# Finding Causal Relations With Question Answering Models
In this Jupyter notebook, we are going to train three distinct models, each with a specific role:

1. **Causal Marker Model**: This model's task is to identify the causal marker within a given sentence.
2. **Cause Identification Model**: Once we have the causal marker and the sentence, this model is designed to pinpoint the cause.
3. **Effect Identification Model**: With the causal marker and the sentence at hand, this model's job is to determine the effect.

Before we proceed with training these models, we need to ensure that all necessary dependencies are installed.

In [14]:
!pip install datasets | grep -v 'already satisfied'
!pip install transformers | grep -v 'already satisfied'

In [15]:
import json
import pandas as pd
import pyarrow as pa
import pyarrow.dataset as ds
from datasets import Dataset
import datasets
import numpy as np
import torch
from transformers import AutoTokenizer, AutoConfig, AutoModelForQuestionAnswering, TrainingArguments, Trainer
from pathlib import Path
from tools import run_model

## Data & Model
In this section, we perform two main tasks:

1. **Data Preparation**: We read the data from a JSON file and transform it into a suitable dataset format for our models.
2. **Model Initialization**: We set up the initial configurations for our models.

You have the flexibility to train any of the three models mentioned earlier. To do so, simply select the appropriate file corresponding to the model you wish to train.

In [None]:
model_name = "HooshvareLab/bert-fa-base-uncased"
tokenizer, config = AutoTokenizer.from_pretrained(model_name), AutoConfig.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

file = json.load(open('data_effect.json', 'r', encoding='utf-8'))
df = pd.json_normalize(file['data']).sample(frac=1, random_state=10) # 3080
dataset = ds.dataset(pa.Table.from_pandas(df).to_batches())
train_data = Dataset(pa.Table.from_pandas(df.iloc[:2400]))
validation_data = Dataset(pa.Table.from_pandas(df.iloc[2400: ]))
data = datasets.DatasetDict({"train":train_data,"validation": validation_data})


## Preprocess
This function is designed to transform our dataset into a format that is compatible with Question-Answering (QA) models.

In [None]:
def preprocess(examples):
    """
    Prepare the data to be fed into QA model.

    :param examples: A dataset containing context and answer and question
    :return:
    """

    tokenized_examples = tokenizer(examples["question"], examples["context"], return_offsets_mapping=True)
    tokenized_examples['start_positions'], tokenized_examples['end_positions'] = [], []

    cls_index = 0
    for i, offset in enumerate(tokenized_examples['offset_mapping']):
        answer = examples['answers'][i][0]

        types = np.array(tokenized_examples.sequence_ids(i))
        types[types == None] = 0
        types.astype(int)

        if len(answer['text'][0]) == 0:
            s, e = cls_index, cls_index

        else:
            s_diff = np.abs(np.array([offset[idx][0] - answer['answer_start'][0] for idx in range(len(offset))]))
            s = np.argmin([s_diff[idx] + 100 * (1 - types[idx]) for idx in range(len(s_diff))])

            e_diff = np.abs(np.array(
                [offset[idx][1] - answer['answer_start'][0] - len(answer['text'][0]) for idx in range(len(offset))]))
            e = np.argmin([e_diff[idx] + 100 * (1 - types[idx]) for idx in range(len(e_diff))])

        tokenized_examples['start_positions'].append(s)
        tokenized_examples['end_positions'].append(e)

    tokenized_examples.pop('offset_mapping')
    return tokenized_examples

## Train
Once the data is ready, we feed it into our model to begin the training process.

In [None]:
tokenized_ds = data.map(preprocess, batched=True, remove_columns=data["train"].column_names)

args = TrainingArguments(
    f"result",
    evaluation_strategy = "steps", # 'epochs'
    eval_steps = 12,
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1,
    weight_decay=0)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_ds['train'],
    eval_dataset=tokenized_ds['validation'],
    tokenizer=tokenizer)

trainer.train()


## Connect to Google Drive
This section provides you with the capability to interact with Google Drive. You can utilize this feature to:
1. **Save your Trained Models**: After training, you can store your models directly to Google Drive for future use.
2. **Load Pre-Trained Models**: If you have models that were previously trained and saved in Google Drive, you can easily load them from here for use in your current project.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')
#trainer.save_model('/content/gdrive/My Drive/effect')

## Test
In this section, we will be working with a test dataset consisting of 300 sentences. Here's the process:

1. **Load Pre-Trained Models**: We start by loading our pre-trained models.
2. **Test the First Model**: We use the first model to identify the causal marker within each sentence in our test data.
3. **Test the Second and Third Models**: Once we have the causal marker, we input both the sentence and the marker into our second and third models. These models will then identify the cause and effect respectively.
4. **Display Results**: Finally, we present the results for each test sample.

This way, you can evaluate the performance of our models on unseen data.

In [None]:
gdp = '/content/gdrive/My Drive/'
paths = [gdp + 'marker', gdp + 'cause', gdp + 'effect']
tokenizers = [AutoTokenizer.from_pretrained(paths[i]) for i in range(3)]
models = [AutoModelForQuestionAnswering.from_pretrained(paths[i]) for i in range(3)]

lines = open('test.txt', mode='r', encoding='utf-8').readlines()
texts = [s.replace('*', '').replace('+', '').replace('&', '') for s in lines]

for i, text in enumerate(texts):
  mark = run_model(models[0], tokenizers[0], text, 'به دلیل این که - نتیجه - علت - زیرا - استنتاج - درصورتی که')
  caus = run_model(models[1], tokenizers[1], text, mark)
  effe = run_model(models[2], tokenizers[2], text, mark)
  answer = [mark, caus, effe] if mark != '[CLS]' else ['', '', '']

  print(i)
  print(lines[i], end='')
  parts = ['marker: ', 'cause: ', 'effect: ']

  for j, tchar in enumerate(['&', '*', '+']):
    print(parts[j], end='')
    print(answer[j], end='    ')

  print()
