<a href="https://colab.research.google.com/github/langdonholmes/lexical_analysis/blob/main/ICNALE_lexical_prepare_dataframes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preparing dataframes

Working with thousands of separate text files can be cumbersome.  
Preparing .csv files with the texts and metadata is a simple alternative.

In [None]:
import os
import pathlib
import pandas as pd

## Texts to Dataframe


In [None]:
corpus_dir = '/content/drive/MyDrive/data/icnale_bert_lexical/ICNALE_W_single_folder_cleaned/'
meta_data = '/content/drive/MyDrive/data/icnale_bert_lexical/final_icnale_variable_smoking_lme.csv'
smj_only = '/content/drive/MyDrive/data/icnale_bert_lexical/smj.csv'

if not os.path.isfile(smj_only):
    texts = []
    for file in tqdm(pathlib.Path(corpus_dir).rglob("*.txt")):
        with open(file, 'r', encoding="utf-8") as f: 
            text = f.read()
            texts.append((file.name, text))
    df = pd.DataFrame.from_records(texts, columns=['Filename', 'text'])
    df = df.merge(pd.read_csv(meta_data), on='Filename')
    df.to_csv(smj_only)

## Train Valid Split on Smoking Prompt

Later on, we will be finetuning BERT, which means that we need a train/test split.  
We do that here so that all our methods use the same train/test split.

In [None]:
train_valid = ['/content/drive/MyDrive/data/icnale_bert_lexical/smj_train.csv',
               '/content/drive/MyDrive/data/icnale_bert_lexical/smj_valid.csv']

if not os.path.isfile(train_valid[0]):
    df = pd.read_csv(fname, index_col=0)
    train, validandtest = train_test_split(
        df,
        test_size=.2,
        stratify=df['Country'],
        random_state=42)
    for fname, pandaframe in zip(train_valid, [train, validandtest]):
      if not os.path.isfile(fname):
        pandaframe.to_csv(fname)

## Augment training data with part time job prompt

all_incale_for_bert_training.csv was produced on my PC. I should include the original dataset.

In [None]:
all_icnale = '/content/drive/MyDrive/data/icnale_bert_lexical/all_icnale_for_bert_training.csv'
both_prompts_but_no_validation = '/content/drive/MyDrive/data/icnale_bert_lexical/all_icnale_train.csv'
if not os.path.isfile(both_prompts_but_no_validation):
    df = pd.read_csv('/content/drive/MyDrive/data/icnale_bert_lexical/smj_valid.csv', index_col=0)
    all = pd.read_csv(all_icnale, index_col=0)
    all = all[all['VST'].notna()].reset_index()
    all_train = all[~all['Filename'].isin(df['Filename'])]
    all_train.to_csv(both_prompts_but_no_validation, index=False)

## Get to know your data
This is just a quick and dirty test to get a point of reference.  
We wil run the final analysis in R, after verifying the assumptions of a linear model.

In [None]:
from sklearn import linear_model

def ball_park():
  predictors = ['LD_Mean_RT_CW', 'Kuperman_AoA_CW', 'WN_Mean_RT_CW',
                'aoe_index_above_threshold_40', 'PLD_CW', 'mtld_original_cw',
                'hdd42_cw', 'McD_CD_CW', 'Brysbaert_Concreteness_Combined_CW',
                'COCA_magazine_bi_DP']

  smj_train = '/content/drive/MyDrive/data/icnale_bert_lexical/smj_train.csv'
  smj_valid = '/content/drive/MyDrive/data/icnale_bert_lexical/smj_valid.csv'

  train = pd.read_csv(smj_train, index_col=0)
  valid = pd.read_csv(smj_valid, index_col=0)
  X_train = train[predictors]
  X_test = valid[predictors]
  y_train = train['VST']
  y_test = valid['VST']

  lin = linear_model.LinearRegression()
  lin.fit(X_train, y_train)
  return lin.score(X_test, y_test) # R Squared

print(f'Ordinary Least Squares (OLS) predicts with an R Squared of {ball_park():.3f}.')
print('This is based on linguistic features generated with TAALES output.')

Ordinary Least Squares (OLS) predicts with an R Squared of 0.173.
This is based on linguistic features generated with TAALES output.
