<a href="https://colab.research.google.com/github/langdonholmes/lexical_analysis/blob/main/ICNALE_lexical_prepare_dataframes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preparing dataframes

Working with thousands of separate text files can be cumbersome.  
Preparing .csv files with the texts and metadata is a simple alternative.

In [1]:
import os
import pathlib
import pandas as pd
from tqdm import tqdm

## Texts to Dataframe
Using the ICNALE corpus WE24 retrieved in 2020

In [2]:
## Creates a corpus from a directory of text files
corpus_dir = 'data/all_icnale_texts.csv'
meta_data_path = 'data/WE24.csv'
all_icnale_path = 'data/all_icnale_texts.csv'

if not os.path.isfile(all_icnale_path):
    texts = []
    for file in tqdm(pathlib.Path(corpus_dir).rglob("*/*.txt")):
        with open(file, 'r', encoding="utf-8") as f: 
            text = f.read()
            texts.append((file.name, text))
    all_texts_df = pd.DataFrame.from_records(texts, columns=['Filename', 'text'])

    '''Filename: W_CHN_PTJ0_021_A2_0.txt --> Code: W_CHN_021'''
    tmp_arr = (
        all_texts_df['Filename']
        .str.split('_', expand=True) # ['W', 'CHN', 'PTJ0', '021', 'A2', '0.txt']
        )

    all_texts_df['id'] = (
        tmp_arr
        .iloc[:, [0, 1, 3]] # ['W', 'CHN', '021']
        .apply(lambda x: '_'.join(x), axis='columns') # 'W_CHN_021'
    )
    
    all_texts_df['prompt'] = (tmp_arr.iloc[:, 2]) # 'PTJ0'
    
    info = pd.read_csv(meta_data_path)[['Code', 'VST']]
    info = info.rename(columns={'Code':'id'})
    all_texts_df = all_texts_df.merge(info, on='id', how='left')
    all_texts_df.to_csv(all_icnale_path, index=False)

## Train Test Split on Smoking Prompt

Later on, we will be finetuning BERT, which means that we need a train/test split.  
We do that here so that all our methods use the same train/test split.  
smj.csv was prepared previously for another project.

In [6]:
from sklearn.model_selection import train_test_split

smk_path = 'data/smk.csv'
train_path = 'data/smk_train.csv'
test_path = 'data/smk_test.csv'

if not os.path.isfile(train_path):
    df = pd.read_csv(smk_path)
    train, test = train_test_split(
        df,
        test_size=.2,
        stratify=df['Country'], # we do not want our split to have a bias towards one country or another
        random_state=42)
    train.to_csv(train_path, index=False)
    test.to_csv(test_path, index=False)

## Augment training data with part time job prompt

Reintroduce the PTJ0 prompt to the training set, but make sure that the test set stays clean.

In [9]:
all_train_path = 'data/all_train.csv'

if not os.path.isfile(all_train_path):
    all_texts_df = pd.read_csv(all_icnale_path)
    all_train = all_texts_df[~all_texts_df['Filename'].isin(test['Filename'])]
    all_train.to_csv(all_train_path, index=False)

## Get to know your data
This is just a quick and dirty test to get a point of reference.  
We wil run the final analysis in R, after verifying the assumptions of a linear model.

In [None]:
from sklearn import linear_model

def ball_park():
  predictors = ['LD_Mean_RT_CW', 'Kuperman_AoA_CW', 'WN_Mean_RT_CW',
                'aoe_index_above_threshold_40', 'PLD_CW', 'mtld_original_cw',
                'hdd42_cw', 'McD_CD_CW', 'Brysbaert_Concreteness_Combined_CW',
                'COCA_magazine_bi_DP']

  X_train = train[predictors]
  X_test = test[predictors]
  y_train = train['VST']
  y_test = test['VST']

  lin = linear_model.LinearRegression()
  lin.fit(X_train, y_train)
  return lin.score(X_test, y_test) # R Squared

print(f'Ordinary Least Squares (OLS) predicts with an R Squared of {ball_park():.3f}.')
print('This is based on linguistic features generated with TAALES output.')

Ordinary Least Squares (OLS) predicts with an R Squared of 0.173.
This is based on linguistic features generated with TAALES output.
