# Thesis data preparation
This notebook serves organises data preparation.

### Inputs
The `settings.json` file contains all necessary local variables and directories for data files. TODO: Template.

Edit `lang_map` and `month_map` if necessary. Obtain list of values using `df['TITLE'].value_counts()` in order to find the values that crash the script if necessary. Errors occur in the conversion to `datetime.date` if month is wrongly specified and can therefore not be converted to an integer value in the range [1:12]

### LexisParse
The `LexisParse` class parses LexisNexis files with the `.txt` extension to dicts with all metadata. These are then summarised in a pandas `DataFrame` for easy manipulation and export (if necessary).

In [1]:
%matplotlib inline

#Imports
import json
import pandas as pd
import numpy as np
from lexisnexisparse import LexisParser
import os
import re
import datetime

#Settings need to include:
#    lexisnexis_source - source directory for LexisNexis files (cannot be shared openly due to copyright regulations)
settings_file = "D:/thesis/settings.json"

In [2]:
#Preparation

#Read settings
settings = json.loads(open(settings_file).read())["settings"]

#LexisNexis file list
ln_files = [settings['lexisnexis_source']+fname for fname in os.listdir(settings['lexisnexis_source']) if fname.lower().endswith(".txt")]

In [3]:
#Parsing LexisNexis Files and placing them in a pandas DataFrame
lp = LexisParser()

df = pd.DataFrame()

for file in ln_files:
    df = pd.concat([df,pd.DataFrame(lp.parse_file(file))])

#Translating variables from text to more useful values

##Length
df['LENGTH'] = df['LENGTH'].apply(lambda x: int(x.replace(" words","")) if isinstance(x,str) else -1)

##Language
lang_map = {'ENGLISH': 'en', 'English': 'en', 'English english': 'en'}
df['LANGUAGE'] = df['LANGUAGE'].map(lang_map)

##Date
re_day = re.compile("(?<!\d)\d{1,2}(?!\d)")
month_map = {'january' : 1, 'february' : 2, 'march' : 3, 'april' : 4, 'may' : 5, 'june' : 6, 'july' : 7,
            'august' : 8, 'september' : 9, 'october' : 10, 'november' : 11, 'december' : 12, '': -1}

df['MONTH'] = df['DATE'].apply(lambda x: re.search("[a-z]+",x)[0] if re.search("[a-z]+",x) is not None else "")
df['MONTH'] = df['MONTH'].map(month_map)
df['YEAR'] = df['DATE'].apply(lambda x: int(re.search("[0-9]{4}",x)[0]) if re.search("[0-9]{4}",x) is not None else -1)
df['DAY'] = df['DATE'].apply(lambda x: int(re.search(re_day,x)[0]) if re.search(re_day,x) is not None else -1)

for i in range(0,len(df)):
    if df.at[i,'MONTH'] > 0 and df.at[i,'DAY'] > 0 and df.at[i,'YEAR'] > 0:
        df.at[i,'DATE_dt'] = datetime.date(year = df.at[i,'YEAR'], day = df.at[i,'DAY'], month = df.at[i,'MONTH'])
        
del re_day, month_map, lang_map

In [5]:
#Export data
df.to_csv(settings['data_csv'], encoding = 'utf-8')