# Supervised Learning Text Complexity Classifier - Dev

## Dev notes

This version of the notebook uses only the training csv. The goal is to remake the algorithms tested in the POC notebook, but using the full test data. Once trained they will be output to pickle files and the PRD notebook will grab those and generate predictions on the test data.


## Data import and cleaning

Primary goals here are getting data from the training data set and creating metrics that will convey the complexity of the text to our classifiers. There are 3 additional data sources as part of the Kaggle set that we used. The average of acquisition data set contains information gathered on around 50,000 words and contains each words lemmatized root and information about when that word the average age a person learns that word and the frequency of its use. The concreteness ratings contains a smaller number of words, but gives an impression of how much a word is associated with a particular idea. Finally the dale_chall data set contains a list of words that are considered 'basic english'.


#### Dependencies

Below are the datacleaning dependencies only. A longer list of imports for modeling is at the start of that section.

In [1]:
import pandas as pd
import numpy as np
import re
import statistics as stats
import pickle as pkl
from nltk.corpus import stopwords  
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score 
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.mixture import GaussianMixture
from sklearn.linear_model import LogisticRegression
from gensim.models.word2vec import Word2Vec

In [2]:
df = pd.read_csv('WikiLarge_Train.csv')

#### Control of how much data we train with

The train and test dataframes can be adjusted here to either be the full data or a random sample of the full data set. For efficiency, the random samples were used for initial investigation of appropriate models and some parameter tuning. Final tuning and model selection was based on the training done

In [3]:
df_train = df.copy()

# Train Data

### Word Tokenization and Stop Word Removal

Metrics are created out of the word count and the syllable counts within the text. However, stop words are common and often only one syllable so they were removed. The metrics will therefore be based only on the remaining words. This will through off the flesch_kincaid score from what it might have been calculated, but when trying to predict text complexity, stop words can be noisy and non-informative.

In [4]:
def tokenize_and_remove_stops(text):
    text_list = word_tokenize(text)
    stop_word_set = set(stopwords.words('english'))
    clean_list = [word.lower() for word in text_list if word.lower() not in stop_word_set]
    clean_list = [word for word in clean_list if re.match('^[a-z]+$', word)]
    
    return clean_list

In [5]:
df_train['clean_text'] = df_train['original_text'].apply(tokenize_and_remove_stops)

df_train.head()

Unnamed: 0,original_text,label,clean_text
0,There is manuscript evidence that Austen conti...,1,"[manuscript, evidence, austen, continued, work..."
1,"In a remarkable comparative analysis , Mandaea...",1,"[remarkable, comparative, analysis, mandaean, ..."
2,"Before Persephone was released to Hermes , who...",1,"[persephone, released, hermes, sent, retrieve,..."
3,Cogeneration plants are commonly found in dist...,1,"[cogeneration, plants, commonly, found, distri..."
4,"Geneva -LRB- , ; , ; , ; ; -RRB- is the second...",1,"[geneva, city, switzerland, populous, city, ro..."


#### Syllable count
While the age of acquisition data source has a syllable count attached to each word, it remains the case that not all words in the texts were in that data source. This syllable counter was retrieved from stack overflow and validated.

In [6]:
def syllable_count(clean_text_list):
    count = 0
    for word in clean_text_list:
        word = word.lower()
        vowels = "aeiouy"
        if word[0] in vowels:
            count += 1
        for index in range(1, len(word)):
            if word[index] in vowels and word[index - 1] not in vowels:
                count += 1
        if word.endswith("e"):
            count -= 1
        if count == 0:
            count += 1
    return count

In [7]:
df_train['syllables'] = df_train['clean_text'].apply(syllable_count)
df_train['word_count'] = df_train['clean_text'].apply(lambda x: len(x))
df_train['avg_syll_per_word'] = df_train['syllables'] / df_train['word_count']
df_train['avg_syll_per_word'] = df_train['avg_syll_per_word'].fillna(0)

df_train.head()

Unnamed: 0,original_text,label,clean_text,syllables,word_count,avg_syll_per_word
0,There is manuscript evidence that Austen conti...,1,"[manuscript, evidence, austen, continued, work...",33,17,1.941176
1,"In a remarkable comparative analysis , Mandaea...",1,"[remarkable, comparative, analysis, mandaean, ...",33,13,2.538462
2,"Before Persephone was released to Hermes , who...",1,"[persephone, released, hermes, sent, retrieve,...",38,19,2.0
3,Cogeneration plants are commonly found in dist...,1,"[cogeneration, plants, commonly, found, distri...",57,27,2.111111
4,"Geneva -LRB- , ; , ; , ; ; -RRB- is the second...",1,"[geneva, city, switzerland, populous, city, ro...",19,8,2.375


#### Flesch_kincaid ease

This is a relatively common metric for text complexity based solely on the number of words per sentence. If for some reason the score fails to calculate (happens on just a couple rows in the full data set), the mean of all successful scores is placed.

In [8]:
def flesch_kincaid_ease(row):
    return round(206.835 - 1.015*(row['word_count']) - 84.6*(row['avg_syll_per_word']),2)

In [9]:
df_train['fc_ease'] = df_train.apply(flesch_kincaid_ease, axis = 1)

fc_mean = df_train['fc_ease'].dropna().mean()
df_train['fc_ease'] = df_train['fc_ease'].fillna(fc_mean)

df_train.head()

Unnamed: 0,original_text,label,clean_text,syllables,word_count,avg_syll_per_word,fc_ease
0,There is manuscript evidence that Austen conti...,1,"[manuscript, evidence, austen, continued, work...",33,17,1.941176,25.36
1,"In a remarkable comparative analysis , Mandaea...",1,"[remarkable, comparative, analysis, mandaean, ...",33,13,2.538462,-21.11
2,"Before Persephone was released to Hermes , who...",1,"[persephone, released, hermes, sent, retrieve,...",38,19,2.0,18.35
3,Cogeneration plants are commonly found in dist...,1,"[cogeneration, plants, commonly, found, distri...",57,27,2.111111,0.83
4,"Geneva -LRB- , ; , ; , ; ; -RRB- is the second...",1,"[geneva, city, switzerland, populous, city, ro...",19,8,2.375,-2.21


Quick check to see if there is a difference between the two labels on this metric.

In [10]:
df_train.groupby('label')['fc_ease'].mean()

label
0    25.707627
1    15.813260
Name: fc_ease, dtype: float64

#### Concreteness Ratings

The cell below reads the concreteness data source and converts it into a dictionary with the word as the key and the measure values as entries in a nested dictionary. This allows me to calculate averages for a given text.

In [11]:
df_conc = pd.read_csv('Concreteness_ratings_Brysbaert_et_al_BRM.txt', sep = '\t').fillna('0')
basic_list = pd.read_csv('dale_chall.txt')['a'].to_list()


df_conc['non_basic'] = df_conc['Word'].apply(lambda x: 0 if x in basic_list else 1)
mean_conc = round(df_conc['Conc.M'].mean(), 4)
print(df_conc.head(1))
print(mean_conc)

conc_keys = df_conc['Word'].tolist()

dict_list = df_conc.set_index('Word').to_dict(orient = 'records')

conc_dict = {}

for i in range(len(conc_keys)):
    conc_dict[conc_keys[i]] = dict_list[i]

          Word  Bigram  Conc.M  Conc.SD  Unknown  Total  Percent_known  \
0  roadsweeper       0    4.85     0.37        1     27           0.96   

   SUBTLEX Dom_Pos  non_basic  
0        0       0          1  
3.0363


#### Age of Acquisition

Similar to the concreteness data set, the cell below gets the desired measures from the AoA data source into a dictionary for use in the metric calculations. Slightly different is that the AoA data source also has a column for alternative spellings. If the alternative spelling listed is different then the primary spelling, that spelling is added to the final dictionary as a unique entry.

In [12]:
df_age = pd.read_csv('AoA_51715_words.csv')
df_age['non_basic'] = df_age['Lemma_highest_PoS'].apply(lambda x: 0 if x in basic_list else 1)
#print(df_age.head())

mean_aoa = round(df_age['AoA_Kup_lem'].mean(), 4)
mean_perc_known = round(df_age['Perc_known_lem'].mean(),4)
med_freq = df_age['Freq_pm'].median()

#print(mean_aoa, mean_perc_known, med_freq)

age_keys = df_age['Word'].to_list()
age_alt_keys = df_age['Alternative.spelling'].to_list()

df_columns_for_dict = df_age[['Freq_pm', 'Dom_PoS_SUBTLEX', 'AoA_Kup_lem', 'Perc_known_lem', 'non_basic']].copy()
df_columns_for_dict.columns = ['freq', 'pos', 'aoa', 'perc_known', 'non_basic']
dict_list = df_columns_for_dict.to_dict(orient = 'records')


age_dict = {}
for i in range(len(age_keys)):
    age_dict[age_keys[i]] = dict_list[i]

for i in range(len(age_alt_keys)):
    if age_alt_keys[i] not in age_dict:
        age_dict[age_alt_keys[i]] = dict_list[i]
        

#### Metric calcuations

The goal of the functions below is to get the average entry for each clean text entry in df_train. The percent_known is located in both data sources, so the larger data source (age of acquisition) is checked first. Frequency has a such heavy outliers on the high end, that median was selected instead of mean. If a word cannot be found the means of the features are substituted (median for frequency). 

In [13]:
def get_mean_conc(text):
    sum_conc = 0
    word_count = max(len(text), 1)
    for word in text:
        try:
            sum_conc += conc_dict[word]['Conc.M']
        except:
            sum_conc += mean_conc
            
    return round(sum_conc / word_count, 4)

def get_mean_aoa(text):
    sum_aoa = 0
    word_count = max(len(text), 1)
    for word in text:
        try:
            sum_aoa += age_dict[word]['aoa']
        except:
            sum_aoa += mean_aoa
            
    return round(sum_aoa / word_count, 4)

def get_mean_perc(text):
    sum_perc = 0
    word_count = max(len(text), 1)
    for word in text:
        try:
            sum_perc += age_dict[word]['perc_known']
        except:
            try:
                sum_perc += conc_dict[word]['Percent_known']
            except:
                sum_perc += mean_perc_known
                
    return round(sum_perc / word_count, 4)

def get_mean_freq(text):
    list_freq = []
    for word in text:
        try:
            list_freq.append(age_dict[word]['freq'])
        except:
            list_freq.append(med_freq)
    
    if not list_freq:
        list_freq.append(0)
            
    return stats.median(list_freq)

def count_non_basic(text):
    count = 0
    for word in text:
        try:
            count += age_dict[word]['non_basic']
        except:
            try:
                count += conc_dict[word]['non_basic']
            except:
                count += 1
    return count

In [14]:
df_train['mean_conc'] = df_train['clean_text'].apply(get_mean_conc)
df_train['mean_aoa'] = df_train['clean_text'].apply(get_mean_aoa)
df_train['mean_perc_known'] = df_train['clean_text'].apply(get_mean_perc)
df_train['mean_freq'] = df_train['clean_text'].apply(get_mean_freq)
df_train['non_basic_words'] = df_train['clean_text'].apply(count_non_basic)
df_train = df_train.fillna(0)
#df_train.replace([np.inf, -np.inf], np.nan).isnull().sum() #sanity check for nulls or infs

#df_train.head()

### Tfidf Logistic Regression Probablity

The goal here is to train a logistic regresssion classifier on the tfidf vectors of the lemmatized words remaining in the cleaned text. Once trained, it will be used to predict the probability of the class. This probablity will be used as a feature along with other text based features calculated below.

In [15]:
def lem_combine(text_list):
    lem = WordNetLemmatizer()
    lem_word = []
    for word in text_list:
        lem_word.append(lem.lemmatize(word))
        
    return (' '.join(word for word in lem_word))

In [16]:
df_train['lem_text'] = df_train['clean_text'].apply(lem_combine)

In [17]:
vectorizer = TfidfVectorizer(stop_words = 'english', ngram_range = (1,3), max_features = 30000)

X_vec = vectorizer.fit_transform(df_train['lem_text'])
y_vec = df_train['label']

log_vec = LogisticRegression(max_iter = 1000)
log_vec.fit(X_vec, y_vec)

LogisticRegression(max_iter=1000)

In [18]:
df_train['logreg_prob'] = [num[0] for num in log_vec.predict_proba(X_vec)]

In [19]:
### Word2Vec Average Vector

In [20]:
w2v = Word2Vec(df_train['lem_text'].apply(lambda x: x.split()))


def document_vector(text):
    doc = [word for word in text.split() if word in w2v.wv.vocab]
    if len(doc) == 0:
        doc.append('he')
    return np.mean(w2v[doc])

df_train['w2v'] = df_train.lem_text.apply(document_vector)

  


#### Final X and y for model

Below is a list of the columns being used to make X and y. 

In [21]:
X_train = df_train[['word_count', 'avg_syll_per_word', 'fc_ease', 'mean_conc', 'mean_aoa', 'mean_perc_known', 'mean_freq', 'non_basic_words', 'logreg_prob', 'w2v']]
y_train = df_train['label']

#### Random Forest Classifier (high performer)

In [22]:
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)

knn_clf = KNeighborsClassifier(n_neighbors =100, weights = 'distance')
knn_clf.fit(X_train_scaled, y_train)

KNeighborsClassifier(n_neighbors=100, weights='distance')

In [24]:
with open('log_reg.pkl', 'wb') as handle:
    pkl.dump(log_vec, handle, protocol = pkl.HIGHEST_PROTOCOL)
    
with open('knn.pkl', 'wb') as handle:
    pkl.dump(knn_clf, handle, protocol = pkl.HIGHEST_PROTOCOL)
    
with open('vectorizer.pkl', 'wb') as handle:
    pkl.dump(vectorizer, handle, protocol = pkl.HIGHEST_PROTOCOL)
    
with open('aoa.pkl', 'wb') as handle:
    pkl.dump(age_dict, handle, protocol = pkl.HIGHEST_PROTOCOL)
    
with open('concrete.pkl', 'wb') as handle:
    pkl.dump(conc_dict, handle, protocol = pkl.HIGHEST_PROTOCOL)
    
with open('w2v.pkl', 'wb') as handle:
    pkl.dump(w2v, handle, protocol = pkl.HIGHEST_PROTOCOL)
    
    
misc_dict = {
    'mean_conc' : mean_conc,
    'mean_aoa' : mean_aoa,
    'mean_perc_known' : mean_perc_known,
    'med_freq' : med_freq
    
}

with open('misc.pkl', 'wb') as handle:
    pkl.dump(misc_dict, handle, protocol = pkl.HIGHEST_PROTOCOL)
