# Notebook 2 


### Preprocessing and building new features from original text

This notebook discusses the various steps needed to preprocess the dataset and get it ready for the classification task. The binary classes we are interested in are:

- **0** text does NOT need to be simplified 
- **1** text needs to be simplified*

Let's review the steps needed to accomplish the document classification task.

### Steps:

1. Problem definition and solution approach (covered in the README file)
2. Input data
3. Exploratory Data Analysis (Covered in Notebook 1 "Exploratory Data Analysis notebook")
4. Feature Engineering (covered in this notebook "Notebook 2")
5. Predictive Models (covered in modeling notebook "Notebook 3" and "Notebook 4")

## Input data

The training data contains 416,768 sentences from simple Wikipedia, already labeled with one of the above categories. There are two columns in the dataset (original text and label).
There is an additional data set (test data) that contains 119,092 comments that are unlabeled. 

## Feature Engineering

There are several aspects of a text that influence its readability, including the vocabulary level, the syntactic structure, and overall coherence. Based on findings from Cardoso et al., 2013 and using some intuition of what might make a text easy or hard to read, we settled on the following features:


#### Surface Features

- Average number of characters per word in the document

- Average number of words per sentence in the document

- Total number of words in the document

#### Syntactic Features

According to Kate et al.(2010), syntactic structures appear to affect text complexity level. As Barzilay and Lapata(2008)note, more noun phrases make texts more complex and harder to understand.

Example of a noun phrase:
**The glowing stars shining with bright colors, took our breath away this evening.**

For more information on identifying phrases refer to: https://www.albert.io/blog/identifying-phrases/

- Number of noun phrases in the document

- Number of verb phrases in the document

#### Discourse and coherence features

Discourse features can refer to text cohesion and coherence. *Text cohesion* refers to the grammatical and lexical links which connect linguistic entities together. *Text coherence* refers to the connection between ideas (Davoodi & Kosseim, 2016)

- Number of pronouns in the document (e.g "I", "he"..etc)

- Number of coordinate and subordinate conjuncts in the document("however", "consequently"..etc)

#### Readability scores

- Age of Acquisition (the approximate age in years when a word was learned). The assumption here is that a word learned at a younger age is simpler than a word learned at a more advanced age. 

- Flesch Reading Ease Score: measures a text’s complexity level and maps it to an educational level (low scores indicate higher difficulty)

- Dale-Chall Readability Score (uses a lookup table of the most commonly used 3000 English words and returns the grade level using the New Dale-Chall Formula)

- McAlpine EFLAW Readability Score (Returns a score for the readability of an english text for a foreign learner or English, focusing on the number of miniwords and length of sentences)

In [6]:
# Suppress all warnings
import warnings
warnings.filterwarnings('ignore')

#data processing
import pandas as pd
import numpy as np
import random
import re
import pickle
import ast

# natural language processing
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer
from collections import defaultdict, Counter 
from nltk import pos_tag
import spacy

#visualization
%matplotlib inline
import matplotlib.pyplot as plt 
import seaborn as sns

#tracking progress
from tqdm import tqdm
from tqdm.notebook import tqdm
tqdm.pandas()

#display complete non-truncated text data
pd.set_option('display.max_colwidth', None)
# print multiple lines at once
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Textstat is a python package, to calculate statistics
# from text to determine readability,complexity and grade level of a particular corpus.
# Package can be found at https://pypi.python.org/pypi/textstat
import textstat
from textstat.textstat import textstatistics

First, let's define a function that takes the large file 'WikiLarge_Train.csv' and returns a random sample of the data of size 1000. This will help us experiment with different techniques and classifiers

In [692]:
def load_data(filename='data/WikiLarge_Train.csv', partial=False, size=1000):
    """
    function to take a random sample of the data
    if partial, then a random sample of size 'size' is returned
    size is the number of lines we take if partial==True
    """

    if partial:
        with open(filename, 'r') as fp:
            for count, line in enumerate(fp):
                pass
        length= count
        #take a random sample
        skip = sorted(random.sample(range(1, length + 1), length - size))
        df = pd.read_csv(filename, skiprows=skip)
        csv_file= df.to_csv('data/sample.csv', index= False) 
        return csv_file

    df = pd.read_csv(filename)
    return df

Now, we are ready to normalize and preprocess the input data. The goal is to remove as much noise as possible. This includes removing non-letter characters, stop words, converting words into lower case, tokenizing and lemmatizing. I debated whether or not to keep stop words but decided to ultimately keep them. My logic is that stop words will be important in extracting discourse and cohesion features.

**NOTE** There are many different parameter settings for Vectorizer objects in scikit-learn. Small changes in these settings can result in very different text representations and significant changes in final classifier accuracy.

In [715]:
def preprocessing(text):
    text= [word for word in text.lower().split() if not word.startswith('-') and word.isalpha()]
    if len(text)> 1:
        return text

In [716]:
# load dataframe and get the preprocessed text
df = load_data(filename='data/WikiLarge_Train.csv', partial=False, size=1000)
df['preprocessed']= df['original_text'].progress_apply(preprocessing)

  0%|          | 0/416768 [00:00<?, ?it/s]

In [717]:
# remove None values for empty strings
df = df.replace(to_replace='None', value=np.nan).dropna()
df.head()

Unnamed: 0,original_text,label,preprocessed
0,"There is manuscript evidence that Austen continued to work on these pieces as late as the period 1809 â '' 11 , and that her niece and nephew , Anna and James Edward Austen , made further additions as late as 1814 .",1,"[there, is, manuscript, evidence, that, austen, continued, to, work, on, these, pieces, as, late, as, the, period, â, and, that, her, niece, and, nephew, anna, and, james, edward, austen, made, further, additions, as, late, as]"
1,"In a remarkable comparative analysis , Mandaean scholar Säve-Söderberg demonstrated that Mani 's Psalms of Thomas were closely related to Mandaean texts .",1,"[in, a, remarkable, comparative, analysis, mandaean, scholar, demonstrated, that, mani, psalms, of, thomas, were, closely, related, to, mandaean, texts]"
2,"Before Persephone was released to Hermes , who had been sent to retrieve her , Hades tricked her into eating pomegranate seeds , -LRB- six or three according to the telling -RRB- which forced her to return to the underworld for a period each year .",1,"[before, persephone, was, released, to, hermes, who, had, been, sent, to, retrieve, her, hades, tricked, her, into, eating, pomegranate, seeds, six, or, three, according, to, the, telling, which, forced, her, to, return, to, the, underworld, for, a, period, each, year]"
3,"Cogeneration plants are commonly found in district heating systems of cities , hospitals , prisons , oil refineries , paper mills , wastewater treatment plants , thermal enhanced oil recovery wells and industrial plants with large heating needs .",1,"[cogeneration, plants, are, commonly, found, in, district, heating, systems, of, cities, hospitals, prisons, oil, refineries, paper, mills, wastewater, treatment, plants, thermal, enhanced, oil, recovery, wells, and, industrial, plants, with, large, heating, needs]"
4,"Geneva -LRB- , ; , ; , ; ; -RRB- is the second-most-populous city in Switzerland -LRB- after Zürich -RRB- and is the most populous city of Romandie -LRB- the French-speaking part of Switzerland -RRB- .",1,"[geneva, is, the, city, in, switzerland, after, zürich, and, is, the, most, populous, city, of, romandie, the, part, of, switzerland]"


In [718]:
# check if all the processed data contains words of at least length 2
for row in df['preprocessed']:
    if len(row)==1:
        print("found one")

Now that we have a new column containing the preprocessed text, we can go ahead and do some fearure extraction.

### Surface Features 

- Total number of words in a document

- Average word length per document

- Average number of syllables in a document

- Number of uncommon words

- Number of difficult words


In [719]:
def word_count(text):
    """
    Returns Number of Words in each document
    """
    words = len(text)

    return words

def avg_word_length(text):
    """
    Calculate the average token length
    """
    try:
        lengths = [len(token) for token in text]
  
        if len(lengths) > 0:
            return np.nanmean(lengths)
        else:
            if None:
                return np.nan
    except:
        return np.nan
    

def syllables_count(text):
    """
    Returns the average number of syllables per doc
    """
    syls=[]
    try:
        if len(text)>0:
            for word in text:
                syl= textstatistics().syllable_count(word) 
                syls.append(syl)
            return np.mean(syls)
        else:
            if None:
                return 0
    except:
        return 0
    
def difficult_words(text):
    """
    Return total number of difficult words in each doc
    difficult words are those with syllables >= 2
    easy_word_set is provide by Textstat as
    a list of common words
    """
    if text is None:
        text = [""]
    else:
        text= " ".join(text)
        diff_words= textstat.difficult_words(text)
 
        return diff_words

def uncommon (text):
    """
    Returns words that are not_in_dale_chall
    """
    Dale_Chall_List = pd.read_csv("data/dale_chall.txt")
    if text is None:
        text = []

    n = [w for w in text if w not in list (Dale_Chall_List['a'])]
    n1 = len(n)
    return n1

In [720]:
#get word count feature
df['word_count']=df['preprocessed'].progress_apply(word_count)

  0%|          | 0/403608 [00:00<?, ?it/s]

In [721]:
#get average word length feature
df['avg_word_count']=df['preprocessed'].progress_apply(avg_word_length)

  0%|          | 0/403608 [00:00<?, ?it/s]

In [723]:
#get average number of characters per word
df['syllable_count']=df['preprocessed'].progress_apply(syllables_count)

  0%|          | 0/403608 [00:00<?, ?it/s]

In [725]:
#get the number of uncommon words
df['uncommon']= df['preprocessed'].progress_apply(uncommon)

  0%|          | 0/403608 [00:00<?, ?it/s]

In [726]:
#get the number of difficult words
df['difficult_words']= df['preprocessed'].progress_apply(difficult_words)

  0%|          | 0/403608 [00:00<?, ?it/s]

### Save data into intermediate file

Once a file is saved into a csv file, lists are converted into strings. I found out that to prevent this from
happening, we'd need to serialize the dataframe in pickle format using pd.to_pickle


In [727]:
df.to_pickle("data/five_features.pkl") 

### Stemming

Some applications benefit from stemming such as opic modeling and training a word vector since accurate counts within the window of a word would be disrupted by an irrelevant inflection like a simple plural or present tense inflection.

In [728]:
def stem(lst):
    """
    Converts words into stems
    """
    from nltk.stem import PorterStemmer, WordNetLemmatizer
    porter = PorterStemmer()
 
    stem_sentence=[]
    for word in res:
        stem_sentence.append(porter.stem(word))
        stem_sentence.append(" ")
    return "".join(stem_sentence)

In [842]:
df = pd.read_pickle("data/five_features.pkl")

In [844]:
df['stem']=df['preprocessed'].progress_apply(stem)

  0%|          | 0/403608 [00:00<?, ?it/s]

In [845]:
#Let's check what the output looks like and if it matches our expectations
df.head(2)

Unnamed: 0,original_text,label,preprocessed,word_count,avg_word_count,syllable_count,uncommon,difficult_words,stem
0,"There is manuscript evidence that Austen continued to work on these pieces as late as the period 1809 â '' 11 , and that her niece and nephew , Anna and James Edward Austen , made further additions as late as 1814 .",1,"[there, is, manuscript, evidence, that, austen, continued, to, work, on, these, pieces, as, late, as, the, period, â, and, that, her, niece, and, nephew, anna, and, james, edward, austen, made, further, additions, as, late, as]",35,4.485714,1.371429,14,7,there is manuscript evid that austen continu to work on these piec as late as the period â and that her niec and nephew anna and jame edward austen made further addit as late as
1,"In a remarkable comparative analysis , Mandaean scholar Säve-Söderberg demonstrated that Mani 's Psalms of Thomas were closely related to Mandaean texts .",1,"[in, a, remarkable, comparative, analysis, mandaean, scholar, demonstrated, that, mani, psalms, of, thomas, were, closely, related, to, mandaean, texts]",19,6.0,1.789474,14,8,there is manuscript evid that austen continu to work on these piec as late as the period â and that her niec and nephew anna and jame edward austen made further addit as late as


### Discourse and coherence features

- Number of pronouns in the document


In [846]:
english_pronouns = ["I", "you", "he", "she", "it", "they", "me", "you", "him", "her", "it", "my", "mine", \
                    "your", "yours", "his", "her", "hers", "its", "who", "whom", "whose", "what", "which", \
                    "another", "each", "everything", "nobody", "either", "someone", "who", "whom", "whose", \
                    "that", "which", "myself", "yourself", "himself", "herself", "itself", "this", "that"]
   
english_conjunctions = ["and", "nor", "but", "or", "yet", "so", "though", "although", "even though", "while", \
                        "if", "only if", "unless", "until", "provided that", "assuming that", "even if", \
                        "in case", "lest", "than", "rather than", "whether", "as much as", "whereas", "after", \
                        "as long as", "as soon as", "before", "by the time", "now that", "once", "since", "till",\
                        "until", "when", "whenever", "while", "because", "since", "so that", "in order", "why", \
                        "that", "what", "whatever", "which", "whichever", "as though", "as if", "wherever", "also",\
                        "besides", "furthermore", "likewise", "moreover", "however", "nevertheless", "nonetheless",\
                        "still", "conversely", "instead", "otherwise", "rather", "accordingly", "consequently",\
                        "hence", "meanwhile", "then", "therefore", "thus"]
 
def get_discourse(text):
    number_of_pronouns=0
    for pronoun in english_pronouns:
        if pronoun in text:
            number_of_pronouns+=1
    return number_of_pronouns

def cohesive_features(text):
    number_of_conjunctions = 0
    for conjunct in english_conjunctions:
        if conjunct in text:
            number_of_conjunctions+=1
    return number_of_conjunctions

In [847]:
#get the number of pronouns
df['discourse']=df['preprocessed'].progress_apply(get_discourse)

  0%|          | 0/403608 [00:00<?, ?it/s]

In [848]:
#get the number of cohesive features
df['cohesive_features']=df['preprocessed'].progress_apply(cohesive_features)

  0%|          | 0/403608 [00:00<?, ?it/s]

#### Readability scores

- Age of Acquisition

- Flesch Reading Ease Score 

- Dale-Chall Readability Score  

- McAlpine EFLAW Readability Score

In [7]:
def mean_aoa(doc):
    """
    Return the mean of the aoa score of a document
    """
    import warnings
    warnings.filterwarnings("ignore")
    # read the file
    aoa_df = pd.read_csv('data/AoA_51715_words.csv', encoding='iso-8859-1')
    # create a dictionary to hold a word and its score
    aoa_dict= aoa_df.set_index('Word').to_dict()['AoA_Kup_lem']
    scores = []
    for token in doc:
        if token.lower() in aoa_dict:
            scores.append(aoa_dict[token.lower()])
    if scores:
        return np.nanmean(scores)
    else:
        return np.nan

    
def flesch_reading_ease(text):
    """
        Implements Flesch Formula:
        Reading Ease score = 206.835 - (1.015 × ASL) - (84.6 × ASW)
        Here,
          ASL = average sentence length (number of words
                divided by number of sentences)
          ASW = average word length in syllables (number of syllables
                divided by number of words)
    """
    text= " ".join(text)
    FRE = textstat.flesch_reading_ease(text)
    return FRE

def dale_chall_readability_score(text):
    """
        Implements Dale Challe Formula:
        Raw score = 0.1579*(PDW) + 0.0496*(ASL) + 3.6365
        
        PDW = Percentage of difficult words.
        ASL = Average sentence length
    """
    text= " ".join(text)
    score= textstat.dale_chall_readability_score(text)
    return score

def mcalpine(text):
    text= " ".join(text)
    return textstat.mcalpine_eflaw(text)

In [850]:
#get the Flesch score
df['flesch']=df['preprocessed'].progress_apply(flesch_reading_ease)

  0%|          | 0/403608 [00:00<?, ?it/s]

In [851]:
#get the Dale Chall score
df['dale']=df['preprocessed'].progress_apply(dale_chall_readability_score)

  0%|          | 0/403608 [00:00<?, ?it/s]

In [852]:
#get the mcalpine score
df['mcalpine']=df['preprocessed'].progress_apply(mcalpine)

  0%|          | 0/403608 [00:00<?, ?it/s]

#### Syntactic Features

- Proportion of nouns and adjectives in each document

In [862]:
#part of speech tags: source https://cs.nyu.edu/~grishman/jet/guide/PennPOS.html

def pos_nouns(doc):
    """
    Returns proprotion of nouns and adjectives in NLTK part of speech for each doc
    """
    count=0
    word_count=0
    for pos in pos_tag(doc):
        word_count+=1
        if pos[1]=='NN' or pos[1]== 'NNS' or pos[1]=='ADJ':
            count+=1
    return count/word_count

In [863]:
df['nouns_adjs']=df['preprocessed'].progress_apply(pos_nouns)

  0%|          | 0/403608 [00:00<?, ?it/s]

In [866]:
# create a new column that contains the cleaned up text (string as opposed to a list)
df['normalized']=df['preprocessed'].apply(lambda x: " ".join(x))

In [867]:
# let's save the data in a pickle file
df.to_pickle("data/features.pkl")


In [868]:
# columns in the features file
df.columns

Index(['original_text', 'label', 'preprocessed', 'word_count',
       'avg_word_count', 'syllable_count', 'uncommon', 'difficult_words',
       'stem', 'discourse', 'cohesive_features', 'flesch', 'dale', 'mcalpine',
       'nouns_adjs', 'normalized'],
      dtype='object')

In [870]:
#df['aoa']=df['preprocessed'].progress_apply(mean_aoa)

Function to vectorize the text into tf-idf vectors. 

We'll define our vectorizer that will learn that mapping and then call the fit transform function to learn how to map our text to a particular matrix. 

In [871]:
def split_vectorize(df):
    """
    Returns vectorized features for training and testing data sets
    """
    #Note that split does not shuffle, so we'll use DataFrame.sample() and randomly resample our entire dataset 
    #to get a random shuffle before the split.
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    train_df,dev_df,test_df=np.split(df.sample(frac=1,random_state=42), 
                                 [int(.8*len(df)), 
                                  int(.9*len(df))])
    
    # initialize the vectorizer object and set max features to 1000
    vectorizer= TfidfVectorizer(max_features=1000, ngram_range=(1,2), min_df=1) #remove extrememly rare words
    # fit transform/only fit test and dev set
    X_train=vectorizer.fit_transform(train_df["normalized"])
    X_test = vectorizer.fit(test_df["normalized"])
    
    return X_train,X_test

In [875]:
df.columns

Index(['original_text', 'label', 'preprocessed', 'word_count',
       'avg_word_count', 'syllable_count', 'uncommon', 'difficult_words',
       'stem', 'discourse', 'cohesive_features', 'flesch', 'dale', 'mcalpine',
       'nouns_adjs', 'normalized'],
      dtype='object')

In [None]:
plt.figure(figsize=(14,11))

df_plotting= df[['word_count','avg_word_count', 'syllable_count', 'uncommon', 'difficult_words',\
                 'discourse', 'cohesive_features', 'flesch', 'dale', 'mcalpine','nouns_adjs']]
ax = sns.pairplot(data = df, plot_kws = dict(color = "maroon"))
plt.show()

<Figure size 1008x792 with 0 Axes>

In [3]:
import pandas as pd
df2=pd.read_pickle("data/features.pkl")

In [8]:
df2['aoa']=df2['preprocessed'].progress_apply(mean_aoa)

  0%|          | 0/403608 [00:00<?, ?it/s]