### This notebook is meant for data preprocessing and feature engineering, not feature visualisation and EDA. Further data analysis will be done in "2. Exploratory Data Analysis.ipynb"
This preprocessing notebook is a standalone notebook containing intensive preprocessing that takes roughly two hours to preprocess the data. ALL EDA is done in notebook number 2 for ease of use. Data will be stored in the path "datasets/" <br><br>
Here, I add the following features:
- num_numbers - If someone uses numbers, perhaps they are being more elaborate and is a good answer
- prop_numbers - Not only the numbers but the proportion as well
- num_words - Does the number of the word affect outcomes?
- nchar - Another way to meature words
- word_density - To measure average length of words, I used n_char / num_words
- num_punctuation - Maybe punctuation would help, indicative of code and proper grammar
- prop_punctuation - Percentage of punctuation would be a proxy for percentage of code
- noun_count - Count number of nouns used, may have some effect
- verb_count - Count number of verbs used, may have some effect
- adj_count - Count number of adjectives used, may have some effect
- adv_count - Count number of adverbs used, may have some effect
- pron_count - Count number of pronouns used, may have some effect
- Latent dirichlet allocation - Topic Modelling May help

The proportions of noun, adjectives, adverbs, pronouns, verbs were constructed in the EDA notebook, although these weren't used in the end.

Read in libraries

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import textblob
from string import punctuation
from tqdm.auto import tqdm  # for notebooks
from sklearn import decomposition
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
train = pd.read_csv("datasets/train.csv")
test = pd.read_csv("datasets/test.csv")
# Standardise all texts to use lower
train["Comment"] = train["Comment"].apply(lambda x: x.lower())
test["Comment"] = test["Comment"].apply(lambda x: x.lower())
train.head()

Unnamed: 0,Comment,Outcome,Id
0,combining lindelof's and gregg lind's ideas: l...,1,15086
1,in most cases r is an interpreted language tha...,1,41061
2,"i don't know r at all, but a bit of creative g...",1,34417
3,if you don't want to modify the list in-place ...,1,30549
4,i assume it helps if the matrix is sparse? yes...,1,8496


#### Functions to get part of speech tagging, to obtain nouns, pronouns, verbs, adjectives and adverbs.

In [3]:
# Text blob conventions, map them to just five categories
pos_family = {
    'noun' : ['NN','NNS','NNP','NNPS'],
    'pron' : ['PRP','PRP$','WP','WP$'],
    'verb' : ['VB','VBD','VBG','VBN','VBP','VBZ'],
    'adj' :  ['JJ','JJR','JJS'],
    'adv' : ['RB','RBR','RBS','WRB']
}

# function to check and get the part of speech tag count of a words in a given sentence
def get_pos_tags(x):
    result = ()
    try:
        wiki = textblob.TextBlob(x)
        result = wiki.tags
    except:
        pass
    return result

def check_pos_tag(tags, flag):
    cnt = 0
    try:
        for tup in tags:
            ppo = list(tup)[1]
            if ppo in pos_family[flag]:
                cnt += 1
    except:
        pass
    return cnt

#### Feature engineering. Only for these variables, I find the proportion. E.g proportion of numbers and punctiation

In [4]:
train["num_numbers"] = train["Comment"].apply(lambda text: len([1 for i in text if i.isdigit()]))
train["prop_numbers"] = train["Comment"].apply(lambda text: len([1 for i in text if i.isdigit()])/len(text))
train["num_words"] = train["Comment"].apply(lambda text: len(text.split()))
train["num_punctuation"] = train["Comment"].apply(lambda text: len([i for i in text if i in punctuation]))
train["prop_punctuation"] = train["Comment"].apply(lambda text: len([i for i in text if i in punctuation])/len(text))
train["nchar"] = train["Comment"].apply(lambda text: len(text))
train["word_density"] = train['word_density'] = train["nchar"] / (train["num_words"] + 1)

#### Obtaining the POS tags takes awhile

In [5]:
tqdm.pandas()

# Now you can use `progress_apply` instead of `apply`
train["pos_tags"] = train['Comment'].progress_apply(lambda x: get_pos_tags(x))

HBox(children=(FloatProgress(value=0.0, max=44459.0), HTML(value='')))

#### For these variables, I do not find the proportion these words. I found that using the count of these words better than using proportions in my EDA notebook.

In [6]:
train['noun_count'] = train['pos_tags'].apply(lambda x: check_pos_tag(x, 'noun'))
train['verb_count'] = train['pos_tags'].apply(lambda x: check_pos_tag(x, 'verb'))
train['adj_count'] = train['pos_tags'].apply(lambda x: check_pos_tag(x, 'adj'))
train['adv_count'] = train['pos_tags'].apply(lambda x: check_pos_tag(x, 'adv'))
train['pron_count'] = train['pos_tags'].apply(lambda x: check_pos_tag(x, 'pron'))

#### Repeat the same for the test set

In [7]:
test["num_numbers"] = test["Comment"].apply(lambda text: len([1 for i in text if i.isdigit()]))
test["prop_numbers"] = test["Comment"].apply(lambda text: len([1 for i in text if i.isdigit()])/len(text))
test["num_words"] = test["Comment"].apply(lambda text: len(text.split()))
test["num_punctuation"] = test["Comment"].apply(lambda text: len([i for i in text if i in punctuation]))
test["prop_punctuation"] = test["Comment"].apply(lambda text: len([i for i in text if i in punctuation])/len(text))
test["nchar"] = test["Comment"].apply(lambda text: len(text))
test["word_density"] = test['word_density'] = train["nchar"] / (train["num_words"] + 1)

In [8]:
test["pos_tags"] = test['Comment'].progress_apply(lambda x: get_pos_tags(x))

HBox(children=(FloatProgress(value=0.0, max=27924.0), HTML(value='')))

In [9]:
test['noun_count'] = test['pos_tags'].apply(lambda x: check_pos_tag(x, 'noun'))
test['verb_count'] = test['pos_tags'].apply(lambda x: check_pos_tag(x, 'verb'))
test['adj_count'] = test['pos_tags'].apply(lambda x: check_pos_tag(x, 'adj'))
test['adv_count'] = test['pos_tags'].apply(lambda x: check_pos_tag(x, 'adv'))
test['pron_count'] = test['pos_tags'].apply(lambda x: check_pos_tag(x, 'pron'))

#### Using LDA on the raw data that we have, with no preprocessing. I initially wanted to use this as a possible feature set, but I decided not to use this in the end due to time constraints, could have been a further work I would have done given more time

In [11]:
lda_stuff = pd.concat([train.Comment, test.Comment]).reset_index().drop(columns="index")

count_vect = CountVectorizer()
count_vect.fit(lda_stuff.Comment)
all_count = count_vect.transform(lda_stuff.Comment)

lda_model = decomposition.LatentDirichletAllocation(n_components=20, learning_method='online', max_iter=20)
X_topics = lda_model.fit_transform(all_count)
topic_word = lda_model.components_ 
vocab = count_vect.get_feature_names()

# view the topic models
n_top_words = 10
topic_summaries = []
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    topic_summaries.append(' '.join(topic_words))

In [12]:
topic_words

array(['the', 'plot', 'col', 'width', 'axis', 'to', 'size', 'image',
       'grid', 'color'], dtype='<U1536')

#### Quite unclean because I didn't preprocess, would have been fine tuned had I decided to use this. However, my intuition is that this would not help so much, unless it could somehow identify advaned topics. However, I thought this was not highly likely and did not work on this further

In [13]:
topic_summaries

['string input re text character word match strings unicode characters',
 'class def return object name method args url type function',
 'na 1l x1 x2 rm median max 2l 5l min',
 'true false val nan ind arr city grepl rank logical',
 'the in to of is you for data and as',
 'the to you is and in it of that this',
 'pip flask 000000 person cv2 dis pypi load_const events opencv',
 'np numpy random matplotlib timeit __main__ __name__ plt node scipy',
 'state pd pool mydata init x00 getattr param dictionaries ordereddict',
 'paste output sep function lst form div shiny collapse tk',
 'model int models size thread time number for return bytes',
 'df 000 100 pandas 2000 var1 2007 2008 var2 frame',
 'num kwargs iris 255 species length spaces setosa sepal dbl',
 'py http com request json bin usr www org lib',
 'data id aes ggplot group frame library value ggplot2 variable',
 'self print import file def in sys read line if',
 '10 12 00 01 11 15 20 13 14 30',
 'foo date datetime time bar year forma

In [14]:
lda_stuff["lda"] = pd.Series(list(X_topics))

In [15]:
train_lda = lda_stuff.iloc[:len(train)].reset_index().drop(columns="index")
test_lda = lda_stuff.iloc[len(train):].reset_index().drop(columns="index")

In [16]:
train["lda"] = train_lda.lda
test["lda"] = test_lda.lda

In [17]:
train.head()

Unnamed: 0,Comment,Outcome,Id,num_numbers,prop_numbers,num_words,num_punctuation,prop_punctuation,nchar,word_density,pos_tags,noun_count,verb_count,adj_count,adv_count,pron_count,lda
0,combining lindelof's and gregg lind's ideas: l...,1,15086,3,0.006897,80,51,0.117241,435,5.37037,"[(combining, VBG), (lindelof, NN), ('s, POS), ...",30,9,9,6,1,"[0.0007692307699486388, 0.0007692307922939407,..."
1,in most cases r is an interpreted language tha...,1,41061,0,0.0,39,4,0.017094,234,5.85,"[(in, IN), (most, JJS), (cases, NNS), (r, NN),...",10,8,7,1,0,"[0.001315789481560718, 0.0013157894744339404, ..."
2,"i don't know r at all, but a bit of creative g...",1,34417,12,0.013423,164,49,0.05481,894,5.418182,"[(i, NNS), (do, VBP), (n't, RB), (know, VB), (...",49,30,14,10,4,"[0.0003184713469724488, 0.0003184713501225187,..."
3,if you don't want to modify the list in-place ...,1,30549,12,0.021164,102,92,0.162257,567,5.504854,"[(if, IN), (you, PRP), (do, VBP), (n't, RB), (...",54,18,8,6,2,"[0.0006578947390142372, 0.000657894744747982, ..."
4,i assume it helps if the matrix is sparse? yes...,1,8496,0,0.0,23,14,0.084848,165,6.875,"[(i, NN), (assume, VBP), (it, PRP), (helps, VB...",9,6,0,2,1,"[0.001724137932075156, 0.0017241379372974095, ..."


In [18]:
test.head()

Unnamed: 0,Comment,Id,num_numbers,prop_numbers,num_words,num_punctuation,prop_punctuation,nchar,word_density,pos_tags,noun_count,verb_count,adj_count,adv_count,pron_count,lda
0,use variables in the outer function instead of...,76343,0,0.0,52,25,0.072886,343,5.37037,"[(use, NN), (variables, NNS), (in, IN), (the, ...",15,9,8,4,4,"[0.0008196721438439306, 0.0008196721501989234,..."
1,if you're looking for something as nice as pyt...,66862,1,0.00369,49,27,0.099631,271,5.85,"[(if, IN), (you, PRP), ('re, VBP), (looking, V...",20,9,11,3,4,"[0.0011363636588532812, 0.0011363636538054228,..."
2,"i use the tail() function: tail(vector, n=1) t...",69629,1,0.007246,23,15,0.108696,138,5.418182,"[(i, NN), (use, VBP), (the, DT), (tail, NN), (...",14,4,1,1,1,"[0.0021739130453386356, 0.0021739130466134305,..."
3,clearly i should have worked on this for anoth...,50008,95,0.087719,179,81,0.074792,1083,5.504854,"[(clearly, RB), (i, NN), (should, MD), (have, ...",66,31,8,11,7,"[0.00026315789532992156, 0.0002631578975795275..."
4,you can drop any row containing a missing usin...,66750,25,0.042808,113,65,0.111301,584,6.875,"[(you, PRP), (can, MD), (drop, VB), (any, DT),...",27,27,10,6,6,"[0.0005747126448159517, 0.0005747126440909237,..."


In [19]:
train.to_csv("datasets/preprocessed_train.csv", index=False)
test.to_csv("datasets/preprocessed_test.csv", index=False)