# Pre-Processing

In [1]:
# Let us start by importing the library from the package directory
import pandas as pd

# Read the data
df = pd.read_csv("./train.csv")

In [2]:
# Check the first five rows of our dataset
df.head()

Unnamed: 0,id,text,harsh,extremely_harsh,vulgar,threatening,disrespect,targeted_hate
0,a8be7c5d4527adbbf15f,""", 6 December 2007 (UTC)\nI am interested, not...",0,0,0,0,0,0
1,0b7ca73f388222aad64d,I added about three missing parameters to temp...,0,0,0,0,0,0
2,db934381501872ba6f38,SANDBOX?? \n\nI DID YOUR MADRE DID IN THE SANDBOX,1,0,0,0,0,0
3,228015c4a87c4b1f09a7,"why good sir? Why? \n\nYou, sir, obviously do ...",1,0,1,1,1,0
4,b18f26cfa1408b52e949,"""\n\n Source \n\nIncase I forget, or someone e...",0,0,0,0,0,0


### Remove urls

In [3]:
import re

def clean_url(review_text):
    return re.sub(r'http\S+', ' ', review_text)

df['text'] = df['text'].apply(clean_url)

### Remove html tags

In [4]:
def clean_html_tags(review_text):
    return re.sub('<[^<]+?>', '', review_text)

df['text'] = df['text'].apply(clean_url)

### Remove numbers and punctuation

In [5]:
def clean_non_alphanumeric(review_text):
    return re.sub('[^a-zA-Z]', ' ', review_text)

df['text'] = df['text'].apply(clean_non_alphanumeric)

In [6]:
df['text'].iloc[0]

'     December       UTC  I am interested  not in arguing  but in the policies which resolve our ongoing content dispute  Also  see Wikipedia  WikiProject United States presidential elections for what I ll be working on  Also  the moneybomb closer just self reverted on two different requests  which echoed what I would have requested   I will rephrase     which I didn t see an answer to  building on our agreement that   moneybomb   should not be a redlink  Given the deletion reversion  what should be the outline of the article called   moneybomb   or should it be submitted for AFD again in due time   If the latter  see the previous version of      However  this version will require a detailed answer because any ambiguity will only necessitate clarifying questions          '

Removing extra white-spaces

In [7]:
df.text = df.text.apply(lambda x : " ".join(x.split()))

In [8]:
df.text.iloc[0]

'December UTC I am interested not in arguing but in the policies which resolve our ongoing content dispute Also see Wikipedia WikiProject United States presidential elections for what I ll be working on Also the moneybomb closer just self reverted on two different requests which echoed what I would have requested I will rephrase which I didn t see an answer to building on our agreement that moneybomb should not be a redlink Given the deletion reversion what should be the outline of the article called moneybomb or should it be submitted for AFD again in due time If the latter see the previous version of However this version will require a detailed answer because any ambiguity will only necessitate clarifying questions'

### Spell Checker

In [9]:
# from textblob import TextBlob

# df['text'] = df['text'].apply(lambda x : TextBlob(x).correct())

### Lemmatization

In [10]:
from nltk.stem import WordNetLemmatizer

lemma = WordNetLemmatizer()

df.text = df.text.apply(lambda x : " ".join([lemma.lemmatize(word = w, pos='n') for w in x.split(" ")]))
df.text = df.text.apply(lambda x : " ".join([lemma.lemmatize(word = w, pos='v') for w in x.split(" ")]))

In [11]:
df.text.iloc[0]

'December UTC I be interest not in argue but in the policy which resolve our ongoing content dispute Also see Wikipedia WikiProject United States presidential election for what I ll be work on Also the moneybomb closer just self revert on two different request which echo what I would have request I will rephrase which I didn t see an answer to build on our agreement that moneybomb should not be a redlink Given the deletion reversion what should be the outline of the article call moneybomb or should it be submit for AFD again in due time If the latter see the previous version of However this version will require a detail answer because any ambiguity will only necessitate clarify question'

# Text -> Features

### BOW (Bag of Words)

Each text is represented as a vector consisting of the frequency / occurence(0,1) / weighted values of vocabulary (list of unique words).

![BOW](assets/BOW.png "Bag Of Words")

Disadvantages ->
1.	Vocabulary can become too large
2.	Each text vector will contain many 0's which will result in a sparse matrix (Why is that a problem?)
3.	Ordering of the words (grammar, meaning) is lost
4.	Words which occur frequently across all text vectors may introduce a bias.

### TF-IDF (Term Frequency and Inverse document Frequency)

TF-IDF kind of normalizes the text vectors. Frequent words in a text-vector are "rewarded", but they also get "punished" if they are frequent in other text-vectors too. Thus, higher weight is assigned to unique or rare terms considering all text-vectors.

TF(t,d) = (count of t in document d) / (Total words in document d)

IDF(t) = log(Total documents / Number of documents containing t)

![TF-IDF](assets/IDF.png "TF_IDF")

Disadvantages ->

1. TF-IDF is basically BOW with weights. It cannot capture the semantics / meaning of text.

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Explore other parameters of TfidVectorizer
vectorizer = TfidfVectorizer(binary=True, lowercase=True, stop_words={'english'})

X = vectorizer.fit_transform(df.text)

In [13]:
vectorizer.get_feature_names()

# Spell Check necessary?



['aa',
 'aaa',
 'aaaa',
 'aaaaaaaa',
 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaany',
 'aaaaaaaaaah',
 'aaaaaaahhhhhhhhhhhhhhhhhhhhhhhh',
 'aaaah',
 'aaaannnnyyyywwwwhhhheeeerrrreeee',
 'aaaawwww',
 'aaaboyz',
 'aaages',
 'aaaghh',
 'aaah',
 'aaahhh',
 'aaai',
 'aaajade',
 'aaand',
 'aaaww',
 'aaba',
 'aaberg',
 'aabove',
 'aac',
 'aachen',
 'aachi',
 'aacs',
 'aad',
 'aademia',
 'aadmi',
 'aaffect',
 'aafia',
 'aaflight',
 'aage',
 'aagin',
 'aah',
 'aai',
 'aaiha',
 'aajonus',
 'aakash',
 'aake',
 'aalborg',
 'aalertbot',
 'aalexa',
 'aaliya',
 'aaliyah',
 'aalst',
 'aam',
 'aamir',
 'aamirjamil',
 'aamu',
 'aan',
 'aanas',
 'aand',
 'aang',
 'aao',
 'aaot',
 'aap',
 'aapl',
 'aapropriate',
 'aar',
 'aarabs',
 'aarau',
 'aardvark',
 'aare',
 'aarem',
 'aaroamal',
 'aarohi',
 'aaron',
 'aaroncrick',
 'aaronic',
 'aarons',
 'aaronshavit',
 'aaronsw',
 'aarp',
 'aarrow',
 'aarticles',
 'aaruveetil',
 'aas',
 'aat',
 'aatc',
 'aau',
 'aave',
 'aaviksoo',
 'aav

In [21]:
vectorizer.vocabulary_

{'december': 23464,
 'utc': 101338,
 'be': 9145,
 'interest': 47034,
 'not': 66289,
 'in': 45549,
 'argue': 5591,
 'but': 13837,
 'the': 95315,
 'policy': 73503,
 'which': 104839,
 'resolve': 80244,
 'our': 69082,
 'ongoing': 68078,
 'content': 20026,
 'dispute': 26344,
 'also': 3169,
 'see': 84919,
 'wikipedia': 105510,
 'wikiproject': 105590,
 'united': 100212,
 'states': 90535,
 'presidential': 74762,
 'election': 29306,
 'for': 35002,
 'what': 104734,
 'll': 55504,
 'work': 106484,
 'on': 68009,
 'moneybomb': 61907,
 'closer': 17955,
 'just': 50048,
 'self': 85085,
 'revert': 80644,
 'two': 98806,
 'different': 25481,
 'request': 80095,
 'echo': 28605,
 'would': 106610,
 'have': 41417,
 'will': 105794,
 'rephrase': 79897,
 'didn': 25400,
 'an': 3707,
 'answer': 4406,
 'to': 96680,
 'build': 13440,
 'agreement': 2011,
 'that': 95279,
 'should': 86659,
 'redlink': 78826,
 'given': 38114,
 'deletion': 23969,
 'reversion': 80641,
 'outline': 69162,
 'of': 67464,
 'article': 5974,
 'cal

# Model Selection and Training