# Machine Learning Model for Text Classification
Now let's build a basic ML model for text classification and detection of hate speech from twitter following the example exercise in [intro to NLP course by Shivam Bansal](https://courses.analyticsvidhya.com/courses/Intro-to-NLP).

In [39]:
import pandas as pd
import re
import nltk
import numpy as np

In [9]:
dF = pd.read_csv('data/final_dataset_basicmlmodel.csv')
dF.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


This dataframe is already organized in such a way as to be used as our output, which is to say, the column titled label is where the classification of hate speech will be stored. The label of 0 implies a classification of no hate speech and 1 implies hate speech. The first step we'll take is to clean the data to reduce the noise.

In [10]:
def clean_tweets(text):
    # filter to allow only alphameric characters
    text = re.sub(r'[^a-zA-Z\']', ' ', text)
    
    # Remove Unicode characters
    text = re.sub(r'[^\x00-\x7F]+', '', text)
    
    # enforce lower case
    text = text.lower()
    
    return text

In [11]:
dF['clean_tweet'] = dF.tweet.apply(lambda x: clean_tweets(x))

dF.head()

Unnamed: 0,id,label,tweet,clean_tweet
0,1,0,@user when a father is dysfunctional and is s...,user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...,user user thanks for lyft credit i can't us...
2,3,0,bihday your majesty,bihday your majesty
3,4,0,#model i love u take with u all the time in ...,model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation,factsguide society now motivation


# Feature Engineering
Next let's engineer features from these cleaned tweets so that we can use a classification model. As a first step toward that let's remove stop words.

In [105]:
stop_words = nltk.corpus.stopwords.words('english')
stop_words.append("'m") # "I'm" is missing from this stop words package. Let's add it

def compare_stop_words(text):
    word_list = np.array(nltk.tokenize.word_tokenize(text))
    word_list = word_list[~np.isin(word_list, stop_words)]
    return ' '.join(word_list)

In [106]:
stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [107]:
# test the function
print(compare_stop_words(dF.loc[0, 'clean_tweet']))
print(dF.loc[0, 'clean_tweet'])

user father dysfunctional selfish drags kids dysfunction run
  user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction     run


In [108]:
dF['stop_words_removed'] = dF.clean_tweet.apply(lambda x: compare_stop_words(x))
dF

Unnamed: 0,id,label,tweet,clean_tweet,stop_words_removed
0,1,0,@user when a father is dysfunctional and is s...,user when a father is dysfunctional and is s...,user father dysfunctional selfish drags kids d...
1,2,0,@user @user thanks for #lyft credit i can't us...,user user thanks for lyft credit i can't us...,user user thanks lyft credit ca n't use cause ...
2,3,0,bihday your majesty,bihday your majesty,bihday majesty
3,4,0,#model i love u take with u all the time in ...,model i love u take with u all the time in ...,model love u take u time ur
4,5,0,factsguide: society now #motivation,factsguide society now motivation,factsguide society motivation
...,...,...,...,...,...
5237,31935,1,lady banned from kentucky mall. @user #jcpenn...,lady banned from kentucky mall user jcpenn...,lady banned kentucky mall user jcpenny kentucky
5238,31947,1,@user omfg i'm offended! i'm a mailbox and i'...,user omfg i'm offended i'm a mailbox and i'...,user omfg offended mailbox proud mailboxpride ...
5239,31948,1,@user @user you don't have the balls to hashta...,user user you don't have the balls to hashta...,user user n't balls hashtag say weasel away lu...
5240,31949,1,"makes you ask yourself, who am i? then am i a...",makes you ask yourself who am i then am i a...,makes ask anybody god oh thank god
