Context
The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from other tweets.

Formally, given a training sample of tweets and labels, where label '1' denotes the tweet is racist/sexist and label '0' denotes the tweet is not racist/sexist, your objective is to predict the labels on the test dataset.

Content
Full tweet texts are provided with their labels for training data.
Mentioned users' username is replaced with @user.

Acknowledgements
Dataset is provided by Analytics Vidhya


In [140]:
# Importing required libraries 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [141]:
# Reading training and testing data files
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

In [142]:
# Concatenating both training and testing data into a single file for easier exploration
data = pd.concat([test.assign(ind="test"), train.assign(ind="train")])

In [143]:
# Retrieving the first five observations using the head function
data.head()

Unnamed: 0,id,tweet,ind,label
0,31963,#studiolife #aislife #requires #passion #dedic...,test,
1,31964,@user #white #supremacists want everyone to s...,test,
2,31965,safe ways to heal your #acne!! #altwaystohe...,test,
3,31966,is the hp and the cursed child book up for res...,test,
4,31967,"3rd #bihday to my amazing, hilarious #nephew...",test,


In [197]:
# Retrieving random samples of data file
data.sample(frac=0.50)

Unnamed: 0,id,tweet,ind,label
11822,43785,@user #summersurvivaltips from @user for a ...,test,
25810,25811,"they said, god will give what's opposite to th...",train,0.0
11947,43910,let us pray for orlando.,test,
17695,17696,i'm so and #grateful now that - #affirmations,train,0.0
16629,16630,@user #coldplaywembley with @user @user !,train,0.0
...,...,...,...,...
6615,6616,self-doubt uninstalling... #vlicobs #xoxo #lov...,train,0.0
8862,8863,getting ready for the opening of hb abc in pho...,train,0.0
17057,49020,i am brave. #i_am #positive #affirmation,test,
3650,3651,@user can't wait... excited to watch lemans th...,train,0.0


In [144]:
# Retrieving the last five observations using the tail function
data.tail()

Unnamed: 0,id,tweet,ind,label
31957,31958,ate @user isz that youuu?ðððððð...,train,0.0
31958,31959,to see nina turner on the airwaves trying to...,train,0.0
31959,31960,listening to sad songs on a monday morning otw...,train,0.0
31960,31961,"@user #sikh #temple vandalised in in #calgary,...",train,1.0
31961,31962,thank you @user for you follow,train,0.0


In [145]:
# tweets that are non-racist or non-sexist
sum(data["label"] == 0)

29720

In [146]:
# tweets that are racist or sexist
sum(data["label"] == 1)

2242

In [147]:
# Checking for missing values
data.isnull().sum()

id           0
tweet        0
ind          0
label    17197
dtype: int64

# Data Cleaning 

In [148]:
#install tweet-preprocessor to clean tweets
!pip install tweet-preprocessor



In [149]:
# remove special characters using the regular expression library
import re

#set up punctuations we want to be replaced
REPLACE_NO_SPACE = re.compile("(\.)|(\;)|(\:)|(\!)|(\')|(\?)|(\,)|(\")|(\|)|(\()|(\))|(\[)|(\])|(\%)|(\$)|(\>)|(\<)|(\{)|(\})")
REPLACE_WITH_SPACE = re.compile("(<br\s/><br\s/?)|(-)|(/)|(:).")

In [150]:
import preprocessor as p

# custum function to clean the dataset (combining tweet_preprocessor and reguar expression)
def clean_tweets(df):
  tempArr = []
  for line in df:
    # send to tweet_processor
    tmpL = p.clean(line)
    # remove puctuation
    tmpL = REPLACE_NO_SPACE.sub("", tmpL.lower()) # convert all tweets to lower cases
    tmpL = REPLACE_WITH_SPACE.sub(" ", tmpL)
    tempArr.append(tmpL)
  return tempArr

In [152]:
# de-concatenating our data file further analysis
test, train = data[data["ind"].eq("test")], data[data["ind"].eq("train")]

In [153]:
# clean training data
train_tweet = clean_tweets(train["tweet"])
train_tweet = pd.DataFrame(train_tweet)

In [154]:
# append cleaned tweets to the training data
train["clean_tweet"] = train_tweet

# compare the cleaned and uncleaned tweets
train.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train["clean_tweet"] = train_tweet


Unnamed: 0,id,tweet,ind,label,clean_tweet
0,1,@user when a father is dysfunctional and is s...,train,0.0,when a father is dysfunctional and is so selfi...
1,2,@user @user thanks for #lyft credit i can't us...,train,0.0,thanks for credit i cant use cause they dont o...
2,3,bihday your majesty,train,0.0,bihday your majesty
3,4,#model i love u take with u all the time in ...,train,0.0,i love u take with u all the time in ur
4,5,factsguide: society now #motivation,train,0.0,factsguide society now
5,6,[2/2] huge fan fare and big talking before the...,train,0.0,2 2 huge fan fare and big talking before they ...
6,7,@user camping tomorrow @user @user @user @use...,train,0.0,camping tomorrow danny
7,8,the next school year is the year for exams.ð...,train,0.0,the next school year is the year for exams can...
8,9,we won!!! love the land!!! #allin #cavs #champ...,train,0.0,we won love the land
9,10,@user @user welcome here ! i'm it's so #gr...,train,0.0,welcome here im its so


In [155]:
# clean the test data and append the cleaned tweets to the test data
test_tweet = clean_tweets(test["tweet"])
test_tweet = pd.DataFrame(test_tweet)
# append cleaned tweets to the training data
test["clean_tweet"] = test_tweet

# compare the cleaned and uncleaned tweets
test.tail()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test["clean_tweet"] = test_tweet


Unnamed: 0,id,tweet,ind,label,clean_tweet
17192,49155,thought factory: left-right polarisation! #tru...,test,,thought factory left right polarisation &gt3
17193,49156,feeling like a mermaid ð #hairflip #neverre...,test,,feeling like a mermaid
17194,49157,#hillary #campaigned today in #ohio((omg)) &am...,test,,today in omg &amp used words like assets&ampli...
17195,49158,"happy, at work conference: right mindset leads...",test,,happy at work conference right mindset leads t...
17196,49159,"my song ""so glad"" free download! #shoegaze ...",test,,my song so glad free download


In [157]:
#importing the train and test split library
from sklearn.model_selection import train_test_split

# extract the labels from the train data
y = train.label.values

# using 70% for the training and 30% for the testing
x_train, x_test, y_train, y_test = train_test_split(train.clean_tweet.values, y, stratify=y, random_state=1, test_size=0.3, shuffle=True)

# Vectorize Tweets Using CountVectorizer

CountVectorizer Example

In [158]:
from sklearn.feature_extraction.text import CountVectorizer

In [159]:
documents = ["This is Import Data's Youtube channel",
             "Data science is my passion and it is fun!",
             "Please subscribe to my channel"]

# initializing the countvectorizer
vectorizer = CountVectorizer()

# tokenize and make the document into a matrix
document_term_matrix = vectorizer.fit_transform(documents)

# check the result
pd.DataFrame(document_term_matrix.toarray(), columns = vectorizer.get_feature_names())

Unnamed: 0,and,channel,data,fun,import,is,it,my,passion,please,science,subscribe,this,to,youtube
0,0,1,1,0,1,1,0,0,0,0,0,0,1,0,1
1,1,0,1,1,0,2,1,1,1,0,1,0,0,0,0
2,0,1,0,0,0,0,0,1,0,1,0,1,0,1,0


In [160]:
from sklearn.feature_extraction.text import CountVectorizer

# vectorize tweets for model building
vectorizer = CountVectorizer(binary=True, stop_words='english')

# learn a vocabulary dictionary of all tokens in the raw documents
vectorizer.fit(list(x_train) + list(x_test))

# transform documents to document-term matrix
x_train_vec = vectorizer.transform(x_train)
x_test_vec = vectorizer.transform(x_test)

# Model Application¶

Support Vetor Machine (SVM)

In [138]:
# Support vector library
from sklearn import svm

# Creating the support vector classifier
svm = svm.SVC(kernel = 'linear', probability=True)

# fit the SVC model based on the given training data
prob = svm.fit(x_train_vec, y_train)

In [174]:
# perform classification and prediction on samples in x_test
y_pred_svm = svm.predict(x_test_vec)

In [191]:
# Predicting probabilities
svm.predict_proba(x_test_vec[0:10])

array([[0.95899018, 0.04100982],
       [0.97747001, 0.02252999],
       [0.95454169, 0.04545831],
       [0.92470233, 0.07529767],
       [0.83674823, 0.16325177],
       [0.9971064 , 0.0028936 ],
       [0.94370099, 0.05629901],
       [0.98839812, 0.01160188],
       [0.93765012, 0.06234988],
       [0.95326183, 0.04673817]])

In [139]:
#importing Library for accuracy score
from sklearn.metrics import accuracy_score

# Model accuracy score
print("Accuracy score for SVC is: ", accuracy_score(y_test, y_pred_svm) * 100, '%')

Accuracy score for SVC is:  94.86912086766085 %


Logistic Regression

In [166]:
# Importing Logistic Regression library
from sklearn.linear_model import LogisticRegression

# Creating the object of the Logisitics Regression model
model = LogisticRegression(solver="liblinear", random_state=42)

In [167]:
# Fitting our model
model.fit(x_train_vec, y_train)

LogisticRegression(random_state=42, solver='liblinear')

In [170]:
# Predicting probabilities
model.predict_proba(x_test_vec)

array([[0.97451696, 0.02548304],
       [0.98073684, 0.01926316],
       [0.95567169, 0.04432831],
       ...,
       [0.98721219, 0.01278781],
       [0.99277082, 0.00722918],
       [0.22434625, 0.77565375]])

In [171]:
# train and test accuracy score 
print("Accuracy of train:", model.score(x_train_vec, y_train))
print("Accuracy of test:", model.score(x_test_vec,y_test))

Accuracy of train: 0.9679971394091091
Accuracy of test: 0.9500469287725519
