**Toxic Comment Classifier**

This model takes text data as an input and identifies whether the text is toxic or not with 6 levels of toxicity.

In building this model few references have been taken into consideration mainly from 'Linebyline.ai' page from github. 
Link: https://github.com/line-by-line/toxic_comments_classifier

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
cd /content/drive/MyDrive/Datasets

/content/drive/MyDrive/Datasets


In [4]:
ls

identity_hate_model.pkl  sample_submission.csv   threat_vect.pkl
identity_hate_vect.pkl   severe_toxic_model.pkl  toxic_model.pkl
insult_model.pkl         severe_toxic_vect.pkl   toxic_vect.pkl
insult_vect.pkl          test.csv                train.csv
obscene_model.pkl        test_labels.csv
obscene_vect.pkl         threat_model.pkl


In [5]:
import pandas as pd
import tensorflow as tf
import nltk
import re
import numpy as np
np.set_printoptions(suppress=True)
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import roc_auc_score, roc_curve, f1_score, confusion_matrix
from nltk.corpus import stopwords
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.metrics import f1_score, precision_score, recall_score, precision_recall_curve, fbeta_score, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB, BernoulliNB

In [6]:

df=pd.read_csv('train.csv')

In [7]:
df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


**Data Analysis**

In [8]:
print(df.shape)
df.dtypes

(159571, 8)


id               object
comment_text     object
toxic             int64
severe_toxic      int64
obscene           int64
threat            int64
insult            int64
identity_hate     int64
dtype: object

In [9]:
df.isnull().sum()

id               0
comment_text     0
toxic            0
severe_toxic     0
obscene          0
threat           0
insult           0
identity_hate    0
dtype: int64

In [10]:
df['comment_text']=df['comment_text'].astype('str')
df.dtypes

id               object
comment_text     object
toxic             int64
severe_toxic      int64
obscene           int64
threat            int64
insult            int64
identity_hate     int64
dtype: object

In [11]:
# Checking the distribution of multi labels data in actual numbers

multi_labels=['toxic','severe_toxic','obscene','threat','insult','identity_hate']
values_numbers={}
for key,value in df.items():
  if key in multi_labels:
    values_numbers[key + ' label division'] = df[key].value_counts()


In [12]:
values_numbers=pd.DataFrame(values_numbers)
values_numbers

Unnamed: 0,toxic label division,severe_toxic label division,obscene label division,threat label division,insult label division,identity_hate label division
0,144277,157976,151122,159093,151694,158166
1,15294,1595,8449,478,7877,1405


In [13]:
label_df=df[multi_labels]
label_df

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0,0,0,0,0,0
1,0,0,0,0,0,0
2,0,0,0,0,0,0
3,0,0,0,0,0,0
4,0,0,0,0,0,0
...,...,...,...,...,...,...
159566,0,0,0,0,0,0
159567,0,0,0,0,0,0
159568,0,0,0,0,0,0
159569,0,0,0,0,0,0


In [14]:
# Checking the distribution of multi labels data in percentage

multi_labels=['toxic','severe_toxic','obscene','threat','insult','identity_hate']
values_percentage={}
for key,value in df.items():
  if key in multi_labels:
    values_percentage[key + ' label division (%)'] = df[key].value_counts()/len(df[key].index)*100


In [15]:
values_percentage=pd.DataFrame(values_percentage)
values_percentage

Unnamed: 0,toxic label division (%),severe_toxic label division (%),obscene label division (%),threat label division (%),insult label division (%),identity_hate label division (%)
0,90.415552,99.000445,94.705178,99.700447,95.063639,99.119514
1,9.584448,0.999555,5.294822,0.299553,4.936361,0.880486


**DATA CLEANING**

In [16]:
df_copy=df.copy()
df_copy

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0
...,...,...,...,...,...,...,...,...
159566,ffe987279560d7ff,""":::::And for the second time of asking, when ...",0,0,0,0,0,0
159567,ffea4adeee384e90,You should be ashamed of yourself \n\nThat is ...,0,0,0,0,0,0
159568,ffee36eab5c267c9,"Spitzer \n\nUmm, theres no actual article for ...",0,0,0,0,0,0
159569,fff125370e4aaaf3,And it looks like it was actually you who put ...,0,0,0,0,0,0


In [17]:
from nltk.stem import WordNetLemmatizer
import string
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [18]:
df_copy['comment_text']

0         Explanation\nWhy the edits made under my usern...
1         D'aww! He matches this background colour I'm s...
2         Hey man, I'm really not trying to edit war. It...
3         "\nMore\nI can't make any real suggestions on ...
4         You, sir, are my hero. Any chance you remember...
                                ...                        
159566    ":::::And for the second time of asking, when ...
159567    You should be ashamed of yourself \n\nThat is ...
159568    Spitzer \n\nUmm, theres no actual article for ...
159569    And it looks like it was actually you who put ...
159570    "\nAnd ... I really don't think you understand...
Name: comment_text, Length: 159571, dtype: object

Removing Line breaks, punctuations, Links, Hashtags, and etc.

In [19]:
df_copy['comment_text']=df_copy['comment_text'].apply( lambda x : re.sub("\n|\r", " ", x)) #Line breaks
df_copy['comment_text']=df_copy['comment_text'].apply( lambda x : re.sub('[^-9A-Za-z ]', '', x)) # Punctuations
df_copy['comment_text']=df_copy['comment_text'].apply( lambda x : re.sub('[%s]' % re.escape(string.punctuation), '', x.lower()))
df_copy['comment_text']=df_copy['comment_text'].apply( lambda x : re.sub(r'[^\x00-\x7f]',r'', x)) #non Ascii
df_copy['comment_text']=df_copy['comment_text'].apply( lambda x : re.sub('@\S+', '', x))# mentions @
df_copy['comment_text']=df_copy['comment_text'].apply( lambda x : re.sub('#\S+', '', x)) # Hashtags
df_copy['comment_text']=df_copy['comment_text'].apply( lambda x : re.sub('https*\S+', '', x)) #Links
# df_copy['comment_text']=df_copy['comment_text'].apply( lambda x : re.sub(r'\s+', '', x, flags=re.I))


In [20]:
#Removing stopwords and rounding up to its base word by using Lemmatization
lmt=WordNetLemmatizer()

def remove_stopwords(text):
  text=[lmt.lemmatize(word) for word in text.split() if word not in stopwords.words('english')]
  return " ".join(text)

In [21]:
df_copy['comment_text']=df_copy['comment_text'].map(remove_stopwords)

In [22]:
df_copy['comment_text']

0         explanation edits made username hardcore metal...
1         daww match background colour im seemingly stuc...
2         hey man im really trying edit war guy constant...
3         cant make real suggestion improvement wondered...
4                       sir hero chance remember page thats
                                ...                        
159566    second time asking view completely contradicts...
159567              ashamed horrible thing put talk page 99
159568    spitzer umm there actual article prostitution ...
159569    look like actually put speedy first version de...
159570    really dont think understand came idea bad rig...
Name: comment_text, Length: 159571, dtype: object

In [23]:
 #Toxic comment example
df_copy[df_copy['toxic']==1]['comment_text']

6                               cocksucker piss around work
12        hey talk exclusive group wp talibanswho good d...
16             bye dont look come think comming back tosser
42        gay antisemmitian archangel white tiger meow g...
43                                fuck filthy mother as dry
                                ...                        
159494    previous conversation fucking shit eating libe...
159514                              mischievious pubic hair
159541    absurd edits absurd edits great white shark to...
159546    hey listen dont ever delete edits ever im anno...
159554    im going keep posting stuff u deleted fucking ...
Name: comment_text, Length: 15294, dtype: object

In [24]:
#Maximum Length of a single sentence
def max_len(x):
    a=x.split()
    return len(a)

In [25]:
sen_len=df['comment_text'].apply(max_len)
print('Maximum length of each sentence')
sen_len.sort_values(ascending=False)

Maximum length of each sentence


140904    1411
4712      1403
81295     1354
35817     1344
32143     1250
          ... 
111438       1
141293       1
52475        1
106891       1
110293       1
Name: comment_text, Length: 159571, dtype: int64

Distributing the data into equal set of 0 and 1 examples.

In [26]:
df_copy_toxic=df_copy[(df_copy['toxic']==1)|(df_copy['severe_toxic']==1)|(df_copy['obscene']==1)|(df_copy['threat']==1)|(df_copy['insult']==1)|(df_copy['identity_hate']==1)]
df_copy_nontoxic=df_copy[(df_copy['toxic']==0)&(df_copy['severe_toxic']==0)&(df_copy['obscene']==0)&(df_copy['threat']==0)&(df_copy['insult']==0)&(df_copy['identity_hate']==0)].iloc[0:17000,:]

In [27]:
df_copy_toxic

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
6,0002bcb3da6cb337,cocksucker piss around work,1,1,1,0,1,0
12,0005c987bdfc9d4b,hey talk exclusive group wp talibanswho good d...,1,0,0,0,0,0
16,0007e25b2121310b,bye dont look come think comming back tosser,1,0,0,0,0,0
42,001810bf8c45bf5f,gay antisemmitian archangel white tiger meow g...,1,0,1,0,1,1
43,00190820581d90ce,fuck filthy mother as dry,1,0,1,0,1,0
...,...,...,...,...,...,...,...,...
159494,fef4cf7ba0012866,previous conversation fucking shit eating libe...,1,0,1,0,1,1
159514,ff39a2895fc3b40e,mischievious pubic hair,1,0,0,0,1,0
159541,ffa33d3122b599d6,absurd edits absurd edits great white shark to...,1,0,1,0,1,0
159546,ffb47123b2d82762,hey listen dont ever delete edits ever im anno...,1,0,0,0,1,0


In [28]:
df_copy_nontoxic

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,explanation edits made username hardcore metal...,0,0,0,0,0,0
1,000103f0d9cfb60f,daww match background colour im seemingly stuc...,0,0,0,0,0,0
2,000113f07ec002fd,hey man im really trying edit war guy constant...,0,0,0,0,0,0
3,0001b41b1c6bb37e,cant make real suggestion improvement wondered...,0,0,0,0,0,0
4,0001d958c54c6e35,sir hero chance remember page thats,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...
18955,32110519a80f7f14,hey wiki dronebot get life asshole removing li...,0,0,0,0,0,0
18956,3211215e87ef7e93,checked article matter word matter rigorous un...,0,0,0,0,0,0
18958,321156bd48d3dfae,root bd aramaic mean work article state sh bib...,0,0,0,0,0,0
18959,321183c94c23961a,try find source source armin wenger photo,0,0,0,0,0,0


In [29]:
#Merging the two datasets (Toxic and non toxic of same shape)
df_copy2=pd.concat([df_copy_toxic,df_copy_nontoxic], axis=0)
df_copy2

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
6,0002bcb3da6cb337,cocksucker piss around work,1,1,1,0,1,0
12,0005c987bdfc9d4b,hey talk exclusive group wp talibanswho good d...,1,0,0,0,0,0
16,0007e25b2121310b,bye dont look come think comming back tosser,1,0,0,0,0,0
42,001810bf8c45bf5f,gay antisemmitian archangel white tiger meow g...,1,0,1,0,1,1
43,00190820581d90ce,fuck filthy mother as dry,1,0,1,0,1,0
...,...,...,...,...,...,...,...,...
18955,32110519a80f7f14,hey wiki dronebot get life asshole removing li...,0,0,0,0,0,0
18956,3211215e87ef7e93,checked article matter word matter rigorous un...,0,0,0,0,0,0
18958,321156bd48d3dfae,root bd aramaic mean work article state sh bib...,0,0,0,0,0,0
18959,321183c94c23961a,try find source source armin wenger photo,0,0,0,0,0,0


In [30]:
#Random Shuffling
df_copy2=df_copy2.sample(frac=1)
df_copy2

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
14230,258cbc477568c98a,national security im george w bush article vio...,0,0,0,0,0,0
1287,037f4ca77bf8a95b,redirect talkcentral statistical office india,0,0,0,0,0,0
74223,c693d75c64948cf5,using sandbox as wipe,1,0,1,0,0,0
4058,0adb17bbf1430e1d,well put qg rarely work towards consensus comp...,0,0,0,0,0,0
148280,49a0976f53582792,hey dickhead please stop posting threat page r...,0,0,1,0,1,0
...,...,...,...,...,...,...,...,...
13288,23234e7b1c8de0f2,accomplish proved point admins abuse power tot...,0,0,0,0,0,0
76690,cd64e48fc4b0d0bf,quoting dumb bastard really dont understand an...,1,0,1,0,1,0
13659,240dae7d3104f327,please stop vandalizing userpages member,0,0,0,0,0,0
77095,ce7a5ce77188a1db,going fucking say bay murphy forum biased beli...,1,0,1,0,0,0


In [31]:
#Max length of a single sentence in new distributed data frame
sen_len2=df_copy2['comment_text'].apply(max_len)
print(sen_len2.shape[0])
sen_len2.sort_values(ascending=False)

33225


76598     1250
32143     1250
153353    1247
32400     1235
106964    1078
          ... 
40970        1
9395         0
8846         0
3990         0
2407         0
Name: comment_text, Length: 33225, dtype: int64

**MULTI LABEL CLASSIFICATION USING TRADITIONAL MACHINE LEARNING ALGORITHMS**

In [32]:
df_copy2

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
14230,258cbc477568c98a,national security im george w bush article vio...,0,0,0,0,0,0
1287,037f4ca77bf8a95b,redirect talkcentral statistical office india,0,0,0,0,0,0
74223,c693d75c64948cf5,using sandbox as wipe,1,0,1,0,0,0
4058,0adb17bbf1430e1d,well put qg rarely work towards consensus comp...,0,0,0,0,0,0
148280,49a0976f53582792,hey dickhead please stop posting threat page r...,0,0,1,0,1,0
...,...,...,...,...,...,...,...,...
13288,23234e7b1c8de0f2,accomplish proved point admins abuse power tot...,0,0,0,0,0,0
76690,cd64e48fc4b0d0bf,quoting dumb bastard really dont understand an...,1,0,1,0,1,0
13659,240dae7d3104f327,please stop vandalizing userpages member,0,0,0,0,0,0
77095,ce7a5ce77188a1db,going fucking say bay murphy forum biased beli...,1,0,1,0,0,0


In order to perform multi-label classification with sci-kit learn, Data should be distributed for each label.

In [48]:
# Label distribution for each class in distributed data

toxic_data=df_copy2.loc[:,['id','comment_text','toxic']]
severe_toxic_data=df_copy2.loc[:,['id','comment_text','severe_toxic']]
obscene_data=df_copy2.loc[:,['id','comment_text','obscene']]
threat_data=df_copy2.loc[:,['id','comment_text','threat']]
insult_data=df_copy2.loc[:,['id','comment_text','insult']]
identity_hate_data=df_copy2.loc[:,['id','comment_text','identity_hate']]

Label distribution for each class in Un-distributed data

In [34]:
# toxic_data=df_copy.loc[:,['id','comment_text','toxic']]
# severe_toxic_data=df_copy.loc[:,['id','comment_text','severe_toxic']]
# obscene_data=df_copy.loc[:,['id','comment_text','obscene']]
# threat_data=df_copy.loc[:,['id','comment_text','threat']]
# insult_data=df_copy.loc[:,['id','comment_text','insult']]
# identity_hate_data=df_copy.loc[:,['id','comment_text','identity_hate']]

In [35]:
toxic_data

Unnamed: 0,id,comment_text,toxic
0,0000997932d777bf,explanation edits made username hardcore metal...,0
1,000103f0d9cfb60f,daww match background colour im seemingly stuc...,0
2,000113f07ec002fd,hey man im really trying edit war guy constant...,0
3,0001b41b1c6bb37e,cant make real suggestion improvement wondered...,0
4,0001d958c54c6e35,sir hero chance remember page thats,0
...,...,...,...
159566,ffe987279560d7ff,second time asking view completely contradicts...,0
159567,ffea4adeee384e90,ashamed horrible thing put talk page 99,0
159568,ffee36eab5c267c9,spitzer umm there actual article prostitution ...,0
159569,fff125370e4aaaf3,look like actually put speedy first version de...,0


In [36]:
#Method which splits data into test and train
from sklearn.model_selection import train_test_split

def splitting_data (df):
  #splitting the data
  X_final=df.iloc[0:, 1]
  y_final=df.iloc[:,-1]

  X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size=0.25, random_state=42)

  # Converting Text into vector form
  return X_train, X_test, y_train, y_test

In [37]:
#Method which perform vectorization

def vectorization (df, vector):

  X_train, X_test, y_train, y_test=splitting_data(df)

  if vector == 'cv' :
    cv=CountVectorizer()
    X_train_vec=cv.fit_transform(X_train)
    X_test_vec=cv.transform(X_test)

  if vector=='tf_idf':
    tf_idf=TfidfVectorizer()
    X_train_vec=tf_idf.fit_transform(X_train)
    X_test_vec=tf_idf.transform(X_test)

  # Initializing Models
  svm=LinearSVC().fit(X_train_vec,y_train)
  svm_f1=f1_score( svm.predict(X_test_vec),y_test)

  logistic_model=LogisticRegression().fit(X_train_vec,y_train)
  logistic_model_f1=f1_score( logistic_model.predict(X_test_vec),y_test)

  rf_model=RandomForestClassifier().fit(X_train_vec,y_train)
  rf_model_f1=f1_score( rf_model.predict(X_test_vec),y_test)

  accuracy= {'SVM_score':{'Accuracy':svm.score(X_test_vec, y_test), 'F1_score': svm_f1}, 'logistic_score': {'Accuracy':logistic_model.score(X_test_vec, y_test), 'F1_score':logistic_model_f1}, 'Random_Forest_score': {'Accuracy':rf_model.score(X_test_vec, y_test), 'F1_score': rf_model_f1 }}
  
  Accuracy_df=pd.DataFrame(accuracy)

  return Accuracy_df
  

In [38]:
X_train, X_test, y_train, y_test= splitting_data(toxic_data)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((119678,), (39893,), (119678,), (39893,))

In [39]:
# X_train_vec.shape, X_test_vec.shape

In [40]:
acc_toxic=vectorization(toxic_data, 'tf_idf')

In [42]:
acc_toxic

Unnamed: 0,SVM_score,logistic_score,Random_Forest_score
Accuracy,0.961146,0.95701,0.952423
F1_score,0.774513,0.729282,0.688648


In [43]:
# acc_severe=vectorization(severe_toxic_data, 'tf_idf')
# acc_obscene=vectorization(obscene_data, 'tf_idf')
# acc_threat=vectorization(threat_data, 'tf_idf')
# acc_insult=vectorization(insult_data, 'tf_idf')
# acc_identity=vectorization(identity_hate_data, 'tf_idf')

Choosing a model 

In [44]:
  X_final=obscene_data.iloc[0:, 1]
  y_final=obscene_data.iloc[:,-1]

  X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size=0.25, random_state=42)

  tf_idf=TfidfVectorizer()
  X_train_vec=tf_idf.fit_transform(X_train)
  X_test_vec=tf_idf.transform(X_test)

  # Initializing Models
  svm=LinearSVC().fit(X_train_vec,y_train)
  svm_f1=f1_score( svm.predict(X_test_vec),y_test)

  randomforest = RandomForestClassifier(n_estimators=100, random_state=42)
  randomforest.fit(X_train_vec, y_train)
  randomforest.predict(X_test_vec)


array([0, 0, 0, ..., 0, 0, 0])

In [45]:
svm.predict(X_test_vec)

array([0, 0, 0, ..., 0, 0, 0])

In [46]:
  rf_model_f1=f1_score( randomforest.predict(X_test_vec),y_test)

In [47]:
svm_f1, rf_model_f1

(0.7837203235063919, 0.740885054272196)

So far it is clear that SVM model is giving better accuracy and being choosen as the final model.

For pickling, I have used the existing method provided by 'LinebyLine.ai' in the following link https://github.com/line-by-line/toxic_comments_classifier/blob/master/Toxic%20Comments%20Classifier.ipynb .

In [52]:
def pickle_model(df, label):
    
    X_final=df.iloc[0:, 1]
    y_final=df.iloc[:,-1]

    # Initiate a Tfidf vectorizer
    tfv = TfidfVectorizer(stop_words='english')
    
    # Convert the X data into a document term matrix dataframe
    X_train_vec = tfv.fit_transform(X_final)  
    
    # saves the column labels (ie. the vocabulary)
    # wb means Writing to the file in Binary mode, written in byte objects
    with open(r"{}.pkl".format(label + '_dict'), "wb") as f:   
        pickle.dump(tfv, f)   
        
    rf=RandomForestClassifier().fit(X_train_vec,y_final)
    
    # Create a new pickle file based on random forest
    with open(r"{}.pkl".format(label + '_model'), "wb") as f:  
        pickle.dump(rf, f)

In [None]:
#Creating Pickle file for undistributed data
import pickle
datasets = [toxic_data, severe_toxic_data, obscene_data, threat_data, insult_data, identity_hate_data]
label = ['toxic', 'severe_toxic', 'obscene', 'insult', 'threat', 'identity_hate']

for i,j in zip(datasets,label):
    pickle_model(i, j)

Pickle files which are created by the above model are huge in size (300 mb each)  which cannot be pushed into GitHub repository even using git LFS, only 1 GB size is allocated per month. 

Since my files are exceeding 1 GB of size of data input, I am creating a new model by reducing the data size of train Data.

In order to reduce the pickle file size, I distributed the data in equally labelled ( 0 and 1) size. 

In [49]:
X_train, X_test, y_train, y_test= splitting_data(toxic_data)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((24918,), (8307,), (24918,), (8307,))

In [50]:
acc_toxic=vectorization(toxic_data, 'tf_idf')

In [51]:
acc_toxic

Unnamed: 0,SVM_score,logistic_score,Random_Forest_score
Accuracy,0.869989,0.869026,0.846756
F1_score,0.85782,0.853645,0.821634


In [53]:
#Creating Pickle file for distributed data
import pickle
datasets = [toxic_data, severe_toxic_data, obscene_data, threat_data, insult_data, identity_hate_data]
label = ['toxic', 'severe_toxic', 'obscene', 'insult', 'threat', 'identity_hate']

for i,j in zip(datasets,label):
    pickle_model(i, j)

For the sake of increasing the accuracy of our model, I tried to apply Deep learning techniques by training own word embedding and word2vec models.

However, these approaches are not validated as I ended up with getting similar results.