### DESCRIPTION
---

Using NLP and machine learning, make a model to identify toxic comments from the Talk edit pages on Wikipedia. Help identify the words that make a comment toxic.

#### Problem Statement:  
---
Wikipedia is the world’s largest and most popular reference work on the internet with about 500 million unique visitors per month. It also has millions of contributors who can make edits to pages. The Talk edit pages, the key community interaction forum where the contributing community interacts or discusses or debates about the changes pertaining to a particular topic. 

Wikipedia continuously strives to help online discussion become more productive and respectful. You are a data scientist at Wikipedia who will help Wikipedia to build a predictive model that identifies toxic comments in the discussion and marks them for cleanup by using NLP and machine learning. Post that, help identify the top terms from the toxic comments.

### Tasks: 

<i> 1. Load the data using read_csv function from pandas package <br>
2. Get the comments into a list, for easy text cleanup and manipulation <br>
3. Cleanup:
- Using regular expressions, remove IP addresses <br>
- Using regular expressions, remove URLs <br>
- Normalize the casing <br>
- Tokenize using word_tokenize from NLTK <br>
- Remove stop words <br>
- Remove punctuation <br>
- Define a function to perform all these steps, you’ll use this later on the actual test set <br> </i>

In [3]:
## Calling required libraries..
import numpy as np
import pandas as pd
import matplotlib as plt
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.tokenize import wordpunct_tokenize
from nltk.stem import PorterStemmer, LancasterStemmer, RegexpStemmer,SnowballStemmer,WordNetLemmatizer
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk
from nltk.corpus import brown
import re
import nltk
from sklearn.metrics.pairwise import cosine_similarity
import spacy
import string as str
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from nltk.corpus import wordnet as wn
from sklearn.linear_model import LogisticRegression
from nltk.metrics.distance import jaccard_distance
from nltk.util import ngrams
from nltk.corpus import words
from nltk.probability import ConditionalFreqDist
from sklearn.metrics import classification_report
from sklearn.svm import SVC
from sklearn.model_selection import RepeatedStratifiedKFold,StratifiedKFold
from sklearn.model_selection import cross_val_score
from nltk.stem import PorterStemmer,WordNetLemmatizer
from spellchecker import SpellChecker
PS = PorterStemmer()
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import mean_squared_error,mean_squared_log_error,r2_score,classification_report,confusion_matrix
import seaborn as sns
from collections import  Counter


import warnings
warnings.filterwarnings('ignore')

In [4]:
Wikipedia_Toxicity = pd.read_csv('C:/Working Files/Mac ka folder/Simplilearn/NLP/Online Classes/Wikipedia Toxicity/train.csv')

In [5]:
Wikipedia_Toxicity.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            5000 non-null   object
 1   comment_text  5000 non-null   object
 2   toxic         5000 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 117.3+ KB


In [6]:
Wikipedia_Toxicity["toxic"].value_counts()

0    4563
1     437
Name: toxic, dtype: int64

In [7]:
def decontracted(phrase):
    # specific
    phrase = re.sub(r"won\'t", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    phrase = re.sub(r"don\'t", " do not", phrase)
    phrase = re.sub(r"doesn\'t", " does not", phrase)
    
    return phrase

In [8]:
import re
my_stopwords = stopwords.words("english")
def textPreprocessing(document):
    document = decontracted(document)
    #0.Remove IP Address
    withoutIPaddress = re.sub(r"\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}", "", document)
    #0.Remove IP URL
    withouturladdress = re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*',"",withoutIPaddress)
    #1. Remove Punctuations
    sentWithoutPunct = ''.join([char for char in withouturladdress  if char not in str.punctuation]) 
    #2. Extract words out of the sentences
    words = sentWithoutPunct.split()
    #3. Normalize the data (lowercase)
    wordNormalized = [word.lower() for word in words]
    # 4. Remove Stopwords
    vocabulary = [word for word in wordNormalized if word not in my_stopwords]
    sent = ' '.join(vocabulary)
    return sent

In [9]:
Wikipedia_Toxicity ["cleaned_comment"] = Wikipedia_Toxicity["comment_text"].apply(textPreprocessing)

In [10]:
Wikipedia_Toxicity

Unnamed: 0,id,comment_text,toxic,cleaned_comment
0,e617e2489abe9bca,"""\r\n\r\n A barnstar for you! \r\n\r\n The De...",0,barnstar defender wiki barnstar like edit kaya...
1,9250cf637294e09d,"""\r\n\r\nThis seems unbalanced. whatever I ha...",0,seems unbalanced whatever said mathsci said fa...
2,ce1aa4592d5240ca,"Marya Dzmitruk was born in Minsk, Belarus in M...",0,marya dzmitruk born minsk belarus march 19 199...
3,48105766ff7f075b,"""\r\n\r\nTalkback\r\n\r\n Dear Celestia... """,0,talkback dear celestia
4,0543d4f82e5470b6,New Categories \r\n\r\nI honestly think that w...,0,new categories honestly think need add categor...
...,...,...,...,...
4995,60229df7b48ba6ff,"""\r\n\r\n Dildo, if you read my response corre...",0,dildo read response correctly never said going...
4996,36a645227572ec5c,"CALM DOWN, CALM DOWN, DON'T GET A BIG DICK",1,calm calm dont get big dick
4997,6d47fa39945ed6f5,In my opinion Dougweller is using his privileg...,0,opinion dougweller using privileges poorly per...
4998,de2e4c0d38db6e30,The style section has been expanded too. I did...,0,style section expanded remember placed tag


### Task
<i> 4. Using a counter, find the top terms in the data:
 - Can any of these be considered contextual stop words? <br> 
 - Words like “Wikipedia”, “page”, “edit” are examples of contextual stop words <br>
 - If yes, drop these from the data <br> </i>

In [11]:
def top_non_stopwords(text):
    stop=set(stopwords.words('english'))
    
    new= text.str.split()
    new=new.values.tolist()
    corpus=[word for i in new for word in i]

    counter=Counter(corpus)
    most=counter.most_common()
    x, y=[], []
    for word,count in most[:300]:
        if (word not in stop):
            x.append(word)
            y.append(count)
            
    return pd.DataFrame(x,y,columns=["word"])

In [12]:
top_300_words = top_non_stopwords(Wikipedia_Toxicity["cleaned_comment"])

In [13]:
top_300_words.iloc[251:300,0:1]

Unnamed: 0,word
110,changes
110,facts
110,day
110,word
109,life
109,mentioned
108,message
108,reverted
108,following
108,consider


#### The definition of contextual is depending on the context, or surrounding words, phrases, and paragraphs, of the writing. Following context words are dropped from cleaned_comment. As they don't directly represents context of statement.
pages, page, wikipedia, wiki, encyclopedia, articles, article, sources, source, discussion, ytmndin, •, —,version

In [19]:
context_word = ("pages page wikipedia wiki encyclopedia articles article artricle sources source discussion ytmndin • — version").split()

In [20]:
def document_without_context_word(document):
    words = document.split()
    vocabulary = [word for word in words if word not in context_word]
    sent = ' '.join(vocabulary)
    return sent

In [21]:
spell = SpellChecker(distance=1)
def Correct(x):
    return spell.correction(x)

In [22]:
Wikipedia_Toxicity["cleaned_comment_final"] = Wikipedia_Toxicity["cleaned_comment"].apply(document_without_context_word)

In [23]:
Wikipedia_Toxicity["cleaned_comment_final"] = Wikipedia_Toxicity["cleaned_comment_final"].apply(Correct)

In [24]:
Wikipedia_Toxicity

Unnamed: 0,id,comment_text,toxic,cleaned_comment,cleaned_comment_final
0,e617e2489abe9bca,"""\r\n\r\n A barnstar for you! \r\n\r\n The De...",0,barnstar defender wiki barnstar like edit kaya...,barnstar defender barnstar like edit kayastha ...
1,9250cf637294e09d,"""\r\n\r\nThis seems unbalanced. whatever I ha...",0,seems unbalanced whatever said mathsci said fa...,seems unbalanced whatever said mathsci said fa...
2,ce1aa4592d5240ca,"Marya Dzmitruk was born in Minsk, Belarus in M...",0,marya dzmitruk born minsk belarus march 19 199...,marya dzmitruk born minsk belarus march 19 199...
3,48105766ff7f075b,"""\r\n\r\nTalkback\r\n\r\n Dear Celestia... """,0,talkback dear celestia,talkback dear celestia
4,0543d4f82e5470b6,New Categories \r\n\r\nI honestly think that w...,0,new categories honestly think need add categor...,new categories honestly think need add categor...
...,...,...,...,...,...
4995,60229df7b48ba6ff,"""\r\n\r\n Dildo, if you read my response corre...",0,dildo read response correctly never said going...,dildo read response correctly never said going...
4996,36a645227572ec5c,"CALM DOWN, CALM DOWN, DON'T GET A BIG DICK",1,calm calm dont get big dick,calm calm dont get big dick
4997,6d47fa39945ed6f5,In my opinion Dougweller is using his privileg...,0,opinion dougweller using privileges poorly per...,opinion dougweller using privileges poorly per...
4998,de2e4c0d38db6e30,The style section has been expanded too. I did...,0,style section expanded remember placed tag,style section expanded remember placed tag


### Task
<i> 5. Separate into train and test sets
 - Use train-test method to divide your data into 2 sets: train and test <br>
 - Use a 70-30 split <br></i>
 
<i> 6. Use TF-IDF values for the terms as feature to get into a vector space model
 - Import TF-IDF vectorizer from sklearn <br>
 - Instantiate with a maximum of 4000 terms in your vocabulary <br>
 - Fit and apply on the train set <br>
 - Apply on the test set <br> </i>

In [25]:
Features = Wikipedia_Toxicity["cleaned_comment_final"]
Label = Wikipedia_Toxicity["toxic"]

In [26]:
print("Features >>", Features.shape, "Label >>", Label.shape)

Features >> (5000,) Label >> (5000,)


In [27]:
vectorizer =  TfidfVectorizer(max_features=4000)

In [28]:
X_train,X_test,y_train,y_test = train_test_split(Features,Label,test_size=0.30,random_state=20)

In [29]:
tf_idf_train = vectorizer.fit_transform(X_train).toarray()
tf_idf_test = vectorizer.fit_transform(X_test).toarray()

In [30]:
print("Train >>", tf_idf_train.shape, "Test >>", tf_idf_test.shape)

Train >> (3500, 4000) Test >> (1500, 4000)


### Task
<i> 7. Model building: Support Vector Machine
 - Instantiate SVC from sklearn with a linear kernel <br>
 - Fit on the train data <br>
 - Make predictions for the train and the test set <br> </i>

In [31]:
from sklearn.svm import SVC

In [32]:
SVC = SVC(C=1.0, kernel='linear', degree=3, gamma='scale', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=- 1, decision_function_shape='ovr', break_ties=False, random_state=None)

In [33]:
SVC.fit(tf_idf_train,y_train)

SVC(kernel='linear')

In [34]:
print("Training score is ",round(SVC.score(tf_idf_train,y_train),2))
print("Testing score is ",round(SVC.score(tf_idf_test,y_test),2))

Training score is  0.97
Testing score is  0.91


### Task
<i> 8. Model evaluation: Accuracy, recall, and f1_score
 - Report the accuracy on the train set <br>
 - Report the recall on the train set:decent, high, low? <br>
 - Get the f1_score on the train set <br> </i>

In [35]:
print(classification_report(y_train,SVC.predict(tf_idf_train),target_names=["Non-Toxic","Toxic"]))

              precision    recall  f1-score   support

   Non-Toxic       0.97      1.00      0.98      3186
       Toxic       0.99      0.64      0.78       314

    accuracy                           0.97      3500
   macro avg       0.98      0.82      0.88      3500
weighted avg       0.97      0.97      0.96      3500



### Train data evaluation
---
precision is high for both Non-Toxic and Toxic. <br>
f1 score is high for predicting Non-Toxic comments, comparatively low for Toxic comments. <br>

In [36]:
print(classification_report(y_test,SVC.predict(tf_idf_test),target_names=["Non-Toxic","Toxic"]))

              precision    recall  f1-score   support

   Non-Toxic       0.92      0.99      0.95      1377
       Toxic       0.00      0.00      0.00       123

    accuracy                           0.91      1500
   macro avg       0.46      0.49      0.48      1500
weighted avg       0.84      0.91      0.87      1500



### Test data evaluation
---
model failed to predict Toxic comments

In [37]:
tf_idf_features = vectorizer.fit_transform(Features).toarray()

In [38]:
print(classification_report(Label,SVC.predict(tf_idf_features),target_names=["Non-Toxic","Toxic"]))

              precision    recall  f1-score   support

   Non-Toxic       0.91      0.99      0.95      4563
       Toxic       0.10      0.01      0.02       437

    accuracy                           0.90      5000
   macro avg       0.51      0.50      0.49      5000
weighted avg       0.84      0.90      0.87      5000



### Overall Data evaluation
---
Model fully baised for Non-Toxic data..

1. One approach to addressing imbalanced datasets is to adjust the parameters gamma='scale', class_weight='balanced' in the SVC module

2. Second approach to addressing imbalanced datasets is to oversample the minority class, in this case Toxic data set. This type of data augmentation for the minority class is referred to as the Synthetic Minority Oversampling Technique, or SMOTE for short.

### Task
<i> 9. Looks like you need to adjust  the class imbalance, as the model seems to focus on the 0s
 - Adjust the appropriate parameter in the SVC module </i>
 
<i> 10. Train again with the adjustment and evaluate
 - Train the model on the train set <br>
 - Evaluate the predictions on the validation set: accuracy, recall, f1_score <br> </i>

### Approach1: Adjust the parameters
---

In [39]:
from sklearn.model_selection import RepeatedStratifiedKFold,StratifiedKFold
from sklearn.svm import SVC

In [40]:
weights = {0:1.0, 1:85.0}
SVC1 = SVC(class_weight=weights,kernel='linear')

In [41]:
SVC1.fit(tf_idf_train,y_train)

SVC(class_weight={0: 1.0, 1: 85.0}, kernel='linear')

In [48]:
print("Training score is ",round(SVC1.score(tf_idf_train,y_train),2))
print("Testing score is ",round(SVC1.score(tf_idf_test,y_test),2))
print("-"*50)
print("Training Data Statistics")
print("-"*50)
print(classification_report(y_train,SVC1.predict(tf_idf_train),target_names=["Non-Toxic","Toxic"]))
print("-"*50)
print("Overall Data Statistics")
print("-"*50)
print(classification_report(Label,SVC1.predict(tf_idf_features),target_names=["Non-Toxic","Toxic"]))

Training score is  0.97
Testing score is  0.59
--------------------------------------------------
Training Data Statistics
--------------------------------------------------
              precision    recall  f1-score   support

   Non-Toxic       1.00      0.97      0.99      3186
       Toxic       0.78      1.00      0.88       314

    accuracy                           0.97      3500
   macro avg       0.89      0.99      0.93      3500
weighted avg       0.98      0.97      0.98      3500

--------------------------------------------------
Overall Data Statistics
--------------------------------------------------
              precision    recall  f1-score   support

   Non-Toxic       0.94      0.65      0.77      4563
       Toxic       0.13      0.54      0.21       437

    accuracy                           0.64      5000
   macro avg       0.53      0.60      0.49      5000
weighted avg       0.87      0.64      0.72      5000



We will replace weights = {0:1.0, 1:85.0} with 'balanced' pre-defined function with SVC to manage to manage imbalance data.

In [43]:
SVC2 = SVC(class_weight='balanced',kernel='linear')

In [44]:
SVC2.fit(tf_idf_train,y_train)

SVC(class_weight='balanced', kernel='linear')

In [49]:
print("Training score is ",round(SVC2.score(tf_idf_train,y_train),2))
print("Testing score is ",round(SVC2.score(tf_idf_test,y_test),2))
print("-"*50)
print("Training Data Statistics")
print("-"*50)
print(classification_report(y_train,SVC2.predict(tf_idf_train),target_names=["Non-Toxic","Toxic"]))
print("-"*50)
print("Overall Data Statistics")
print("-"*50)
print(classification_report(Label,SVC2.predict(tf_idf_features),target_names=["Non-Toxic","Toxic"]))

Training score is  0.98
Testing score is  0.85
--------------------------------------------------
Training Data Statistics
--------------------------------------------------
              precision    recall  f1-score   support

   Non-Toxic       1.00      0.99      0.99      3186
       Toxic       0.87      0.98      0.92       314

    accuracy                           0.98      3500
   macro avg       0.93      0.98      0.96      3500
weighted avg       0.99      0.98      0.98      3500

--------------------------------------------------
Overall Data Statistics
--------------------------------------------------
              precision    recall  f1-score   support

   Non-Toxic       0.92      0.95      0.93      4563
       Toxic       0.14      0.08      0.10       437

    accuracy                           0.88      5000
   macro avg       0.53      0.52      0.52      5000
weighted avg       0.85      0.88      0.86      5000



# Conclusion 
---
'Overall Data Statistics' results clearly shows - Toxic data predictibility high with  weights = {0:1.0, 1:85.0} compare to  'balanced' pre-defined. we continued to use 'balanced' as requested by assignment.

### Task
<i> 11. Hyperparameter tuning
 - Import GridSearch and StratifiedKFold (because of class imbalance) <br>
 - Provide the parameter grid to choose for ‘C’ <br>
    
<i> 12. Use a balanced class weight while instantiating the Support Vector Classifier </i>
 - Find the parameters with the best recall in cross validation <br>
 - Choose ‘recall’ as the metric for scoring <br>
 - Choose stratified 5 fold cross validation scheme <br> </i>

In [52]:
from sklearn.model_selection import GridSearchCV

In [53]:
params = {'C': [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]}

In [54]:
SVC3 = SVC(class_weight="balanced",kernel='linear')

In [55]:
model = GridSearchCV(SVC3, param_grid = params , cv=StratifiedKFold(5),scoring=['precision','recall', 'f1'], refit='recall')

### Task
<i>13. Fit on the train set <br>
14. What are the best parameters? <br>
 - Predict and evaluate using the best estimator <br>
 - Use best estimator from the grid search to make predictions on the test set <br>
 - What is the recall on the test set for the toxic comments? <br>
 - What is the f1_score? <br></i>

In [56]:
model.fit(tf_idf_train,y_train)

GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=None, shuffle=False),
             estimator=SVC(class_weight='balanced', kernel='linear'),
             param_grid={'C': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]},
             refit='recall', scoring=['precision', 'recall', 'f1'])

In [57]:
model.best_estimator_

SVC(C=0.2, class_weight='balanced', kernel='linear')

In [58]:
model.best_score_ 

0.5923195084485406

In [59]:
print("Training score is ",round(model.score(tf_idf_train,y_train),2))
print("Testing score is ",round(model.score(tf_idf_test,y_test),2))
print("-"*50)
print("Training Data Statistics")
print("-"*50)
print(classification_report(y_train,model.predict(tf_idf_train),target_names=["Non-Toxic","Toxic"]))
print("-"*50)
print("Overall Data Statistics")
print("-"*50)
print(classification_report(Label,model.predict(tf_idf_features),target_names=["Non-Toxic","Toxic"]))

Training score is  0.96
Testing score is  0.07
--------------------------------------------------
Training Data Statistics
--------------------------------------------------
              precision    recall  f1-score   support

   Non-Toxic       1.00      0.97      0.98      3186
       Toxic       0.77      0.96      0.86       314

    accuracy                           0.97      3500
   macro avg       0.88      0.97      0.92      3500
weighted avg       0.98      0.97      0.97      3500

--------------------------------------------------
Overall Data Statistics
--------------------------------------------------
              precision    recall  f1-score   support

   Non-Toxic       0.91      0.95      0.93      4563
       Toxic       0.11      0.07      0.08       437

    accuracy                           0.87      5000
   macro avg       0.51      0.51      0.51      5000
weighted avg       0.84      0.87      0.86      5000



# Conclusion
---
Model is not generalized, training_accuracy > testing_accuracy (Overfitted model). Alternate approach using SMOTE can be considered.

### Task
<i> 15. What are the most prominent terms in the toxic comments? <br>
 - Separate the comments from the test set that the model identified as toxic <br>
 - Make one large list of the terms <br>
 - Get the top 15 terms <br> </i>

In [118]:
predictions = model.predict(tf_idf_test)

In [143]:
new_df = pd.DataFrame()

In [144]:
new_df["comments"] = X_test

In [145]:
new_df.tail()

Unnamed: 0,comments
3912,quite accurate says debuted november 2004 coun...
345,harassing arbcom decision based lie said could...
946,way english speaker going search spelling none...
1265,dear jeff g ツ blow fag
157,warangal fort deleted warangal fort contained ...


In [146]:
new_df["prediction"] = predictions

In [168]:
new_df["original"] = y_test

In [169]:
new_df.head()

Unnamed: 0,comments,prediction,original
71,gonna see able continue,0,0
1561,please stop continue blank remove portions con...,0,0
2439,worse story invocation personal attacks reason...,0,0
1294,copyright problem removed prior content duplic...,0,0
1946,imageadrianneweyjpg image deletion warning ima...,0,0


In [170]:
top_300_words = top_non_stopwords(new_df["comments"][new_df["original"]==1])

In [171]:
top_300_words.iloc[1:15,0:1]

Unnamed: 0,word
363,suck
356,mexicans
29,shoot
29,freak
20,like
19,fuck
18,go
16,would
12,people
12,talk


In [172]:
top_300_words = top_non_stopwords(new_df["comments"][new_df["prediction"]==1])

In [173]:
top_300_words.iloc[1:15,0:1]

Unnamed: 0,word
30,see
25,would
17,talk
16,like
15,please
15,also
14,much
14,well
12,could
11,make


### Approach2: SMOTE
---

In [60]:
import imblearn

In [61]:
from imblearn.over_sampling import SMOTE

In [62]:
smote = SMOTE()

In [66]:
tf_idf_features_smote,Label_smote  = smote.fit_resample(tf_idf_features,Label)

In [67]:
print("Pre-Sampling:\n",Label.value_counts())
print("Post-Sampling:\n",Label_smote.value_counts())

Pre-Sampling:
 0    4563
1     437
Name: toxic, dtype: int64
Post-Sampling:
 0    4563
1    4563
Name: toxic, dtype: int64


In [69]:
SVC = SVC(C=1.0, kernel='linear', degree=3, gamma='scale', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=- 1, decision_function_shape='ovr', break_ties=False, random_state=None)

In [70]:
SVC.fit(tf_idf_features_smote,Label_smote)

SVC(kernel='linear')

In [71]:
print(classification_report(Label_smote,SVC.predict(tf_idf_features_smote),target_names=["Non-Toxic","Toxic"]))

              precision    recall  f1-score   support

   Non-Toxic       1.00      0.95      0.97      4563
       Toxic       0.95      1.00      0.97      4563

    accuracy                           0.97      9126
   macro avg       0.98      0.97      0.97      9126
weighted avg       0.98      0.97      0.97      9126



In [72]:
X_train1,X_test1,y_train1,y_test1 = train_test_split(tf_idf_features_smote,Label_smote,test_size=0.30,random_state=20)

In [73]:
print("Train >>", X_train1.shape, "Test >>", X_test1.shape)

Train >> (6388, 4000) Test >> (2738, 4000)


In [74]:
SVC.fit(X_train1,y_train1)

SVC(kernel='linear')

In [75]:
print("Training score is ",round(SVC.score(X_train1,y_train1),2))
print("Testing score is ",round(SVC.score(X_test1,y_test1),2))

Training score is  0.97
Testing score is  0.93


In [76]:
print(classification_report(y_train1,SVC.predict(X_train1),target_names=["Non-Toxic","Toxic"]))

              precision    recall  f1-score   support

   Non-Toxic       0.99      0.95      0.97      3175
       Toxic       0.95      0.99      0.97      3213

    accuracy                           0.97      6388
   macro avg       0.97      0.97      0.97      6388
weighted avg       0.97      0.97      0.97      6388



In [77]:
print(classification_report(y_test1,SVC.predict(X_test1),target_names=["Non-Toxic","Toxic"]))

              precision    recall  f1-score   support

   Non-Toxic       0.99      0.87      0.92      1388
       Toxic       0.88      0.99      0.93      1350

    accuracy                           0.93      2738
   macro avg       0.93      0.93      0.93      2738
weighted avg       0.93      0.93      0.93      2738



# Conclusion:
---
Training score 0.97 > Testing score 0.93; but model performing much better in predicting Toxic. This is recommened approach for model building for imbalanced data.