# ML Model Building on The Lion King (2019) Movie Reviews


### In this section
1. I have extracted 2 sets of reviews from rotten tomatoes.
    a. Train Set of 3000 reviews which can be split into train and valid1 sets and apply algorithms.
    b. Valid2 set of 1100 reviews which can be used as a secondary validation set.
2. I have tried different algorithms and SVM worked the best.

## Importing Necessary Libraries

In [1]:
import os
import numpy as np
import pandas as pd
import random
import string
random.seed(123)
import datetime as dt

import warnings
warnings.filterwarnings('ignore','RuntimeWarning')

import nltk
import re
from nltk.corpus import stopwords
from nltk.classify.scikitlearn import SklearnClassifier
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.collocations import BigramCollocationFinder, TrigramCollocationFinder
from wordcloud import WordCloud

import spacy
nlp = spacy.load('en')

from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

pd.set_option('display.max_colwidth', -1)

- data1 is for train and valid1 split
- valid2 is for using it as second unseen data
- test is where we make predictions on

In [2]:
data1 = pd.read_csv('alreviews_df_3000.csv')
valid2 = pd.read_csv('alreviews_df_1100_validation.csv')
test = pd.read_csv('test-1566381431512.csv')

In [3]:
print(data1.shape)
print(valid2.shape)
print(test.shape)

(3000, 13)
(1100, 12)
(1200, 2)


In [4]:
print(data1.isna().sum().sum())
print(valid2.isna().sum().sum())
print(test.isna().sum().sum())

2916
1071
0


In [5]:
data1.sample()

Unnamed: 0,createDate,displayImageUrl,displayName,hasProfanity,hasSpoilers,isSuperReviewer,isVerified,rating,review,score,timeFromCreation,updateDate,primary_key
500,2019-08-13T20:28:36.454Z,,Lisa,False,False,False,True,STAR_2,The small changes from the original to new were not good.,2.0,5d ago,2019-08-13T20:28:36.454Z,500


In [6]:
valid2.sample()

Unnamed: 0,createDate,displayImageUrl,displayName,hasProfanity,hasSpoilers,isSuperReviewer,isVerified,rating,review,score,timeFromCreation,updateDate
830,2019-07-30T15:34:54.560Z,,Lolmom,False,False,False,False,STAR_3,"The humorous extras they added were fun. The imagery itself was not as impressive as the animated version, just because it's all been done before. The vocals did not live up to the original either in some cases.",3.0,"Jul 30, 2019",2019-07-30T15:34:54.560Z


In [7]:
test.sample()

Unnamed: 0,ReviewID,Review
566,93442,IT WAS AMAZING THE MOVIE KICKED ASS!!!


In [8]:
data1.drop(['createDate','displayImageUrl','displayName','hasProfanity','hasSpoilers','isSuperReviewer','isVerified','rating','timeFromCreation','updateDate','primary_key'],axis=1,inplace=True)
valid2.drop(['createDate','displayImageUrl','displayName','hasProfanity','hasSpoilers','isSuperReviewer','isVerified','rating','timeFromCreation','updateDate'],axis=1,inplace=True)

In [9]:
data1.head()

Unnamed: 0,review,score
0,"Really enjoyed it. The songs were amazing and the visuals were spectacular. I see what people are saying about it’s like if emotion and dullness, but I personally liked it and if you ever liked the original than you’ll like this. The acting and singing was amazing!",4.0
1,"Realky enjoyable. We've seen the original animated movie, the stage play and now the live action movie and it's a worthy extension of the story.",5.0
2,Beautiful! Loved it!,5.0
3,"Absolutely loved the movie, it had my emotions all over the place. Thank you for an awesome experience.",5.0
4,Tha movie was phenomenal,5.0


In [10]:
valid2.head()

Unnamed: 0,review,score
0,I thought it really good glad they didn't make scar a queer,4.0
1,I can only say this movie is good in animate this movie feels realistic but its not original The Lion King movie so I enjoy it as much as Im very disappointed in it. Id give it 4 half star for its movie but 1 star for the title because its not exact same as the 1994 The Lion King movie.,3.0
2,Great songs-----forgot it was animated,4.5
3,The kids and adults love the movie!,5.0
4,Absolutely loved it!! Heyyy what better way to release a New movie... Its Loe Season,5.0


# Creating a target 'sentiment' from score
### Positive : 0
### Negative : 1

- Since the metric is f1 score for negative reviews. We are assigning 1 to negative reviews and 0 to positive reviews.

In [11]:
data1['sentiment'] = np.where((data1['score']>3.0),0,1)
valid2['sentiment'] = np.where((valid2['score']>3.0),0,1)

In [12]:
data1.sentiment.value_counts(normalize=True)

0    0.723333
1    0.276667
Name: sentiment, dtype: float64

In [13]:
valid2.sentiment.value_counts(normalize=True)

0    0.75
1    0.25
Name: sentiment, dtype: float64

In [14]:
CONTRACTION_MAP = {"ain't": 'is not', "aren't": 'are not', "can't": 'cannot', "can't've": 'cannot have', "'cause": 'because', "could've": 'could have', "couldn't": 'could not', "couldn't've": 'could not have', "didn't": 'did not', "doesn't": 'does not', "don't": 'do not', "hadn't": 'had not', "hadn't've": 'had not have', "hasn't": 'has not', "haven't": 'have not', "he'd": 'he would', "he'd've": 'he would have', "he'll": 'he will', "he'll've": 'he he will have', "he's": 'he is', "how'd": 'how did', "how'd'y": 'how do you', "how'll": 'how will', "how's": 'how is', "I'd": 'I would', "I'd've": 'I would have', "I'll": 'I will', "I'll've": 'I will have', "I'm": 'I am', "I've": 'I have', "i'd": 'i would', "i'd've": 'i would have', "i'll": 'i will', "i'll've": 'i will have', "i'm": 'i am', "i've": 'i have', "isn't": 'is not', "it'd": 'it would', "it'd've": 'it would have', "it'll": 'it will', "it'll've": 'it will have', "it's": 'it is', "let's": 'let us', "ma'am": 'madam', "mayn't": 'may not', "might've": 'might have', "mightn't": 'might not', "mightn't've": 'might not have', "must've": 'must have', "mustn't": 'must not', "mustn't've": 'must not have', "needn't": 'need not', "needn't've": 'need not have', "o'clock": 'of the clock', "oughtn't": 'ought not', "oughtn't've": 'ought not have', "shan't": 'shall not', "sha'n't": 'shall not', "shan't've": 'shall not have', "she'd": 'she would', "she'd've": 'she would have', "she'll": 'she will', "she'll've": 'she will have', "she's": 'she is', "should've": 'should have', "shouldn't": 'should not', "shouldn't've": 'should not have', "so've": 'so have', "so's": 'so as', "that'd": 'that would', "that'd've": 'that would have', "that's": 'that is', "there'd": 'there would', "there'd've": 'there would have', "there's": 'there is', "they'd": 'they would', "they'd've": 'they would have', "they'll": 'they will', "they'll've": 'they will have', "they're": 'they are', "they've": 'they have', "to've": 'to have', "wasn't": 'was not', "we'd": 'we would', "we'd've": 'we would have', "we'll": 'we will', "we'll've": 'we will have', "we're": 'we are', "we've": 'we have', "weren't": 'were not', "what'll": 'what will', "what'll've": 'what will have', "what're": 'what are', "what's": 'what is', "what've": 'what have', "when's": 'when is', "when've": 'when have', "where'd": 'where did', "where's": 'where is', "where've": 'where have', "who'll": 'who will', "who'll've": 'who will have', "who's": 'who is', "who've": 'who have', "why's": 'why is', "why've": 'why have', "will've": 'will have', "won't": 'will not', "won't've": 'will not have', "would've": 'would have', "wouldn't": 'would not', "wouldn't've": 'would not have', "y'all": 'you all', "y'all'd": 'you all would', "y'all'd've": 'you all would have', "y'all're": 'you all are', "y'all've": 'you all have', "you'd": 'you would', "you'd've": 'you would have', "you'll": 'you will', "you'll've": 'you will have', "you're": 'you are', "you've": 'you have'}

In [15]:
def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
    #re.compile(regex).search(subject) is equivalent to re.search(regex, subject).
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())),
                                      flags=re.IGNORECASE | re.DOTALL)

    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
            if contraction_mapping.get(match)\
            else contraction_mapping.get(match.lower())
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction

    expanded_text = re.sub("’", "'", text)
    expanded_text = contractions_pattern.sub(expand_match, expanded_text)

    return expanded_text

In [16]:
# Function to Preprocess the Reviews
def clean_doc(doc):
    # Removing contractions
    doc = expand_contractions(doc)
    
    # split into tokens by white space
    tokens = doc.split(' ')
    
    # Converting into lower case
    tokens = [w.lower() for w in tokens]
    
    # remove special characters from each token
    tokens = [re.sub(r"[^a-zA-Z#\s]",'',i) for i in tokens]
    tokens = [re.sub(r"[\r\n]",'',i) for i in tokens]
    # filter out stop words
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    
    # lemmatizing
    lmtzr = nltk.stem.WordNetLemmatizer()
    tokens = [lmtzr.lemmatize(w) for w in tokens]
    
    # filter out short tokens
    tokens = [word for word in tokens if len(word) > 1]
    return tokens

In [17]:
data1['modified_review'] = data1.review.apply(lambda x: ' '.join(clean_doc(x)))
valid2['modified_review'] = valid2.review.apply(lambda x: ' '.join(clean_doc(x)))
test['modified_review'] = test.Review.apply(lambda x: ' '.join(clean_doc(x)))

## Model Building

In [18]:
# Creating dependent and independent variables.
X = data1['modified_review']
y = data1['sentiment']

In [19]:
X_train, X_valid1, y_train, y_valid1 = train_test_split(X,y,test_size=0.3, random_state=1234)

In [20]:
X_valid2 = valid2['modified_review']
y_valid2 = valid2['sentiment']
X_test = test['modified_review']

In [21]:
print(X_train.shape)
print(y_train.shape)
print(X_valid1.shape)
print(y_valid1.shape)
print(X_valid2.shape)
print(y_valid2.shape)

(2100,)
(2100,)
(900,)
(900,)
(1100,)
(1100,)


### Vectorizing the reviews using TFIDF Vectorizer

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
tfidf_vectorizer = TfidfVectorizer(max_df=0.90,max_features=1000,stop_words='english')

In [23]:
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_valid1_tfidf = tfidf_vectorizer.transform(X_valid1)
X_valid2_tfidf = tfidf_vectorizer.transform(X_valid2)

X_test_tfidf = tfidf_vectorizer.transform(X_test)

### Vectorizing the reviews using Count Vectorizer

In [24]:
count_vectorizer = CountVectorizer(stop_words='english',lowercase=True, strip_accents='unicode',decode_error='ignore')

In [25]:
X_train_cv = count_vectorizer.fit_transform(X_train)
X_valid1_cv = count_vectorizer.transform(X_valid1)
X_valid2_cv = count_vectorizer.transform(X_valid2)

X_test_cv = count_vectorizer.transform(X_test)

## Logistic Regression with TFIDF Vectorizer

In [26]:
from sklearn.linear_model import LogisticRegression
import warnings
warnings.filterwarnings('ignore')

In [27]:
logreg = LogisticRegression(penalty='l2',class_weight='balanced',C=0.5)
lr_clf = logreg.fit(X_train_tfidf,y_train)

train_pred = lr_clf.predict(X_train_tfidf)

valid1_pred = lr_clf.predict(X_valid1_tfidf)

valid2_pred = lr_clf.predict(X_valid2_tfidf)

print('Train F1 Score :',round(f1_score(y_train,train_pred),3))
print('Valid1 F1 Score :',round(f1_score(y_valid1,valid1_pred),3))
print('Valid2 F1 Score :',round(f1_score(y_valid2,valid2_pred),3))

Train F1 Score : 0.77
Valid1 F1 Score : 0.713
Valid2 F1 Score : 0.701


#### Observation
- Balanced class weight parameter is found to increase the F1 score
- As C increases the model overfits.
- As C decrease below 0.5 the model underfits.
- L2 penalty is found to be the best.

## Logistic Regression with Count Vectorizer

In [28]:
logreg = LogisticRegression(penalty='l1',class_weight='balanced',C=0.9)
lr_clf = logreg.fit(X_train_cv,y_train)

train_pred = lr_clf.predict(X_train_cv)

valid1_pred = lr_clf.predict(X_valid1_cv)

valid2_pred = lr_clf.predict(X_valid2_cv)

print('Train F1 Score :',round(f1_score(y_train,train_pred),3))
print('Valid1 F1 Score :',round(f1_score(y_valid1,valid1_pred),3))
print('Valid2 F1 Score :',round(f1_score(y_valid2,valid2_pred),3))

Train F1 Score : 0.827
Valid1 F1 Score : 0.682
Valid2 F1 Score : 0.714


#### Observation
- Balanced class weight parameter is found to increase the F1 score
- As C increases the model overfits.
- As C decrease below 0.9 the model underfits.
- L1 penalty is found to be the best.

## Naive Bayes with TFIDF Vectorizer

In [29]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB(alpha=0.1)
NB_clf = classifier.fit(X_train_tfidf,y_train)

train_pred = NB_clf.predict(X_train_tfidf)

valid1_pred = NB_clf.predict(X_valid1_tfidf)

valid2_pred = NB_clf.predict(X_valid2_tfidf)

print('Train F1 Score :',round(f1_score(y_train,train_pred),3))
print('Valid1 F1 Score :',round(f1_score(y_valid1,valid1_pred),3))
print('Valid2 F1 Score :',round(f1_score(y_valid2,valid2_pred),3))

Train F1 Score : 0.798
Valid1 F1 Score : 0.581
Valid2 F1 Score : 0.594


## Naive Bayes with Count Vectorizer


In [30]:
classifier = MultinomialNB(alpha=0.5)
NB_clf = classifier.fit(X_train_cv,y_train)

train_pred = NB_clf.predict(X_train_cv)

valid1_pred = NB_clf.predict(X_valid1_cv)

valid2_pred = NB_clf.predict(X_valid2_cv)

print('Train F1 Score :',round(f1_score(y_train,train_pred),3))
print('Valid1 F1 Score :',round(f1_score(y_valid1,valid1_pred),3))
print('Valid2 F1 Score :',round(f1_score(y_valid2,valid2_pred),3))

Train F1 Score : 0.859
Valid1 F1 Score : 0.647
Valid2 F1 Score : 0.664


## SVM with TFIDF

In [31]:
from sklearn import svm

In [32]:
svm_classifier = svm.SVC(kernel='linear')
svm_clf = svm_classifier.fit(X_train_tfidf,y_train)

train_pred = svm_clf.predict(X_train_tfidf)

valid1_pred = svm_clf.predict(X_valid1_tfidf)

valid2_pred = svm_clf.predict(X_valid2_tfidf)

print('Train F1 Score :',round(f1_score(y_train,train_pred),3))
print('Valid1 F1 Score :',round(f1_score(y_valid1,valid1_pred),3))
print('Valid2 F1 Score :',round(f1_score(y_valid2,valid2_pred),3))

Train F1 Score : 0.834
Valid1 F1 Score : 0.596
Valid2 F1 Score : 0.635


In [33]:
svm_classifier = svm.SVC(kernel="linear", class_weight="balanced")
svm_clf = svm_classifier.fit(X_train_tfidf,y_train)

train_pred = svm_clf.predict(X_train_tfidf)

valid1_pred = svm_clf.predict(X_valid1_tfidf)

valid2_pred = svm_clf.predict(X_valid2_tfidf)

print('Train F1 Score :',round(f1_score(y_train,train_pred),3))
print('Valid1 F1 Score :',round(f1_score(y_valid1,valid1_pred),3))
print('Valid2 F1 Score :',round(f1_score(y_valid2,valid2_pred),3))

Train F1 Score : 0.831
Valid1 F1 Score : 0.698
Valid2 F1 Score : 0.685


In [34]:
# test_pred = svm_clf.predict(X_test_tfidf)

# pd.Series(test_pred).value_counts(normalize=True)

# submission = pd.read_csv('samplesubmission.csv')

# submission.sentiment = test_pred.astype('int64')

# submission.sentiment.dtype

# submission.to_csv('submission1.csv',index=False)

In [35]:
from xgboost import XGBClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

## AdaBoost

In [36]:
Adaboost_model = AdaBoostClassifier(DecisionTreeClassifier(max_depth=2),n_estimators=600,learning_rate=1)

In [37]:
adb_clf = Adaboost_model.fit(X_train_tfidf,y_train)

train_pred = adb_clf.predict(X_train_tfidf)

valid1_pred = adb_clf.predict(X_valid1_tfidf)

valid2_pred = adb_clf.predict(X_valid2_tfidf)

print('Train F1 Score :',round(f1_score(y_train,train_pred),3))
print('Valid1 F1 Score :',round(f1_score(y_valid1,valid1_pred),3))
print('Valid2 F1 Score :',round(f1_score(y_valid2,valid2_pred),3))

Train F1 Score : 0.99
Valid1 F1 Score : 0.565
Valid2 F1 Score : 0.587


## AdaBoost Grid Search with TFIDF

In [38]:
from sklearn.model_selection import GridSearchCV

param_grid = {'n_estimators':[100,150,200],
             'learning_rate':[0.1,0.5,0.9]}

Adaboost_model_grid = GridSearchCV(AdaBoostClassifier(DecisionTreeClassifier(max_depth=2)),param_grid,n_jobs=-1)

In [39]:
adb_clf = Adaboost_model_grid.fit(X_train_tfidf,y_train)

print(Adaboost_model_grid.best_params_)

train_pred = adb_clf.predict(X_train_tfidf)

valid1_pred = adb_clf.predict(X_valid1_tfidf)

valid2_pred = adb_clf.predict(X_valid2_tfidf)

print('Train F1 Score :',round(f1_score(y_train,train_pred),3))
print('Valid1 F1 Score :',round(f1_score(y_valid1,valid1_pred),3))
print('Valid2 F1 Score :',round(f1_score(y_valid2,valid2_pred),3))

{'learning_rate': 0.1, 'n_estimators': 150}
Train F1 Score : 0.787
Valid1 F1 Score : 0.499
Valid2 F1 Score : 0.527


## AdaBoost Grid Search with CountVec

In [40]:
adb_clf = Adaboost_model_grid.fit(X_train_cv,y_train)

print(Adaboost_model_grid.best_params_)

train_pred = adb_clf.predict(X_train_cv)

valid1_pred = adb_clf.predict(X_valid1_cv)

valid2_pred = adb_clf.predict(X_valid2_cv)

print('Train F1 Score :',round(f1_score(y_train,train_pred),3))
print('Valid1 F1 Score :',round(f1_score(y_valid1,valid1_pred),3))
print('Valid2 F1 Score :',round(f1_score(y_valid2,valid2_pred),3))

{'learning_rate': 0.1, 'n_estimators': 200}
Train F1 Score : 0.781
Valid1 F1 Score : 0.54
Valid2 F1 Score : 0.549


## Gradient Boosting with TFIDF

In [41]:
from sklearn.ensemble import GradientBoostingClassifier

In [42]:
GBM_model = GradientBoostingClassifier(n_estimators=50,learning_rate=0.3,subsample=0.8)
gbm_clf = GBM_model.fit(X_train_tfidf,y_train)

train_pred = gbm_clf.predict(X_train_tfidf)

valid1_pred = gbm_clf.predict(X_valid1_tfidf)

valid2_pred = gbm_clf.predict(X_valid2_tfidf)

print('Train F1 Score :',round(f1_score(y_train,train_pred),3))
print('Valid1 F1 Score :',round(f1_score(y_valid1,valid1_pred),3))
print('Valid2 F1 Score :',round(f1_score(y_valid2,valid2_pred),3))

Train F1 Score : 0.782
Valid1 F1 Score : 0.526
Valid2 F1 Score : 0.552


## Gradient Boosting with CountVec

In [43]:
GBM_model = GradientBoostingClassifier(n_estimators=50,learning_rate=0.3,subsample=0.8)
gbm_clf = GBM_model.fit(X_train_cv,y_train)

train_pred = gbm_clf.predict(X_train_cv)

valid1_pred = gbm_clf.predict(X_valid1_cv)

valid2_pred = gbm_clf.predict(X_valid2_cv)

print('Train F1 Score :',round(f1_score(y_train,train_pred),3))
print('Valid1 F1 Score :',round(f1_score(y_valid1,valid1_pred),3))
print('Valid2 F1 Score :',round(f1_score(y_valid2,valid2_pred),3))

Train F1 Score : 0.754
Valid1 F1 Score : 0.531
Valid2 F1 Score : 0.534


## SVM with CountVec

In [44]:
svm_classifier_linear = svm.SVC(kernel="linear", class_weight="balanced",C=0.09)

svm_clf = svm_classifier_linear.fit(X_train_cv,y_train)

train_pred = svm_clf.predict(X_train_cv)

valid1_pred = svm_clf.predict(X_valid1_cv)

valid2_pred = svm_clf.predict(X_valid2_cv)

print('Train F1 Score :',round(f1_score(y_train,train_pred),3))
print('Valid1 F1 Score :',round(f1_score(y_valid1,valid1_pred),3))
print('Valid2 F1 Score :',round(f1_score(y_valid2,valid2_pred),3))

Train F1 Score : 0.847
Valid1 F1 Score : 0.682
Valid2 F1 Score : 0.695


In [45]:
# test_pred = svm_clf.predict(X_test_cv)

# pd.Series(test_pred).value_counts(normalize=True)

# submission = pd.read_csv('samplesubmission.csv')

# submission.sentiment = test_pred.astype('int64')

# submission.sentiment.dtype

# submission.to_csv('submission2.csv',index=False)


In [46]:
svm_classifier_linear = svm.SVC(kernel="linear", class_weight="balanced",C=1)
svm_clf = svm_classifier_linear.fit(X_train_tfidf,y_train)

train_pred = svm_clf.predict(X_train_tfidf)

valid1_pred = svm_clf.predict(X_valid1_tfidf)

valid2_pred = svm_clf.predict(X_valid2_tfidf)

print('Train F1 Score :',round(f1_score(y_train,train_pred),3))
print('Valid1 F1 Score :',round(f1_score(y_valid1,valid1_pred),3))
print('Valid2 F1 Score :',round(f1_score(y_valid2,valid2_pred),3))

Train F1 Score : 0.831
Valid1 F1 Score : 0.698
Valid2 F1 Score : 0.685


## Grid Search on Gradient Boosting with CountVec

In [47]:
from sklearn.model_selection import GridSearchCV

param_grid = {
              'learning_rate':[0.1,0.5],
              'subsample': [0.5, 0.6, 0.7, 0.8, 0.9, 1]}

GBM_model_grid = GridSearchCV(GradientBoostingClassifier(n_estimators=200,),param_grid,n_jobs=-1,cv=5)

In [48]:
gbm_clf = GBM_model_grid.fit(X_train_cv,y_train)

train_pred = gbm_clf.predict(X_train_cv)

valid1_pred = gbm_clf.predict(X_valid1_cv)

valid2_pred = gbm_clf.predict(X_valid2_cv)

print('Train F1 Score :',round(f1_score(y_train,train_pred),3))
print('Valid1 F1 Score :',round(f1_score(y_valid1,valid1_pred),3))
print('Valid2 F1 Score :',round(f1_score(y_valid2,valid2_pred),3))

Train F1 Score : 0.763
Valid1 F1 Score : 0.563
Valid2 F1 Score : 0.559


In [49]:
GBM_model_grid.best_params_

{'learning_rate': 0.1, 'subsample': 0.8}

In [50]:
GBM_model = GradientBoostingClassifier(random_state=123,n_estimators=300,learning_rate=0.5,min_samples_leaf=10,max_features=25,max_depth=2)

gbm_clf = GBM_model.fit(X_train_tfidf,y_train)

train_pred = gbm_clf.predict(X_train_tfidf)

valid1_pred = gbm_clf.predict(X_valid1_tfidf)

valid2_pred = gbm_clf.predict(X_valid2_tfidf)

print('Train F1 Score :',round(f1_score(y_train,train_pred),2))
print('Valid1 F1 Score :',round(f1_score(y_valid1,valid1_pred),2))
print('Valid2 F1 Score :',round(f1_score(y_valid2,valid2_pred),2))

Train F1 Score : 0.87
Valid1 F1 Score : 0.58
Valid2 F1 Score : 0.63


## CountVec using Ngrams(1,3)

In [51]:
count_vectorizern2 = CountVectorizer(stop_words='english',lowercase=True, strip_accents='unicode',decode_error='ignore',ngram_range=(1,2))
count_vectorizern3 = CountVectorizer(stop_words='english',lowercase=True, strip_accents='unicode',decode_error='ignore',ngram_range=(1,3))

In [52]:
X_train_cvn2 = count_vectorizern2.fit_transform(X_train)
X_valid1_cvn2 = count_vectorizern2.transform(X_valid1)
X_valid2_cvn2 = count_vectorizern2.transform(X_valid2)

X_test_cvn2 = count_vectorizern2.transform(X_test)

In [53]:
X_train_cvn3 = count_vectorizern3.fit_transform(X_train)
X_valid1_cvn3 = count_vectorizern3.transform(X_valid1)
X_valid2_cvn3 = count_vectorizern3.transform(X_valid2)

X_test_cvn3 = count_vectorizern3.transform(X_test)

In [54]:
logreg = LogisticRegression(penalty='l2',class_weight='balanced',C=0.1)
lr_clf = logreg.fit(X_train_cvn2,y_train)

train_pred = lr_clf.predict(X_train_cvn2)

valid1_pred = lr_clf.predict(X_valid1_cvn2)

valid2_pred = lr_clf.predict(X_valid2_cvn2)

print('Train F1 Score :',round(f1_score(y_train,train_pred),3))
print('Valid1 F1 Score :',round(f1_score(y_valid1,valid1_pred),3))
print('Valid2 F1 Score :',round(f1_score(y_valid2,valid2_pred),3))

Train F1 Score : 0.891
Valid1 F1 Score : 0.673
Valid2 F1 Score : 0.693


In [55]:
logreg = LogisticRegression(penalty='l2',class_weight='balanced',C=0.08)
lr_clf = logreg.fit(X_train_cvn3,y_train)

train_pred = lr_clf.predict(X_train_cvn3)

valid1_pred = lr_clf.predict(X_valid1_cvn3)

valid2_pred = lr_clf.predict(X_valid2_cvn3)

print('Train F1 Score :',round(f1_score(y_train,train_pred),3))
print('Valid1 F1 Score :',round(f1_score(y_valid1,valid1_pred),3))
print('Valid2 F1 Score :',round(f1_score(y_valid2,valid2_pred),3))

Train F1 Score : 0.918
Valid1 F1 Score : 0.673
Valid2 F1 Score : 0.683


In [56]:
svm_classifier_linear = svm.SVC(kernel="linear", class_weight="balanced",C=0.05)

svm_clf = svm_classifier_linear.fit(X_train_cvn2,y_train)

train_pred = svm_clf.predict(X_train_cvn2)

valid1_pred = svm_clf.predict(X_valid1_cvn2)

valid2_pred = svm_clf.predict(X_valid2_cvn2)

print('Train F1 Score :',round(f1_score(y_train,train_pred),3))
print('Valid1 F1 Score :',round(f1_score(y_valid1,valid1_pred),3))
print('Valid2 F1 Score :',round(f1_score(y_valid2,valid2_pred),3))

Train F1 Score : 0.921
Valid1 F1 Score : 0.652
Valid2 F1 Score : 0.676


In [57]:
svm_classifier_linear = svm.SVC(kernel="linear", class_weight="balanced",C=0.05)

svm_clf = svm_classifier_linear.fit(X_train_cvn3,y_train)

train_pred = svm_clf.predict(X_train_cvn3)

valid1_pred = svm_clf.predict(X_valid1_cvn3)

valid2_pred = svm_clf.predict(X_valid2_cvn3)

print('Train F1 Score :',round(f1_score(y_train,train_pred),3))
print('Valid1 F1 Score :',round(f1_score(y_valid1,valid1_pred),3))
print('Valid2 F1 Score :',round(f1_score(y_valid2,valid2_pred),3))

Train F1 Score : 0.961
Valid1 F1 Score : 0.63
Valid2 F1 Score : 0.656
