For this challenge, you will need to choose a corpus of data from nltk or another source that includes categories you can predict and create an analysis pipeline that includes the following steps:

- Data cleaning / processing / language parsing
- Create features using two different NLP methods: For example, BoW vs tf-idf.
- Use the features to fit supervised learning models for each feature set to predict the category outcomes.
- Assess your models using cross-validation and determine whether one model performed better.
- Pick one of the models and try to increase accuracy by at least 5 percentage points.
- Write up your report in a Jupyter notebook. Be sure to explicitly justify the choices you make throughout, and submit it below.

In [54]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
import spacy
from nltk.corpus import movie_reviews, stopwords
from collections import Counter
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import cross_val_predict
from sklearn import metrics

In [22]:
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to C:\nltk_data...
[nltk_data]   Unzipping corpora\movie_reviews.zip.


True

In [23]:
movie_reviews

<CategorizedPlaintextCorpusReader in 'C:\\nltk_data\\corpora\\movie_reviews'>

In [24]:
print (len(movie_reviews.fileids()))

2000


In [25]:
print (movie_reviews.categories())

['neg', 'pos']


In [26]:
print (len(movie_reviews.fileids('pos')))
print (len(movie_reviews.fileids('neg')))

1000
1000


In [27]:
#here, we pick two files, a negative review and a positive review
neg_rev_list = []
for i in movie_reviews.fileids('neg')[0:5]:
    #print(i)
    neg_rev_list.append(movie_reviews.raw(i))
all_neg_revs = ' '.join(neg_rev_list)

pos_rev_list = []
for i in movie_reviews.fileids('pos')[0:5]:
    #print(i)
    pos_rev_list.append(movie_reviews.raw(i))
all_pos_revs = ' '.join(pos_rev_list)

In [28]:
all_neg_revs[0:500]

'plot : two teen couples go to a church party , drink and then drive . \nthey get into an accident . \none of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . \nwhat\'s the deal ? \nwatch the movie and " sorta " find out . . . \ncritique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . \nwhich is what makes this review an even harder one to write , since i generally applaud films which attempt'

In [29]:
all_pos_revs[0:500]

"films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . \nfor starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen . \nto say moore and campbell thoroughly researched the subject"

In [30]:
# Utility function for standard text cleaning(from the curriculum).
def text_cleaner(text):
    
    text = ' '.join(text.split())
    text = re.sub(r' . . . ','. ',text)
    return text

In [31]:
neg_rev_clean = text_cleaner(all_neg_revs)
neg_rev_all_clean = neg_rev_clean.replace("\\",'')

pos_rev_clean = text_cleaner(all_pos_revs)
pos_rev_all_clean = pos_rev_clean.replace("\\",'')

In [32]:
all_neg_revs[0:500]

'plot : two teen couples go to a church party , drink and then drive . \nthey get into an accident . \none of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . \nwhat\'s the deal ? \nwatch the movie and " sorta " find out . . . \ncritique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . \nwhich is what makes this review an even harder one to write , since i generally applaud films which attempt'

In [33]:
neg_rev_all_clean[0:500]

'plot : two teen couples go to a church party , drink and then drive . they get into an accident . one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . what\'s the deal ? watch the movie and " sorta " find out. critique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . which is what makes this review an even harder one to write , since i generally applaud films which attempt to break t'

In [34]:
all_pos_revs[0:500]

"films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . \nfor starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen . \nto say moore and campbell thoroughly researched the subject"

In [35]:
pos_rev_all_clean[0:500]

"films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen . to say moore and campbell thoroughly researched the subject o"

In [36]:
# Parse using SpaCy
nlp = spacy.load('en_core_web_sm')
neg_rev = nlp(neg_rev_all_clean)
pos_rev = nlp(pos_rev_all_clean)

In [37]:
#Group into sentences
neg_sents = [[sent,'Negative'] for sent in neg_rev.sents]
pos_sents = [[sent, 'Positive'] for sent in pos_rev.sents]


sentences_df = pd.DataFrame(neg_sents + pos_sents)
sentences_df.head()

Unnamed: 0,0,1
0,"(plot, :, two, teen, couples, go, to, a, churc...",Negative
1,"(they, get, into, an, accident, .)",Negative
2,"(one, of, the, guys, dies, ,, but, his, girlfr...",Negative
3,"(what, 's, the, deal, ?)",Negative
4,"(watch, the, movie, and, "", sorta, "", find, ou...",Negative


# BoW

In [38]:
# BoW function
def bag_of_words(text, most_common_count):
    
    # filter out punctuation and stop words
    allwords = [token.lemma_
                for token in text
                if not token.is_punct
                and not token.is_stop]
    
    print('allwords count', len(allwords))
    # Return most common words
    return [item[0] for item in Counter(allwords).most_common(most_common_count)]

In [39]:
# Get bags 
neg_words = bag_of_words(neg_rev, 500)

pos_words = bag_of_words(pos_rev, 500)

# Combine bags to create common set of unique words
common_words = set(neg_words + pos_words)

allwords count 1201
allwords count 1710


In [40]:
len(pos_words)

500

In [41]:
len(neg_words)

500

In [42]:
# Create bag of words data frame using combined common words and sentences
def bow_features(sentences, common_words):
    
    # Build data frame
    df = pd.DataFrame(columns=common_words)
    df['text_sentence'] = sentences[0]
    df['text_source'] = sentences[1]
    df.loc[:, common_words] = 0
    
    # Process each row, counting the occurrence of words in each sentence.
    for i, sentences in enumerate(df['text_sentence']):
        
        # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        words = [token.lemma_
                 for token in sentences
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words
                 )]
        
        # Populate the row with word counts.
        for word in words:
            df.loc[i, word] += 1
    
    return df

In [43]:
# Create bow features 
reviews = bow_features(sentences_df, common_words)
reviews.head()

Unnamed: 0,skip,police,deftly,horror,hunt,entire,twin,witch,design,palate,...,enter,scene,20th,high,carve,ape,hollow,indiglo,text_sentence,text_source
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(plot, :, two, teen, couples, go, to, a, churc...",Negative
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(they, get, into, an, accident, .)",Negative
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(one, of, the, guys, dies, ,, but, his, girlfr...",Negative
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(what, 's, the, deal, ?)",Negative
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(watch, the, movie, and, "", sorta, "", find, ou...",Negative


# TF-IDF

In [44]:
neg_sents_list = []
pos_sents_list = []
all_sents_list = []
for i in movie_reviews.fileids('neg')[0:5]:
    
    neg_rev_sents = movie_reviews.sents(i)
    neg_sents_list.append([ " ".join(sent) for sent in neg_rev_sents]    )

for i in movie_reviews.fileids('pos')[0:5]:
    
    pos_rev_sents = movie_reviews.sents(i)
    pos_sents_list.append([ " ".join(sent) for sent in pos_rev_sents]    )

neg_pos_sent = neg_sents_list + pos_sents_list

for sublist in neg_pos_sent:
    for item in sublist:
        all_sents_list.append(item)

In [45]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

#vectorize

vectorizer = TfidfVectorizer(max_df=0.5, 
                             min_df=2, 
                             stop_words='english',   
                             use_idf=True,
                             norm=u'l2', 
                             smooth_idf=True 
                            )

all_sents_tfidf = vectorizer.fit_transform(all_sents_list)

In [46]:
neg_cnt = 0
for sublist in neg_sents_list:
    for item in sublist:
        neg_cnt += 1
print(neg_cnt)   
pos_cnt = 0
for sublist in pos_sents_list:
    for item in sublist:
        pos_cnt += 1
print(pos_cnt)

144
161


In [47]:
# Set model variables

# BoW
X_bow = reviews.drop(['text_sentence', 'text_source'], 1)
Y_bow = reviews['text_source']

# Tfidf
X_tfidf = all_sents_tfidf
Y_tfidf = ['Negative'] * neg_cnt + ['Positive'] * pos_cnt

In [48]:
X_train_tfidf, X_test_tfidf,y_train_tfidf, y_test_tfidf= train_test_split(X_tfidf,Y_tfidf, test_size=0.3, random_state=13)

In [49]:
X_train_bow, X_test_bow,y_train_bow, y_test_bow= train_test_split(X_bow,Y_bow, test_size=0.3, random_state=13)

# Models

In [72]:
from sklearn.linear_model import LogisticRegression

#BoW
log_r_bow = LogisticRegression(random_state=13)

log_r_bow.fit(X_train_bow, y_train_bow)

y_train_bow_predlog = cross_val_predict(log_r_bow, X_train_bow, y_train_bow, cv=5)
print('Accuracy Score: \n' ,metrics.accuracy_score(y_train_bow, y_train_bow_predlog))
print('Confusion Matrix: \n' ,confusion_matrix(y_train_bow, y_train_bow_predlog))
print('Cross Validation Score: \n' ,cross_val_score(log_r_bow, X_train_bow, y_train_bow, cv=5, scoring='accuracy'))
print('Classification Report: \n' ,classification_report(y_train_bow, y_train_bow_predlog))

y_test_bow_predlog = log_r_bow.predict(X_test_bow)
print('Test Set Accuracy Score: \n' ,metrics.accuracy_score(y_test_bow, y_test_bow_predlog))

Accuracy Score: 
 0.7112068965517241
Confusion Matrix: 
 [[ 53  49]
 [ 18 112]]
Cross Validation Score: 
 [0.65957447 0.80851064 0.69565217 0.69565217 0.69565217]
Classification Report: 
               precision    recall  f1-score   support

    Negative       0.75      0.52      0.61       102
    Positive       0.70      0.86      0.77       130

    accuracy                           0.71       232
   macro avg       0.72      0.69      0.69       232
weighted avg       0.72      0.71      0.70       232

Test Set Accuracy Score: 
 0.75


In [71]:
#tfidf
log_r_tfidf = LogisticRegression(random_state=13)

log_r_tfidf.fit(X_train_tfidf, y_train_tfidf)

y_train_tfidf_predlog = cross_val_predict(log_r_tfidf, X_train_tfidf, y_train_tfidf, cv=5)
print('Accuracy Score: \n' ,metrics.accuracy_score(y_train_tfidf, y_train_tfidf_predlog))
print('Confusion Matrix: \n' ,confusion_matrix(y_train_tfidf, y_train_tfidf_predlog))
print('Cross Validation Score: \n' ,cross_val_score(log_r_tfidf, X_train_tfidf, y_train_tfidf, cv=5, scoring='accuracy'))
print('Classification Report: \n' ,classification_report(y_train_tfidf, y_train_tfidf_predlog))

y_test_tfidf_predlog = log_r_tfidf.predict(X_test_tfidf)
print('Test Set Accuracy Score: \n' ,metrics.accuracy_score(y_test_tfidf, y_test_tfidf_predlog))

Accuracy Score: 
 0.676056338028169
Confusion Matrix: 
 [[49 51]
 [18 95]]
Cross Validation Score: 
 [0.74418605 0.6744186  0.55813953 0.66666667 0.73809524]
Classification Report: 
               precision    recall  f1-score   support

    Negative       0.73      0.49      0.59       100
    Positive       0.65      0.84      0.73       113

    accuracy                           0.68       213
   macro avg       0.69      0.67      0.66       213
weighted avg       0.69      0.68      0.66       213

Test Set Accuracy Score: 
 0.7717391304347826


In [70]:
from sklearn import ensemble

#BoW
rfc_bow = ensemble.RandomForestClassifier(random_state=13)

rfc_bow.fit(X_train_bow, y_train_bow)

y_train_bow_predrfc = cross_val_predict(rfc_bow, X_train_bow, y_train_bow, cv=5)
print('Accuracy Score: \n' ,metrics.accuracy_score(y_train_bow, y_train_bow_predrfc))
print('Confusion Matrix: \n' ,confusion_matrix(y_train_bow, y_train_bow_predrfc))
print('Cross Validation Score: \n' ,cross_val_score(rfc_bow, X_train_bow, y_train_bow, cv=5, scoring='accuracy'))
print('Classification Report: \n' ,classification_report(y_train_bow, y_train_bow_predrfc))

y_test_bow_predrfc = rfc_bow.predict(X_test_bow)
print('Test Set Accuracy Score: \n' ,metrics.accuracy_score(y_test_bow, y_test_bow_predrfc))

Accuracy Score: 
 0.6637931034482759
Confusion Matrix: 
 [[ 33  69]
 [  9 121]]
Cross Validation Score: 
 [0.63829787 0.68085106 0.65217391 0.7173913  0.63043478]
Classification Report: 
               precision    recall  f1-score   support

    Negative       0.79      0.32      0.46       102
    Positive       0.64      0.93      0.76       130

    accuracy                           0.66       232
   macro avg       0.71      0.63      0.61       232
weighted avg       0.70      0.66      0.63       232

Test Set Accuracy Score: 
 0.74


In [69]:
#tfidf
rfc_tfidf = ensemble.RandomForestClassifier(random_state=13)

rfc_tfidf.fit(X_train_tfidf, y_train_tfidf)

y_train_tfidf_predrfc = cross_val_predict(rfc_tfidf, X_train_tfidf, y_train_tfidf, cv=5)
print('Accuracy Score: \n' ,metrics.accuracy_score(y_train_tfidf, y_train_tfidf_predrfc))
print('Confusion Matrix: \n' ,confusion_matrix(y_train_tfidf, y_train_tfidf_predrfc))
print('Cross Validation Score: \n' ,cross_val_score(rfc_tfidf, X_train_tfidf, y_train_tfidf, cv=5, scoring='accuracy'))
print('Classification Report: \n' ,classification_report(y_train_tfidf, y_train_tfidf_predrfc))

y_test_tfidf_predrfc = rfc_tfidf.predict(X_test_tfidf)
print('Test Set Accuracy Score: \n' ,metrics.accuracy_score(y_test_tfidf, y_test_tfidf_predrfc))

Accuracy Score: 
 0.6995305164319249
Confusion Matrix: 
 [[85 15]
 [49 64]]
Cross Validation Score: 
 [0.69767442 0.74418605 0.62790698 0.71428571 0.71428571]
Classification Report: 
               precision    recall  f1-score   support

    Negative       0.63      0.85      0.73       100
    Positive       0.81      0.57      0.67       113

    accuracy                           0.70       213
   macro avg       0.72      0.71      0.70       213
weighted avg       0.73      0.70      0.69       213

Test Set Accuracy Score: 
 0.7282608695652174


In [68]:
from sklearn.svm import SVC

#BoW
svc_bow = SVC(random_state=13)

svc_bow.fit(X_train_bow, y_train_bow)

y_train_bow_predsvc = cross_val_predict(svc_bow, X_train_bow, y_train_bow, cv=5)
print('Accuracy Score: \n' ,metrics.accuracy_score(y_train_bow, y_train_bow_predsvc))
print('Confusion Matrix: \n' ,confusion_matrix(y_train_bow, y_train_bow_predsvc))
print('Cross Validation Score: \n' ,cross_val_score(svc_bow, X_train_bow, y_train_bow, cv=5, scoring='accuracy'))
print('Classification Report: \n' ,classification_report(y_train_bow, y_train_bow_predsvc))

y_test_bow_predsvc = svc_bow.predict(X_test_bow)
print('Test Set Accuracy Score: \n' ,metrics.accuracy_score(y_test_bow, y_test_bow_predsvc))

Accuracy Score: 
 0.6422413793103449
Confusion Matrix: 
 [[ 20  82]
 [  1 129]]
Cross Validation Score: 
 [0.61702128 0.68085106 0.63043478 0.67391304 0.60869565]
Classification Report: 
               precision    recall  f1-score   support

    Negative       0.95      0.20      0.33       102
    Positive       0.61      0.99      0.76       130

    accuracy                           0.64       232
   macro avg       0.78      0.59      0.54       232
weighted avg       0.76      0.64      0.57       232

Test Set Accuracy Score: 
 0.62


In [65]:
#tfidf
svc_tfidf = SVC(random_state=13)
svc_tfidf.fit(X_train_tfidf, y_train_tfidf)

y_train_tfidf_predsvc = cross_val_predict(svc_tfidf, X_train_tfidf, y_train_tfidf, cv=5)
print('Accuracy Score: \n' ,metrics.accuracy_score(y_train_tfidf, y_train_tfidf_predsvc))
print('Confusion Matrix: \n' ,confusion_matrix(y_train_tfidf, y_train_tfidf_predsvc))
print('Cross Validation Score: \n' ,cross_val_score(svc_tfidf, X_train_tfidf, y_train_tfidf, cv=5, scoring='accuracy'))
print('Classification Report: \n' ,classification_report(y_train_tfidf, y_train_tfidf_predsvc))

y_test_tfidf_predsvc = svc_tfidf.predict(X_test_tfidf)
print('Test Set Accuracy Score: \n' ,metrics.accuracy_score(y_test_tfidf, y_test_tfidf_predsvc))

Accuracy Score: 
 0.6666666666666666
Confusion Matrix: 
 [[45 55]
 [16 97]]
Cross Validation Score: 
 [0.55813953 0.74418605 0.58139535 0.76190476 0.69047619]
Classification Report: 
               precision    recall  f1-score   support

    Negative       0.74      0.45      0.56       100
    Positive       0.64      0.86      0.73       113

    accuracy                           0.67       213
   macro avg       0.69      0.65      0.65       213
weighted avg       0.68      0.67      0.65       213

Test Set Accuracy Score: 
 0.717391304347826


# Improve Accuracy

In [93]:
# Increase the sample size 
neg_words = bag_of_words(neg_rev, 500)

pos_words = bag_of_words(pos_rev, 500)

# Combine bags to create common set of unique words
common_words = set(neg_words + pos_words)

allwords count 1201
allwords count 1710


In [94]:
# Create bow features 
reviews = bow_features(sentences_df, common_words)
reviews.head()

Unnamed: 0,skip,police,deftly,horror,hunt,entire,twin,witch,design,palate,...,enter,scene,20th,high,carve,ape,hollow,indiglo,text_sentence,text_source
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(plot, :, two, teen, couples, go, to, a, churc...",Negative
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(they, get, into, an, accident, .)",Negative
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(one, of, the, guys, dies, ,, but, his, girlfr...",Negative
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(what, 's, the, deal, ?)",Negative
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(watch, the, movie, and, "", sorta, "", find, ou...",Negative


In [95]:
def entity_types(df):
    
    person_ent_type = []
    qty_ent_type = []
    ordinal_ent_type = []
    time_ent_type = []
    org_ent_type = []
    lang_ent_type = []
    date_ent_type = []
    card_ent_type = []
    gpe_ent_type = []
    fac_ent_type = []
    for i, sentence in enumerate(df['text_sentence']):
        person_count = 0
        qty_count= 0
        ordinal_count = 0
        time_count = 0
        org_count = 0
        lang_count = 0
        date_count= 0
        cardinal_count =0 
        gpe_count= 0
        fac_count = 0
    
        for token in sentence:
            if token.ent_type_ == 'PERSON':
                person_count += 1
        
            if token.ent_type_ == 'QUANTITY':
                qty_count += 1
            
            if token.ent_type_ == 'ORDINAL':
                ordinal_count += 1
            
            if token.ent_type_ == 'TIME':
                time_count += 1
            
            if token.ent_type_ == 'ORG':
                org_count += 1
            
            if token.ent_type_ == 'LANGUAGE':
                lang_count += 1
            if token.ent_type_ == 'DATE':
                date_count += 1            
        
            if token.ent_type_ == 'CARDINAL':
                cardinal_count += 1            
            if token.ent_type_ == 'GPE':
                gpe_count += 1            
            if token.ent_type_ == 'FAC':
                fac_count += 1            
            
        person_ent_type.append(person_count)
        qty_ent_type.append(qty_count)
        ordinal_ent_type.append(ordinal_count)
        time_ent_type.append(time_count)
        org_ent_type.append(org_count)
        lang_ent_type.append(lang_count)
        date_ent_type.append(date_count)
        card_ent_type.append(cardinal_count)
        gpe_ent_type.append(gpe_count)
        fac_ent_type.append(fac_count)

          
    df['person_ent'] = person_ent_type
    df['qty_ent'] = qty_ent_type
    df['ordinal_ent'] = ordinal_ent_type
    df['time_ent'] = time_ent_type
    df['org_ent'] = org_ent_type
    df['lang_ent'] = lang_ent_type
    df['date_ent'] = date_ent_type
    df['card_ent'] = card_ent_type
    df['gpe_ent'] = gpe_ent_type
    df['fac_ent'] = fac_ent_type
    return(df)

In [96]:
reviews = entity_types(reviews)

In [97]:
reviews.head()

Unnamed: 0,skip,police,deftly,horror,hunt,entire,twin,witch,design,palate,...,person_ent,qty_ent,ordinal_ent,time_ent,org_ent,lang_ent,date_ent,card_ent,gpe_ent,fac_ent
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [98]:
X_bow = reviews.drop(['text_sentence', 'text_source'], 1)
Y_bow = reviews['text_source']
X_train_bow, X_test_bow,y_train_bow, y_test_bow= train_test_split(X_bow, Y_bow, test_size=0.3, random_state=13)

In [99]:
#BoW
log_r_bow = LogisticRegression(random_state=13)

log_r_bow.fit(X_train_bow, y_train_bow)

y_train_bow_predlog = cross_val_predict(log_r_bow, X_train_bow, y_train_bow, cv=5)
print('Accuracy Score: \n' ,metrics.accuracy_score(y_train_bow, y_train_bow_predlog))
print('Confusion Matrix: \n' ,confusion_matrix(y_train_bow, y_train_bow_predlog))
print('Cross Validation Score: \n' ,cross_val_score(log_r_bow, X_train_bow, y_train_bow, cv=5, scoring='accuracy'))
print('Classification Report: \n' ,classification_report(y_train_bow, y_train_bow_predlog))

y_test_bow_predlog = log_r_bow.predict(X_test_bow)
print('Test Set Accuracy Score: \n' ,metrics.accuracy_score(y_test_bow, y_test_bow_predlog))

Accuracy Score: 
 0.7025862068965517
Confusion Matrix: 
 [[ 53  49]
 [ 20 110]]
Cross Validation Score: 
 [0.63829787 0.74468085 0.73913043 0.76086957 0.63043478]
Classification Report: 
               precision    recall  f1-score   support

    Negative       0.73      0.52      0.61       102
    Positive       0.69      0.85      0.76       130

    accuracy                           0.70       232
   macro avg       0.71      0.68      0.68       232
weighted avg       0.71      0.70      0.69       232

Test Set Accuracy Score: 
 0.74


In [100]:
#tune the parameters
parameters =[ {'C': [0.01, 0.1, 1, 10, 100],'solver':['liblinear'],'penalty':['l1', 'l2'],'fit_intercept':[True]},
            {'C': [0.01, 0.1, 1, 10, 100],'solver':['lbfgs','newton-cg'],'fit_intercept':[True]}
            ]

grid_logr = GridSearchCV(log_r_bow, param_grid = parameters )
grid_logr.fit(X_train_bow, y_train_bow)
print('Best Parameter ', grid_logr.best_params_)

Best Parameter  {'C': 1, 'fit_intercept': True, 'penalty': 'l2', 'solver': 'liblinear'}


In [101]:
log_r_tuned = LogisticRegression(**grid_logr.best_params_, random_state = 13)
log_r_tuned.fit(X_train_bow, y_train_bow)

y_train_bow_predlog = cross_val_predict(log_r_tuned, X_train_bow, y_train_bow, cv=5)
print('Accuracy Score: \n' ,metrics.accuracy_score(y_train_bow, y_train_bow_predlog))
print('Confusion Matrix: \n' ,confusion_matrix(y_train_bow, y_train_bow_predlog))
print('Cross Validation Score: \n' ,cross_val_score(log_r_tuned, X_train_bow, y_train_bow, cv=5, scoring='accuracy'))
print('Classification Report: \n' ,classification_report(y_train_bow, y_train_bow_predlog))

y_test_bow_predlog = log_r_tuned.predict(X_test_bow)
print('Test Set Accuracy Score: \n' ,metrics.accuracy_score(y_test_bow, y_test_bow_predlog))

Accuracy Score: 
 0.7025862068965517
Confusion Matrix: 
 [[ 53  49]
 [ 20 110]]
Cross Validation Score: 
 [0.63829787 0.74468085 0.73913043 0.76086957 0.63043478]
Classification Report: 
               precision    recall  f1-score   support

    Negative       0.73      0.52      0.61       102
    Positive       0.69      0.85      0.76       130

    accuracy                           0.70       232
   macro avg       0.71      0.68      0.68       232
weighted avg       0.71      0.70      0.69       232

Test Set Accuracy Score: 
 0.74


Adding entity features to the model seems to decrease the accuracy.