
## Applying Sentiment analysis to movie review data posted on Kaggle.

https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data

### Process unclassified phrases using saved Sentiment module and trained data


## Part 4- Evaluate Model using Kaggle Training Set
__Summary__:  

The accuracy is number of times the model correctly predictes the sentiment on new phrases. This is a supervised learning exercise. Given that the Kaggle data has a 0-4 scale, any review labeled 2 would be neither postive nor negative.  For the purpose of our binary classification, the prediction is either positive or negative.  Excluding any neutral reviews, 6,874 labeled phrases are available to measure for our model.  100% of the trained data is used to test the accuracy of the model.  Only the top 5,000 features for all 6,874 phrases are used to train the model. Post the removal of neutral labels, the base prediction is 52.40% for positive reviews. Overfitting is a risk given this is supervised learning and we are testing our model against the training set.

## Does the model work?  
Our movie sentiment analysis yielded 68% accuracy on the training set, which is 16.25% more accurate than null accuracy of 52%.  The model has a much higher precision for predicting positive(94%) reviews vs negative(60%) while having a higher F-1 score for negative predictions(75 vs 58). Overall I would be more confident in predicting negative reviews than positive reviews, which may reflect the sarcasm that could me more easily mis-interpred by POS selected for the model.


1.  Create a processed phrase table for the training set in Postgres DB.
2.  Evaluate the model using Sklearn metrics classification and confusion matrix.

#### References
>[binary classification](https://en.wikipedia.org/wiki/Binary_classification)

>[confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix)

>[precision, recall and f measures](https://en.wikipedia.org/wiki/Precision_and_recall)

>[Python 3 Text Processing with NLTK 3 Cookbook](https://www.amazon.com/Python-Text-Processing-NLTK-Cookbook-ebook/dp/B00N2RWMJU/ref=sr_1_2?ie=UTF8&qid=1547992672&sr=8-2&keywords=nltk)

>[Tokenizing Words and Sentences with NLTK](https://pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/)
<img src="confusionmatrix.png">, 
---

In [1]:
#!/usr/local/bin/python3.6
#import logging
import os
import pyodbc
import psycopg2
import psycopg2.extras
import numpy as np
#import pandas as pd
#import pandas.io.sql as psql
from sqlalchemy import create_engine

In [2]:
#postgres authentication
user = "alexp"
password = "secret"
host = "pg_db"
port = "5432"
database = "priv_workspace"

In [3]:
#view pg environment
try:
    connection = psycopg2.connect(user = user,
                                  password = password,
                                  host = host,
                                  port = port,
                                  database = database)
    cursor = connection.cursor()
    # Print PostgreSQL Connection properties
    print ( connection.get_dsn_parameters(),"\n")
    # Print PostgreSQL version
    cursor.execute("SELECT version();")
    record = cursor.fetchone()
    print("You are connected to - ", record,"\n")
except (Exception, psycopg2.Error) as error :
    print ("Error while connecting to PostgreSQL", error)
finally:
    #closing database connection.
        if(connection):
            cursor.close()
            connection.close()
            print("PostgreSQL connection is closed")

{'user': 'alexp', 'dbname': 'priv_workspace', 'host': 'pg_db', 'port': '5432', 'tty': '', 'options': '', 'sslmode': 'prefer', 'sslcompression': '1', 'krbsrvname': 'postgres'} 

You are connected to -  ('PostgreSQL 10.5 (Debian 10.5-1.pgdg90+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516, 64-bit',) 

PostgreSQL connection is closed


#### 1.  Create a processed phrase table for the training set in Postgres DB.

In [7]:
###create processedPhrasesTable table
processedPhrasesTable = '''CREATE TABLE IF NOT EXISTS "processed_phrases_train"(
    "phraseid" INTEGER REFERENCES sent_movie_train(phraseid), "sentenceid" INTEGER, 
 "rowno" INT,"sentiment" TEXT, "sent_score" NUMERIC (5,2)
);'''

In [9]:
import sentiment_mod as s
#score phrases for sentiment and write to pg db
from psycopg2 import Error
from psycopg2.extensions import ISOLATION_LEVEL_AUTOCOMMIT

try:
    connection = psycopg2.connect(user = user,
                                  password = password,
                                  host = host,
                                  port = port,
                                  database = database)
    cursor = connection.cursor()
    connection.set_isolation_level(ISOLATION_LEVEL_AUTOCOMMIT)
    engine_string = "postgresql://{}:{}@{}:{}/{}".format(user, password, host, port, database)
    engine = create_engine(engine_string)

    #create processed train table if doesnt already exist
    cursor.execute(processedPhrasesTable)
    
    #get total documents to be processed 
    sql1 = '''SELECT COUNT(*) FROM "sent_train_tble"'''
    cursor.execute(sql1)
    fetch = cursor.fetchall()
    totalDocuments = fetch[0][0]
    print('retrieved total documents ', fetch[0][0])
    
    
    for i in range(totalDocuments):
        row = i +1
        processRow = '''select * from (SELECT a.phraseid, a.sentenceid, a.phrase, 
        row_number() over (order by phrase)
        FROM "sent_train_tble" a ) b 
        where b.row_number = '{}' '''.format(row)
        #print(processRow)
        cursor.execute(processRow)
        fetch = cursor.fetchall()
        sentiment = s.sentiment(fetch[0][2])
        

        processedPhrase = fetch[0][0:1] + fetch[0][1:2] + fetch[0][3:4] + sentiment
        sql='''insert into processed_phrases_train values({}, {}, {}, '{}', {})'''.format(processedPhrase[0], processedPhrase[1],
                                                            processedPhrase[2], processedPhrase[3],
                                                              processedPhrase[4])
        #insert processed documents
        cursor.execute(sql)
        #print('inserted row')
    print('inserted ', row,  ' rows into processed_phrases_train table')
    


except (Exception, psycopg2.DatabaseError) as error :
    print ("Error while creating PostgreSQL table", error)
finally:
    #closing database connection.
        if(connection):
            cursor.close()
            connection.close()
            print("PostgreSQL connection is closed")

retrieved total documents  8529
inserted  8529  rows into processed_phrases table
PostgreSQL connection is closed


In [11]:
import pandas as pd
#score phrases for sentiment and write to pg db
from psycopg2 import Error
from psycopg2.extensions import ISOLATION_LEVEL_AUTOCOMMIT

try:
    connection = psycopg2.connect(user = user,
                                  password = password,
                                  host = host,
                                  port = port,
                                  database = database)
    cursor = connection.cursor()
    connection.set_isolation_level(ISOLATION_LEVEL_AUTOCOMMIT)
    engine_string = "postgresql://{}:{}@{}:{}/{}".format(user, password, host, port, database)
    engine = create_engine(engine_string)

    
    #get predicted verse actual results 
    sql = '''SELECT a.phraseid, a.sentiment, a.sent_score, 
    CASE WHEN b.sentiment < 2 THEN 0 
    WHEN b.sentiment > 2 THEN 1
    ELSE NULL END actual,
    CASE WHEN a.sentiment = 'neg' THEN 0 
    WHEN a.sentiment = 'pos' THEN 1
    ELSE NULL END predicted
    FROM "processed_phrases_train" a
    LEFT JOIN sent_train_tble b on a.phraseid=b.phraseid'''
    df = pd.read_sql_query(sql,con=engine)
    print('created pandas dataframe ')
    


except (Exception, psycopg2.DatabaseError) as error :
    print("Error while creating PostgreSQL table", error)
finally:
    #closing database connection.
        if(connection):
            cursor.close()
            connection.close()
            print("PostgreSQL connection is closed")

created pandas dataframe 
PostgreSQL connection is closed


In [13]:
df.shape

(8529, 5)

In [107]:
#remove values where kaggle provided training set had a score of 2 in range of 0-4 meaning neutral
dfEval = df.dropna()

In [113]:
dfEval.shape

(6874, 5)

In [130]:
dfEval.head()

Unnamed: 0,phraseid,sentiment,sent_score,actual,predicted
0,1,neg,1.0,0.0,0
1,64,pos,1.0,1.0,1
2,82,neg,1.0,0.0,0
3,117,neg,0.6,1.0,0
4,157,neg,1.0,0.0,0


In [135]:
dfEval.describe()

Unnamed: 0,phraseid,sent_score,actual,predicted
count,6874.0,6874.0,6874.0,6874.0
mean,81557.343759,0.953331,0.524003,0.231743
std,44255.796561,0.113403,0.49946,0.421976
min,1.0,0.6,0.0,0.0
25%,44167.0,1.0,0.0,0.0
50%,82664.0,1.0,1.0,0.0
75%,119954.75,1.0,1.0,0.0
max,156032.0,1.0,1.0,1.0


#### 2.  Evaluate the model using Sklearn metrics classification and confusion matrix.

In [141]:
y_pred = [0 if n == 0 else 1 for n in dfEval['predicted']]
y_actual = [0 if n == 0 else 1 for n in dfEval['actual']]

In [115]:
print('''Train set without "neutral sentiment" has total {0} entries with {1:.2f}% negative and {2:.2f}% postive'''.format(len(dfEval['predicted']),
                    len(dfEval[dfEval['actual'] ==0.0])/ len(dfEval['actual'])*100,
                    len(dfEval[dfEval['actual'] ==1.0])/ len(dfEval['actual'])*100
                   ))
                                                                      

Train set without "neutral sentiment" has total 6874 entries with 47.60% negative and 52.40% postive


In [156]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix

#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html

conmat = np.array(confusion_matrix(y_actual, y_pred, labels=[1,0]))

confusion = pd.DataFrame(conmat, index=['positive', 'negative'],
                         columns=['predicted_positive','predicted_negative'])

print("Accuracy Score: {0:.2f}%".format(accuracy_score(y_actual, y_pred)*100))

print("-"*80)

print("Confusion Matrix\n")
print(confusion)
print("-"*80)
print("Classification Report\n")
target_names = ['class neg', 'class pos']
print(classification_report(y_actual, y_pred, target_names=target_names))

Accuracy Score: 68.45%
--------------------------------------------------------------------------------
Confusion Matrix

          predicted_positive  predicted_negative
positive                1513                2089
negative                  80                3192
--------------------------------------------------------------------------------
Classification Report

             precision    recall  f1-score   support

  class neg       0.60      0.98      0.75      3272
  class pos       0.95      0.42      0.58      3602

avg / total       0.79      0.68      0.66      6874



In [166]:
### better understand how measures are formed.
print("total reviews ", 3602 +3272)
print("actual postive reviews", 3602, "positive precision", 1513/(1513+80), '\n')
print("actual postive reviews", 3602, "positive recall", 1513/(1513+2089), '\n\n')


print("actual negative reviews", 3272, "negative precision", 3192/(5281), '\n')
print("actual postive reviews", 3272, "negative recall", 3192/(3192+80), '\n')

total reviews  6874
actual postive reviews 3602 positive precision 0.9497802887633396 

actual postive reviews 3602 positive recall 0.42004441976679624 


actual negative reviews 3272 negative precision 0.6044309789812535 

actual postive reviews 3272 negative recall 0.9755501222493888 

