## Text Classification
The moviereviews2.tsv dataset contains the text of 6000 movie reviews. The text has been reduced
and preprocessed as a tab-delimited file. For more information on this dataset visit
http://ai.stanford.edu/~amaas/data/sentiment/
- Perform imports and load the dataset into a pandas DataFrame.
- Data Cleanup: Handle missing values, and NaN
- Split the data into train & test sets. Use test_size=0.33, random_state=42
- Build a pipeline to vectorize the data, then train and fit a model. You may use whatever model you like and LinearSVC.
- Run predictions and analyze the results. Report the confusion matrix and classification report.

In [51]:
import numpy as np
import pandas as pd

#Load the dataset
df = pd.read_csv('../TextFiles/moviereviews2.tsv', sep='\t')
df.head()

Unnamed: 0,label,review
0,pos,I loved this movie and will watch it again. Or...
1,pos,"A warm, touching movie that has a fantasy-like..."
2,pos,I was not expecting the powerful filmmaking ex...
3,neg,"This so-called ""documentary"" tries to tell tha..."
4,pos,This show has been my escape from reality for ...


In [5]:
#Check for NaN values
df.isnull().sum()

label      0
review    20
dtype: int64

There are 20 reviews that is null.

In [6]:
#Check for whitespace strings 
blanks = [] 

for i,lb,rv in df.itertuples(): 
    if type(rv)==str:            
        if rv.isspace():       
            blanks.append(i)     
        
len(blanks)

0

There aren't any empty strings. 

In [7]:
#Data Cleanup
df.dropna(inplace=True)

In [8]:
#Check for NaN values after cleanup
df.isnull().sum()

label     0
review    0
dtype: int64

In [10]:
#Split train and test sets
from sklearn.model_selection import train_test_split

y = df['label']
X = df['review']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [11]:
#Build a pipeline
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC()),
])

# Feed the training data through the pipeline
text_clf.fit(X_train, y_train)  



In [12]:
#Run predictions and analyze the results
from sklearn import metrics

predictions = text_clf.predict(X_test)
print(metrics.confusion_matrix(y_test,predictions))

[[900  91]
 [ 63 920]]


In [13]:
#Print a classification report
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

         neg       0.93      0.91      0.92       991
         pos       0.91      0.94      0.92       983

    accuracy                           0.92      1974
   macro avg       0.92      0.92      0.92      1974
weighted avg       0.92      0.92      0.92      1974



## Sentiment Analysis
*Task #1* : Write a function (word_vector) that takes in 3 strings(words), performs a -b + c arithmetic, and returns a top-ten closest results (cosine similarity) after performing vector arithmetic on your own words. The goal is to come as close to an expected word as possible.

In [10]:
import spacy
nlp = spacy.load('en_core_web_md')

In [11]:
from scipy import spatial

cosine_similarity = lambda x, y: 1 - spatial.distance.cosine(x, y)

In [12]:
def word_vector(a,b,c):
    a_vector = nlp.vocab[a].vector
    b_vector = nlp.vocab[b].vector
    c_vector = nlp.vocab[c].vector
    
    new_vector = a_vector - b_vector + c_vector
    computed_similarities = []

    for word in nlp.vocab:
        if word.has_vector:
            if word.is_lower:
                if word.is_alpha:
                    similarity = cosine_similarity(new_vector, word.vector)
                    computed_similarities.append((word, similarity))

    computed_similarities = sorted(computed_similarities, key=lambda item: -item[1])

    return [w[0].text for w in computed_similarities[:10]]

In [17]:
word_vector('wolf','dog','cat')

['wolf', 'cat', 'man', 'woman', 'i', 'nt', 'nuff', 'cos', 'cuz', 'coz']

*Task #2* : Write a function to perform VADER Sentiment Analysis on your own review. The function returns a set of “SentimentIntensityAnalyzer” polarity scores based on written review. Consider returning a score of "Positive" , "Negative" or "Neutral"

In [21]:
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/neslisahcelek/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [22]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

In [44]:
office_review = 'The season is of to a high standard with hysterical one liners, terrific pranks from Jim, Michael’s hilarious leadership, and the introduction to the cast of the brilliant Ed Helms as Andy Bernard.'

In [45]:
sid.polarity_scores(office_review)

{'neg': 0.028, 'neu': 0.724, 'pos': 0.248, 'compound': 0.8591}

In [48]:
def rating_score(string):
    scores = sid.polarity_scores(string)
    if scores['compound'] == 0:
        return 'Neutral'
    elif scores['compound'] > 0:
        return 'Positive'
    else:
        return 'Negative'

In [49]:
rating_score(office_review)

'Positive'