## Text Classification with spaCy - Sentiment Analysis

https://www.kaggle.com/abhideshmukh1/nlp-spacy-prac-yelp-reviews/notebook

In [1]:
import pandas as pd
import spacy

## Problem Statement:

In this project, sentiment analysis is done using natural language processing on the online reviews prevalant for various items 
on amazon, yelp and imdb which were lablelled.  The data set contains about 1,000 online reviews each for various items on Amazon, Yelp and IMDB, where roughly 500 were labelled positive and 500 were labelled negative reviews.

Using the Python spaCy package to preprocess the data before, each individual review has been tokenized, lemmatized, filtered for stop words and vectorized to prepare the data for  machine learning models. A pipeline is created which vectorizes the preprocessed data using count vectorization, followed by tfidf vectorizer, which is then split into training and testing datasets, which are then used to train the machine learning models and evaluate.

## Load the data

Read the files seperately, and then join them with key fields

[Sentiment Labelled Sentences Data Set ](http://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences#)

In [2]:
# Load our dataset
# http://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences
df_yelp = pd.read_csv('../data/SentimentData/yelp_labelled.txt', sep = '\t')
df_imdb = pd.read_csv('../data/SentimentData/imdb_labelled.txt', sep = '\t')
df_amzn = pd.read_csv('../data/SentimentData/amazon_cells_labelled.txt', sep = '\t')

In [3]:
# From the output below, the tables do not have headers; column content comprises the review and the label indicating 
# 0 when negative, 1 when positive
df_yelp.columns

Index(['Wow... Loved this place.', '1'], dtype='object')

In [4]:
# Concatenate our datasets
frames = [df_yelp, df_imdb, df_amzn]

# Rename column names 
for colname in frames:
    colname.columns = ['Message', 'Target']
    
for colname in frames:
    print(colname.columns)

Index(['Message', 'Target'], dtype='object')
Index(['Message', 'Target'], dtype='object')
Index(['Message', 'Target'], dtype='object')


In [5]:
# To recognize which dataset belonged to which company, a 'Company' column is added as a key
company = ['Yelp', 'IMDB', 'Amazon']

# Merge or concat our datasets
df = pd.concat(frames, keys = company)
df.shape

(2745, 2)

In [6]:
df.head()

Unnamed: 0,Unnamed: 1,Message,Target
Yelp,0,Crust is not good.,0
Yelp,1,Not tasty and the texture was just nasty.,0
Yelp,2,Stopped by during the late May bank holiday of...,1
Yelp,3,The selection on the menu was great and so wer...,1
Yelp,4,Now I am getting angry and I want my damn pho.,0


In [7]:
df.tail()

Unnamed: 0,Unnamed: 1,Message,Target
Amazon,994,The screen does get smudged easily because it ...,0
Amazon,995,What a piece of junk.. I lose more calls on th...,0
Amazon,996,Item Does Not Match Picture.,0
Amazon,997,The only thing that disappoint me is the infra...,0
Amazon,998,"You can not answer calls with the unit, never ...",0


In [8]:
df.to_csv('../data/SentimentData/sentiment_dataset.csv')
print(df.columns)
df.isnull().sum()

Index(['Message', 'Target'], dtype='object')


Message    0
Target     0
dtype: int64

## Preprocessing the data using Spacy and Machine learning model training using sklearn

In this stage, spaCy is used to lemmatize and remove stop words from the dataset

In [9]:
# Import the English language class
from spacy.lang.en import English

# https://stackoverflow.com/questions/51072516/pos-in-spacy-is-not-returning-any-results-in-python
#
# Load the English language class dictionary, which includes the language-specific data like tokenization rules, but doesn't 
# actually load a model, which enables spaCy to predict part-of-speech tags and other linguistic annotations.  It contains the
# language data and tokenizer, but doesn't have a statistical model.  This means that spaCy will tokenize the text, but its 
# pipeline is empty; it doesn't include a tagger or parser, and it doesn't have any weights or attributes available either
nlp_parser = English()
nlp_parser.pipeline

[]

In [10]:
from spacy.lang.en.stop_words import STOP_WORDS

# Build a list of stopwords to use to filter
stopwords = list(STOP_WORDS)
stopwords

['see',
 'take',
 'between',
 "n't",
 'hereafter',
 'n’t',
 'into',
 '‘re',
 'move',
 'its',
 '‘ve',
 'your',
 'noone',
 'why',
 'hence',
 'off',
 'toward',
 'around',
 'name',
 'up',
 'one',
 'only',
 'yourself',
 'amongst',
 'more',
 'wherever',
 'during',
 'whereas',
 'so',
 'among',
 '‘s',
 'this',
 'less',
 'for',
 'just',
 'onto',
 'therefore',
 'because',
 'seemed',
 'whereby',
 'himself',
 'always',
 'thru',
 'on',
 'hers',
 'forty',
 'herself',
 'without',
 'mostly',
 'formerly',
 'what',
 'upon',
 'would',
 'along',
 'though',
 'go',
 'already',
 'often',
 'give',
 'if',
 "'re",
 'whose',
 'something',
 'empty',
 '‘ll',
 'below',
 'anyhow',
 'our',
 'beside',
 'now',
 'latter',
 'yet',
 'yours',
 'can',
 'or',
 'over',
 'throughout',
 'those',
 'were',
 'somehow',
 'do',
 'does',
 'towards',
 'above',
 'itself',
 'ever',
 'being',
 'here',
 'but',
 'when',
 'hereby',
 'perhaps',
 'is',
 'next',
 'again',
 'whence',
 'herein',
 'them',
 'how',
 'ourselves',
 'and',
 'themselve

In [11]:
# Use the punctuations of string module
import string
punctuations = string.punctuation

In [12]:
# Create spacy tokenizer that accepts a sentence as input and processes the sentence into tokens, performing lemmatization, 
# lowercasing, and removing stop words and punctuation 
def spacy_tokenizer(sentence):
    mytokens = nlp_parser(sentence)
    mytokens = [word.lemma_.lower().strip() if word.lemma_ != '-PRON-' else word.lower_ for word in mytokens]
    mytokens = [word for word in mytokens if word not in stopwords and word not in punctuations]
    return mytokens

## Machine Learning with sklearn

In [21]:
# ML Packages
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC

In [22]:
from sklearn.svm import SVC

# Vectorization
vectorizer = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range = (1,1), binary = True)
# classifier = LinearSVC()
classifier = SVC(C = 150, gamma = 2e-2, probability = True)
print(vectorizer)

CountVectorizer(analyzer='word', binary=True, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=<function spacy_tokenizer at 0x000001E033C01EA0>,
        vocabulary=None)


In [23]:
# Splitting Data Set
from sklearn.model_selection import train_test_split

# Features and labels
X = df['Message']
y = df['Target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)

## Create a sklearn pipeline with components: vectorizer, classifier

For tokenizer and vectorizer we build our own custom modules using spaCy

• The first stage of the pipeline is to vectorize the cleaned data.  Tokenized words are lemmatized and filtered for 
  pronouns, stopwords and punctuations using spacy_tokenizer().  For that purpose, count vectorizeor and tfidfVectorizer 
  are subsequently employed to decide which is better.<br>
• The second stage of the pipeline is the defining of the classifier.  Linear Support Vector Machine classifier was chosen

In [24]:
# Create the pipeline to clean, tokenize, vectorize, and classify
pipe = Pipeline([('vectorizer', vectorizer),
                 ('classifier', classifier)])

# Fit our data
pipe.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vectorizer', CountVectorizer(analyzer='word', binary=True, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        ...',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

In [25]:
# Predicting with a test dataset
sample_prediction = pipe.predict(X_test)

# Prediction Results
# 1 = Positive review
# 0 = Negative review
for (sample, pred) in zip(X_test, sample_prediction):
    print(sample, 'Prediction =>', pred)

Great pork sandwich. Prediction => 1
It is a true classic.   Prediction => 0
It's close to my house, it's low-key, non-fancy, affordable prices, good food. Prediction => 1
Audio Quality is poor, very poor. Prediction => 0
We loved the biscuits!!! Prediction => 1
I don't have very many words to say about this place, but it does everything pretty well. Prediction => 0
Was not happy. Prediction => 1
The headsets are easy to use and everyone loves them. Prediction => 1
I miss it and wish they had one in Philadelphia! Prediction => 0
Still it's quite interesting and entertaining to follow.   Prediction => 1
All three broke within two months of use. Prediction => 0
Oh yeah, and the storyline was pathetic too.   Prediction => 0
IT'S REALLY EASY. Prediction => 1
Every element of this story was so over the top, excessively phony and contrived that it was painful to sit through.   Prediction => 0
The food was outstanding and the prices were very reasonable. Prediction => 1
I am so tired of clich

He owns the film, just as Spacek owned Coal Miner's Daughter" and Quaid owned "Great Balls of Fire.   Prediction => 1
A standout scene.   Prediction => 0
Worst hour and a half of my life!Oh my gosh!   Prediction => 0
They had a toro tartare with a cavier that was extraordinary and I liked the thinly sliced wagyu with white truffle. Prediction => 1
Also, it's a real treat to see Anthony Quinn playing Crazy Horse.   Prediction => 1
I don't each much pasta, but I love the homemade /hand made pastas and thin pizzas here. Prediction => 1
I am very impressed with the job that Motorola did on the sturdiness of this phone. Prediction => 0
I received my headset in good time and was happy with it. Prediction => 1
Would recommend this item. Prediction => 0
We definately enjoyed ourselves. Prediction => 1
very good product, well made. Prediction => 1
The writer, Gorman Bechard, undoubtedly did his homework because all references are industry and character-age appropriate.   Prediction => 1
I like 

Also were served hot bread and butter, and home made potato chips with bacon bits on top....very original and very good. Prediction => 0
There was hardly any meat. Prediction => 0
This really is how Vegas fine dining used to be, right down to the menus handed to the ladies that have no prices listed. Prediction => 1
The manager was the worst. Prediction => 0
Not worth it. Prediction => 0
Frozen pucks of disgust, with some of the worst people behind the register. Prediction => 0
Omelets are to die for! Prediction => 0
Excellent short film.   Prediction => 1
Great place to eat, reminds me of the little mom and pop shops in the San Francisco Bay Area. Prediction => 1
The camera on the phone may be used as a dustpan when indoors... I'd rather be using a disposable then this. Prediction => 1
I use this product in a motor control center where there is a lot of high voltage humming from the equipment, and it works great! Prediction => 1
GO AND SEE IT!   Prediction => 0
I recently had problems

It dit not work most of the time with my Nokia 5320. Prediction => 0
Rating: 0/10 (Grade: Z) Note: The Show Is So Bad That Even Mother Of The Cast Pull Her Daughter Out Of The Show.   Prediction => 0
This is one I did not have, but I watched it recently on Fox Movie Channel, and was very disappointed.   Prediction => 0
I couldn't take them seriously.   Prediction => 0
Bought mainly for the charger, which broke soon after purchasing. Prediction => 0
The bottowm line...another worthless, cheap gimmick from Sprint. Prediction => 0
Protects the phone on all sides. Prediction => 0
Good show, Samsung. Prediction => 1
Gave up trying to eat any of the crust (teeth still sore). Prediction => 0
Saw the movie today and thought it was a good effort, good messages for kids.   Prediction => 1
Not sure who was more lost - the flat characters or the audience, nearly half of whom walked out.   Prediction => 0
She was quite disappointed although some blame needs to be placed at her door. Prediction => 0

But when someone strives for greatness and poetry, but delivers a muddled (and often ridiculous) story, a bunch of disparate scenes, pretentious dialogue... Then you get the worst kind of a movie that some other reviewer very accurately defined as pretentious crap".   Prediction => 1
Uncomfortable In the Ear, Don't use with LG VX9900 (EnV). Prediction => 1
Every single character was hilarious and deserved to be called a lead.   Prediction => 1
This case seems well made. Prediction => 1
It'll be a regular stop on my trips to Phoenix! Prediction => 1
This is a good battery, and it got here really fast. Prediction => 1
What a waste. Prediction => 0
Fantastic buy and will get again for whatever my next phone is Prediction => 1
I'd love to go back. Prediction => 1
Thoroughly disappointed! Prediction => 0
The sweet potato fries were very good and seasoned well. Prediction => 1
Cinematography: The film was shot in an interesting way.   Prediction => 1
first time there and might just be the la

In [26]:
from sklearn import metrics
from sklearn.metrics import roc_curve, auc

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, sample_prediction)
roc_auc = auc(false_positive_rate, true_positive_rate)
print('ROC_AUC: \t\t\t\t', roc_auc)
print('Accuracy: X_test vs y_test \t\t', pipe.score(X_test, y_test))
print('Accuracy: sample_prediction vs y_test \t', metrics.accuracy_score(sample_prediction, y_test)) # comparing array vs series

ROC_AUC: 				 0.7638540004752368
Accuracy: X_test vs y_test 		 0.7627365356622998
Accuracy: sample_prediction vs y_test 	 0.7627365356622998


## Model Interpretation with Eli5

[TEXT CLASSIFICATION AND ML MODEL INTERPRETATION WITH ELI5,SKLEARN AND SPACY](https://jcharistech.wordpress.com/2019/05/28/text-classification-and-ml-model-interpretation-with-eli5sklearn-and-spacy/)

In [27]:
# Load Interpretation Pkgs
import eli5
from eli5.lime import TextExplainer

In [28]:
pipe.predict_proba

<function sklearn.pipeline.Pipeline.predict_proba(self, X)>

In [40]:
# TextExplainer allows to explain predictions of black-box text classifiers using LIME algorithm
exp = TextExplainer(random_state = 42)

X_test.values[0]

'Great pork sandwich.'

In [41]:
exp.fit(X_test.values[0], pipe.predict_proba)

TextExplainer(char_based=False,
       clf=SGDClassifier(alpha=0.001, average=False, class_weight=None,
       early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
       l1_ratio=0.15, learning_rate='optimal', loss='log', max_iter=None,
       n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='elasticnet',
       power_t=0.5,
       random_state=<mtrand.RandomState object at 0x000001E036A75CA8>,
       shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0,
       warm_start=False),
       expand_factor=10, n_samples=5000, position_dependent=False,
       random_state=42, rbf_sigma=None,
       sampler=MaskingTextSamplers(random_state=<mtrand.RandomState object at 0x000001E036A75CA8>,
          sampler_params=None, token_pattern='(?u)\\b\\w+\\b',
          weights=array([0.7, 0.3])),
       token_pattern='(?u)\\b\\w+\\b',
       vec=CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='c

In [42]:
y.unique()

array([0, 1], dtype=int64)

In [43]:
target_names = ['Negative','Positive']

exp.show_prediction()

Contribution?,Feature
1.97,Highlighted in text (sum)
-0.088,<BIAS>


In [44]:
exp.show_prediction(target_names = target_names)

Contribution?,Feature
1.97,Highlighted in text (sum)
-0.088,<BIAS>


## TextExplainer: debugging black-box text classifiers

* ‘score’ is an accuracy score weighted by cosine distance between generated sample and the original document (i.e. texts which are closer to the example are more important). Accuracy shows how good are ‘top 1’ predictions.
* ‘mean_KL_divergence’ is a mean Kullback–Leibler divergence for all target classes; it is also weighted by distance. KL divergence shows how well are probabilities approximated; 0.0 means a perfect match.

https://eli5.readthedocs.io/en/latest/tutorials/black-box-text-classifiers.html

In [45]:
exp.metrics_

{'mean_KL_divergence': 0.00073010505221788, 'score': 1.0}

In [46]:
exp.show_weights()

Weight?,Feature
2.004,great
0.165,pork sandwich
0.163,great pork
-0.053,sandwich
-0.088,<BIAS>
-0.309,pork


In [77]:
# Check For Vectorizer and Classifer Used
exp.vec_, exp.clf_

(CountVectorizer(analyzer='word', binary=False, decode_error='strict',
         dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
         lowercase=True, max_df=1.0, max_features=None, min_df=1,
         ngram_range=(1, 2), preprocessor=None, stop_words=None,
         strip_accents=None, token_pattern='(?u)\\b\\w+\\b', tokenizer=None,
         vocabulary=None),
 SGDClassifier(alpha=0.001, average=False, class_weight=None,
        early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
        l1_ratio=0.15, learning_rate='optimal', loss='log', max_iter=None,
        n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='elasticnet',
        power_t=0.5,
        random_state=<mtrand.RandomState object at 0x000001E0369F33F0>,
        shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0,
        warm_start=False))

## Second notebook

https://mahadev001.github.io/Mahadev-Upadhyayula/Sentiment%20Analysis%20via%20NLP/Sentiment%20Analysis%20using%20NLP%20with%20Spacy%20and%20%20SVM.html

## Natural Language Processing (NLP)

NLP helps resolve ambiguity in language and adds useful numeric structure to the data for many downstream applications, such as 
speech recognition or text analytics.  Basic NLP tasks include tokenization and parsing, lemmatization/stemming, part-of-speech 
tagging, language detection and identification of semantic relationships.  NLP tasks break down language into shorter, elemental 
pieces, try to understand relationships between the pieces and explore how the pieces work together to create meaning.

## spaCy

Implementation of spaCy and querying properties is done by creating pipelines.  A pipeline is created by loading the models. 
Models provided contain information about language – vocabularies, trained vectors, syntaxes and entities.  Pipelines output 
document properties such as tokens that reference index, part of speech tags, entities, vectors, sentiment, vocabulary, etc:

* **Tokenization**:  Every spaCy document is tokenized into sentences and further into tokens which can be accessed by iterating 
the document
* **Part of Speech Tagging**:  Part-of-speech tags are the properties of the word that are defined by the usage of the word in the grammatically correct sentence.  Tags are used in information filtering, statistical models, and rule based parsing.

* **Entity Detection**: spaCy is capable of identifying entitiy phrases from the document (e.g., a person, location, organization, dates, numerals, etc), which are accessed through the .ents property.

* **Dependency Parsing**:  A powerful, fast, and accurate syntactic dependency parser that can also be used for sentence boundary detection and phrase chunking. Relations are accessed by the properties .children, .root, .ancestor, etc.

* **Noun Phrases**:  Dependency trees can also be used to generate noun phrases

* **Word to Vectors Integration**:   Integration of dense, real valued vectors representing distributional similarity information. It uses GloVe vectors to generate vectors. GloVe is an unsupervised learning algorithm for obtaining vector representations for words.

## Create a sklearn pipeline with components: cleaner/tokenizer, vectorizer, classifier

For tokenizer and vectorizer we build our own custom modules using spaCy:

• The 'predictors' class is used as the first stage of the pipeline.  It is derived from the TransformerMixin package 
  and performs the cleaning of data.  <span style="color:red">**It is ommitted in [NLP Spacy prac - yelp reviews](https://www.kaggle.com/abhideshmukh1/nlp-spacy-prac-yelp-reviews/notebook), and does not really appear to be necessary**</span><br>
• The second stage of the pipeline is to vectorize the cleaned data.  Tokenized words are lemmatized and filtered for 
  pronouns, stopwords and punctuations using spacy_tokenizer().  For that purpose, count vectorizeor and tfidfVectorizer 
  are subsequently employed to decide which is better.<br>
• The third stage of the pipeline is the defining of the classifier.  Linear Support Vector Machine classifier was chosen

A pipeline component is a TransformerMixin derived class with three  methods:
* fit - Uses the input data to train the transformer.  Used by components such as the CountVectorizer to setup the internal mappings for words to vector elements.  It gets both the features and the expected output.
* transform - Takes the input features and transforms them.  Transformers are not allowed to add or remove records to the input dataset.
* get_params

Use the pipeline to perform a grid search to let the computer automatically discover the optimum hyperparameters for your algorithm.

```
# param_grid is a list of dictionaries.  You can test more than one scenario with different parametersets
param_grid = [{ 'classify__max_depth': [5, 10, 15, 20] }]

grid_search = GridSearchCV(pipeline, param_grid = param_grid)
grid_search.fit(features, labels)
```

In [50]:
from sklearn.base import TransformerMixin

# Custom (pipeline component) transformer using spaCy 
class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        return [clean_text(text) for text in X]
    def fit(self, X, y=None, **fit_params):
        return self
    def get_params(self, deep=True):
        return {}

# Basic utility function to clean the text 
def clean_text(text):     
    return text.strip().lower()

## Create a custom tokenizer function using spacy parser and some basic cleaning

The text features can be replaced with word vectors (especially beneficial in deep learning models)

In [51]:
# Create spacy tokenizer that accepts a sentence as input and processes the sentence into tokens, performing lemmatization, 
# lowercasing, and removing stop words and punctuation 
def spacy_tokenizer(sentence):
    mytokens = nlp_parser(sentence)
    mytokens = [word.lemma_.lower().strip() if word.lemma_ != '-PRON-' else word.lower_ for word in mytokens]
    mytokens = [word for word in mytokens if word not in stopwords and word not in punctuations]
    return mytokens

# Create vectorizer object to generate feature vectors, using custom spaCy tokenizer
vectorizer = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range = (1,1))
classifier = LinearSVC()

In [79]:
ex1 = 'He was walking with the walker in the Wall; he may had sat and run with the runner, too'
spacy_tokenizer(ex1)

['walk', 'walker', 'wall', 'sit', 'run', 'runner']

In [52]:
# Splitting Data Set
from sklearn.model_selection import train_test_split

# Features and labels
X = df['Message']
y = df['Target']

# Split into test and training datasets prior to feeding into the ML pipeline
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [80]:
X_train.shape

(2196,)

In [81]:
X_train[:10]

Amazon  480    Verizon's bills, however, are difficult to und...
IMDB    737    The opening sequence of this gem is a classic,...
        19     This if the first movie I've given a 10 to in ...
Yelp    891    Pricing is a bit of a concern at Mellow Mushroom.
Amazon  460                  We would recommend these to others.
        772               The reception has been generally good.
Yelp    847    To my disbelief, each dish qualified as the wo...
Amazon  218                I'm still infatuated with this phone.
IMDB    616    But this understated film leaves a lasting imp...
Amazon  120                  Could not get strong enough signal.
Name: Message, dtype: object

## Creating a Pipeline and Generating the Model

Create a pipeline with three components: a cleaner, a vectorizer, and a classifier. 
* The cleaner uses our predictors class object to clean and preprocess the text
* The vectorizer uses countvector objects to create the bag of words matrix for our text
* The classifier is an object that performs the LinearSVC model to classify the sentiments

Once this pipeline is built, we’ll fit the pipeline components using fit().

In [53]:
# Create the  pipeline to clean, tokenize, vectorize, and classify using"Count Vectorizor"
pipe = Pipeline([("cleaner", predictors()),
                 ('vectorizer', vectorizer),
                 ('classifier', classifier)])
# Fit our data
pipe.fit(X_train,y_train)

# Predicting with a test dataset
sample_prediction = pipe.predict(X_test)

# Prediction Results
# 1 = Positive review
# 0 = Negative review
for (sample, pred) in zip(X_test, sample_prediction):
    print(sample, 'Prediction =>', pred)

Great pork sandwich. Prediction => 1
It is a true classic.   Prediction => 0
It's close to my house, it's low-key, non-fancy, affordable prices, good food. Prediction => 0
Audio Quality is poor, very poor. Prediction => 0
We loved the biscuits!!! Prediction => 1
I don't have very many words to say about this place, but it does everything pretty well. Prediction => 0
Was not happy. Prediction => 1
The headsets are easy to use and everyone loves them. Prediction => 1
I miss it and wish they had one in Philadelphia! Prediction => 0
Still it's quite interesting and entertaining to follow.   Prediction => 1
All three broke within two months of use. Prediction => 0
Oh yeah, and the storyline was pathetic too.   Prediction => 0
IT'S REALLY EASY. Prediction => 1
Every element of this story was so over the top, excessively phony and contrived that it was painful to sit through.   Prediction => 0
The food was outstanding and the prices were very reasonable. Prediction => 1
I am so tired of clich

He owns the film, just as Spacek owned Coal Miner's Daughter" and Quaid owned "Great Balls of Fire.   Prediction => 1
A standout scene.   Prediction => 0
Worst hour and a half of my life!Oh my gosh!   Prediction => 0
They had a toro tartare with a cavier that was extraordinary and I liked the thinly sliced wagyu with white truffle. Prediction => 1
Also, it's a real treat to see Anthony Quinn playing Crazy Horse.   Prediction => 1
I don't each much pasta, but I love the homemade /hand made pastas and thin pizzas here. Prediction => 1
I am very impressed with the job that Motorola did on the sturdiness of this phone. Prediction => 0
I received my headset in good time and was happy with it. Prediction => 1
Would recommend this item. Prediction => 0
We definately enjoyed ourselves. Prediction => 1
very good product, well made. Prediction => 1
The writer, Gorman Bechard, undoubtedly did his homework because all references are industry and character-age appropriate.   Prediction => 1
I like 

Also were served hot bread and butter, and home made potato chips with bacon bits on top....very original and very good. Prediction => 1
There was hardly any meat. Prediction => 0
This really is how Vegas fine dining used to be, right down to the menus handed to the ladies that have no prices listed. Prediction => 1
The manager was the worst. Prediction => 0
Not worth it. Prediction => 0
Frozen pucks of disgust, with some of the worst people behind the register. Prediction => 0
Omelets are to die for! Prediction => 0
Excellent short film.   Prediction => 1
Great place to eat, reminds me of the little mom and pop shops in the San Francisco Bay Area. Prediction => 1
The camera on the phone may be used as a dustpan when indoors... I'd rather be using a disposable then this. Prediction => 0
I use this product in a motor control center where there is a lot of high voltage humming from the equipment, and it works great! Prediction => 1
GO AND SEE IT!   Prediction => 0
I recently had problems

It dit not work most of the time with my Nokia 5320. Prediction => 0
Rating: 0/10 (Grade: Z) Note: The Show Is So Bad That Even Mother Of The Cast Pull Her Daughter Out Of The Show.   Prediction => 0
This is one I did not have, but I watched it recently on Fox Movie Channel, and was very disappointed.   Prediction => 0
I couldn't take them seriously.   Prediction => 0
Bought mainly for the charger, which broke soon after purchasing. Prediction => 0
The bottowm line...another worthless, cheap gimmick from Sprint. Prediction => 0
Protects the phone on all sides. Prediction => 0
Good show, Samsung. Prediction => 1
Gave up trying to eat any of the crust (teeth still sore). Prediction => 0
Saw the movie today and thought it was a good effort, good messages for kids.   Prediction => 1
Not sure who was more lost - the flat characters or the audience, nearly half of whom walked out.   Prediction => 0
She was quite disappointed although some blame needs to be placed at her door. Prediction => 0

In [82]:
X_test.shape

(549,)

In [85]:
X_test.values[0]

'Great pork sandwich.'

In [86]:
sample_prediction

array([1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0,
       1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1,
       1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
       1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1,
       1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0,
       0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1,

In [88]:
# Accuracy
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, sample_prediction)
roc_auc = auc(false_positive_rate, true_positive_rate)
print('ROC_AUC: \t\t\t', roc_auc)
print('Accuracy: X_test vs y_test \t', pipe.score(X_test, y_test), ' *****')
print('Accuracy: X_train vs y_train \t', pipe.score(X_train, y_train))

ROC_AUC: 			 0.7985184693064045
Accuracy: X_test vs y_test 	 0.7978142076502732  *****
Accuracy: X_train vs y_train 	 0.9849726775956285


In [55]:
# Another random review
pipe.predict(['This was a great movie'])

array([1], dtype=int64)

In [56]:
example = ['I do enjoy my job', 'What a poor product!, I will have to get a new one', 'I feel amazing!']
pipe.predict(example)

array([1, 0, 1], dtype=int64)

## Text Classification With Machine Learning and SpaCy - Using Tfid

https://github.com/Jcharis/Natural-Language-Processing-Tutorials/blob/master/Text%20Classification%20With%20Machine%20Learning,SpaCy,Sklearn(Sentiment%20Analysis)/Text%20Classification%20&%20Sentiment%20Analysis%20with%20SpaCy,Sklearn.ipynb

In [57]:
from sklearn.ensemble import RandomForestClassifier

# Using Tfidf
tfvectorizer = TfidfVectorizer(tokenizer = spacy_tokenizer)

classifier = RandomForestClassifier(n_jobs = -1, max_depth = 6, n_estimators = 10)

In [58]:
# Create the  pipeline to clean, tokenize, vectorize, and classify 
pipe_tfid = Pipeline([("cleaner", predictors()),
                 ('vectorizer', tfvectorizer),
                 ('classifier', classifier)])

In [59]:
pipe_tfid.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('cleaner', <__main__.predictors object at 0x000001E033D2F8D0>), ('vectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
    ..._jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

In [60]:
sample_prediction_tfid = pipe_tfid.predict(X_test)

In [61]:
for (sample, pred) in zip(X_test, sample_prediction_tfid):
    print(sample, 'Prediction=>', pred)

Great pork sandwich. Prediction=> 0
It is a true classic.   Prediction=> 0
It's close to my house, it's low-key, non-fancy, affordable prices, good food. Prediction=> 0
Audio Quality is poor, very poor. Prediction=> 0
We loved the biscuits!!! Prediction=> 1
I don't have very many words to say about this place, but it does everything pretty well. Prediction=> 1
Was not happy. Prediction=> 1
The headsets are easy to use and everyone loves them. Prediction=> 1
I miss it and wish they had one in Philadelphia! Prediction=> 0
Still it's quite interesting and entertaining to follow.   Prediction=> 0
All three broke within two months of use. Prediction=> 1
Oh yeah, and the storyline was pathetic too.   Prediction=> 0
IT'S REALLY EASY. Prediction=> 0
Every element of this story was so over the top, excessively phony and contrived that it was painful to sit through.   Prediction=> 0
The food was outstanding and the prices were very reasonable. Prediction=> 0
I am so tired of clichés that is just

He owns the film, just as Spacek owned Coal Miner's Daughter" and Quaid owned "Great Balls of Fire.   Prediction=> 1
A standout scene.   Prediction=> 0
Worst hour and a half of my life!Oh my gosh!   Prediction=> 0
They had a toro tartare with a cavier that was extraordinary and I liked the thinly sliced wagyu with white truffle. Prediction=> 1
Also, it's a real treat to see Anthony Quinn playing Crazy Horse.   Prediction=> 0
I don't each much pasta, but I love the homemade /hand made pastas and thin pizzas here. Prediction=> 1
I am very impressed with the job that Motorola did on the sturdiness of this phone. Prediction=> 1
I received my headset in good time and was happy with it. Prediction=> 1
Would recommend this item. Prediction=> 0
We definately enjoyed ourselves. Prediction=> 1
very good product, well made. Prediction=> 1
The writer, Gorman Bechard, undoubtedly did his homework because all references are industry and character-age appropriate.   Prediction=> 0
I like design and l

Not worth it. Prediction=> 0
Frozen pucks of disgust, with some of the worst people behind the register. Prediction=> 0
Omelets are to die for! Prediction=> 0
Excellent short film.   Prediction=> 1
Great place to eat, reminds me of the little mom and pop shops in the San Francisco Bay Area. Prediction=> 0
The camera on the phone may be used as a dustpan when indoors... I'd rather be using a disposable then this. Prediction=> 1
I use this product in a motor control center where there is a lot of high voltage humming from the equipment, and it works great! Prediction=> 0
GO AND SEE IT!   Prediction=> 0
I recently had problems where I could not stay connected for more than 10 minutes before being disconnected. Prediction=> 1
You get what you pay for I guess. Prediction=> 0
After charging overnight, these batteries work great. Prediction=> 0
They really want to make your experience a good one. Prediction=> 1
It was horrendous.   Prediction=> 0
I can hear while I'm driving in the car, and u

Good show, Samsung. Prediction=> 1
Gave up trying to eat any of the crust (teeth still sore). Prediction=> 0
Saw the movie today and thought it was a good effort, good messages for kids.   Prediction=> 1
Not sure who was more lost - the flat characters or the audience, nearly half of whom walked out.   Prediction=> 0
She was quite disappointed although some blame needs to be placed at her door. Prediction=> 0
Fantastic earphones. Prediction=> 0
disappointing. Prediction=> 0
Special mention should be made of the superb music score and sound effects, which are an integral element in helping to make this such a memorable and enjoyable cartoon.   Prediction=> 0
Lewis Black's considerable talent is wasted here too, as he is at his most incendiary when he is unrestrained, which the PG-13 rating certainly won't allow.   Prediction=> 0
Still, it makes up for all of this with a super ending that depicts a great sea vessel being taken out by the mighty frost.   Prediction=> 0
This phone works gr

## Compute Area Under the Receiver Operating Characteristic (ROC) Curve

An ROC curve is a plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold 
is varied.  It is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold 
settings. The TPR is known as sensitivity, recall or probability of detection in ML.  The FPR is known as the fall-out or 
probability of false alarm and is calculated as (1 − specificity) 

In [62]:
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, sample_prediction_tfid)
roc_auc = auc(false_positive_rate, true_positive_rate)
print('ROC_AUC: \t\t\t', roc_auc)

# Accuracy
print('Accuracy: X_test vs y_test \t', pipe_tfid.score(X_test, y_test))
print('Accuracy: X_train vs y_train \t', pipe_tfid.score(X_train, y_train))

ROC_AUC: 			 0.6288865267074143
Accuracy: X_test vs y_test 	 0.6193078324225865
Accuracy: X_train vs y_train 	 0.6739526411657559


| LinearSVC Results | Scored | Result |
|-----|-----|-----|
| ROC_AUC | * | 0.7986446983789529 |
| Accuracy | X_test vs y_test | 0.7978142076502732 |
| Accuracy | X_train vs y_train | 0.9795081967213115 |

| RandomForest Results | Scored | Result |
|-----|-----|-----|
| ROC_AUC | * | 0.6810789263885197 |
| Accuracy | X_test vs y_test | 0.6721311475409836 |
| Accuracy | X_train vs y_train | 0.6780510018214936 |
 


In [63]:
# Another random review
pipe_tfid.predict(['This was a great movie'])

array([0], dtype=int64)

In [64]:
example = ['I do enjoy my job', 'What a poor product!  I will ahve to get a new one', 'I feel amazing']
pipe_tfid.predict(example)

array([1, 0, 0], dtype=int64)

## Model Interpretation with Eli5

[TEXT CLASSIFICATION AND ML MODEL INTERPRETATION WITH ELI5,SKLEARN AND SPACY](https://jcharistech.wordpress.com/2019/05/28/text-classification-and-ml-model-interpretation-with-eli5sklearn-and-spacy/)

In [66]:
pipe_tfid.predict_proba

<function sklearn.pipeline.Pipeline.predict_proba(self, X)>

In [67]:
# TextExplainer allows to explain predictions of black-box text classifiers using LIME algorithm
exp = TextExplainer(random_state = 12)

X_test.values[1]

'It is a true classic.  '

In [69]:
exp.fit(X_test.values[1], pipe_tfid.predict_proba)

TextExplainer(char_based=False,
       clf=SGDClassifier(alpha=0.001, average=False, class_weight=None,
       early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
       l1_ratio=0.15, learning_rate='optimal', loss='log', max_iter=None,
       n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='elasticnet',
       power_t=0.5,
       random_state=<mtrand.RandomState object at 0x000001E01F56D318>,
       shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0,
       warm_start=False),
       expand_factor=10, n_samples=5000, position_dependent=False,
       random_state=12, rbf_sigma=None,
       sampler=MaskingTextSamplers(random_state=<mtrand.RandomState object at 0x000001E01F56D318>,
          sampler_params=None, token_pattern='(?u)\\b\\w+\\b',
          weights=array([0.7, 0.3])),
       token_pattern='(?u)\\b\\w+\\b',
       vec=CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='c

In [70]:
y.unique()

array([0, 1], dtype=int64)

In [71]:
target_names = ['Negative','Positive']

exp.show_prediction()

Contribution?,Feature
0.094,Highlighted in text (sum)
0.022,<BIAS>


In [72]:
exp.show_prediction(target_names = target_names)

Contribution?,Feature
0.094,Highlighted in text (sum)
0.022,<BIAS>


In [73]:
exp.metrics_

{'mean_KL_divergence': 0.00017557432380346498, 'score': 1.0}

In [74]:
exp.show_weights()

Weight?,Feature
-0.022,<BIAS>
-0.027,true classic
-0.067,it is


In [78]:
# Check For Vectorizer and Classifer Used
exp.vec_, exp.clf_

(CountVectorizer(analyzer='word', binary=False, decode_error='strict',
         dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
         lowercase=True, max_df=1.0, max_features=None, min_df=1,
         ngram_range=(1, 2), preprocessor=None, stop_words=None,
         strip_accents=None, token_pattern='(?u)\\b\\w+\\b', tokenizer=None,
         vocabulary=None),
 SGDClassifier(alpha=0.001, average=False, class_weight=None,
        early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
        l1_ratio=0.15, learning_rate='optimal', loss='log', max_iter=None,
        n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='elasticnet',
        power_t=0.5,
        random_state=<mtrand.RandomState object at 0x000001E0369F33F0>,
        shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0,
        warm_start=False))

## Custom Function

In [None]:
def executeModel(classifier, modelName):
    # Create the  pipeline to clean, tokenize, vectorize, and classify
    pipe_tfid = Pipeline([("cleaner", predictors()),
                          ('vectorizer', tfvectorizer),
                          ('classifier', classifier)])
    pipe_tfid.fit(X_train, y_train)
    sample_prediction_tfid = pipe_tfid.predict(X_test)
    print('Accuracy for', modelName, ':')
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, sample_prediction_tfid)
    roc_auc = auc(false_positive_rate, true_positive_rate)
    print('\tROC_AUC: \t\t\t', roc_auc)
    # Accuracy
    print('\tAccuracy: X_test vs y_test \t', pipe_tfid.score(X_test, y_test))
    print('\tAccuracy: X_train vs y_train \t', pipe_tfid.score(X_train, y_train))
    print('-' * 70)
    return

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import MultinomialNB

names = ["Logistic Regression", "Nearest Neighbors", "Decision Tree", "Random Forest", "AdaBoost 50", "Sigmoid SVM", 
         "AdaBoost 10", "GradientBoost", "MultinomialNB", "LinearSVC"]

classifiers = [ 
    LogisticRegression(solver = 'lbfgs', max_iter = 100, C = 1e9),
    KNeighborsClassifier(n_neighbors = 14),
    DecisionTreeClassifier(max_depth = 6, min_samples_split = 2, random_state = 0),
    RandomForestClassifier(n_jobs = -1, max_depth = 6, n_estimators = 11, max_features = 1),
    AdaBoostClassifier(),
    SVC(kernel = "sigmoid", gamma = 'auto', C = 0.025),
    AdaBoostClassifier(n_estimators = 10),
    GradientBoostingClassifier(n_estimators = 10, learning_rate = 1.0, max_depth = 1, random_state = 0),
    MultinomialNB(),
    LinearSVC()]

# Evalute the classifiers
# zip merges the two lists together, treating the two lists as key and data to create a dictionary
for modelName, clf in zip(names, classifiers):
    executeModel(clf, modelName)