# Text Classification Project

### Approach
* Read in a collection of documents - a *corpus*
* Transform text into numerical vector data using a pipeline
* Create a classifier
* Fit/train the classifier
* Test the classifier on new data
* Evaluate performance

For this project we'll use the Cornell University Movie Review polarity dataset v2.0 obtained from http://www.cs.cornell.edu/people/pabo/movie-review-data/

**we will try to predict the Positive/Negative labels based on text content alone.**

In [1]:
#Importing libraries
import numpy as np
import pandas as pd

In [2]:
#Read the csv file
df = pd.read_csv('moviereviews.tsv',sep='\t')

In [3]:
df.head()

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...


In [4]:
#length of dataframe
len(df)

2000

### Dropping NULL Records

In [5]:
#Checking null values
df.isnull().sum()

label      0
review    35
dtype: int64

In [6]:
#Dropping missing values
df.dropna(inplace=True)

In [7]:
df.head()

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...


**Detecting and Removing Empty Strings**
In order to detect these strings we need to iterate over each row in the DataFrame. The `.itertuples()` pandas method is a good tool for this as it provides access to every field. For brevity, we'll assign the names `i`, `lb` and `rv` to the `index`, `label` and `review` columns.

In [8]:
#Removing review which just have empty strings
blanks = []

#Iterate over the dataframe
for i,lb,rv in df.itertuples(): 
    #checking that string value is there or not
    if type(rv) == str:
        #if there is string, and is just whitespace
        if rv.isspace():
            #appending the index of the blank space
            blanks.append(i)
            
print(len(blanks) , 'blanks:',blanks)

27 blanks: [57, 71, 147, 151, 283, 307, 313, 323, 343, 351, 427, 501, 633, 675, 815, 851, 977, 1079, 1299, 1455, 1493, 1525, 1531, 1763, 1851, 1905, 1993]


In [9]:
#Dropping the indexes which are having just single space
df.drop(blanks, inplace=True)

In [10]:
#Length of dataframe
len(df)

1938

### `Label` Column value counts

In [11]:
df['label'].value_counts()

#Dataset is balanced

neg    969
pos    969
Name: label, dtype: int64

## Model Building

### Splitting data into train and test

In [12]:
from sklearn.model_selection import train_test_split

In [13]:
#reating X and y
X = df['review']
y = df['label']

In [14]:
X_train , X_test , y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)

### Creating a Pipeline

In [15]:
#Pipeline
from sklearn.pipeline import Pipeline

In [16]:
#Tfidf Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

## Naive Bayes Classifier

In [18]:
#Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB

In [19]:
# Classification Model (Naive Bayes):
text_clf_nb = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', MultinomialNB()),
])

### Fitting the model

In [20]:
text_clf_nb.fit(X_train, y_train)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', MultinomialNB())])

### Run predictions

In [21]:
#Predicting the set
predictions = text_clf_nb.predict(X_test)

In [22]:
# Report the confusion matrix
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions))

[[174  14]
 [ 65 135]]


In [23]:
# Print a classification report
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

         neg       0.73      0.93      0.81       188
         pos       0.91      0.68      0.77       200

    accuracy                           0.80       388
   macro avg       0.82      0.80      0.79       388
weighted avg       0.82      0.80      0.79       388



In [24]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

0.7963917525773195


### Predicting on New Data

In [25]:
myreview = "A movie I really wanted to love was terrible. \
I'm sure the producers had the best intentions, but the execution was lacking."

In [26]:
print(text_clf_nb.predict([myreview]))

['neg']


You can also use Stopwords from NLTK and can also use Lemmatizer

### Adding Stopwords to CountVectorizer
By default, **CountVectorizer** and **TfidfVectorizer** do *not* filter stopwords. However, they offer some optional settings, including passing in your own stopword list.
<div class="alert alert-info" style="margin: 20px">CAUTION: There are some [known issues](http://aclweb.org/anthology/W18-2502) using Scikit-learn's built-in stopwords list. Some words that are filtered may in fact aid in classification. We'll pass in our own stopword list, so that we know exactly what's being filtered.</div>

The [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) class accepts the following arguments:
> *CountVectorizer(input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, **stop_words=None**, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)*

[TfidVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) supports the same arguments and more. Under *stop_words* we have the following options:
> stop_words : *string {'english'}, list, or None (default)*

That is, we can run `TfidVectorizer(stop_words='english')` to accept scikit-learn's built-in list,<br>
or `TfidVectorizer(stop_words=[a, and, the])` to filter these three words. In practice we would assign our list to a variable and pass that in instead.

Scikit-learn's built-in list contains 318 stopwords:
> <pre>from sklearn.feature_extraction import text
> print(text.ENGLISH_STOP_WORDS)</pre>
['a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amoungst', 'amount', 'an', 'and', 'another', 'any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'are', 'around', 'as', 'at', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides', 'between', 'beyond', 'bill', 'both', 'bottom', 'but', 'by', 'call', 'can', 'cannot', 'cant', 'co', 'con', 'could', 'couldnt', 'cry', 'de', 'describe', 'detail', 'do', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight', 'either', 'eleven', 'else', 'elsewhere', 'empty', 'enough', 'etc', 'even', 'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fifty', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 'formerly', 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get', 'give', 'go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'however', 'hundred', 'i', 'ie', 'if', 'in', 'inc', 'indeed', 'interest', 'into', 'is', 'it', 'its', 'itself', 'keep', 'last', 'latter', 'latterly', 'least', 'less', 'ltd', 'made', 'many', 'may', 'me', 'meanwhile', 'might', 'mill', 'mine', 'more', 'moreover', 'most', 'mostly', 'move', 'much', 'must', 'my', 'myself', 'name', 'namely', 'neither', 'never', 'nevertheless', 'next', 'nine', 'no', 'nobody', 'none', 'noone', 'nor', 'not', 'nothing', 'now', 'nowhere', 'of', 'off', 'often', 'on', 'once', 'one', 'only', 'onto', 'or', 'other', 'others', 'otherwise', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 'part', 'per', 'perhaps', 'please', 'put', 'rather', 're', 'same', 'see', 'seem', 'seemed', 'seeming', 'seems', 'serious', 'several', 'she', 'should', 'show', 'side', 'since', 'sincere', 'six', 'sixty', 'so', 'some', 'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhere', 'still', 'such', 'system', 'take', 'ten', 'than', 'that', 'the', 'their', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 'therefore', 'therein', 'thereupon', 'these', 'they', 'thick', 'thin', 'third', 'this', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus', 'to', 'together', 'too', 'top', 'toward', 'towards', 'twelve', 'twenty', 'two', 'un', 'under', 'until', 'up', 'upon', 'us', 'very', 'via', 'was', 'we', 'well', 'were', 'what', 'whatever', 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whither', 'who', 'whoever', 'whole', 'whom', 'whose', 'why', 'will', 'with', 'within', 'without', 'would', 'yet', 'you', 'your', 'yours', 'yourself', 'yourselves']

However, there are words in this list that may influence a classification of movie reviews. With this in mind, let's trim the list to just 60 words:

In [27]:
stopwords = ['a', 'about', 'an', 'and', 'are', 'as', 'at', 'be', 'been', 'but', 'by', 'can', \
             'even', 'ever', 'for', 'from', 'get', 'had', 'has', 'have', 'he', 'her', 'hers', 'his', \
             'how', 'i', 'if', 'in', 'into', 'is', 'it', 'its', 'just', 'me', 'my', 'of', 'on', 'or', \
             'see', 'seen', 'she', 'so', 'than', 'that', 'the', 'their', 'there', 'they', 'this', \
             'to', 'was', 'we', 'were', 'what', 'when', 'which', 'who', 'will', 'with', 'you']

**Now let's repeat the process above and see if the removal of stopwords improves or impairs our score.**

In [28]:
import numpy as np
import pandas as pd

df = pd.read_csv('moviereviews.tsv', sep='\t')
df.dropna(inplace=True)
blanks = []
for i,lb,rv in df.itertuples():
    if type(rv)==str:
        if rv.isspace():
            blanks.append(i)
df.drop(blanks, inplace=True)
from sklearn.model_selection import train_test_split
X = df['review']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

In [29]:
text_clf_nb = Pipeline([('tfidf', TfidfVectorizer(stop_words=stopwords)),
                     ('clf', MultinomialNB()),
])
text_clf_nb.fit(X_train, y_train)

Pipeline(steps=[('tfidf',
                 TfidfVectorizer(stop_words=['a', 'about', 'an', 'and', 'are',
                                             'as', 'at', 'be', 'been', 'but',
                                             'by', 'can', 'even', 'ever', 'for',
                                             'from', 'get', 'had', 'has',
                                             'have', 'he', 'her', 'hers', 'his',
                                             'how', 'i', 'if', 'in', 'into',
                                             'is', ...])),
                ('clf', MultinomialNB())])

In [31]:
predictions = text_clf_nb.predict(X_test)
print(metrics.confusion_matrix(y_test,predictions))

[[282  26]
 [105 227]]


In [32]:
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

         neg       0.73      0.92      0.81       308
         pos       0.90      0.68      0.78       332

    accuracy                           0.80       640
   macro avg       0.81      0.80      0.79       640
weighted avg       0.82      0.80      0.79       640



In [33]:
print(metrics.accuracy_score(y_test,predictions))

0.7953125


#### Our score didn't change that much. We went from 79.6% without filtering stopwords to 79.5% after adding a stopword filter to our pipeline. Keep in mind that 2000 movie reviews is a relatively small dataset. The real gain from stripping stopwords is improved processing speed; depending on the size of the corpus, it might save hours.

## Feed new data into a trained model
Once we've developed a fairly accurate model, it's time to feed new data through it. 

In [35]:
myreview = "A movie I really wanted to love was terrible. \
I'm sure the producers had the best intentions, but the execution was lacking."

In [36]:
print(text_clf_nb.predict([myreview]))  # be sure to put "myreview" inside square brackets

['neg']


Build text classification pipelines in scikit-learn, applied algorithm like naïve Bayes and handle stopwords, and test a fitted model on new data.

# Text Classification Assessment - Solution
This assessment is very much like the Text Classification Project we just completed, and the dataset is very similar.

The **moviereviews2.tsv** dataset contains the text of 6000 movie reviews. 3000 are positive, 3000 are negative, and the text has been preprocessed as a tab-delimited file. As before, labels are given as `pos` and `neg`. 

We've included 20 reviews that contain either `NaN` data, or have strings made up of whitespace.

For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/

### Perform imports and load the dataset into a pandas DataFrame
For this exercise you can load the dataset from `'moviereviews2.tsv'`.

In [28]:
import numpy as np
import pandas as pd

df = pd.read_csv('moviereviews2.tsv', sep='\t')
df.head()

Unnamed: 0,label,review
0,pos,I loved this movie and will watch it again. Or...
1,pos,"A warm, touching movie that has a fantasy-like..."
2,pos,I was not expecting the powerful filmmaking ex...
3,neg,"This so-called ""documentary"" tries to tell tha..."
4,pos,This show has been my escape from reality for ...


### Check for missing values:

In [30]:
# Check for NaN values:
df.isnull().sum()

label      0
review    20
dtype: int64

In [31]:
# Check for whitespace strings (it's OK if there aren't any!):
blanks = []  # start with an empty list

for i,lb,rv in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list
        
len(blanks)

0

### Remove NaN values:

In [32]:
df.dropna(inplace=True)

### quick look at the `label` column:

In [33]:
df['label'].value_counts()

neg    2990
pos    2990
Name: label, dtype: int64

### Split the data into train & test sets:

In [34]:
from sklearn.model_selection import train_test_split

X = df['review']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

### Build a pipeline to vectorize the date, then train and fit a model
You may use whatever model you like. To compare your results to the solution notebook, use `LinearSVC`.

## Linear SVC

In [35]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC()),
])

# Feed the training data through the pipeline
text_clf.fit(X_train, y_train)  

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

### Run predictions and analyze the results

In [36]:
# Form a prediction set
predictions = text_clf.predict(X_test)

In [37]:
# Report the confusion matrix
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions))

[[900  91]
 [ 63 920]]


In [38]:
# Print a classification report
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

         neg       0.93      0.91      0.92       991
         pos       0.91      0.94      0.92       983

    accuracy                           0.92      1974
   macro avg       0.92      0.92      0.92      1974
weighted avg       0.92      0.92      0.92      1974



In [39]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

0.9219858156028369
