# Text Classification Assessment
This assessment is very much like the Text Classification Project we just completed, and the dataset is very similar.

The **moviereviews2.tsv** dataset contains the text of 6000 movie reviews. 3000 are positive, 3000 are negative, and the text has been preprocessed as a tab-delimited file. As before, labels are given as `pos` and `neg`.

We've included 20 reviews that contain either `NaN` data, or have strings made up of whitespace.

For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/

### Task #1: Perform imports and load the dataset into a pandas DataFrame
For this exercise you can load the dataset from `'../TextFiles/moviereviews2.tsv'`.

In [26]:
import numpy as np
import pandas as pd

CONST_DATA_FILE = 'data/moviereviews2.zip'

df = pd.read_csv(CONST_DATA_FILE, sep='\t', compression='zip', )
df.head()

Unnamed: 0,label,review
0,pos,I loved this movie and will watch it again. Or...
1,pos,"A warm, touching movie that has a fantasy-like..."
2,pos,I was not expecting the powerful filmmaking ex...
3,neg,"This so-called ""documentary"" tries to tell tha..."
4,pos,This show has been my escape from reality for ...


In [27]:
df.describe()

Unnamed: 0,label,review
count,6000,5980
unique,2,5966
top,neg,"God, I was bored out of my head as I watched t..."
freq,3000,2


### Task #2: Data Cleaning - Check for missing values:

In [3]:
# Check for NaN values:
df.isnull().sum()

label      0
review    20
dtype: int64

In [4]:
print(f"Number of empty reviews : {len(df[df['review'].str.strip()==u''])}")

Number of empty reviews : 0


In [5]:
# print dataframe entries with null values
df_with_nulls = df[df.isnull().any(axis=1)]
print(f'{df_with_nulls.describe()}')
print('dataframe entries with null values:')
df_with_nulls

       label review
count     20      0
unique     2      0
top      neg    NaN
freq      10    NaN
dataframe entries with null values:


Unnamed: 0,label,review
825,neg,
895,neg,
1889,neg,
2038,pos,
2260,pos,
2452,neg,
2713,pos,
2980,pos,
3182,neg,
3250,pos,


In [6]:
# use the index of the rows that are nulls to remove them from the original dataframe
df.drop(df_with_nulls.index, inplace=True)
df.describe()

Unnamed: 0,label,review
count,5980,5980
unique,2,5966
top,neg,"God, I was bored out of my head as I watched t..."
freq,2990,2


In [7]:
# Check again for NaN values:
df.isnull().sum()

label     0
review    0
dtype: int64

In [8]:
# Check for whitespace strings (it's OK if there aren't any!):
print(f"Number of empty reviews : {len(df[df['review'].str.strip()==u''])}")
df_clean = df.drop(df[df['review'].str.strip()==u''].index)
len(df_clean)


Number of empty reviews : 0


5980

### Task #3: Data Cleaning - Check for dublicate values:

In [9]:
#list duplicate reviews
df[df.duplicated()]

Unnamed: 0,label,review
503,neg,"Wow, here it finally is; the action ""movie"" wi..."
2042,neg,"God, I was bored out of my head as I watched t..."
2045,pos,Everyone knows about this ''Zero Day'' event. ...
2672,pos,Though structured totally different from the b...
2875,pos,One of Disney's best films that I can enjoy wa...
3037,pos,Smallville episode Justice is the best episode...
3179,pos,I loved this movie. I knew it would be chocked...
3657,neg,"Awful, simply awful. It proves my theory about..."
3973,neg,What was an exciting and fairly original serie...
4363,neg,I have been familiar with the fantastic book o...


In [10]:
# one exemple of duplicate rows
df[df['review'].str.match('Wow, here it finally is; ')]

Unnamed: 0,label,review
270,neg,"Wow, here it finally is; the action ""movie"" wi..."
503,neg,"Wow, here it finally is; the action ""movie"" wi..."


In [11]:
# just to be sure :
df.iloc[270]['review'] == df.iloc[503]['review']

True

In [12]:
# let's remove the duplicates entries
df.drop(df[df.duplicated()].index, inplace=True)
df.describe()

Unnamed: 0,label,review
count,5966,5966
unique,2,5966
top,pos,Here is the explanation screenwriter Pamela Ka...
freq,2984,1


### Task #4: Take a quick look at the `label` column:

In [13]:
df['label'].value_counts()

pos    2984
neg    2982
Name: label, dtype: int64

### Task #5: Split the data into train & test sets:
You may use whatever settings you like. To compare your results to the solution notebook, use `test_size=0.33, random_state=42`

In [14]:
from sklearn.model_selection import train_test_split

X = df['review']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

print(f'X_train.shape:{X_train.shape}, X_test.shape:{X_test.shape}, y_train.shape:{y_train.shape}, y_test.shape:{y_test.shape}')


X_train.shape:(3997,), X_test.shape:(1969,), y_train.shape:(3997,), y_test.shape:(1969,)


### Task #6: Build a pipeline to vectorize the date, then train and fit a model
You may use whatever model you like. To compare your results to the solution notebook, use `LinearSVC`.

In [15]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

# Naïve Bayes:
text_clf_nb = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', MultinomialNB()),
])

# Linear SVC:
text_clf_lsvc = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC()),
])


### Task #7: Run predictions and analyze the results

#### We'll run naïve Bayes first

In [16]:
# Feed the training data through the pipeline
text_clf_nb.fit(X_train, y_train)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', MultinomialNB())])

In [17]:
# Form a prediction set
predictions_naive_bayes = text_clf_nb.predict(X_test)

In [18]:
# Report the confusion matrix
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions_naive_bayes))


[[950  36]
 [126 857]]


In [19]:
# Print a classification report
print(metrics.classification_report(y_test,predictions_naive_bayes))

              precision    recall  f1-score   support

         neg       0.88      0.96      0.92       986
         pos       0.96      0.87      0.91       983

    accuracy                           0.92      1969
   macro avg       0.92      0.92      0.92      1969
weighted avg       0.92      0.92      0.92      1969



In [20]:
# Print the overall accuracy
naive_bayes_accuracy = metrics.accuracy_score(y_test,predictions_naive_bayes)
print(naive_bayes_accuracy)

0.9177247333671915


#### Now let's run the Linear SVC:

In [21]:
# Feed the training data through the pipeline
text_clf_lsvc.fit(X_train, y_train)  

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

In [22]:
# Form a prediction set
predictions_linear_svc = text_clf_lsvc.predict(X_test)

In [23]:
# Report the confusion matrix
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions_linear_svc))

[[910  76]
 [ 72 911]]


In [24]:
# Print a classification report
print(metrics.classification_report(y_test,predictions_linear_svc))

              precision    recall  f1-score   support

         neg       0.93      0.92      0.92       986
         pos       0.92      0.93      0.92       983

    accuracy                           0.92      1969
   macro avg       0.92      0.92      0.92      1969
weighted avg       0.92      0.92      0.92      1969



In [25]:
# Print the overall accuracy
linear_svc_accuracy = metrics.accuracy_score(y_test,predictions_linear_svc)
print(linear_svc_accuracy)

0.9248349415947181


## Great job! we get 92.5 % overall accuracy for the Linear SVC classfier and a little bit less (91.7 %) for the Naive Bayes