___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

Xiaoyi Wang

2403234885

# Text Classification Assessment
This assessment is very much like the Text Classification Project we just completed, and the dataset is very similar.

The **moviereviews2.tsv** dataset contains the text of 6000 movie reviews. 3000 are positive, 3000 are negative, and the text has been preprocessed as a tab-delimited file. As before, labels are given as `pos` and `neg`.

We've included 20 reviews that contain either `NaN` data, or have strings made up of whitespace.

For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/

### Task #1: Perform imports and load the dataset into a pandas DataFrame
For this exercise you can load the dataset from `'../TextFiles/moviereviews2.tsv'`.

In [1]:
import pandas as pd
import os
import glob

In [41]:
cwd = os.getcwd()
dir_path = os.path.join(cwd, 'TextFiles')

file_paths = []
# Search for all text files in the specified directory structure
for folder_name in ['train', 'test']:
    for sentiment in ['pos', 'neg']:
        search_path = os.path.join(dir_path, folder_name, sentiment, '*.txt')
        file_paths.extend(glob.glob(search_path))

dfs = []
# Iterate over the file paths and read the contents into the DataFrame
for file_path in file_paths:
    with open(file_path, 'r') as file:
        text = file.read()
        sentiment = file_path.split(os.sep)[-2]
        data = pd.DataFrame({'text': [text], 'label': [sentiment]})
        dfs.append(data)

# concat the dfs into a single DataFrame
df = pd.concat(dfs, ignore_index=True)

In [42]:
df.shape

(50000, 2)

In [16]:
df[df['text'].apply(len) < 33]

Unnamed: 0,text,label
46499,"Read the book, forget the movie!",neg


In [43]:
df.sample(5)

Unnamed: 0,text,label
16961,I went into this movie with high hopes. Normal...,neg
46670,"It's hard to believe a movie can be this bad, ...",neg
25721,Make sure you make this delightful comedy part...,pos
12909,As a writer I find films this bad making it in...,neg
28554,"It's not difficult, after watching this film, ...",pos


In [44]:
# Saving the output
# df.to_csv('TextFiles/moviereviews2.tsv', sep='\t', index=False)

In [3]:
# Reading the file
df = pd.read_csv('TextFiles/moviereviews2.tsv', sep='\t')

### Task #2: Check for missing values:

In [4]:
# Check for NaN values:
df.isnull().sum()

text     0
label    0
dtype: int64

In [5]:
# Check for whitespace strings (it's OK if there aren't any!):
blanks = []  # start with an empty list

for i,rv,lb in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list
        
print(len(blanks), 'blanks: ', blanks)

0 blanks:  []


### Task #3: Remove NaN values:

In [61]:
df.dropna(inplace=True)
df.drop(blanks, inplace=True)
len(df)

50000

### Task #4: Take a quick look at the `label` column:

In [50]:
df['label'].unique()

array(['pos', 'neg'], dtype=object)

In [51]:
df['label'].value_counts()

pos    25000
neg    25000
Name: label, dtype: int64

### Task #5: Split the data into train & test sets:
You may use whatever settings you like. To compare your results to the solution notebook, use `test_size=0.33, random_state=42`

In [52]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.33, random_state=42)

print('Training Data Shape:', X_train.shape)
print('Testing Data Shape: ', X_test.shape)

Training Data Shape: (33500,)
Testing Data Shape:  (16500,)


### Task #6: Build a pipeline to vectorize the date, then train and fit a model
You may use whatever model you like. To compare your results to the solution notebook, use `LinearSVC`.

In [55]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.svm import LinearSVC

text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC())
])

# Feed the training data through the pipeline
text_clf.fit(X_train, y_train)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

### Task #7: Run predictions and analyze the results

In [56]:
# Form a prediction set
predictions = text_clf.predict(X_test)

In [57]:
from sklearn import metrics

In [58]:
# Report the confusion matrix
print(metrics.confusion_matrix(y_test,predictions))

[[7322  874]
 [ 796 7508]]


In [59]:
# Print a classification report
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

         neg       0.90      0.89      0.90      8196
         pos       0.90      0.90      0.90      8304

    accuracy                           0.90     16500
   macro avg       0.90      0.90      0.90     16500
weighted avg       0.90      0.90      0.90     16500



In [60]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

0.8987878787878788


## Great job!