<img src="https://nyp-aicourse.s3.ap-southeast-1.amazonaws.com/agods/nyp_ago_logo.png" width='400'/>

# Text Classification (Sentiment Analysis) with Scikit-Learn

In this lab exercise, we will learn how to perform Sentiment Analysis with Scikit-Learn, a popular Machine Learning toolkit for Classical Machine Learning. Sentiment Analysis is a Text Classification task where your model learns how to classify a paragraph or a document of text into whether it is a positive or a negative sentiment.

We will explore using TF-IDF and various Classical Machine Learning algorithms such as Naive Bayes and SVM to classify whether sentiments of movie reviews are positive or negative.

In [None]:
#!wget https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/aiup/day2-pm/lab1/lab1.zip
#!unzip lab1.zip

In [None]:
import pandas as pd
import numpy as np
import os
import pickle

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC
from sklearn import metrics

import nltk
from nltk import word_tokenize   
from nltk.stem import WordNetLemmatizer 

import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns
from functools import reduce


## Section 1.1 - Explore Your Data

Take a look at the IMDB Dataset.csv to see format of the text file that we will be using for this exercise. If you intend to use this set of Jupyter Notebooks later for your own Sentiment Analysis projects, please ensure that you collect your data in this format.

There are 50,000 rows in the IMDB Dataset.csv file. We used Excel to cut out 40,000 rows and saved them into the train.csv file, and the remaining 10,000 rows, into the test.csv file.

## Section 1.2 - Load Data from CSV

Update the following code to load the training and test data from the correct CSV file path, and indicate the appropriate column names to extract the input text, and output classification label.

The path to the training file should be **"data/train.csv"**, and the path to the test file should be **"data/test.csv"**. The column names to the input text and output labels can be found in the train.csv and test.csv files.



In [None]:
train_csv_file = 'datasets/train.csv'
test_csv_file='datasets/test.csv'


# Load up the data from the training CSV file.
#
print ("Loading training data...")
train_data_df = pd.read_csv(train_csv_file)


# Load up the data from the test CSV file.
#    
print ("Loading test data...")
test_data_df = pd.read_csv(test_csv_file)


Write some code to view the first five rows of the train and test data

<details>
    <summary>Click here to see code</summary>

```python
train_data_df.head()
test_data_df.head()
```

In [None]:
#insert code here to look at train data


In [None]:
#insert code here to look at test data


Let's write some code to save the appropriate columns in the variables x_train, y_train, x_test and y_test

<details>
    <summary>Click here to see code</summary>

```python
x_train=train_data_df['review']
y_train=train_data_df['sentiment']
x_test=test_data_df['review']
y_test=test_data_df['sentiment']

```

In [None]:
#insert code here



## Section 1.3 - Create the Classical Machine Learning Text Classification Model

The following creates the Classical Machine Learning model for our Text Classification task. 

Run the following cell to create a pipeline using Scikit-Learn libraries to tokenize (split into words) and lemmatize (convert words into root forms) before converting it into Bag-of-Words + TF-IDF counts and then passing that count into the Machine Learning model. 

This is how the processing pipeline for Natural Language Processing in Scikit-Learn will look like.

<img src="https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/resources/day2-pm/scikit.PNG" />

The codes below create Naive Bayes, SVM or artificial neural network. Uncomment the appropriate section to create the model of your choice. Let's start with Naive Bayes. 



In [None]:
class LemmaTokenizer:
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]

In [None]:
# # Naive Bayes
text_classifier_model = Pipeline([
    ('vect', CountVectorizer(tokenizer=LemmaTokenizer())),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

#SGD
# text_classifier_model = Pipeline([
#     ('vect', CountVectorizer(tokenizer=LemmaTokenizer())),
#     ('tfidf', TfidfTransformer()),
#     ('clf', SGDClassifier(verbose=1) ),
# ])

# #ANN
# text_classifier_model = Pipeline([
#     ('vect', CountVectorizer(tokenizer=LemmaTokenizer())),
#     ('tfidf', TfidfTransformer()),
#     ('clf', MLPClassifier(solver='adam', alpha=1e-5, hidden_layer_sizes=20, random_state=1, max_iter=5, verbose=True) ),
# ]) 

## Section 1.4 Training and Evaluating the Model

Run the following cell to perform the training. Once the training is complete, run the next cell to evaluate the classifier. Review the results  to look at how well your model is doing. Take a look at the test data's F1 score, because it is a meaningful metric that tells us how well our model works for data that doesn't exist in the training set.


In [None]:
print ("Training classifier...")
text_classifier_model.fit(x_train, y_train)


In [None]:
print ("Evaluating classifier...")
pred_y_train = text_classifier_model.predict(x_train)
pred_y_test = text_classifier_model.predict(x_test)

plt.figure(figsize=(20,6))  

labels=['negative','positive']
labels = np.array(labels)

# Print the first Confusion Matrix for the training data
#
cm = confusion_matrix(y_train, pred_y_train)

cm_df = pd.DataFrame(cm, labels, labels)          
plt.subplot(1, 2, 1)
plt.title('Confusion Matrix (Train Data)')
sns.heatmap(cm_df, annot=True)
plt.ylabel('Actual')
plt.xlabel('Predicted')        

# Print the second Confusion Matrix for the test data
#    
cm = confusion_matrix(y_test, pred_y_test)

cm_df = pd.DataFrame(cm, labels, labels)          
plt.subplot(1, 2, 2)
plt.title('Confusion Matrix (Test Data)')
sns.heatmap(cm_df, annot=True)
plt.ylabel('Actual')
plt.xlabel('Predicted')        

plt.show()

In [None]:
# Finally display the classification reports
#
print ("Train Data")
print ("--------------------------------------------------------")
print(metrics.classification_report(y_train, pred_y_train, target_names=labels))
print ("")
print ("Test Data")
print ("--------------------------------------------------------")
print(metrics.classification_report(y_test, pred_y_test, target_names=labels))


## Section 1.5 - Saving the Model

Let's save the model into a file that we can reload and use later on.

Once you have run the following cell, take a look at the file in the folder. 

Once you have saved the model, you can head back to Step 1.3 to try and train your text classification task with another Machine Learning model, and save it using another filename.


In [None]:
text_classifier_labels=['negative','positive']

saved='models/Naive-Bayes'
pickle.dump(text_classifier_labels, open(saved + ".labels", "wb"))
pickle.dump(text_classifier_model, open(saved, "wb"))

## Section 1.6 - Loading the Model 

Run the following cell to load your model

In [None]:
text_classifier_labels = pickle.load(open(saved + ".labels", "rb"))
text_classifier_model = pickle.load(open(saved, "rb"))

## Section 1.7 - Testing the Model

Let's try to run the following cell to test our model. When prompted for an input, enter any line of text and see what your machine learning model has classified the text as.

Try also to load the Naive Bayes model, and load the SVM models and try the text classification for both models.

Discuss your findings. 

1. Which model was more accurate based on the F1-score calculated after training?

2. Do you think that the classification has been accurate when you actually tried the model?

3. What else can you do to improve the accuracy of the model?


In [None]:
print ("Enter some text:")
text = input()
print ("You entered: %s" % (text))
input_text=pd.Series(text)
result = text_classifier_model.predict(input_text)

print ("Classification result:")
print (result)