<img src="https://www.nyp.edu.sg/content/dam/nyp/logo.png" width='200'/>

Welcome to the lab! Before we get started here are a few pointers on Jupyter notebooks.

1. The notebook is composed of cells; cells can contain code which you can run, or they can hold text and/or images which are there for you to read.

2. You can execute code cells by clicking the ```Run``` icon in the menu, or via the following keyboard shortcuts ```Shift-Enter``` (run and advance) or ```Ctrl-Enter``` (run and stay in the current cell).

3. To interrupt cell execution, click the ```Stop``` button on the toolbar or navigate to the ```Kernel``` menu, and select ```Interrupt ```.


# Lab 1 - Sentiment Analysis with Scikit-Learn

In this lab exercise, we will learn how to perform Sentiment Analysis with Scikit-Learn, a popular Machine Learning toolkit for Classical Machine Learning. Sentiment Analysis is a Text Classification task where you model learns how to classify a paragraph or a document of text into whether it is a positive or a negative sentiment.

We will explore using TF-IDF and various Classical Machine Learning algorithms such as Naive Bayes and SVM to classify whether sentiments of movie reviews are positive or negative.

In [None]:
!wget https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/datasets/day2-pm/lab1/lab1.zip
!unzip lab1.zip

from helpers import *
print ("Importing helpers complete.")

## Section 1.1 - Explore Your Data

Take a look at the IMDB Dataset.csv to see format of the text file that we will be using for this exercise. If you intend to use this set of Jupyter Notebooks later for your own Sentiment Analysis projects, please ensure that you collect your data in this format.

There are 50,000 rows in the IMDB Dataset.csv file. We used Excel to cut out 40,000 rows and saved them into the train.csv file, and the remaining 10,000 rows, into the test.csv file.

## Section 1.2 - Load Data from CSV

Update the following code to load the training and test data from the correct CSV file path, and indicate the appropriate column names to extract the input text, and output classification label.

The path to the training file should be **"data/train.csv"**, and the path to the test file should be **"test.csv"**. The column names to the input text and output labels can be found in the train.csv and test.csv files.


In [None]:
# TODO: Set the filename to the path containing our training and test files. 
#
load_text_data_from_csv_for_scikit(
    "???",                           # The training CSV file
    "???",                           # The test CSV file
    "???",                           # The column in the CSV used as the input text
    "???")                           # The column in the CSV used as the output classification label

## Section 1.3 - Display Loaded Data

Run the following code to display the training data that we have loaded.

Can you identify which parts are the input texts, and which parts are the output labels?

In [None]:
display_trainx_trainy()

## Section 1.4 - Create the Classical Machine Learning Text Classification Model

The following creates the Classical Machine Learning model for our Text Classification task. 

We have written codes for you to create a model with Naive Bayes, or SVM. Let's start with Naive Bayes first.



In [None]:
# TODO:
# Run either this or SVM.
#
create_text_classifier_model_naivebayes(
    1.0,        # Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).
    True,       # Whether to learn class prior probabilities or not. If false, a uniform prior will be used.
    None        # Prior probabilities of the classes. If specified the priors are not adjusted according to the data.
)


In [None]:
# TODO:
# Run either this or Naive Bayes.
#
create_text_classifier_model_svm()


## Section 1.5 - Training and Evaluating the Model

Run the following cell to perform the training. The data pipeline set up in Scikit-Learn in the helpers already uses NLTK to tokenize (split into words) and lemmatize (convert words into root forms) before converting it into Bag-of-Words + TF-IDF counts and then passing that count into the Machine Learning model. 

This is how the processing pipeline for Natural Language Processing in Scikit-Learn will look like.

<img src="https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/resources/day2-pm/scikit.PNG" />

Once the training is complete, review the results below and look at how well your model is fairing. Take a look at the test data's F1 score, because it is a meaningful metric that tells us how well our model works for data that doesn't exist in the training set.


In [None]:
train_text_classifier_model()


## Section 1.6 - Saving the Model

Let's save the model into a file that we can reload and use later on.

Once you have run the following cell, take a look at the file in the folder. 

If you trained using Naive Bayes, we recommend that you save into a file name such as **"models/naivebayes.scikit"**. If you trained your model with SVM, save into a file name such as **"models/svm.scikit"**.

Once you have saved the model, head back to Step 1.4 to try and train your text classification task with another Machine Learning model.

In [None]:
# TODO: 
# Give the model a file name.
#
save_text_classifier_model("???")

## Section 1.7 - Loading the Model 

Update the following cell to provide the file name of your model and run the cell.

In [None]:
# TODO: 
# Update the file name of the model that you want to load.
#
load_text_classifier_model("???")

## Section 1.8 - Testing the Model

Let's try to run the following cell to test our model. When prompted for an input, enter any line of text and see what your machine learning model has classified the text as.

Try also to load the Naive Bayes model, and load the SVM models and try the text classification for both models.

Discuss your findings. 

1. Which model was more accurate based on the F1-score calculated after training?

2. Do you think that the classification has been accurate when you actually tried the model?

3. What else can you do to improve the accuracy of the model?


In [None]:
print ("Enter some text:")
user_text = input()
classify_text(user_text)


## Section 1.9 - Explore the helpers.py code

Take a look at the code within the helpers.py file to see the codes that load the training and test data, create the Machine Learning model, train the model and perform classification.