# SynergyLabs  Machine Learning Assignment

### Overview
You are expected to build a text classification model to predict positive and negative reviews (binary classification). The dataset you will use in this assignment contains text snippets from real-world reviews. We encourage you to actively explore novel ways of completing this assignment - there are no 'right" or "wrong" answers. If not specified, you are free to set your own assumptions.

If you are not comfortable with Jupyter Notebooks, please let us know before you start.

The assignment is divided into three stages, each with different requirements. You will be building a simple machine learning pipeline that:

* Read data points from text files
* Featurize data points in a proposed way
* Build a machine learning model based on the featurized data points

Expected completion time: 5 hours

Please talk to lab members (Sudershan, Chen, or Dohyun) if you have any question or doubt.


### Getting Started
Download the dataset using the following URL: [https://placeholder.com]
The archive contains the following:
* README.MD
* stop_words.txt -- A textfile containing stop words with one stop word on every line (Source: [NLTK's list of english stopwords](https://gist.github.com/sebleier/554280))
* data/ : Contains 2 directories (positive and negative). Each of these directories contain textfiles with reviews belonging to the class indicated by the directory name. (Source: [Movie Review Data](http://www.cs.cornell.edu/people/pabo/movie-review-data/))

You can execute shell commands by prefacing them with "!". For example, to execute the `ls` command, you would enter `!ls`. Give it a try below:

(*Press `Shift+Return` to execute code inside a cell*)

In [None]:
!ls

Enter your commands to download and unzip the dataset below:

### Stage 1: Tokenizing the text file

> *There are a number of solutions to the given problem on websites like StackOverflow or through the use of open libraries (e.g. Sklearn.feature_extraction.text). However, you are expected **NOT** to copy & paste from them and **NOT** use pre-built libraries. You should complete this stage of the assignment using Python native libraries (e.g. open, readline, etc) only.*


You can use [this reference](https://docs.python.org/3/tutorial/inputoutput.html) as a starting point.

For each text file, you need to read the contents and transform it into a "[Bag of Words](https://en.wikipedia.org/wiki/Bag-of-words_model)" representation. For example, if a review contains the following text:

```John likes to watch movies. Mary likes movies too.```

Then, we have one occurence of "John" and two occurences of "movies", etc. We count the number of times each word appears in the text file. If we transform the example review into its bag of words representation, it should look like the following.

```BoW = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1}```

When you build a bag of words, you can determine which word to count or not to count. You can consider not containing punctuations (e.g. "." or "-") or stop words (list of words that are not useful features - one such list is included in the archive you downloaded earlier).


### Stage 2: Featurizing Data points
> *There are a number of solutions to the given problem on websites like StackOverflow or through the use of open libraries (e.g. Sklearn.feature_extraction.text). However, you are expected **NOT** to copy & paste from them and **NOT** use pre-built libraries. You should complete this stage of the assignment using Python native libraries (e.g. open, readline, etc) only.*

To extract better features out of reviews, we will use [TF/IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) featurization. There are different ways to get term frequency(TF) and inverse document frequency(IDF), please follow the same way as the example on Wikipedia link ("Example of tf-idf" section) for the sake of simplicity. You can also try additional techniques such as smoothing, but please do not spend too much time on those additional features.

Implement and use TF/IDF to transform the Bag of Words into features for classification.

### Stage 3: Build Machine Learning Models
> *You are now allowed to use external libraries. [Sklearn](http://scikit-learn.org/stable/index.html) is a good start and we highly encourage its use.*

With data points featurized, you will now use it to build a machine learning model. You will need to build a two-label classification model using [Logistic Regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression). 
You are required to report test [accuracy](https://en.wikipedia.org/wiki/F1_score) with your selected hyperparameter (only the C term). If a word that did not appear in the training set appears in the test set, you can skip the word. Also, please use IDF you learned from training set.

You must decide the follwoing for yourself:
* How to represent each data point (vector, matrix, or sparse matrix, etc)
* How to split the training/testing set
* Hyper-parameter tuning - you do not have to exhaustively explore hyperparameters. It is sufficient to consider 3 different configurations for C (error term penalty) of Logistic Regression. There are more hyper-parameters or configuration, but you are okay to use default values for other hyperparameters.