# Neural networks from scratch

## Part I: Preprocessing

In our last sessions, we talked about what it means for a supervised algorithm to *learn*. We saw that, in a supervised learning algorithm, we have an **input** with a corresponding **output** label. The goal is to learn the weights for given features that will give us the correct output - this is what it means to train a model.

In this exercise, we will apply what we learned to predict the rating of a review from reviews. 

In the `train.csv` file, there are a small number of reviews; for each review, there is the text of the review, the number of people who have read the review, and the numerical score of the review:

```
"This video is incredible!!!",1,5.0
"Sad and sorry excuse for words on a screen",8,1.5
"The instructor was amazing, incredible in every way!",2,5.0
"I feel sorry for anyone who waste their time on this book.",9,1.0
"This is so-so, not bad but not good either.",6,3.0
...
```

The first step in every NLP pipeline is cleaning and tokenization. The first step is to extract the vocabulary and count statistics. 

In real life, you will have to decide what constitutes a **token**. Here, let's do a simple definition. A token is:
* one single valid English word
* lowercased
* plurals are stemmed to their singular form
* excludes stop words


In [7]:
from collections import Counter

# Let's try a max vocab size of 12 for now, and see how this works
MAX_VOCAB_SIZE = 12
STOP_WORDS = ['a', 'an', 'and', 'are', 'as', 'at',
              'be', 'but', 'by', 'for', 'if', 'in',
              'into', 'is', 'it', 'no', 'not', 'of',
              'on', 'or', 'such', 'that', 'the', 'their',
              'then', 'there', 'these', 'they', 'this',
              'to', 'was', 'will', 'with']

def get_vocabulary(reviews):
    ''' 
        Given a list of reviews, returns a vocabulary dictionary of tokens. 
        For example, given ["foo bar", "bar baz hello", "hello world"], 
        return {"foo": 1, "bar": 2, "baz": 1, "hello": 2}
    '''
    vocab = {}
    counter = Counter()
    for review in reviews:
        for word in review.split(" "):
            if word not in STOP_WORDS:
                counter[word.lower()] += 1
                
    return dict(counter)


{'bar': 2,
 'baz': 1,
 'broski': 1,
 'dude': 1,
 'foo': 1,
 'hello': 2,
 'how': 1,
 'i': 1,
 'incredible': 1,
 'love': 1,
 'world': 1}

Stemming and excluding stop words is tedious work. **Question**: What libraries might you use that can help you with this? Do some googling and write down below at least two libraries and function calls you can make.

**TODO: what are two libraries and their functions I could call to make this easier...**
1. nltk word.tokenize.word_tokenize
2. spacy.en English()

Now, we will convert each sample into a *feature vector*. In real life, you will decide what features to use, and in the future, we will use deep learning to help us decide features. But for now, we will use these features:
* Counts of each token
* The number of people who have read the review

In [59]:

def vectorize(data, vocab):
    '''
        Given a sample and the vocabulary dictionary, convert the sample to a feature vector. 
        For example, given "I love how incredible this was.",2,4.5 
        it should return a vector of counts of each vocab word and the # of people who have read the review
    '''
    review, read_count, rating = data
    vocab_count = []
    for word in review:
        try:
            vocab_count.append(vocab[word])
        except:
            pass
        
    return vocab_count, float(read_count)


Again, vectorizing by hand can be tough work. **Question**: How might we be able to use sklearn to help us out here? Do some googling and name the class we could use.

**TODO: what is the general class in sklearn that helps us with converting data to feature vectors?**

Now, call your two functions to get the vocabulary and vectorize the data into feature vectors. You should end up with $X$, the array of feature vectors for each sample, and $y$, the array of truth labels (in this case, the review score) for each sample.

In [60]:
import csv
import numpy as np

train_f = open('./train.csv', 'r')
review_data = [review for review in csv.reader(train_f)]
review_text = [review[0] for review in review_data]

# process your file here, call your functions, etc
X = np.array(get_vocabulary(review_text))

y = np.array([vectorize(review, X) for review in review_data])

# make sure they're numpy array objects.
assert isinstance(X, np.ndarray) == True
assert isinstance(y, np.ndarray) == True

Q: How many features do we have? In other words, what is the dimensionality of our training set?

Two? 1) The review text and 2) the review score. A potential third feature being number of review reads

## Part II: Training

TODO

## Part III: Evaluation

TODO