# Neural networks from scratch

In our last sessions, we talked about what it means for a supervised algorithm to *learn*. We saw that, in a supervised learning algorithm, we have an **input** with a corresponding **output** label. The goal is to learn the weights for given features that will give us the correct output - this is what it means to train a model.

In this exercise, we will apply what we learned to predict the rating of a review from reviews. 

In the `train.csv` file, there are a small number of reviews; for each review, there is the text of the review, the number of people who have read the review, and the numerical score of the review:

```
"This video is incredible!!!",1,5.0
"Sad and sorry excuse for words on a screen",8,1.5
"The instructor was amazing, incredible in every way!",2,5.0
"I feel sorry for anyone who waste their time on this book.",9,1.0
"This is so-so, not bad but not good either.",6,3.0
...
```

## Part I: Preprocessing

The first step in every NLP pipeline is cleaning and tokenization. The first step is to extract the vocabulary and count statistics. 

In real life, you will have to decide what constitutes a **token**. Here, let's do a simple definition. A token is:
* one single valid English word
* lowercased
* plurals are stemmed to their singular form
* excludes stop words


In [3]:
from collections import Counter

# Let's try a max vocab size of 12 for now, and see how this works
MAX_VOCAB_SIZE = 12
STOP_WORDS = ['a', 'an', 'and', 'are', 'as', 'at',
              'be', 'but', 'by', 'for', 'if', 'in',
              'into', 'is', 'it', 'no', 'not', 'of',
              'on', 'or', 'such', 'that', 'the', 'their',
              'then', 'there', 'these', 'they', 'this',
              'to', 'was', 'will', 'with']
PUNCTUATIONS = ['.', ',', '?', '!', '\n']


def get_vocabulary(reviews):
    ''' 
        Given a list of reviews, returns a vocabulary dictionary of tokens. 
        For example, given ["foo bar", "bar baz hello", "hello world"], 
        return {"foo": 1, "bar": 2, "baz": 1, "hello": 2}
    '''
    vocabulary = {}
    counter = Counter()
    for review in reviews:
        if len(counter) == MAX_VOCAB_SIZE:
            break
        for word in tokenize_sentence(review):
            counter[word] += 1
    return dict(counter)

def tokenize(word):
    word = word.lower()
    if word[-1] == 's':
        word = word[:len(word)-1]
    for punctuation in PUNCTUATIONS:
        word = word.replace(punctuation, '')
    return word

def tokenize_sentence(sentence):
    words = []
    for word in sentence.split(" "):
        if word not in STOP_WORDS:
            words.append(tokenize(word))
    return words

            
get_vocabulary(["foo bar", "bar baz hello", "hello world", "HellO Josh", "I like apples", "I love how incredible this was."])
    

{'apple': 1,
 'bar': 2,
 'baz': 1,
 'foo': 1,
 'hello': 3,
 'how': 1,
 'i': 2,
 'incredible': 1,
 'josh': 1,
 'like': 1,
 'love': 1,
 'was': 1,
 'world': 1}

Stemming and excluding stop words is tedious work. **Question**: What libraries might you use that can help you with this? Do some googling and write down below at least two libraries and function calls you can make.

**TODO: what are two libraries and their functions I could call to make this easier...**
1. from nltk.tokenize.word_tokenize
2. from spacy.en import English; parser = English(); parser(text)

Now, we will convert each sample into a *feature vector*. In real life, you will decide what features to use, and in the future, we will use deep learning to help us decide features. But for now, we will use these features:
* Counts of each token
* The number of people who have read the review

In [43]:
import numpy as np

def vectorize(data, vocab):
    '''
        Given a sample and the vocabulary dictionary, convert the sample to a feature vector. 
        For example, given "I love how incredible this was.",2,4.5 
        it should return a vector of counts of each vocab word and the # of people who have read the review
    '''
    sentence, num_of_reviewers, rating = data
    occurances = []
    for word in tokenize_sentence(sentence):
        occurances.append(vocab.get(word))
    return [occurances, num_of_reviewers]
        

Again, vectorizing by hand can be tough work. **Question**: How might we be able to use sklearn to help us out here? Do some googling and name the class we could use.

**TODO: what is the general class in sklearn that helps us with converting data to feature vectors?**
1. `sklearn.feature_extraction.text.CountVectorizer`
2. `sklearn.feature_extraction.text.TfidfVectorizer`

Now, call your two functions to get the vocabulary and vectorize the data into feature vectors. You should end up with $X$, the array of feature vectors for each sample, and $y$, the array of truth labels (in this case, the review score) for each sample.

In [64]:
import csv
import re


# call the get_vocabulary() and vectorize() functions
def parse_csv():
    rows = []
    with open('./train.csv', 'r') as train_f:
        reader = csv.reader(train_f)
        for row in reader:
            rows.append(row)
    return rows

# process your file here, call your functions, etc.
data = parse_csv()
sentences = [d[0] for d in data]
num_reviews = [d[1] for d in data]
reviews = [d[2] for d in data]
vocab = get_vocabulary(sentences)

X = np.asarray([vectorize(d, vocab) for d in data])
y = np.asarray([float(d[-1]) for d in data])

# make sure they're numpy array objects.
assert isinstance(X, np.ndarray) == True
assert isinstance(y, np.ndarray) == True

[[list([2, 2, 2]) '1']
 [list([1, 2, 2, 1, 2]) '8']
 [list([1, 1, 2, 2, 1, 2]) '2']
 [list([3, 1, 2, 2, 1, 3, 2, 3]) '9']
 [list([2, 1, 1, 1, 1]) '6']
 [list([2, 1, 1, 2, 2, 2, 2, 1, 1]) '3']
 [list([3, 1, 2, 1, 1, 1, 1, 1, 1, 2]) '7']
 [list([4, 3, 2, 1]) '2']
 [list([1, 2, 2, 1, 1]) '1']
 [list([3, 2, 3, 1]) '6']
 [list([2, 2, 1, 2]) '6']
 [list([1, 2, 2, 1]) '8']
 [list([3, 1, 1, 1, 1, 1, 1]) '7']
 [list([2, 2, 2]) '3']
 [list([4, 4, 4, 3]) '1']]


Q: How many features do we have? In other words, what is the dimensionality of our training set?

**TODO: Type in your answer**
2, The occurance of each word as a vector, and the number of people who have given a review.

## Part II: Training

TODO

## Part III: Evaluation

TODO