# Neural networks from scratch

In our last sessions, we talked about what it means for a supervised algorithm to *learn*. We saw that, in a supervised learning algorithm, we have an **input** with a corresponding **output** label. The goal is to learn the weights for given features that will give us the correct output - this is what it means to train a model.

In this exercise, we will apply what we learned to predict the rating of a review from reviews. 

In the `train.csv` file, there are a small number of reviews; for each review, there is the text of the review, the number of people who have read the review, and the numerical score of the review:

```
"This video is incredible!!!",1,5.0
"Sad and sorry excuse for words on a screen",8,1.5
"The instructor was amazing, incredible in every way!",2,5.0
"I feel sorry for anyone who waste their time on this book.",9,1.0
"This is so-so, not bad but not good either.",6,3.0
...
```

## Part I: Preprocessing

The first step in every NLP pipeline is cleaning and tokenization. The first step is to extract the vocabulary and count statistics. 

In real life, you will have to decide what constitutes a **token**. Here, let's do a simple definition. A token is:
* one single valid English word
* lowercased
* plurals are stemmed to their singular form
* excludes stop words


In [37]:
from collections import Counter
import spacy

# Let's try a max vocab size of 12 for now, and see how this works
MAX_VOCAB_SIZE = 12
# STOP_WORDS = [TODO - DECIDE WHAT YOUR STOP WORDS ARE]

nlp = spacy.load('en_core_web_sm')

def get_vocabulary(reviews):
    ''' 
        Given a list of reviews, returns a vocabulary dictionary of tokens. 
        For example, given ["foo bar", "bar baz hello", "hello world"], 
        return {"foo": 1, "bar": 2, "baz": 1, "hello": 2}
    '''
    vocabulary = Counter()
    for review in reviews:
        vocabulary.update(token.text for token in nlp(review))        
    return dict(vocabulary)

assert get_vocabulary([]) == {}
assert get_vocabulary(['']) == {}
assert get_vocabulary(['hello']) == {'hello': 1}
assert get_vocabulary(['foo bar', 'bar baz hello', 'hello world']) == {'foo': 1, 'bar': 2, 'baz': 1, 'hello': 2, 'world': 1}

print('all good')

all good


Stemming and excluding stop words is tedious work. **Question**: What libraries might you use that can help you with this? Do some googling and write down below at least two libraries and function calls you can make.

**TODO: what are two libraries and their functions I could call to make this easier...**

Now, we will convert each sample into a *feature vector*. In real life, you will decide what features to use, and in the future, we will use deep learning to help us decide features. But for now, we will use these features:
* Counts of each token
* The number of people who have read the review

In [38]:
import numpy as np
from sklearn.feature_extraction import DictVectorizer

vectorizer = DictVectorizer()

def vectorize(data, vocab):
    '''
        Given a sample and the vocabulary dictionary, convert the sample to a feature vector. 
        For example, given "I love how incredible this was.",2,4.5 
        it should return a vector of counts of each vocab word and the # of people who have read the review
    '''
    vocab = dict(vocab, _readers_=1)
    vectorizer.fit([vocab])
    
    print(data)
    review, readers, *_ = data
    features = get_vocabulary([review])
    features['_readers_'] = readers
    
    return vectorizer.transform(features)
    
vocab = get_vocabulary(['foo bar', 'bar baz hello', 'hello world'])
print(vocab)
print(vectorize(['hello world world world', 42, 4.5], vocab))
print(vectorize(['hello world world baz', 42, 4.5], vocab))

{'baz': 1, 'foo': 1, 'hello': 2, 'world': 1, 'bar': 2}
['hello world world world', 42, 4.5]
  (0, 0)	42.0
  (0, 4)	1.0
  (0, 5)	3.0
['hello world world baz', 42, 4.5]
  (0, 0)	42.0
  (0, 2)	1.0
  (0, 4)	1.0
  (0, 5)	2.0


Again, vectorizing by hand can be tough work. **Question**: How might we be able to use sklearn to help us out here? Do some googling and name the class we could use.

**TODO: what is the general class in sklearn that helps us with converting data to feature vectors?**

Now, call your two functions to get the vocabulary and vectorize the data into feature vectors. You should end up with $X$, the array of feature vectors for each sample, and $y$, the array of truth labels (in this case, the review score) for each sample.

In [49]:
# call the get_vocabulary() and vectorize() functions
import csv

with open('./train.csv', 'r') as train_f:
    reader = csv.reader(train_f)
    reviews = [r for r in reader]
    
    # process your file here, call your functions, etc.
    vocab = get_vocabulary([review[0] for review in reviews])

    X = np.asarray([vectorize(review, vocab) for review in reviews])
    y = np.asarray([float(review[-1]) for review in reviews])

# make sure they're numpy array objects.

assert isinstance(X, np.ndarray) == True
assert isinstance(y, np.ndarray) == True
print("Dimensionality is", len(vocab), "+ 1")

['This video is incredible!!!', '1', '5.0']
['Sad and sorry excuse for words on a screen', '8', '1.5']
['The instructor was amazing and incredible in every way!', '2', '5.0']
['I feel sorry for anyone who waste their time on this book.', '9', '1.0']
['This is so-so not bad but not good either.', '6', '3.0']
['Amazing resource! A must read for anyone learning web development!', '3', '4.5']
['I had to excuse myself halfway through the training because it was so boring', '7', '2.5']
['Great book for beginner programmers.', '2', '4.0']
['Best way for learning design patterns', '1', '5.0']
['Waste of time and waste of money.', '6', '2.0']
['the video was not very legible on screen', '6', '2.0']
['really boring and not very practical', '8', '2.0']
['i just wasted 2 hours of my life on this', '7', '1.5']
['must read for beginners', '3', '4.5']
['great great great book!!!', '1', '5.0']
Dimensionality is 78 + 1


Q: How many features do we have? In other words, what is the dimensionality of our training set?

**TODO: Type in your answer**

## Part II: Training

TODO

## Part III: Evaluation

TODO