# Neural networks from scratch

In our last sessions, we talked about what it means for a supervised algorithm to *learn*. We saw that, in a supervised learning algorithm, we have an **input** with a corresponding **output** label. The goal is to learn the weights for given features that will give us the correct output - this is what it means to train a model.

In this exercise, we will apply what we learned to predict the rating of a review from reviews. 

In the `train.csv` file, there are a small number of reviews; for each review, there is the text of the review, the number of people who have read the review, and the numerical score of the review:

```
"This video is incredible!!!",1,5.0
"Sad and sorry excuse for words on a screen",8,1.5
"The instructor was amazing, incredible in every way!",2,5.0
"I feel sorry for anyone who waste their time on this book.",9,1.0
"This is so-so, not bad but not good either.",6,3.0
...
```

## Part I: Preprocessing

The first step in every NLP pipeline is cleaning and tokenization. The first step is to extract the vocabulary and count statistics. 

In real life, you will have to decide what constitutes a **token**. Here, let's do a simple definition. A token is:
* one single valid English word
* lowercased
* plurals are stemmed to their singular form
* excludes stop words


In [61]:
import re
from collections import Counter, OrderedDict

from porter2stemmer import Porter2Stemmer

DEBUG = False
MAX_VOCAB_SIZE = 12
REVIEWS = ["Foo bar", "bar baz hello", "hello world", "functionally"]
STOP_WORDS = set('a an and are as at be but by for i if in into is it no not of on or such that the their then '
                 'there these they this to was will with'.split(' '))
PUNCTUATION = re.compile('[~`!@#$%^&*()+={\[}\]|\\:;"\',<.>/?]')

MORE_REVIEWS = [
    ("This video is incredible!!!",1,5.0),
    ("Sad and sorry excuse for words on a screen",8,1.5),
    ("The instructor was amazing, incredible in every way!",2,5.0),
    ("I feel sorry for anyone who waste their time on this book.",9,1.0),
    ("This is so-so, not bad but not good either.",6,3.0),
]


def print_debug(*texts, DEBUG=DEBUG):
    if DEBUG:
        print(*texts)
        

def tokenize(text):
    """Tokenize an entire chunk of text

    That is:
      - remove punctuation
      - cast to lower case
      - split on remaining spaces
      - remove stop words
    """
    text = PUNCTUATION.sub('', text)
    return [w for w in text.lower().split() if w not in STOP_WORDS]


def stem_words(words):
    """Stem all words in a container"""
    stemmer = Porter2Stemmer()
    return [stemmer.stem(w) for w in words]


def get_vocabulary(reviews, most_common=MAX_VOCAB_SIZE):
    ''' 
        Given a list of reviews, returns a vocabulary dictionary of tokens. 
        For example, given ["foo bar", "bar baz hello", "hello world"], 
        return {"foo": 1, "bar": 2, "baz": 1, "hello": 2}
    '''
    vocab = Counter()
    
    for phrase in reviews:
        vocab.update(stem_words(tokenize(phrase)))
        
    v = OrderedDict(sorted(vocab.most_common(most_common)))
    return v

v1 = get_vocabulary(REVIEWS)
print_debug(v1)
assert v1 == {'bar': 2, 'hello': 2, 'foo': 1, 'baz': 1, 'world': 1, 'function': 1}

v2 = get_vocabulary([m[0] for m in MORE_REVIEWS], 25)
print_debug(v2)
assert v2 == {
    'amaz': 1, 'anyon': 1, 'bad': 1, 'book': 1, 'either': 1, 'everi': 1, 'excus': 1, 'feel': 1, 'good': 1,
    'incred': 2, 'instructor': 1, 'sad': 1, 'screen': 1, 'so-so': 1, 'sorri': 2, 'time': 1, 'video': 1,
    'wast': 1, 'way': 1, 'who': 1, 'word': 1
}


Stemming and excluding stop words is tedious work. **Question**: What libraries might you use that can help you with this? Do some googling and write down below at least two libraries and function calls you can make.

**TODO: what are two libraries and their functions I could call to make this easier...**

Stemming is not a straight-forward task, so we chose to make use of a library available on PyPI, `porter2stemmer` to perform stemming for us using the Porter stemming algorithm.

Some other libraries that would be useful are `nltk` (the Natural Language Toolkit), `pystemmer`, `whoosh` (which is a full-text search enging aking to Lucene, but that has stemming tools), and `gensim`.

Now, we will convert each sample into a *feature vector*. In real life, you will decide what features to use, and in the future, we will use deep learning to help us decide features. But for now, we will use these features:
* Counts of each token
* The number of people who have read the review

In [55]:
import numpy as np

def vectorize(data, vocab):
    '''
        Given a sample and the vocabulary dictionary, convert the sample to a feature vector. 
        For example, given "I love how incredible this was.",2,4.5 
        it should return a vector of counts of each vocab word and the # of people who have read the review
    '''
    review_text, num_viewers, _ = data
    # print(review_text, num_viewers, _)
    
    words = stem_words(tokenize(review_text))  
     
    occurrences = []
    for w, c in vocab.items():
        occurrences.append(c if w in words else 0)
    return np.array([occurrences, num_viewers])


vocab = get_vocabulary([m[0] for m in MORE_REVIEWS], 12)
    
for review in MORE_REVIEWS:
    v = vectorize(review, vocab)
    print_debug(v)
    
print_debug(vocab)


Again, vectorizing by hand can be tough work. **Question**: How might we be able to use sklearn to help us out here? Do some googling and name the class we could use.

**TODO: what is the general class in sklearn that helps us with converting data to feature vectors?**

Now, call your two functions to get the vocabulary and vectorize the data into feature vectors. You should end up with $X$, the array of feature vectors for each sample, and $y$, the array of truth labels (in this case, the review score) for each sample.

In [57]:
import csv


def get_reviews_from_file(filepath):
    """Read in file and numeric types"""
    reviews = []
    with open(filepath, 'r') as csvfile:
        reader = csv.reader(csvfile, delimiter=',')
        for row in reader:
            reviews.append((row[0], int(row[1]), float(row[2])))
            
    return reviews


def get_feature_vectors(reviews, most_common_words=MAX_VOCAB_SIZE):
    """Return numpy array of vectorize on each review"""
    vocab = get_vocabulary([m[0] for m in MORE_REVIEWS], most_common_words)
    return np.array([vectorize(r, vocab) for r in MORE_REVIEWS])


def get_truth_labels(reviews):
    """Get our expected values"""
    return np.array([r[2] for r in reviews])

def get_features_and_truths(filepath):
    reviews = get_reviews_from_file(filepath)
    print_debug(reviews)
    return (get_feature_vectors(reviews),
            get_truth_labels(reviews))


x, y = get_features_and_truths('./train.csv')

print_debug(x)
print_debug(y)

# make sure they're numpy array objects.
assert isinstance(x, np.ndarray)
assert isinstance(y, np.ndarray)

**Q:** How many features do we have? In other words, what is the dimensionality of our training set?

**A:** We have two features (and thus our dimensionality is 2), the vector of words is the first feature and the number of viewers of each review is the second feature.

## Part II: Training

TODO

## Part III: Evaluation

TODO