# Exercise 1: Sentiment analysis as bag of words

In [4]:
import bz2
import pandas as pd
import numpy as np
import sklearn.feature_extraction

In [5]:
def get_labels_and_texts(file, limit=100000):
    labels = []
    texts = []
    lineNumber = 0
    for line in bz2.BZ2File(file):
        x = line.decode("utf-8")
        labels.append(int(x[9]) - 1)
        texts.append(x[10:].strip())
        lineNumber = lineNumber + 1
        if lineNumber >= limit and limit > 0:
          break
    return np.array(labels), texts


If `data/amazon/train.ft.txt.bz2` does not exist, we download it. (The exclamation mark `!` indicates that the command is executed not by Python, but by the underlying shell. That's why the syntax does not look like python at all.)

In [1]:
! if ! [[ -f data/amazon/train.ft.txt.bz2 ]]; then curl https://nilsreiter.de/assets/2020-08-31-deep-learning/amazon/train.ft.txt.bz2 > data/amazon/train.ft.txt.bz2; fi

If `data/amazon/test.ft.txt.bz2` does not exist, we download it. (The exclamation mark `!` indicates that the command is executed not by Python, but by the underlying shell. That's why the syntax does not look like python at all.)

This line opens and parses the file we have downloaded before. The function `get_labels_and_texts(...)` is defined above. Because we are not overwriting the argument `limit`, the function only loads the first 100000 reviews.

In [10]:
train_labels, train_texts = get_labels_and_texts('data/amazon/train.ft.txt.bz2')

Now we have imported the train and test data into variables. `train_labels` contains the classes, `train_texts` the corresponding reviews. Feel free to inspect those.

In [12]:
# shows the first five texts
train_texts[:5]

['Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^',
 "The best soundtrack ever to anything.: I'm reading a lot of reviews saying that this is the best 'game soundtrack' and I figured that I'd write a review to disagree a bit. This in my opinino is Yasunori Mitsuda's ultimate masterpiece. The music is timeless and I'm been listening to it for years now and its beauty simply refuses to fade.The price tag on this is pretty staggering I must say, but if you are going to buy any cd for this much money, this is the only one that I feel would be worth every penny.",
 'Amazing!: This soundtrack is my favorite music of all

## Task 1

In this exercise, we want the input to be a document-term-matrix. I.e., each document is represented by a numeric vector. The vector contains one dimension for each token in the vocabulary, i.e., unique token in the entire (training) corpus.

As an example, consider the following matrix

| document | dog | cat | mouse | the | a | an |
| --- | --- | --- | --- | --- | --- | --- | 
| d1 | 5 | 6 | 0 | 10 | 5 | 6 |

TODO: finish

In [13]:
vectorizer = sklearn.feature_extraction.text.CountVectorizer(max_features=5000)

vectorizer.fit(train_texts)

train_texts_vec = vectorizer.transform(train_texts)
test_texts_vec = vectorizer.transform(test_texts)


#train_texts_vec = None
#test_texts_vec = None

## Task 2

The vectors for each document are now represented as sparse arrays, i.e., they are not fully realized (zeros are not stored, for instance). To make them dense, we can use the `numpy`-function `todense()`. This is also a good opportunity to limit the number of training instances for development.


In [12]:
numInstances = 10000

x_train = train_texts_vec[:numInstances].todense()
y_train = train_labels[:numInstances]

TypeError: 'NoneType' object is not subscriptable

## Task 3

We are now ready to define the neural network. Please define (for starters) one with 
1. an input layer 
2. an hidden layer with size 5 and activation function `sigmoid`
3. an output layer with activation function `sigmoid`

In [16]:
ffnn = models.Sequential()
ffnn.add(layers.Dense(5, input_shape=(5000,), activation="sigmoid", activity_regularizer=regularizers.l2(0.2)))
ffnn.add(layers.Dense(1, activation="sigmoid"))

ffnn.compile(loss="mean_squared_error", optimizer="sgd",
  metrics=["accuracy"])

NameError: name 'models' is not defined

This model `ffnn` can now be trained on the input data, using the function `fit()`.

In [None]:
ffnn.fit(x_train, y_train, epochs=25, batch_size=5, verbose=1)

In [6]:
! if ! [[ -f data/amazon/test.ft.txt.bz2 ]]; then curl https://nilsreiter.de/assets/2020-08-31-deep-learning/amazon/test.ft.txt.bz2 > data/amazon/test.ft.txt.bz2; fi

In [7]:
test_labels, test_texts = get_labels_and_texts('data/amazon/test.ft.txt.bz2')

## Credits

This notebook is based on [this one](https://www.kaggle.com/muonneutrino/sentiment-analysis-with-amazon-reviews) by MuonNeutrino on kaggle.