In [1]:
import project1 as p1
import utils
import numpy as np

# 6. Automative review analyzer

> Now that you have verified the correctness of your implementations, you are ready to tackle the main task of this project: building a classifier that labels reviews as positive or negative using text-based features and the linear classifiers that you implemented in the previous section!

## The Data

> The data consists of several reviews, each of which has been labeled with  −1  or  +1 , corresponding to a negative or positive review, respectively. The original data has been split into four files:

- reviews_train.tsv (4000 examples)
- reviews_validation.tsv (500 examples)
- reviews_test.tsv (500 examples)

> To get a feel for how the data looks, we suggest first opening the files with a text editor, spreadsheet program, or other scientific software package (like pandas).

## Translating reviews to feature vectors

> We will convert review texts into feature vectors using a bag of words approach. We start by compiling all the words that appear in a training set of reviews into a dictionary , thereby producing a list of  𝑑  unique words.


> We can then transform each of the reviews into a feature vector of length  𝑑  by setting the  𝑖th  coordinate of the feature vector to  1  if the  𝑖th  word in the dictionary appears in the review, or  0 otherwise. For instance, consider two simple documents “Mary loves apples" and “Red apples". In this case, the dictionary is the set  {Mary;loves;apples;red} , and the documents are represented as  (1;1;1;0)  and  (0;0;1;1) .

> A bag of words model can be easily expanded to include phrases of length  𝑚 . A unigram model is the case for which  𝑚=1 . In the example, the unigram dictionary would be  (Mary;loves;apples;red) . In the bigram case,  𝑚=2 , the dictionary is  (Mary loves;loves apples;Red apples) , and representations for each sample are  (1;1;0),(0;0;1) . In this section, you will only use the unigram word features. These functions are already implemented for you in the bag of words function.
In utils.py, we have supplied you with the load data function, which can be used to read the .tsv files and returns the labels and texts. We have also supplied you with the bag_of_words function in project1.py, which takes the raw data and returns dictionary of unigram words. The resulting dictionary is an input to extract_bow_feature_vectors which computes a feature matrix of ones and zeros that can be used as the input for the classification algorithms. Using the feature matrix and your implementation of learning algorithms from before, you will be able to compute  𝜃 and  𝜃0 .


# 7. Classification and Accuracy

> Now we need a way to actually use our model to classify the data points. In this section, you will implement a way to classify the data points using your model parameters, and then measure the accuracy of your model.

## Classification

> Implement a classification function that uses  𝜃  and  𝜃0  to classify a set of data points. You are given the feature matrix,  𝜃 , and  𝜃0  as defined in previous sections. This function should return a numpy array of -1s and 1s. If a prediction is greater than zero, it should be considered a positive classification.

In [41]:
def classify(feature_matrix, theta, theta_0):
    """
    A classification function that uses theta and theta_0 to classify a set of
    data points.

    Args:
        feature_matrix - A numpy matrix describing the given data. Each row
            represents a single data point.
                theta - A numpy array describing the linear classifier.
        theta - A numpy array describing the linear classifier.
        theta_0 - A real valued number representing the offset parameter.

    Returns: A numpy array of 1s and -1s where the kth element of the array is
    the predicted classification of the kth row of the feature matrix using the
    given theta and theta_0. If a prediction is GREATER THAN zero, it should
    be considered a positive classification.
    """
    # Your code here
    predictions = []
    
    for feature_vector in feature_matrix:
        prediction = np.dot(theta, feature_vector) + theta_0
        if prediction >= 0:
            predictions.append(1)
        else:
            predictions.append(-1)
    return np.array(predictions)

In [33]:
arr = np.array([])
res = np.array([])
for i in range(10):
    print(i)
    tmp = np.append(arr, [i])
    res = np.append(res, tmp)
res

0
1
2
3
4
5
6
7
8
9


array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

In [37]:
x = []
for i in range( 10 ):
    x.append( i )
x

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [40]:
y = []
y.append(1)
y.append(-1)
y

[1, -1]

## Accuracy

> We have supplied you with an accuracy function:

```python
def accuracy(preds, targets):
    """
    Given length-N vectors containing predicted and target labels,
    returns the percentage and number of correct predictions.
    """
    return (preds == targets).mean()

```


> The accuracy function takes a numpy array of predicted labels and a numpy array of actual labels and returns the prediction accuracy. You should use this function along with the functions that you have implemented thus far in order to implement classifier_accuracy.

> The classifier_accuracy function should take 6 arguments:


- a classifier function that, itself, takes arguments (feature_matrix, labels, **kwargs)

- the training feature matrix

- the validation feature matrix

- the training labels

- the valiation labels

- a **kwargs argument to be passed to the classifier function



### 方針

1. Training phase
    - todo:
        - training datasetを使って`classifier(feature_matrix, labels, **kwargs)` functionで(theta, theta_0)をチューニングする
    - input:
        - `training feature matrix`
        - `training labels`
    - output: 
        - `theta`, `theta`, `theta_0`
2. Validation phase
    - todo:
        - 前問で実装した`classify(feature_matrix, theta, theta_0)`を使って、チューニング済み(theta, theta_0)よりラベル付け(+1,-1)を行う
    - input1:
        - `traing feature matrix`
        - `(theta, theta_0)`
    - output1;
        - traing dataset predictions
    - input2:
        - `validation feature matrix`
        - `(theta, theta_0)`
    - output2;
        - validation dataset predictions
3. Compute the classification accuracy phase
    - todo:
        - `accuracy(preds, targets)`を使ってtraning/validation predictionsとtraning/validation labelsを比較してclassification accuracyを算出する
    - input1:
        - training predictions
        - training labels
    - output1:
        - training accuracy
    - input2:
        - validation predictions
        - validation labels
    - output2:
        - validation accuracy

In [None]:
def classifier_accuracy(
        classifier,
        train_feature_matrix,
        val_feature_matrix,
        train_labels,
        val_labels,
        **kwargs):
    """
    Trains a linear classifier and computes accuracy.
    The classifier is trained on the train data. The classifier's
    accuracy on the train and validation data is then returned.

    Args:
        classifier - A classifier function that takes arguments
            (feature matrix, labels, **kwargs) and returns (theta, theta_0)
        train_feature_matrix - A numpy matrix describing the training
            data. Each row represents a single data point.
        val_feature_matrix - A numpy matrix describing the training
            data. Each row represents a single data point.
        train_labels - A numpy array where the kth element of the array
            is the correct classification of the kth row of the training
            feature matrix.
        val_labels - A numpy array where the kth element of the array
            is the correct classification of the kth row of the validation
            feature matrix.
        **kwargs - Additional named arguments to pass to the classifier
            (e.g. T or L)

    Returns: A tuple in which the first element is the (scalar) accuracy of the
    trained classifier on the training data and the second element is the
    accuracy of the trained classifier on the validation data.
    """
    # Your code here
    
    ## 1. Training phase ##
    parameters = classifier(train_feature_matrix, train_labels, **kwargs)
    
    ## 2. Validation phase
    training_predections = classify(train_feature_matrix, parameters[0], parameters[1])
    validation_predecions = classify(val_feature_matrix, parameters[0], parameters[1])
    
    ## 3. Compute the classification accuracy phase
    traing_accuracy = p1.accuracy(training_predections, train_labels)
    validation_accuracy = p1.accuracy(validation_predecions, val_labels)
    
    return (traing_accuracy, validation_accuracy)


### A note on the classifier() call in classifier_accuracy: Python and first-class functions

> For folks new to Python or functional programming: In Python, one can store a reference to a function in a variable, and pass that function reference as an argument. Then one can later call the function by using the variable name just as though it were the function name.

In [4]:
# Get a reference to an example function and store it in a variable.
# This just makes an alias for the function.
foo = np.greater
# And some arguments for it.
a = np.array([1, 2, 3])
b = np.array([2, 2, 2])
# Now call the function.
foo(a, b)
# Result is array([False, False,  True])

array([False, False,  True])

# 8. Parameter Tuning

> You finally have your algorithms up and running, and a way to measure performance! But, it's still unclear what values the hyperparameters like  𝑇  and  𝜆  should have. In this section, you'll tune these hyperparameters to maximize the performance of each model.

---

> One way to tune your hyperparameters for any given Machine Learning algorithm is to perform a grid search over all the possible combinations of values. If your hyperparameters can be any real number, you will need to limit the search to some finite set of possible values for each hyperparameter. For efficiency reasons, often you might want to tune one individual parameter, keeping all others constant, and then move onto the next one; Compared to a full grid search there are many fewer possible combinations to check, and this is what you'll be doing for the questions below.

> In main.py uncomment Problem 8 to run the staff-provided tuning algorithm from utils.py. For the purposes of this assignment, please try the following values for  𝑇 : [1, 5, 10, 15, 25, 50] and the following values for  𝜆  [0.001, 0.01, 0.1, 1, 10]. For pegasos algorithm, first fix  𝜆=0.01  to tune  𝑇 , and then use the best  𝑇  to tune  𝜆

## Keyword

- hyperparameter
- grid search

## Performance After Tuning

### Result

- best method: `Pagasos`

```
perceptron valid: [(1, 0.758), (5, 0.72), (10, 0.716), (15, 0.778), (25, 0.794), (50, 0.79)]
best = 0.7940, T=25.0000
avg perceptron valid: [(1, 0.794), (5, 0.792), (10, 0.798), (15, 0.798), (25, 0.8), (50, 0.796)]
best = 0.8000, T=25.0000
Pegasos valid: tune T [(1, 0.7), (5, 0.782), (10, 0.794), (15, 0.806), (25, 0.804), (50, 0.806)]
best = 0.8060, T=15.0000
Pegasos valid: tune L [(0.001, 0.79), (0.01, 0.806), (0.1, 0.752), (1, 0.594), (10, 0.518)]
best = 0.8060, L=0.0100
```


## Accuracy on the test set

> After you have chosen your best method (perceptron, average perceptron or Pegasos) and parameters, use this classifier to compute testing accuracy on the test set. 

> We have supplied the feature matrix and labels in main.py as test_bow_features and test_labels.

> Note: In practice the validation set is used for tuning hyperparameters while a heldout test set is the final benchmark used to compare disparate models that have already been tuned. You may notice that your results using a validation set don't always align with those of the test set, and this is to be expected.


```python
T=25
L=0.0100

# Your code here
avg_peg_train_accuracy, avg_peg_test_accuracy = \
   p1.classifier_accuracy(p1.pegasos, train_bow_features,test_bow_features,train_labels,test_labels,T=T,L=L)
print("{:50} {:.4f}".format("Training accuracy for Pegasos:", avg_peg_train_accuracy))
print("{:50} {:.4f}".format("Test accuracy for Pegasos:", avg_peg_test_accuracy))

```

### Result

```
Training accuracy for Pegasos:                     0.9195
Test accuracy for Pegasos:                         0.8020
```

## The most explanatory unigrams

> According to the largest weights (i.e. individual  𝑖  values in your vector), you can find out which unigrams were the most impactful ones in predicting positive labels. Uncomment the relevant part in main.py to call utils.most_explanatory_word.

> Report the top ten most explanatory word features for positive classification below:


### 分析

- thetaの要素するは`13234`個
- typeはnumpyのarray

```
theta dim: (13234,)
theta type: <class 'numpy.ndarray'>
```
- thetaの中身のトップ10
    - thetaを降順にソート

```
0.5903715521713366

0.5685059656607246

0.44734183045507536

0.40064895432564757

0.38411827540030485

0.36204152120051414

0.3599248573066697

0.35073693119923643

0.32758790048448905

0.3274736683441225
```

- wordlist[list]のサイズ
    - thetaの数と同じ

```
13234
```

- `most_explanatory_word(theta, wordlist)`を分析
    - thetaの各要素（重み）とwordをzipでひも付けしている
    - 降順ソートはこの関数内で行っているので、この関数に渡す前にthetaをソートしてしまうと、wordとのひも付けが間違った状態になってしまう。
    - wordlistは13234個あるので、thetaもそのままの要素数で渡す

```
[word for (theta_i, word) in sorted(zip(theta, wordlist))[::-1]]
```

In [5]:
a = np.array([2,1,10,-1])

In [16]:
np.sort(a, kind='quicksort')

array([-1,  1,  2, 10])

In [17]:
id(a)

139985953269360

In [18]:
a[::-1].sort()

In [19]:
a

array([10,  2,  1, -1])

In [20]:
id(a)

139985953269360

### most_explanatory_word関数内のzipの挙動分析

In [21]:
theta_dummy = np.random.randint(1,10,5)

In [22]:
theta_dummy

array([2, 7, 9, 3, 5])

In [23]:
wordlist = ['apple', 'banana', 'grape', 'orange', 'peach']

In [24]:
wordlist

['apple', 'banana', 'grape', 'orange', 'peach']

#### zip関数を使ってみる

In [33]:
zipped = zip(theta_dummy, wordlist)

In [34]:
print(tuple(zipped))

((2, 'apple'), (7, 'banana'), (9, 'grape'), (3, 'orange'), (5, 'peach'))


#### zipしたものをソートしてみる

Note: [What is the difference between sorted(list) vs list.sort()?](https://stackoverflow.com/questions/22442378/what-is-the-difference-between-sortedlist-vs-list-sort)

In [42]:
sorted_zipped = sorted(zip(theta_dummy, wordlist))

In [39]:
sorted_zipped

[(2, 'apple'), (3, 'orange'), (5, 'peach'), (7, 'banana'), (9, 'grape')]

#### zipしたものを降順でソートしてみる

In [40]:
sorted_zipped = sorted(zip(theta_dummy, wordlist))[::-1]

In [41]:
sorted_zipped

[(9, 'grape'), (7, 'banana'), (5, 'peach'), (3, 'orange'), (2, 'apple')]

#### zipするもの同士の要素数が同じでない場合

- #theta < #wordlist
- 結果: 少ない要素数に合わせてzipされるので情報落ちが発生する

In [43]:
theta_dummy2 = np.random.randint(1,10,2)

In [None]:
theta_dummy2

In [45]:
zipped2 = zip(theta_dummy2, wordlist)

In [46]:
print(tuple(zipped2))

((3, 'apple'), (2, 'banana'))


### 方針

- best_thetaにはtrain dataとpegasosアルゴリズムで出したthetaをそのまま代入
    - そのままとは、ソートや絞り込みなどは一切行わないということ
- pegasosでdata setを回してparameterを抽出するという関数がないので、`classifier_accuracy` @main.pyを真似て、`classifier_parameter`をmain.pyに作成する
- イメージ:
    1. text(review) -> bow_features(1,0で構成されるベクトル)に変換。文字を数値化することで、アルゴリズムの計算ができるようにする。
    2. bow_featuresのtest datasetでアルゴリズムを用いて、parameter最適化を行う。つまり,最適化されたthetaとtheta_0を求める
    3. 求めたthetaはbowのwordlistの要素数と同じ要素数を持つベクトルとなる。この各要素はwordlistの各wordの重みになる。そのため、もう一度このthetaをwordlistと紐付けを行う。(重み:word)というtupleの集合を作成する。そして、重みの降順でのソートも行う。（これが`most_explanatory_word`関数内で行っていること）

### 結果

- listupされた単語は合っていた。しかし、後半の順位は間違っていたので、やはりpegasosの実装にどこか問題があるみたい

```
Most Explanatory Word Features
['delicious', 'great', '!', 'best', 'perfect', 'loves', 'glad', 'wonderful', 'quickly', 'love']
```

- ちなみに、昇順ソートにすればworst top10になる

```python
return [word for (theta_i, word) in sorted(zip(theta, wordlist))]
```

```
['disappointed', 'however', 'bad', 'not', 'but', 'unfortunately', 'awful', '\$', 'ok', 'money']
```


# 9. Feature Engineering

>Frequently, the way the data is represented can have a significant impact on the performance of a machine learning method. Try to improve the performance of your best classifier by using different features. In this problem, we will practice two simple variants of the bag of words (BoW) representation.

---

## Remove Stop Words

>Try to implement stop words removal in your feature engineering code. Specifically, load the file stopwords.txt, remove the words in the file from your dictionary, and use features constructed from the new dictionary to train your model and make predictions.

>Compare your result in the testing data on Pegasos algorithm using  𝑇=25  and  𝐿=0.01  when you remove the words in stopwords.txt from your dictionary.

>Hint: Instead of replacing the feature matrix with zero columns on stop words, you can modify the bag_of_words function to prevent adding stopwords to the dictionary

>Accuracy on the test set using the original dictionary: 0.8020
>Accuracy on the test set using the dictionary with stop words removed:


### 方針

- `stopwords.txt`をloadしてlist化@main.py
- `bag_of_words_remove_stop_words(train_texts, stopwords_data)`関数をproject1.pyに作成
    - list化したstopwordsを引数で渡せるようにする
    - wordを追加する際に、stopwordsを参照して、wordがstopwordsに含まれるのであれば追加しないようにする

In [48]:
stopwords = [
    'i',
    'me',
    'my',
    'myself',
    'we',
    'our',
    'ours',
    'ourselves',
    'you',
    'your',
    'yours',
    'yourself',
    'yourselves',
    'he'
]

In [50]:
word_list = [
    'apple',
    'my',
    'orange'
]

In [53]:
for w in word_list:
    if w not in stopwords:
        print('word ' + w)
    else:
        print('stop word ' + w)

word apple
stop word my
word orange


### 結果

- original dictionaryに比べて、dictionary with stop words removedだと確かに精度が上がった
- しかし、不正解。やはりpegasosの実装に問題があるよう。

```shell
Training accuracy for Pegasos:                     0.9150
Test accuracy for Pegasos:                         0.8100
13108
Most Explanatory Word Features
['delicious', 'great', 'loves', '!', 'best', 'perfect', 'excellent', 'wonderful', 'favorite', 'love']
```