# INFO-4604 HW5: Ensemble Learning 

* Created by Michael Paul on November 27, 2017
* Modified by James Gallmeister on December 3, 2017

##### Deadline: Monday, December 4, 8:00pm MT

In this assignment, you will continue working with the Twitter sentiment dataset from HW4. This time, you will build a classifier that combines the individual classifiers submitted by everyone in the class.

### What to hand in

You will submit the assignment on Piazza. A private note to the instructor should be submitted with the subject _"Submission 5 from [your full name]"_ with the submission file(s) as an attachment. The note should be submitted to the `submissions` folder (**not** the `hw5` folder).

Submit a single Jupyter notebook named `hw5lastname.ipynb`, where lastname is replaced with your last name.

## Combined Dataset

Recall that in HW4B, you submitted the sentiment probabilities from your classifier. The features were randomized so that most classifiers will be slightly different.

The probabilities from all of the submissions have been put together for this assignment. The format is a CSV file where the first column is the label, and subsequent columns are classifier probabilities. Each three-column sequence is the probability of negative ($-1$), neutral ($0$), and positive ($1$), in that order. For example, column 2 (where column 1 is the label) is the negative probability from the first submission, column 4 is the positive probability of the first submission, column 5 is the negative probability of the second submission, column 6 is the neutral probability of the second submission, and so on. There are two files: the first should be used for training and cross-validation, and the second should be used for testing.

As usual, run the code below to load the data. The accuracies of each individual system are also calculated.

In [13]:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score

df_train = pd.read_csv('http://cmci.colorado.edu/classes/INFO-4604/data/tweet_predictions_cv.csv', header=None)
df_test = pd.read_csv('http://cmci.colorado.edu/classes/INFO-4604/data/tweet_predictions_test.csv', header=None)

Y_train = df_train.iloc[0:, 0].values
X_train = df_train.iloc[0:, 1:].values

Y_test = df_test.iloc[0:, 0].values
X_test = df_test.iloc[0:, 1:].values

for i in np.arange(0, len(X_train[0]), 3):
    print("Submission %d:" % (1 + int(i/3)))
    predictions_cv = [np.argmax(x)-1 for x in X_train[0:, i:i+3]]
    print(" Validation accuracy: %0.6f" % accuracy_score(Y_train, predictions_cv))
    predictions_test = [np.argmax(x)-1 for x in X_test[0:, i:i+3]]
    print(" Test accuracy: %0.6f" % accuracy_score(Y_test, predictions_test))


Submission 1:
 Validation accuracy: 0.651113
 Test accuracy: 0.633333
Submission 2:
 Validation accuracy: 0.616119
 Test accuracy: 0.600000
Submission 3:
 Validation accuracy: 0.716861
 Test accuracy: 0.755556
Submission 4:
 Validation accuracy: 0.752916
 Test accuracy: 0.766667
Submission 5:
 Validation accuracy: 0.722163
 Test accuracy: 0.744444
Submission 6:
 Validation accuracy: 0.727466
 Test accuracy: 0.766667
Submission 7:
 Validation accuracy: 0.737010
 Test accuracy: 0.755556
Submission 8:
 Validation accuracy: 0.760339
 Test accuracy: 0.788889
Submission 9:
 Validation accuracy: 0.727466
 Test accuracy: 0.777778
Submission 10:
 Validation accuracy: 0.645811
 Test accuracy: 0.644444
Submission 11:
 Validation accuracy: 0.679745
 Test accuracy: 0.600000
Submission 12:
 Validation accuracy: 0.734889
 Test accuracy: 0.766667
Submission 13:
 Validation accuracy: 0.621421
 Test accuracy: 0.633333
Submission 14:
 Validation accuracy: 0.713680
 Test accuracy: 0.744444
Submission 15:


## Problem 1: Ensemble Classifier [7 points]

First, build a classifier that uses the probabilities from the 36 submissions as features. Since each submission contains 3 probabilities, there are 108 total features.

Following HW4B, you should use multinomial logistic regression as the classifier. Use `sklearn`'s [`LogisticRegression`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) class, setting the `multi_class` argument to `'multinomial'`, the `solver` argument to `'lbfgs'`, and the `random_state` argument to `123` (as usual). 

Additionally, use [`GridSearchCV`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) to select the `C` parameter using 5-fold cross-validation. For the grid search, try the following values for `C`: ${0.1, 0.2, 0.3, 0.4, \ldots, 1.8, 1.9, 2.0}$. (You can easily generate this list of values using [`numpy.arange`](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.arange.html).) When making predictions on the test data, you should use the optimal classifier tuned during cross-validation.

You may wish to refer to the HW4B code to get started, since the code will be similar.

#### Deliverable 1.1: Implement the ensemble classifier as described, and calculate both the cross-validation accuracy and test accuracy.

See below output.

#### Deliverable 1.2: Examine the validation and test accuracies of the individual submissions above. How do these accuracies compare to the validation and test accuracy of your ensemble classifier?

The validation and test accuracy of my ensemble classifier is much higher than that of the individual submissions above.

#### Deliverable 1.3: Based on what was discussed in lecture, explain these results. If the ensemble outperformed the individual classifiers, explain why ensembles are able to do this. If the ensemble did not outperform the individual classifiers, explain why this particular ensemble might not have been effective.

The ensemble outperformed the individual classifiers. This may be because, as an ensemble, it averages out biases learned by the different individual classifiers, it reduces the variance of the individual classifiers and it is unlikely to overfit if none of the individual models overfit. It doesn't seem like any of the individual models were overfitting based on the individual model results above.  

In [14]:
# code for 1.1 here
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, cross_val_predict
from sklearn.metrics import accuracy_score

base_classifier = LogisticRegression(multi_class = 'multinomial', solver = 'lbfgs', random_state = 123)
params = [{'C': np.arange(0.1,2.0, 0.1)}]

gs_classifier = GridSearchCV(base_classifier, params, cv=5)
gs_classifier.fit(X_train, Y_train)
print('Best parameter settings:', gs_classifier.best_params_)
print('Validation accuracy: %0.6f' % gs_classifier.best_score_)
print('Test accuracy: %0.6f' % gs_classifier.score(X_test, Y_test))

('Best parameter settings:', {'C': 0.40000000000000002})
Validation accuracy: 0.815483
Test accuracy: 0.800000


## Problem 2: Dimensionality Reduction [5 points]

Since the features are continuous-valued and correlated with each other, this feature set is a good candidate for dimensionality reduction with principal component analysis (PCA). You will experiment with PCA here.

Use the [`sklearn.decomposition.PCA`](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) class to transform the feature vectors (`X_train` and `X_test`) using PCA.  You should fit PCA with the training data, and then transform the feature vectors of both the training and test data. This will require a combination of the `fit`, `transform`, and/or `fit_transform` functions. Read the documentation linked here. This class is similar to the [`sklearn.feature_selection.chi2`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html) class that you used in HW4B, so you may find it helpful to refer back to your code for feature selection.

When creating a `PCA` object, you set the number of components (that is, the dimensionality of the feature vectors) with the `n_components` argument. Additionally, set `random_state` to `123`.

You should run the same classifier from Problem 1 on the PCA-reduced data. You should continue to use `GridSearchCV` to tune `C`.

#### Deliverable 2.1: Apply PCA to the data and calculate the validation and test accuracies when the number of components is each of: $1, 2, 10, 20, 30, 40, 50, 100$.

[you may wish to plot these results, but it is not required as long as your results are readable]

In [34]:
# code for 2.1 here
from sklearn.decomposition import PCA

comp = [1,2,10,20,30,40,50,100]
base_classifier1 = LogisticRegression(multi_class = 'multinomial', solver = 'lbfgs', random_state = 123)
params = [{'C': np.arange(0.1,2.0, 0.1)}]
gs_classifier1 = GridSearchCV(base_classifier1, params, cv=5)

for x in range(8):
    pca = PCA(n_components = comp[x], random_state=123)
    pca.fit(X_train)
    pca_x_train = pca.transform(X_train)
    pca_x_test = pca.transform(X_test)
    gs_classifier1.fit(pca_x_train, Y_train)
    print('Validation Accuracy with %d components: %0.6f' % (comp[x], gs_classifier1.best_score_))
    print('Test Accuracy with %d components: %0.6f' % (comp[x], gs_classifier1.score(pca_x_test, Y_test)))

Validation Accuracy with 1 components: 0.667020
Test Accuracy with 1 components: 0.677778
Validation Accuracy with 2 components: 0.763521
Test Accuracy with 2 components: 0.777778
Validation Accuracy with 10 components: 0.779427
Test Accuracy with 10 components: 0.788889
Validation Accuracy with 20 components: 0.797455
Test Accuracy with 20 components: 0.800000
Validation Accuracy with 30 components: 0.808059
Test Accuracy with 30 components: 0.822222
Validation Accuracy with 40 components: 0.814422
Test Accuracy with 40 components: 0.811111
Validation Accuracy with 50 components: 0.814422
Test Accuracy with 50 components: 0.777778
Validation Accuracy with 100 components: 0.814422
Test Accuracy with 100 components: 0.800000


## Problem 3: Feedback [+1 EC]

#### Deliverable 3.1: Approximately how much time did you spend on this assignment?

Around 3 hours or so.
