# Project 3 - Classification
Welcome to the third project of Data 8!  You will build a classifier that guesses whether a song is hip-hop or country, using only the numbers of times words appear in the song's lyrics.  By the end of the project, you should know how to:

1. Build a k-nearest-neighbors classifier.
2. Test a classifier on data.

### Logistics


**Deadline.** This project is due at 11:59pm on Thursday 4/27. You can earn an early submission bonus point by submitting your completed project by Wednesday 4/26. Late submissions will be accepted until Tuesday 5/2, but a 10% late penalty will be applied for each day late. It's **much** better to be early than late, so start working now.

**Checkpoint.** For full credit, you must also **complete Part 1 of the project (out of 4) and submit them by 11:59pm on Friday 4/21**. You will have some lab time to work on these questions, but we recommend that you start the project before lab and leave time to finish the checkpoint afterward.

**Partners.** You may work with one other partner. It's best to work with someone in your lab. Only one of you is required to submit the project. On [okpy.org](http://okpy.org), the person who submits should also designate their partner so that both of you receive credit.

**Rules.** Don't share your code with anybody but your partner. You are welcome to discuss questions with other students, but don't share the answers. The experience of solving the problems in this project will prepare you for exams (and life). If someone asks you for the answer, resist! Instead, you can demonstrate how you would solve a similar problem.

**Support.** You are not alone! Come to office hours, post on Piazza, and talk to your classmates. If you want to ask about the details of your solution to a problem, make a private Piazza post and the staff will respond. If you're ever feeling overwhelmed or don't know how to make progress, email your TA or tutor for help. You can find contact information for the staff on the [course website](http://data8.org/sp17/staff.html).

**Tests.** Passing the tests for a question **does not** mean that you answered the question correctly. Tests usually only check that your table has the correct column labels. However, more tests will be applied to verify the correctness of your submission in order to assign your final score, so be careful and check your work!

**Advice.** Develop your answers incrementally. To perform a complicated table manipulation, break it up into steps, perform each step on a different line, give a new name to each result, and check that each intermediate result is what you expect. You can add any additional names or functions you want to the provided cells. 

To get started, load `datascience`, `numpy`, `plots`, and `ok`.

In [1]:
# Run this cell to set up the notebook, but please don't change it.
!pip install datascience
!pip install client
import numpy as np
import math
from datascience import *

# These lines set up the plotting functionality and formatting.
import matplotlib
import matplotlib.pyplot as plt
matplotlib.use('Agg', warn=False)
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

# These lines load the tests.
#from client.api.notebook import Notebook
#ok = Notebook('project3.ok')
#_ = ok.auth(inline=True)


!pip install -U scikit-learn
!pip install --upgrade pip
from sklearn.neighbors import NearestNeighbors
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.cross_validation import train_test_split

Requirement already up-to-date: scikit-learn in /Users/kevinko/anaconda3/lib/python3.6/site-packages
Requirement already up-to-date: pip in /Users/kevinko/anaconda3/lib/python3.6/site-packages




# 1. The Dataset

Our dataset is a table of songs, each with a name, an artist, and a genre.  We'll be trying to predict each song's genre.

The only attributes we will use to predict the genre of a song are its lyrics. In particular, we have a list of just under 5,000 words that might occur in a song.  For each song, our dataset tells us the frequency with which each of these words occurs in that song. All words have been converted to lowercase.

Run the cell below to read the `lyrics` table. **It may take up to a minute to load.**

In [2]:
lyrics = Table.read_table('lyrics.csv')
lyrics.where("Title", "In Your Eyes").select(0, 1, 2, 3, 4, 5, "like", "love")


Title,Artist,Genre,i,the,you,like,love
In Your Eyes,Alison Krauss,Country,0.107143,0,0.0297619,0.0119048,0.0595238


That cell prints a few columns of the row for the country song ["In Your Eyes" by Alison Krauss](http://www.azlyrics.com/lyrics/alisonkrauss/inyoureyes.html).  The song contains 168 words. The word "like" appears twice:  $\frac{2}{168} \approx 0.0119$ of the words in the song. The word "love" appears 10 times: $\frac{10}{168} \approx 0.0595$ of the words. The word "the" doesn't appear at all.

Our dataset doesn't contain all information about a song.  For example, it doesn't describe the order of words in the song, let alone the melody, instruments, or rhythm. Nonetheless, you may find that word frequencies alone are sufficient to build an accurate genre classifier.

All titles are unique. The `row_for_title` function provides fast access to the one row for each title. 

## Ungraded and Optional: A Custom Classifier
Try to create an even better classifier. You're not restricted to using only word proportions as features.  For example, given the data, you could compute various notions of vocabulary size or estimated song length.  If you're feeling very adventurous, you could also try other classification methods, like logistic regression.  If you think you built a classifier that works well, post on Piazza and let us know.

In [3]:
#####################
# Custom Classifier #
#####################
#SVC Best kernel = linear

## Kaggle Competition

**Note:** This part is completely optional and will not contribute towards your grade in any way.

We decided to *hold out* a set of 100 songs, for which we have provided the attributes but not the genres. You can use this set to evaluate how well you classifier performs on data for which you have never seen the correct genres. Optionally, you can submit your predictions on this dataset to Kaggle to compare your classifier to others (whoever else decides to participate).

To participate, use your classifier to predict the genre of each row in the `holdout` table. Then, call ```create_competition_submission``` to generate a CSV file that you can submit to the competition!

If you want to participate in the competition, you will have to create a Kaggle account. It's easiest for the staff to determine the winners of the competition if you use your `@berkeley.edu` email when doing so, but you can also contact your GSI if you decide to use another email address. Winners may receive honor and glory, but no material benefit.

When you are ready to make a submission, go to https://inclass.kaggle.com/c/hip-hop-or-country for further instructions.

## Entire Data Set

In [4]:
holdout = Table.read_table('holdout_attributes.csv').drop('Id')
#holdout.select(0, 1, 2, 3, 4).show(5)

contest = holdout.to_array()
contest = [list(item) for item in contest]

X = lyrics.drop(0, 1, 2).to_array()
X = [list(item) for item in X]
y = lyrics.column('Genre')

In [None]:
print(len(X[0]), len(contest[0]))

4817 4817


In [None]:
from sklearn.svm import SVC
svm = SVC(kernel='linear')
svm.fit(X, y)
y_pred = svm.predict(X)
metrics.accuracy_score(y, y_pred)

In [None]:
from sklearn.svm import SVC
svm = SVC(kernel='sigmoid')
svm.fit(X, y)
y_pred = svm.predict(X)
metrics.accuracy_score(y, y_pred)

## Feature Selection

In [None]:
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold()
X = sel.fit_transform(X)
contest = sel.transform(contest)

In [None]:
print(len(X[0]), len(contest[0]))

In [None]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.decomposition import PCA

pca = PCA(n_components=1000)

# Maybe some original features where good, too?
selection = SelectKBest(k=1000)

# Build estimator from PCA and Univariate selection:

combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])

# Use combined features to transform dataset:
X_features = combined_features.fit(X, y).transform(X)

svm = SVC(kernel="linear")

# Do grid search over k, n_components and C:

pipeline = Pipeline([("features", combined_features), ("svm", svm)])

param_grid = dict(features__pca__n_components=[1, 2, 3],
                  features__univ_select__k=[1, 2],
                  svm__C=[0.1, 1, 10])

grid_search = GridSearchCV(pipeline, param_grid=param_grid, verbose=10)
grid_search.fit(X, y)
print(grid_search.best_estimator_)

In [None]:
grid_search.best_params_

## Train-test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

## Split-test Train + features

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_features, y, test_size=0.2)

In [None]:
len(X_features[0])

## Best features based on Tree

In [None]:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import samples_generator
from sklearn.feature_selection import f_regression
from sklearn.pipeline import Pipeline


clf = ExtraTreesClassifier()
clf = clf.fit(X, y)
clf.feature_importances_ 
model = SelectFromModel(clf, prefit=True)

In [None]:
X = model.transform(X)
contest = model.transform(contest)

In [None]:
len(contest[0])

In [None]:
len(X[0])

## NN

In [None]:
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(hidden_layer_sizes=(38,))#solver='adam', learning_rate='adaptive', hidden_layer_sizes=(8,), alpha=1e-05, activation='tanh')#solver='', alpha=1e-5, hidden_layer_sizes=(34,), activation='logistic', learning_rate='constant')
#{'solver': 'adam', 'learning_rate': 'adaptive', 'hidden_layer_sizes': 8, 'alpha': 1e-05, 'activation': 'tanh'}
#{'solver': 'lbfgs', 'learning_rate': 'adaptive', 'hidden_layer_sizes': 32, 'alpha': 0.001, 'activation': 'tanh'}
#{'learning_rate': 'constant', 'hidden_layer_sizes': 38, 'alpha': 0.001, 'activation': 'identity'}
#{'solver': 'lbfgs', 'learning_rate': 'constant', 'hidden_layer_sizes': 34, 'alpha': 0.001, 'activation': 'logistic'}
#{'learning_rate': 'constant', 'hidden_layer_sizes': 38, 'alpha': 0.001, 'activation': 'identity'}
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
metrics.accuracy_score(y_test, y_pred)

In [None]:
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(clf, X, y, cv=10, scoring='accuracy')
print(scores.mean())

## Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import log_loss

clf = RandomForestClassifier(n_estimators=3000, oob_score=True)#, min_samples_leaf=60, min_samples_split=2, bootstrap=False, criterion='entropy', max_features='sqrt') #min_samples_leaf=90, min_samples_split=9, max_depth=None)
#{'bootstrap': True, 'criterion': 'entropy', 'max_features': 'sqrt', 'min_samples_leaf': 90, 'min_samples_split': 9, 'n_estimators': 1000}
#{'bootstrap': True, 'criterion': 'gini', 'max_depth': 3, 'max_features': 'sqrt', 'min_samples_leaf': 60, 'min_samples_split': 9, 'n_estimators': 900}
#{'bootstrap': False, 'criterion': 'gini', 'max_depth': None, 'max_features': 6, 'min_samples_leaf': 3, 'min_samples_split': 5, 'n_estimators': 400}
#{'bootstrap': False, 'criterion': 'entropy', 'max_depth': None, 'max_features': 10, 'min_samples_leaf': 2, 'min_samples_split': 7}
#{'bootstrap': True, 'criterion': 'entropy', 'max_depth': None, 'max_features': 10, 'min_samples_leaf': 7, 'min_samples_split': 6}
#{'bootstrap': False, 'criterion': 'entropy', 'max_depth': None, 'max_features': 9, 'min_samples_leaf': 1, 'min_samples_split': 5}
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred))


In [None]:
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(clf, X, y, cv=10, scoring='accuracy')
print(scores.mean())

In [None]:
from time import time
from scipy.stats import randint as sp_randint

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier

param_dist = {"n_estimators": np.arange(100, 1100, 100),
              #"max_depth": [3, None],
              "max_features": [None, 'auto', 'sqrt', 0.3],
              #"min_samples_split": sp_randint(2, 11),
              #"min_samples_leaf": np.arange(50, 510, 10),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}


random_search = RandomizedSearchCV(clf, param_distributions=param_dist)
start = time()
random_search.fit(X_train, y_train)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time() - start), 0))



In [None]:
print(random_search.best_score_)
print(random_search.best_estimator_)
print(random_search.best_params_)

In [None]:
print(len(contest[0]), len(X[0]))

In [None]:
clf

In [None]:
predictions = clf.predict(contest)
predictions

In [None]:
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
gnb = GaussianNB()
y_pred = gnb.fit(X_train, y_train).predict(X_test)
metrics.accuracy_score(y_test, y_pred)

In [None]:
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(gnb, X, y, cv=10, scoring='accuracy')
print(scores.mean())

In [None]:
#contest = holdout.to_array()
#contest = [list(item) for item in contest]
#contest = model.transform(contest)
gnb_pred = gnb.predict(contest)
gnb_pred

(79, 79)

In [None]:
from sklearn.grid_search import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

gs = RandomizedSearchCV(clf, param_distributions={
    'hidden_layer_sizes': np.arange(1, 50), 
    'alpha': [1e-5, 0.001, 0.1, 10, 1000],
    'solver' : ['lbfgs', 'sgd', 'adam'],
    'learning_rate' : ['constant', 'invscaling', 'adaptive'],
    'activation' : ['identity', 'logistic', 'tanh', 'relu']})
gs.fit(X_train, y_train)

In [None]:
print(gs.best_score_)
print(gs.best_estimator_)
print(gs.best_params_)

In [None]:
print(holdout.num_columns)

In [None]:
def create_competition_submission(predictions, filename='my_submission.csv'):
    """
    Create a submission CSV for the Kaggle competition.
    
    Inputs:
      predictions - list or array of your predictions (Generated as in Question 3.3.1.)
    """
    Table().with_columns('Id', np.arange(len(predictions)), 'Predictions', predictions).to_csv(filename)
    print('Created', filename)

Here's an example of how to generate a submission file.

In [None]:
create_competition_submission(predictions, 'idkanymore.csv')