Enter Team Member Names here (double click to edit):

- Name 1:
- Name 2:
- Name 3:

________

#In Class Assignment Three
In the following assignment you will be asked to fill in python code and derivations for a number of different problems. Please read all instructions carefully and turn in the rendered notebook (or HTML of the rendered notebook) to blackboard before the end of class.

**Distance Students**: please finish this assignment in 2 hours and 30 minutes. Turn in before next class per the instructions on blackboard.

________________________________________________________________________________________________________

## Downloading the Document Data
Please run the following code to read in the "20 newsgroups" dataset from sklearn's data loading module.

In [1]:
from sklearn.datasets import fetch_20newsgroups_vectorized
import numpy as np

# this takes about 30 seconds to compute, read the next section while this downloads
ds = fetch_20newsgroups_vectorized(subset='train')

# this holds the continuous feature data (which is tfidf)
print 'features shape:', ds.data.shape # there are ~11000 instances and ~130k features per instance
print 'target shape:', ds.target.shape 
print 'range of target:', np.min(ds.target),np.max(ds.target)
print 'Data type is', type(ds.data), float(ds.data.nnz)/(ds.data.shape[0]*ds.data.shape[1])*100, '% of the data is non-zero'

features shape: (11314, 130107)
target shape: (11314,)
range of target: 0 19
Data type is <class 'scipy.sparse.csr.csr_matrix'> 0.121435315436 % of the data is non-zero


## Understanding the Dataset
Look at the description for the 20 newsgroups dataset at http://qwone.com/~jason/20Newsgroups/. You have just downloaded the "vectorized" version of the dataset, which means all the words inside the articles have gone through a transformation that binned them into 130 thousand features related to the words in them. You can ignore the information on TFIDF-just recognize that it is a means of converting text to a vector of data. 

**Question Set 1**:
- How many instances are in the dataset? 
- What does each instance represent? 
- How many classes are in the dataset and what does each class represent?
- Would you expect a classifier trained on this data would generalize to documents written in the past week? Why or why not?
- Is the data represented as a sparse or dense matrix?

___
Enter your answer here:

*Double click to edit*

This is a dataset of news article topics. There are about 11,000 instances, each representing a news article. There are 20 different groups that are labeled in the dataset. The data will be very sparse, with only a small percentage of the data having a numeric value and all others being zero. 

This data was trained from twenty year old articles so many news topics would not generalize to topics in today's news. In fact, many words like 'facebook', 'twitter', and 'Obama' would not even be present in the existing dataset, but are important for modern topic classifiers to understand. However, some health related topics may still work well--it depends on how much the underlying distribution of words is changing.

This data is represented as a sparse matrix, meaning only non-zero values are stored with their matrix location.

___

## Measures of Distance
In the following block of code, we isolate three instances from the dataset. The instanca `a` is from the group computer graphics, `b` is from from the group recreation autos, and `c` is from group recreation motorcycle. **Exercise for part 2**: Calculate the Euclidean distance, cosine distance, and Jaccard similarity between each pair of instances using the imported function below. Remember that the Jaccard distance is only for binary valued vectors, so convert vectors to binary using a threshold. **Question for part 2**: Which distance seems more appropriate to use for this data? Why?

In [2]:
from scipy.spatial.distance import cosine
from scipy.spatial.distance import euclidean
from scipy.spatial.distance import jaccard
import numpy as np

# get first instance (comp)
idx = 550
a = ds.data[idx].todense()
a_class = ds.target_names[ds.target[idx]]
print 'Instance A is from class', a_class

# get second instance (autos)
idx = 4000
b = ds.data[idx].todense()
b_class = ds.target_names[ds.target[idx]]
print 'Instance B is from class', b_class

# get third instance (motorcycle)
idx = 7000
c = ds.data[idx].todense()
c_class = ds.target_names[ds.target[idx]]
print 'Instance C is from class', c_class

# Enter distance comparison below for each pair of vectors:
print 'Euclidean Distance\n ab:', euclidean(a,b), 'ac:', euclidean(a,c), 'bc:', euclidean(b,c)
print 'Cosine Distance\n ab:', cosine(a,b), 'ac:', cosine(a,c), 'bc:', cosine(b,c)
print 'Jaccard Dissimilarity (vectors should be boolean values)\n ab:', jaccard(a>0,b>0), 'ac:', jaccard(a>0,c>0), 'bc:', jaccard(b>0,c>0)

print '\n\nThe most appropraite distance is...' 
print 'Cosine. It clearly delineates between topics effectively. B and C are closest (by far),',
print 'and A is about the same distance from B and C. Euclidean also works well, but not quite as good as "cosine"'

Instance A is from class comp.graphics
Instance B is from class rec.autos
Instance C is from class rec.motorcycles
Euclidean Distance
 ab: 1.09851846719 ac: 1.18914054254 bc: 0.917779422666
Cosine Distance
 ab: 0.603371411376 ac: 0.707027614956 bc: 0.421159534335
Jaccard Dissimilarity (vectors should be boolean values)
 ab: 0.882113821138 ac: 0.875471698113 bc: 0.908794788274


The most appropraite distance is...
Cosine. It clearly delineates between topics effectively. B and C are closest (by far), and A is about the same distance from B and C. Euclidean also works well, but not quite as good as "cosine"


___
## Using scikit-learn with KNN
Now let's use stratified cross validation with a holdout set to train a KNN model in `scikit-learn`. Use the example below to train a KNN classifier. The documentation for KNeighbors is here: http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html  

**Exercise for part 3**: Use the code below to test what value of `n_neighbors` works best for the given data. *Note: do NOT change the metric to be anything other than `'euclidean'`. Other distance functions are not optimized for the amount of data we are working with.* **Question for part 3**: What is the accuracy of the best classifier you can create for this data (by changing only the `n_neighbors` parameter)? 

In [3]:
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

cv = StratifiedShuffleSplit(ds.target, n_iter = 1, test_size = 0.5, train_size=0.5)

for trainidx, testidx in cv:
    # note that these are sparse matrices
    X_train = ds.data[trainidx] 
    X_test = ds.data[testidx] 
    y_train = ds.target[trainidx]
    y_test = ds.target[testidx]

clf = KNeighborsClassifier(n_neighbors=5, weights='uniform', metric='euclidean')

# fill in your code here
clf.fit(X_train, y_train)
y_hat = clf.predict(X_test)

print 'Accuracy of classifier is:', accuracy_score(y_test,y_hat)

Accuracy of classifier is: 0.500972246774


**Question for part 3**: With sparse data, does the use of a KDTree representation make sense? Why or Why not?

Enter your answer below:

*Double Click to edit*

*Not usually because the cost of measuring distance with truly sparse data is small--usually smaller than the cost of creating a tree. Sparse data also will not branch well in a KDTree because there are so many zeros. The number of "zero" comparisons will mean each tree needs to be very deep. Therefore the benefit of the KDTree is drastically reduced.*

_____
## KNN extensions - Centroids
Now lets look at a very closely related classifier to KNN, called nearest centroid. In this classifier (which is more appropriate for big data scenarios and sparse data), the training step is used to calculate the centroids for each class. These centroids are saved. Unknown attributes, at prediction time, only need to have distances calculated for each saved centroid, drastically decreasing the time required for a prediction. **Exercise for part 4**: Use the template code below to create a nearest centroid classifier. Test which metric has the best performance: Euclidean, Cosine, or Manhattan. In `scikit-learn` you can see the documentation for NearestCentroid here: 
- http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestCentroid.html#sklearn.neighbors.NearestCentroid

and for supported distance metrics here:
- http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.distance_metrics.html#sklearn.metrics.pairwise.distance_metrics

In [4]:
from sklearn.neighbors.nearest_centroid import NearestCentroid

# the parameters for the nearest centroid metric to test are:
#    l1, l2, and cosine (all are optimized)
clf = NearestCentroid(metric='euclidean')

# fill in your code here
for d in ['l1','l2','cosine']:
    clf = NearestCentroid(metric=d)
    clf.fit(X_train, y_train)
    y_hat = clf.predict(X_test)
    print 'Accuracy of', d,'is:', accuracy_score(y_test,y_hat)

print 'The best distance metric is: ', 'Cosine'

Accuracy of l1 is: 0.346119851511
Accuracy of l2 is: 0.410641682871
Accuracy of cosine is: 0.489658829768
The best distance metric is:  Cosine




___
## Naive Bayes Classification
Now let's look at the use of the Naive Bayes classifier. The 20 newsgroups dataset has 20 classes and about 130,000 features. Recall that the Naive Bayes classifer calculates a posterior distribution for each possible class. Each posterior distribution is a multiplication of many conditional distributions. **Question for part 5**: With this many classes and features, how many different conditional probabilities need to be parameterized? How many priors need to be modeled?

Enter you answer here:

*Double Click to edit*

*The number of features is 130,107 and the number of classes is 20. The total conditionals is the multiplication of these, or roughly 2.6 million. There are 20 priors, one for each class. *

In [5]:
# Use this space for any calculations you might want to do

n_features = float(ds.data.shape[1])
n_classes = float(len(ds.target_names))

print n_features, n_classes, n_features*n_classes

130107.0 20.0 2602140.0


___
## Naive Bayes in Scikit-learn
Scikit has several implementations of the Naive Bayes classifier: `GaussianNB`, `MultinomialNB`, and `BernoulliNB`. Look at the documentation here: http://scikit-learn.org/stable/modules/naive_bayes.html Take a look at each implementation and then answer this question: **Question for part 6**: If my instances contain continuous attributes, would it be better to use Gaussian Naive Bayes, Multinomial Naive Bayes, or Bernoulli? And Why? What if the data is sparse, does this change your answer? Why or Why not?

Enter you answer here:

*Double Click to edit*

*It is probably better to use Gaussian as long as the number of examples is large because the continuous data can be characterized more easily. However, sparse matrices are much harder to find realistic Gaussian models for because they always have a mean near zero. For sparse data, it is probably better (and faster) to use multinomial naive Bayes. An argument can also be made for Bernoulli if binarizing the feature data helps to reduce the complexity of the problem.*
___

## Naive Bayes Comparison
For the final section of this notebook let's compare the performance of Naive Bayes for document classification. Look at the parameters for `MultinomialNB`, and `BernoulliNB` (especially `alpha` and `binarize`). **Exercise for part 7**: Using the example code below, change the parameters for each classifier and see how accurate you can make the classifiers on the test set. **Question for part 7**: Why are these implementations so fast to train? What does the 'alpha' value control in these models (*i.e.*, how does it change the learned models)? 

In [12]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB


clf_mnb = MultinomialNB(alpha=0.001)
clf_bnb = BernoulliNB(alpha=0.001, binarize=0.02)

# fill in your code here
for clf in [clf_mnb,clf_bnb]:
    clf.fit(X_train, y_train)
    y_hat = clf.predict(X_test)
    print 'Accuracy of classifier is:', accuracy_score(y_test,y_hat)
            
print 'These classifiers are simply counting instances and values, '
print 'which can be done extremely effectively for sparse matrices of data'
print 'The alpha values are controlling the smoothing parameter for never allowing'
print 'a class-value pair to have zero probability. This is called smoothing.'


Accuracy of classifier is: 0.886689057804
Accuracy of classifier is: 0.871840197985
These classifiers are simply counting instances and values, 
which can be done extremely effectively for sparse matrices of data
The alpha values are controlling the smoothing parameter for never allowing
a class-value pair to have zero probability. This is called smoothing.
