### 1)	Goal: observe the performance of classification on difficult vs easy datasets
#### Newsgroup data has 6 main classes with multiple subclasses for each class. The main classes include Computer, recreation, science, politics, miscellaneous and other.

#### a.	Perform Naiive bayes classification on an easy data set that includes the classes (recreation, computers)

In [1]:
# sys module allows python to interface with underlying operating system and access file system without worrying about underlying operating system. 
import sys 
# os module is same as above but contains ability to rename
import os
# NumPy (Numeric Python) has powerful data structures and function for efficient computation of multi-dimen
import numpy as np

from sklearn.datasets import fetch_20newsgroups

news = fetch_20newsgroups()

In [2]:
from pprint import pprint
news.keys()

['description', 'DESCR', 'filenames', 'target_names', 'data', 'target']

In [3]:
# printing list of categories
pprint(list(news.target_names))

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']


In [4]:
news = fetch_20newsgroups(
    subset='train', 
    categories=('comp.sys.ibm.pc.hardware','rec.sport.baseball'), 
    remove=('headers','footers','quotes'))
pprint(list(news.target_names))

['comp.sys.ibm.pc.hardware', 'rec.sport.baseball']


In [5]:
#generate term frequency matrix
#feature_extraction.text is a module that turns text into vectors of numerical values suitable for statistical analysis. 
#CountVectorizer converts a collection of text documents into a matrix of token counts
#We will create below counts for the 10,000 most frequent single word tokens ignoring those words with a document frequency > 500 and english stop words
from sklearn.feature_extraction.text import CountVectorizer
tf_vec = CountVectorizer (max_df=500,
                         min_df=0,
                         max_features=10000,
                         ngram_range=(1,1),
                         stop_words='english')
#this creates a sparse matrix in which only non-zero values are stored 
#(each row is document/token pair w/ non-zero count, then count)
tf_matrix = tf_vec.fit_transform(news.data[:500])
print ('the data has %d rows and %d columns' % (tf_matrix.shape[0], tf_matrix.shape[1]))

the data has 500 rows and 7430 columns


In [6]:
print (tf_matrix[1:10])

  (0, 1242)	1
  (0, 2843)	1
  (0, 4883)	1
  (0, 6809)	1
  (0, 6356)	1
  (0, 2204)	1
  (0, 5597)	2
  (0, 7191)	1
  (0, 2138)	1
  (0, 6094)	1
  (0, 4540)	1
  (0, 3888)	1
  (0, 4577)	2
  (0, 6745)	1
  (0, 2174)	1
  (0, 386)	1
  (0, 247)	1
  (0, 358)	1
  (0, 1328)	1
  (0, 4411)	1
  (0, 3090)	2
  (0, 261)	2
  (0, 2920)	3
  (0, 6933)	2
  (0, 2099)	1
  :	:
  (8, 4292)	2
  (8, 6554)	1
  (8, 6192)	1
  (8, 5416)	1
  (8, 3378)	1
  (8, 7192)	1
  (8, 1510)	1
  (8, 5609)	1
  (8, 6381)	1
  (8, 4843)	1
  (8, 7023)	1
  (8, 3076)	2
  (8, 7251)	1
  (8, 6010)	1
  (8, 2924)	2
  (8, 7343)	1
  (8, 6260)	1
  (8, 5533)	1
  (8, 2454)	1
  (8, 5813)	1
  (8, 7037)	1
  (8, 7191)	1
  (8, 261)	1
  (8, 4260)	1
  (8, 5944)	1


In [7]:
#pandas module is library for data manipulation and analysis, especially to manipulate time series and numerical tables
import pandas as pd
#DataFrame function creates a spreadsheet essentially in which zero values are also stored
full_matrix = pd.DataFrame(tf_matrix.todense(),columns=tf_vec.get_feature_names())
print (full_matrix[1:10])

   00  000  007  01  02  03  0300  031  0334  038   ...    zbiciak  zeile  \
1   0    0    0   0   0   0     0    0     0    0   ...          0      0   
2   0    0    0   0   0   0     0    0     0    0   ...          0      0   
3   0    0    0   0   0   0     0    0     0    0   ...          0      0   
4   0    0    0   0   0   0     0    0     0    0   ...          0      0   
5   0    0    0   0   0   0     0    0     0    0   ...          0      0   
6   0    0    0   0   0   0     0    0     0    0   ...          0      0   
7   0    0    0   0   0   0     0    0     0    0   ...          0      0   
8   0    0    0   0   0   0     0    0     0    0   ...          0      0   
9   0    0    0   0   0   0     0    0     0    0   ...          0      0   

   zenith  zeos  zero  zip  zone  zoom  zorro  zupcic  
1       0     0     0    0     0     0      0       0  
2       0     0     0    0     0     0      0       0  
3       0     0     0    0     0     0      0       0  
4    

In [8]:
#classify data
#we will make an array of the true category value for each post
t=np.asarray(news.target[:500])
print t [1:10]
#train_test_split module splits our full matrix array and our category array (formatted as numpy array) 
#into random train and test subsets
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(full_matrix.as_matrix(),t,random_state=50) 


[0 0 0 0 1 1 1 0 1]


In [9]:
# naive_bayes module assuming normal distribution of features (Gaussian)
# popular baseline method for text categorization using word frequencies as features
# meaning: for each word(feature) in the document, 
# the training set has calculated a prior probability of that word(feature) belonging to a particular category 
from sklearn.naive_bayes import GaussianNB as NB
clf = NB()
y_pred = clf.fit(xtrain, ytrain).predict(xtest)
error = (y_pred != ytest).sum()
print ("number of mislabels out of %d points: %d" % (xtest.shape[0],error ))

number of mislabels out of 125 points: 6


#### b.	Perform Naiive Bayes classification on a difficult dataset that includes the classes rec.motorcycles and rec.autos.

In [10]:
news2 = fetch_20newsgroups(
    subset='train', 
    categories=('rec.motorcycles','rec.autos'), 
    remove=('headers','footers','quotes'))
tf_vec2 = CountVectorizer (max_df=500,
                         min_df=0,
                         max_features=10000,
                         ngram_range=(1,1),
                         stop_words='english')
tf_matrix2 = tf_vec2.fit_transform(news2.data[:500])
print ('the data has %d rows and %d columns' % (tf_matrix.shape[0], tf_matrix.shape[1]))

full_matrix2 = pd.DataFrame(tf_matrix2.todense(),columns=tf_vec2.get_feature_names())
t2=np.asarray(news.target[:500])
xtrain2, xtest2, ytrain2, ytest2 = train_test_split(full_matrix2.as_matrix(),t2,random_state=50) 
y_pred2 = clf.fit(xtrain2, ytrain2).predict(xtest2)
error = (y_pred2 != ytest2).sum()
print ("number of mislabels out of %d points: %d" % (xtest2.shape[0],error ))

the data has 500 rows and 7430 columns
number of mislabels out of 125 points: 60


#### c.	Repeat a and b using the decision tree classifier

In [11]:
# tree module includes decision tree-based models for classification and regression
from sklearn import tree
# Decision Tree Classifier generates an algorithm based on feature values that classifies the data
clf = tree.DecisionTreeClassifier()
y_pred = clf.fit(xtrain, ytrain).predict(xtest)
error = (y_pred != ytest).sum()
print ("for the easy data set, number of mislabels out of %d points: %d" % (xtest.shape[0],error ))
y_pred2 = clf.fit(xtrain2, ytrain2).predict(xtest2)
error = (y_pred2 != ytest2).sum()
print ("for the difficult data set, number of mislabels out of %d points: %d" % (xtest2.shape[0],error ))

for the easy data set, number of mislabels out of 125 points: 18
for the difficult data set, number of mislabels out of 125 points: 64


#### d.	Discuss the results

Naive Bayes is popular for test classification and this exercise demonstrates why. Because there are so many words(features) in the text volume overall, Naive Bayes allows you to assess all relevant features in a document at one time and combine their prior probabilities to predict the class of the new document. Decision tree is less flexible. The algorithm is developed to pick the most polarized features first and then assign class based on value of one feature at a time. Once classified, the algorithm can't then reassess the other features in the document to refine the likelihood of the correct class. So it makes sense that the decision tree classifiers perform less well than the naive bayes classifiers.  But naive bayes seems to lose it's power as the classification becomes more subtle (as when the text documents are on very similar topics) and error rate approaches that of decision tree. Which by the way is pretty close to chance. 

#### Repeat the classification of the difficult data set above using:

a.	Bagging

In [12]:
# Bagging or Bootstrap Aggregation takes a sample data set with replacement to train and generates a classifier
# Classification in this case is naive bayes
# This process is repeated over and over.  Then all classifiers are applied a novel data set.
# For each target, the class with the most votes wins!

from sklearn.ensemble import BaggingClassifier 
# in this case we are using base estimator of naive bayes, a default number of 10 times
# not sure what max_samples and max_features means
bagging = BaggingClassifier (NB(), max_samples=.5, max_features=.5)
y_pred2 = bagging.fit(xtrain2, ytrain2).predict(xtest2)
error = (y_pred2 != ytest2).sum()
print ("With Bagging - number of mislabels out of %d points: %d" % (xtest2.shape[0],error ))

With Bagging - number of mislabels out of 125 points: 65


b.	AdaBoost

In [13]:
# AdaBoost generates a classifier, then gives more weight to the points that were incorrectly classified.
# Then generates another classifier, and so on
# Each classifier is weighted based on its accuracy and the votes are combined
from sklearn.ensemble import AdaBoostClassifier
# Here we again using base estimator of naive bayes 300 times using the SAMME algorithm
# Discrete SAMME AdaBoost adapts based on errors in predicted class labels whereas real SAMME.R uses the predicted class probabilities.
clf = AdaBoostClassifier(NB(),
                         algorithm="SAMME",
                         n_estimators=300)

y_pred2 = clf.fit(xtrain2, ytrain2).predict(xtest2)
error = (y_pred2 != ytest2).sum()
print ("With AdaBoost - number of mislabels out of %d points: %d" % (xtest2.shape[0],error ))

With AdaBoost - number of mislabels out of 125 points: 60


c.	RandomForest

In [14]:
# RandomForst generates a classifier (tree) using a bootstrap sample, for each node a set of random features are chosen
# the best feature to divide the sample is chosen to split the data 
# a bunch of additional classifiers are generated in the same way and then votes are combined
from sklearn.ensemble import RandomForestClassifier
# here we are generating 100 trees with each node assessing sqrt of our 10,000 features
clf = RandomForestClassifier(n_estimators=100, max_depth=None,random_state=10, max_features='auto')
y_pred2 = clf.fit(xtrain2, ytrain2).predict(xtest2)
error = (y_pred2 != ytest2).sum()
print ("With Random Forest- number of mislabels out of %d points: %d" % (xtest2.shape[0],error ))

With Random Forest- number of mislabels out of 125 points: 61


#### Compare performance of the ensemble classifiers.

Each time they are run, they give a slightly different answer all hovering around 60. There seems to be more variation in Bagging's output than AdaBoost or Randome Forest but mostly, they perform about equally (certain within the margin of error).  But none of them help me to better categorize documents as auto or motorcycle. 

#### Experiment with (10,20,30)-fold cross validation and discuss whether increasing the folds affects the stability of the ensembles performance.

In [15]:
from sklearn.model_selection import KFold
from sklearn import metrics

t=np.asarray(news2.target[:500])   # true labels
kf = KFold(n_splits =30)
i=0
for train, test in kf.split(tf_matrix): 
    xtrain,xtest = tf_matrix[train],  tf_matrix[test]
    ytrain, ytest = t[train], t[test]
    #clf = RandomForestClassifier(n_estimators=100, max_depth=None,random_state=10, max_features='auto')
    #clf = NB()
    clf = AdaBoostClassifier(NB(),algorithm="SAMME",n_estimators=300)
    y = clf.fit(xtrain.toarray(), ytrain).predict(xtest.toarray())
    acc=metrics.accuracy_score(ytest, y)
    i=i+1

    print (" Accuracy of fold (%d) is %3.3f" % (i,acc ))



 Accuracy of fold (1) is 0.176
 Accuracy of fold (2) is 0.353
 Accuracy of fold (3) is 0.471
 Accuracy of fold (4) is 0.471
 Accuracy of fold (5) is 0.529
 Accuracy of fold (6) is 0.529
 Accuracy of fold (7) is 0.529
 Accuracy of fold (8) is 0.294
 Accuracy of fold (9) is 0.824
 Accuracy of fold (10) is 0.353
 Accuracy of fold (11) is 0.765
 Accuracy of fold (12) is 0.471
 Accuracy of fold (13) is 0.529
 Accuracy of fold (14) is 0.647
 Accuracy of fold (15) is 0.294
 Accuracy of fold (16) is 0.706
 Accuracy of fold (17) is 0.294
 Accuracy of fold (18) is 0.353
 Accuracy of fold (19) is 0.588
 Accuracy of fold (20) is 0.529
 Accuracy of fold (21) is 0.375
 Accuracy of fold (22) is 0.625
 Accuracy of fold (23) is 0.688
 Accuracy of fold (24) is 0.375
 Accuracy of fold (25) is 0.438
 Accuracy of fold (26) is 0.688
 Accuracy of fold (27) is 0.312
 Accuracy of fold (28) is 0.688
 Accuracy of fold (29) is 0.562
 Accuracy of fold (30) is 0.438


Increasing the folds does affect the stability of the ensemble classifiers performance negatively. The more folds you have, the wider the variation in accuracy. 

### 2)	 Multi-label classification:
#### Form a sample of the following classes: comp.graphics, rec.autos, talk.politics.gun, soc.religion.christian. Include at least 200 documents from each subclass. 

In [16]:
news = fetch_20newsgroups(
    subset='train', 
    categories=('comp.graphics', 'rec.autos', 'talk.politics.guns', 'soc.religion.christian'), 
    remove=('headers','footers','quotes'))
tf_vec = CountVectorizer (max_df=500,
                         min_df=0,
                         max_features=30000,
                         ngram_range=(1,1),
                         stop_words='english')
tf_matrix = tf_vec.fit_transform(news.data)
print ('the data has %d rows and %d columns' % (tf_matrix.shape[0], tf_matrix.shape[1]))


the data has 2323 rows and 29847 columns


#### Perform one-vs-all classification using Naïve Bayes, decision trees and support vector machine SVC classifiers.

In [None]:
# one versus all classification develops classifiers by comparing one category to all others
# then applies all developed classifiers to each new instance and the class with the highest confidence score wins
full_matrix = pd.DataFrame(tf_matrix.todense(),columns=tf_vec.get_feature_names())
t=np.asarray(news.target)
xtrain, xtest, ytrain, ytest = train_test_split(full_matrix.as_matrix(),t,random_state=50) 

# A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. 
# In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples.
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC

clf1= LinearSVC(random_state=10)
clf2 = NB()
clf3 = tree.DecisionTreeClassifier()

for name, classifier in zip(['SVC','NB','DT'],[clf1,clf2,clf3]):
    y_pred = OneVsRestClassifier(classifier).fit(xtrain, ytrain).predict(xtest)
    error = (y_pred != ytest).sum()
    print ("One-vs-All %s --> number of mislabels out of %d points in the test test: %d" % (name, xtest.shape[0],error ))




One-vs-All SVC --> number of mislabels out of 581 points in the test test: 92
One-vs-All NB --> number of mislabels out of 581 points in the test test: 125


#### Repeat classification using all-vs-all.

In [20]:
# One-vs-One (All-vs-All)
# one vs one makes a classifier for each pair of categories, then again runs all classifiers on new instance
# which ever category has the most votes wins
# 
from sklearn.multiclass import OneVsOneClassifier
from sklearn.svm import LinearSVC

clf1 = LinearSVC(random_state=10)
clf2 = NB()
clf3 = tree.DecisionTreeClassifier()

y_pred = OneVsOneClassifier(clf1).fit(xtrain, ytrain).predict(xtest)
error = (y_pred != ytest).sum()
print ("One-vs-One --> SVC number of mislabels out of %d points in the test test: %d" % (xtest.shape[0],error ))
print("accuracy  is %2.2f " % (metrics.accuracy_score(ytest, y_pred)))
y_pred = OneVsOneClassifier(clf2).fit(xtrain, ytrain).predict(xtest)
error = (y_pred != ytest).sum()
print ("One-vs-One --> NB number of mislabels out of %d points in the test test: %d" % (xtest.shape[0],error ))
print("accuracy  is %2.2f " % (metrics.accuracy_score(ytest, y_pred)))
y_pred = OneVsOneClassifier(clf3).fit(xtrain, ytrain).predict(xtest)
error = (y_pred != ytest).sum()
print ("One-vs-One --> DT number of mislabels out of %d points in the test test: %d" % (xtest.shape[0],error ))
print("accuracy  is %2.2f " % (metrics.accuracy_score(ytest, y_pred)))


One-vs-One --> SVC number of mislabels out of 581 points in the test test: 100
accuracy  is 0.83 
One-vs-One --> NB number of mislabels out of 581 points in the test test: 71
accuracy  is 0.88 
One-vs-One --> DT number of mislabels out of 581 points in the test test: 113
accuracy  is 0.81 


#### Discuss the results for both approaches and rank classifiers based on performance.

For this text categorization, naive bayes in one vs one performs the best. Categories are somewhat different making naive bayes a good choice but the one vs one is a better model than the one vs rest. For some reason one vs rest using decision tree keeps timing out but I did get it to run previous and it was similar to NB. 
1. ovo NB
2. ovr SVC
3. ovo SVC
4. ovo DT
5. ovo NB
6. ovo DT

### 3) Assuming a decision stump classifier with the AdaBoost ensemble classifier. How does the classifier process the weights of the data points to focus on misclassified data points?

Each misclassified point is increased in weight so then a new stump that fits more of those points would be preferred over the original stump. I feel like I'm missing something here though.  Would love an explanation.