# Supervised Learning

Dataset: [20newgroups](http://scikit-learn.org/stable/datasets/index.html#the-20-newsgroups-text-dataset) of sklearn

In [49]:
from sklearn.datasets import fetch_20newsgroups


#To speed up processing, we retrieve a subset of the 20 topics in the dataset
categories =  ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']

# get data for training
train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

# get data for testing
test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)

print("\n",train.keys())


 dict_keys(['description', 'target', 'target_names', 'data', 'filenames', 'DESCR'])


## Prepare data
Let us first convert the collection of text documents to a matrix of token counts with CountVectorizer.


In [2]:
from sklearn.feature_extraction.text import CountVectorizer

count_vec = CountVectorizer()
# teach and transform train data
train_token = count_vec.fit_transform(train.data)

# just transform test data
test_token = count_vec.transform(test.data)

In order to feed predictive or clustering models with the text data, we need to turn the text into vectors of numerical values suitable for statistical analysis. Reduce the weight of very common words by using TF-IDF

In [9]:
from sklearn.feature_extraction.text import TfidfTransformer

tf = TfidfTransformer()
# teach and transform train_token
train_tf = tf.fit_transform(train_token)

# transform test
test_tf = tf.transform(test_token)

The last two steps can be done with TfidVectorizer

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

tv = TfidfVectorizer()
# teach and transform train_token
train_tv = tv.fit_transform(train.data)

# transform test
test_tv = tv.transform(test.data)


## KNN

In [12]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier().fit(train_tv, train.target)
knn_prediction = knn.predict(test_tv)

knn.score(test_tv, test.target)


0.76564580559254325

In [13]:

knn = KNeighborsClassifier().fit(train_tf, train.target)
knn_prediction = knn.predict(test_tf)

knn.score(test_tf, test.target)

0.76564580559254325

The score is not that good. Let's see if we get better result with GridSearchCV

In [41]:
from sklearn.model_selection import GridSearchCV
parameters = {'n_neighbors':[1,2,3,4,5,7,10]}

knn_cv = GridSearchCV(knn, parameters).fit(train_tf, train.target)
knn_cv_prediction = knn_cv.predict(test_tf)

sorted(knn_cv.cv_results_.keys())
print(knn_cv.best_params_)
print("Best score: ", knn_cv.best_score_)

0.796271637816
{'n_neighbors': 1}
Best score:  0.899867080195


## SVM

keys to remember: 
* Kernels
* Kernel trick

In [38]:
from sklearn.svm import SVC

svc = SVC(kernel = 'linear', C=1.0).fit(train_tv, train.target)
svc_prediction = svc.predict(test_tv)

svc.score(test_tv, test.target)

0.92077230359520634

The score is better but let see if we can do better with GridSearchCV

In [39]:
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10, 15]}
svc_cv = GridSearchCV(svc, parameters).fit(train_tv, train.target)
svc_cv_prediction = svc_cv.predict(test_tv)

sorted(svc_cv.cv_results_.keys())
print(svc_cv.best_params_)
print("Best score: ", svc_cv.best_score_)

0.922103861518
{'C': 10, 'kernel': 'linear'}
Best score:  0.965883916704


## Decision Tree

keys to remember: 
* Entropy
* Tend to overfitt

In [44]:
from sklearn import tree

dt = tree.DecisionTreeClassifier().fit(train_tv, train.target)
dt_prediction = dt.predict(train_tv)
print(dt.score(test_tv, test.target))


0.717043941411



#### Avoid overfitting using k-fold cross validation

In [62]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(dt, train_tv, train.target, cv=5)

print(scores.mean())

0.782014111989


#### Avoid overfitting using Random Forest classifier

In [71]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier().fit(train_tv, train.target)
rf_prediction = rf.predict(test_tv)
print(rf.score(test_tv, test.target))

parameters = {'n_estimators':[1, 10, 15, 20, 30, 90]}
rf_cv = GridSearchCV(rf, parameters).fit(train_tv, train.target)
rf_cv_prediction = rf_cv.predict(test_tv)


print(rf_cv.best_params_)
print("Best score: ", rf_cv.best_score_)

0.735019973369
{'n_estimators': 90}
Best score:  0.880372175454


## Naive Bayes

In [63]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB().fit(train_tv, train.target)
nb_prediction = nb.predict(test_tv)

print(nb.score(test_tv, test.target))

0.834886817577
