# News Category Classification

The goal of this Lab is to explore some of the main scikit-learn tools on a single practical task: analyzing a collection of text documents (newsgroups posts) on twenty different topics.

In this Lab we will see how to:

- Load the file contents and the categories
- Extract feature vectors suitable for machine learning
- Training a classifier
- Building a pipeline
- Parameter tuning using grid search
- Evaluation of the performance on the test set

## Loading the 20 newsgroups dataset

We will use the built-in dataset loader for 20 newsgroups from scikit-learn. In order to get faster execution times for this first example we will work on a partial dataset with only 3 categories out of the 20 available in the dataset:

In [None]:
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics']

We can now load the list of files matching those categories as follows:

In [None]:
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=12)

The target_names holds the list of the requested category names:

In [None]:
twenty_train.target_names

The samples and the corresponding categories used for training are as follows:

In [None]:
X_train = twenty_train.data
y_train = twenty_train.target

## Extracting features from text files

In order to perform machine learning on text documents, we first need to turn the text content into numerical feature vectors.

###  Write code to implement following:
- initialize TfidfVectorizer
- fit and tranform using training data

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
### START CODE HERE ###
# initialize TfidfVectorizer as tfidf_vectorizer
tfidf_vectorizer = 

# fit and tranform using training data 
X_train_tfidf = 
### END CODE HERE ###

## Training a classifier

Now that we have our features, we can train a classifier to try to predict the category of a post. Let’s use [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html), which is widely regarded as one of the best text classification algorithms.

###  Write code to implement following:
- initialize and train a LogisticRegression classifier as clf

In [None]:
from sklearn.linear_model import LogisticRegression
### START CODE HERE ###
# initialize and train a LogisticRegression classifier as clf
clf = 
### END CODE HERE ###

- predict the category on new documents 

In [None]:
docs_new = ['God is love', 'OpenGL on the GPU is fast']
### START CODE HERE ###
# transform new documents using the same feature extraction method as before
# that is using the fitted tfidf_vectorizer, tranform the new documents
X_new_tfidf = 
# predict the category on new documents 
predicted = 
### END CODE HERE ###
for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

## Building a pipeline

In order to make the vectorizer => classifier easier to work with, scikit-learn provides a Pipeline class that behaves like a compound classifier.

###  Write code to implement following:
- construct a pipeline to assemble TfidfVectorizer and LogisticRegression
- name the pipeline as text_clf
- you will use [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline), check out its documentation for more details.

In [None]:
from sklearn.pipeline import Pipeline
### START CODE HERE ###
text_clf = 
### END CODE HERE ###

## Parameter tuning using grid search

We’ve already encountered some parameters such as ngram_range in the TfidfVectorizer. Classifiers tend to have many parameters as well; e.g., LogisticRegression includes has a regularization parameter C.

Instead of tweaking the parameters of the various components of the chain, it is possible to run an exhaustive search of the best parameters on a grid of possible values. 

###  Write code to implement following:
- try out all classifiers on either words or bigrams and with a regularization parameter of either 1, 0.1 or 0.01 for the LogisticRegression
- you will use [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html), check out its documentation for more details.

In [33]:
from sklearn.model_selection import GridSearchCV
### START CODE HERE ###
parameters = 
### END CODE HERE ###
# n_jobs is to specify how many cpu cores to use, If we give this parameter a value of -1, grid search will detect how many cores are installed and use them all
# cv determines the cross-validation splitting strategy
gs_clf = GridSearchCV(text_clf, parameters, cv=5, n_jobs=-1)

The grid search instance behaves like a normal scikit-learn model. Let’s perform the search on the training data.

In [None]:
gs_clf = gs_clf.fit(X_train, y_train)

A more detailed summary of the search is available at gs_clf.cv_results_. The cv_results_ parameter can be easily imported into pandas as a DataFrame for further inspection.

In [None]:
import pandas as pd
cv_result = pd.DataFrame(gs_clf.cv_results_)
cv_result

The object’s best_score_ and best_params_ attributes store the best mean score and the parameters setting corresponding to that score.

In [None]:
gs_clf.best_score_

In [None]:
gs_clf.best_params_

## Evaluation of the performance on the test set

Let's load the test data.

In [None]:
twenty_test = fetch_20newsgroups(subset='test',categories=categories, shuffle=True, random_state=42)
X_test = twenty_test.data
y_test = twenty_test.target

The result of calling fit on a GridSearchCV object is a classifier that we can use to predict:

In [None]:
twenty_train.target_names[gs_clf.predict(['God is love'])[0]]

###  Write code to implement following:
- predict the category on test data
- using [classification_report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) to print out the result, check out its documentation for more details.

In [None]:
from sklearn.metrics import classification_report
### START CODE HERE ###

### END CODE HERE ###

we can also print out the confusion matrix

In [None]:
from sklearn.metrics import confusion_matrix
y_pre = gs_clf.predict(X_test)
confusion_matrix(y_test, y_pre)