# Template for the homework


## Task
- Train a classifier for the 20newsgroups dataset. Classify each document in the dataset in one of the 20 categories availables.
- The objective is get the better accuracy in the test set. You can use any library and model viewed in the course.
- The delivery are once jupiter notebook with all the code. Must run in colab.

## Template structure

- A Jupiter notebook template is provided to do the task. Structure:
  - Read the train and validation data.
  - Transform to generate numerical features.: Build your own transformations here
  - Model: Build your own model or models here. Check the accuracy over the validation set.
  - Evaluate results: Build your scoring function here and apply it over the test set.
- You need to complete the transform and model steps to achieve the best result in the evaluation metric, the accuracy in test set.
- Is forbidden to load and use the test dataset except once in the final evaluate results step.

## Evaluation

- Exercise evaluated in 0-10 range points.
- To obtain 5 points you must deliver a notebook without errors that provide a solution whit a minimum accuracy of 67%.
- If you obtain an accuracy over 87% you have 10 points.
- Intermediated accuracies between 67% and 87% obtain intermediated points proportionally.
- Extra points. Is possible to get 2 extra points (until the max on 10 points)
  - An extra point if one of the models tested is made with tensorflow-keras.
  - An extra point if one of the models tested uses pretrained embbedings. 
- Reduced points:
  - One point is reduced if only one model is tested.


## Tips
- Optionaly you can include the headers, footers and quotes of the dataset.
- When you have selected your final model to apply the test evaluation, you can train this final model with all the train and validation data before to test it. 


In [1]:
# Header
from __future__ import print_function

import pandas as pd


## 01 Load data

In [2]:
from sklearn.datasets import fetch_20newsgroups

twenty_train = fetch_20newsgroups(subset='train',
                 remove=('headers', 'footers', 'quotes'), shuffle=True, random_state=42)

print(twenty_train.target_names)

Downloading dataset from http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz (14 MB)


________________________________________________________________________________
Cache loading failed
________________________________________________________________________________
Can't get attribute 'Bunch' on <module 'sklearn.utils' from '/Users/jorge/anaconda/lib/python3.6/site-packages/sklearn/utils/__init__.py'>
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [3]:
# Separate train and validation
from sklearn.model_selection import train_test_split

# Recommended 20% to validation. 
text_trn, text_val, y_trn, y_val = train_test_split(twenty_train.data, twenty_train.target, test_size=0.2)
print(len(text_trn), len(text_val))

9051 2263


## 02 Text encoding

In [4]:
# ------------------------------------
# Define your own encoding proccess here
# ------------------------------------

# EXAMPLE OF CODE
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

# Extract word ocurrences
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=5000,
                                stop_words='english')
X_train_counts = tf_vectorizer.fit_transform(text_trn)

#From occurrences to frequencies
tfidf_transformer = TfidfTransformer().fit(X_train_counts)

def encoding_text(text):
    '''
    Encoding function
        Input: raw text
        Output: features to train the model
    '''
    text_counts = tf_vectorizer.transform(text)
    text_tf = tfidf_transformer.transform(text_counts)
    return text_tf

# Encode train
X_trn = encoding_text(text_trn)
print(X_trn.shape)
# END OF EXAMPLE OF CODE


(9051, 5000)


## 03 Model and score function

In [5]:
# ------------------------------------
# Put your model or models here
# ------------------------------------

# EXAMPLE OF CODE
from sklearn.naive_bayes import MultinomialNB
# Define and fit in one line
clf = MultinomialNB().fit(X_trn, y_trn)
# END OF EXAMPLE OF CODE


In [6]:
# Score function
def score_function(data, model):
    '''
    score_function
        Input: Raw text data
        Ouptut: predicted category for each text
    '''

    # ------------------------------------
    # Define your own score function
    # ------------------------------------
    
    # EXAMPLE OF CODE
    # Transformation steps
    X_test_tf = encoding_text(data)
    # Prediction steps
    predicted = model.predict(X_test_tf)
    # END OF EXAMPLE OF CODE

    return predicted



## 04 Evaluate valid data

In [7]:
# Confussion matrix
from sklearn.metrics import accuracy_score, confusion_matrix

# Evalaute valid data
pred_val = score_function(text_val, clf)

#Calculate accuracy with sklearn
print('Accuracy valid: ', accuracy_score(y_val, pred_val))
pd.DataFrame(confusion_matrix(y_val, pred_val))


Accuracy valid:  0.70083959346


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,45,0,0,1,0,0,0,1,3,1,5,3,1,2,0,29,2,8,0,1
1,1,84,4,8,3,8,2,0,0,1,2,1,4,1,1,2,0,0,0,1
2,0,6,85,12,3,11,0,1,0,1,8,1,2,0,0,1,0,0,0,0
3,0,1,13,81,9,1,2,1,0,0,2,1,4,0,0,0,0,0,0,0
4,0,1,7,14,78,1,5,0,0,0,8,1,2,2,0,0,0,0,0,0
5,0,11,5,2,0,100,3,1,0,0,1,0,1,0,1,1,0,0,0,0
6,0,2,1,10,2,3,87,1,2,1,2,2,2,0,0,0,1,1,0,0
7,0,1,1,0,0,3,3,78,13,0,8,2,5,0,1,1,0,2,0,0
8,0,1,0,0,2,1,5,8,84,2,2,1,3,5,0,1,1,0,1,0
9,1,0,0,0,0,0,2,1,0,98,12,0,0,0,1,0,0,0,0,0


## 05 Evaluate test data
- Don't edit after this!!!
- Execute only ONCE whit the optimal model selected based on the validation accuracy metric calculated over multiple experiments.

In [8]:
# Test Accuracy
twenty_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))

predicted = score_function(twenty_test.data)
    
print('Accuracy test: ', accuracy_score(twenty_test.target, predicted))


Accuracy test:  0.641396707382
