# Model Training

Once we have our feature vectors built, we'll try several machine learning classification models in order to find which one performs best on our data. We will try with the following models:

* Baseline Classifier
* Random Forest
* Support Vector Machine
* K Nearest Neighbors
* Multinomial Naïve Bayes
* Multinomial Logistic Regression
* Gradient Boosting

The methodology used to train each model is as follows:

1. First of all, we'll decide which hyperparameters we want to tune.
2. Secondly, we'll define the metric we'll get when measuring the performance of a model. In this case, we'll use the **accuracy**.
3. We'll perform a Randomized Search Cross Validation process in order to find the hyperparameter region in which we get higher values of accuracy. 
4. Once we find that region, we'll use a Grid Search Cross Validation process to exhaustively find the best combination of hyperparameters.
5. Once we obtain the best combination of hyperparameters, we'll obtain the accuracy on the training data and the test data, the classification report and the confusion matrix.
6. Finally, we'll calculate the accuracy of a model with default hyperparameters, to see if we have achieved better results by hyperparameter tuning.

We need to be aware of the fact that our dataset only contains 5 categories:

* Business
* Politics
* Sports
* Tech
* Entertainment

So, when we get news articles that don't belong to any of that categories (for example, weather or terrorism news articles), we will surely get a wrong prediction. For this reason we will take into account the conditional probability of belonging to every class and set a lower threshold (i.e. if the 5 conditional probabilities are lower than 65% then the prediction will be 'other'). This probability vector can be obtained in a simple way in some models, but not in other ones. For this reason we will take this into consideration when choosing the model to use.

In [8]:
import pickle
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import chi2
import numpy as np

In [9]:
path_df = "/Users/keerthanareddy/Downloads/Latest-News-Classifier-master/0. Latest News Classifier/03. Feature Engineering/Pickles/X_train.pickle"

with open(path_df, 'rb') as data:
    X_train = pickle.load(data)

In [10]:
X_train.head()

1    artificial intelligence (ai)  use   wide varie...
Name: Content_Parsed, dtype: object

In [11]:
path_df = "/Users/keerthanareddy/Downloads/Latest-News-Classifier-master/0. Latest News Classifier/03. Feature Engineering/Pickles/y_train.pickle"

with open(path_df, 'rb') as data:
    Y_train = pickle.load(data)

In [12]:
Y_train.head()

1    Suziki
Name: Category_Code, dtype: object

In [13]:
path_df = "/Users/keerthanareddy/Downloads/Latest-News-Classifier-master/0. Latest News Classifier/03. Feature Engineering/Pickles/X_test.pickle"

with open(path_df, 'rb') as data:
    X_test = pickle.load(data)

In [14]:
(X_test.head())

0     year ago detroit  silicon valley  visions  pu...
Name: Content_Parsed, dtype: object

In [15]:
path_df = "/Users/keerthanareddy/Downloads/Latest-News-Classifier-master/0. Latest News Classifier/03. Feature Engineering/Pickles/y_test.pickle"

with open(path_df, 'rb') as data:
    Y_test = pickle.load(data)

In [16]:
Y_test.head()

0    Baleno
Name: Category_Code, dtype: object

In [57]:
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
from sklearn import *

regressor = RandomForestRegressor(n_estimators=20, random_state=0)
trainX=pd.get_dummies(X_train)
trainY=pd.get_dummies(Y_train)
testX=pd.get_dummies(X_test)
testY=pd.get_dummies(Y_test)
regressor.fit(trainX, trainY)
Y_pred = regressor.predict((testX))

ValueError: X has 1 features, but DecisionTreeRegressor is expecting 2 features as input.

In [58]:
print(X_train.shape)
print(Y_test.shape)

(2,)
(1,)


In [59]:
rf_0 = RandomForestClassifier(random_state = 8)

print('Parameters currently in use:\n')
print(rf_0.get_params())

Parameters currently in use:

{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': 8, 'verbose': 0, 'warm_start': False}


In [60]:
# n_estimators
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 1000, num = 5)]

# max_features
max_features = ['auto', 'sqrt']

# max_depth
max_depth = [int(x) for x in np.linspace(20, 100, num = 5)]
max_depth.append(None)

# min_samples_split
min_samples_split = [2, 5, 10]

# min_samples_leaf
min_samples_leaf = [1, 2, 4]

# bootstrap
bootstrap = [True, False]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

print(random_grid)

{'n_estimators': [200, 400, 600, 800, 1000], 'max_features': ['auto', 'sqrt'], 'max_depth': [20, 40, 60, 80, 100, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'bootstrap': [True, False]}


In [35]:
# First create the base model to tune
rfc = RandomForestClassifier(random_state=8)

# Definition of the random search
random_search = RandomizedSearchCV(estimator=rfc,
                                   param_distributions=random_grid,
                                   n_iter=10,
                                   scoring='accuracy',
                                   cv=2, 
                                   verbose=1, 
                                   random_state=8)

# Fit the random search model
random_search.fit(X_train, Y_train)

Fitting 2 folds for each of 10 candidates, totalling 20 fits


ValueError: n_splits=2 cannot be greater than the number of members in each class.

In [61]:
import pandas as pd
%timeit
model=RandomForestRegressor(n_estimators=100,oob_score='TRUE',n_jobs=-1,random_state=1)
trainX=pd.get_dummies(X_train)
trainY=pd.get_dummies(Y_train)
testX=pd.get_dummies(X_test)
model.fit(trainX,trainY)
Y_pred = regressor.predict((testX))

ValueError: X has 1 features, but DecisionTreeRegressor is expecting 2 features as input.