# News Classification 

By: Kanika Chopra

In [0]:
!pip install nltk
!pip install sklearn
!pip install contractions

Collecting contractions
  Downloading https://files.pythonhosted.org/packages/85/41/c3dfd5feb91a8d587ed1a59f553f07c05f95ad4e5d00ab78702fbf8fe48a/contractions-0.0.24-py2.py3-none-any.whl
Collecting textsearch
  Downloading https://files.pythonhosted.org/packages/42/a8/03407021f9555043de5492a2bd7a35c56cc03c2510092b5ec018cae1bbf1/textsearch-0.0.17-py2.py3-none-any.whl
Collecting pyahocorasick
[?25l  Downloading https://files.pythonhosted.org/packages/f4/9f/f0d8e8850e12829eea2e778f1c90e3c53a9a799b7f412082a5d21cd19ae1/pyahocorasick-1.4.0.tar.gz (312kB)
[K     |████████████████████████████████| 317kB 1.7MB/s 
[?25hCollecting Unidecode
[?25l  Downloading https://files.pythonhosted.org/packages/d0/42/d9edfed04228bacea2d824904cae367ee9efd05e6cce7ceaaedd0b0ad964/Unidecode-1.1.1-py2.py3-none-any.whl (238kB)
[K     |████████████████████████████████| 245kB 7.7MB/s 
[?25hBuilding wheels for collected packages: pyahocorasick
  Building wheel for pyahocorasick (setup.py) ... [?25l[?25hdone
  C

In [0]:
import pandas as pd
import numpy as np

# NLP Preprocessing 
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Models 
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Features 
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer

# Parameter Tuning and Evaluation
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score, precision_score, recall_score, classification_report, make_scorer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold

import string
import contractions
import re

from nlp import get_pos


## The Problem

I will be using a supervised machine learning algorithm using a dataset from Kaggle with Huffpost news articles and their categories. 

The problem is to train a classifier to classify news articles based on their headlines and then understand the success of using transfer learning to apply this model to news-related tweets. 

First, let's begin with some data analysis. 

## Data Analysis 

Let's take a look at the data we have, this will help us determine what type of data preprocessing needs to be conducted before we can start feature extraction.

In [0]:
df = pd.read_json("News_Category_Dataset_v2.json", lines = True)

# What columns do we have 
df.columns

Index(['category', 'headline', 'authors', 'link', 'short_description', 'date'], dtype='object')

In [0]:
# Categories Distribution
df['category'].value_counts()

POLITICS          32739
WELLNESS          17827
ENTERTAINMENT     16058
TRAVEL             9887
STYLE & BEAUTY     9649
PARENTING          8677
HEALTHY LIVING     6694
QUEER VOICES       6314
FOOD & DRINK       6226
BUSINESS           5937
COMEDY             5175
SPORTS             4884
BLACK VOICES       4528
HOME & LIVING      4195
PARENTS            3955
THE WORLDPOST      3664
WEDDINGS           3651
WOMEN              3490
IMPACT             3459
DIVORCE            3426
CRIME              3405
MEDIA              2815
WEIRD NEWS         2670
GREEN              2622
WORLDPOST          2579
RELIGION           2556
STYLE              2254
SCIENCE            2178
WORLD NEWS         2177
TASTE              2096
TECH               2082
MONEY              1707
ARTS               1509
FIFTY              1401
GOOD NEWS          1398
ARTS & CULTURE     1339
ENVIRONMENT        1323
COLLEGE            1144
LATINO VOICES      1129
CULTURE & ARTS     1030
EDUCATION          1004
Name: category, 

In [0]:
# Number of categories
len(df['category'].value_counts())

41

In [0]:
# Let's view our data
df.head()

Unnamed: 0,category,headline,authors,link,short_description,date
0,CRIME,There Were 2 Mass Shootings In Texas Last Week...,Melissa Jeltsen,https://www.huffingtonpost.com/entry/texas-ama...,She left her husband. He killed their children...,2018-05-26
1,ENTERTAINMENT,Will Smith Joins Diplo And Nicky Jam For The 2...,Andy McDonald,https://www.huffingtonpost.com/entry/will-smit...,Of course it has a song.,2018-05-26
2,ENTERTAINMENT,Hugh Grant Marries For The First Time At Age 57,Ron Dicker,https://www.huffingtonpost.com/entry/hugh-gran...,The actor and his longtime girlfriend Anna Ebe...,2018-05-26
3,ENTERTAINMENT,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,Ron Dicker,https://www.huffingtonpost.com/entry/jim-carre...,The actor gives Dems an ass-kicking for not fi...,2018-05-26
4,ENTERTAINMENT,Julianna Margulies Uses Donald Trump Poop Bags...,Ron Dicker,https://www.huffingtonpost.com/entry/julianna-...,"The ""Dietland"" actress said using the bags is ...",2018-05-26


In [0]:
# How long is our dataset
len(df), len(df.columns)

(200853, 6)

We have 200,853 news headlines and 6 columns. Our classifier is going to be built on the headlines column to predict category. 

Before we begin data preprocessing, we can see that we have a lot of categories, 41 specifically, I want to first focus on training a subset of this data. Primarily, the distinct topics that I am most interested in seeing are: Politics, Sports, Entertainment Business and Crime (for simplicity).

In [0]:
news = df[df['category'].isin(['CRIME', 'ENTERTAINMENT', 'POLITICS', 'SPORTS', 'BUSINESS'])]

# Remove unnecessary columns
news = news[['date', 'authors', 'headline', 'category', 'link']]

news.head()

Unnamed: 0,date,authors,headline,category,link
0,2018-05-26,Melissa Jeltsen,There Were 2 Mass Shootings In Texas Last Week...,CRIME,https://www.huffingtonpost.com/entry/texas-ama...
1,2018-05-26,Andy McDonald,Will Smith Joins Diplo And Nicky Jam For The 2...,ENTERTAINMENT,https://www.huffingtonpost.com/entry/will-smit...
2,2018-05-26,Ron Dicker,Hugh Grant Marries For The First Time At Age 57,ENTERTAINMENT,https://www.huffingtonpost.com/entry/hugh-gran...
3,2018-05-26,Ron Dicker,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,ENTERTAINMENT,https://www.huffingtonpost.com/entry/jim-carre...
4,2018-05-26,Ron Dicker,Julianna Margulies Uses Donald Trump Poop Bags...,ENTERTAINMENT,https://www.huffingtonpost.com/entry/julianna-...


In [0]:
news['category'].value_counts()

POLITICS         32739
ENTERTAINMENT    16058
BUSINESS          5937
SPORTS            4884
CRIME             3405
Name: category, dtype: int64

We can see that the class distribution is highly skewed with politics having over 30,000 headlines vs. crime having less than 3500 headlines. This means our classifier might be really good with classifying politics-related articles, but less strong with crime-related articles. 

First, we will train our model and look at the confusion matrix to see if we need to take further steps to handle the imbalanced dataset but we will keep this in the back of our mind for now.


## Data Preprocessing

We want to use bag of words and tf-idf for our features so before we do that, we need to complete some preprocessing steps. Below are the steps we will be taking:

*   Break apart contractions

*   Make all headlines lowercase
*   Convert all numbers to the string 'num'
*   Remove punctuation (replace with empty string) 
*   Remove all stop words
*   Lemmatize all words (back to their root words)
*   Combine the list of words into a string again


In [0]:
# Break apart all contractions (except name possession e.g. Sarah's)
news['headline'] = news['headline'].apply(lambda x: contractions.fix(x))

In [0]:
# Download stop words 
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [0]:
# Convert to lowercase 
news['clean_headline'] = news['headline'].apply(lambda x: x.lower())

# Convert all numbers in the headlines to the word 'num' using re
news['clean_headline'] = news['clean_headline'].apply(lambda x: re.sub(r'\d+', 'num', x))

# Remove punctuation
punct = str.maketrans('', '', string.punctuation)
news['clean_headline'] = news['clean_headline'].apply(lambda x: x.translate(punct))

# Initialize tokenizer so that it doesn't include punctuation
tokenizer = RegexpTokenizer(r'\w+')
news['clean_headline'] = [tokenizer.tokenize(x) for x in news['clean_headline']]

# Remove stopwords
stop_words = set(stopwords.words('english'))
news['clean_headline'] = news['clean_headline'].apply(lambda x: [word for word in x if word not in stop_words])

In [0]:
# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatize the headlines
news['clean_headline'] = news['clean_headline'].apply(lambda x: [lemmatizer.lemmatize(word, get_pos(word)) for word in x])

In [0]:
# Combine the list of words into a string
news['clean_headline'] = news['clean_headline'].apply(lambda x: ' '.join(x))

In [0]:
# Let's compare the before and after
print(news['headline'][0])
print(news['clean_headline'][0])

There Were 2 Mass Shootings In Texas Last Week, But Only 1 On TV
num mass shooting texas last week num tv


## Feature Extraction

We are going to have a bag-of-words model and a tf-idf matrix for our features.

In [0]:
# Generate bag of words object with maximum vocab size of 1000
vec_counter = CountVectorizer(max_features = 1000)

# Get bag of words model as sparse matrix
bag_of_words = vec_counter.fit_transform(news['clean_headline'])

In [0]:
# Generate our tf-idf object with maximum vocab size of 1000
tf_counter = TfidfVectorizer(max_features=1000)

# Get tf-idf matrix as sparse matrix
tfidf = tf_counter.fit_transform(news['clean_headline'])

In [0]:
# Get a preview of the words corresponding to the vocab index 
tf_counter.get_feature_names()[:25]

['abortion',
 'abuse',
 'access',
 'accuse',
 'act',
 'action',
 'activist',
 'actor',
 'actually',
 'ad',
 'adam',
 'address',
 'administration',
 'admits',
 'adorable',
 'adviser',
 'age',
 'agency',
 'ago',
 'ahead',
 'aide',
 'aim',
 'air',
 'al',
 'alabama']

In [0]:
tfidf.toarray().shape

(63023, 1000)

This means that each of the 63,023 headlies is represented with 1000 features representing the tf-idf score for different unigrams and bigrams. 

This step was moreso to get an idea of how and what CountVectorizer and TD-IDF were doing, but these steps will be added into a pipeline when training our models.

## Training and Testing Dataset

We are going to split our data into 80% training data, and 20% testing data.

In [0]:
X_train, X_test, y_train, y_test = train_test_split(news['clean_headline'], news['category'],
                                                    test_size=0.2, random_state=42, stratify=news['category'])

In [0]:
# Check the distribution of classes in the training and testing dataset
y_distr = pd.DataFrame(y_train.value_counts())
y_distr.reset_index(inplace=True)
y_distr = y_distr.merge(y_test.value_counts().to_frame().reset_index(), how='inner', on='index')

y_distr.columns = ['category', 'train', 'test']

# Add percentages of testing and training 
y_distr['train pct'] = y_distr['train'].apply(lambda x: x/sum(y_distr['train']))
y_distr['test pct'] = y_distr['test'].apply(lambda x: x/sum(y_distr['test']))

y_distr

Unnamed: 0,category,train,test,train pct,test pct
0,POLITICS,26191,6548,0.519477,0.519476
1,ENTERTAINMENT,12846,3212,0.25479,0.25482
2,BUSINESS,4750,1187,0.094212,0.094169
3,SPORTS,3907,977,0.077492,0.077509
4,CRIME,2724,681,0.054028,0.054026


We can see that our training and testing distributions are very similarly distributed so now that we have done our preprocessing and have our training and testing sets, we can train our models.

## Building a Pipeline 

The above code is to implement the bag of words and tfidf individually. We can also create a pipeline that will do these two steps and then also train a model. We will be creating pipelines for the following models:
1. Naive Bayes Classifier (NB)
2. Support Vector Machine (SVM) 
* Linear Kernel
* Polynomial Kernel
* Gaussian Kernel
3. Logistic Regression
4. Random Forest

In [0]:
# Set evaluation scores to use when fine-tuning parameters
scorers = {
    'precision_score': make_scorer(precision_score, greater_is_better=True, average='micro'),
    'recall_score': make_scorer(recall_score, greater_is_better=True, average='micro'),
    'accuracy_score': make_scorer(accuracy_score)
}

skf = StratifiedKFold(n_splits=10)

## 1. Naive Bayes Classifier
Let's begin with training a classifier for Naive Bayes first.

In [0]:
# Training Model
nb_clas = Pipeline([('vect', CountVectorizer()), 
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB())])

nb_clas = nb_clas.fit(X_train, y_train)

Let's test our model now and get the performance on the 15% testing set. We will be taking the testing set inputs and comparing the model outputs to the actual outputs for our testing set.

In [0]:
# Testing Dataset 
# Predictions 
nb_predicted = nb_clas.predict(X_test)

# Performance
np.mean(nb_predicted == y_test)

0.7717572391907973

We got ~ 77.18% accuracy on the testing set which isn't bad for a start with naive classifier. 

Let's try the next model, compare the two and then we can start to fine-tune the chosen model.

## 2. Support Vector Machines (SVM)
Time to try our second model, Support Vector Machines. We are going to try three variations of this model: 


*   Linear Kernel
*   Polynomial Kernel
*   Gaussian Kernel

### Linear Kernel
We use the SGDClassifier under linear models with sklearn - the default is a SVM. This will be SVM with a linear kernel using Stochastic Gradient Descent.

In [0]:
# Creating pipeline
linSvm_pipe = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('svm-lin', SGDClassifier(loss='hinge', penalty='l2',random_state=42))])

# Setting GridSearch Parameters
linSVM_param_grid = {
     'svm-lin__alpha': [0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10],
     'tfidf__use_idf': (True, False),
     'vect__ngram_range': [(1,1), (1,2)]
}

linSVM_search = GridSearchCV(linSvm_pipe, linSVM_param_grid, scoring=scorers, refit='precision_score', return_train_score=True, n_jobs=-1)

In [0]:
lin_svm = linSVM_search.fit(X_train, y_train)

In [0]:
# Predictions on Test Set
svm_lin_pred = lin_svm.predict(X_test)

In [0]:
linSVM_results = {'Metrics': ['Precision', 'Recall', 'Accuracy'],
                  'Scores': [precision_score(y_test, svm_lin_pred, average='micro'),
                             recall_score(y_test, svm_lin_pred, average='micro'),
                             accuracy_score(y_test, svm_lin_pred)]}

pd.DataFrame(linSVM_results)

Unnamed: 0,Metrics,Scores
0,Precision,0.881793
1,Recall,0.881793
2,Accuracy,0.881793


In [0]:
print('Best parameters:', linSVM_search.best_params_)

Best parameters: {'svm-lin__alpha': 1e-05, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}


With Linear SVM, we got ~88.1793% accuracy on our testing set.

### Polynomial Kernel
Now, we'll the Polynomial Kernel to see if this works better than the linear kernel with classifying our headlines.

In [0]:
# Creating the Pipeline
polySVM_pipe = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('svm-poly', SVC(kernel='poly', degree=2, n_jobs=-1)))])

# Setting GridSearch Parameters
polySVM_param_grid = {
    'svm-poly__C': [0.01, 0.1, 1, 10],
    'svm-poly__degree': [2,3],
    'tfidf__use_idf': (True, False),
    'vect__ngram_range': [(1,1), (1,2)]
}

polySVM_search = GridSearchCV(polySVM_pipe, polySVM_param_grid, scoring=scorers, 
                              refit='precision_score', return_train_score=True)

In [0]:
poly_svm = polySVM_search.fit(X_train, y_train)

In [0]:
# Predictions on Test Set 
svm_poly_pred = poly_svm.predict(X_test)

In [0]:
polySVM_results = {'Metrics': ['Precision', 'Recall', 'Accuracy'],
                  'Scores': [precision_score(y_test, svm_poly_pred, average='micro'),
                             recall_score(y_test, svm_poly_pred, average='micro'),
                             accuracy_score(y_test, svm_poly_pred)]}

pd.DataFrame(polySVM_results)

In [0]:
print('Best parameters:', polySVM_search.best_params_)

With Polynomial SVM, we got ~79.83% accuracy on our testing set so the linear kernel worked better than polynomial of degree 2 with C=10.

### Gaussian Kernel
Lastly, we will be trying the Gaussian kernel for SVM. We will be trying different parameters for C, and for gamma for this kernel.

In [0]:
# Training the Model
gausSVM_pipe = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('svm-gaus', SVC(kernel='rbf'))])

# Setting GridSearch Parameters for the pipeline
gausSVM_param_grid = {
    'svm-gaus__C': [0.001, 0.01, 0.1, 1, 10],
    'svm-gaus__gamma': [0.001, 0.01, 0.1, 1],
    'tfidf__use_idf': (True, False),
    'vect__ngram_range': [(1,1), (1,2)]
}

gausSVM_search = GridSearchCV(gausSVM_pipe, gausSVM_param_grid, scoring=scorers, 
                              refit='precision_score', return_train_score=True,
                              n_jobs=-1)

In [0]:
gaus_svm = gausSVM_search.fit(X_train, y_train)

In [0]:
# Predictions on Test Set 
svm_gaus_pred = gaus_svm.predict(X_test)

NameError: ignored

In [0]:
gausSVM_results = {'Metrics': ['Precision', 'Recall', 'Accuracy'],
                  'Scores': [precision_score(y_test, svm_gaus_pred, average='micro'),
                             recall_score(y_test, svm_gaus_pred, average='micro'),
                             accuracy_score(y_test, svm_gaus_pred)]}

pd.DataFrame(gausSVM_results)

In [0]:
print('Best parameters:', gausSVM_search.best_params_)

The Gaussion kernel had a accuracy of 83.7621 % on the testing set. Therefore with regards to the SVM, we have that the linear kernel worked best, then the Gaussian and then polynomial with degree 2.


## 3. Logistic Regression
Logistic regression is a simple model and easy to understand; it is extremely useful in binary classification but can easily be generalized to multiple classes like in our case.

In [0]:
# Creating the Pipeline
logreg_clas = Pipeline([('vect', CountVectorizer()),
                    ('tfidf', TfidfTransformer()),
                    ('logreg', LogisticRegression(n_jobs=-1, max_iter= 500, penalty='l2'))])

# Setting GridSearch Parameters for the pipeline
logreg_param_grid = {
    'logreg__C': [0.001, 0.01, 0.1, 1],
    'tfidf__use_idf': (True, False),
    'vect__ngram_range': [(1,1), (1,2)]
}

logreg_search = GridSearchCV(logreg_clas, logreg_param_grid, scoring=scorers, 
                             refit='precision_score', return_train_score=True)

In [0]:
# Training the Model
logreg_clas = logreg_search.fit(X_train, y_train)

In [0]:
# Predictions on Test Set
log_reg_pred = logreg_clas.predict(X_test)

In [0]:
logreg_results = {'Metrics': ['Precision', 'Recall', 'Accuracy'],
                  'Scores': [precision_score(y_test, log_reg_pred, average='micro'),
                             recall_score(y_test, log_reg_pred, average='micro'),
                             accuracy_score(y_test, log_reg_pred)]}

pd.DataFrame(logreg_results)

Unnamed: 0,Metrics,Scores
0,Precision,0.858786
1,Recall,0.858786
2,Accuracy,0.858786


In [0]:
print('Best parameters:', logreg_search.best_params_)

Best parameters: {'logreg__C': 1, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 1)}


Multinomial Logistic Regression has a 85.8786% accuracy with the testing set. 

This is higher than the Multinomial Naive Bayes, but lower than Linear SVM by ~3%.

## 4. Random Forest 


In [0]:
# Creating the Pipeline
rf_pipe = Pipeline([('vect', CountVectorizer()),
                    ('tfidf', TfidfTransformer()),
                    ('rf-clas', RandomForestClassifier(n_jobs=-1))])

# Setting GridSearch Parameters
rf_param_grid = {
    'rf-clas__n_estimators': [100,200,300],
    'rf-clas__max_depth': [3,5,10,25],
    'rf-clas__max_features': [3, 5, 10, 25, 31],
    'tfidf__use_idf': (True, False),
    'vect__ngram_range': [(1,1), (1,2)]
}

rf_search = GridSearchCV(rf_pipe, rf_param_grid, scoring=scorers, 
                         refit='precision_score', return_train_score=True)

In [0]:
rf_clas = rf_search.fit(X_train,y_train)

In [0]:
# Predictions 
rf_pred = rf_clas.predict(X_test)

In [0]:
rf_results = {'Metrics': ['Precision', 'Recall', 'Accuracy'],
                  'Scores': [precision_score(y_test, rf_pred, average='micro'),
                             recall_score(y_test, rf_pred, average='micro'),
                             accuracy_score(y_test, rf_pred)]}

pd.DataFrame(rf_results)

Unnamed: 0,Metrics,Scores
0,Precision,0.519556
1,Recall,0.519556
2,Accuracy,0.519556


In [0]:
print('Best parameters:', rf_search.best_params_)

Best parameters: {'rf-clas__max_depth': 25, 'rf-clas__max_features': 31, 'rf-clas__n_estimators': 100, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 1)}


Random Forest has the lowest accuracy on the test set with an accuray of 51.9476%.

## Overall Comparison
Now that we have trained all of our models,let's compare the accuracy on the testing set and then we will choose one model to focus on evaluating and fine-tuning.

In [0]:
overall_models = pd.DataFrame({'Models': ['Multinomial Naive Bayes', 'Linear SVM',
                                          'Polynomial SVM', 'Gaussian SVM', 
                                          'Logistic Regression', 'Random Forest'],
                               'Training Set Accuracy': [nb_clas.score(X_train, y_train),
                                                         lin_svm.score(X_train, y_train),
                                                         poly_svm.score(X_train, y_train),
                                                         gaus_svm.score(X_train, y_train),
                                                         logreg_clas.score(X_train, y_train),
                                                         rf_clas.score(X_train, y_train)],
                               'Test Set Accuracy': [np.mean(nb_predicted == y_test),
                                            np.mean(svm_lin_pred == y_test),
                                            np.mean(svm_poly_pred == y_test),
                                            np.mean(svm_gaus_pred == y_test),
                                            np.mean(log_reg_pred == y_test),
                                            np.mean(rf_pred == y_test)]})

In [0]:
overall_models

We can see that logistic regression, polynomial SVM and Gaussian SVM are severely overfitting the data. The accuracy of random forest is too low in comparison to the other models so we take a look at Multinomial Naive Bayes and Linear SVM. 

In this case, Linear SVM has a higher test set accuracy so we will focus on fine-tuning this model and using it for our classification purposes. We notice that it is overfitted as well, so we will take a look at what we can do to fix this, keeping in mind that the goal is to add more twitter data so more data will be added to hopefully resolve some of the overfitting.

## Fine-tuning Chosen Model

We have chosen Linear SVM for our model. There are two steps we need to consider:
1. Dealing with the overfitting 
2. Handling imbalanced classes 

### Overfitting 

We will first try and retrain our SVM model by reducing the features to see if this helps with our overfitting. 

The next step would be to gather more data, but that will be considered after we look at step 2. Two solutions are to try and find more news headline data which can be tailored to helping deal with the imbalanced classes as well, and the other solution is to use the Twitter data that I will be labelling to use for the transfer learning step.

In [0]:
# Precision, Recall, F1 Score, Accuracy per Class
from sklearn.metrics import precision_recall_fscore_support
precision_recall_fscore_support(y_test, svm_lin_pred, beta=1.0, labels=labels, pos_label=1, average=None, warn_for=('precision', 'recall', 'f-score'), sample_weight=None, zero_division='warn')

### Imbalanced Classes
We saw that we had approximately 10x more data for politics than we did for crimes so this could cause our model to be trained to classify politics articles better, or more often than crime. 

Let's begin by looking at our confusion matrix. 

In [0]:
conf_matrix = pd.DataFrame(confusion_matrix(y_test, svm_lin_pred))
labels = ['Crime', 'Entertainment', 'Politics', 'Sports', 'Business']
conf_matrix.columns = labels
conf_matrix.index = labels

conf_matrix

In [0]:
print(classification_report(y_test, svm_lin_pred, target_names = news['category'].unique()))

The majority of our discrepancies seem to be with crime and sports, and sports and politics. The next step is to check to look into how to handle imbalanced datasets and see if those preprocessing methods help with the model training since we had a lot of politics data, but substantially less crime data.

# Next Steps 
1. How do we handle overfitting and imbalanced dataset?
2. Try to use different features (e.g. Word2Vec with logistic regression)
3. Collect tweets and apply transfer learning with the model and train again on the new data
4. Create visualizations 

# Extensions
1. Try to learn deep learning and K-Nearest Neighbors (KNN) and see if these models are better