# Classifying Science vs. Rural News

For this challenge, I will choose a corpus of data from nltk that includes predictable categories and create an analysis pipeline that includes the following steps:

- Data cleaning / processing / language parsing
- Create features using two different NLP methods: For example, BoW vs tf-idf.
- Use the features to fit supervised learning models for each feature set to predict the category outcomes.
- Assess your models using cross-validation and determine whether one model performed better.
- Pick one of the models and try to increase accuracy by at least 5 percentage points.

I will be looking at news clippings from the Australian Broadcast Center and using various Natural Language Processing methods to predict if a sentence from an unseen news clipping is categorized as 'rural' or 'science'.

In [59]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
import sklearn
%matplotlib inline

from nltk.corpus import abc
import re
import spacy

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn import ensemble
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

# Data cleaning / processing / language parsing

First, import rural/scientific news clips from text files. <br>
Check length and show previews of content.

In [3]:
rural = abc.raw('rural.txt')
science = abc.raw('science.txt')

print(f'Length of Rural Doc: {len(rural)}')
print(f'Length of Science Doc: {len(science)}\n')

print(rural[:3000] + '\n')
print(science[:3000])

Length of Rural Doc: 1808022
Length of Science Doc: 2246756

PM denies knowledge of AWB kickbacks
The Prime Minister has denied he knew AWB was paying kickbacks to Iraq despite writing to the wheat exporter asking to be kept fully informed on Iraq wheat sales.
Letters from John Howard and Deputy Prime Minister Mark Vaile to AWB have been released by the Cole inquiry into the oil for food program.
In one of the letters Mr Howard asks AWB managing director Andrew Lindberg to remain in close contact with the Government on Iraq wheat sales.
The Opposition's Gavan O'Connor says the letter was sent in 2002, the same time AWB was paying kickbacks to Iraq though a Jordanian trucking company.
He says the Government can longer wipe its hands of the illicit payments, which totalled $290 million.
"The responsibility for this must lay may squarely at the feet of Coalition ministers in trade, agriculture and the Prime Minister," he said.
But the Prime Minister says letters show he was inquiring abou

At ~2 million chatacters each, these documents may be too long to be processed by my computer. I will shorten the length to 500,000 characters each. This may affect the accuracy of the classifier, but it will allow the program to run faster on my processor.

In [4]:
rural_short = rural[:500000]
science_short = science[:500000]

Now, I'd like to remove the heading titles from the documents, as they are not complete sentences and will skew the classifier's interpretation of what a sentence is. I will identify heading titles using a regular expression for 2 new lines, a sentence WITHOUT a full-stop, followed by another new line.

In [5]:
rural_no_head = re.sub(r'(\n\n+.[^\.]+\n)', ' ', rural_short)
science_no_head = re.sub(r'(\n\n+.[^\.]+\n)', ' ', science_short)

Now, convert the documents to spacy tokens so we can extract lemmatized sentences, and filter out stop-words etc.

In [6]:
nlp = spacy.load('en')

In [7]:
rural_spacy = nlp(rural_no_head)
science_spacy = nlp(science_no_head)

In [53]:
rural_lemma_sents = [[token.lemma_] for token in rural_spacy.sents]
science_lemma_sents = [[token.lemma_] for token in science_spacy.sents]


rural_str_sents = [[str(sent), 'rural'] for sent in rural_lemma_sents]
science_str_sents = [[str(sent), 'science'] for sent in science_lemma_sents]

sentences_df = pd.DataFrame(rural_str_sents + science_str_sents)
print(sentences_df.head())
print(sentences_df.shape)

                                                   0      1
0              ['pm deny knowledge of awb kickback']  rural
1  ['the prime minister have deny -PRON- know awb...  rural
2  ['letters from john howard and deputy prime mi...  rural
3  ['in one of the letter mr howard ask awb manag...  rural
4  ["the opposition have gavan o'connor say the l...  rural
(7464, 2)


# Pipeline

Creating a pipeline will complete the following core objectives:
- Create features using two different NLP methods: For example, BoW vs tf-idf.
- Use the features to fit supervised learning models for each feature set to predict the category outcomes.
- Assess your models using cross-validation and determine whether one model performed better.

I will create a pipeline that implements a gridsearch to determine the best techniques and parameters for feature generation and classification. The first step in the pipeline will be feature generation, implementing either bag-of-words or tf-idf vectorizer. The second step in the pipeline will be fitting the feature sets to supervised learning models for random forest and logistic regression. The grid search will implement cross-validation and return the parameters for the model with the best performance (highest accuracy score). I will then use those parameters to validate the model on an unseen validation dataset.

In [54]:
X = sentences_df[0]
Y = sentences_df[1]

X_train, X_test, Y_train, Y_test = train_test_split(X, 
                                                    Y,
                                                    test_size=0.8,
                                                    random_state=1)

In [31]:
cv_min_df = [0.25, 0.5, 0.75]
cv_max_df = [0.25, 0.5, 0.75]
cv_max_features = [100, 500, 1000, None]

tfidf_min_df = [0.25, 0.5, 0.75]
tfidf_max_df = [0.25, 0.5, 0.75]
tfidf_max_features = [100, 500, 1000, None]

rf_depth_max = [None, 10,100, 1000]
rf_ft_max = [None, 100, 1000]
rf_min_splits = [2, 10, 100, 1000]
rf_n_trees = [10, 50]

logistic_c = [1e-3, 1e-2, 1e-1, 1, 10, 50, 100, 1000, 10000]

In [55]:
pipe = Pipeline([
        ('create_feat', CountVectorizer()),
        ('classify', RandomForestClassifier())
         ])

cv_min_df = [1, 5]
cv_max_df = [0.5, 0.75]
cv_max_features = [100, 500, None]

tfidf_min_df = [1, 5]
tfidf_max_df = [0.5, 0.75]
tfidf_max_features = [100, 500, None]

rf_depth_max = [None, 10,100, 1000]
rf_ft_max = [None, 100, 500]
rf_min_splits = [2, 10, 100, 1000]
rf_n_trees = [10, 50]

logistic_c = [1e-3, 1e-2, 1e-1, 1, 10, 50, 100, 1000, 10000]

param_grid = [
    {
        'create_feat': [CountVectorizer(stop_words='english',analyzer='word')],
        'create_feat__min_df': cv_min_df,
        'create_feat__max_df': cv_max_df,
        'create_feat__max_features': cv_max_features,
        'classify': [RandomForestClassifier()],
        'classify__max_depth': rf_depth_max,
        #'classify__max_features': rf_ft_max,
        'classify__min_samples_split': rf_min_splits,
        'classify__n_estimators': rf_n_trees
    },
    {
        'create_feat': [TfidfVectorizer(stop_words='english',analyzer='word')],
        'create_feat__min_df': tfidf_min_df,
        'create_feat__max_df': tfidf_max_df,
        'create_feat__max_features': tfidf_max_features,
        'classify': [RandomForestClassifier()],
        'classify__max_depth': rf_depth_max,
        #'classify__max_features': rf_ft_max,
        'classify__min_samples_split': rf_min_splits,
        'classify__n_estimators': rf_n_trees
    },
    {
        'create_feat': [CountVectorizer(stop_words='english',analyzer='word')],
        'create_feat__min_df': cv_min_df,
        'create_feat__max_df': cv_max_df,
        'create_feat__max_features': cv_max_features,
        'classify': [LogisticRegression()],
        'classify__C': logistic_c
    },
        {
        'create_feat': [TfidfVectorizer(stop_words='english',analyzer='word')],
        'create_feat__min_df': tfidf_min_df,
        'create_feat__max_df': tfidf_max_df,
        'create_feat__max_features': tfidf_max_features,
        'classify': [LogisticRegression()],
        'classify__C': logistic_c
    },
]

grid = GridSearchCV(pipe, cv=3, n_jobs=1, param_grid=param_grid)
grid.fit(X_train, Y_train)

GridSearchCV(cv=3, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('create_feat', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
      ...n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'create_feat': [CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.75, max_features=None, min_df=5,
        ngram_range=(1, 1), preprocessor=None, stop_words='engl...     verbose=0, warm_start=False)], 'classify__C': [0.001, 0.01, 0.1, 1, 10, 50, 100, 1000, 10000]}],
       pre_dispatch='2*n_jobs', refit=True, return_tra

In [56]:
print(f'best params:\n {grid.best_params_}')
print(f'best score:\n {grid.best_score_}')

best params:
 {'classify': LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False), 'classify__C': 1, 'create_feat': TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.75, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None), 'create_feat__max_df': 0.75, 'create_feat__max_features': None, 'create_feat__min_df': 1}
best score:
 0.8820375335120644


The best method of feature generation was tf-idf vectorizer, and the best method of classification was logistic regression with default C=1. This generated a cross-validated training score of **0.882**. Let's see how this model performs on the test set.

In [57]:
pipe.set_params(create_feat=TfidfVectorizer(stop_words='english',analyzer='word'),
                create_feat__min_df=1,
                create_feat__max_df=0.75,
                create_feat__max_features=None).fit(X_train,Y_train)

Pipeline(memory=None,
     steps=[('create_feat', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.75, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=...n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

In [60]:
test_pred = pipe.predict(X_test)
print(f'Testing Accuracy: {accuracy_score(Y_test, test_pred)}')

Testing Accuracy: 0.8039182853315472


This model seems to be overfitting a lot! The cross-validated training score is 0.882, yet the testing score is only **0.804**. 

# Pick one of the models and try to increase accuracy by at least 5 percentage points

Let's try to improve this model by filtering down the gridsearch to just tfidf and logistic, and adding more variety the hyperparameters for these techniques.

pipe2 = Pipeline([
        ('create_feat', TfidfVectorizer()),
        ('classify', LogisticRegression())
         ])

tfidf_min_df2 = [1, 2, 5, 10]
tfidf_max_df2 = [0.5, 0.6,0.7, 0.75, 0.8]
tfidf_max_features2 = [50, 100, 150, None]

logistic_c2 = [1e-3, 1e-2, 1e-1, 1, 10, 50, 100, 1000, 10000]

param_grid2 = [
        {
        'create_feat': [TfidfVectorizer(stop_words='english',analyzer='word')],
        'create_feat__min_df': tfidf_min_df2,
        'create_feat__max_df': tfidf_max_df2,
        'create_feat__max_features': tfidf_max_features2,
        'classify': [LogisticRegression()],
        'classify__C': logistic_c
    },
]

grid2 = GridSearchCV(pipe2, cv=3, n_jobs=1, param_grid=param_grid2)
grid2.fit(X_train, Y_train)

In [64]:
print(f'best params:\n {grid2.best_params_}')
print(f'best score:\n {grid2.best_score_}')

best params:
 {'classify': LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False), 'classify__C': 1, 'create_feat': TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.6, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None), 'create_feat__max_df': 0.6, 'create_feat__max_features': None, 'create_feat__min_df': 1}
best score:
 0.8820375335120644


In [67]:
pipe2.set_params(create_feat=TfidfVectorizer(stop_words='english',analyzer='word'),
                create_feat__min_df=1,
                create_feat__max_df=0.6,
                create_feat__max_features=None).fit(X_train,Y_train)
test_pred2 = pipe2.predict(X_test)
print(f'Testing Accuracy: {accuracy_score(Y_test, test_pred2)}')

Testing Accuracy: 0.8879772270596116


Wow -- just decreasing the max_df from 0.75 to 6 had very little effect on our cross-validated training score (both 0.882) but had a significant effect on our testing score, which increased from 0.804 to **0.888.** We have achieved our goal of improving the score by 5 points!

# Conclusion

Implementing a pipeline with grid search was the most efficient way to classify rural vs. science news on a per-sentence basis. I offered two options for feature generation (bag-of-words and tfidf vectorizer) and two options for classification (random forest and logistic regression). I chose the best model by looking at the cross-validation score on the training dataset (75% of the data) then feeding unseen test data into the trained model. Tfidf-vectorization with logistic regression was the best-performing model with a cross-validation score of 0.882 and a final testing score of **0.888.** 