# CSC 620 -- HA #9

By: Mark Kim

Adapted from
[dbaghern](https://www.kaggle.com/code/baghern/a-deep-dive-into-sklearn-pipelines/notebook)

This assignment is about feature engineering and streamlining the process of
modeling.

Below, we import the libraries we need and read in our training data.  This data
consists of sentences from horror stories and their authors.  The purpose of this notebook is
to predict the author of a horror story sentence given its text.

In [153]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

df = pd.read_csv('./input/train.csv')

df.dropna(axis=0)
df.set_index('id', inplace = True)

df.head()

Unnamed: 0_level_0,text,author
id,Unnamed: 1_level_1,Unnamed: 2_level_1
id26305,"This process, however, afforded me no means of...",EAP
id17569,It never once occurred to me that the fumbling...,HPL
id11008,"In his left hand was a gold snuff box, from wh...",EAP
id27763,How lovely is spring As we looked from Windsor...,MWS
id12958,"Finding nothing else, not even gold, the Super...",HPL


## Preprocessing and Feature Engineering

The function in this next cell does a little bit of text normalization 
(removal of punctuation and capitalization), then does some feature engineering.

The feature engineering of the original author consists of the following:
1. Number of characters in the sentence/text.
2. Number of words in the sentence/text (minus stopwords).
3. Average word length of words in sentence/text (minus stopwords).
4. Number of commas used in a sentence/text.

Starting on line 26 of the code below, I have implemented the three additional
features required by the assignment:
1. Number of Adjectives.
2. Number of Nouns.
3. Number of Verbs.

In [154]:
import re
from nltk.corpus import stopwords
from nltk.tag import pos_tag

stopWords = set(stopwords.words('english'))

#creating a function to encapsulate preprocessing, to mkae it easy to replicate on  submission data
def processing(df):
    #lowering and removing punctuation
    df['processed'] = df['text'].apply(lambda x: re.sub(r'[^\w\s]','', x.lower()))
    
    #numerical feature engineering
    #total length of sentence
    df['length'] = df['processed'].apply(lambda x: len(x))
    #get number of words
    df['words'] = df['processed'].apply(lambda x: len(x.split(' ')))
    df['words_not_stopword'] = df['processed'].apply(
        lambda x: len([t for t in x.split(' ') if t not in stopWords]))
    #get the average word length
    df['avg_word_length'] = df['processed'].apply(
        lambda x: np.mean([len(t) for t in x.split(' ') if t not in stopWords]) 
            if len([len(t) for t in x.split(' ') if t not in stopWords]) > 0 else 0)
    #get the average word length
    df['commas'] = df['text'].apply(lambda x: x.count(','))

    # This section adds the three new features as required in the assignment instructions
    # tokenize and tag parts of speech for processing
    df['pos'] = df['processed'].apply(lambda x: x.split(' ')).apply(lambda x: pos_tag(x))
    # get the count of adjectives
    adj = df['pos'].apply(
        lambda x: list(filter(lambda y: y[1] == 'JJ' 
        or y[1] == 'JJR' 
        or y[1] == 'JJS', x)))
    df['adj_count'] = adj.apply(lambda x: len(x))
    # get the count of nouns
    noun = df['pos'].apply(
        lambda x: list(filter(lambda y: y[1] == 'NN' 
        or y[1] == 'NNS' 
        or y[1] == 'NNP'
        or y[1] == 'NNPS', x)))
    df['noun_count'] = noun.apply(lambda x: len(x))
    # get the count of verbs
    verb = df['pos'].apply(
        lambda x: list(filter(lambda y: y[1] == 'VB' 
        or y[1] == 'VBD' 
        or y[1] == 'VBG'
        or y[1] == 'VBN' 
        or y[1] == 'VBP'
        or y[1] == 'VBZ', x)))
    df['verb_count'] = verb.apply(lambda x: len(x))

    return(df)

df = processing(df)

df.head()

Unnamed: 0_level_0,text,author,processed,length,words,words_not_stopword,avg_word_length,commas,pos,adj_count,noun_count,verb_count
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
id26305,"This process, however, afforded me no means of...",EAP,this process however afforded me no means of a...,224,41,21,6.380952,4,"[(this, DT), (process, NN), (however, RB), (af...",2,12,6
id17569,It never once occurred to me that the fumbling...,HPL,it never once occurred to me that the fumbling...,70,14,6,6.166667,0,"[(it, PRP), (never, RB), (once, RB), (occurred...",1,2,2
id11008,"In his left hand was a gold snuff box, from wh...",EAP,in his left hand was a gold snuff box from whi...,195,36,19,5.947368,4,"[(in, IN), (his, PRP$), (left, JJ), (hand, NN)...",5,10,4
id27763,How lovely is spring As we looked from Windsor...,MWS,how lovely is spring as we looked from windsor...,202,34,21,6.47619,3,"[(how, WRB), (lovely, RB), (is, VBZ), (spring,...",6,10,5
id12958,"Finding nothing else, not even gold, the Super...",HPL,finding nothing else not even gold the superin...,170,27,16,7.1875,2,"[(finding, VBG), (nothing, NN), (else, RB), (n...",1,6,6


### Further split the `test.csv` set in to train and test sets.

Here we get extract the feature columns from the dataframe (with one that has
all features and another that only has numeric features).

Then the set is split into train and test sets.

In [155]:
from sklearn.model_selection import train_test_split

features= [c for c in df.columns.values if c  not in ['id','text','author','pos']]
numeric_features= [c for c in df.columns.values if c  not in ['id','text','author','processed','pos']]
target = 'author'

X_train, X_test, y_train, y_test = train_test_split(df[features], df[target], test_size=0.33, random_state=42)
X_train.head()

Unnamed: 0_level_0,processed,length,words,words_not_stopword,avg_word_length,commas,adj_count,noun_count,verb_count
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
id19417,this panorama is indeed glorious and i should ...,91,18,6,6.666667,1,1,4,2
id09522,there was a simple natural earnestness about h...,240,44,18,6.277778,4,7,8,7
id22732,who are you pray that i duc de lomelette princ...,387,74,38,5.552632,9,3,18,10
id10351,he had gone in the carriage to the nearest tow...,118,24,11,5.363636,0,1,8,3
id24580,there is no method in their proceedings beyond...,71,13,5,7.0,1,0,4,1


### Creating a Pipeline

First, we create functions that will return a single column from a dataframe as
a Pandas Series.  Apparently, the original author ran into problems between
trying to select a text column versus a numerical column, so a separate class
was created for each.

In [156]:
from sklearn.base import BaseEstimator, TransformerMixin

class TextSelector(BaseEstimator, TransformerMixin):
    """
    Transformer to select a single column from the data frame to perform additional transformations on
    Use on text columns in the data
    """
    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[self.key]
    
class NumberSelector(BaseEstimator, TransformerMixin):
    """
    Transformer to select a single column from the data frame to perform additional transformations on
    Use on numeric columns in the data
    """
    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[[self.key]]
    


Here a pipeline is formed that will retrieve the `processed` text column from a
dataframe, then applies the `TfidfVectorizer` to the resulting Pandas Series.

I have explained the `TfidfVectorizer` in detail in a previous assignment.

In [157]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

text = Pipeline([
                ('selector', TextSelector(key='processed')),
                ('tfidf', TfidfVectorizer( stop_words='english'))
            ])

text.fit_transform(X_train)

<13117x21516 sparse matrix of type '<class 'numpy.float64'>'
	with 148061 stored elements in Compressed Sparse Row format>

A similar pipeline is applied to the numeric data from the dataframe.  In this
case the data is standardized by removing the mean and scaling to unit variance.
This score is calculated by applying the following function on each sample, $x$:
$$ z = \frac{x - \mu}{\sigma}. $$

This is done to ensure that the data is centered around $0$ and looks similar to
a standard normal distribution because many parts used in the objective function
of learning algorithms assume standard distribution.

In [158]:
from sklearn.preprocessing import StandardScaler

length =  Pipeline([
                ('selector', NumberSelector(key='length')),
                ('standard', StandardScaler())
            ])

length.fit_transform(X_train)

array([[-0.50769254],
       [ 0.88000324],
       [ 2.24907223],
       ...,
       [-0.46112557],
       [-0.14447015],
       [-0.39593181]])

The same process is followed for each of the other numeric features.

In [159]:
words =  Pipeline([
                ('selector', NumberSelector(key='words')),
                ('standard', StandardScaler())
            ])
words_not_stopword =  Pipeline([
                ('selector', NumberSelector(key='words_not_stopword')),
                ('standard', StandardScaler())
            ])
avg_word_length =  Pipeline([
                ('selector', NumberSelector(key='avg_word_length')),
                ('standard', StandardScaler())
            ])
commas =  Pipeline([
                ('selector', NumberSelector(key='commas')),
                ('standard', StandardScaler()),
            ])
adj_count =  Pipeline([
                ('selector', NumberSelector(key='adj_count')),
                ('standard', StandardScaler())
            ])
noun_count =  Pipeline([
                ('selector', NumberSelector(key='noun_count')),
                ('standard', StandardScaler())
            ])
verb_count =  Pipeline([
                ('selector', NumberSelector(key='verb_count')),
                ('standard', StandardScaler()),
            ])

`FeatureUnion` is then used to create a concatenated results from all the
`Pipeline` objects.  This resulting estimator applies the pipelines in parallel.
The author then puts the FeatureUnion estimator into a pipeline to demonstrate a
fit/transformation on the training set.

In [160]:
from sklearn.pipeline import FeatureUnion

feats_original = FeatureUnion([('text', text), 
                      ('length', length),
                      ('words', words),
                      ('words_not_stopword', words_not_stopword),
                      ('avg_word_length', avg_word_length),
                      ('commas', commas)])

Here, I added the three new features to the `FeatureUnion` for analysis.

In [161]:
feats = FeatureUnion([('text', text), 
                      ('length', length),
                      ('words', words),
                      ('words_not_stopword', words_not_stopword),
                      ('avg_word_length', avg_word_length),
                      ('commas', commas),
                      ('adj_count', adj_count),
                      ('noun_count', noun_count),
                      ('verb_count', verb_count)])

### Adding a Classifier

Ultimately, the author expanded the previous cell by adding a `RandomForestClassifier`
to the pipeline so that it can be trained/fit to the dataset.  Notice that the
model predicted the test set correctly $67.92\%$ of the time.  In our case, we
will be using Logistic Regression instead as shown in the cell following this
commented out cell.

In [30]:
# from sklearn.ensemble import RandomForestClassifier

# pipeline = Pipeline([
#     ('features',feats),
#     ('classifier', RandomForestClassifier(random_state = 42)),
# ])

# pipeline.fit(X_train, y_train)

# preds = pipeline.predict(X_test)
# np.mean(preds == y_test)

0.6792014856081708

### 2a Using Logistic Regression instead of Random Forest

This cell implements the use of Logistic Regression over Random Forest

In [163]:
from sklearn.linear_model import LogisticRegression

pipeline_original = Pipeline([
    ('features',feats_original),
    ('classifier', LogisticRegression(max_iter=350, random_state = 42)),
])

pipeline = Pipeline([
    ('features',feats),
    ('classifier', LogisticRegression(max_iter=350, random_state = 42)),
])

pipeline_original.fit(X_train, y_train)
pipeline.fit(X_train, y_train)

preds_original = pipeline_original.predict(X_test)
preds = pipeline.predict(X_test)
print("Original: ", np.mean(preds_original == y_test))
print("New: ", np.mean(preds == y_test))

Original:  0.7807180439492417
New:  0.780099040544723


### 2c - Classification Report

This is the classification report before cross-validation/hyperparameter tuning.
Although, it is not shown here, Logistic Regression seems to perform
significantly better than Random Forest.

In [164]:
from sklearn.metrics import classification_report

print("Old model:")
print(classification_report(y_test, preds_original))

print("New model (with 3 added features):")
print(classification_report(y_test, preds))

Old model:
              precision    recall  f1-score   support

         EAP       0.74      0.85      0.79      2587
         HPL       0.81      0.74      0.77      1852
         MWS       0.83      0.73      0.78      2023

    accuracy                           0.78      6462
   macro avg       0.79      0.77      0.78      6462
weighted avg       0.79      0.78      0.78      6462

New model (with 3 added features):
              precision    recall  f1-score   support

         EAP       0.74      0.84      0.79      2587
         HPL       0.81      0.75      0.78      1852
         MWS       0.81      0.73      0.77      2023

    accuracy                           0.78      6462
   macro avg       0.79      0.77      0.78      6462
weighted avg       0.78      0.78      0.78      6462



## Cross Validation

As was covered in class, we want to tune the parameters.  Here, the original
author explores the parameters available for tuning.

In [135]:
pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'features', 'classifier', 'features__n_jobs', 'features__transformer_list', 'features__transformer_weights', 'features__verbose', 'features__text', 'features__length', 'features__words', 'features__words_not_stopword', 'features__avg_word_length', 'features__commas', 'features__adj_count', 'features__noun_count', 'features__verb_count', 'features__text__memory', 'features__text__steps', 'features__text__verbose', 'features__text__selector', 'features__text__tfidf', 'features__text__selector__key', 'features__text__tfidf__analyzer', 'features__text__tfidf__binary', 'features__text__tfidf__decode_error', 'features__text__tfidf__dtype', 'features__text__tfidf__encoding', 'features__text__tfidf__input', 'features__text__tfidf__lowercase', 'features__text__tfidf__max_df', 'features__text__tfidf__max_features', 'features__text__tfidf__min_df', 'features__text__tfidf__ngram_range', 'features__text__tfidf__norm', 'features__text__tfidf__preprocessor', '

The original author chose the hyperparameters as shown in the cell below, then
does a `GridSearchCV` to tune with the list of values in the hyperparameter dictionary.

In [147]:
from sklearn.model_selection import GridSearchCV

hyperparameters = { 'features__text__tfidf__max_df': [0.8, 0.95],
                    'features__text__tfidf__ngram_range': [(1,1), (1,2)],
                    'classifier__C': [0.9, 1.1],
                    'classifier__intercept_scaling': [0.9, 1.1]
                  }
clf_old = GridSearchCV(pipeline_original, hyperparameters, cv=5)
clf = GridSearchCV(pipeline, hyperparameters, cv=5)
 
# Fit and tune the old model
clf_old.fit(X_train, y_train)
 
# Fit and tune the new model
clf.fit(X_train, y_train)

The best parameters are found using the `best_params_` variable.

In [148]:
print("Old Model Best Params:")
print(clf_old.best_params_)

print("New Model Best Params:")
print(clf.best_params_)

Old Model Best Params:
{'classifier__C': 1.05, 'classifier__intercept_scaling': 0.95, 'features__text__tfidf__max_df': 0.9, 'features__text__tfidf__ngram_range': (1, 1)}
New Model Best Params:
{'classifier__C': 1.05, 'classifier__intercept_scaling': 0.95, 'features__text__tfidf__max_df': 0.9, 'features__text__tfidf__ngram_range': (1, 1)}


Finally, we can refit the model to the best parameters using `refit`.

In [149]:
#refitting on entire training data using best settings
clf_old.refit
clf.refit

preds_old = clf_old.predict(X_test)
probs_old = clf_old.predict_proba(X_test)
preds = clf.predict(X_test)
probs = clf.predict_proba(X_test)

print("Old model:")
print(np.mean(preds_old == y_test))
print("New Model:")
print(np.mean(preds == y_test))

Old model:
0.7816465490560198
New Model:
0.7807180439492417


### 2c - Classification Report

This is the classification report after cross-validation/hyperparameter tuning.

In [150]:
print("Old model:")
print(classification_report(y_test, preds_original))

print("New model (with 3 added features):")
print(classification_report(y_test, preds))

Old model:
              precision    recall  f1-score   support

         EAP       0.74      0.85      0.79      2587
         HPL       0.81      0.74      0.77      1852
         MWS       0.83      0.73      0.78      2023

    accuracy                           0.78      6462
   macro avg       0.79      0.77      0.78      6462
weighted avg       0.79      0.78      0.78      6462

New model (with 3 added features):
              precision    recall  f1-score   support

         EAP       0.74      0.84      0.79      2587
         HPL       0.81      0.75      0.78      1852
         MWS       0.82      0.73      0.77      2023

    accuracy                           0.78      6462
   macro avg       0.79      0.77      0.78      6462
weighted avg       0.78      0.78      0.78      6462



# Final Predictions

Here, the final results are gathered with the probability score for each author
for each sentence.

In [152]:
submission = pd.read_csv('./input/test.csv')

#preprocessing
submission = processing(submission)
predictions_old = clf_old.predict_proba(submission)
predictions = clf.predict_proba(submission)

preds_old = pd.DataFrame(data=predictions_old, columns = clf_old.best_estimator_.named_steps['classifier'].classes_)
preds = pd.DataFrame(data=predictions, columns = clf.best_estimator_.named_steps['classifier'].classes_)

#generating a submission file
result_old = pd.concat([submission[['id']], preds_old], axis=1)
result_old.set_index('id', inplace = True)
print("Old:")
print(result_old.head())
print()

result = pd.concat([submission[['id']], preds], axis=1)
result.set_index('id', inplace = True)
print("New:")
print(result.head())

Old:
              EAP       HPL       MWS
id                                   
id02310  0.292199  0.062944  0.644857
id24541  0.848385  0.047281  0.104334
id00134  0.238463  0.685973  0.075564
id27757  0.682623  0.211395  0.105982
id04081  0.712984  0.211258  0.075758

New:
              EAP       HPL       MWS
id                                   
id02310  0.236220  0.063602  0.700179
id24541  0.835888  0.037979  0.126133
id00134  0.243195  0.694784  0.062021
id27757  0.702322  0.192928  0.104749
id04081  0.694676  0.218217  0.087106


# Discussion

Looking at the classification report, we can see that the new model performed
slightly worse than the old model.  The reason for this is unclear since I would
have thought that the addition of the new features
would have increased the precision and recall.

One item of note is that the model
performance between the old model and the new model is pretty small.  Without
some statistical analysis, we don't really know if the difference is
statistically significant.  Given this knowledge, we really cannot conclude
anything and any analysis of this data without a deeper statistical analysis
would be anecdotal at best.

Yet another dimension to all of this is the differences in performance from the
initial fit and the model with tuned hyperparameters.  It seems that the
hyperparameter tuning I chose did not make much of a difference to the final
results.  All of this requires further investigation and analysis.