# Model Comparison

Having taken a preliminary look at our data, we now work toward building a pipeline, comparing some models, and performing hyperparameter tuning on the best model.

## Importing preprocessed data

We read in the preprocessed data using `pandas`.

In [4]:
import pandas as pd
import numpy as np

# Dataset URL:
# https://www.kaggle.com/datasets/athu1105/book-genre-prediction

# Read the data into dataframe
df = pd.read_csv('../data/book_genre_dataset.csv')

We create a new column in our dataframe that is concatenation of the book title and summary. This combined text is what we will treat as our primary input text for the models.

In [5]:
df['combined'] = df['title'] + '. ' + df['summary']

## Preprocess combined text

Since the combined text has not been preprocessed, we recreate part of the preprocessing pipeline from the exploratory data analysis here.

### Feature selector

In [6]:
from sklearn.base import TransformerMixin, BaseEstimator

class FeatureSelector(BaseEstimator, TransformerMixin):
    def __init__(self, variables):
        self.variables = variables
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X.loc[:,self.variables]

### Language detection

In [7]:
import langdetect

class LangDetection(BaseEstimator, TransformerMixin):
    def __init__(self, lang='en'):
        self.lang = lang
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X_ = X.copy()
        X_['lang'] = X_.apply(lambda x: langdetect.detect(x))
        X_lang_only = X[X_['lang'] == self.lang]
        return X_lang_only

### Lowercasing

In [8]:
class LowercaseTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X_lower = X.apply(lambda x: x.lower())
        return X_lower

### Removing punctuation

In [9]:
import re

class RemovePunctuation(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X_no_punct = X.apply(lambda x: re.sub(r'[^\w\s]|_', '', x))
        return X_no_punct

### Dropping data entries

In [10]:
class DropDataEntries(BaseEstimator, TransformerMixin):
    def __init__(self, ids):
        self.ids = ids
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X_ = X.copy()
        for i in self.ids:
            X_ = X_.drop(i)
        return X_     

## Clean text pipeline

We create a pipeline out of these custom transformers in order to clean up the combined text.

In [11]:
from sklearn.pipeline import Pipeline

nonsense_summaries_idx = [338, 588, 834, 1574, 1772, 2410, 2485]

clean_text_pipeline = Pipeline([
    ('get_combined_text', FeatureSelector('combined')),
    ('detect_lang', LangDetection()),
    ('lowercase',    LowercaseTransformer()),
    ('remove_punctuation', RemovePunctuation()),
    ('drop_nonsense_summaries', DropDataEntries(nonsense_summaries_idx))
])

In [12]:
clean_text_pipeline

We now use the `fit_transform()` method to apply the cleaning to the combined text column of our data.

In [13]:
X_combined_clean = clean_text_pipeline.fit_transform(df)

We drop all of the rows which resulted in an empty string due to the cleaning of the combined text, and we create our new input data for the model, consisting of the title, summary, and combined text. This comes to a total of 4,657 entries in our input data.

In [15]:
df.drop([i for i in range(len(df)) if i not in X_combined_clean])
X = df[['title','summary','combined']]
y = df['genre']
X.shape, y.shape

((4657, 3), (4657,))

## Further preprocessing

To finish the second part of our preprocessing, we once again reuse part of the preprocessing from the exploratory data analysis.

### Removing stopwords

In [18]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

class RemoveStopwords(BaseEstimator, TransformerMixin):
    def __init__(self, stopwords):
        self.stopwords = stopwords
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X_no_stopwords = X.apply(lambda x: ' '.join([w for w in x.split() if w not in self.stopwords]))
        return X_no_stopwords

[nltk_data] Downloading package stopwords to /home/mia/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Lemmatizer

In [19]:
from nltk.stem import WordNetLemmatizer

class Lemmatizer(BaseEstimator, TransformerMixin):
    def __init__(self, lemmatizer):
        self.lemmatizer = lemmatizer
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        lemmatize_func = lambda x: ' '.join(
            [self.lemmatizer.lemmatize(w) for w in x.split()]
        )
        X_lemmatized = X.apply(lemmatize_func)
        return X_lemmatized

### Removing short words

In [20]:
class RemoveShortWords(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X.apply(lambda x: ' '.join([x for x in x.split() if len(x) > 3]))

## Putting preprocessing pipeline together

We create a preprocessing pipeline containing the `RemoveStopwords`, `Lemmatizer`, and `RemoveShortWords` transformers defined above. In addition, we include a `TfidfVectorizer` and `SelectKBest`.

In order to convert the raw text in our data into a format that our models can use as input, we need to vectorize the text. This means converting the text into numerical vectors.

The method we'll use for this is TF-IDF (term frequency-inverse document frequency). For a given term (i.e. word), the TF-IDF score is proportional to the frequency of the word in the given document (i.e. a particular summary), and inversely proportional to the frequency of documents containing that term. In other words, the more frequent a word is in a given summary the more important it is, and if the word appears in many different summaries it's less important.

Since the TF-IDF vectorizer creates a very large matrix (the number of features will equal the size of the vocabulary, i.e. all the unique words across all the input text), we use the `SelectKBest` transformer provided by `sklearn`. The `SelectKBest` transformer works by calculating ANOVA F-values between labels and features, and selecting the `k` features that are most predictive of the correct labels. This provides a way to pare down the size of our vectorized input data. For now, we select the 5,000 most predictive features. Later, we'll tune our model with `k` as a hyperparameter.

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import StandardScaler
from mlxtend.preprocessing import DenseTransformer

preprocess_text_pipeline = Pipeline([
    ('remove_stopwords', RemoveStopwords(stop_words)),
    ('lemmatize', Lemmatizer(WordNetLemmatizer())),
    ('remove_short_words', RemoveShortWords()),
    ('tfidf', TfidfVectorizer()),
    ('select_k_best', SelectKBest(k=5000))
])

To finalize our main text pipeline, we use the `FeatureSelector` to extract the `combined` column of our dataset, and apply the `preprocess_text_pipeline` to the combined column.

In [24]:
text_pipeline = Pipeline([
    ('select_combined', FeatureSelector('combined')),
    ('preprocess', preprocess_text_pipeline),
])

## Feature engineering

In order to augment our input data, we'll define some hand-crafted features that contain potentially predictive information explicitly.

To this end, we define a new custom transformer that creates new data columns containing each of the following features for both the title and the summary:

- **char_count:** number of characters in the title/summary
- **word_count:** number of words in the title/summary
- **avg_word_len:** average number of characters per word in the title/summary
- **stopword_count:** the number of stopwords in the title/summary

In [25]:
def stopword_count(text):
    stopwords_in_text = [w for w in text.split() if w in stop_words]
    return len(stopwords_in_text)

In [26]:
class TitleSummaryFE(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X_ = X.copy()
        feats = pd.DataFrame()
        feats['char_count'] = X_.apply(lambda x: len(x))
        feats['word_count'] = X_.apply(lambda x: len(x.split()))
        feats['avg_word_len'] = feats['char_count'] / feats['word_count']
        feats['stopword_count'] = X_.apply(stopword_count)
        return feats

Like we did for preprocessing the combined text, we create two new pipelines that extract the title and summary, respectively, and create the new features for each.

In [27]:
title_features = Pipeline([
    ('select_title', FeatureSelector('title')),
    ('title_features', TitleSummaryFE()),
])

In [28]:
summary_features = Pipeline([
    ('select_summary', FeatureSelector('summary')),
    ('summary_features', TitleSummaryFE()),
])

Finally, we create a transformer for creating the features from the combined text.

- **unique_word_count:** the total number of unique words in the combined text
- **unique_word_ratio:** the proportion of words in the combined text that are unique
- **sentiment_score:** the compound sentiment anaysis score as calculated by the `Sentiment Intensity Analyzer` provided by `nltk.sentiment`

In [29]:
from nltk.sentiment import SentimentIntensityAnalyzer

class CombinedFE(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        feats = pd.DataFrame()
        sia = SentimentIntensityAnalyzer()
        X_ = X.copy()
        feats['unique_word_count'] = X_.apply(lambda x: len(set(x.split())))
        feats['unique_word_ratio'] = feats['unique_word_count'] / X_.apply(lambda x: len(x.split()))
        feats['sentiment_score'] = X_.apply(lambda x: sia.polarity_scores(x)['compound'])
        return feats

As before, we create a new pipeline that extracts the combined text column of the data, and generates the relevant features.

In [30]:
combined_features = Pipeline([
    ('select_combined', FeatureSelector('combined')),
    ('combined_features', CombinedFE()),
])

## Putting it all together and comparing models

In order to finalize our input data for the model, use the `FeatureUnion` provided by `sklearn`. The `FeatureUnion` works the same as a `Pipeline`, but the transformers are applied in parallel rather than in series.

We therefore define `features`, which will apply the `text_pipeline`, `title_features`, `summary_features`, and `combined_features` transformers in parallel. The output of this transformer will be a dataframe that contains the processed and vectorized combined text, as well as all of the hand-crafted title, summary, and combined features.

In [31]:
from sklearn.pipeline import FeatureUnion

features = FeatureUnion([
    ('text', text_pipeline),
    ('title', title_features),
    ('summary', summary_features),
    ('combined', combined_features)
])

In order to be able to run multiple experiments with our models, we define `run_experiment()`, which takes the following input:
- **X:** input dataframe
- **y:** gold labels
- **pipeline:** an scikit-learn `Pipeline` object that implements a model we wish to train and evaluate
- **num_expts:** (default is 5) the number of experiments to run

We start by using `train_test_split` to set aside 20% of the data for evaluation, using the remaining 80% to train our model. We then fit the model on our training data, predict labels using the trained model, and calculate the accuracy score of our model. This is done `num_expts` many times.

In [32]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

def run_experiment(X, y, pipeline, num_expts=5):
    scores = list()
    for i in range(num_expts):
        X_train, X_test, y_train, y_true = train_test_split(X, y, test_size=0.2)
        model = pipeline.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        score = accuracy_score(y_true, y_pred)
        scores.append(score)
    
    print(f"Average accuracy over {num_expts} experiments: {sum(scores) / num_expts} \n")
    print("Classification report for the last experiment:\n")
    print(classification_report(y_true, y_pred))

Since we want to compare multiple models, we define a helper function `model_pipeline()` which takes a model object (e.g. `LinearSVC()`) and creates a pipeline that combines the the feature generation with the input model.

In [43]:
def model_pipeline(model):
    pipeline = Pipeline([
        ('features', features),
        ('model', model)
    ])
    return pipeline

For this project, we decide to compare a Linear Support Vector Classification (`LinearSVC`), a k-nearest neighbors classifier (`KNeighborsClassifier`), and a decision tree classifier (`DecisionTreeClassifier`).

We then run five experiments with each of these models.

### Linear Support Vector

In [44]:
from sklearn.svm import LinearSVC

svc = LinearSVC(dual=False)
run_experiment(X, y, model_pipeline(svc), num_expts=5)

Average accuracy over 5 experiments: 0.6622317596566523 

Classification report for the last experiment:

              precision    recall  f1-score   support

       crime       0.74      0.70      0.72        99
     fantasy       0.65      0.78      0.71       183
     history       0.66      0.70      0.68       127
      horror       0.61      0.55      0.58       121
  psychology       0.94      0.62      0.75        24
     romance       0.50      0.36      0.42        14
     science       0.68      0.65      0.66       130
      sports       0.64      0.64      0.64        14
    thriller       0.61      0.61      0.61       203
      travel       0.69      0.53      0.60        17

    accuracy                           0.66       932
   macro avg       0.67      0.61      0.64       932
weighted avg       0.66      0.66      0.65       932



### k-Nearest Neighbors

In [45]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
run_experiment(X, y, model_pipeline(knn), num_expts=5)

Average accuracy over 5 experiments: 0.21716738197424892 

Classification report for the last experiment:

              precision    recall  f1-score   support

       crime       0.13      0.22      0.17        94
     fantasy       0.25      0.42      0.31       163
     history       0.20      0.17      0.18       121
      horror       0.20      0.13      0.16       126
  psychology       0.22      0.19      0.21        21
     romance       0.00      0.00      0.00        16
     science       0.23      0.13      0.17       139
      sports       0.29      0.06      0.11        31
    thriller       0.32      0.32      0.32       199
      travel       0.17      0.05      0.07        22

    accuracy                           0.23       932
   macro avg       0.20      0.17      0.17       932
weighted avg       0.23      0.23      0.22       932



### Decision Tree

In [46]:
from sklearn.tree import DecisionTreeClassifier

dtree = DecisionTreeClassifier()
run_experiment(X, y, model_pipeline(dtree), num_expts=5)

Average accuracy over 5 experiments: 0.453862660944206 

Classification report for the last experiment:

              precision    recall  f1-score   support

       crime       0.55      0.49      0.52       117
     fantasy       0.48      0.51      0.49       166
     history       0.45      0.48      0.47       115
      horror       0.32      0.30      0.31       117
  psychology       0.31      0.29      0.30        14
     romance       0.23      0.24      0.23        21
     science       0.39      0.42      0.40       129
      sports       0.70      0.58      0.64        24
    thriller       0.51      0.51      0.51       211
      travel       0.42      0.28      0.33        18

    accuracy                           0.45       932
   macro avg       0.43      0.41      0.42       932
weighted avg       0.45      0.45      0.45       932



## Hyperparameter tuning

As we can see from our model comparison above, the `LinearSVC` was the best-performing model by quite a wide margin across five experiments compared to the `KNeighborsClassifier` and the `DecisionTreeClassifier`.

In order to perform hyperparameter tuning on the `LinearSVC` model, we first define a dict with all the parameters we want to consider in our search grid.

In [47]:
params = {
    # tf-idf params
    "features__text__preprocess__tfidf__max_df" : [0.6, 0.8, 1.0],
    "features__text__preprocess__tfidf__min_df" : [1, 3, 5],
    "features__text__preprocess__tfidf__ngram_range" : [(1, 1), (1, 2)],
    "features__text__preprocess__tfidf__norm" : ["l1", "l2"],    
    # select k-best params
    "features__text__preprocess__select_k_best__k" : [100,1000,5000],
    # svc model params
    "model__C" : [0.1, 1],
}

We then define a `GridSearchCV` object with the `LinearSVC` estimator object and the dict of parameters from above.

In [48]:
from sklearn.model_selection import GridSearchCV

svc = LinearSVC(dual=False)

grid = GridSearchCV(
    estimator = model_pipeline(svc),
    param_grid = params,
    cv = 3,
    verbose = 1
)

Before performing the grid search, we do a train-test split of the data. While the grid search itself performs a train-test split for cross-validation, we reserve 10% of the data from the grid search in order to preempt any information "leaks" in the algorithm. We can then use this withheld 10% of the data to get more representative accuracy that we may expect of the optimized model from unseen data.

In [49]:
X_train, X_test, y_train, y_true = train_test_split(X, y, test_size=0.1)

Run the grid search.

**NOTE:** this takes very long to run!! Depending on your computer, the number of parameter combinations, and the number of cross-validation folds, this will vary. As is here, it will take approximately 4 hours.

In [50]:
grid.fit(X_train, y_train)

Fitting 3 folds for each of 216 candidates, totalling 648 fits


After running the grid search, we can get the best-performing combination of parameters from the `grid` object.

In [54]:
grid.best_params_

{'features__text__preprocess__select_k_best__k': 5000,
 'features__text__preprocess__tfidf__max_df': 0.6,
 'features__text__preprocess__tfidf__min_df': 3,
 'features__text__preprocess__tfidf__ngram_range': (1, 1),
 'features__text__preprocess__tfidf__norm': 'l2',
 'model__C': 1}

And finally, we use the 10% of the data that we set aside before performing the grid search to evaluate the accuracy of the optimized model. This hyperparameter tuning resulted in an increase of about 2.5% in our accuracy, which is not bad.

In [56]:
y_predicted = grid.predict(X_test)
accuracy = accuracy_score(y_true, y_predicted)
print(f"Accuracy: {accuracy}")
print(classification_report(y_true, y_predicted))

Accuracy: 0.6888412017167382
              precision    recall  f1-score   support

       crime       0.61      0.57      0.59        53
     fantasy       0.74      0.75      0.74        93
     history       0.61      0.81      0.70        57
      horror       0.73      0.60      0.66        50
  psychology       1.00      0.50      0.67         8
     romance       0.33      0.33      0.33         9
     science       0.77      0.75      0.76        67
      sports       0.78      0.70      0.74        10
    thriller       0.65      0.69      0.67       102
      travel       1.00      0.65      0.79        17

    accuracy                           0.69       466
   macro avg       0.72      0.63      0.66       466
weighted avg       0.70      0.69      0.69       466

