<h1>Text Classification using Ensemble Model<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Ensemble-Learning" data-toc-modified-id="Ensemble-Learning-1">Ensemble Learning</a></span><ul class="toc-item"><li><span><a href="#What-is-Ensemble-Learning?" data-toc-modified-id="What-is-Ensemble-Learning?-1.1">What is Ensemble Learning?</a></span></li><li><span><a href="#Bagging" data-toc-modified-id="Bagging-1.2">Bagging</a></span></li><li><span><a href="#Boosting" data-toc-modified-id="Boosting-1.3">Boosting</a></span></li><li><span><a href="#Stacking" data-toc-modified-id="Stacking-1.4">Stacking</a></span></li></ul></li><li><span><a href="#Machine-Learning-Project-Lifecycle:-Fourth-Iteration" data-toc-modified-id="Machine-Learning-Project-Lifecycle:-Fourth-Iteration-2">Machine Learning Project Lifecycle: Fourth Iteration</a></span><ul class="toc-item"><li><span><a href="#Problem-Statement" data-toc-modified-id="Problem-Statement-2.1">Problem Statement</a></span></li><li><span><a href="#Training-Data" data-toc-modified-id="Training-Data-2.2">Training Data</a></span></li><li><span><a href="#Preprocessing-+-Feature-Engineering" data-toc-modified-id="Preprocessing-+-Feature-Engineering-2.3">Preprocessing + Feature Engineering</a></span></li><li><span><a href="#Machine-Learning-Algorithm:-Ensemble-Learning" data-toc-modified-id="Machine-Learning-Algorithm:-Ensemble-Learning-2.4">Machine Learning Algorithm: Ensemble Learning</a></span><ul class="toc-item"><li><span><a href="#Naive-Bayes" data-toc-modified-id="Naive-Bayes-2.4.1">Naive Bayes</a></span></li><li><span><a href="#Logistic-Regression" data-toc-modified-id="Logistic-Regression-2.4.2">Logistic Regression</a></span></li><li><span><a href="#Ensemble-Model-1:-Bagging-using-Naive-Bayes" data-toc-modified-id="Ensemble-Model-1:-Bagging-using-Naive-Bayes-2.4.3">Ensemble Model 1: Bagging using Naive Bayes</a></span></li><li><span><a href="#Ensemble-Model-2:-Bagging-using-LR" data-toc-modified-id="Ensemble-Model-2:-Bagging-using-LR-2.4.4">Ensemble Model 2: Bagging using LR</a></span></li><li><span><a href="#Ensemble-Model-3:-Boosting-using-Naive-Bayes" data-toc-modified-id="Ensemble-Model-3:-Boosting-using-Naive-Bayes-2.4.5">Ensemble Model 3: Boosting using Naive Bayes</a></span></li><li><span><a href="#Ensemble-Model-4:-Boosting-Using-LR" data-toc-modified-id="Ensemble-Model-4:-Boosting-Using-LR-2.4.6">Ensemble Model 4: Boosting Using LR</a></span></li><li><span><a href="#Ensemble-Model-5:-Basic-Stacking-using-Naive-Bayes-&amp;-LR" data-toc-modified-id="Ensemble-Model-5:-Basic-Stacking-using-Naive-Bayes-&amp;-LR-2.4.7">Ensemble Model 5: Basic Stacking using Naive Bayes &amp; LR</a></span></li><li><span><a href="#Ensemble-Model-6:-Stacking-using-Naive-Bayes-&amp;-LR" data-toc-modified-id="Ensemble-Model-6:-Stacking-using-Naive-Bayes-&amp;-LR-2.4.8">Ensemble Model 6: Stacking using Naive Bayes &amp; LR</a></span></li></ul></li><li><span><a href="#Model-Evaluation" data-toc-modified-id="Model-Evaluation-2.5">Model Evaluation</a></span></li><li><span><a href="#Quality-Metrics" data-toc-modified-id="Quality-Metrics-2.6">Quality Metrics</a></span></li><li><span><a href="#Model-Evaluation-on-Test-Dataset" data-toc-modified-id="Model-Evaluation-on-Test-Dataset-2.7">Model Evaluation on Test Dataset</a></span></li></ul></li><li><span><a href="#Homework" data-toc-modified-id="Homework-3">Homework</a></span></li><li><span><a href="#Resources" data-toc-modified-id="Resources-4">Resources</a></span></li></ul></div>

<img src="../images/classification.png" alt="Classification" style="width: 700px;"/>

## Ensemble Learning

### What is Ensemble Learning?

### Bagging

<img src='../images/bagging_example.png' alt='Overfitting' style="width: 500px;" align="left">

### Boosting

### Stacking

<img src='../images/stacking_example.png' alt='Overfitting' style="width: 700px;" align="left">

## Machine Learning Project Lifecycle: Fourth Iteration

### Problem Statement

Classify the Financial Consumer Complaints into different Product Categories given consumer complaint text.

**Product Categories**

- Credit reporting, repair, or other
- Debt collection
- Student loan
- Money transfer, virtual currency, or money service
- Bank account or service

### Training Data

[Kaggle: Consumer Complaint Database](https://www.kaggle.com/selener/consumer-complaint-database)

In [153]:
import pandas as pd
import numpy as np

In [5]:
complaints_training_dataset = pd.read_csv('../datasets/consumer_complaints_training_dataset.csv')

In [6]:
complaints_training_dataset.head()

Unnamed: 0,Product,Complaint_text
0,"Credit reporting, repair, or other","My name is XXXX XXXX XXXX , not XXXX X..."
1,"Credit reporting, repair, or other",I was shocked when I reviewed my credit report...
2,"Credit reporting, repair, or other",Equifax misused of credit file. Disputing acco...
3,"Credit reporting, repair, or other",I am disturbed that you continue to list the v...
4,"Credit reporting, repair, or other",I went to multiple different credit report web...


In [7]:
complaints_training_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Product         20000 non-null  object
 1   Complaint_text  20000 non-null  object
dtypes: object(2)
memory usage: 312.6+ KB


**Q) What is the distribution of complaints for each product type?**

In [8]:
complaints_training_dataset.Product.unique()

array(['Credit reporting, repair, or other', 'Debt collection',
       'Student loan',
       'Money transfer, virtual currency, or money service',
       'Bank account or service'], dtype=object)

In [9]:
complaints_training_dataset\
    .groupby('Product')\
    [['Complaint_text']]\
    .count()\
    .rename(columns={'Complaint_text': 'Count'})\
    .sort_values('Count', ascending=False)

Unnamed: 0_level_0,Count
Product,Unnamed: 1_level_1
Bank account or service,4000
"Credit reporting, repair, or other",4000
Debt collection,4000
"Money transfer, virtual currency, or money service",4000
Student loan,4000


**Q) Find out the Occurances of Duplicate Text messages if any?**

In [10]:
complaints_training_dataset['Complaint_text'].nunique()

19913

In [11]:
duplicate_complaints = complaints_training_dataset['Complaint_text']\
    .value_counts()\
    [complaints_training_dataset['Complaint_text'].value_counts() > 2].index

In [12]:
len(duplicate_complaints)

9

### Preprocessing + Feature Engineering

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

RANDOM_STATE = 19

In [14]:
count_vectorizer = CountVectorizer(stop_words='english', max_features=5000)

In [15]:
X_train, X_test, y_train, y_test = train_test_split(
    complaints_training_dataset['Complaint_text'],
    complaints_training_dataset['Product'],
    test_size=.2,
    stratify=complaints_training_dataset['Product'],
    random_state=RANDOM_STATE)

In [16]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((16000,), (4000,), (16000,), (4000,))

In [17]:
X_train_count_vectorizer = count_vectorizer.fit_transform(X_train)
X_test_count_vectorizer = count_vectorizer.transform(X_test)

In [18]:
len(count_vectorizer.get_feature_names())

5000

In [19]:
count_vectorizer.get_feature_names()[:10]

['00', '000', '10', '100', '1000', '10000', '100000', '1005', '11', '110']

In [20]:
list(count_vectorizer.vocabulary_.items())[:10]

[('xxxx', 4976),
 ('account', 322),
 ('listed', 2727),
 ('credit', 1279),
 ('report', 3828),
 ('experian', 1842),
 ('paid', 3234),
 ('closed', 1021),
 ('2007', 63),
 ('like', 2712)]

In [21]:
X_train_count_vectorizer.shape, X_test_count_vectorizer.shape

((16000, 5000), (4000, 5000))

### Machine Learning Algorithm: Ensemble Learning

#### Naive Bayes

In [137]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

naive_bayes = MultinomialNB()

naive_bayes.fit(X_train_count_vectorizer, y_train)

naive_bayes_predictions = naive_bayes.predict(X_test_count_vectorizer)

naive_bayes_accuracy_score = accuracy_score(y_test, naive_bayes_predictions)

naive_bayes_accuracy_score

0.859

#### Logistic Regression

In [141]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(penalty='l2', max_iter=501, random_state=RANDOM_STATE)

lr.fit(X_train_count_vectorizer, y_train)

lr_predictions = lr.predict(X_test_count_vectorizer)

lr_accuracy_score = accuracy_score(y_test, lr_predictions)
lr_accuracy_score

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.84075

In [139]:
print(naive_bayes_accuracy_score * 100, lr_accuracy_score * 100)

85.9 84.075


In [143]:
accuracy_score(y_test, ensembled_predictions) * 100

86.75

#### Ensemble Model 1: Bagging using Naive Bayes

In [234]:
from collections import Counter

count_vectorizer_with_all_features = CountVectorizer(stop_words='english')

X_train_count_vect_with_all_features = count_vectorizer_with_all_features.fit_transform(
    X_train)
X_test_count_vect_with_all_features = count_vectorizer_with_all_features.transform(
    X_test)

print(X_train_count_vect_with_all_features.shape)

bagging_naive_bayes_models = []
model_counts = 5
features_size = 5000
bagging_nb_training_set_accuracies = []

for index in range(model_counts):
    features_index = np.random.choice(X_train_count_vect_with_all_features.shape[0],
                                     features_size,
                                     replace=True)
    features = X_train_count_vect_with_all_features[:, features_index]
    nb_model = MultinomialNB()
    nb_model.fit(features, y_train)
    training_set_accuracy = accuracy_score(y_train, nb_model.predict(features))
    print(f'Model {index} training set accuracy: {training_set_accuracy}')
    bagging_nb_training_set_accuracies.append(training_set_accuracy)
    bagging_naive_bayes_models.append([nb_model, features_index])
    
bagging_nb_predictions = []
for test_row in X_test_count_vect_with_all_features:
    local_predictions = []
    for index in range(model_counts):
        test_features = test_row[0, bagging_naive_bayes_models[index][1]]
        local_predictions.append(bagging_naive_bayes_models[index][0].predict(test_features)[0])
    bagging_nb_predictions.append(Counter(local_predictions).most_common()[0][0])

bagging_nb_test_set_accuracy = accuracy_score(y_test, bagging_nb_predictions)
bagging_nb_mean_training_set_accuracy = sum(bagging_nb_training_set_accuracies) / len(bagging_nb_training_set_accuracies)
print(f'Mean Accuracy of individual models is {bagging_nb_mean_training_set_accuracy}')
print(f'Bagging Model Accuracy is {bagging_nb_test_set_accuracy}')

(16000, 24253)
Model 0 training set accuracy: 0.7374375
Model 1 training set accuracy: 0.73825
Model 2 training set accuracy: 0.7318125
Model 3 training set accuracy: 0.7554375
Model 4 training set accuracy: 0.72125
Mean Accuracy of individual models is 0.7368374999999999
Bagging Model Accuracy is 0.78175


#### Ensemble Model 2: Bagging using LR

In [235]:
from collections import Counter

count_vectorizer_with_all_features = CountVectorizer(stop_words='english')

X_train_count_vect_with_all_features = count_vectorizer_with_all_features.fit_transform(
    X_train)
X_test_count_vect_with_all_features = count_vectorizer_with_all_features.transform(
    X_test)

bagging_lr_models = []
model_counts = 5
features_size = 5000
bagging_lr_training_set_accuracies = []
bagging_lr_predictions = []

for index in range(model_counts):
    features_index = np.random.choice(X_train_count_vect_with_all_features.shape[0],
                                     features_size,
                                     replace=True)
    features = X_train_count_vect_with_all_features[:, features_index]
    lr_model = LogisticRegression(penalty='l2', max_iter=501, random_state=RANDOM_STATE)
    lr_model.fit(features, y_train)
    training_set_accuracy = accuracy_score(y_train, lr_model.predict(features))
    print(f'Model {index} training set accuracy: {training_set_accuracy}')
    bagging_lr_training_set_accuracies.append(training_set_accuracy)
    bagging_lr_models.append([lr_model, features_index])
    
for test_row in X_test_count_vect_with_all_features:
    local_predictions = []
    for index in range(model_counts):
        test_features = test_row[0, bagging_lr_models[index][1]]
        local_predictions.append(bagging_lr_models[index][0].predict(test_features)[0])
    bagging_lr_predictions.append(Counter(local_predictions).most_common()[0][0])

print(f'Mean Accuracy of individual models is {sum(bagging_lr_training_set_accuracies) / len(bagging_lr_training_set_accuracies)}')
print(f'Bagging Model Accuracy is {accuracy_score(y_test, bagging_lr_predictions)}')

Model 0 training set accuracy: 0.806
Model 1 training set accuracy: 0.8079375


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Model 2 training set accuracy: 0.7999375
Model 3 training set accuracy: 0.806
Model 4 training set accuracy: 0.8215
Mean Accuracy of individual models is 0.8082750000000001
Bagging Model Accuracy is 0.78075


In [252]:
from collections import Counter

bagging_lr_models2 = []
model_counts = 5
max_rows = 5000
bagging_lr_training_set_accuracies2 = []
bagging_lr_predictions2 = []

for index in range(model_counts):
    rows_index = np.random.choice(X_train.shape[0],
                                  max_rows,
                                  replace=True)
    mini_count_vect = CountVectorizer(stop_words='english', max_features=5000)
    mini_X_train = mini_count_vect.fit_transform(X_train.iloc[rows_index])
    mini_y_train = y_train.iloc[rows_index]
    print(mini_X_train.shape, mini_y_train.shape)
    
    lr_model = LogisticRegression(penalty='l2', max_iter=501, random_state=RANDOM_STATE)
    lr_model.fit(mini_X_train, mini_y_train)
    training_set_accuracy = accuracy_score(mini_y_train, lr_model.predict(mini_X_train))
    print(f'Model {index} training set accuracy: {training_set_accuracy}')
    bagging_lr_training_set_accuracies2.append(training_set_accuracy)
    bagging_lr_models2.append([lr_model, mini_count_vect])
    
for test_row in X_test:
    local_predictions = []
    for index in range(model_counts):
        test_features = bagging_lr_models2[index][1].transform([test_row])
        local_predictions.append(bagging_lr_models2[index][0].predict(test_features)[0])
    bagging_lr_predictions2.append(Counter(local_predictions).most_common()[0][0])

print(f'Mean Accuracy of individual models is {sum(bagging_lr_training_set_accuracies2) / len(bagging_lr_training_set_accuracies2)}')
print(f'Bagging Model Accuracy is {accuracy_score(y_test, bagging_lr_predictions2)}')

(5000, 5000) (5000,)
Model 0 training set accuracy: 0.998
(5000, 5000) (5000,)
Model 1 training set accuracy: 0.9976
(5000, 5000) (5000,)
Model 2 training set accuracy: 0.9992
(5000, 5000) (5000,)
Model 3 training set accuracy: 0.9982
(5000, 5000) (5000,)
Model 4 training set accuracy: 0.997
Mean Accuracy of individual models is 0.998
Bagging Model Accuracy is 0.8495


#### Ensemble Model 3: Boosting using Naive Bayes

In [204]:
from sklearn.ensemble import AdaBoostClassifier

In [218]:
adaboost = AdaBoostClassifier(MultinomialNB(), n_estimators=500,
                              random_state=RANDOM_STATE)

adaboost.fit(X_train_count_vectorizer, y_train)

predictions = adaboost.predict(X_test_count_vectorizer)

accuracy_score(y_test, predictions)

0.396

#### Ensemble Model 4: Boosting Using LR

In [220]:
adaboost = AdaBoostClassifier(LogisticRegression(max_iter=500), n_estimators=10,
                              random_state=RANDOM_STATE)

adaboost.fit(X_train_count_vectorizer, y_train)

predictions = adaboost.predict(X_test_count_vectorizer)

accuracy_score(y_test, predictions)

0.80975

#### Ensemble Model 5: Basic Stacking using Naive Bayes & LR

- Make prediction using Naive Bayes & LR.
- Final output will be *max(NB prediction, LR prediction)*

In [142]:
ensembled_predictions = []
same_predictions, same_and_correct_predictions,\
    mismatch_naive_bayes_prediction, mismatch_lr_prediction= 0, 0, 0, 0
for i, input_row in enumerate(X_test_count_vectorizer):
    naive_bayes_prediction = naive_bayes.predict(input_row)
    lr_prediction = lr.predict(input_row)
    naive_bayes_max_prob = naive_bayes.predict_proba(input_row).max()
    lr_max_prob = lr.predict_proba(input_row).max()
    if naive_bayes_prediction == lr_prediction:
        ensembled_predictions.append(naive_bayes_prediction)
        same_predictions += 1
        if naive_bayes_prediction == y_test_list[i]:
            same_and_correct_predictions += 1
    elif naive_bayes_max_prob > lr_max_prob:
        ensembled_predictions.append(naive_bayes_prediction)
        mismatch_naive_bayes_prediction += 1
    elif lr_max_prob > naive_bayes_max_prob:
        ensembled_predictions.append(lr_prediction)
        mismatch_lr_prediction += 1

#### Ensemble Model 6: Stacking using Naive Bayes & LR

### Model Evaluation

In [267]:
from sklearn.model_selection import cross_val_score

### Quality Metrics

In [270]:
from sklearn.metrics import (accuracy_score,
                             confusion_matrix)

import seaborn as sns;
sns.set()

import matplotlib.pyplot as plt
%matplotlib inline

### Model Evaluation on Test Dataset

- Note: Retrain the model using full training [dataset](../datasets/consumer_complaints_training_dataset.csv) & test using the test [dataset](../datasets/consumer_complaints_test_dataset.csv).

## Homework

## Resources

- [Model Ensembles](https://www.youtube.com/watch?v=ZeAv5k71AS4)
- [Bagging](https://www.youtube.com/watch?v=1zSkR2xFWKg&t=9s)
- [Boosting](https://www.youtube.com/watch?v=ZqbPS7TvhqM)