<h1>Text Classification using Ensemble Model<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Ensemble-Learning" data-toc-modified-id="Ensemble-Learning-1">Ensemble Learning</a></span></li><li><span><a href="#Machine-Learning-Project-Lifecycle:-Fourth-Iteration" data-toc-modified-id="Machine-Learning-Project-Lifecycle:-Fourth-Iteration-2">Machine Learning Project Lifecycle: Fourth Iteration</a></span><ul class="toc-item"><li><span><a href="#Problem-Statement" data-toc-modified-id="Problem-Statement-2.1">Problem Statement</a></span></li><li><span><a href="#Training-Data" data-toc-modified-id="Training-Data-2.2">Training Data</a></span></li><li><span><a href="#Preprocessing-+-Feature-Engineering" data-toc-modified-id="Preprocessing-+-Feature-Engineering-2.3">Preprocessing + Feature Engineering</a></span></li><li><span><a href="#Machine-Learning-Algorithm:-Ensemble-Learning" data-toc-modified-id="Machine-Learning-Algorithm:-Ensemble-Learning-2.4">Machine Learning Algorithm: Ensemble Learning</a></span><ul class="toc-item"><li><span><a href="#Naive-Bayes" data-toc-modified-id="Naive-Bayes-2.4.1">Naive Bayes</a></span></li><li><span><a href="#Logistic-Regression" data-toc-modified-id="Logistic-Regression-2.4.2">Logistic Regression</a></span></li><li><span><a href="#Ensemble-Model" data-toc-modified-id="Ensemble-Model-2.4.3">Ensemble Model</a></span></li></ul></li><li><span><a href="#Model-Evaluation" data-toc-modified-id="Model-Evaluation-2.5">Model Evaluation</a></span></li><li><span><a href="#Quality-Metrics" data-toc-modified-id="Quality-Metrics-2.6">Quality Metrics</a></span></li><li><span><a href="#Model-Evaluation-on-Test-Dataset" data-toc-modified-id="Model-Evaluation-on-Test-Dataset-2.7">Model Evaluation on Test Dataset</a></span></li></ul></li><li><span><a href="#Homework" data-toc-modified-id="Homework-3">Homework</a></span></li><li><span><a href="#Resources" data-toc-modified-id="Resources-4">Resources</a></span></li></ul></div>

<img src="../images/classification.png" alt="Classification" style="width: 700px;"/>

## Ensemble Learning

## Machine Learning Project Lifecycle: Fourth Iteration

### Problem Statement

Classify the Financial Consumer Complaints into different Product Categories given consumer complaint text.

**Product Categories**

- Credit reporting, repair, or other
- Debt collection
- Student loan
- Money transfer, virtual currency, or money service
- Bank account or service

### Training Data

[Kaggle: Consumer Complaint Database](https://www.kaggle.com/selener/consumer-complaint-database)

In [4]:
import pandas as pd

In [5]:
complaints_training_dataset = pd.read_csv('../datasets/consumer_complaints_training_dataset.csv')

In [6]:
complaints_training_dataset.head()

Unnamed: 0,Product,Complaint_text
0,"Credit reporting, repair, or other","My name is XXXX XXXX XXXX , not XXXX X..."
1,"Credit reporting, repair, or other",I was shocked when I reviewed my credit report...
2,"Credit reporting, repair, or other",Equifax misused of credit file. Disputing acco...
3,"Credit reporting, repair, or other",I am disturbed that you continue to list the v...
4,"Credit reporting, repair, or other",I went to multiple different credit report web...


In [7]:
complaints_training_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Product         20000 non-null  object
 1   Complaint_text  20000 non-null  object
dtypes: object(2)
memory usage: 312.6+ KB


**Q) What is the distribution of complaints for each product type?**

In [8]:
complaints_training_dataset.Product.unique()

array(['Credit reporting, repair, or other', 'Debt collection',
       'Student loan',
       'Money transfer, virtual currency, or money service',
       'Bank account or service'], dtype=object)

In [9]:
complaints_training_dataset\
    .groupby('Product')\
    [['Complaint_text']]\
    .count()\
    .rename(columns={'Complaint_text': 'Count'})\
    .sort_values('Count', ascending=False)

Unnamed: 0_level_0,Count
Product,Unnamed: 1_level_1
Bank account or service,4000
"Credit reporting, repair, or other",4000
Debt collection,4000
"Money transfer, virtual currency, or money service",4000
Student loan,4000


**Q) Find out the Occurances of Duplicate Text messages if any?**

In [10]:
complaints_training_dataset['Complaint_text'].nunique()

19913

In [11]:
duplicate_complaints = complaints_training_dataset['Complaint_text']\
    .value_counts()\
    [complaints_training_dataset['Complaint_text'].value_counts() > 2].index

In [12]:
len(duplicate_complaints)

9

### Preprocessing + Feature Engineering

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

RANDOM_STATE = 19

In [14]:
count_vectorizer = CountVectorizer(stop_words='english', max_features=5000)

In [15]:
X_train, X_test, y_train, y_test = train_test_split(
    complaints_training_dataset['Complaint_text'],
    complaints_training_dataset['Product'],
    test_size=.2,
    stratify=complaints_training_dataset['Product'],
    random_state=RANDOM_STATE)

In [16]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((16000,), (4000,), (16000,), (4000,))

In [17]:
X_train_count_vectorizer = count_vectorizer.fit_transform(X_train)
X_test_count_vectorizer = count_vectorizer.transform(X_test)

In [18]:
len(count_vectorizer.get_feature_names())

5000

In [19]:
count_vectorizer.get_feature_names()[:10]

['00', '000', '10', '100', '1000', '10000', '100000', '1005', '11', '110']

In [20]:
list(count_vectorizer.vocabulary_.items())[:10]

[('xxxx', 4976),
 ('account', 322),
 ('listed', 2727),
 ('credit', 1279),
 ('report', 3828),
 ('experian', 1842),
 ('paid', 3234),
 ('closed', 1021),
 ('2007', 63),
 ('like', 2712)]

In [21]:
X_train_count_vectorizer.shape, X_test_count_vectorizer.shape

((16000, 5000), (4000, 5000))

### Machine Learning Algorithm: Ensemble Learning

#### Naive Bayes

In [111]:
from sklearn.naive_bayes import MultinomialNB

naive_bayes = MultinomialNB()

naive_bayes.fit(X_train_count_vectorizer, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

#### Logistic Regression

In [112]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(penalty='l2', max_iter=501, random_state=RANDOM_STATE)

lr.fit(X_train_count_vectorizer, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=501,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=19, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [113]:
from sklearn.metrics import accuracy_score

naive_bayes_predictions = naive_bayes.predict(X_test_count_vectorizer)

naive_bayes_accuracy_score = accuracy_score(y_test, naive_bayes_predictions)

In [114]:
lr_predictions = lr.predict(X_test_count_vectorizer)

lr_accuracy_score = accuracy_score(y_test, lr_predictions)

In [130]:
print(naive_bayes_accuracy_score * 100, lr_accuracy_score * 100)

85.9 84.075


#### Ensemble Model

In [116]:
naive_bayes.predict_proba(X_test_count_vectorizer[:5])

array([[1.00000000e+00, 2.51806180e-63, 3.68676059e-60, 1.11750021e-37,
        2.96267797e-64],
       [7.60171500e-11, 1.25061781e-07, 9.99999874e-01, 3.49006817e-10,
        1.26483728e-10],
       [1.97851722e-22, 2.24140674e-02, 9.77585933e-01, 6.06726006e-26,
        3.83842754e-13],
       [1.59947701e-15, 8.64915524e-12, 5.15139416e-11, 2.60977681e-14,
        1.00000000e+00],
       [3.21608638e-22, 1.35274986e-21, 2.20350678e-22, 1.00000000e+00,
        6.68301179e-27]])

In [117]:
lr.predict_proba(X_test_count_vectorizer[:5])

array([[1.00000000e+00, 1.72189067e-22, 2.36560427e-20, 7.19659569e-15,
        5.78540875e-22],
       [9.02917827e-04, 2.40253670e-04, 9.97852426e-01, 2.51937279e-04,
        7.52464778e-04],
       [3.35346726e-06, 4.07829198e-01, 5.91659455e-01, 1.96361325e-08,
        5.07973393e-04],
       [9.35927807e-06, 9.38755502e-05, 3.63831168e-05, 1.28014298e-04,
        9.99732368e-01],
       [1.53568667e-10, 2.91341504e-09, 3.28672722e-09, 9.99999993e-01,
        1.08871320e-09]])

In [118]:
naive_bayes.predict(X_test_count_vectorizer[:5])

array(['Bank account or service', 'Debt collection', 'Debt collection',
       'Student loan',
       'Money transfer, virtual currency, or money service'], dtype='<U50')

In [119]:
lr.predict(X_test_count_vectorizer[:5])

array(['Bank account or service', 'Debt collection', 'Debt collection',
       'Student loan',
       'Money transfer, virtual currency, or money service'], dtype=object)

In [120]:
mismatch_indexes = []
for i, input_row in enumerate(X_test_count_vectorizer):
    naive_bayes_prediction = naive_bayes.predict(input_row)
    lr_prediction = lr.predict(input_row)
    if naive_bayes_prediction != lr_prediction:
        mismatch_indexes.append(i)
print(len(mismatch_indexes))

533


In [121]:
mismatch_inputs = X_test_count_vectorizer[mismatch_indexes]

In [122]:
mismatch_outputs = []
for input_row in mismatch_inputs:
    naive_bayes_prediction = naive_bayes.predict(input_row)
    lr_prediction = lr.predict(input_row)
    naive_bayes_max_prob = naive_bayes.predict_proba(input_row).max()
    lr_max_prob = lr.predict_proba(input_row).max()
    mismatch_outputs.append([(naive_bayes_prediction, naive_bayes_max_prob),
                             (lr_prediction, lr_max_prob)])

In [123]:
mismatch_outputs[:5]

[[(array(['Student loan'], dtype='<U50'), 0.9992997576074214),
  (array(['Debt collection'], dtype=object), 0.930392953089541)],
 [(array(['Student loan'], dtype='<U50'), 0.9999999812662284),
  (array(['Debt collection'], dtype=object), 0.49534116285616897)],
 [(array(['Student loan'], dtype='<U50'), 0.9999999982051122),
  (array(['Debt collection'], dtype=object), 0.7199304816810411)],
 [(array(['Debt collection'], dtype='<U50'), 0.8307008376983429),
  (array(['Bank account or service'], dtype=object), 0.572151883494653)],
 [(array(['Credit reporting, repair, or other'], dtype='<U50'),
   0.9999999996043698),
  (array(['Debt collection'], dtype=object), 0.9225176582699213)]]

In [124]:
y_test_list = list(y_test)

In [125]:
ensembled_predictions = []
same_predictions, same_and_correct_predictions,\
    mismatch_naive_bayes_prediction, mismatch_lr_prediction= 0, 0, 0, 0
for i, input_row in enumerate(X_test_count_vectorizer):
    naive_bayes_prediction = naive_bayes.predict(input_row)
    lr_prediction = lr.predict(input_row)
    naive_bayes_max_prob = naive_bayes.predict_proba(input_row).max()
    lr_max_prob = lr.predict_proba(input_row).max()
    if naive_bayes_prediction == lr_prediction:
        ensembled_predictions.append(naive_bayes_prediction)
        same_predictions += 1
        if naive_bayes_prediction == y_test_list[i]:
            same_and_correct_predictions += 1
    elif naive_bayes_max_prob > lr_max_prob:
        ensembled_predictions.append(naive_bayes_prediction)
        mismatch_naive_bayes_prediction += 1
    elif lr_max_prob > naive_bayes_max_prob:
        ensembled_predictions.append(lr_prediction)
        mismatch_lr_prediction += 1

In [131]:
accuracy_score(y_test, ensembled_predictions) * 100

86.75

In [127]:
print(same_predictions,
      same_and_correct_predictions,
      mismatch_naive_bayes_prediction,
      mismatch_lr_prediction)

3467 3162 435 98


In [128]:
3462 / 4000

0.8655

In [129]:
3462 - 3158

304

### Model Evaluation

In [267]:
from sklearn.model_selection import cross_val_score

### Quality Metrics

In [270]:
from sklearn.metrics import (accuracy_score,
                             confusion_matrix)

import seaborn as sns;
sns.set()

import matplotlib.pyplot as plt
%matplotlib inline

### Model Evaluation on Test Dataset

- Note: Retrain the model using full training [dataset](../datasets/consumer_complaints_training_dataset.csv) & test using the test [dataset](../datasets/consumer_complaints_test_dataset.csv).

## Homework

## Resources