<h1>Text Classification using Naive Bayes<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1">Introduction</a></span></li><li><span><a href="#Naive-Bayes-Theory" data-toc-modified-id="Naive-Bayes-Theory-2">Naive Bayes Theory</a></span></li><li><span><a href="#Machine-Learning-Project-Lifecycle:-First-Iteration" data-toc-modified-id="Machine-Learning-Project-Lifecycle:-First-Iteration-3">Machine Learning Project Lifecycle: First Iteration</a></span><ul class="toc-item"><li><span><a href="#Problem-Statement" data-toc-modified-id="Problem-Statement-3.1">Problem Statement</a></span></li><li><span><a href="#Training-Data" data-toc-modified-id="Training-Data-3.2">Training Data</a></span></li><li><span><a href="#Preprocessing-+-Feature-Engineering" data-toc-modified-id="Preprocessing-+-Feature-Engineering-3.3">Preprocessing + Feature Engineering</a></span></li><li><span><a href="#Machine-Learning-Algorithm:-Naive-Bayes" data-toc-modified-id="Machine-Learning-Algorithm:-Naive-Bayes-3.4">Machine Learning Algorithm: Naive Bayes</a></span></li><li><span><a href="#Modeling" data-toc-modified-id="Modeling-3.5">Modeling</a></span></li><li><span><a href="#Model-Evaluation" data-toc-modified-id="Model-Evaluation-3.6">Model Evaluation</a></span></li></ul></li><li><span><a href="#Homework" data-toc-modified-id="Homework-4">Homework</a></span></li><li><span><a href="#Resources" data-toc-modified-id="Resources-5">Resources</a></span></li></ul></div>

## Introduction

## Naive Bayes Theory

## Machine Learning Project Lifecycle: First Iteration

### Problem Statement

### Training Data

In [1]:
import pandas as pd

In [2]:
complaints_dataset = pd.read_csv('../datasets/consumer_complaints_dataset.csv')

In [3]:
complaints_dataset.head()

Unnamed: 0,Product,Complaint_text
0,"Credit reporting, repair, or other",The Summer of XX/XX/2018 I was denied a mortga...
1,"Credit reporting, repair, or other",There are many mistakes appear in my report wi...
2,"Credit reporting, repair, or other",There are many mistakes appear in my report wi...
3,"Credit reporting, repair, or other",There are many mistakes appear in my report wi...
4,"Credit reporting, repair, or other",There are many mistakes appear in my report wi...


In [4]:
complaints_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 254350 entries, 0 to 254349
Data columns (total 2 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   Product         254350 non-null  object
 1   Complaint_text  254350 non-null  object
dtypes: object(2)
memory usage: 3.9+ MB


**Q) What is the distribution complaints for each product type?**

In [5]:
complaints_dataset\
    .groupby('Product')\
    [['Complaint_text']]\
    .count()\
    .rename(columns={'Complaint_text': 'Count'})\
    .sort_values('Count', ascending=False)

Unnamed: 0_level_0,Count
Product,Unnamed: 1_level_1
"Credit reporting, repair, or other",123966
Debt collection,86710
Student loan,21810
Bank account or service,14885
"Money transfer, virtual currency, or money service",6979


**Q) Find out the Occurances of Duplicate Text messages? Also remove these.**

In [6]:
complaints_dataset['Complaint_text'].nunique()

238121

In [7]:
duplicate_complaints = complaints_dataset['Complaint_text']\
    .value_counts()\
    [complaints_dataset['Complaint_text'].value_counts() > 2].index

In [8]:
len(duplicate_complaints)

3354

**Extract Mini Dataset with Equal number of examples in each Product class**

In [9]:
dataset_indexes = []
for product in complaints_dataset['Product'].unique():
    indexes = complaints_dataset[complaints_dataset.Product == product]\
    .sample(4000, random_state=19).index
    dataset_indexes.extend(indexes)

In [10]:
mini_complaints_dataset = complaints_dataset.loc[dataset_indexes].copy()

In [11]:
mini_complaints_dataset.shape

(20000, 2)

In [12]:
mini_complaints_dataset.groupby('Product')\
    [['Complaint_text']]\
    .count()\
    .rename(columns={'Complaint_text': 'Count'})\
    .sort_values('Count', ascending=False)

Unnamed: 0_level_0,Count
Product,Unnamed: 1_level_1
Bank account or service,4000
"Credit reporting, repair, or other",4000
Debt collection,4000
"Money transfer, virtual currency, or money service",4000
Student loan,4000


### Preprocessing + Feature Engineering

- Bag of words
- `CountVectorizer`

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

RANDOM_STATE = 19

**Split the Data into Train & Test Sets**

In [14]:
X_train, X_test, y_train, y_test = train_test_split(
    mini_complaints_dataset['Complaint_text'],
    mini_complaints_dataset['Product'],
    test_size=.2,
    stratify=mini_complaints_dataset['Product'],
    random_state = RANDOM_STATE)

In [15]:
y_train.value_counts()

Money transfer, virtual currency, or money service    3200
Debt collection                                       3200
Student loan                                          3200
Credit reporting, repair, or other                    3200
Bank account or service                               3200
Name: Product, dtype: int64

In [16]:
y_test.value_counts()

Credit reporting, repair, or other                    800
Money transfer, virtual currency, or money service    800
Student loan                                          800
Debt collection                                       800
Bank account or service                               800
Name: Product, dtype: int64

**`CountVectorizer` Examples**

In [17]:
example_dataset = [
    "Today is Thursday.",
    "Second session of Machine Learning Series",
    "Machine Learning uses Stats methods to learn from data. Stats is awesome",
]

In [48]:
example_vectorizer = CountVectorizer(binary=True)
transformed_data = example_vectorizer.fit_transform(example_dataset)

In [19]:
print(example_vectorizer.get_feature_names())

['awesome', 'data', 'learn', 'learning', 'machine', 'methods', 'second', 'series', 'session', 'stats', 'thursday', 'today', 'uses']


In [20]:
example_vectorizer.vocabulary_

{'today': 11,
 'thursday': 10,
 'second': 6,
 'session': 8,
 'machine': 4,
 'learning': 3,
 'series': 7,
 'uses': 12,
 'stats': 9,
 'methods': 5,
 'learn': 2,
 'data': 1,
 'awesome': 0}

In [21]:
transformed_data.toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0],
       [0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 0, 0, 0, 2, 0, 0, 1]])

**Preprocessing + Feature Eng. of the Dataset**

In [22]:
binary_count_vectorizer = CountVectorizer(binary=True, stop_words='english')

In [23]:
X_train_binary_count_vectorizer = binary_count_vectorizer.fit_transform(X_train)

In [24]:
X_test_binary_count_vectorizer = binary_count_vectorizer.transform(X_test)

In [25]:
X_train_binary_count_vectorizer.shape, X_test_binary_count_vectorizer.shape

((16000, 24253), (4000, 24253))

In [26]:
binary_count_vectorizer.get_feature_names()[:10]

['00', '000', '0001', '001', '0010', '00109', '004', '00i', '01', '014']

In [27]:
binary_count_vectorizer.vocabulary_

{'xxxx': 23870,
 'account': 1264,
 'listed': 12934,
 'credit': 5847,
 'report': 18089,
 'experian': 8616,
 'paid': 15420,
 'closed': 4699,
 '2007': 316,
 'like': 12884,
 'removed': 17959,
 'years': 24179,
 'employer': 7949,
 'submitted': 20596,
 'incorrect': 11389,
 'information': 11571,
 'hsa': 10867,
 'bank': 2995,
 'subsidiary': 20629,
 'webster': 23415,
 'money': 14021,
 'deducted': 6314,
 'paycheck': 15611,
 '15': 141,
 'days': 6155,
 'hired': 10690,
 'refusing': 17693,
 'release': 17849,
 'funds': 9757,
 'corrected': 5640,
 'requested': 18186,
 'verifying': 23042,
 'identity': 11020,
 'social': 19889,
 'security': 19170,
 'number': 14713,
 'began': 3183,
 'disputing': 7212,
 'items': 12217,
 'bureaus': 3841,
 'xx': 23868,
 'sent': 19261,
 'follow': 9343,
 'letters': 12801,
 'believe': 3227,
 'public': 16890,
 'records': 17518,
 'reporting': 18098,
 'accurately': 1326,
 'compliance': 5057,
 'fcra': 8951,
 'fdcpa': 8960,
 'received': 17384,
 'didnt': 6848,
 'say': 18958,
 'company'

**Inspect the Feature Matrix**

In [30]:
X_train_array = X_train_binary_count_vectorizer.toarray()

In [46]:
def raw_text_to_vocab(text_index):
    print(X_train.iloc[text_index])
    for index, exists in enumerate(X_train_array[text_index]):
        if exists:
            print(index, '->', binary_count_vectorizer.get_feature_names()[index])

In [47]:
raw_text_to_vocab(0)

XXXX account listed on my credit report with Experian and XXXX has been paid and closed since XXXX 2007. I would like this account removed from my credit report as it has been over 7 years.
316 -> 2007
1264 -> account
4699 -> closed
5847 -> credit
8616 -> experian
12884 -> like
12934 -> listed
15420 -> paid
17959 -> removed
18089 -> report
23870 -> xxxx
24179 -> years


**Q) Which Words are common in different classes of products?**

**Q) Which Words are specifically used in respective product classes?**

### Machine Learning Algorithm: Naive Bayes

### Modeling

### Model Evaluation

- Quality Metrics
    - Confusion Matrics
    - Accuracy using %ages
- Evalutation
    - Cross Validation
    - K Fold Cross Validation

## Homework

- Drop the Complaints which occurs more than 2 times from dataset & see how removing these examples impacts the accuracy?
- Use CountVectorizer with Binary as False & see what impacts it has on the accuracy?
- Fit CountVectorizer without Stopwords & see how impacts the accuracy?

## Resources

- [How to Use CountVectorizer](https://kavita-ganesan.com/how-to-use-countvectorizer/)