<h1>Text Classification using Naive Bayes<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1">Introduction</a></span></li><li><span><a href="#Essential-Concepts" data-toc-modified-id="Essential-Concepts-2">Essential Concepts</a></span><ul class="toc-item"><li><span><a href="#Naive-Bayes-Theory" data-toc-modified-id="Naive-Bayes-Theory-2.1">Naive Bayes Theory</a></span></li><li><span><a href="#Numpy" data-toc-modified-id="Numpy-2.2">Numpy</a></span></li><li><span><a href="#SKLearn" data-toc-modified-id="SKLearn-2.3">SKLearn</a></span></li></ul></li><li><span><a href="#Machine-Learning-Project-Lifecycle:-First-Iteration" data-toc-modified-id="Machine-Learning-Project-Lifecycle:-First-Iteration-3">Machine Learning Project Lifecycle: First Iteration</a></span><ul class="toc-item"><li><span><a href="#Problem-Statement" data-toc-modified-id="Problem-Statement-3.1">Problem Statement</a></span></li><li><span><a href="#Training-Data" data-toc-modified-id="Training-Data-3.2">Training Data</a></span></li><li><span><a href="#Preprocessing-+-Feature-Engineering" data-toc-modified-id="Preprocessing-+-Feature-Engineering-3.3">Preprocessing + Feature Engineering</a></span></li><li><span><a href="#Machine-Learning-Algorithm:-Naive-Bayes" data-toc-modified-id="Machine-Learning-Algorithm:-Naive-Bayes-3.4">Machine Learning Algorithm: Naive Bayes</a></span></li><li><span><a href="#Modeling" data-toc-modified-id="Modeling-3.5">Modeling</a></span></li><li><span><a href="#Quality-Metrics" data-toc-modified-id="Quality-Metrics-3.6">Quality Metrics</a></span></li><li><span><a href="#Model-Evaluation" data-toc-modified-id="Model-Evaluation-3.7">Model Evaluation</a></span></li></ul></li><li><span><a href="#Homework" data-toc-modified-id="Homework-4">Homework</a></span></li><li><span><a href="#Resources" data-toc-modified-id="Resources-5">Resources</a></span></li></ul></div>

## Introduction

## Essential Concepts

### Naive Bayes Theory

### Numpy

### SKLearn

## Machine Learning Project Lifecycle: First Iteration

### Problem Statement

### Training Data

In [2]:
import pandas as pd

In [3]:
complaints_dataset = pd.read_csv('../datasets/consumer_complaints_dataset.csv')

In [4]:
complaints_dataset.head()

Unnamed: 0,Product,Complaint_text
0,"Credit reporting, repair, or other",The Summer of XX/XX/2018 I was denied a mortga...
1,"Credit reporting, repair, or other",There are many mistakes appear in my report wi...
2,"Credit reporting, repair, or other",There are many mistakes appear in my report wi...
3,"Credit reporting, repair, or other",There are many mistakes appear in my report wi...
4,"Credit reporting, repair, or other",There are many mistakes appear in my report wi...


In [5]:
complaints_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 254350 entries, 0 to 254349
Data columns (total 2 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   Product         254350 non-null  object
 1   Complaint_text  254350 non-null  object
dtypes: object(2)
memory usage: 3.9+ MB


**Q) What is the distribution of rows for each product type?**

In [6]:
complaints_dataset\
    .groupby('Product')\
    [['Complaint_text']]\
    .count()\
    .rename(columns={'Complaint_text': 'Count'})\
    .sort_values('Count', ascending=False)

Unnamed: 0_level_0,Count
Product,Unnamed: 1_level_1
"Credit reporting, repair, or other",123966
Debt collection,86710
Student loan,21810
Bank account or service,14885
"Money transfer, virtual currency, or money service",6979


**Q) Find out the Occurances of Duplicate Text messages? Also remove these.**

In [7]:
complaints_dataset['Complaint_text'].nunique()

238121

In [46]:
duplicate_complaints = complaints_dataset['Complaint_text']\
    .value_counts()\
    [complaints_dataset['Complaint_text'].value_counts() > 2].index

In [47]:
len(duplicate_complaints)

3354

In [53]:
complaints_dataset[complaints_dataset['Complaint_text'].isin(duplicate_complaints)].shape

(13377, 2)

In [50]:
complaints_dataset[
    ~complaints_dataset['Complaint_text'].isin(duplicate_complaints)]\
    .groupby('Product')\
    [['Complaint_text']]\
    .count()\
    .rename(columns={'Complaint_text': 'Count'})\
    .sort_values('Count', ascending=False)

Unnamed: 0_level_0,Count
Product,Unnamed: 1_level_1
"Credit reporting, repair, or other",112055
Debt collection,85286
Student loan,21794
Bank account or service,14872
"Money transfer, virtual currency, or money service",6966


In [54]:
complaints_dataset.drop(
    complaints_dataset[complaints_dataset['Complaint_text'].isin(duplicate_complaints)].index,
    inplace=True)

In [55]:
complaints_dataset.shape

(240973, 2)

**Extract Mini Dataset with Equal number of examples in each Product class**

- [ ] Understand problem caused by Imbalanced Classes in Text Classification
    - [imbalanced text classification problems in machine learning](https://www.google.com/search?sxsrf=ALeKk01CQN9VXMRaO6OF2vc125fXCmoWQA:1586266419857&q=imbalanced+text+classification+problems+in+machine+learning&sa=X&ved=2ahUKEwjbyuXzttboAhX27nMBHZ9jDGMQ7xYoAHoECA4QJw&biw=1280&bih=618)

In [56]:
dataset_indexes = []
for product in complaints_dataset['Product'].unique():
    indexes = complaints_dataset[complaints_dataset.Product == product]\
    .sample(4000, random_state=19).index
    dataset_indexes.extend(indexes)

In [57]:
mini_complaints_dataset = complaints_dataset.loc[dataset_indexes].copy()

In [58]:
mini_complaints_dataset.shape

(20000, 2)

In [59]:
mini_complaints_dataset.groupby('Product')\
    [['Complaint_text']]\
    .count()\
    .rename(columns={'Complaint_text': 'Count'})\
    .sort_values('Count', ascending=False)

Unnamed: 0_level_0,Count
Product,Unnamed: 1_level_1
Bank account or service,4000
"Credit reporting, repair, or other",4000
Debt collection,4000
"Money transfer, virtual currency, or money service",4000
Student loan,4000


### Preprocessing + Feature Engineering

### Machine Learning Algorithm: Naive Bayes

### Modeling

### Quality Metrics

### Model Evaluation

## Homework

## Resources