# Multiclass Text Classification on Consumer Complaints for Financial Products
## Author: Georgios Spyrou
## Date: 19/08/2020

### Project Description

The data used in this project contain complaints that have been made by consumers regarding financial services and products (e.g. student loans, credit reports, mortgage, etc) in the United States between November 2011 and May 2019. Each of the complaints is marked to belong under one Product Category. This makes the data ideal for supervised learning purposes, with the text (complaint from the consumer) as the **input**, and the category that the complaint belongs to as the **target** variable.

The dataset is publicly available and it keeps getting updated daily from the USA Consumer Financial Protection Bureau, and it can be found <a href="https://catalog.data.gov/dataset/consumer-complaint-database" style="text-decoration: none"> here</a>.

After this small introduction, we can move to the main part of the project. At the beggining we will load the dataset into Python, perform some data cleaning and continue with the exploratory data analysis part, so that we can get a better understanding of the dataset. After we make sure that we understand the data, we will proceed to the modeling part where we will try different types of classification algorithms and compare their performance on predicting to which class/category each complaint should fall into.

### Part 1 - Data Loading

In [3]:
# Import dependencies
import os
import re
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
project_dir = r'C:\Users\george\Desktop\GitHub\Projects\Consumer_Complaints'
os.chdir(project_dir)

complaints_df = pd.read_csv(os.path.join(project_dir, 'Data', 'complaints.csv'))

In [5]:
complaints_df.columns

Index(['Date received', 'Product', 'Sub-product', 'Issue', 'Sub-issue',
       'Consumer complaint narrative', 'Company public response', 'Company',
       'State', 'ZIP code', 'Tags', 'Consumer consent provided?',
       'Submitted via', 'Date sent to company', 'Company response to consumer',
       'Timely response?', 'Consumer disputed?', 'Complaint ID'],
      dtype='object')

From the features contained in the dataset we technically only care about the columns **Consumer complaint narrative** which corresponds to the text/complaint of the consumer, and the **Product** which is the category that the complaint falls into and it will be our target variable.

### Part 2 - Data Cleaning & Exporatory Data Analysis