# Lab | Customer Analysis Round 1

#### Remember the process:

1. Case Study
2. Get data
3. Cleaning/Wrangling/EDA
4. Processing Data
5. Modeling
6. Validation
7. Reporting

### Abstract

The objective of this data is to understand customer demographics and buying behavior. Later during the week, we will use predictive analytics to analyze the most profitable customers and how they interact. After that, we will take targeted actions to increase profitable customer response, retention, and growth.

For this lab, we will gather the data from 3 _csv_ files that are provided in the `files_for_lab` folder. Use that data and complete the data cleaning tasks as mentioned later in the instructions.

### Instructions

- Read the three files into python as dataframes
- Show the DataFrame's shape.
- Standardize header names.
- Rearrange the columns in the dataframe as needed
- Concatenate the three dataframes
- Which columns are numerical?
- Which columns are categorical?
- Understand the meaning of all columns
- Perform the data cleaning operations mentioned so far in class

  - Delete the column education and the number of open complaints from the dataframe.
  - Correct the values in the column customer lifetime value. They are given as a percent, so multiply them by 100 and change `dtype` to `numerical` type.
  - Check for duplicate rows in the data and remove if any.
  - Filter out the data for customers who have an income of 0 or less.

1. Read the three files into python as dataframes

In [25]:
import pandas as pd
import numpy as np

In [26]:
file1 = pd.read_csv("./files_for_lab/file1.csv")
file2 = pd.read_csv("./files_for_lab/file2.csv")
file3 = pd.read_csv("./files_for_lab/file3.csv")

2. Show the DataFrame's shape

In [27]:
print(file1.shape)
print(file2.shape)
print(file3.shape)

(4008, 11)
(996, 11)
(7070, 11)


3. Standardize header names

In [28]:
file1.columns

Index(['Customer', 'ST', 'GENDER', 'Education', 'Customer Lifetime Value',
       'Income', 'Monthly Premium Auto', 'Number of Open Complaints',
       'Policy Type', 'Vehicle Class', 'Total Claim Amount'],
      dtype='object')

In [29]:
file2.columns

Index(['Customer', 'ST', 'GENDER', 'Education', 'Customer Lifetime Value',
       'Income', 'Monthly Premium Auto', 'Number of Open Complaints',
       'Total Claim Amount', 'Policy Type', 'Vehicle Class'],
      dtype='object')

In [30]:
file3.columns

Index(['Customer', 'State', 'Customer Lifetime Value', 'Education', 'Gender',
       'Income', 'Monthly Premium Auto', 'Number of Open Complaints',
       'Policy Type', 'Total Claim Amount', 'Vehicle Class'],
      dtype='object')

In [22]:
cols = file1.columns
cols

Index(['Customer', 'ST', 'GENDER', 'Education', 'Customer Lifetime Value',
       'Income', 'Monthly Premium Auto', 'Number of Open Complaints',
       'Policy Type', 'Vehicle Class', 'Total Claim Amount'],
      dtype='object')

In [35]:
file2 = file2[file1.columns]
file2.head()

Unnamed: 0,Customer,ST,GENDER,Education,Customer Lifetime Value,Income,Monthly Premium Auto,Number of Open Complaints,Policy Type,Vehicle Class,Total Claim Amount
0,GS98873,Arizona,F,Bachelor,323912.47%,16061,88,1/0/00,Personal Auto,Four-Door Car,633.6
1,CW49887,California,F,Master,462680.11%,79487,114,1/0/00,Special Auto,SUV,547.2
2,MY31220,California,F,College,899704.02%,54230,112,1/0/00,Personal Auto,Two-Door Car,537.6
3,UH35128,Oregon,F,College,2580706.30%,71210,214,1/1/00,Personal Auto,Luxury Car,1027.2
4,WH52799,Arizona,F,College,380812.21%,94903,94,1/0/00,Corporate Auto,Two-Door Car,451.2


In [36]:
file3 = file2[file1.columns]
file3.head()

Unnamed: 0,Customer,ST,GENDER,Education,Customer Lifetime Value,Income,Monthly Premium Auto,Number of Open Complaints,Policy Type,Vehicle Class,Total Claim Amount
0,GS98873,Arizona,F,Bachelor,323912.47%,16061,88,1/0/00,Personal Auto,Four-Door Car,633.6
1,CW49887,California,F,Master,462680.11%,79487,114,1/0/00,Special Auto,SUV,547.2
2,MY31220,California,F,College,899704.02%,54230,112,1/0/00,Personal Auto,Two-Door Car,537.6
3,UH35128,Oregon,F,College,2580706.30%,71210,214,1/1/00,Personal Auto,Luxury Car,1027.2
4,WH52799,Arizona,F,College,380812.21%,94903,94,1/0/00,Corporate Auto,Two-Door Car,451.2


3. Rearrange columns in database

In [37]:
column_names = file1.columns
column_names

Index(['Customer', 'ST', 'GENDER', 'Education', 'Customer Lifetime Value',
       'Income', 'Monthly Premium Auto', 'Number of Open Complaints',
       'Policy Type', 'Vehicle Class', 'Total Claim Amount'],
      dtype='object')

In [38]:
data = pd.DataFrame(columns=column_names)
data

Unnamed: 0,Customer,ST,GENDER,Education,Customer Lifetime Value,Income,Monthly Premium Auto,Number of Open Complaints,Policy Type,Vehicle Class,Total Claim Amount


4. Concatenate the three files

In [47]:
data = pd.concat([data,file1, file2, file3], axis=0)
data.shape

(12000, 11)

5. Which columns are numerical?

In [48]:
data.select_dtypes('float')

Unnamed: 0,Income,Monthly Premium Auto,Total Claim Amount
0,0.0,1000.0,2.704934
1,0.0,94.0,1131.464935
2,48767.0,108.0,566.472247
3,0.0,106.0,529.881344
4,36357.0,68.0,17.269323
...,...,...,...
991,63513.0,70.0,185.667213
992,58161.0,68.0,140.747286
993,83640.0,70.0,471.050488
994,0.0,96.0,28.460568


6. Which columns are categorical?

In [49]:
data.select_dtypes('object')

Unnamed: 0,Customer,ST,GENDER,Education,Customer Lifetime Value,Number of Open Complaints,Policy Type,Vehicle Class
0,RB50392,Washington,,Master,,1/0/00,Personal Auto,Four-Door Car
1,QZ44356,Arizona,F,Bachelor,697953.59%,1/0/00,Personal Auto,Four-Door Car
2,AI49188,Nevada,F,Bachelor,1288743.17%,1/0/00,Personal Auto,Two-Door Car
3,WW63253,California,M,Bachelor,764586.18%,1/0/00,Corporate Auto,SUV
4,GA49547,Washington,M,High School or Below,536307.65%,1/0/00,Personal Auto,Four-Door Car
...,...,...,...,...,...,...,...,...
991,HV85198,Arizona,M,Master,847141.75%,1/0/00,Personal Auto,Four-Door Car
992,BS91566,Arizona,F,College,543121.91%,1/0/00,Corporate Auto,Four-Door Car
993,IL40123,Nevada,F,College,568964.41%,1/0/00,Corporate Auto,Two-Door Car
994,MY32149,California,F,Master,368672.38%,1/0/00,Personal Auto,Two-Door Car


7. Data Cleaning

* Delete columns: "Education" and "Number of open complaints"

In [53]:
data = data.drop(['Education', 'Number of Open Complaints'], axis=1)
data

Unnamed: 0,Customer,ST,GENDER,Customer Lifetime Value,Income,Monthly Premium Auto,Policy Type,Vehicle Class,Total Claim Amount
0,RB50392,Washington,,,0.0,1000.0,Personal Auto,Four-Door Car,2.704934
1,QZ44356,Arizona,F,697953.59%,0.0,94.0,Personal Auto,Four-Door Car,1131.464935
2,AI49188,Nevada,F,1288743.17%,48767.0,108.0,Personal Auto,Two-Door Car,566.472247
3,WW63253,California,M,764586.18%,0.0,106.0,Corporate Auto,SUV,529.881344
4,GA49547,Washington,M,536307.65%,36357.0,68.0,Personal Auto,Four-Door Car,17.269323
...,...,...,...,...,...,...,...,...,...
991,HV85198,Arizona,M,847141.75%,63513.0,70.0,Personal Auto,Four-Door Car,185.667213
992,BS91566,Arizona,F,543121.91%,58161.0,68.0,Corporate Auto,Four-Door Car,140.747286
993,IL40123,Nevada,F,568964.41%,83640.0,70.0,Corporate Auto,Two-Door Car,471.050488
994,MY32149,California,F,368672.38%,0.0,96.0,Personal Auto,Two-Door Car,28.460568


* Correct values in "Customer Lifetime Value"

In [54]:
data['Customer Lifetime Value'] = data['Customer Lifetime Value']*100
data

Unnamed: 0,Customer,ST,GENDER,Customer Lifetime Value,Income,Monthly Premium Auto,Policy Type,Vehicle Class,Total Claim Amount
0,RB50392,Washington,,,0.0,1000.0,Personal Auto,Four-Door Car,2.704934
1,QZ44356,Arizona,F,697953.59%697953.59%697953.59%697953.59%697953...,0.0,94.0,Personal Auto,Four-Door Car,1131.464935
2,AI49188,Nevada,F,1288743.17%1288743.17%1288743.17%1288743.17%12...,48767.0,108.0,Personal Auto,Two-Door Car,566.472247
3,WW63253,California,M,764586.18%764586.18%764586.18%764586.18%764586...,0.0,106.0,Corporate Auto,SUV,529.881344
4,GA49547,Washington,M,536307.65%536307.65%536307.65%536307.65%536307...,36357.0,68.0,Personal Auto,Four-Door Car,17.269323
...,...,...,...,...,...,...,...,...,...
991,HV85198,Arizona,M,847141.75%847141.75%847141.75%847141.75%847141...,63513.0,70.0,Personal Auto,Four-Door Car,185.667213
992,BS91566,Arizona,F,543121.91%543121.91%543121.91%543121.91%543121...,58161.0,68.0,Corporate Auto,Four-Door Car,140.747286
993,IL40123,Nevada,F,568964.41%568964.41%568964.41%568964.41%568964...,83640.0,70.0,Corporate Auto,Two-Door Car,471.050488
994,MY32149,California,F,368672.38%368672.38%368672.38%368672.38%368672...,0.0,96.0,Personal Auto,Two-Door Car,28.460568


In [56]:
data['Customer Lifetime Value'] =  pd.to_numeric(data['Customer Lifetime Value'], errors='coerce')
data

Unnamed: 0,Customer,ST,GENDER,Customer Lifetime Value,Income,Monthly Premium Auto,Policy Type,Vehicle Class,Total Claim Amount
0,RB50392,Washington,,,0.0,1000.0,Personal Auto,Four-Door Car,2.704934
1,QZ44356,Arizona,F,,0.0,94.0,Personal Auto,Four-Door Car,1131.464935
2,AI49188,Nevada,F,,48767.0,108.0,Personal Auto,Two-Door Car,566.472247
3,WW63253,California,M,,0.0,106.0,Corporate Auto,SUV,529.881344
4,GA49547,Washington,M,,36357.0,68.0,Personal Auto,Four-Door Car,17.269323
...,...,...,...,...,...,...,...,...,...
991,HV85198,Arizona,M,,63513.0,70.0,Personal Auto,Four-Door Car,185.667213
992,BS91566,Arizona,F,,58161.0,68.0,Corporate Auto,Four-Door Car,140.747286
993,IL40123,Nevada,F,,83640.0,70.0,Corporate Auto,Two-Door Car,471.050488
994,MY32149,California,F,,0.0,96.0,Personal Auto,Two-Door Car,28.460568


* Check and remove duplicates

In [57]:
data = data.drop_duplicates()

* Filter customers with income of 0 or less