# Lab Case Study
___

## Scenario
___
You are working as an analyst for an auto insurance company. The company has collected some data about its customers including their demographics, education, employment, policy details, vehicle information on which insurance policy is, and claim amounts. You will help the senior management with some business questions that will help them to better understand their customers, improve their services, and improve profitability.

## Business Objectives
___
- Retain customers,
- analyze relevant customer data,
- develop focused customer retention programs.

Based on the analysis, take targeted actions to increase profitable customer response, retention, and growth.


## Activities
___
### Part 1
- [ ] Aggregate data into one Data Frame using Pandas.
- [ ] Standardizing header names
- [ ] Deleting and rearranging columns – delete the column customer as it is only a unique identifier for each row of data
- [ ] Working with data types – Check the data types of all the columns and fix the incorrect ones (for ex. customer lifetime value and number of complaints )
- [ ] Filtering data and Correcting typos – Filter the data in state and gender column to standardize the texts in those columns
- [ ] Removing duplicates
- [ ] Replacing null values – Replace missing values with means of the column (for numerical columns)

### Part 2
- [ ] Bucketing the data - Write a function to replace column "State" to different zones. California as West Region, Oregon as North West, and Washington as East, and Arizona and Nevada as Central
- [ ] Standardizing the data – Use string functions to standardize the text data (lower case)

### Part 3
- [ ] Which columns are numerical?
- [ ] Which columns are categorical?
- [ ] Datetime format - Extract the months from the dataset and store in a separate column. Then filter the data to show only the information for the first quarter , ie. January, February and March. Hint: If data from March does not exist, consider only January and February.

___

###  Importing relevant libraries / modules

In [1]:
import pandas as pd
import numpy as np

### Importing and merging files

In [2]:
file1 = pd.read_csv('Data/file1.csv')
file2 = pd.read_csv('Data/file2.csv')
file3 = pd.read_csv('Data/file3.csv')

In [3]:
file1

Unnamed: 0,Customer,ST,GENDER,Education,Customer Lifetime Value,Income,Monthly Premium Auto,Number of Open Complaints,Policy Type,Vehicle Class,Total Claim Amount
0,RB50392,Washington,,Master,,0.0,1000.0,1/0/00,Personal Auto,Four-Door Car,2.704934
1,QZ44356,Arizona,F,Bachelor,697953.59%,0.0,94.0,1/0/00,Personal Auto,Four-Door Car,1131.464935
2,AI49188,Nevada,F,Bachelor,1288743.17%,48767.0,108.0,1/0/00,Personal Auto,Two-Door Car,566.472247
3,WW63253,California,M,Bachelor,764586.18%,0.0,106.0,1/0/00,Corporate Auto,SUV,529.881344
4,GA49547,Washington,M,High School or Below,536307.65%,36357.0,68.0,1/0/00,Personal Auto,Four-Door Car,17.269323
...,...,...,...,...,...,...,...,...,...,...,...
4003,,,,,,,,,,,
4004,,,,,,,,,,,
4005,,,,,,,,,,,
4006,,,,,,,,,,,


In [4]:
file2

Unnamed: 0,Customer,ST,GENDER,Education,Customer Lifetime Value,Income,Monthly Premium Auto,Number of Open Complaints,Total Claim Amount,Policy Type,Vehicle Class
0,GS98873,Arizona,F,Bachelor,323912.47%,16061,88,1/0/00,633.600000,Personal Auto,Four-Door Car
1,CW49887,California,F,Master,462680.11%,79487,114,1/0/00,547.200000,Special Auto,SUV
2,MY31220,California,F,College,899704.02%,54230,112,1/0/00,537.600000,Personal Auto,Two-Door Car
3,UH35128,Oregon,F,College,2580706.30%,71210,214,1/1/00,1027.200000,Personal Auto,Luxury Car
4,WH52799,Arizona,F,College,380812.21%,94903,94,1/0/00,451.200000,Corporate Auto,Two-Door Car
...,...,...,...,...,...,...,...,...,...,...,...
991,HV85198,Arizona,M,Master,847141.75%,63513,70,1/0/00,185.667213,Personal Auto,Four-Door Car
992,BS91566,Arizona,F,College,543121.91%,58161,68,1/0/00,140.747286,Corporate Auto,Four-Door Car
993,IL40123,Nevada,F,College,568964.41%,83640,70,1/0/00,471.050488,Corporate Auto,Two-Door Car
994,MY32149,California,F,Master,368672.38%,0,96,1/0/00,28.460568,Personal Auto,Two-Door Car


In [5]:
file3

Unnamed: 0,Customer,State,Customer Lifetime Value,Education,Gender,Income,Monthly Premium Auto,Number of Open Complaints,Policy Type,Total Claim Amount,Vehicle Class
0,SA25987,Washington,3479.137523,High School or Below,M,0,104,0,Personal Auto,499.200000,Two-Door Car
1,TB86706,Arizona,2502.637401,Master,M,0,66,0,Personal Auto,3.468912,Two-Door Car
2,ZL73902,Nevada,3265.156348,Bachelor,F,25820,82,0,Personal Auto,393.600000,Four-Door Car
3,KX23516,California,4455.843406,High School or Below,F,0,121,0,Personal Auto,699.615192,SUV
4,FN77294,California,7704.958480,High School or Below,M,30366,101,2,Personal Auto,484.800000,SUV
...,...,...,...,...,...,...,...,...,...,...,...
7065,LA72316,California,23405.987980,Bachelor,M,71941,73,0,Personal Auto,198.234764,Four-Door Car
7066,PK87824,California,3096.511217,College,F,21604,79,0,Corporate Auto,379.200000,Four-Door Car
7067,TD14365,California,8163.890428,Bachelor,M,0,85,3,Corporate Auto,790.784983,Four-Door Car
7068,UP19263,California,7524.442436,College,M,21941,96,0,Personal Auto,691.200000,Four-Door Car


**Some observations about the files:**

1. It seems like *'file1'* has a lot of rows with only `NaN` values. We can get rid of them.
2. *'file3'* has different column names compared to the other files, what would cause trouble when concatenating.

In [6]:
#1. delete rows only if ALL values are missing
file1.dropna(axis=0, how='all', inplace=True)
file1.shape

(1071, 11)

file1 had almost 3000 rows deleted, but there were no changes in the other files

```python
file2.dropna(axis=0, how='all', inplace=True)
file2.shape
(996, 11)

file2.dropna(axis=0, how='all', inplace=True)
file2.shape
(7070, 11)
```


In [7]:
#2. rename columns in file3 to match the ones in the other files
file3.rename(columns={'State': 'ST', 'Gender': 'GENDER'}, inplace=True)
file3.columns

Index(['Customer', 'ST', 'Customer Lifetime Value', 'Education', 'GENDER',
       'Income', 'Monthly Premium Auto', 'Number of Open Complaints',
       'Policy Type', 'Total Claim Amount', 'Vehicle Class'],
      dtype='object')

Now we can finally join all files and start working on them

In [8]:
customer_data = pd.concat([file1, file2, file3]).reset_index(drop=True)
customer_data

Unnamed: 0,Customer,ST,GENDER,Education,Customer Lifetime Value,Income,Monthly Premium Auto,Number of Open Complaints,Policy Type,Vehicle Class,Total Claim Amount
0,RB50392,Washington,,Master,,0.0,1000.0,1/0/00,Personal Auto,Four-Door Car,2.704934
1,QZ44356,Arizona,F,Bachelor,697953.59%,0.0,94.0,1/0/00,Personal Auto,Four-Door Car,1131.464935
2,AI49188,Nevada,F,Bachelor,1288743.17%,48767.0,108.0,1/0/00,Personal Auto,Two-Door Car,566.472247
3,WW63253,California,M,Bachelor,764586.18%,0.0,106.0,1/0/00,Corporate Auto,SUV,529.881344
4,GA49547,Washington,M,High School or Below,536307.65%,36357.0,68.0,1/0/00,Personal Auto,Four-Door Car,17.269323
...,...,...,...,...,...,...,...,...,...,...,...
9132,LA72316,California,M,Bachelor,23405.98798,71941.0,73.0,0,Personal Auto,Four-Door Car,198.234764
9133,PK87824,California,F,College,3096.511217,21604.0,79.0,0,Corporate Auto,Four-Door Car,379.200000
9134,TD14365,California,M,Bachelor,8163.890428,0.0,85.0,3,Corporate Auto,Four-Door Car,790.784983
9135,UP19263,California,M,College,7524.442436,21941.0,96.0,0,Personal Auto,Four-Door Car,691.200000


In [9]:
customer_data.nunique()

Customer                     9056
ST                              8
GENDER                          5
Education                       6
Customer Lifetime Value      8211
Income                       5655
Monthly Premium Auto          209
Number of Open Complaints      12
Policy Type                     3
Vehicle Class                   6
Total Claim Amount           5070
dtype: int64

In [10]:
customer_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9137 entries, 0 to 9136
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Customer                   9137 non-null   object 
 1   ST                         9137 non-null   object 
 2   GENDER                     9015 non-null   object 
 3   Education                  9137 non-null   object 
 4   Customer Lifetime Value    9130 non-null   object 
 5   Income                     9137 non-null   float64
 6   Monthly Premium Auto       9137 non-null   float64
 7   Number of Open Complaints  9137 non-null   object 
 8   Policy Type                9137 non-null   object 
 9   Vehicle Class              9137 non-null   object 
 10  Total Claim Amount         9137 non-null   float64
dtypes: float64(3), object(8)
memory usage: 785.3+ KB


**More observations about the data**

- Header names have different naming conventions
- For analysis purposes, the 'Customer' column doesn't add any useful information
- 'GENDER' has more unique values than expected
- The columns 'Customer Lifetime Value' and 'Number of Open Complaints' have inconsistent values and 'object' as data type, when they were suposed to be numeric
- Only two columns have missing values: 'GENDER' and 'Customer Lifetime Value'

### Standardizing header names

In [11]:
# change the columns to lower case and snake case (with an underscore)
customer_data.columns = customer_data.columns.str.lower().str.replace(' ', '_')

# change the column 'st' to a more intuitive name
customer_data.rename(columns={'st': 'state'}, inplace=True)

customer_data

Unnamed: 0,customer,state,gender,education,customer_lifetime_value,income,monthly_premium_auto,number_of_open_complaints,policy_type,vehicle_class,total_claim_amount
0,RB50392,Washington,,Master,,0.0,1000.0,1/0/00,Personal Auto,Four-Door Car,2.704934
1,QZ44356,Arizona,F,Bachelor,697953.59%,0.0,94.0,1/0/00,Personal Auto,Four-Door Car,1131.464935
2,AI49188,Nevada,F,Bachelor,1288743.17%,48767.0,108.0,1/0/00,Personal Auto,Two-Door Car,566.472247
3,WW63253,California,M,Bachelor,764586.18%,0.0,106.0,1/0/00,Corporate Auto,SUV,529.881344
4,GA49547,Washington,M,High School or Below,536307.65%,36357.0,68.0,1/0/00,Personal Auto,Four-Door Car,17.269323
...,...,...,...,...,...,...,...,...,...,...,...
9132,LA72316,California,M,Bachelor,23405.98798,71941.0,73.0,0,Personal Auto,Four-Door Car,198.234764
9133,PK87824,California,F,College,3096.511217,21604.0,79.0,0,Corporate Auto,Four-Door Car,379.200000
9134,TD14365,California,M,Bachelor,8163.890428,0.0,85.0,3,Corporate Auto,Four-Door Car,790.784983
9135,UP19263,California,M,College,7524.442436,21941.0,96.0,0,Personal Auto,Four-Door Car,691.200000


### Deleting unecessary columns

In [12]:
customer_data.drop('customer', axis=1, inplace=True)

In [13]:
customer_data

Unnamed: 0,state,gender,education,customer_lifetime_value,income,monthly_premium_auto,number_of_open_complaints,policy_type,vehicle_class,total_claim_amount
0,Washington,,Master,,0.0,1000.0,1/0/00,Personal Auto,Four-Door Car,2.704934
1,Arizona,F,Bachelor,697953.59%,0.0,94.0,1/0/00,Personal Auto,Four-Door Car,1131.464935
2,Nevada,F,Bachelor,1288743.17%,48767.0,108.0,1/0/00,Personal Auto,Two-Door Car,566.472247
3,California,M,Bachelor,764586.18%,0.0,106.0,1/0/00,Corporate Auto,SUV,529.881344
4,Washington,M,High School or Below,536307.65%,36357.0,68.0,1/0/00,Personal Auto,Four-Door Car,17.269323
...,...,...,...,...,...,...,...,...,...,...
9132,California,M,Bachelor,23405.98798,71941.0,73.0,0,Personal Auto,Four-Door Car,198.234764
9133,California,F,College,3096.511217,21604.0,79.0,0,Corporate Auto,Four-Door Car,379.200000
9134,California,M,Bachelor,8163.890428,0.0,85.0,3,Corporate Auto,Four-Door Car,790.784983
9135,California,M,College,7524.442436,21941.0,96.0,0,Personal Auto,Four-Door Car,691.200000


### Filtering data and correcting typos

**'state'**

In [14]:
# check the values in the 'state' column
customer_data['state'].value_counts(dropna=False)

California    3032
Oregon        2601
Arizona       1630
Nevada         882
Washington     768
Cali           120
AZ              74
WA              30
Name: state, dtype: int64

One thing to notice is that the column doesn't have missing values. Also, there are incompatible names for some states, so let's correct them.

In [15]:
customer_data['state'].replace({'Cali': 'California', 'AZ': 'Arizona', 'WA': 'Washington'}, inplace=True)
customer_data['state'].value_counts()

California    3152
Oregon        2601
Arizona       1704
Nevada         882
Washington     798
Name: state, dtype: int64

**'gender'**

In [16]:
# check the values in the 'gender' column
customer_data['gender'].value_counts(dropna=False)

F         4560
M         4368
NaN        122
Male        40
female      30
Femal       17
Name: gender, dtype: int64

There are some inconsistency and typos in the the values that we need to correct.

In [17]:
customer_data['gender'].replace({'Male': 'M', 'Femal': 'F', 'female': 'F'}, inplace=True)
customer_data['gender'].value_counts(dropna=False)

F      4607
M      4408
NaN     122
Name: gender, dtype: int64

**'education'**

In [18]:
customer_data['education'].value_counts()

Bachelor                2719
College                 2682
High School or Below    2616
Master                   752
Doctor                   344
Bachelors                 24
Name: education, dtype: int64

In [19]:
# Just a small issue regarding 'Bachelor/s'
customer_data['education'].replace('Bachelors', 'Bachelor', inplace=True)
customer_data['education'].value_counts(dropna=False)

Bachelor                2743
College                 2682
High School or Below    2616
Master                   752
Doctor                   344
Name: education, dtype: int64

**'policy_type'**

In [20]:
customer_data['policy_type'].value_counts()

Personal Auto     6792
Corporate Auto    1965
Special Auto       380
Name: policy_type, dtype: int64

**'vehicle_class'**

In [21]:
customer_data['vehicle_class'].value_counts()

Four-Door Car    4641
Two-Door Car     1896
SUV              1774
Sports Car        483
Luxury SUV        182
Luxury Car        161
Name: vehicle_class, dtype: int64

### Correcting data types

**'customer_lifetime_value'**

In [22]:
# check if string has anything that is not a number
customer_data['customer_lifetime_value'].str.contains('(?!^\d+$)^.+$').sum()

2060

In [23]:
customer_data['customer_lifetime_value'].str.contains('%').sum()

2060

It seems like the only character that is not a number is the '%'. So let's replace it and convert the values.

In [24]:
cleaning = lambda x: x.replace('%', '') if type(x) == str else x
customer_data['customer_lifetime_value'] = customer_data['customer_lifetime_value'].apply(cleaning)

In [25]:
customer_data['customer_lifetime_value'] = customer_data['customer_lifetime_value'].astype(float)
customer_data['customer_lifetime_value']

0                NaN
1       6.979536e+05
2       1.288743e+06
3       7.645862e+05
4       5.363077e+05
            ...     
9132    2.340599e+04
9133    3.096511e+03
9134    8.163890e+03
9135    7.524442e+03
9136    2.611837e+03
Name: customer_lifetime_value, Length: 9137, dtype: float64

**'number_of_open_complaints'**

In [26]:
customer_data['number_of_open_complaints'].value_counts()

0         5629
1/0/00    1626
1          765
2          283
1/1/00     247
3          230
4          119
1/2/00      93
1/3/00      60
5           44
1/4/00      29
1/5/00      12
Name: number_of_open_complaints, dtype: int64

From the values in the form '1/0-5/00', we should get the middle number

In [27]:
cleaning = lambda x: x.split('/')[1] if type(x) == str else x
# We only want to modify the values stored as strings. By splitting them, we should get the numbers stored as a list
# and we are interested in the second element (index 1)

customer_data['number_of_open_complaints'] = customer_data['number_of_open_complaints'].apply(cleaning).astype(int)
customer_data['number_of_open_complaints'].value_counts()

0    7255
1    1012
2     376
3     290
4     148
5      56
Name: number_of_open_complaints, dtype: int64

Check if we have the right types

In [28]:
customer_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9137 entries, 0 to 9136
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   state                      9137 non-null   object 
 1   gender                     9015 non-null   object 
 2   education                  9137 non-null   object 
 3   customer_lifetime_value    9130 non-null   float64
 4   income                     9137 non-null   float64
 5   monthly_premium_auto       9137 non-null   float64
 6   number_of_open_complaints  9137 non-null   int32  
 7   policy_type                9137 non-null   object 
 8   vehicle_class              9137 non-null   object 
 9   total_claim_amount         9137 non-null   float64
dtypes: float64(4), int32(1), object(5)
memory usage: 678.3+ KB


### Removing duplicates

In [29]:
customer_data.drop_duplicates(inplace=True)
customer_data

Unnamed: 0,state,gender,education,customer_lifetime_value,income,monthly_premium_auto,number_of_open_complaints,policy_type,vehicle_class,total_claim_amount
0,Washington,,Master,,0.0,1000.0,0,Personal Auto,Four-Door Car,2.704934
1,Arizona,F,Bachelor,6.979536e+05,0.0,94.0,0,Personal Auto,Four-Door Car,1131.464935
2,Nevada,F,Bachelor,1.288743e+06,48767.0,108.0,0,Personal Auto,Two-Door Car,566.472247
3,California,M,Bachelor,7.645862e+05,0.0,106.0,0,Corporate Auto,SUV,529.881344
4,Washington,M,High School or Below,5.363077e+05,36357.0,68.0,0,Personal Auto,Four-Door Car,17.269323
...,...,...,...,...,...,...,...,...,...,...
9132,California,M,Bachelor,2.340599e+04,71941.0,73.0,0,Personal Auto,Four-Door Car,198.234764
9133,California,F,College,3.096511e+03,21604.0,79.0,0,Corporate Auto,Four-Door Car,379.200000
9134,California,M,Bachelor,8.163890e+03,0.0,85.0,3,Corporate Auto,Four-Door Car,790.784983
9135,California,M,College,7.524442e+03,21941.0,96.0,0,Personal Auto,Four-Door Car,691.200000


### Replacing null values

The only numeric column with missing values is 'customer_lifetime_value'. We're gonna replace those with the mean.

In [30]:
customer_data['customer_lifetime_value'].fillna(customer_data['customer_lifetime_value'].mean(), inplace=True)
customer_data['customer_lifetime_value'].isna().sum()

0

For the gender column, let's adopt 'unknown' for the missing values

In [31]:
customer_data['gender'].fillna('Unknown', inplace=True)
customer_data['customer_lifetime_value'].isna().sum()

0

In [32]:
customer_data.reset_index(drop=True, inplace=True)
customer_data

Unnamed: 0,state,gender,education,customer_lifetime_value,income,monthly_premium_auto,number_of_open_complaints,policy_type,vehicle_class,total_claim_amount
0,Washington,Unknown,Master,1.855902e+05,0.0,1000.0,0,Personal Auto,Four-Door Car,2.704934
1,Arizona,F,Bachelor,6.979536e+05,0.0,94.0,0,Personal Auto,Four-Door Car,1131.464935
2,Nevada,F,Bachelor,1.288743e+06,48767.0,108.0,0,Personal Auto,Two-Door Car,566.472247
3,California,M,Bachelor,7.645862e+05,0.0,106.0,0,Corporate Auto,SUV,529.881344
4,Washington,M,High School or Below,5.363077e+05,36357.0,68.0,0,Personal Auto,Four-Door Car,17.269323
...,...,...,...,...,...,...,...,...,...,...
8870,California,M,Bachelor,2.340599e+04,71941.0,73.0,0,Personal Auto,Four-Door Car,198.234764
8871,California,F,College,3.096511e+03,21604.0,79.0,0,Corporate Auto,Four-Door Car,379.200000
8872,California,M,Bachelor,8.163890e+03,0.0,85.0,3,Corporate Auto,Four-Door Car,790.784983
8873,California,M,College,7.524442e+03,21941.0,96.0,0,Personal Auto,Four-Door Car,691.200000


### Bucketing the data

Replace column "State" to different zones. California as West Region, Oregon as North West, and Washington as East, and Arizona and Nevada as Central.

In [33]:
customer_data['zones'] = customer_data['state'].replace({
    'California': 'West Region',
    'Oregon': 'North West',
    'Washington': 'East',
    'Arizona': 'Central',
    'Nevada': 'Central'
})
customer_data

Unnamed: 0,state,gender,education,customer_lifetime_value,income,monthly_premium_auto,number_of_open_complaints,policy_type,vehicle_class,total_claim_amount,zones
0,Washington,Unknown,Master,1.855902e+05,0.0,1000.0,0,Personal Auto,Four-Door Car,2.704934,East
1,Arizona,F,Bachelor,6.979536e+05,0.0,94.0,0,Personal Auto,Four-Door Car,1131.464935,Central
2,Nevada,F,Bachelor,1.288743e+06,48767.0,108.0,0,Personal Auto,Two-Door Car,566.472247,Central
3,California,M,Bachelor,7.645862e+05,0.0,106.0,0,Corporate Auto,SUV,529.881344,West Region
4,Washington,M,High School or Below,5.363077e+05,36357.0,68.0,0,Personal Auto,Four-Door Car,17.269323,East
...,...,...,...,...,...,...,...,...,...,...,...
8870,California,M,Bachelor,2.340599e+04,71941.0,73.0,0,Personal Auto,Four-Door Car,198.234764,West Region
8871,California,F,College,3.096511e+03,21604.0,79.0,0,Corporate Auto,Four-Door Car,379.200000,West Region
8872,California,M,Bachelor,8.163890e+03,0.0,85.0,3,Corporate Auto,Four-Door Car,790.784983,West Region
8873,California,M,College,7.524442e+03,21941.0,96.0,0,Personal Auto,Four-Door Car,691.200000,West Region


### Standardizing the data

Standardize the text data (lower case)

In [34]:
text_data = customer_data.columns[customer_data.dtypes == 'object'].to_list()
text_data

['state', 'gender', 'education', 'policy_type', 'vehicle_class', 'zones']

In [35]:
customer_data[text_data] = customer_data[text_data].applymap(str.lower)
customer_data[text_data]

Unnamed: 0,state,gender,education,policy_type,vehicle_class,zones
0,washington,unknown,master,personal auto,four-door car,east
1,arizona,f,bachelor,personal auto,four-door car,central
2,nevada,f,bachelor,personal auto,two-door car,central
3,california,m,bachelor,corporate auto,suv,west region
4,washington,m,high school or below,personal auto,four-door car,east
...,...,...,...,...,...,...
8870,california,m,bachelor,personal auto,four-door car,west region
8871,california,f,college,corporate auto,four-door car,west region
8872,california,m,bachelor,corporate auto,four-door car,west region
8873,california,m,college,personal auto,four-door car,west region
