# Lab Case Study
___

## Scenario
___
You are working as an analyst for an auto insurance company. The company has collected some data about its customers including their demographics, education, employment, policy details, vehicle information on which insurance policy is, and claim amounts. You will help the senior management with some business questions that will help them to better understand their customers, improve their services, and improve profitability.


## Business Objectives
---
- Retain customers,
- analyze relevant customer data,
- develop focused customer retention programs.

Based on the analysis, take targeted actions to increase profitable customer response, retention, and growth.

## Activities
___
- [ ] Aggregate data into one Data Frame using Pandas.
- [ ] Standardizing header names
- [ ] Deleting and rearranging columns – delete the column customer as it is only a unique identifier for each row of data
- [ ] Working with data types – Check the data types of all the columns and fix the incorrect ones (for ex. customer lifetime value and number of complaints )
- [ ] Filtering data and Correcting typos – Filter the data in state and gender column to standardize the texts in those columns
- [ ] Removing duplicates
- [ ] Replacing null values – Replace missing values with means of the column (for numerical columns)

## Data
___
The csv files is provided in the folder. The columns in the file are self-explanatory.

###  Importing relevant libraries / modules

In [68]:
import pandas as pd
import numpy as np

### Importing and merging files

In [76]:
file1 = pd.read_csv('Data/file1.csv')
file2 = pd.read_csv('Data/file2.csv')
file3 = pd.read_csv('Data/file3.csv')

In [83]:
file1

Unnamed: 0,Customer,ST,GENDER,Education,Customer Lifetime Value,Income,Monthly Premium Auto,Number of Open Complaints,Policy Type,Vehicle Class,Total Claim Amount
0,RB50392,Washington,,Master,,0.0,1000.0,1/0/00,Personal Auto,Four-Door Car,2.704934
1,QZ44356,Arizona,F,Bachelor,697953.59%,0.0,94.0,1/0/00,Personal Auto,Four-Door Car,1131.464935
2,AI49188,Nevada,F,Bachelor,1288743.17%,48767.0,108.0,1/0/00,Personal Auto,Two-Door Car,566.472247
3,WW63253,California,M,Bachelor,764586.18%,0.0,106.0,1/0/00,Corporate Auto,SUV,529.881344
4,GA49547,Washington,M,High School or Below,536307.65%,36357.0,68.0,1/0/00,Personal Auto,Four-Door Car,17.269323
...,...,...,...,...,...,...,...,...,...,...,...
4003,,,,,,,,,,,
4004,,,,,,,,,,,
4005,,,,,,,,,,,
4006,,,,,,,,,,,


In [82]:
file1.dropna(axis=0, how='all')

Unnamed: 0,Customer,ST,GENDER,Education,Customer Lifetime Value,Income,Monthly Premium Auto,Number of Open Complaints,Policy Type,Vehicle Class,Total Claim Amount
0,RB50392,Washington,,Master,,0.0,1000.0,1/0/00,Personal Auto,Four-Door Car,2.704934
1,QZ44356,Arizona,F,Bachelor,697953.59%,0.0,94.0,1/0/00,Personal Auto,Four-Door Car,1131.464935
2,AI49188,Nevada,F,Bachelor,1288743.17%,48767.0,108.0,1/0/00,Personal Auto,Two-Door Car,566.472247
3,WW63253,California,M,Bachelor,764586.18%,0.0,106.0,1/0/00,Corporate Auto,SUV,529.881344
4,GA49547,Washington,M,High School or Below,536307.65%,36357.0,68.0,1/0/00,Personal Auto,Four-Door Car,17.269323
...,...,...,...,...,...,...,...,...,...,...,...
1066,TM65736,Oregon,M,Master,305955.03%,38644.0,78.0,1/1/00,Personal Auto,Four-Door Car,361.455219
1067,VJ51327,Cali,F,High School or Below,2031499.76%,63209.0,102.0,1/2/00,Personal Auto,SUV,207.320041
1068,GS98873,Arizona,F,Bachelor,323912.47%,16061.0,88.0,1/0/00,Personal Auto,Four-Door Car,633.600000
1069,CW49887,California,F,Master,462680.11%,79487.0,114.0,1/0/00,Special Auto,SUV,547.200000


In [84]:
file1.dropna(axis=0, how='all')

Unnamed: 0,Customer,ST,GENDER,Education,Customer Lifetime Value,Income,Monthly Premium Auto,Number of Open Complaints,Policy Type,Vehicle Class,Total Claim Amount
0,RB50392,Washington,,Master,,0.0,1000.0,1/0/00,Personal Auto,Four-Door Car,2.704934
1,QZ44356,Arizona,F,Bachelor,697953.59%,0.0,94.0,1/0/00,Personal Auto,Four-Door Car,1131.464935
2,AI49188,Nevada,F,Bachelor,1288743.17%,48767.0,108.0,1/0/00,Personal Auto,Two-Door Car,566.472247
3,WW63253,California,M,Bachelor,764586.18%,0.0,106.0,1/0/00,Corporate Auto,SUV,529.881344
4,GA49547,Washington,M,High School or Below,536307.65%,36357.0,68.0,1/0/00,Personal Auto,Four-Door Car,17.269323
...,...,...,...,...,...,...,...,...,...,...,...
1066,TM65736,Oregon,M,Master,305955.03%,38644.0,78.0,1/1/00,Personal Auto,Four-Door Car,361.455219
1067,VJ51327,Cali,F,High School or Below,2031499.76%,63209.0,102.0,1/2/00,Personal Auto,SUV,207.320041
1068,GS98873,Arizona,F,Bachelor,323912.47%,16061.0,88.0,1/0/00,Personal Auto,Four-Door Car,633.600000
1069,CW49887,California,F,Master,462680.11%,79487.0,114.0,1/0/00,Special Auto,SUV,547.200000


In [5]:
file3

Unnamed: 0,Customer,State,Customer Lifetime Value,Education,Gender,Income,Monthly Premium Auto,Number of Open Complaints,Policy Type,Total Claim Amount,Vehicle Class
0,SA25987,Washington,3479.137523,High School or Below,M,0,104,0,Personal Auto,499.200000,Two-Door Car
1,TB86706,Arizona,2502.637401,Master,M,0,66,0,Personal Auto,3.468912,Two-Door Car
2,ZL73902,Nevada,3265.156348,Bachelor,F,25820,82,0,Personal Auto,393.600000,Four-Door Car
3,KX23516,California,4455.843406,High School or Below,F,0,121,0,Personal Auto,699.615192,SUV
4,FN77294,California,7704.958480,High School or Below,M,30366,101,2,Personal Auto,484.800000,SUV
...,...,...,...,...,...,...,...,...,...,...,...
7065,LA72316,California,23405.987980,Bachelor,M,71941,73,0,Personal Auto,198.234764,Four-Door Car
7066,PK87824,California,3096.511217,College,F,21604,79,0,Corporate Auto,379.200000,Four-Door Car
7067,TD14365,California,8163.890428,Bachelor,M,0,85,3,Corporate Auto,790.784983,Four-Door Car
7068,UP19263,California,7524.442436,College,M,21941,96,0,Personal Auto,691.200000,Four-Door Car


In [6]:
df = pd.concat([file1, file2, file3]).reset_index(drop=True)
df

Unnamed: 0,Customer,ST,GENDER,Education,Customer Lifetime Value,Income,Monthly Premium Auto,Number of Open Complaints,Policy Type,Vehicle Class,Total Claim Amount,State,Gender
0,RB50392,Washington,,Master,,0.0,1000.0,1/0/00,Personal Auto,Four-Door Car,2.704934,,
1,QZ44356,Arizona,F,Bachelor,697953.59%,0.0,94.0,1/0/00,Personal Auto,Four-Door Car,1131.464935,,
2,AI49188,Nevada,F,Bachelor,1288743.17%,48767.0,108.0,1/0/00,Personal Auto,Two-Door Car,566.472247,,
3,WW63253,California,M,Bachelor,764586.18%,0.0,106.0,1/0/00,Corporate Auto,SUV,529.881344,,
4,GA49547,Washington,M,High School or Below,536307.65%,36357.0,68.0,1/0/00,Personal Auto,Four-Door Car,17.269323,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
12069,LA72316,,,Bachelor,23405.98798,71941.0,73.0,0,Personal Auto,Four-Door Car,198.234764,California,M
12070,PK87824,,,College,3096.511217,21604.0,79.0,0,Corporate Auto,Four-Door Car,379.200000,California,F
12071,TD14365,,,Bachelor,8163.890428,0.0,85.0,3,Corporate Auto,Four-Door Car,790.784983,California,M
12072,UP19263,,,College,7524.442436,21941.0,96.0,0,Personal Auto,Four-Door Car,691.200000,California,M


Some observations about the data:
- There are two columns for state ('ST' and State') and two columns for gender ('GENDER', 'Gender')
    - That's due to 'file3' having a different structure compared to files 1 and 2.
- The column 'Customer Lifetime Value' has inconsistent values (some are missing the '%' symbol)
- The column 'Number of Open Complaints' also has inconsistent values
- For analysis purposes, the 'Customer' column doesn't add any useful information

### Standardizing header names

In [7]:
df.columns = df.columns.str.lower().str.replace(' ', '_')
df

Unnamed: 0,customer,st,gender,education,customer_lifetime_value,income,monthly_premium_auto,number_of_open_complaints,policy_type,vehicle_class,total_claim_amount,state,gender.1
0,RB50392,Washington,,Master,,0.0,1000.0,1/0/00,Personal Auto,Four-Door Car,2.704934,,
1,QZ44356,Arizona,F,Bachelor,697953.59%,0.0,94.0,1/0/00,Personal Auto,Four-Door Car,1131.464935,,
2,AI49188,Nevada,F,Bachelor,1288743.17%,48767.0,108.0,1/0/00,Personal Auto,Two-Door Car,566.472247,,
3,WW63253,California,M,Bachelor,764586.18%,0.0,106.0,1/0/00,Corporate Auto,SUV,529.881344,,
4,GA49547,Washington,M,High School or Below,536307.65%,36357.0,68.0,1/0/00,Personal Auto,Four-Door Car,17.269323,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
12069,LA72316,,,Bachelor,23405.98798,71941.0,73.0,0,Personal Auto,Four-Door Car,198.234764,California,M
12070,PK87824,,,College,3096.511217,21604.0,79.0,0,Corporate Auto,Four-Door Car,379.200000,California,F
12071,TD14365,,,Bachelor,8163.890428,0.0,85.0,3,Corporate Auto,Four-Door Car,790.784983,California,M
12072,UP19263,,,College,7524.442436,21941.0,96.0,0,Personal Auto,Four-Door Car,691.200000,California,M


### Deleting unecessary columns

In [8]:
df.drop('customer', axis=1, inplace=True)

In [9]:
df

Unnamed: 0,st,gender,education,customer_lifetime_value,income,monthly_premium_auto,number_of_open_complaints,policy_type,vehicle_class,total_claim_amount,state,gender.1
0,Washington,,Master,,0.0,1000.0,1/0/00,Personal Auto,Four-Door Car,2.704934,,
1,Arizona,F,Bachelor,697953.59%,0.0,94.0,1/0/00,Personal Auto,Four-Door Car,1131.464935,,
2,Nevada,F,Bachelor,1288743.17%,48767.0,108.0,1/0/00,Personal Auto,Two-Door Car,566.472247,,
3,California,M,Bachelor,764586.18%,0.0,106.0,1/0/00,Corporate Auto,SUV,529.881344,,
4,Washington,M,High School or Below,536307.65%,36357.0,68.0,1/0/00,Personal Auto,Four-Door Car,17.269323,,
...,...,...,...,...,...,...,...,...,...,...,...,...
12069,,,Bachelor,23405.98798,71941.0,73.0,0,Personal Auto,Four-Door Car,198.234764,California,M
12070,,,College,3096.511217,21604.0,79.0,0,Corporate Auto,Four-Door Car,379.200000,California,F
12071,,,Bachelor,8163.890428,0.0,85.0,3,Corporate Auto,Four-Door Car,790.784983,California,M
12072,,,College,7524.442436,21941.0,96.0,0,Personal Auto,Four-Door Car,691.200000,California,M


### Filtering data and correcting typos

#### 'st' vc 'state'

In [10]:
df['st'].value_counts(dropna=False)

NaN           10007
Oregon          623
California      488
Arizona         328
Nevada          223
Washington      181
Cali            120
AZ               74
WA               30
Name: st, dtype: int64

In [11]:
df['state'].value_counts(dropna=False)

NaN           5004
California    2544
Oregon        1978
Arizona       1302
Nevada         659
Washington     587
Name: state, dtype: int64

First, let's starndardize the names in column 'st'. Then we're going to use 'state' to replace some null values that are caused by the way the files were concatenated.

In [12]:
df['st'] = df['st'].replace({'Cali': 'California', 'AZ': 'Arizona', 'WA': 'Washington'})

In [13]:
df['st'] = df['st'].fillna(df['state'])
df['st'].value_counts(dropna=False)

California    3152
NaN           2937
Oregon        2601
Arizona       1704
Nevada         882
Washington     798
Name: st, dtype: int64

Now, all our information about the states are condensed in the column 'st'. We can rename it and get rid of the second column.

In [14]:
df.drop('state', axis=1, inplace=True)

In [15]:
df.rename(columns={'st': 'state'}, inplace=True)
df

Unnamed: 0,state,gender,education,customer_lifetime_value,income,monthly_premium_auto,number_of_open_complaints,policy_type,vehicle_class,total_claim_amount,gender.1
0,Washington,,Master,,0.0,1000.0,1/0/00,Personal Auto,Four-Door Car,2.704934,
1,Arizona,F,Bachelor,697953.59%,0.0,94.0,1/0/00,Personal Auto,Four-Door Car,1131.464935,
2,Nevada,F,Bachelor,1288743.17%,48767.0,108.0,1/0/00,Personal Auto,Two-Door Car,566.472247,
3,California,M,Bachelor,764586.18%,0.0,106.0,1/0/00,Corporate Auto,SUV,529.881344,
4,Washington,M,High School or Below,536307.65%,36357.0,68.0,1/0/00,Personal Auto,Four-Door Car,17.269323,
...,...,...,...,...,...,...,...,...,...,...,...
12069,California,,Bachelor,23405.98798,71941.0,73.0,0,Personal Auto,Four-Door Car,198.234764,M
12070,California,,College,3096.511217,21604.0,79.0,0,Corporate Auto,Four-Door Car,379.200000,F
12071,California,,Bachelor,8163.890428,0.0,85.0,3,Corporate Auto,Four-Door Car,790.784983,M
12072,California,,College,7524.442436,21941.0,96.0,0,Personal Auto,Four-Door Car,691.200000,M


#### 'gender'

In [16]:
# only posible with indices, as there are two columns with the same name!
df.iloc[:, 1].value_counts(dropna=False)

NaN       10129
F           984
M           874
Male         40
female       30
Femal        17
Name: gender, dtype: int64

In [17]:
df.iloc[:, 10].value_counts(dropna=False)

NaN    5004
F      3576
M      3494
Name: gender, dtype: int64

First, let's starndardize the values in the first 'gender' column (index 1). Then we're going to use the values from the second column (index 10) to replace some null values that are caused by the way the files were concatenated.

In [18]:
df.iloc[:, 1] = df.iloc[:, 1].replace('Male', 'M').replace(['female', 'Femal'], 'F')
df.iloc[:, 1].value_counts(dropna=False)

NaN    10129
F       1031
M        914
Name: gender, dtype: int64

In [19]:
df.iloc[:, 1] = df.iloc[:, 1].fillna(df.iloc[:, 10])
df.iloc[:, 1].value_counts(dropna=False)

F      4607
M      4408
NaN    3059
Name: gender, dtype: int64

Now, all our information about the states are condensed in one column ('gender', index 1) and we can get rid of the second column (index 10).

In [20]:
df.drop(df.columns[10], axis=1, inplace=True)

In [21]:
df

Unnamed: 0,state,education,customer_lifetime_value,income,monthly_premium_auto,number_of_open_complaints,policy_type,vehicle_class,total_claim_amount
0,Washington,Master,,0.0,1000.0,1/0/00,Personal Auto,Four-Door Car,2.704934
1,Arizona,Bachelor,697953.59%,0.0,94.0,1/0/00,Personal Auto,Four-Door Car,1131.464935
2,Nevada,Bachelor,1288743.17%,48767.0,108.0,1/0/00,Personal Auto,Two-Door Car,566.472247
3,California,Bachelor,764586.18%,0.0,106.0,1/0/00,Corporate Auto,SUV,529.881344
4,Washington,High School or Below,536307.65%,36357.0,68.0,1/0/00,Personal Auto,Four-Door Car,17.269323
...,...,...,...,...,...,...,...,...,...
12069,California,Bachelor,23405.98798,71941.0,73.0,0,Personal Auto,Four-Door Car,198.234764
12070,California,College,3096.511217,21604.0,79.0,0,Corporate Auto,Four-Door Car,379.200000
12071,California,Bachelor,8163.890428,0.0,85.0,3,Corporate Auto,Four-Door Car,790.784983
12072,California,College,7524.442436,21941.0,96.0,0,Personal Auto,Four-Door Car,691.200000


### Checking and correcting data types

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12074 entries, 0 to 12073
Data columns (total 9 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   state                      9137 non-null   object 
 1   education                  9137 non-null   object 
 2   customer_lifetime_value    9130 non-null   object 
 3   income                     9137 non-null   float64
 4   monthly_premium_auto       9137 non-null   float64
 5   number_of_open_complaints  9137 non-null   object 
 6   policy_type                9137 non-null   object 
 7   vehicle_class              9137 non-null   object 
 8   total_claim_amount         9137 non-null   float64
dtypes: float64(3), object(6)
memory usage: 849.1+ KB


One should expect the columns 'customer_lifetime_value' and 'number_of_open_complaints'. Let's investigate these columns.

In [41]:
df['number_of_open_complaints'].value_counts(dropna=False)  ## get the middle number

0         5629
NaN       2937
1/0/00    1626
1          765
2          283
1/1/00     247
3          230
4          119
1/2/00      93
1/3/00      60
5           44
1/4/00      29
1/5/00      12
Name: number_of_open_complaints, dtype: int64

In [74]:
df['number_of_open_complaints'].str.contains('/').sum()

2067

In [59]:
# check if string has anything that is not a number
df['customer_lifetime_value'].str.contains('(?!^\d+$)^.+$').sum()

2060

In [60]:
df['customer_lifetime_value'].str.contains('%').sum()

2060

The only character that is not a number is the '%'. So let's replace it and convert the values.

In [58]:
df['customer_lifetime_value'].str.replace('%', '').astype(float)

0               NaN
1         697953.59
2        1288743.17
3         764586.18
4         536307.65
            ...    
12069           NaN
12070           NaN
12071           NaN
12072           NaN
12073           NaN
Name: customer_lifetime_value, Length: 12074, dtype: float64

In [64]:
str1 = '1/1/00'

int(str1.split('/')[1])

1

In [69]:
'/' in np.nan

TypeError: argument of type 'float' is not iterable