# Lab | Customer Analysis Round 1 |
##### Isidre Munné-Bertran
![image.png](attachment:image.png)

### Abstract 
The objective of this data is to understand customer demographics and buying behavior. Later during the week, we will use predictive analytics to analyze the most profitable customers and how they interact. After that, we will take targeted actions to increase profitable customer response, retention, and growth.

For this lab, we will gather the data from 3 csv files that are provided in the `files_for_lab` folder. 

### Instructions
* Read the three files into python as dataframes
* Show the DataFrame's shape.
* Standardize header names.
* Rearrange the columns in the dataframe as needed
* Concatenate the three dataframes
* Which columns are numerical?
* Which columns are categorical?
* Understand the meaning of all columns
* Perform the data cleaning operations mentioned so far in class
    * Delete the column education and the number of open complaints from the dataframe.
    * Correct the values in the column customer lifetime value. They are given as a percent, so multiply them by 100 and change dtype to `numerical` type.
    * Check for duplicate rows in the data and remove if any.
    * Filter out the data for customers who have an income of 0 or less.
    
### Our structure as Ironhackers
1. Case Study
2. Get data
3. Cleaning/Wrangling/EDA
4. Processing Data
5. Modeling
6. Validation
7. Reporting

### 0. Importing Libraries

In [1]:
# Firstly we import the libraries we are gonna use:
import pandas as pd
import numpy as np

### 1. Read the three files into python as dataframes
Let's start by `opening` the three dataframe and then continue with data exploration.

In [2]:
# We use pandas to open de files
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file2.csv')
df3 = pd.read_csv('file3.csv')

### 2. Show the dataframe shape (Exploring the Data)
Now, we will create a function to `explore` our dataframe that will return each data set **shape** and head.

In [3]:
def explore_data(ds):
    shape = ds.shape # shape prints number of rows and columns in dataframe
    columns = ds.columns
    print("The dataframe shape is", shape, ".")
    print(columns)
    return ds # Displays the DataFrame keeping the format nice and easy

In [4]:
explore_data(df1)

The dataframe shape is (4008, 11) .
Index(['Customer', 'ST', 'GENDER', 'Education', 'Customer Lifetime Value',
       'Income', 'Monthly Premium Auto', 'Number of Open Complaints',
       'Policy Type', 'Vehicle Class', 'Total Claim Amount'],
      dtype='object')


Unnamed: 0,Customer,ST,GENDER,Education,Customer Lifetime Value,Income,Monthly Premium Auto,Number of Open Complaints,Policy Type,Vehicle Class,Total Claim Amount
0,RB50392,Washington,,Master,,0.0,1000.0,1/0/00,Personal Auto,Four-Door Car,2.704934
1,QZ44356,Arizona,F,Bachelor,697953.59%,0.0,94.0,1/0/00,Personal Auto,Four-Door Car,1131.464935
2,AI49188,Nevada,F,Bachelor,1288743.17%,48767.0,108.0,1/0/00,Personal Auto,Two-Door Car,566.472247
3,WW63253,California,M,Bachelor,764586.18%,0.0,106.0,1/0/00,Corporate Auto,SUV,529.881344
4,GA49547,Washington,M,High School or Below,536307.65%,36357.0,68.0,1/0/00,Personal Auto,Four-Door Car,17.269323
...,...,...,...,...,...,...,...,...,...,...,...
4003,,,,,,,,,,,
4004,,,,,,,,,,,
4005,,,,,,,,,,,
4006,,,,,,,,,,,


This dataset has 4008 costumers data and 11 columns. The columns that might be useful for the purpose of our analysis **costumer demographics** are `Customer`, `ST (State)`, `GENDER`, `Education`, `Income` and `Costumer Lifetime Value`. Also, since we want to know their **buying behavior** we will use `Vehicle Class`, `Monthly Premium Auto`, `Policy Type` and `Total Claim Amount`. The column names seems to be self-explanatory, which is usefull since we do not have the documentation.

Exploring our dataset, we observe some `NaN` missing values in our dataset.

In [5]:
explore_data(df2)

The dataframe shape is (996, 11) .
Index(['Customer', 'ST', 'GENDER', 'Education', 'Customer Lifetime Value',
       'Income', 'Monthly Premium Auto', 'Number of Open Complaints',
       'Total Claim Amount', 'Policy Type', 'Vehicle Class'],
      dtype='object')


Unnamed: 0,Customer,ST,GENDER,Education,Customer Lifetime Value,Income,Monthly Premium Auto,Number of Open Complaints,Total Claim Amount,Policy Type,Vehicle Class
0,GS98873,Arizona,F,Bachelor,323912.47%,16061,88,1/0/00,633.600000,Personal Auto,Four-Door Car
1,CW49887,California,F,Master,462680.11%,79487,114,1/0/00,547.200000,Special Auto,SUV
2,MY31220,California,F,College,899704.02%,54230,112,1/0/00,537.600000,Personal Auto,Two-Door Car
3,UH35128,Oregon,F,College,2580706.30%,71210,214,1/1/00,1027.200000,Personal Auto,Luxury Car
4,WH52799,Arizona,F,College,380812.21%,94903,94,1/0/00,451.200000,Corporate Auto,Two-Door Car
...,...,...,...,...,...,...,...,...,...,...,...
991,HV85198,Arizona,M,Master,847141.75%,63513,70,1/0/00,185.667213,Personal Auto,Four-Door Car
992,BS91566,Arizona,F,College,543121.91%,58161,68,1/0/00,140.747286,Corporate Auto,Four-Door Car
993,IL40123,Nevada,F,College,568964.41%,83640,70,1/0/00,471.050488,Corporate Auto,Two-Door Car
994,MY32149,California,F,Master,368672.38%,0,96,1/0/00,28.460568,Personal Auto,Two-Door Car


This dataset instead contains 996 costumers data and 11 columns. This data set follows the same structure as the other, so we will pick and use the same columns.

Also, in this case it seems there are not any `NaN` missing values.

In [6]:
explore_data(df3)

The dataframe shape is (7070, 11) .
Index(['Customer', 'State', 'Customer Lifetime Value', 'Education', 'Gender',
       'Income', 'Monthly Premium Auto', 'Number of Open Complaints',
       'Policy Type', 'Total Claim Amount', 'Vehicle Class'],
      dtype='object')


Unnamed: 0,Customer,State,Customer Lifetime Value,Education,Gender,Income,Monthly Premium Auto,Number of Open Complaints,Policy Type,Total Claim Amount,Vehicle Class
0,SA25987,Washington,3479.137523,High School or Below,M,0,104,0,Personal Auto,499.200000,Two-Door Car
1,TB86706,Arizona,2502.637401,Master,M,0,66,0,Personal Auto,3.468912,Two-Door Car
2,ZL73902,Nevada,3265.156348,Bachelor,F,25820,82,0,Personal Auto,393.600000,Four-Door Car
3,KX23516,California,4455.843406,High School or Below,F,0,121,0,Personal Auto,699.615192,SUV
4,FN77294,California,7704.958480,High School or Below,M,30366,101,2,Personal Auto,484.800000,SUV
...,...,...,...,...,...,...,...,...,...,...,...
7065,LA72316,California,23405.987980,Bachelor,M,71941,73,0,Personal Auto,198.234764,Four-Door Car
7066,PK87824,California,3096.511217,College,F,21604,79,0,Corporate Auto,379.200000,Four-Door Car
7067,TD14365,California,8163.890428,Bachelor,M,0,85,3,Corporate Auto,790.784983,Four-Door Car
7068,UP19263,California,7524.442436,College,M,21941,96,0,Personal Auto,691.200000,Four-Door Car


The last dataset contains 7070 and 11 columns instead contains  costumers data and 11 columns. This dataset has some differences with the others, `State` column instead of `ST` and no `%` symbol in `Customer Lifetime Value`. We will correct them now, cleaning the data:

### 3. Standardize header names (Cleaning the Data)
For this exercise we will standarize some of the column names before joining the data. This means:
* Firstly, making all columns low cap letters and replacing spaces for `_`
* Changing the first two dataframes column names from `st` to `state`.
* Modifying the order of how the columns are displayed.

In [7]:
# We will define a function that will return a cleaned column (lower cap letters and replacing spaces if any)
# cc = cleaned columns
def cc(ds):
    ds.columns = cc = [column.lower().replace(' ', '_') for column in ds]
    ds = ds.rename(columns={"st":"state"}) # If it's possible, it will also change "st" to "state" in our dataset
    return ds

In [8]:
# And we will use it for each dataframe, checking the result:
# First data frame
df1 = cc(df1)
cc(df1)

Unnamed: 0,customer,state,gender,education,customer_lifetime_value,income,monthly_premium_auto,number_of_open_complaints,policy_type,vehicle_class,total_claim_amount
0,RB50392,Washington,,Master,,0.0,1000.0,1/0/00,Personal Auto,Four-Door Car,2.704934
1,QZ44356,Arizona,F,Bachelor,697953.59%,0.0,94.0,1/0/00,Personal Auto,Four-Door Car,1131.464935
2,AI49188,Nevada,F,Bachelor,1288743.17%,48767.0,108.0,1/0/00,Personal Auto,Two-Door Car,566.472247
3,WW63253,California,M,Bachelor,764586.18%,0.0,106.0,1/0/00,Corporate Auto,SUV,529.881344
4,GA49547,Washington,M,High School or Below,536307.65%,36357.0,68.0,1/0/00,Personal Auto,Four-Door Car,17.269323
...,...,...,...,...,...,...,...,...,...,...,...
4003,,,,,,,,,,,
4004,,,,,,,,,,,
4005,,,,,,,,,,,
4006,,,,,,,,,,,


In [9]:
df2 = cc(df2)
cc(df2)

Unnamed: 0,customer,state,gender,education,customer_lifetime_value,income,monthly_premium_auto,number_of_open_complaints,total_claim_amount,policy_type,vehicle_class
0,GS98873,Arizona,F,Bachelor,323912.47%,16061,88,1/0/00,633.600000,Personal Auto,Four-Door Car
1,CW49887,California,F,Master,462680.11%,79487,114,1/0/00,547.200000,Special Auto,SUV
2,MY31220,California,F,College,899704.02%,54230,112,1/0/00,537.600000,Personal Auto,Two-Door Car
3,UH35128,Oregon,F,College,2580706.30%,71210,214,1/1/00,1027.200000,Personal Auto,Luxury Car
4,WH52799,Arizona,F,College,380812.21%,94903,94,1/0/00,451.200000,Corporate Auto,Two-Door Car
...,...,...,...,...,...,...,...,...,...,...,...
991,HV85198,Arizona,M,Master,847141.75%,63513,70,1/0/00,185.667213,Personal Auto,Four-Door Car
992,BS91566,Arizona,F,College,543121.91%,58161,68,1/0/00,140.747286,Corporate Auto,Four-Door Car
993,IL40123,Nevada,F,College,568964.41%,83640,70,1/0/00,471.050488,Corporate Auto,Two-Door Car
994,MY32149,California,F,Master,368672.38%,0,96,1/0/00,28.460568,Personal Auto,Two-Door Car


In [10]:
df3 = cc(df3)
cc(df3)

Unnamed: 0,customer,state,customer_lifetime_value,education,gender,income,monthly_premium_auto,number_of_open_complaints,policy_type,total_claim_amount,vehicle_class
0,SA25987,Washington,3479.137523,High School or Below,M,0,104,0,Personal Auto,499.200000,Two-Door Car
1,TB86706,Arizona,2502.637401,Master,M,0,66,0,Personal Auto,3.468912,Two-Door Car
2,ZL73902,Nevada,3265.156348,Bachelor,F,25820,82,0,Personal Auto,393.600000,Four-Door Car
3,KX23516,California,4455.843406,High School or Below,F,0,121,0,Personal Auto,699.615192,SUV
4,FN77294,California,7704.958480,High School or Below,M,30366,101,2,Personal Auto,484.800000,SUV
...,...,...,...,...,...,...,...,...,...,...,...
7065,LA72316,California,23405.987980,Bachelor,M,71941,73,0,Personal Auto,198.234764,Four-Door Car
7066,PK87824,California,3096.511217,College,F,21604,79,0,Corporate Auto,379.200000,Four-Door Car
7067,TD14365,California,8163.890428,Bachelor,M,0,85,3,Corporate Auto,790.784983,Four-Door Car
7068,UP19263,California,7524.442436,College,M,21941,96,0,Personal Auto,691.200000,Four-Door Car


### 4. Rearrange the columns in the dataframe as needed (Cleaning the Data)
And again, we will define a function which will return us all the column rearranged:

In [11]:
def rc(ds):
    ds = ds[['customer', 'state', 'gender', 'policy_type', 'vehicle_class', 'education','customer_lifetime_value', 'income',
 'monthly_premium_auto', 'number_of_open_complaints', 'total_claim_amount']]
    return ds.columns

In [12]:
rc(df1)

Index(['customer', 'state', 'gender', 'policy_type', 'vehicle_class',
       'education', 'customer_lifetime_value', 'income',
       'monthly_premium_auto', 'number_of_open_complaints',
       'total_claim_amount'],
      dtype='object')

In [13]:
rc(df2)

Index(['customer', 'state', 'gender', 'policy_type', 'vehicle_class',
       'education', 'customer_lifetime_value', 'income',
       'monthly_premium_auto', 'number_of_open_complaints',
       'total_claim_amount'],
      dtype='object')

In [14]:
rc(df3)

Index(['customer', 'state', 'gender', 'policy_type', 'vehicle_class',
       'education', 'customer_lifetime_value', 'income',
       'monthly_premium_auto', 'number_of_open_complaints',
       'total_claim_amount'],
      dtype='object')

### 5. Concatenate the three dataframes
Now that we changed the columns, we can concatenate the three data sets:

In [15]:
data = pd.concat([df1, df2, df3])
data

Unnamed: 0,customer,state,gender,education,customer_lifetime_value,income,monthly_premium_auto,number_of_open_complaints,policy_type,vehicle_class,total_claim_amount
0,RB50392,Washington,,Master,,0.0,1000.0,1/0/00,Personal Auto,Four-Door Car,2.704934
1,QZ44356,Arizona,F,Bachelor,697953.59%,0.0,94.0,1/0/00,Personal Auto,Four-Door Car,1131.464935
2,AI49188,Nevada,F,Bachelor,1288743.17%,48767.0,108.0,1/0/00,Personal Auto,Two-Door Car,566.472247
3,WW63253,California,M,Bachelor,764586.18%,0.0,106.0,1/0/00,Corporate Auto,SUV,529.881344
4,GA49547,Washington,M,High School or Below,536307.65%,36357.0,68.0,1/0/00,Personal Auto,Four-Door Car,17.269323
...,...,...,...,...,...,...,...,...,...,...,...
7065,LA72316,California,M,Bachelor,23405.98798,71941.0,73.0,0,Personal Auto,Four-Door Car,198.234764
7066,PK87824,California,F,College,3096.511217,21604.0,79.0,0,Corporate Auto,Four-Door Car,379.200000
7067,TD14365,California,M,Bachelor,8163.890428,0.0,85.0,3,Corporate Auto,Four-Door Car,790.784983
7068,UP19263,California,M,College,7524.442436,21941.0,96.0,0,Personal Auto,Four-Door Car,691.200000


### 6. Which columns are numerical? (Exploring our Data)

In [16]:
data.select_dtypes(include=["float", "int"])

Unnamed: 0,income,monthly_premium_auto,total_claim_amount
0,0.0,1000.0,2.704934
1,0.0,94.0,1131.464935
2,48767.0,108.0,566.472247
3,0.0,106.0,529.881344
4,36357.0,68.0,17.269323
...,...,...,...
7065,71941.0,73.0,198.234764
7066,21604.0,79.0,379.200000
7067,0.0,85.0,790.784983
7068,21941.0,96.0,691.200000


As expected, `income`, `monthly_premium_auto` and `total_claim_amount` are numberical. Also, we noticed that `number_of_open_complaints` is missing.

### 7. Which columns are categorical?

In [17]:
data.select_dtypes(exclude=["float", "int"])

Unnamed: 0,customer,state,gender,education,customer_lifetime_value,number_of_open_complaints,policy_type,vehicle_class
0,RB50392,Washington,,Master,,1/0/00,Personal Auto,Four-Door Car
1,QZ44356,Arizona,F,Bachelor,697953.59%,1/0/00,Personal Auto,Four-Door Car
2,AI49188,Nevada,F,Bachelor,1288743.17%,1/0/00,Personal Auto,Two-Door Car
3,WW63253,California,M,Bachelor,764586.18%,1/0/00,Corporate Auto,SUV
4,GA49547,Washington,M,High School or Below,536307.65%,1/0/00,Personal Auto,Four-Door Car
...,...,...,...,...,...,...,...,...
7065,LA72316,California,M,Bachelor,23405.98798,0,Personal Auto,Four-Door Car
7066,PK87824,California,F,College,3096.511217,0,Corporate Auto,Four-Door Car
7067,TD14365,California,M,Bachelor,8163.890428,3,Corporate Auto,Four-Door Car
7068,UP19263,California,M,College,7524.442436,0,Personal Auto,Four-Door Car


8 of our 11 columns, `customer`, `state`, `gender`, `policy_type`, `vehicle_class` and `education` are objects, containing `str`.

### 8. Understand the meaning of all columns

In [18]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12074 entries, 0 to 7069
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   customer                   9137 non-null   object 
 1   state                      9137 non-null   object 
 2   gender                     9015 non-null   object 
 3   education                  9137 non-null   object 
 4   customer_lifetime_value    9130 non-null   object 
 5   income                     9137 non-null   float64
 6   monthly_premium_auto       9137 non-null   float64
 7   number_of_open_complaints  9137 non-null   object 
 8   policy_type                9137 non-null   object 
 9   vehicle_class              9137 non-null   object 
 10  total_claim_amount         9137 non-null   float64
dtypes: float64(3), object(8)
memory usage: 1.1+ MB


As we saw before, `number_of_complaints` is an object instead of containing only numeric values (we had one dataset with a "%" at the end of the percentage).
Also, looking it at detail we can detect that `income` and `monthly_premium_auto` are int and `total_claim_amount` is a float.

### 9. Perform the data cleaning operations mentioned so far in class (Cleaning)

* Delete the column education and the number of open complaints from the dataframe.

In [19]:
data = data.drop(columns=["education", "number_of_open_complaints"])
print(data.columns)

Index(['customer', 'state', 'gender', 'customer_lifetime_value', 'income',
       'monthly_premium_auto', 'policy_type', 'vehicle_class',
       'total_claim_amount'],
      dtype='object')


* Correct the values in the column customer lifetime value. * They are given as a percent, so multiply them by 100 and * change `dtype` to `numerical` type.

In [20]:
# We select our target column from our dataset and use the str.replace method.
data['customer_lifetime_value'] = data['customer_lifetime_value'].str.replace("%", "")
data['customer_lifetime_value']

0              NaN
1        697953.59
2       1288743.17
3        764586.18
4        536307.65
           ...    
7065           NaN
7066           NaN
7067           NaN
7068           NaN
7069           NaN
Name: customer_lifetime_value, Length: 12074, dtype: object

In [21]:
# Now that we removed the %, we can change the type of our data column:
data = data.astype({'customer_lifetime_value':'float'})
data['customer_lifetime_value']

0              NaN
1        697953.59
2       1288743.17
3        764586.18
4        536307.65
           ...    
7065           NaN
7066           NaN
7067           NaN
7068           NaN
7069           NaN
Name: customer_lifetime_value, Length: 12074, dtype: float64

In [22]:
# Lastly, we multiplying them * 100 using lambda
data['customer_lifetime_value'] = data['customer_lifetime_value'].apply(lambda x: x * 100)
data['customer_lifetime_value']

0               NaN
1        69795359.0
2       128874317.0
3        76458618.0
4        53630765.0
           ...     
7065            NaN
7066            NaN
7067            NaN
7068            NaN
7069            NaN
Name: customer_lifetime_value, Length: 12074, dtype: float64

* Check for duplicate rows in the data and remove if any.

In [23]:
# Again we use a panda method called duplicated which will return boolean False or True if there is any duplicate
data.duplicated()

0       False
1       False
2       False
3       False
4       False
        ...  
7065    False
7066    False
7067    False
7068    False
7069    False
Length: 12074, dtype: bool

* Filter out the data for customers who have an income of 0 or less.

In [24]:
data[data['income'] <= 0] # Rows where the col column is equal or lesser than 0

Unnamed: 0,customer,state,gender,customer_lifetime_value,income,monthly_premium_auto,policy_type,vehicle_class,total_claim_amount
0,RB50392,Washington,,,0.0,1000.0,Personal Auto,Four-Door Car,2.704934
1,QZ44356,Arizona,F,69795359.0,0.0,94.0,Personal Auto,Four-Door Car,1131.464935
3,WW63253,California,M,76458618.0,0.0,106.0,Corporate Auto,SUV,529.881344
7,CF85061,Arizona,M,72161003.0,0.0,101.0,Corporate Auto,Four-Door Car,363.029680
10,SX51350,California,M,47389920.0,0.0,67.0,Personal Auto,Four-Door Car,482.400000
...,...,...,...,...,...,...,...,...,...
7059,WZ45103,California,F,,0.0,76.0,Personal Auto,Four-Door Car,364.800000
7061,RX91025,California,M,,0.0,185.0,Personal Auto,SUV,1950.725547
7062,AC13887,California,M,,0.0,67.0,Corporate Auto,Two-Door Car,482.400000
7067,TD14365,California,M,,0.0,85.0,Corporate Auto,Four-Door Car,790.784983


2294 costumers have an income of 0 or less.

### Extra: Replacing null values
Lastly, as good practice, we will replace all `NaN` values to `0`.

In [25]:
# Original NaN values
data

Unnamed: 0,customer,state,gender,customer_lifetime_value,income,monthly_premium_auto,policy_type,vehicle_class,total_claim_amount
0,RB50392,Washington,,,0.0,1000.0,Personal Auto,Four-Door Car,2.704934
1,QZ44356,Arizona,F,69795359.0,0.0,94.0,Personal Auto,Four-Door Car,1131.464935
2,AI49188,Nevada,F,128874317.0,48767.0,108.0,Personal Auto,Two-Door Car,566.472247
3,WW63253,California,M,76458618.0,0.0,106.0,Corporate Auto,SUV,529.881344
4,GA49547,Washington,M,53630765.0,36357.0,68.0,Personal Auto,Four-Door Car,17.269323
...,...,...,...,...,...,...,...,...,...
7065,LA72316,California,M,,71941.0,73.0,Personal Auto,Four-Door Car,198.234764
7066,PK87824,California,F,,21604.0,79.0,Corporate Auto,Four-Door Car,379.200000
7067,TD14365,California,M,,0.0,85.0,Corporate Auto,Four-Door Car,790.784983
7068,UP19263,California,M,,21941.0,96.0,Personal Auto,Four-Door Car,691.200000


In [26]:
# NaN replace
data.fillna(0)

Unnamed: 0,customer,state,gender,customer_lifetime_value,income,monthly_premium_auto,policy_type,vehicle_class,total_claim_amount
0,RB50392,Washington,0,0.0,0.0,1000.0,Personal Auto,Four-Door Car,2.704934
1,QZ44356,Arizona,F,69795359.0,0.0,94.0,Personal Auto,Four-Door Car,1131.464935
2,AI49188,Nevada,F,128874317.0,48767.0,108.0,Personal Auto,Two-Door Car,566.472247
3,WW63253,California,M,76458618.0,0.0,106.0,Corporate Auto,SUV,529.881344
4,GA49547,Washington,M,53630765.0,36357.0,68.0,Personal Auto,Four-Door Car,17.269323
...,...,...,...,...,...,...,...,...,...
7065,LA72316,California,M,0.0,71941.0,73.0,Personal Auto,Four-Door Car,198.234764
7066,PK87824,California,F,0.0,21604.0,79.0,Corporate Auto,Four-Door Car,379.200000
7067,TD14365,California,M,0.0,0.0,85.0,Corporate Auto,Four-Door Car,790.784983
7068,UP19263,California,M,0.0,21941.0,96.0,Personal Auto,Four-Door Car,691.200000
