#### Jorge Castro DAPT BER





# Lab (Customer Analysis Final Round )
    



Jump to:
* <a href="#01---Problem-(case-study)">01 - Problem (case study)</a>
    * <a href="#Data-Description">Data Description</a>
    * <a href="#Goal">Goal</a>
* <a href="#02---Getting-Data">02 - Getting Data</a>
    * <a href="#Read-the-.csv-file">Read the .csv file</a>
* <a href="#03---Cleaning/Wrangling/EDA">03 - Cleaning/Wrangling/EDA</a>
    * <a href="#Change-headers-names">Change headers names</a>
    * <a href="#Deal-with-NaN-values">Deal with NaN values</a>
    * <a href="#Categorical-Features">Categorical Features</a>
    * <a href="#Numerical-Features">Numerical Features</a>
    * <a href="#Exploration">Exploration</a>
* <a href="#04---Processing-Data">04 - Processing Data</a>
    * <a href="#Dealing-with-outliers">Dealing with outliers</a>
    * <a href="#Normalization">Normalization</a>
    * <a href="#Encoding-Categorical-Data">Encoding Categorical Data</a>
    * <a href="#Splitting-into-train-set-and-test-set">Splitting into train set and test set</a>
* <a href="#05---Modeling">05 - Modeling</a>
    * <a href="#Apply-model">Apply model</a>
* <a href="#06---Model-Validation">06 - Model Validation</a>
    * <a href="#R2">R2</a>
    * <a href="#MSE">MSE</a>
    * <a href="#RMSE">RMSE</a>
    * <a href="#MAE">MAE</a>
* <a href="#07---Reporting">07 - Reporting</a>
    * <a href="#Present-results">Present results</a>
    


* <a href="#Jorge-Castro-DAPT-BER">Back to Top</a>
* <a href="#.">Go down</a>

* [Go bottom](#Go-bottom)




In [28]:
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
import sweetviz as sv
%matplotlib inline

# 01 - Problem (case study)
* <a href="#Jorge-Castro-DAPT-BER">Back to Top</a>
* <a href="#.">Go down</a>

##### Data Description

    


Dataset from an insurance company

##### Goal
* <a href="#Jorge-Castro-DAPT-BER">Back to Top</a>
* <a href="#.">Go down</a>

To run predictive analytics to analyze the most profitable customers and how they interact.

# 02 - Getting Data
##### Read the .csv file
* <a href="#Jorge-Castro-DAPT-BER">Back to Top</a>
* <a href="#.">Go down</a>


In [29]:
df_d = pd.read_csv('marketing_customer_analysis.csv')
# Showing dataframe shape

df_d.shape

FileNotFoundError: [Errno 2] No such file or directory: 'marketing_customer_analysis.csv'

In [None]:
df_d.head()

# 03 - Cleaning/Wrangling/EDA
* <a href="#Jorge-Castro-DAPT-BER">Back to Top</a>
* <a href="#.">Go down</a>

##### Change headers names

In [None]:
# Lower caseing headers
df_d.columns = df_d.columns.str.lower()

In [None]:
# Replacing spaces by underscores
df_d.columns = df_d.columns.str.replace(' ', '_')

In [None]:
# Dropping the column 'unnamed:_0' as it is a duplicate from the index
df_d = df_d.drop('unnamed:_0', 1)

##### Deal with NaN values
* <a href="#Jorge-Castro-DAPT-BER">Back to Top</a>
* <a href="#.">Go down</a>

In [None]:
# Dealing with null values:


# We have to create a paralelle DataFrame but only indicating the null % of each series,
# then we use it as a guide to drop the series that have more than n% of nulls
# we save on a variable the creation of the dataframe with our calcule of null %

nulls_df = pd.DataFrame(round(df_d.isna().sum()/len(df_d)*100,2))

# We need to rename the headers so we can perform the column drop based on null %, 
# but first we need to reset the index as this 2 dataframes have 
# different lenghts (amount of rows)

# Reseting index
nulls_df = nulls_df.reset_index()

# Renaming headers of our new mini DataFrame to read null %
nulls_df.columns = ['header_name', '%_nulls']


# saving our calculations for the drop of headers on a variable based on % nulls
columns_drop = nulls_df[nulls_df['%_nulls'] > 50 ]['header_name']

# Dropping columns using the variable I just saved columns_drop
df1 = df_d.drop(columns_drop, axis = 1)

# Reseting the index after the drop of values
df1.reset_index()

# to address the rest of the null values, 
# which have a low % of nulls for each series, I will use the 
# function dropna which has a parametre called 'how' and the arguments are any or all
# ANY means if there is at least 1 null in ANY row, the row will be droped

# Creating a DataFrame to visualize the remaining nulls
nulls_percent_2 = pd.DataFrame(round(df1.isna().sum()/len(df1)*100,2))





In [None]:
# Lets have a look at the remaining % of nulls

nulls_percent_2

In [None]:
# Applying function "dropna" which has a parametre called 'how' and the 
# arguments are "any" or "all"
# "any" means if there is at least 1 null in ANY row, the row will be droped

df = df1.dropna(how='any')

In [None]:
# Verifying if all the nulls have been erradicated 
nulls_percent_3 = pd.DataFrame(round(df.isna().sum()/len(df)*100,2))

In [None]:
nulls_percent_3

In [None]:
df.head()

In [None]:
# Rounding decimals in columns: 2 and 21
df.iloc[:, [2,21]]


In [None]:
# Rounding decimals: 
df = pd.DataFrame(df.round({'customer_lifetime_value': 2, 'total_claim_amount': 2}))

In [None]:
# Creating a new DataFrame with columns that are important to keep
df_0 = pd.DataFrame(df.iloc[:, [1, 2, 5, 7, 8, 9, 11, 16, 21]])

In [None]:
df_0.head()

##### Categorical Features
* <a href="#Jorge-Castro-DAPT-BER">Back to Top</a>
* <a href="#.">Go down</a>

In [None]:
df_cat = pd.DataFrame(df_0.select_dtypes(include='object'))

In [None]:
df_cat

##### Numerical Features
* <a href="#Jorge-Castro-DAPT-BER">Back to Top</a>
* <a href="#.">Go down</a>

In [None]:
# # To see only the colums with numeric datatypes we use the get_numeric_data method.

In [None]:
df_num = pd.DataFrame(df_0._get_numeric_data())

In [None]:
df_num

##### Exploration
* <a href="#Jorge-Castro-DAPT-BER">Back to Top</a>
* <a href="#.">Go down</a>

# 04 - Processing Data
* <a href="#Jorge-Castro-DAPT-BER">Back to Top</a>
* <a href="#.">Go down</a>

##### Dealing with outliers
* <a href="#Jorge-Castro-DAPT-BER">Back to Top</a>
* <a href="#.">Go down</a>

##### Normalization
* <a href="#Jorge-Castro-DAPT-BER">Back to Top</a>
* <a href="#.">Go down</a>



##### Encoding Categorical Data
* <a href="#Jorge-Castro-DAPT-BER">Back to Top</a>
* <a href="#.">Go down</a>


##### Splitting into train set and test set
* <a href="#Jorge-Castro-DAPT-BER">Back to Top</a>
* <a href="#.">Go down</a>


# 05 - Modeling
##### Apply model
* <a href="#Jorge-Castro-DAPT-BER">Back to Top</a>
* <a href="#.">Go down</a>

# 06 - Model Validation
* <a href="#Jorge-Castro-DAPT-BER">Back to Top</a>
* <a href="#.">Go down</a>

##### R2

##### MSE

##### RMSE

##### MAE

# 07 - Reporting
##### Present results
* <a href="#Jorge-Castro-DAPT-BER">Back to Top</a>
* <a href="#.">Go down</a>


* <a href="#Jorge-Castro-DAPT-BER">Back to Top</a>
* <a href="#.">Go down</a>

<div class="alert alert-block alert-info">
<b>Tip:</b> Use blue boxes (alert-info) for tips and notes. 
If it’s a note, you don’t have to include the word “Note”.
</div>

<div class="alert alert-block alert-warning">
<b>Example:</b> Use yellow boxes for examples that are not 
inside code cells, or use for mathematical formulas if needed.
</div>

<div class="alert alert-block alert-success">
<b>Up to you:</b> Use green boxes sparingly, and only for some specific 
purpose that the other boxes can't cover. For example, if you have a lot 
of related content to link to, maybe you decide to use green boxes for 
related links from each section of a notebook.
</div>

<div class="alert alert-block alert-danger">
<b>Just don't:</b> In general, avoid the red boxes. These should only be
used for actions that might cause data loss or another major issue.
</div>

* <a href="#Jorge-Castro-DAPT-BER">Back to Top</a>
* <a href="#.">Go down</a>

##### .
##### Go bottom