# ECB Data Academy - Week 2 - Workshop - Solutions

[Krisolis](http://www.krisolis.ie)

## Data Exploration and analysis

In [None]:
import pandas as pd
import numpy as np

import TAS_Python_Utilities
import matplotlib.pyplot as plt
%matplotlib inline

import sklearn
import sklearn.impute

#import pandas_profiling

## Workshop Tasks
Prefrom the following tasks:
- Import the dataset from the file InsureABC_Channel_Data.csv into a pandas data frame called abt.
- Explore the data in the ABT Insure Channel dataset using appropriate summary statistics and data visualisations. 
- Make a list of any data quality issues associated with the data
- Fix the Data qulity Isssue

### Load Dataset

Import the dataset from the file InsureABC_Channel_Data into a Python data frame called abt.

In [None]:
abt = pd.read_csv("InsureABC_Channel_Data.csv", encoding = "UTF-8", index_col = 0)
target_feature_name = 'PrefChannel'
print(abt.columns)
print(abt.shape)
display(abt.head())

### Explore Dataset

Explore the dataset.

In [None]:
# Print descriptive statsitcs for each column
print("Summary Stats")
if abt.select_dtypes(include=[np.number]).shape[1] > 0: 
    display(abt.select_dtypes(include=[np.number]).describe().transpose())
if abt.select_dtypes(include=[object]).shape[1] > 0: 
    display(abt.select_dtypes(include=[object]).describe().transpose())

# Check for presence of missing values
print("Missing Values")
print(abt.isnull().sum())


In [None]:
#pandas_profiling.ProfileReport(abt, minimal = True)

**Data Quality Plan**

Irregular Cardinality
- PrefChannel: 6 instead of 3 levels
- GivenName: way too many levels
- MiddleInitial: way too many levels
- Surname: way too many levels
- Occupation: way too many levels

Missing values
- CreditCardType: missing, possible impute of constant value
- MotorValue: missing, possible impute of 0
- MotorType: missing, possible impute of constant value
- HealthType: missing, possible impute of constant value
- TravelType: missing, possible impute of constant value
- HealthDependentsAdults: missing, possible impute of constant value
- HealthDependentsKids: missing, possible impute of constant value


### Data Preparation

Remove columns with too many levels

In [None]:
abt = abt[abt.columns.difference(['GivenName', 'MiddleInitial', 'Surname', 'Occupation'])]

Remap spurious target level values

In [None]:
abt.loc[abt['PrefChannel'] == 'P','PrefChannel'] = "Phone"
abt.loc[abt['PrefChannel'] == 'E','PrefChannel'] = "Email"
abt.loc[abt['PrefChannel'] == 'S','PrefChannel'] = "SMS"

Perfrom simple imputation  on columns with missing values

In [None]:
imputers = dict()

imputers['HealthDependentsAdults'] = sklearn.impute.SimpleImputer(strategy="constant", fill_value = 0)
abt['HealthDependentsAdults'] = imputers['HealthDependentsAdults'].fit_transform(abt['HealthDependentsAdults'].values.reshape(-1, 1))

#imp = sklearn.impute.SimpleImputer(strategy="constant", fill_value = 0)
#abt['HealthDependentsAdults'] = imp.fit_transform(abt['HealthDependentsAdults'].values.reshape(-1, 1))

#imp = sklearn.impute.SimpleImputer(strategy="median")
#abt['HealthDependentsAdults'] = imp.fit_transform(abt['HealthDependentsAdults'].values.reshape(-1, 1))


imputers['HealthDependentsKids'] = sklearn.impute.SimpleImputer(strategy="constant", fill_value = 0)
abt['HealthDependentsKids'] = imputers['HealthDependentsKids'].fit_transform(abt['HealthDependentsKids'].values.reshape(-1, 1))

imputers['CreditCardType'] = sklearn.impute.SimpleImputer(strategy="constant", fill_value = 'missing')
abt['CreditCardType'] = imputers['CreditCardType'].fit_transform(abt['CreditCardType'].values.reshape(-1, 1))

imputers['MotorValue'] = sklearn.impute.SimpleImputer(strategy="constant", fill_value = 0)
abt['MotorValue'] = imputers['MotorValue'].fit_transform(abt['MotorValue'].values.reshape(-1, 1))

imputers['MotorType'] = sklearn.impute.SimpleImputer(strategy="constant", fill_value = 'none')
abt['MotorType'] = imputers['MotorType'].fit_transform(abt['MotorType'].values.reshape(-1, 1))

imputers['HealthType'] = sklearn.impute.SimpleImputer(strategy="constant", fill_value = 'none')
abt['HealthType'] = imputers['HealthType'].fit_transform(abt['HealthType'].values.reshape(-1, 1))

imputers['TravelType'] = sklearn.impute.SimpleImputer(strategy="constant", fill_value = 'none')
abt['TravelType'] = imputers['TravelType'].fit_transform(abt['TravelType'].values.reshape(-1, 1))

Check transformed dataset

In [None]:
#pandas_profiling.ProfileReport(abt, minimal = True)

In [None]:
# Print descriptive statsitcs for each column
print("Summary Stats")
if abt.select_dtypes(include=[np.number]).shape[1] > 0: 
    display(abt.select_dtypes(include=[np.number]).describe().transpose())
if abt.select_dtypes(include=[object]).shape[1] > 0: 
    display(abt.select_dtypes(include=[object]).describe().transpose())

# Check for presence of missing values
print("Missing Values")
print(abt.isnull().sum())
