# ECB Data Academy - Week 2 - Data Exploration and Data Preparation

[Krisolis](http://www.krisolis.ie)

## Data Exploration in Python

This notebook demontates some simple data exploration in python using **pandas** and **pandas_profiler**

### Package Imports

To build predictive models in Python we use a set of libraries that are imported here. In particular **pandas** and **sklearn** are particularly important.

In [None]:
import pandas as pd
import numpy as np

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

import TAS_Python_Utilities

import sklearn
import sklearn.preprocessing
import sklearn.impute


### Load Data

In [None]:
dataset = pd.read_csv('../data/ACME_ABT.csv')
print(dataset.shape)
display(dataset.head())

### Explore the Dataset

Examine the distribution of the two classes

In [None]:
dataset["churn"].value_counts()

Generate summary statistics for each fearture in the dataset

In [None]:
if dataset.select_dtypes(include=[np.number]).shape[1] > 0: 
    display(dataset.select_dtypes(include=[np.number]).describe().transpose())
if dataset.select_dtypes(include=[object]).shape[1] > 0: 
    display(dataset.select_dtypes(include=[object]).describe().transpose())

Check for missing values within each feature in the dataset.

In [None]:
# Check for presence of missing values
print("Missing Values")
print(dataset.isnull().sum())

Visualise the distribtion of each fearure in the dataset.

In [None]:
TAS_Python_Utilities.data_viz(dataset)

Visualise the distribtion of each fearure in the dataset separated by the target feature.

In [None]:
TAS_Python_Utilities.data_viz_target(dataset, "churn")

### Explore the Dataset Using Pandas Profiler

The pandas-profiler package is a great way to exmaine a datset.

In [None]:
#import pandas_profiling
#profile = pandas_profiling.ProfileReport(dataset, minimal = True)
#profile.to_file("your_report.html")

### Clean Data

Replace spurious categories in **creditCard** feature.

In [None]:
dataset.loc[dataset['creditCard'] == 't','creditCard'] = "TRUE"
dataset.loc[dataset['creditCard'] == 'f','creditCard'] = "FALSE"
dataset.loc[dataset['creditCard'] == 'yes','creditCard'] = "TRUE"
dataset.loc[dataset['creditCard'] == 'no','creditCard'] = "FALSE"

Convert string based 'False' and 'True' to boolean False and True

In [None]:
dataset['creditCard'] = dataset['creditCard'].map({"TRUE": True, "FALSE": False})

Replace spurious categories in **regionType** feature.

In [None]:
dataset.loc[dataset['regionType'] == 's','regionType'] = "suburban"
dataset.loc[dataset['regionType'] == 't','regionType'] = "town"
dataset.loc[dataset['regionType'] == 'r','regionType'] = "rural"

Clamp outlier values in **avgReceived** to range (0,1000)

In [None]:
dataset["avgReceivedMins"].clip(0, 1000, inplace=True)

### Prepare Data for Modelling

Replace missing values using imputation.

In [None]:
occupation_imputer = sklearn.impute.SimpleImputer(strategy="constant", 
                                                  fill_value = 'unknown')
occupation_imputer.fit(dataset['occupation'].values.reshape(-1, 1))
dataset['occupation'] = occupation_imputer.transform(dataset['occupation'].values.reshape(-1, 1))

regionType_imputer = sklearn.impute.SimpleImputer(strategy="most_frequent")
regionType_imputer.fit(dataset['regionType'].values.reshape(-1, 1))
dataset['regionType'] = regionType_imputer.transform(dataset['regionType'].values.reshape(-1, 1))

age_imputer = sklearn.impute.SimpleImputer(missing_values = 0, strategy="mean")
age_imputer.fit(dataset['age'].values.reshape(-1, 1))
dataset['age'] = age_imputer.transform(dataset['age'].values.reshape(-1, 1))

Convert ordinal features to numeric.

In [None]:
creditRating_oe = sklearn.preprocessing.OrdinalEncoder()
creditRating_oe.fit(dataset['creditRating'].values.reshape(-1, 1))
dataset['creditRating'] = creditRating_oe.transform(dataset['creditRating'].values.reshape(-1, 1))

Convert categorical features to dummy coding.

In [None]:
dataset = pd.get_dummies(dataset)
print(dataset.shape)
display(dataset.head())

Rescale numeric features to defined range

In [None]:
cols = dataset.columns     # Save column names to avoid lsoing them when changing from pandas dataframe to numpy array
min_max_scaler = sklearn.preprocessing.MinMaxScaler(feature_range=(0,1))
min_max_scaler.fit(dataset)
a = min_max_scaler.transform(dataset)
dataset = pd.DataFrame(a, columns = cols) # Watch out for putting back in columns here

In [None]:
min_max_scaler.data_max_

In [None]:
dataset

In [None]:
cols = dataset.columns     # Save column names to avoid lsoing them when changing from pandas dataframe to numpy array
a = min_max_scaler.inverse_transform(dataset)
dataset_scaled = pd.DataFrame(a, columns = cols) # Watch out for putting back in columns here
dataset_scaled.head()

In [None]:
dataset

Examine the newly transformed dataset. 

In [None]:
display(dataset.head())

In [None]:
pandas_profiling.ProfileReport(dataset, minimal = True)