<img src= 'http://www.bigbang-datascience.com/wp-content/uploads/2017/09/cropped-Logo-01.jpg' width=500/>

## Project Guide  
------------  
- [Project Overview](#project-overview)  
- [Part 1: Reading Data - Exploratory Data Analysis with Pandas](#I)
- [Part 2: Visual data analysis in Python](#II)
- [Part 3: Data Pre-processing &  Preparation](#III)
- [Part 4: Predictive Analytics](#IV)
- [Part 5: Optimization (Hyper Parameter Tuning)](#V)

In [None]:
# Roadmap for Building Machine Learning Models

# 1. Prepare Problem
# a) Define The Business Objective
# b) Select the datasets
# c) Load dataset
# d) Load libraries


# Data Pre-processing
# This is the first step in building a machine learning model. Data pre-processing refers to the transformation of data
# before feeding it into the model. It deals with the techniques that are used to convert unusable raw data into clean 
# reliable data.

# Since data collection is often not performed in a controlled manner, raw data often contains outliers 
# (for example, age = 120), nonsensical data combinations (for example, model: bicycle, type: 4-wheeler), missing values, 
# scale problems, and so on. Because of this, raw data cannot be fed into a machine learning model because it might 
# compromise the quality of the results. As such, this is the most important step in the process of data science.


# 2. Summarize Data
# a) Descriptive statistics
# b) Data visualizations

# 3. Prepare Data
# a) Data Cleaning
# b) Feature Selection
# c) Data Transformation

# Model Learning
# After pre-processing the data and splitting it into train/test sets (more on this later), we move on to modeling. Models 
# are nothing but sets of well-defined methods called algorithms that use pre-processed data to learn patterns, which can 
# later be used to make predictions. There are different types of learning algorithms, including supervised, semi-supervised, 
# unsupervised, and reinforcement learning. These will be discussed later.

# 4. Modeling Strategy
# a) Select Suitable Algorithms
# b) Select Training/Testing Approaches
# c) Train 


# Model Evaluation
# In this stage, the models are evaluated with the help of specific performance metrics. With these metrics, we can go on to 
# tune the hyperparameters of a model in order to improve it. This process is called hyperparameter optimization. We will 
# repeat this step until we are satisfied with the performance.

# 4. Evaluate Algorithms
# a) Split-out validation dataset
# b) Test options and evaluation metric
# c) Spot Check Algorithms
# d) Compare Algorithms

# Prediction
# Once we are happy with the results from the evaluation step, we will then move on to predictions. Predictions are made 
# by the trained model when it is exposed to a new dataset. In a business setting, these predictions can be shared with 
# decision makers to make effective business choices.

# 5. Improve Accuracy
# a) Algorithm Tuning
# b) Ensembles

# Model Deployment
# The whole process of machine learning does not just stop with model building and prediction. It also involves making use 
# of the model to build an application with the new data. Depending on the business requirements, the deployment may be a 
# report, or it may be some repetitive data science steps that are to be executed. After deployment, a model needs proper 
# management and maintenance at regular intervals to keep it up and running.

# 6. Finalize Model
# a) Predictions on validation dataset
# b) Create standalone model on entire training dataset
# c) Save model for later use


<a id="I"></a>

# I.  Reading Data - Exploratory Data Analysis with Pandas

### Article outline
1. Demonstration of main Pandas methods
2. First attempt on predicting Bank churn
3. Useful resources

### 1. Demonstration of main Pandas methods 

**[Pandas](http://pandas.pydata.org)** is a Python library that provides extensive means for data analysis. Data scientists often work with data stored in table formats like `.csv`, `.tsv`, or `.xlsx`. Pandas makes it very convenient to load, process, and analyze such tabular data using SQL-like queries. In conjunction with `Matplotlib` and `Seaborn`, `Pandas` provides a wide range of opportunities for visual analysis of tabular data.

The main data structures in `Pandas` are implemented with **Series** and **DataFrame** classes. The former is a one-dimensional indexed array of some fixed data type. The latter is a two-dimensional data structure - a table - where each column contains data of the same type. You can see it as a dictionary of `Series` instances. `DataFrames` are great for representing real data: rows correspond to instances (examples, observations, etc.), and columns correspond to features of these instances.

In [None]:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

import seaborn as sns
sns.set()  #  Will import Seaborn functionalities
# we don't like warnings
# you can comment the following 2 lines if you'd like to
import warnings
warnings.filterwarnings('ignore')


We’ll demonstrate the main methods in action by analyzing a [dataset](https://bigml.com/user/francisco/gallery/dataset/5163ad540c0b5e5b22000383) on the churn rate of telecom operator clients. Let’s read the data (using `read_csv`), and take a look at the first 5 lines using the `head` method:


In [None]:
Bchurn = pd.read_csv('./churn-modeling/Churn_Modelling.csv')
Bchurn.head()

<details>
<summary>About printing DataFrames in Jupyter notebooks</summary>
<p>
In Jupyter notebooks, Pandas DataFrames are printed as these pretty tables seen above while `print(df.head())` looks worse.
By default, Pandas displays 20 columns and 60 rows, so, if your DataFrame is bigger, use the `set_option` function as shown in the example below:

```python
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)
```
</p>
</details>

Recall that each row corresponds to one client, an **instance**, and columns are **features** of this instance.

The cell below will load our dataset in, and perform some rudimentary cleaning.  

Do not worry if you do not understand all of the code below. Comments are provided if you are interested in following along.

In [None]:
# Column names may be accessed (and changed) using the `.columns` attribute as below
print("Old Column Names:\n", Bchurn.columns) 

In [None]:
# Stripping out spaces from ends of names, and replacing internal spaces with "_"
print("\nStripping spaces from ends of column names; replacing internal spaces with '_'\n")
Bchurn.columns = [col.strip().replace(' ', '_').lower() for col in Bchurn.columns]

# Print edited column names
print("\nNew Column Names:\n", Bchurn.columns)

Let’s have a look at data dimensionality, features names, and feature types.

In [None]:
print(Bchurn.shape)

From the output, we can see that the table contains 3333 rows and 20 columns.

Now let’s try printing out column names using `columns`:

In [None]:
print(Bchurn.columns)

We can use the `info()` method to output some general information about the dataframe: 

In [None]:
Bchurn.info()

`bool`, `int64`, `float64` and `object` are the data types of our features. We see that one feature is logical (`bool`), 3 features are of type `object`, and 16 features are numeric. With this same method, we can easily see if there are any missing values. Here, there are none because each column contains 3333 observations, the same number of rows we saw before with `shape`.

We can **change the column type** with the `astype` method. Let’s apply this method to the `Churn` feature to convert it into `int64`:


In [None]:
# df['Churn'] = df['Churn'].astype('int64')

In [None]:
# Number of rows
len(Bchurn)  # for traceability 576 rows imported


The `describe` method shows basic statistical characteristics of each numerical feature (`int64` and `float64` types): number of non-missing values, mean, standard deviation, range, median, 0.25 and 0.75 quartiles.

In [None]:
Bchurn.describe()  # Similar to summary() in R

In [None]:
Bchurn.describe().transpose()  # change the rows and columns

In order to see statistics on non-numerical features, one has to explicitly indicate data types of interest in the `include` parameter.

In [None]:
Bchurn.describe(include=['object', 'bool'])

For categorical (type `object`) and boolean (type `bool`) features we can use the `value_counts` method. Let’s have a look at the distribution of `Churn`:

In [None]:
Bchurn['churned'].value_counts()

2850 users out of 3333 are *loyal*; their `Churn` value is `0`. To calculate fractions, pass `normalize=True` to the `value_counts` function.

In [None]:
Bchurn['churned'].value_counts(normalize=True)


### 2.  Indexing and retrieving data

DataFrame can be indexed in different ways. 

To get a single column, you can use a `DataFrame['Name']` construction. Let's use this to answer a question about that column alone: **what is the proportion of churned users in our dataframe?**



In [None]:
Bchurn['churned'].mean()


14.5% is actually quite bad for a company; such a churn rate can make the company go bankrupt.

**Boolean indexing** with one column is also very convenient. The syntax is `df[P(df['Name'])]`, where `P` is some logical condition that is checked for each element of the `Name` column. The result of such indexing is the DataFrame consisting only of rows that satisfy the `P` condition on the `Name` column. 

Let’s use it to answer the question:

**What are average values of numerical features for churned users?**


In [None]:
Bchurn[Bchurn['churned'] == 1].mean()


DataFrames can be indexed by column name (label) or row name (index) or by the serial number of a row. The `loc` method is used for **indexing by name**, while `iloc()` is used for **indexing by number**.

In the first case, we would say *"give us the values of the rows with index from 0 to 5 (inclusive) and columns labeled from State to Area code (inclusive)"*, and, in the second case, we would say *"give us the values of the first five rows in the first three columns"* (as in typical Python slice: the maximal value is not included).


In [None]:
Bchurn.loc[0:5, 'geography':'age']

In [None]:
Bchurn.iloc[0:5, 0:3]

If we need the first or the last line of the data frame, we can use the `df[:1]` or `df[-1:]` construct:

In [None]:
Bchurn[-1:]


### Applying Functions to Cells, Columns and Rows

**To apply functions to each column, use `apply()`:**


In [None]:
Bchurn.apply(np.max) 

<a id="II"></a>
# II. Visual data analysis in Python


In the field of Machine Learning, *data visualization* is not just making fancy graphics for reports; it is used extensively in day-to-day work for all phases of a project.

To start with, visual exploration of data is the first thing one tends to do when dealing with a new task. We do preliminary checks and analysis using graphics and tables to summarize the data and leave out the less important details. It is much more convenient for us, humans, to grasp the main points this way than by reading many lines of raw data. It is amazing how much insight can be gained from seemingly simple charts created with available visualization tools.

Next, when we analyze the performance of a model or report results, we also often use charts and images. Sometimes, for interpreting a complex model, we need to project high-dimensional spaces onto more visually intelligible 2D or 3D figures.

All in all, visualization is a relatively fast way to learn something new about your data. Thus, it is vital to learn its most useful techniques and make them part of your everyday ML toolbox.

We are going to get hands-on experience with visual exploration of data using popular libraries such as `matplotlib` and `seaborn`.

### Article outline

1. Dataset
2. Univariate visualization
    * 2.1 Quantitative features
    * 2.2 Categorical and binary features
3. Multivariate visualization
    * 3.1 Quantitative–Quantitative
    * 3.2 Quantitative–Categorical
    * 3.3 Categorical–Categorical
4. Whole dataset
    * 4.1 Naive approach
    * 4.2 Dimensionality reduction
    * 4.2 t-SNE
5. Useful resources

### 1. Univariate visualization

*Univariate* analysis looks at one feature at a time. When we analyze a feature independently, we are usually mostly interested in the *distribution of its values* and ignore other features in the dataset.

Below, we will consider different statistical types of features and the corresponding tools for their individual visual analysis.

#### 1.1 Quantitative features

*Quantitative features* take on ordered numerical values. Those values can be *discrete*, like integers, or *continuous*, like real numbers, and usually express a count or a measurement.

##### 1.1.1 Histograms and density plots

The easiest way to take a look at the distribution of a numerical variable is to plot its *histogram* using the `DataFrame`'s method [`hist()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.hist.html).

In [None]:
features = ['balance', 'tenure']

Bchurn[features].hist(figsize=(10, 4));

A histogram groups values into *bins* of equal value range. The shape of the histogram may contain clues about the underlying distribution type: Gaussian, exponential etc. You can also spot any skewness in its shape when the distribution is nearly regular but has some anomalies. Knowing the distribution of the feature values becomes important when you use Machine Learning methods that assume a particular type of it, most often Gaussian.

In the above plot, we see that the variable *Total day minutes* is normally distributed, while *Total intl calls* is prominently skewed right (its tail is longer on the right).

There is also another, often clearer, way to grasp the distribution: *density plots* or, more formally, *Kernel Density Plots*. They can be considered a [smoothed](https://en.wikipedia.org/wiki/Kernel_smoother) version of the histogram. Their main advantage over the latter is that they do not depend on the size of the bins. Let's create density plots for the same two variables:

In [None]:
Bchurn[features].plot(kind='density', subplots=True, layout=(1, 2), 
                  sharex=False, figsize=(10, 4));

It is also possible to plot a distribution of observations with `seaborn`'s [`distplot()`](https://seaborn.pydata.org/generated/seaborn.distplot.html). For example, let's look at the distribution of *Total day minutes*. By default, the plot displays both the histogram with the [kernel density estimate](https://en.wikipedia.org/wiki/Kernel_density_estimation) (KDE) on top.

In [None]:
# increasing the width of the Chart
import seaborn as sns
plt.rcParams['figure.figsize'] = 10,6 # similar to par(mfrow = c(2,1), mar = c(4,4,2,1)) # 2 columns and 1 row
sns.distplot(Bchurn["age"]) # pass it one variable

# if you are getting warnings related to the package you should use ignore function
import warnings
warnings.filterwarnings ('ignore')

In [None]:
# increasing the width of the Chart
plt.rcParams['figure.figsize'] = 10,6 # similar to par(mfrow = c(2,1), mar = c(4,4,2,1)) # 2 columns and 1 row
sns.distplot(Bchurn["estimatedsalary"]) # pass it one variable

# if you are getting warnings related to the package you should use ignore function
import warnings
warnings.filterwarnings ('ignore')

#### 1.2 Categorical and binary features

*Categorical features* take on a fixed number of values. Each of these values assigns an observation to a corresponding group, known as a *category*, which reflects some qualitative property of this example. *Binary* variables are an important special case of categorical variables when the number of possible values is exactly 2. If the values of a categorical variable are ordered, it is called *ordinal*.

##### 1.2.1 Frequency table

Let’s check the class balance in our dataset by looking at the distribution of the target variable: the *churn rate*. First, we will get a frequency table, which shows how frequent each value of the categorical variable is. For this, we will use the [`value_counts()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) method:

In [None]:
Bchurn['churned'].value_counts()

By default, the entries in the output are sorted from the most to the least frequently-occurring values.

In our case, the data is not *balanced*; that is, our two target classes, loyal and disloyal customers, are not represented equally in the dataset. Only a small part of the clients canceled their subscription to the telecom service. As we will see in the following articles, this fact may imply some restrictions on measuring the classification performance, and, in the future, we may want to additionaly penalize our model errors in predicting the minority "Churn" class.

##### 1.2.2 Bar plot

The bar plot is a graphical representation of the frequency table. The easiest way to create it is to use the `seaborn`'s function [`countplot()`](https://seaborn.pydata.org/generated/seaborn.countplot.html). There is another function in `seaborn` that is somewhat confusingly called [`barplot()`](https://seaborn.pydata.org/generated/seaborn.barplot.html) and is mostly used for representation of some basic statistics of a numerical variable grouped by a categorical feature.

Let's plot the distributions for two categorical variables:

In [None]:
_, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 4))

sns.countplot(x='churned', data=Bchurn, ax=axes[0]);
sns.countplot(x='age', data=Bchurn, ax=axes[1]);

#### 1.2.3. Distributions of categorical features

In [None]:
# Distributions of categorical features
plt.rcParams['figure.figsize'] = 8,6
sns.countplot(y='gender', data=Bchurn)
plt.show()

sns.countplot(y='geography', data=Bchurn)
plt.show()

While the histograms, discussed above, and bar plots may look similar, there are several differences between them:
1. *Histograms* are best suited for looking at the distribution of numerical variables while *bar plots* are used for categorical features.
2. The values on the X-axis in the *histogram* are numerical; a *bar plot* can have any type of values on the X-axis: numbers, strings, booleans.
3. The *histogram*'s X-axis is a *Cartesian coordinate axis* along which values cannot be changed; the ordering of the *bars* is not predefined. Still, it is useful to note that the bars are often sorted by height, that is, the frequency of the values. Also, when we consider *ordinal* variables (like *Customer service calls* in our data), the bars are usually ordered by variable value.

The left chart above vividly illustrates the imbalance in our target variable. The bar plot for *Customer service calls* on the right gives a hint that the majority of customers resolve their problems in maximum 2–3 calls. But, as we want to be able to predict the minority class, we may be more interested in how the fewer dissatisfied customers behave. It may well be that the tail of that bar plot contains most of our churn. These are just hypotheses for now, so let's move on to some more interesting and powerful visual techniques.

### 2. Multivariate visualization

*Multivariate* plots allow us to see relationships between two and more different variables, all in one figure. Just as in the case of univariate plots, the specific type of visualization will depend on the types of the variables being analyzed.

#### 2.1 Quantitative–Quantitative

##### 2.1.1 Correlation matrix

Let's look at the correlations among the numerical variables in our dataset. This information is important to know as there are Machine Learning algorithms (for example, linear and logistic regression) that do not handle highly correlated input variables well.

First, we will use the method [`corr()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.corr.html) on a `DataFrame` that calculates the correlation between each pair of features. Then, we pass the resulting *correlation matrix* to [`heatmap()`](https://seaborn.pydata.org/generated/seaborn.heatmap.html) from `seaborn`, which renders a color-coded matrix for the provided values:

In [None]:
Bchurn.head(5)

In [None]:
# Drop non-numerical variables
numerical = list(set(Bchurn.columns) - 
                 set(['rowNumber', 'customerId', 'gender', 'surname','geography']))

# Calculate Correlation
corr_matrix = Bchurn[numerical].corr()
corr_matrix


In [None]:
#### 4.2 Correlation heatmap of the numberic variables
plt.rcParams['figure.figsize'] = 20,10  # control plot sizeimport seaborn as sns
sns.heatmap(Bchurn.corr(), cmap = "coolwarm")

In [None]:
# seaborn
## first_twenty = har_train.iloc[:, :20] # pull out first 20 feats
corr = Bchurn.corr()  # compute correlation matrix
mask = np.zeros_like(corr, dtype=np.bool)  # make mask
mask[np.triu_indices_from(mask)] = True  # mask the upper triangle

fig, ax = plt.subplots(figsize=(11, 9))  # create a figure and a subplot
cmap = sns.diverging_palette(220, 10, as_cmap=True)  # custom color map
sns.heatmap(
    corr,
    mask=mask,
    cmap=cmap,
    center=0,
    linewidth=0.5,
    cbar_kws={'shrink': 0.5}
);

In [None]:
# Box and Whisker Plots
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = 20,10  # control plot size
Bchurn.plot(kind='box', subplots=True, layout=(3,6), sharex=False, sharey=False)
plt.show()


In [None]:
# Univariate Density Plots
plt.rcParams['figure.figsize'] = 20,10  # control plot size

Bchurn.plot(kind='density', subplots=True, layout=(4,4), sharex=False)
plt.show()


In [None]:
# Univariate Histograms
plt.rcParams['figure.figsize'] = 20,10  # control plot size

Bchurn.hist()
plt.show()


##### 2.1.2 Scatter plot

The *scatter plot* displays values of two numerical variables as *Cartesian coordinates* in 2D space. Scatter plots in 3D are also possible.

Let's try out the function [`scatter()`](https://matplotlib.org/devdocs/api/_as_gen/matplotlib.pyplot.scatter.html) from the `matplotlib` library:

In [None]:
plt.scatter(Bchurn['balance'], Bchurn['estimatedsalary']);

We get an uninteresting picture of two normally distributed variables. Also, it seems that these features are uncorrelated because the ellpise-like shape is aligned with the axes.

There is a slightly fancier option to create a scatter plot with the `seaborn` library:

In [None]:
sns.jointplot(x='balance', y='estimatedsalary', 
              data=Bchurn, kind='scatter');

The function [`jointplot()`](https://seaborn.pydata.org/generated/seaborn.jointplot.html) plots two histograms that may be useful in some cases.

Using the same function, we can also get a smoothed version of our bivariate distribution:

In [None]:
sns.jointplot('tenure', 'age', data=Bchurn,
              kind="kde", color="g");

This is basically a bivariate version of the *Kernel Density Plot* discussed earlier.

##### 2.1.3 Scatterplot matrix

In some cases, we may want to plot a *scatterplot matrix* such as the one shown below. Its diagonal contains the distributions of the corresponding variables, and the scatter plots for each pair of variables fill the rest of the matrix.

In [None]:
# `pairplot()` may become very slow with the SVG format
%config InlineBackend.figure_format = 'png'
sns.pairplot(Bchurn[numerical]);

<a id="III"></a>
# III- Data Pre-processing &  Preparation

In [None]:
Bchurn = pd.read_csv('Churn_Modelling.csv')
Bchurn.head()

In [None]:
# Convert data types

print(Bchurn.dtypes)
# Bchurn = Bchurn.astype(float)
# print(Bchurn.dtypes)


In [None]:
# Exclude unwanted columns
Bchurn = Bchurn[['CreditScore', 'Geography','Gender', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard',
       'IsActiveMember', 'EstimatedSalary', 'Churned']]

In [None]:
# skewness along the index axis 
Bchurn.skew(axis = 0, skipna = True) 

###  1. Data Integration

In [None]:
'''# Use merge function to conbine two files

df = pd.merge(df1, df2, on = 'Student_id')'''

###  2. Fixing missing values

#### 2.1 Delete a column

In [None]:
# Finding missing values

Bchurn.isna().sum()

#### 2.2 Delete rows with NaN values

Once you have figured out all the missing details, we remove all the missing rows from the DataFrame. To do so, we use the dropna() function:


In [None]:
# removing Null values

Bchurn = Bchurn.dropna()

In [None]:
# Delete a column
'''Bchurn.drop('Code', axis=1, inplace=True)
print(Bchurn.shape)
Bchurn.head(10)'''

In [None]:
# Delete rows with NaN values

'''print(Bchurn.shape)
Bchurn[['Bare-Nuclei']] = Bchurn[['Bare-Nuclei']].replace('?', numpy.NaN)
Bchurn.dropna(axis=0, how='any', inplace=True)
print(data2.shape)'''


#### 2.3 Mark value as NaN

In [None]:
# Mark value as NaN

'''print(pd.unique(Bchurn['Bare-Nuclei']))
Bchurn[['Bare-Nuclei']] = Bchurn[['Bare-Nuclei']].replace('?', numpy.NaN)
print(pandas.unique(dataset['Bare-Nuclei']))'''


#### 2.4 Impute missing values 

Impute the numerical data of the age column with its mean. To do so, first find the mean of the column with missing values using the mean() function of pandas, and then print it  




In [None]:
# Impute numerical data with mean '
mean_age = Bchurn.Age.mean()
Bchurn.Age.fillna(mean_age, inplace=True)

We couild also impute the categorical data  with its mode. We first need to find the mode 

In [None]:
# Impute Categorical data with mode
mode_Gender = Bchurn.Gender.mode()[0]
print(mode_Gender)

# Impute the missing data of the contact column with its mode using the fillna() function
Bchurn.Gender.fillna(mode_Gender,inplace=True)

##### 2.5 Splitting data - on whether or not "Age" is specified.

In [None]:
'''### Splitting data - on whether or not "Age" is specified.

# Training data -- "Age" Not null; "Age" as target
train = titanic_knn[titanic_knn.Age.notnull()]
X_train = train.drop(['Age'], axis = 1)
y_train = train.Age'''

##### 2.5.1 Data to impute, -- Where Age is null; Remove completely-null "Age" column.

In [None]:
'''# Data to impute, -- Where Age is null; Remove completely-null "Age" column.
impute = titanic_knn[titanic_knn.Age.isnull()].drop(['Age'], axis = 1)
print("Data to Impute")
print(impute.head(3))'''

##### 2.5.2 import algorithm

In [None]:
'''# import algorithm
from sklearn.neighbors import KNeighborsRegressor

# Instantiate
knr = KNeighborsRegressor()

# Fit
knr.fit(X_train, y_train)

# Create Predictions
imputed_ages = knr.predict(impute)'''

##### 2.5.3 Add to Df

In [None]:
'''# Add to Df
impute['Age'] = imputed_ages
print("\nImputed Ages")
print(impute.head(3))'''

##### 2.5.4 Re-combine dataframes

In [None]:
'''# Re-combine dataframes
titanic_imputed = pd.concat([train, impute], sort = False, axis = 0)'''

##### 2.5.5 Return to original order - to match back up with "Survived"

In [None]:
'''# Return to original order - to match back up with "Survived"
titanic_imputed.sort_index(inplace = True)
print("Shape with imputed values:", titanic_imputed.shape)
print("Shape before imputation:", titanic_knn.shape)
titanic_imputed.head(7)'''

###  3. Finding and Fixing Outliers

In [None]:
# Boxplots
# Horizontal boxplot with observations
plt.rcParams['figure.figsize'] = 8,4
sns.boxplot(Bchurn['Age'])

The height of the histogram bars here is normed and shows the density rather than the number of examples in each bin.

##### Box plot

Another useful type of visualization is a *box plot*. `seaborn` does a great job here:

In [None]:
plt.rcParams['figure.figsize'] = 4,8
sns.boxplot(x='Age', data=Bchurn, orient = 'v');


The boxplot uses the IQR method to display the data and the outliers (the shape of the data). But in order to print an outlier, we use a mathematical formula to retrieve it. Add the following code to find the outliers of the Age column using the IQR method:

In [None]:
Q1 = Bchurn["Age"].quantile(0.25)

Q3 = Bchurn["Age"].quantile(0.75)

IQR = Q3 - Q1

print(IQR)

Now we find the upper fence and lower fence by adding the following code, and print all the data above the upper fence and below the lower fence. Add the following code to do this:

In [None]:
Lower_Fence = Q1 - (1.5 * IQR)

Upper_Fence = Q3 + (1.5 * IQR)

print(Lower_Fence)

print(Upper_Fence)

In [None]:
#Filter out the outlier data and print only the potential data. To do so, just negate the preceding result using the ~ operator:
Bchurn[((Bchurn["Age"] < Lower_Fence) |(Bchurn["Age"] > Upper_Fence))].head()


Filter out the outlier data and print only the potential data. To do so, just negate the preceding result using the ~ operator:

In [None]:
Bchurn = Bchurn[~((Bchurn ["Age"] < Lower_Fence) |(Bchurn["Age"] > Upper_Fence))]
Bchurn.head()

In [None]:
# Test if outliers are fixed

plt.rcParams['figure.figsize'] = 4,8
sns.boxplot(x='Age', data=Bchurn, orient = 'v');


### 2. Data Feature Selection

#### 2.1 Standardize data (0 mean, 1 stdev)

In [None]:
'''# Feature Importance with Extra Trees Classifier
from pandas import read_csv
from sklearn.ensemble import ExtraTreesClassifier

array = Bchurn.values
X = array[:,0:8]
Y = array[:,8]

# feature extraction
model = ExtraTreesClassifier()
model.fit(X, Y)
print(model.feature_importances_)
'''

#### 2.2 Identify Features with Low Variance

In [None]:
'''# Identify Features with Low Variance
from pandas import read_csv
from sklearn.feature_selection import VarianceThreshold
# load data
array = Bchurn.values
X = array[:,0:8]
Y = array[:,8]
# feature selection
threshold = 0.8 * (1 - 0.8)
test = VarianceThreshold(threshold)
fit = test.fit(X)
print(fit.variances_)
features = fit.transform(X)
print(features)
'''

#### 2.3 Feature Extraction with PCA

In [None]:
'''# Feature Extraction with PCA
import numpy
from pandas import read_csv
from sklearn.decomposition import PCA
# load data
array = Bchurn.values
X = array[:,0:8]
Y = array[:,8]
# feature extraction
pca = PCA(n_components=3)
fit = pca.fit(X)
# summarize components
print("Explained Variance: %s") % fit.explained_variance_ratio_
print(fit.components_)
'''

#### 2.4 Feature Extraction with Recursive Feature Elimination (REF)

In [None]:
'''# Feature Extraction with RFE
from pandas import read_csv
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
# load data
array = Bchurn.values
X = array[:,0:8]
Y = array[:,8]
# feature extraction
model = LogisticRegression()
rfe = RFE(model, 3)
fit = rfe.fit(X, Y)
print ()"Num Features: %d" % fit.n_features_
print("Selected Features: %s") % fit.support_)
print("Feature Ranking: %s") % fit.ranking_
'''

#### 2.5 Feature Extraction with Univariate Statistical Tests (Chi-squared for classification)

In [None]:
'''# Feature Extraction with Univariate Statistical Tests (Chi-squared for classification)

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
# load data
array = Bchurn.values
X = array[:,0:8]
Y = array[:,8]
# feature extraction
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, Y)
# summarize scores
numpy.set_printoptions(precision=3)
print(fit.scores_)
features = fit.transform(X)
# summarize selected features
print(features[0:5,:])

'''

### 3. Data Transformation

There are some algorithms that can work well with categorical data, such as decision trees. But most machine learning algorithms cannot operate directly with categorical data. These algorithms require the input and output both to be in numerical form. If the output to be predicted is categorical, then after prediction we convert them back to categorical data from numerical data. Let's discuss some key challenges that we face while dealing with categorical data:

#### 3.1 Simple Replacement of Categorical Data with a Number

Find the categorical column and separate it out with a different dataframe. To do so, use the select_dtypes() function from pandas:

In [None]:
Bchurn_categorical = Bchurn.select_dtypes(exclude=[np.number])

Bchurn_categorical.head()

We could also extract the categorial featuers using boolean mask

In [None]:
# Categorical boolean mask
categorical_feature_mask = Bchurn.dtypes==object
# filter categorical columns using mask and turn it into a list
categorical_cols = Bchurn.columns[categorical_feature_mask].tolist()

categorical_cols

Find the distinct unique values in the Geography and Gender column. To do so, use the unique() function from pandas with the column name:

In [None]:
print(Bchurn_categorical['Geography'].unique())
print(Bchurn_categorical['Gender'].unique())

In [None]:
Bchurn_categorical.Gender.replace({"Female":1, "Male":2, "3rd Class":3}, inplace= True)
Bchurn_categorical.Geography.replace({"France":1, "Spain":2, "Germany":3}, inplace= True)

In [None]:
Bchurn_categorical.Gender.head()
Bchurn_categorical.Geography.head()

#### 3.2 Label Encoding

This is a technique in which we replace each value in a categorical column with numbers from 0 to N-1. For example, say we've got a list of employee names in a column. After performing label encoding, each employee name will be assigned a numeric label. But this might not be suitable for all cases because the model might consider numeric values to be weights assigned to the data. Label encoding is the **best method to use for ordinal data**. The scikit-learn library provides LabelEncoder(), which helps with label encoding. Let's look at an exercise in the next section.

Before doing the encoding, remove all the missing data. To do so, use the dropna() function, Select all the columns that are not numeric using the following code:

In [None]:
data_column_category = Bchurn.select_dtypes(exclude=[np.number]).columns
data_column_category

Iterate through this category column and convert it to numeric data using LabelEncoder(). To do so, import the sklearn.preprocessing package and use the LabelEncoder() class to transform the data:

In [None]:
#import the LabelEncoder class

from sklearn.preprocessing import LabelEncoder

#Creating the object instance

label_encoder = LabelEncoder()

for i in data_column_category:

    Bchurn[i] = label_encoder.fit_transform(Bchurn[i])

print("Label Encoded Data: ")

Bchurn.head()

#### 3.3 One-Hot Encoding

In label encoding, categorical data is converted to numerical data, and the values are assigned labels (such as 1, 2, and 3). Predictive models that use this numerical data for analysis might sometimes mistake these labels for some kind of order (for example, a model might think that a label of 3 is "better" than a label of 1, which is incorrect). In order to avoid this confusion, we can use one-hot encoding. Here, the label-encoded data is further divided into n number of columns. Here, n denotes the total number of unique labels generated while performing label encoding. For example, say that three new labels are generated through label encoding. Then, while performing one-hot encoding, the columns will be divided into three parts. So, the value of n is 3. Let's look at an exercise to get further clarification.

Once we have performed label encoding, we execute one-hot encoding. Add the following code to implement this:

In [None]:
#Performing Onehot Encoding
from sklearn.preprocessing import OneHotEncoder
onehot_encoder = OneHotEncoder(sparse=False)

onehot_encoded = onehot_encoder.fit_transform(Bchurn[data_column_category])

Now we create a new dataframe with the encoded data and print the first five rows. Add the following code to do this:


In [None]:
onehot_encoded_frame = pd.DataFrame(onehot_encoded, columns = onehot_encoder.get_feature_names(data_column_category))

onehot_encoded_frame.head()

For every level or category, a new column is created. In order to prefix the category name with the column name you can use this alternate way to create one-hot encoding. In order to prefix the category name with the column name, write the following code:

#### 3.4  Dummy Varaibles

In [None]:
# Create dummy variables for the categorical features
to_dummy = ['Gender','Geography']
Bchurn_getdummies  = pd.get_dummies(Bchurn, prefix = to_dummy, columns = to_dummy, drop_first = True)
Bchurn_getdummies .head()

### 4. Data Discretization

Data discretization is the process of converting continuous data into discrete buckets by grouping it. Discretization is also known for easy maintainability of the data. Training a model with discrete data becomes faster and more effective than when attempting the same with continuous data. Although continuous-valued data contains more information, huge amounts of data can slow the model down. Here, discretization can help us strike a balance between both. Some famous methods of data discretization are binning and using a histogram. Although data discretization is useful, we need to effectively pick the range of each bucket, which is a challenge. 

The main challenge in discretization is to choose the number of intervals or bins and how to decide on their width.

Here we make use of a function called pandas.cut(). This function is useful to achieve the bucketing and sorting of segmented data. 

In [None]:
Bchurn['Tenure'] = pd.cut(Bchurn['Tenure'],5,labels=['Poor','Below_average','Average','Above_Average','Excellent'])

Bchurn.head(10)

### 5. Binarization

In [None]:
'''# binarization
from sklearn.preprocessing import Binarizer
import pandas
import numpy

array = Bchurn.values
X = array[:,0:8]
Y = array[:,8]
binarizer = Binarizer(threshold=0.0).fit(X)
binaryX = binarizer.transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(binaryX[0:5,:])'''

### 6. Box-Cox transform

In [None]:
'''# Box-Cox transform
import pandas
from scipy.stats import boxcox

array = Bchurn.values
X = array[:,0:8]
Y = array[:,8]
X_boxcox = boxcox(1+X[:,2])[0]

print(X_boxcox)
'''

### 7. Convert a string class label to an integer

In [None]:
'''# Convert a string class label to an integer
import pandas
from sklearn.preprocessing import LabelEncoder

array = Bchurn.values

y = array[:, 60]
encoder = LabelEncoder()
encoder.fit(y)
print(encoder.classes_)
encoded_y = encoder.transform(y)
print(encoded_y)
'''

### 8. Normalize data (length of 1)

In [None]:
''''# Normalize data (length of 1)
from sklearn.preprocessing import Normalizer
import pandas
import numpy
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(normalizedX[0:5,:])
''''

In real life, values in a dataset might have a variety of different magnitudes, ranges, or scales. Algorithms that use distance as a parameter may not weigh all these in the same way. There are various data transformation techniques that are used to transform the features of our data so that they use the same scale, magnitude, or range. This ensures that each feature has an appropriate effect on a model's predictions.

Some features in our data might have high-magnitude values (for example, annual salary), while others might have relatively low values (for example, the number of years worked at a company). Just because some data has smaller values does not mean it is less significant. So, to make sure our prediction does not vary because of different magnitudes of features in our data, we can perform feature scaling, standardization, or normalization (these are three similar ways of dealing with magnitude issues in data).

### 9. Rescale data (between 0 and 1)

In [None]:
'''# Rescale data (between 0 and 1)
import pandas
import scipy
import numpy
from sklearn.preprocessing import MinMaxScaler
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(rescaledX[0:5,:])
'''

### 10. Standardize data (0 mean, 1 stdev)

In [None]:
'''# Standardize data (0 mean, 1 stdev)
from sklearn.preprocessing import StandardScaler
import pandas
import numpy
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(rescaledX[0:5,:])
'''

<a id="IV"></a>
## IV Predictive Analytics 

### 1.1 Identifying X and y

In [None]:
Bchurn = pd.read_csv('Churn_Modelling.csv')
Bchurn.head()

In [None]:
Bchurn.columns
Bchurn = Bchurn[[ 'CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard',
       'IsActiveMember', 'EstimatedSalary', 'Churned']]

In [None]:
'''X = Bchurn.iloc[:, [0,7]].values
y = Bchurn.iloc[:, 8].values'''

In [None]:
X = Bchurn.drop('Churned', axis = 1)
y = Bchurn.Churned

In [None]:
'''X = Bchurn.iloc[:, [3,6,7,8,8,10,11,12,13]].values
y = Bchurn.iloc[:, 13].values'''

### 1.3 Data split & Scaling Data Preprocessing

In [None]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 123)


In [None]:
# Feature Scaling
'''from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)'''

### 2. Building Model (Decision Tree)

###  2.1 First Method (using function)

In [None]:
# from sklearn.tree import DecisionTreeClassifier as Model

In [None]:
# def train(features, target):
#     model = Model()
#     model.fit(features, target)
#     return model

In [None]:
# def predict(model, new_features):
#     preds = model.predict(new_features)
#     return preds

In [None]:
# # Assume Titanic data is loaded into titanic_feats,
# # titanic_target and titanic_test
# model = train(X_train, y_train)
# predictions = predict(model, X_test)

###  2.2 Second Method 

In [None]:
# Fitting Decision Tree Classification to the Training set
from sklearn.tree import DecisionTreeClassifier

In [None]:
DecisionTreeModel = DecisionTreeClassifier(criterion = 'entropy', random_state = 123)

In [None]:
DecisionTreeModel.fit(X_train, y_train)  # Training input and its Target variables

In [None]:
DecisionTreePred = DecisionTreeModel.predict(X_test) # I already Know y_test

### 2.2  Making the Confusion Matrix
**Accuracy** is perhaps the most intuitive performance measure. It is simply the ratio of correctly predicted observations.  
**Precision**: Precision looks at the ratio of correct positive observations   
**Recall** : Recall is also known as sensitivity or true positive rate. It is the ratio of correctly predicted positive events   
**F1 Score** : The F1 Score is the weighted average of Precision and recall. Therefore, this score takes both false postives and false negatives into account   

In [None]:
# Confusion Matrix
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [None]:
# Import machine learning modules
from sklearn.ensemble import GradientBoostingClassifier, partial_dependence
from sklearn.metrics import roc_auc_score

In [None]:
# Confusion Matrix
CMTD = confusion_matrix(y_test,DecisionTreePred ) # Compare the predicted target varaible to the orginal target variable
CMTD

In [None]:
target = 'Churned'
CMTD = pd.crosstab(y_test,DecisionTreePred, rownames=['Actual'], colnames=['Predicted'])
fig, (ax1) = plt.subplots(ncols=1, figsize=(5,5))
sns.heatmap(CMTD, 
            xticklabels=['Active', 'Churned'],
            yticklabels=['Active', 'Churned'],
            annot=True,ax=ax1,
            linewidths=.2,linecolor="Darkblue", cmap="Blues")
plt.title('Confusion Matrix', fontsize=14)
plt.show()

In [None]:
# Accuracy Score
ACDT = accuracy_score(y_test, DecisionTreePred )
print(ACDT)

### 2.3  Plot feature importances

In [None]:
# Plot feature importances
plt.title('Normalized Feature Importances')
sns.barplot(y=X.columns, x=DecisionTreeModel.feature_importances_)
plt.show()

In [None]:
'''tmp = pd.DataFrame({'Feature': DecisionTreePred, 'Feature importance': DecisionTreeModel.feature_importances_})
tmp = tmp.sort_values(by='Feature importance',ascending=False)
plt.figure(figsize = (7,4))
plt.title('Features importance',fontsize=14)
s = sns.barplot(x='Feature',y='Feature importance',data=tmp)
s.set_xticklabels(s.get_xticklabels(),rotation=90)
plt.show()   '''

In [None]:
from sklearn.metrics import classification_report
#d = DecisionTreeModel.fit(X_train.values, y_train.values.copy(), 50)

#preds = predict(X_test, d)
print(classification_report(y_test,DecisionTreePred))