
<h2 id="Problem-Definition"><b>Problem Definition</b><a class="anchor-link" href="#Problem-Definition">¶</a></h2><p><strong>The context:</strong> Why is this problem important to solve?<br/>
    This problem is important to solve because it will improve the accuracy of determining who will mostly likely default on their loan. This will inturn help the bank give out better loads to people who will actually pay them back.<br/>
<strong>The objectives:</strong> What is the intended goal?<br/>
    The goal is to build a classification model that will take the given inputs and determine if the customer will default on their load or not. This model should be free of any bias. We will also give any recomendations back to the bank on any important features to consider when approving a loan. <br/>
<strong>The key questions:</strong> What are the key questions that need to be answered?<br/>
What are the key features to determine if someone will default on a loan? <br/>
<strong>The problem formulation:</strong> What is it that we are trying to solve using data science?<br/>
We are trying to use data science to eleminate the human error and bias that comes into determining if someone will default on the loan or not.</p>
<h2 id="Data-Description:"><strong>Data Description:</strong><a class="anchor-link" href="#Data-Description:">¶</a></h2><p>The Home Equity dataset (HMEQ) contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable that indicates whether an applicant has ultimately defaulted or has been severely delinquent. This adverse outcome occurred in 1,189 cases (20 percent). 12 input variables were registered for each applicant.</p>
<ul>
<li><p><strong>BAD:</strong> 1 = Client defaulted on loan, 0 = loan repaid</p>
</li>
<li><p><strong>LOAN:</strong> Amount of loan approved.</p>
</li>
<li><p><strong>MORTDUE:</strong> Amount due on the existing mortgage.</p>
</li>
<li><p><strong>VALUE:</strong> Current value of the property.</p>
</li>
<li><p><strong>REASON:</strong> Reason for the loan request. (HomeImp = home improvement, DebtCon= debt consolidation which means taking out a new loan to pay off other liabilities and consumer debts)</p>
</li>
<li><p><strong>JOB:</strong> The type of job that loan applicant has such as manager, self, etc.</p>
</li>
<li><p><strong>YOJ:</strong> Years at present job.</p>
</li>
<li><p><strong>DEROG:</strong> Number of major derogatory reports (which indicates a serious delinquency or late payments).</p>
</li>
<li><p><strong>DELINQ:</strong> Number of delinquent credit lines (a line of credit becomes delinquent when a borrower does not make the minimum required payments 30 to 60 days past the day on which the payments were due).</p>
</li>
<li><p><strong>CLAGE:</strong> Age of the oldest credit line in months.</p>
</li>
<li><p><strong>NINQ:</strong> Number of recent credit inquiries.</p>
</li>
<li><p><strong>CLNO:</strong> Number of existing credit lines.</p>
</li>
<li><p><strong>DEBTINC:</strong> Debt-to-income ratio (all your monthly debt payments divided by your gross monthly income. This number is one way lenders measure your ability to manage the monthly payments to repay the money you plan to borrow.</p>
</li>
</ul>



<h2 id="Important-Notes"><b>Important Notes</b><a class="anchor-link" href="#Important-Notes">¶</a></h2><ul>
<li><p>This notebook can be considered a guide to refer to while solving the problem. The evaluation will be as per the Rubric shared for each Milestone. Unlike previous courses, it does not follow the pattern of the graded questions in different sections. This notebook would give you a direction on what steps need to be taken in order to get a viable solution to the problem. Please note that this is just one way of doing this. There can be other 'creative' ways to solve the problem and we urge you to feel free and explore them as an 'optional' exercise.</p>
</li>
<li><p>In the notebook, there are markdowns cells called - Observations and Insights. It is a good practice to provide observations and extract insights from the outputs.</p>
</li>
<li><p>The naming convention for different variables can vary. Please consider the code provided in this notebook as a sample code.</p>
</li>
<li><p>All the outputs in the notebook are just for reference and can be different if you follow a different approach.</p>
</li>
<li><p>There are sections called <strong>Think About It</strong> in the notebook that will help you get a better understanding of the reasoning behind a particular technique/step. Interested learners can take alternative approaches if they want to explore different techniques.</p>
</li>
</ul>



<h3 id="Import-the-necessary-libraries"><strong>Import the necessary libraries</strong><a class="anchor-link" href="#Import-the-necessary-libraries">¶</a></h3>


In [None]:


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report,accuracy_score,precision_score,recall_score,f1_score,precision_recall_curve

from sklearn import tree
from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier

import scipy.stats as stats

from sklearn.model_selection import GridSearchCV

from sklearn.neighbors import KNeighborsClassifier


import warnings
warnings.filterwarnings('ignore')





<h3 id="Read-the-dataset"><strong>Read the dataset</strong><a class="anchor-link" href="#Read-the-dataset">¶</a></h3>


In [None]:
hm=pd.read_csv("hmeq.csv")

In [None]:
# Copying data to another variable to avoid any changes to original data
data=hm.copy()


<h3 id="Print-the-first-and-last-5-rows-of-the-dataset"><strong>Print the first and last 5 rows of the dataset</strong><a class="anchor-link" href="#Print-the-first-and-last-5-rows-of-the-dataset">¶</a></h3>


In [None]:
# Display first five rows
# Remove ___________ and complete the code

data.head()

In [None]:
# Display last 5 rows
# Remove ___________ and complete the code
data.tail()


<h3 id="Understand-the-shape-of-the-dataset"><strong>Understand the shape of the dataset</strong><a class="anchor-link" href="#Understand-the-shape-of-the-dataset">¶</a></h3>


In [None]:
# Check the shape of the data
# Remove ___________ and complete the code

data.shape


<p><strong>Insights There are some missing values, we can see this in the first five lines. We also have 12 imputs for our 1 output of BAD. There are 5960 rows of data in our data set. Having an option like "other" for job seems kind of vague. The missing values are going to need to be addressed if we want an accurate model.</strong></p>



<h3 id="Check-the-data-types-of-the-columns"><strong>Check the data types of the columns</strong><a class="anchor-link" href="#Check-the-data-types-of-the-columns">¶</a></h3>


In [None]:
# Check info of the data
# Remove ___________ and complete the code
data.info()


<p><strong>Insights: We have a lot of null values in multiple columns. Otherwise the data types all mades sense for thier respective columns.</strong></p>



<h3 id="Check-for-missing-values"><strong>Check for missing values</strong><a class="anchor-link" href="#Check-for-missing-values">¶</a></h3>


In [None]:
# Analyse missing values - Hint: use isnull() function
# Remove ___________ and complete the code
print(data.isnull())
print(data.isnull().sum())

In [None]:
# Check the percentage of missing values in the each column.
# Hint: divide the result from the previous code by the number of rows in the dataset
# Remove ___________ and complete the code

data.isnull().sum() * 100 / len(data)


<p><strong>Insights: We have several categories that are missing significant amount of data. Specifically DEBTINC, DEROG, and DELINQ.</strong></p>



<h3 id="Think-about-it:"><strong>Think about it:</strong><a class="anchor-link" href="#Think-about-it:">¶</a></h3><ul>
<li>We found the total number of missing values and the percentage of missing values, which is better to consider?</li>
<li>What can be the limit for % missing values in a column in order to avoid it and what are the challenges associated with filling them and avoiding them? </li>
</ul>



<p><strong>We can convert the object type columns to categories</strong></p>
<p><code>converting "objects" to "category" reduces the data space required to store the dataframe</code></p>



<h3 id="Convert-the-data-types"><strong>Convert the data types</strong><a class="anchor-link" href="#Convert-the-data-types">¶</a></h3>


In [None]:
cols = data.select_dtypes(['object']).columns.tolist()

#adding target variable to this list as this is an classification problem and the target variable is categorical

cols.append('BAD')

In [None]:
cols

In [None]:
# Changing the data type of object type column to category. hint use astype() function
# remove ___________ and complete the code

for i in cols:
    data[i] = data[i].astype('category')

In [None]:
# Checking the info again and the datatype of different variable
# remove ___________ and complete the code

data.dtypes


<h3 id="Analyze-Summary-Statistics-of-the-dataset"><strong>Analyze Summary Statistics of the dataset</strong><a class="anchor-link" href="#Analyze-Summary-Statistics-of-the-dataset">¶</a></h3>


In [None]:
# Analyze the summary statistics for numerical variables
# Remove ___________ and complete the code

data.describe().T


<p><strong>Insights There seems to be some very high values of homes compared to the mean of 5,848, with the max beeing almost 200x bigger. DEROG seems to just have several outlayers on the high side with the majority of people having zero. The same can be said for DELINQ. The oldest credit line is almost 100 years old. That looks a little off. Also the max number of 71 credit lines is a little alarming as well. </strong></p>


In [None]:
# Check summary for categorical data - Hint: inside describe function you can use the argument include=['category']
# Remove ___________ and complete the code

data.describe(include=['category']).T


<p><strong>Insights There seems to not be a lot of information with JOB since there are only 6 categories and the majority have chosen Other. This could be an area of improvment for the future. It also looks like the majority of our loans are not bad loans and are getting repaid which is a good thing, but can be improved. </strong></p>



<p><strong>Let's look at the unique values in all the categorical variables</strong></p>


In [None]:
# Checking the count of unique values in each categorical column 
# Remove ___________ and complete the code

cols_cat= data.select_dtypes(['category'])

for i in cols_cat.columns:
    print('Unique values in',i, 'are :')
    print(pd.unique(cols_cat[i]))
    print('*'*40)


<p><strong>Insights: Both Reason and Job have NaN values which can be an issue. Also, the number of unique options for Job is very limited to just several positions. This could be an issue. </strong></p>



<h3 id="Think-about-it"><strong>Think about it</strong><a class="anchor-link" href="#Think-about-it">¶</a></h3><ul>
<li>The results above gave the absolute count of unique values in each categorical column. Are absolute values a good measure? </li>
<li>If not, what else can be used? Try implementing that. </li>
</ul>



<h2 id="Exploratory-Data-Analysis-(EDA)-and-Visualization"><strong>Exploratory Data Analysis (EDA) and Visualization</strong><a class="anchor-link" href="#Exploratory-Data-Analysis-(EDA)-and-Visualization">¶</a></h2>



<h2 id="Univariate-Analysis"><strong>Univariate Analysis</strong><a class="anchor-link" href="#Univariate-Analysis">¶</a></h2><p>Univariate analysis is used to explore each variable in a data set, separately. It looks at the range of values, as well as the central tendency of the values. It can be done for both numerical and categorical variables</p>



<h3 id="1.-Univariate-Analysis---Numerical-Data"><strong>1. Univariate Analysis - Numerical Data</strong><a class="anchor-link" href="#1.-Univariate-Analysis---Numerical-Data">¶</a></h3><p>Histograms and box plots help to visualize and describe numerical data. We use box plot and histogram to analyze the numerical columns.</p>


In [None]:
# While doing uni-variate analysis of numerical variables we want to study their central tendency and dispersion.
# Let us write a function that will help us create boxplot and histogram for any input numerical variable.
# This function takes the numerical column as the input and return the boxplots and histograms for the variable.
# Let us see if this help us write faster and cleaner code.
def histogram_boxplot(feature, figsize=(15,10), bins = None):
    """ Boxplot and histogram combined
    feature: 1-d feature array
    figsize: size of fig (default (9,8))
    bins: number of bins (default None / auto)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(nrows = 2, # Number of rows of the subplot grid= 2
                                           sharex = True, # x-axis will be shared among all subplots
                                           gridspec_kw = {"height_ratios": (.25, .75)}, 
                                           figsize = figsize 
                                           ) # creating the 2 subplots
    sns.boxplot(feature, ax=ax_box2, showmeans=True, color='violet') # boxplot will be created and a star will indicate the mean value of the column
    sns.distplot(feature, kde=F, ax=ax_hist2, bins=bins,palette="winter") if bins else sns.distplot(feature, kde=False, ax=ax_hist2) # For histogram
    ax_hist2.axvline(np.mean(feature), color='green', linestyle='--') # Add mean to the histogram
    ax_hist2.axvline(np.median(feature), color='black', linestyle='-') # Add median to the histogram


<h4 id="Using-the-above-function,-let's-first-analyze-the-Histogram-and-Boxplot-for-LOAN">Using the above function, let's first analyze the Histogram and Boxplot for LOAN<a class="anchor-link" href="#Using-the-above-function,-let's-first-analyze-the-Histogram-and-Boxplot-for-LOAN">¶</a></h4>


In [None]:
# Build the histogram boxplot for Loan
histogram_boxplot(data['LOAN'])


<p><strong>Insights: Our data is skewed to the right. It seems we have quite a few outlayers above the 4 quartile of our box plot. If that data is set aside, our data is pretty normally distributed. </strong></p>



<h4 id="Note:-As-done-above,-analyze-Histogram-and-Boxplot-for-other-variables"><strong>Note:</strong> As done above, analyze Histogram and Boxplot for other variables<a class="anchor-link" href="#Note:-As-done-above,-analyze-Histogram-and-Boxplot-for-other-variables">¶</a></h4>


In [None]:
# Build the histogram boxplot for MORTDUE
histogram_boxplot(data['MORTDUE'])

# Build the histogram boxplot for VALUE
histogram_boxplot(data['VALUE'])

# Build the histogram boxplot for YOJ
histogram_boxplot(data['YOJ'])

# Build the histogram boxplot for DEROG
histogram_boxplot(data['DEROG'])

# Build the histogram boxplot for DELINQ
histogram_boxplot(data['DELINQ'])

# Build the histogram boxplot for CLAGE
histogram_boxplot(data['CLAGE'])

# Build the histogram boxplot for NINQ
histogram_boxplot(data['NINQ'])

# Build the histogram boxplot for CLNO
histogram_boxplot(data['CLNO'])

# Build the histogram boxplot for DEBTINC
histogram_boxplot(data['DEBTINC'])


<pre><code>**Insights: MORTDUE is similar to LOAN in that it is skewed right with several outlayers in that direction. The majority of the data is normally distributed. For Value, we have some outlayers that are way out to the right, over 4x the mean. For YOJ we don't have a normal distribution. Most of the data is on the left with very few data points being over 10 years. For both DEROG and DELINQ, the majority of our data is 0, very few data points being over that. CLAGE is fairly noramally distrubited with a few outlays but they seem to be minimal. NINQ is pretty close to being 0 the majority of the time. CLNO is pretty normally distributed. NOt the smoothest and has several outlayers but still close to normal. DEBTINC looks a little left skewed. There are a few outlayers but doesn't look like too many. **</code></pre>



<h3 id="2.-Univariate-Analysis---Categorical-Data"><strong>2. Univariate Analysis - Categorical Data</strong><a class="anchor-link" href="#2.-Univariate-Analysis---Categorical-Data">¶</a></h3>


In [None]:
# Function to create barplots that indicate percentage for each category.

def perc_on_bar(plot, feature):
    '''
    plot
    feature: categorical feature
    the function won't work if a column is passed in hue parameter
    '''

    total = len(feature) # length of the column
    for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height()/total) # percentage of each class of the category
        x = p.get_x() + p.get_width() / 2 - 0.05 # width of the plot
        y = p.get_y() + p.get_height()           # height of the plot
        ax.annotate(percentage, (x, y), size = 12) # annotate the percentage 
        
    plt.show() # show the plot


<h4 id="Analyze-Barplot-for-DELINQ">Analyze Barplot for DELINQ<a class="anchor-link" href="#Analyze-Barplot-for-DELINQ">¶</a></h4>


In [None]:
#Build barplot for DELINQ

plt.figure(figsize=(15,5))
ax = sns.countplot(data["DELINQ"],palette='winter')
perc_on_bar(ax,data["DELINQ"])


<p><strong>Insights: Over 85% of DELINQ take placy in 0-2, with everything being under 10.</strong></p>



<h4 id="Note:-As-done-above,-analyze-Histogram-and-Boxplot-for-other-variables."><strong>Note:</strong> As done above, analyze Histogram and Boxplot for other variables.<a class="anchor-link" href="#Note:-As-done-above,-analyze-Histogram-and-Boxplot-for-other-variables.">¶</a></h4>


In [None]:
#Build barplot for DEROG

plt.figure(figsize=(15,5))
ax = sns.countplot(data["DEROG"],palette='winter')
perc_on_bar(ax,data["DEROG"])

#Build barplot for REASON

plt.figure(figsize=(15,5))
ax = sns.countplot(data["REASON"],palette='winter')
perc_on_bar(ax,data["REASON"])

#Build barplot for JOB

plt.figure(figsize=(15,5))
ax = sns.countplot(data["JOB"],palette='winter')
perc_on_bar(ax,data["JOB"])


<p><strong>Insights: Almost all of DEROG is less than 2. 2/3 of reasons for the loan have been Debt Consolidation. The vast majority of the jobs have been Other with ProfExe being the next hightest. </strong></p>



<h2 id="Bivariate-Analysis"><strong>Bivariate Analysis</strong><a class="anchor-link" href="#Bivariate-Analysis">¶</a></h2>



<h3 id="Bivariate-Analysis:-Continuous-and-Categorical-Variables"><strong>Bivariate Analysis: Continuous and Categorical Variables</strong><a class="anchor-link" href="#Bivariate-Analysis:-Continuous-and-Categorical-Variables">¶</a></h3>



<h4 id="Analyze-BAD-vs-Loan">Analyze BAD vs Loan<a class="anchor-link" href="#Analyze-BAD-vs-Loan">¶</a></h4>


In [None]:
sns.boxplot(data["BAD"],data['LOAN'],palette="PuBu")


<p><strong>Insights: The loan value doesn't appear to have that much affect on whether its going to be a Good or BAD loan. </strong></p>



<h4 id="Note:-As-shown-above,-perform-Bi-Variate-Analysis-on-different-pair-of-Categorical-and-continuous-variables"><strong>Note:</strong> As shown above, perform Bi-Variate Analysis on different pair of Categorical and continuous variables<a class="anchor-link" href="#Note:-As-shown-above,-perform-Bi-Variate-Analysis-on-different-pair-of-Categorical-and-continuous-variables">¶</a></h4>


In [None]:
#sns.boxplot(data["BAD"],data['MORTDUE'],palette="PuBu")

#sns.boxplot(data["BAD"],data['VALUE'],palette="PuBu")

#sns.boxplot(data["BAD"],data['YOJ'],palette="PuBu")

#sns.boxplot(data["BAD"],data['CLAGE'],palette="PuBu")

#sns.boxplot(data["BAD"],data['NINQ'],palette="PuBu")

#sns.boxplot(data["BAD"],data['CLNO'],palette="PuBu")

sns.boxplot(data["BAD"],data['DEBTINC'],palette="PuBu")


<h3 id="Bivariate-Analysis:-Two-Continuous-Variables"><strong>Bivariate Analysis: Two Continuous Variables</strong><a class="anchor-link" href="#Bivariate-Analysis:-Two-Continuous-Variables">¶</a></h3>


In [None]:
sns.scatterplot(data["VALUE"],data['MORTDUE'],palette="PuBu")


<p><strong>Insights: There appears to be some correlation between MORTDUE and VALUE. </strong></p>



<h4 id="Note:-As-shown-above,-perform-Bivariate-Analysis-on-different-pairs-of-continuous-variables"><strong>Note:</strong> As shown above, perform Bivariate Analysis on different pairs of continuous variables<a class="anchor-link" href="#Note:-As-shown-above,-perform-Bivariate-Analysis-on-different-pairs-of-continuous-variables">¶</a></h4>


In [None]:
sns.scatterplot(data["CLAGE"],data['CLNO'],palette="PuBu")


<p><strong>Insights: Majority of cred lines are not over 400 months old, so realativly less that 4 years old.</strong></p>



<h3 id="Bivariate-Analysis:--BAD-vs-Categorical-Variables"><strong>Bivariate Analysis:  BAD vs Categorical Variables</strong><a class="anchor-link" href="#Bivariate-Analysis:--BAD-vs-Categorical-Variables">¶</a></h3>



<p><strong>The stacked bar chart (aka stacked bar graph)</strong> extends the standard bar chart from looking at numeric values across one categorical variable to two.</p>


In [None]:
### Function to plot stacked bar charts for categorical columns

def stacked_plot(x):
    sns.set(palette='nipy_spectral')
    tab1 = pd.crosstab(x,data['BAD'],margins=True)
    print(tab1)
    print('-'*120)
    tab = pd.crosstab(x,data['BAD'],normalize='index')
    tab.plot(kind='bar',stacked=True,figsize=(10,5))
    plt.legend(loc='lower left', frameon=False)
    plt.legend(loc="upper left", bbox_to_anchor=(1,1))
    plt.show()


<h4 id="Plot-stacked-bar-plot-for-for-LOAN-and-REASON">Plot stacked bar plot for for LOAN and REASON<a class="anchor-link" href="#Plot-stacked-bar-plot-for-for-LOAN-and-REASON">¶</a></h4>


In [None]:
# Plot stacked bar plot for BAD and REASON
stacked_plot(data['REASON'])


<p><strong>Insights: Around 20 percent of both DebtCon and HomeImp default on their loans. Slightly more on HomeImp.</strong></p>



<h4 id="Note:-As-shown-above,-perform-Bivariate-Analysis-on-different-pairs-of-Categorical-vs-BAD"><strong>Note:</strong> As shown above, perform Bivariate Analysis on different pairs of Categorical vs BAD<a class="anchor-link" href="#Note:-As-shown-above,-perform-Bivariate-Analysis-on-different-pairs-of-Categorical-vs-BAD">¶</a></h4>


In [None]:
# Plot stacked bar plot for BAD and REASON
stacked_plot(data['JOB'])

In [None]:
# Plot stacked bar plot for BAD and REASON
stacked_plot(data['DEROG'])

In [None]:
# Plot stacked bar plot for BAD and REASON
stacked_plot(data['DELINQ'])


<p><strong>Insights: sALES and Self Employeed have the highest numbers of defaults. As DEROG increas, Defaluts increase significantly. Same as DELINQ. </strong></p>



<h3 id="Multivariate-Analysis"><strong>Multivariate Analysis</strong><a class="anchor-link" href="#Multivariate-Analysis">¶</a></h3>



<h4 id="Analyze-Correlation-Heatmap-for-Numerical-Variables">Analyze Correlation Heatmap for Numerical Variables<a class="anchor-link" href="#Analyze-Correlation-Heatmap-for-Numerical-Variables">¶</a></h4>


In [None]:
# Separating numerical variables
numerical_col = data.select_dtypes(include=np.number).columns.tolist()

# Build correlation matrix for numerical columns
# Remove ___________ and complete the code

corr = data[numerical_col].corr()

# plot the heatmap
# Remove ___________ and complete the code

plt.figure(figsize=(16,12))
sns.heatmap(corr,cmap='coolwarm',vmax=1,vmin=-1,
        fmt=".2f",
        xticklabels=corr.columns,
        yticklabels=corr.columns);

In [None]:
data.head()

In [None]:
# Build pairplot for the data with hue = 'BAD'
# Remove ___________ and complete the code

sns.pairplot(data=data, hue = 'BAD')


<h3 id="Think-about-it"><strong>Think about it</strong><a class="anchor-link" href="#Think-about-it">¶</a></h3><ul>
<li>Are there missing values and outliers in the dataset? If yes, how can you treat them? </li>
<li>Can you think of different ways in which this can be done and when to treat these outliers or not?</li>
<li>Can we create new features based on Missing values?</li>
</ul>



<h4 id="Treating-Outliers">Treating Outliers<a class="anchor-link" href="#Treating-Outliers">¶</a></h4>


In [None]:
def treat_outliers(df,col):
    '''
    treats outliers in a varaible
    col: str, name of the numerical varaible
    df: data frame
    col: name of the column
    '''
    
    Q1=df.quantile(q=0.25, axis=1) # 25th quantile
    Q3=df.quantile(q=0.75, axis=1)  # 75th quantile
    IQR=Q3-Q1   # IQR Range
    Lower_Whisker = Q1-(1.5*IQR) #define lower whisker
    Upper_Whisker = Q3+(1.5*IQR)  # define upper Whisker
    df[col] = np.clip(df[col], Lower_Whisker, Upper_Whisker) # all the values samller than Lower_Whisker will be assigned value of Lower_whisker 
                                                            # and all the values above upper_whishker will be assigned value of upper_Whisker 
    return df

def treat_outliers_all(df, col_list):
    '''
    treat outlier in all numerical varaibles
    col_list: list of numerical varaibles
    df: data frame
    '''
    for c in col_list:
        df = treat_outliers(df,c)
        
    return df

In [None]:
data.head()

In [None]:
df_raw = data.copy()

numerical_col = df_raw.select_dtypes(include=np.number).columns.tolist()# getting list of numerical columns

df = treat_outliers_all(df_raw, numerical_col)

In [None]:
sns.boxplot(df["BAD"],df['DEBTINC'],palette="PuBu")


<h4 id="Adding-new-columns-in-the-dataset-for-each-column-which-has-missing-values">Adding new columns in the dataset for each column which has missing values<a class="anchor-link" href="#Adding-new-columns-in-the-dataset-for-each-column-which-has-missing-values">¶</a></h4>


In [None]:
#For each column we create a binary flag for the row, if there is missing value in the row, then 1 else 0. 
def add_binary_flag(df,col):
    '''
    df: It is the dataframe
    col: it is column which has missing values
    It returns a dataframe which has binary falg for missing values in column col
    '''
    new_col = str(col)
    new_col += '_missing_values_flag'
    df[new_col] = df[col].isna()
    return df




In [None]:


# list of columns that has missing values in it
missing_col = [col for col in df.columns if df[col].isnull().any()]

for colmn in missing_col:
    add_binary_flag(df,colmn)
    




In [None]:


df.head()





<h4 id="Filling-missing-values-in-numerical-columns-with-median-and-mode-in-categorical-variables">Filling missing values in numerical columns with median and mode in categorical variables<a class="anchor-link" href="#Filling-missing-values-in-numerical-columns-with-median-and-mode-in-categorical-variables">¶</a></h4>


In [None]:


#  Treat Missing values in numerical columns with median and mode in categorical variables
# Select numeric columns.
num_data = df.select_dtypes('number')

# Select string and object columns.
cat_data = df.select_dtypes('category').columns.tolist()#df.select_dtypes('object')

# Fill numeric columns with median.
# Remove _________ and complete the code
df[num_data.columns] = num_data.fillna(df.median())

# Fill object columns with model.
# Remove _________ and complete the code
for column in cat_data:
    mode = df[column].mode()[0]
    df[column] = df[column].fillna(mode)




In [None]:


df.head()




In [None]:


sns.boxplot(df["BAD"],df['CLAGE'],palette="PuBu")




In [None]:


sns.pairplot(data=df, hue = 'BAD')





<h2 id="Proposed-approach"><strong>Proposed approach</strong><a class="anchor-link" href="#Proposed-approach">¶</a></h2><p><strong>1. Potential techniques</strong> - What different techniques should be explored?
        We should look at different ML tecniques like Random Forests and decision trees.</p>
<p><strong>2. Overall solution design</strong> - What is the potential solution design?</p>
<p><strong>3. Measures of success</strong> - What are the key measures of success?
        The keys for success are going to be limiting the number of loans that default, that get classified as good.</p>



<h2 id="Model-Building---Approach"><strong>Model Building - Approach</strong><a class="anchor-link" href="#Model-Building---Approach">¶</a></h2><ol>
<li>Data preparation</li>
<li>Partition the data into train and test set</li>
<li>Fit on the train data</li>
<li>Tune the model and prune the tree, if required</li>
<li>Test the model on test set</li>
</ol>



<h2 id="Data-Preparation"><strong>Data Preparation</strong><a class="anchor-link" href="#Data-Preparation">¶</a></h2>



<h3 id="Separating-the-target-variable-from-other-variables"><strong>Separating the target variable from other variables</strong><a class="anchor-link" href="#Separating-the-target-variable-from-other-variables">¶</a></h3>


In [None]:


# Drop the dependent variable from the dataframe and create the X(independent variable) matrix
# Remove _________ and complete the code
X = df.drop(['BAD'], axis=1)

# Create dummy variables for the categorical variables - Hint: use the get_dummies() function
# Remove _________ and complete the code
X = pd.get_dummies(X)

# Create y(dependent varibale)
# Remove _________ and complete the code

y = df.BAD
X.head()





<h3 id="Splitting-the-data-into-70%-train-and-30%-test-set"><strong>Splitting the data into 70% train and 30% test set</strong><a class="anchor-link" href="#Splitting-the-data-into-70%-train-and-30%-test-set">¶</a></h3>


In [None]:


# Split the data into training and test set
# Remove _________ and complete the code


x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=1,stratify=y)





<h3 id="Think-about-it"><strong>Think about it</strong><a class="anchor-link" href="#Think-about-it">¶</a></h3><ul>
<li>You can try different splits like 70:30 or 80:20 as per your choice. Does this change in split affect the performance?</li>
<li>If the data is imbalanced, can you make the split more balanced and if yes, how?</li>
</ul>


In [None]:


x_train_2,x_test_2,y_train_2,y_test_2=train_test_split(X,y,test_size=0.2,random_state=1,stratify=y)





<h2 id="Model-Evaluation-Criterion"><strong>Model Evaluation Criterion</strong><a class="anchor-link" href="#Model-Evaluation-Criterion">¶</a></h2><h4 id="After-understanding-the-problem-statement,-think-about-which-evaluation-metrics-to-consider-and-why.">After understanding the problem statement, think about which evaluation metrics to consider and why.<a class="anchor-link" href="#After-understanding-the-problem-statement,-think-about-which-evaluation-metrics-to-consider-and-why.">¶</a></h4>


In [None]:


#creating metric function 
def metrics_score(actual, predicted):
    print(classification_report(actual, predicted))
    cm = confusion_matrix(actual, predicted)
    plt.figure(figsize=(8,5))
    sns.heatmap(cm, annot=True,  fmt='.2f', xticklabels=['Not Eligible', 'Eligible'], yticklabels=['Not Eligible', 'Eligible'])
    plt.ylabel('Actual')
    plt.xlabel('Predicted')
    plt.show()





<h3 id="Build-a-Logistic-Regression-Model"><strong>Build a Logistic Regression Model</strong><a class="anchor-link" href="#Build-a-Logistic-Regression-Model">¶</a></h3>


In [None]:


# Defining the Logistic regression model
# Remove _________ and complete the code
log_reg= LogisticRegression()
log_reg_2= LogisticRegression()

# Fitting the model on the training data 
# Remove _________ and complete the code

log_reg.fit(x_train,y_train)
log_reg_2.fit(x_train_2,y_train_2)





<h4 id="Checking-the-performance-on-the-train-dataset">Checking the performance on the train dataset<a class="anchor-link" href="#Checking-the-performance-on-the-train-dataset">¶</a></h4>


In [None]:


#Predict for train set
# Remove _________ and complete the code
y_pred_train = log_reg.predict(x_train)

#checking the performance on the train dataset
# Remove _________ and complete the code
metrics_score(y_train, y_pred_train)




In [None]:


#Predict for train set
# Remove _________ and complete the code
y_pred_train_2 = log_reg_2.predict(x_train_2)

#checking the performance on the train dataset
# Remove _________ and complete the code
metrics_score(y_train_2, y_pred_train_2)





<h4 id="Checking-the-performance-on-the-test-dataset">Checking the performance on the test dataset<a class="anchor-link" href="#Checking-the-performance-on-the-test-dataset">¶</a></h4>


In [None]:


#Predict for test set
# Remove _________ and complete the code

y_pred_test = log_reg.predict(x_test)

#checking the performance on the test dataset
# Remove _________ and complete the code

metrics_score(y_test, y_pred_test)





<p><strong>Observations: The model is doing a really good job on corretly predictly Not Eligible as well as limiting those predicted Eligible that are Not Eligible. The model does a terrible job at predictly Eligible, this is not quite as important as we are looking for predicting defaults. </strong></p>


In [None]:


#Predict for test set
# Remove _________ and complete the code

y_pred_test_2 = log_reg.predict(x_test_2)

#checking the performance on the test dataset
# Remove _________ and complete the code

metrics_score(y_test_2, y_pred_test_2)





<h4 id="Let's-check-the-coefficients,-and-check-which-variables-are-important-and-how-they-affect-the-process-of-loan-approval">Let's check the coefficients, and check which variables are important and how they affect the process of loan approval<a class="anchor-link" href="#Let's-check-the-coefficients,-and-check-which-variables-are-important-and-how-they-affect-the-process-of-loan-approval">¶</a></h4>


In [None]:


# Printing the coefficients of logistic regression
# Remove _________ and complete the code


pd.Series(log_reg.coef_[0], index=x_train.columns).sort_values(ascending=False)





<p><strong>Insights DELINQ has the largest affect on our model, followed by DEROG. These both make sense since deliquencies seem to be a good indicator if someone will payback their loan. It look like the missing values of DEBTINC could play a role in this as well. Maybe if the bank were to obtain that information we could build a better model. </strong></p>



<h3 id="Think-about-it:"><strong>Think about it:</strong><a class="anchor-link" href="#Think-about-it:">¶</a></h3><ul>
<li>The above Logistic regression model was build on the threshold of 0.5, can we use different threshold?</li>
<li>How to get an optimal threshold and which curve will help you achieve? - Precision-Recall Curve for Logistic Regression</li>
<li>How does, accuracy, precision and recall change on the threshold?</li>
</ul>


In [None]:


y_scores=log_reg.predict_proba(x_train) #predict_proba gives the probability of each observation belonging to each class


precisions, recalls, thresholds = precision_recall_curve(y_train, y_scores[:,1])

#Plot values of precisions, recalls, and thresholds
plt.figure(figsize=(10,7))
plt.plot(thresholds, precisions[:-1], 'b--', label='precision')
plt.plot(thresholds, recalls[:-1], 'g--', label = 'recall')
plt.xlabel('Threshold')
plt.legend(loc='upper left')
plt.ylim([0,1])
plt.show()




In [None]:


#calculating the exact threshold where precision and recall are equal.
for i in np.arange(len(thresholds)):
    if precisions[i]==recalls[i]:
        print(thresholds[i])




In [None]:


optimal_threshold1 = 0.326
metrics_score(y_train, y_scores[:,1]>optimal_threshold1)





<h3 id="Build-a-Decision-Tree-Model"><strong>Build a Decision Tree Model</strong><a class="anchor-link" href="#Build-a-Decision-Tree-Model">¶</a></h3>



<h3 id="Think-about-it:"><strong>Think about it:</strong><a class="anchor-link" href="#Think-about-it:">¶</a></h3><ul>
<li>In Logistic regression we treated the outliers and built the model, should we do the same for tree based models or not? If not, why?</li>
</ul>
<p>We do not need to remove outliers since the decision tree will naturally eliminate irrelevant features.</p>



<h4 id="Data-Preparation-for-the-tree-based-model">Data Preparation for the tree based model<a class="anchor-link" href="#Data-Preparation-for-the-tree-based-model">¶</a></h4>


In [None]:


# Add binary flags
# List of columns that has missing values in it
missing_col = [col for col in data.columns if data[col].isnull().any()]

for colmn in missing_col:
    add_binary_flag(data,colmn)
    




In [None]:


#  Treat Missing values in numerical columns with median and mode in categorical variables
# Select numeric columns.
num_data = data.select_dtypes('number')

# Select string and object columns.
cat_data = data.select_dtypes('category').columns.tolist()#df.select_dtypes('object')

# Fill numeric columns with median.
# Remove _________ and complete the code
data[num_data.columns] = num_data.fillna(data.median())

# Fill object columns with model.
# Remove _________ and complete the code
for column in cat_data:
    mode = data[column].mode()[0]
    data[column] = data[column].fillna(mode)





<h4 id="Separating-the-target-variable-y-and-independent-variable-x">Separating the target variable y and independent variable x<a class="anchor-link" href="#Separating-the-target-variable-y-and-independent-variable-x">¶</a></h4>


In [None]:


# Drop dependent variable from dataframe and create the X(independent variable) matrix
# Remove _________ and complete the code

X = data.drop(['BAD'], axis=1)

# Create dummy variables for the categorical variables - Hint: use the get_dummies() function
# Remove _________ and complete the code
X = pd.get_dummies(X)

# Create y(dependent varibale)
# Remove _________ and complete the code

y = data.BAD





<h4 id="Split-the-data">Split the data<a class="anchor-link" href="#Split-the-data">¶</a></h4>


In [None]:


# Split the data into training and test set
# Remove _________ and complete the code


x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=1,stratify=y) 




In [None]:


#Defining Decision tree model with class weights class_weight={0: 0.2, 1: 0.8}
# Remove ___________ and complete the code

dt = DecisionTreeClassifier(class_weight={0:0.2,1:0.8}, random_state=1)




In [None]:


#fitting Decision tree model
# Remove ___________ and complete the code
dt.fit(x_train, y_train)





<h4 id="Checking-the-performance-on-the-train-dataset">Checking the performance on the train dataset<a class="anchor-link" href="#Checking-the-performance-on-the-train-dataset">¶</a></h4>


In [None]:


# Checking performance on the training data
# Remove ___________ and complete the code

y_train_pred_dt=dt.predict(x_train)
metrics_score(y_train,y_train_pred_dt)





<h4 id="Checking-the-performance-on-the-test-dataset">Checking the performance on the test dataset<a class="anchor-link" href="#Checking-the-performance-on-the-test-dataset">¶</a></h4>


In [None]:


# Checking performance on the testing data
# Remove _________ and complete the code

y_test_pred_dt=dt.predict(x_test)
metrics_score(y_test,y_test_pred_dt)





<p><strong>Insights This ran perfectly on the train data and not so well on the test. This leads me to believe the this model is overfitting. </strong></p>



<h3 id="Think-about-it:"><strong>Think about it:</strong><a class="anchor-link" href="#Think-about-it:">¶</a></h3><ul>
<li>Can we improve this model? <ul>
<li>Yes - the moddel is over fitting and need to be adjust </li>
</ul>
</li>
<li>How to get optimal parameters in order to get the best possible results?<ul>
<li>We can plot feature importance to see which features are having the greates effect</li>
<li>We can use grid search to hyper tun the parameters</li>
</ul>
</li>
</ul>



<h3 id="Decision-Tree---Hyperparameter-Tuning"><strong>Decision Tree - Hyperparameter Tuning</strong><a class="anchor-link" href="#Decision-Tree---Hyperparameter-Tuning">¶</a></h3><ul>
<li>Hyperparameter tuning is tricky in the sense that <strong>there is no direct way to calculate how a change in the hyperparameter value will reduce the loss of your model</strong>, so we usually resort to experimentation. We'll use Grid search to perform hyperparameter tuning.</li>
<li><strong>Grid search is a tuning technique that attempts to compute the optimum values of hyperparameters.</strong> </li>
<li><strong>It is an exhaustive search</strong> that is performed on the specific parameter values of a model.</li>
<li>The parameters of the estimator/model used to apply these methods are <strong>optimized by cross-validated grid-search</strong> over a parameter grid.</li>
</ul>
<p><strong>Criterion {“gini”, “entropy”}</strong></p>
<p>The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.</p>
<p><strong>max_depth</strong></p>
<p>The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.</p>
<p><strong>min_samples_leaf</strong></p>
<p>The minimum number of samples is required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.</p>
<p>You can learn about more Hyperpapameters on this link and try to tune them.</p>
<p><a href="https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html">https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html</a></p>



<h4 id="Using-GridSearchCV-for-Hyperparameter-tuning-on-the-model">Using GridSearchCV for Hyperparameter tuning on the model<a class="anchor-link" href="#Using-GridSearchCV-for-Hyperparameter-tuning-on-the-model">¶</a></h4>


In [None]:


# Choose the type of classifier. 
# Remove _________ and complete the code
dtree_estimator = DecisionTreeClassifier(class_weight={0:0.2,1:0.8}, random_state=1)


# Grid of parameters to choose from
# Remove _________ and complete the code
parameters = {'max_depth': np.arange(2,7), 
              'criterion': ['gini', 'entropy'],
              'min_samples_leaf': [5, 10, 20, 25]
             }


# Type of scoring used to compare parameter combinations
# Remove _________ and complete the code
scorer = metrics.make_scorer(recall_score, pos_label=1)


# Run the grid search
# Remove _________ and complete the code
gridCV= GridSearchCV(dtree_estimator, parameters, scoring=scorer,cv=10)


# Fit the GridSearch on train dataset
# Remove _________ and complete the code
gridCV = gridCV.fit(x_train, y_train)


# Set the clf to the best combination of parameters
# Remove _________ and complete the code
dtree_estimator = gridCV.best_estimator_


# Fit the best algorithm to the data. 
# Remove _________ and complete the code
dtree_estimator.fit(x_train, y_train)





<h4 id="Checking-the-performance-on-the-train-dataset">Checking the performance on the train dataset<a class="anchor-link" href="#Checking-the-performance-on-the-train-dataset">¶</a></h4>


In [None]:


# Checking performance on the training data based on the tuned model
# Remove _________ and complete the code

y_train_pred_dt=dtree_estimator.predict(x_train)
metrics_score(y_train,y_train_pred_dt)





<h4 id="Checking-the-performance-on-the-test-dataset">Checking the performance on the test dataset<a class="anchor-link" href="#Checking-the-performance-on-the-test-dataset">¶</a></h4>


In [None]:


# Checking performance on the testing data based on the tuned model
# Remove _________ and complete the code

y_test_pred_dt=dtree_estimator.predict(x_test)
metrics_score(y_test,y_test_pred_dt)





<p><strong>Insights This model is much improved from the previous. Its not over fitting as the train and the test both have a recall of 0.89 and the f1-score and precision are right around the same values as well. </strong></p>



<h4 id="Plotting-the-Decision-Tree">Plotting the Decision Tree<a class="anchor-link" href="#Plotting-the-Decision-Tree">¶</a></h4>


In [None]:


# Plot the decision  tree and analyze it to build the decision rule
# Remove _________ and complete the code

features = list(X.columns)

plt.figure(figsize=(30,20))

tree.plot_tree(dt,max_depth=5,feature_names=features,filled=True,fontsize=12,node_ids=True,class_names=True)
plt.show()





<h4 id="Deduce-the-business-rules-apparent-from-the-Decision-Tree-and-write-them-down:-Debt-to-income-less-than-43.7,-Deliquencies-less-than-1,-and-Credit-Line-ages-less-than-182-month-and-you-will-not-default-on-your-loan.">Deduce the business rules apparent from the Decision Tree and write them down: Debt to income less than 43.7, Deliquencies less than 1, and Credit Line ages less than 182 month and you will not default on your loan.<a class="anchor-link" href="#Deduce-the-business-rules-apparent-from-the-Decision-Tree-and-write-them-down:-Debt-to-income-less-than-43.7,-Deliquencies-less-than-1,-and-Credit-Line-ages-less-than-182-month-and-you-will-not-default-on-your-loan.">¶</a></h4>



<h3 id="Building-a-Random-Forest-Classifier"><strong>Building a Random Forest Classifier</strong><a class="anchor-link" href="#Building-a-Random-Forest-Classifier">¶</a></h3><p><strong>Random Forest is a bagging algorithm where the base models are Decision Trees.</strong> Samples are taken from the training data and on each sample a decision tree makes a prediction.</p>
<p><strong>The results from all the decision trees are combined together and the final prediction is made using voting or averaging.</strong></p>


In [None]:


# Defining Random forest CLassifier
# Remove _________ and complete the code

rf_estimator = RandomForestClassifier()
rf_estimator.fit(x_train,y_train)





<h4 id="Checking-the-performance-on-the-train-dataset">Checking the performance on the train dataset<a class="anchor-link" href="#Checking-the-performance-on-the-train-dataset">¶</a></h4>


In [None]:


#Checking performance on the training data
# Remove _________ and complete the code
y_pred_train_rf = rf_estimator.predict(x_train)
metrics_score(y_train, y_pred_train_rf)





<h4 id="Checking-the-performance-on-the-test-dataset">Checking the performance on the test dataset<a class="anchor-link" href="#Checking-the-performance-on-the-test-dataset">¶</a></h4>


In [None]:


# Checking performance on the test data
# Remove _________ and complete the code

y_pred_test_rf = rf_estimator.predict(x_test)
metrics_score(y_test, y_pred_test_rf)





<p><strong>Observations: The model seems to be overfitting on the training set of data, however its still performing very well on the test data. There could still be room for improvment. </strong></p>



<h3 id="Build-a-Random-Forest-model-with-Class-Weights"><strong>Build a Random Forest model with Class Weights</strong><a class="anchor-link" href="#Build-a-Random-Forest-model-with-Class-Weights">¶</a></h3>


In [None]:


# Defining Random Forest model with class weights class_weight={0: 0.2, 1: 0.8}

# Remove _________ and complete the code

rf_model = RandomForestClassifier(class_weight={0:0.2,1:0.8})

# Fitting Random Forest model
# Remove _________ and complete the code

rf_model.fit(x_train,y_train)





<h4 id="Checking-the-performance-on-the-train-dataset">Checking the performance on the train dataset<a class="anchor-link" href="#Checking-the-performance-on-the-train-dataset">¶</a></h4>


In [None]:


# Checking performance on the train data
# Remove _________ and complete the code

y_pred_train_rf = rf_model.predict(x_train)
metrics_score(y_train, y_pred_train_rf)





<h4 id="Checking-the-performance-on-the-test-dataset">Checking the performance on the test dataset<a class="anchor-link" href="#Checking-the-performance-on-the-test-dataset">¶</a></h4>


In [None]:


# Checking performance on the test data
# Remove _________ and complete the code

y_pred_test_rf = rf_model.predict(x_test)
metrics_score(y_test, y_pred_test_rf)





<h3 id="Think-about-it:"><strong>Think about it:</strong><a class="anchor-link" href="#Think-about-it:">¶</a></h3><ul>
<li>Can we try different weights?</li>
<li>If yes, should we increase or decrease class weights for different classes? </li>
</ul>



<h3 id="Tuning-the-Random-Forest"><strong>Tuning the Random Forest</strong><a class="anchor-link" href="#Tuning-the-Random-Forest">¶</a></h3>



<ul>
<li>Hyperparameter tuning is tricky in the sense that <strong>there is no direct way to calculate how a change in the hyperparameter value will reduce the loss of your model</strong>, so we usually resort to experimentation. We'll use Grid search to perform hyperparameter tuning.</li>
<li><strong>Grid search is a tuning technique that attempts to compute the optimum values of hyperparameters.</strong> </li>
<li><strong>It is an exhaustive search</strong> that is performed on the specific parameter values of a model.</li>
<li>The parameters of the estimator/model used to apply these methods are <strong>optimized by cross-validated grid-search</strong> over a parameter grid.</li>
</ul>
<p><strong>n_estimators</strong>: The number of trees in the forest.</p>
<p><strong>min_samples_split</strong>: The minimum number of samples required to split an internal node:</p>
<p><strong>min_samples_leaf</strong>: The minimum number of samples required to be at a leaf node.</p>
<p><strong>max_features{“auto”, “sqrt”, “log2”, 'None'}</strong>: The number of features to consider when looking for the best split.</p>
<ul>
<li><p>If “auto”, then max_features=sqrt(n_features).</p>
</li>
<li><p>If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).</p>
</li>
<li><p>If “log2”, then max_features=log2(n_features).</p>
</li>
<li><p>If None, then max_features=n_features.</p>
</li>
</ul>
<p>You can learn more about Random Forest Hyperparameters from the link given below and try to tune them</p>
<p><a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html">https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html</a></p>



<h4 id="Warning:-This-may-take-a-long-time-depending-on-the-parameters-you-tune."><strong>Warning:</strong> This may take a long time depending on the parameters you tune.<a class="anchor-link" href="#Warning:-This-may-take-a-long-time-depending-on-the-parameters-you-tune.">¶</a></h4>


In [None]:
# Choose the type of classifier. 
# Remove _________ and complete the code
rf_estimator_tuned = RandomForestClassifier(class_weight={0:0.2,1:0.8})


# Grid of parameters to choose from
# Remove _________ and complete the code
params_rf = {  
        "n_estimators": [100,250,500],
        "min_samples_leaf": np.arange(1, 4,1),
        "max_features": [0.7,0.9,'auto'],
}

# Type of scoring used to compare parameter combinations
# Remove _________ and complete the code
scorer = metrics.make_scorer(recall_score, pos_label=1)


# Run the grid search
# Remove _________ and complete the code
grid_obj = GridSearchCV(rf_estimator_tuned, params_rf, scoring=scorer, cv=5)


#fit the GridSearch on train dataset
# Remove _________ and complete the code
grid_obj = grid_obj.fit(x_train, y_train)


# Set the clf to the best combination of parameters
# Remove _________ and complete the code
rf_estimator_tuned = grid_obj.best_estimator_


# Fit the best algorithm to the data. 
# Remove _________ and complete the code
rf_estimator_tuned.fit(x_train, y_train)





<h4 id="Checking-the-performance-on-the-train-dataset">Checking the performance on the train dataset<a class="anchor-link" href="#Checking-the-performance-on-the-train-dataset">¶</a></h4>


In [None]:
# Checking performance on the training data
# Remove _________ and complete the code
y_pred_train_rf_tuned = rf_estimator_tuned.predict(x_train)
metrics_score(y_train, y_pred_train_rf_tuned)





<h4 id="Checking-the-performance-on-the-test-dataset">Checking the performance on the test dataset<a class="anchor-link" href="#Checking-the-performance-on-the-test-dataset">¶</a></h4>


In [None]:
# Checking performace on test dataset
# Remove _________ and complete the code

y_pred_test_rf_tuned = rf_estimator_tuned.predict(x_test)
metrics_score(y_test, y_pred_test_rf_tuned)





<p><strong>Insights: There is some slight improvement in precision. It doesn't appear to be over fitting on this one. Maybe we could tweek the parameters for futher improvment. </strong></p>



<h4 id="Plot-the-Feature-importance-of-the-tuned-Random-Forest">Plot the Feature importance of the tuned Random Forest<a class="anchor-link" href="#Plot-the-Feature-importance-of-the-tuned-Random-Forest">¶</a></h4>


In [None]:
# importance of features in the tree building ( The importance of a feature is computed as the 
#(normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
# Checking performace on test dataset
# Remove _________ and complete the code

importances = rf_estimator_tuned.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(10,10))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()





<h3 id="Think-about-it:"><strong>Think about it:</strong><a class="anchor-link" href="#Think-about-it:">¶</a></h3><ul>
<li>We have only built 3 models so far, Logistic Regression, Decision Tree and Random Forest </li>
<li>We can build other Machine Learning classification models like kNN, LDA, QDA or even Support Vector Machines (SVM).</li>
<li>Can we also perform feature engineering and create model features and build a more robust and accurate model for this problem statement? </li>
</ul>


In [None]:
knn = KNeighborsClassifier()

# We select the best value of k for which the error rate is the least in the validation data
# Let us loop over a few values of k to determine the best k

train_error = []
test_error = []
knn_many_split = {}

error_df_knn = pd.DataFrame()
features = X.columns

for k in range(1,15):
    train_error = []
    test_error = []
    lista = []
    knn = KNeighborsClassifier(n_neighbors=k)
    for i in range(30):
        x_train_new, x_val, y_train_new, y_val = train_test_split(x_train, y_train, test_size = 0.20)
    
        #Fitting knn on training data
        knn.fit(x_train_new, y_train_new)
        #Calculating error on training and validation data
        train_error.append(1 - knn.score(x_train_new, y_train_new)) 
        test_error.append(1 - knn.score(x_val, y_val))
    lista.append(sum(train_error)/len(train_error))
    lista.append(sum(test_error)/len(test_error))
    knn_many_split[k] = lista

knn_many_split




In [None]:


kltest = []
vltest = []
for k, v in knn_many_split.items():
    kltest.append(k)
    vltest.append(knn_many_split[k][1])

kltrain = []
vltrain = []

for k, v in knn_many_split.items():
    kltrain.append(k)
    vltrain.append(knn_many_split[k][0])

# Plotting k vs error
plt.figure(figsize=(10,6))
plt.plot(kltest,vltest, label = 'test' )
plt.plot(kltrain,vltrain, label = 'train')
plt.legend()
plt.show()




In [None]:


#define knn model
knn=KNeighborsClassifier(n_neighbors=4)




In [None]:


#fitting data to the KNN model
knn.fit(x_train,y_train)




In [None]:


#checking the performance of knn model
y_pred_train_knn = knn.predict(x_train)
metrics_score(y_train, y_pred_train_knn)




In [None]:


y_pred_test_knn = knn.predict(x_test)
metrics_score(y_test, y_pred_test_knn)




In [None]:


params_knn={'n_neighbors':np.arange(3,15),'weights':['uniform','distance'],'p':[1,2]}

grid_knn=GridSearchCV(estimator=knn,param_grid=params_knn,scoring='recall',cv=10)

model_knn=grid_knn.fit(x_train,y_train)

knn_estimator = model_knn.best_estimator_
print(knn_estimator)




In [None]:


#Fit the best estimator on the training data
knn_estimator.fit(x_train, y_train)




In [None]:


y_pred_train_knn_estimator = knn_estimator.predict(x_train)
metrics_score(y_train, y_pred_train_knn_estimator)




In [None]:


y_pred_test_knn_estimator = knn_estimator.predict(x_test)
metrics_score(y_test, y_pred_test_knn_estimator)





<h3 id="Comparing-Model-Performances"><strong>Comparing Model Performances</strong><a class="anchor-link" href="#Comparing-Model-Performances">¶</a></h3>


In [None]:


def get_recall_score(model,flag=True,x_train=x_train,x_test=x_test):
    '''
    model : classifier to predict values of X

    '''
    a = [] # defining an empty list to store train and test results
    pred_train = model.predict(x_train)
    pred_test = model.predict(x_test)
    train_recall = metrics.recall_score(y_train,pred_train)
    test_recall = metrics.recall_score(y_test,pred_test)
    a.append(train_recall) # adding train recall to list 
    a.append(test_recall) # adding test recall to list
    
    # If the flag is set to True then only the following print statements will be dispayed. The default value is set to True.
    if flag == True: 
        print("Recall on training set : ",metrics.recall_score(y_train,pred_train))
        print("Recall on test set : ",metrics.recall_score(y_test,pred_test))
    
    return a # returning the list with train and test scores




In [None]:


##  Function to calculate precision score
def get_precision_score(model,flag=True,x_train=x_train,x_test=x_test):
    '''
    model : classifier to predict values of X

    '''
    b = []  # defining an empty list to store train and test results
    pred_train = model.predict(x_train)
    pred_test = model.predict(x_test)
    train_precision = metrics.precision_score(y_train,pred_train)
    test_precision = metrics.precision_score(y_test,pred_test)
    b.append(train_precision) # adding train precision to list
    b.append(test_precision) # adding test precision to list
    
    # If the flag is set to True then only the following print statements will be dispayed. The default value is set to True.
    if flag == True: 
        print("Precision on training set : ",metrics.precision_score(y_train,pred_train))
        print("Precision on test set : ",metrics.precision_score(y_test,pred_test))

    return b # returning the list with train and test scores




In [None]:


##  Function to calculate accuracy score
def get_accuracy_score(model,flag=True,X_train=x_train,X_test=x_test):
    '''
    model : classifier to predict values of X

    '''
    c = [] # defining an empty list to store train and test results
    train_acc = model.score(X_train,y_train)
    test_acc = model.score(X_test,y_test)
    c.append(train_acc) # adding train accuracy to list
    c.append(test_acc) # adding test accuracy to list
    
    # If the flag is set to True then only the following print statements will be dispayed. The default value is set to True.
    if flag == True:
        print("Accuracy on training set : ",model.score(X_train,y_train))
        print("Accuracy on test set : ",model.score(X_test,y_test))
    
    return c # returning the list with train and test scores




In [None]:


# Make the list of all the model names 

models = [log_reg, dtree_estimator, rf_model, rf_estimator_tuned]
# Remove _________ and complete the code

# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []

# looping through all the models to get the accuracy,recall and precision scores
for model in models:
     # accuracy score
    j = get_accuracy_score(model,False)
    acc_train.append(j[0])
    acc_test.append(j[1])

    # recall score
    k = get_recall_score(model,False)
    recall_train.append(k[0])
    recall_test.append(k[1])

    # precision score
    l = get_precision_score(model,False)
    precision_train.append(l[0])
    precision_test.append(l[1])




In [None]:


# Mention the Model names in the list. for example 'Model': ['Decision Tree', 'Tuned Decision Tree'..... write tht names of all model built]
# Remove _________ and complete the code

comparison_frame = pd.DataFrame({'Model':['Logistic Regression', 'Decision Tree', 'Random Forest', 'Random Forest Truned'], 
                                          'Train_Accuracy': acc_train,
                                          'Test_Accuracy': acc_test,
                                          'Train_Recall': recall_train,
                                          'Test_Recall': recall_test,
                                          'Train_Precision': precision_train,
                                          'Test_Precision': precision_test}) 
comparison_frame





<p><strong>Insights: The Tuned Random Forest appears to have the best overall performace based off the metrics above. However, I believe I can achieve better results with some more turning or possible some different methods.</strong></p>



<p><strong>1. Refined insights -</strong> What are the most meaningful insights from the data relevant to the problem?</p>



<p><strong>2. Comparison of various techniques and their relative performance -</strong> How do different techniques perform? Which one is performing relatively better? Is there scope to improve the performance further?</p>



<p><strong>3. Proposal for the final solution design -</strong> What model do you propose to be adopted? Why is this the best solution to adopt?</p>
