# Home Loan Prediction

The goal of this task is to build a predictive machine learning classifier that will use historic data of loan decisions (ground truth) to make recommended decisions for new loan applications. Along the way, you’ll get comfortable with using Jupyter Notebooks to document your work, using scikit-learn to train and test basic ML models, and start thinking about the strengths and weaknesses of models like this.

The dataset `home_loans_sample.csv` a listing of home loan applications in Washington, USA, where each row of the dataset is an individual loan application. Your goal in this assignment is to build a machine learning model that can accurately predict whether a given loan application was accepted or rejected. 

## Background
In the US (and a number of other countries), it is common for people who want to buy a home to apply for a loan (called a “mortgage”) from a bank or other lender. In order for the bank to decide whether to give them this loan, the bank will look at a number of factors to determine whether they are likely to be able to repay the loan because the bank does not want to lend money to somebody who would not be able to pay it back. These factors typically include how much money the person makes, how much money they are asking for in the loan, and a handful of other things.  Banks typically have formulas in place to determine who they will approve for a given loan and who they will deny, but there is also frequently some human judgment involved in the process.

If the person is approved for the loan, they will use the loan to purchase the home and will pay it back in small chunks every month for a set period of time (usually 15 or 30 years). If they are denied, they may have to find a less expensive home to buy or address some of the factors that led them to be denied (e.g., they might have to wait until they make more money). 

In this task, we are given a set of data about whether customers were approved or denied a loan based on a number of factors. In this assignment, your job will be to build a model that predicts whether a person will be approved or denied based on these same factors.


## Part 1: Data Exploration
The first few exercises will get you used to looking at the data using `pandas`. Pandas is a widely used library in python for manipulating data. Why? Datasets can consume a _lot_ of space in your computer's memory and traditional python data structures like lists or dictionaries will become painfully slow as we add thousands of rows of data. We use a specialized dataset library `pandas` which has a specialized data structure called a `dataframe` designed to be ultra fast & efficient. Documentation is here: https://pandas.pydata.org/pandas-docs/stable/

In [1]:
import pandas as pd # import pandas library
df = pd.read_csv('data/home_loans_sample.csv', low_memory=False) # read the csv file into a pandas dataframe object

To understand what kind of data was collected, `pandas` has some handy commands:

- `df.head()` will show us the first 5 rows of our dataset. You can also specify the first N rows, like `df.head(18)` will show us the first 18 rows.
- `df.sample(10)` will show us 10 randomly sampled rows of our dataset (but this may not show all the columns!)
- `df.shape` will tell us how many rows and how many columns are in the dataset
- `df.columns` will list the names of all columns in the dataset
- `df.describe()` will give you summary statistics about all numerical columns in the dataset

### Question 1.a:  How many rows are in this dataset? How many columns?
_Double click this text to write your answer to the question here. Show your python work in the code block (`In [  ]:`) below:

27 columns and 369281 rows

In [2]:
print("{} columns and {} rows".format(len(df.columns),len(df)))

27 columns and 369281 rows


### Question 1.b: One of the columns in the dataset is the outcome value for each application, the value we will try to predict. Which column is that?
_Double click this text to write your answer to the question here. Show your python work in the code block below, if applicable:

10th column : loan_approved

In [3]:
print(df.columns)

Index(['town_name', 'county_name', 'loan_amount_000s', 'applicant_income_000s',
       'property_type_name', 'occupied_by_owner', 'loan_type_name',
       'is_hoepa_loan', 'loan_purpose_name', 'loan_approved',
       'denial_reason_name_3', 'denial_reason_name_2', 'denial_reason_name_1',
       'co_applicant_sex_name', 'co_applicant_race_name_5',
       'co_applicant_race_name_4', 'co_applicant_race_name_3',
       'co_applicant_race_name_2', 'co_applicant_race_name_1',
       'co_applicant_ethnicity_name', 'applicant_sex_name',
       'applicant_race_name_5', 'applicant_race_name_4',
       'applicant_race_name_3', 'applicant_race_name_2',
       'applicant_race_name_1', 'applicant_ethnicity_name'],
      dtype='object')


### Question 1.c: What reasons were given in this dataset for denying a loan application?
Hint: There are 3 columns in the dataset that list why a loan was denied. Try looking up the pandas command to list the unique values in a column.

_Double click this text to write your answer to the question here. Show your python work in the code block below:_

nan, 'Other', 'Credit history',
       'Insufficient cash (downpayment, closing costs)',
       'Employment history', 'Debt-to-income ratio',
       'Unverifiable information', 'Collateral',
       'Credit application incomplete', 'Mortgage insurance denied'

In [4]:
df['denial_reason_name_1'].unique()
df['denial_reason_name_2'].unique()
df['denial_reason_name_3'].unique()

array([nan, 'Other', 'Credit history',
       'Insufficient cash (downpayment, closing costs)',
       'Employment history', 'Debt-to-income ratio',
       'Unverifiable information', 'Collateral',
       'Credit application incomplete', 'Mortgage insurance denied'],
      dtype=object)

### Question 1.d: Given the denial reasons columns and the rest of the columns in this dataset, think about what information you _don't_ have about each application. Rank your top 3 _missing_ pieces of information about each application that could help you better predict the application's loan outcome.
_Double click this text to write your answer to the question here. Show your python work in the code block below, if applicable:_

#1. Renting history

#2. Criminal history

#3. Marital status / number of people living with the person

In [5]:
df.columns

Index(['town_name', 'county_name', 'loan_amount_000s', 'applicant_income_000s',
       'property_type_name', 'occupied_by_owner', 'loan_type_name',
       'is_hoepa_loan', 'loan_purpose_name', 'loan_approved',
       'denial_reason_name_3', 'denial_reason_name_2', 'denial_reason_name_1',
       'co_applicant_sex_name', 'co_applicant_race_name_5',
       'co_applicant_race_name_4', 'co_applicant_race_name_3',
       'co_applicant_race_name_2', 'co_applicant_race_name_1',
       'co_applicant_ethnicity_name', 'applicant_sex_name',
       'applicant_race_name_5', 'applicant_race_name_4',
       'applicant_race_name_3', 'applicant_race_name_2',
       'applicant_race_name_1', 'applicant_ethnicity_name'],
      dtype='object')

## Part 2: Preparing Data to Input to a Model
Here we'll start using `scikit-learn` which provides simple library calls for most things we'd like to do in a simple machine learning pipeline. If you haven't used `scikit-learn` before this tutorial may be useful to give you a sense of what the library can do: https://scikit-learn.org/stable/tutorial/basic/tutorial.html

Machine learning models can only understand data that is represented numerically, but lots of the columns in our dataset like `town_name` are text _categorical_ data. Meanwhile, many models do better when continous numerical data is within small, consistent ranges, such as all data being between -1, 0 and 1, which is definitely not the case with our thousands of dollars loan units.

So first, we will separate out our samples (called _X_) into features (sometimes called attributes) we'd like to include in our model that are categorical or continous so that we can preprocess each appropriately, separately.

In [6]:
import sklearn # import scikit-learn
from sklearn import preprocessing # import preprocessing utilites

features_cat = ['loan_purpose_name', 'applicant_sex_name', 'town_name',
                'county_name']

features_num = ['loan_amount_000s', 'applicant_income_000s', 'is_hoepa_loan']

X_cat = df[features_cat]
X_num = df[features_num]

### Part 2.A One Hot Encode categorical variables
Run the following code to One Hot Encode the categorical features:

In [7]:
enc = preprocessing.OneHotEncoder()
enc.fit(X_cat) # fit the encoder to categories in our data 
one_hot = enc.transform(X_cat) # transform data into one hot encoded sparse array format

In [8]:
# Finally, put the newly encoded sparse array back into a pandas dataframe so that we can use it
X_cat_proc = pd.DataFrame(one_hot.toarray(), columns=enc.get_feature_names_out())
X_cat_proc.head()

Unnamed: 0,loan_purpose_name_Home improvement,loan_purpose_name_Home purchase,loan_purpose_name_Refinancing,applicant_sex_name_Female,"applicant_sex_name_Information not provided by applicant in mail, Internet, or telephone application",applicant_sex_name_Male,applicant_sex_name_Not applicable,town_name_Bellingham - WA,"town_name_Bremerton, Silverdale - WA","town_name_Kennewick, Richland - WA",...,county_name_Snohomish County,county_name_Spokane County,county_name_Stevens County,county_name_Thurston County,county_name_Wahkiakum County,county_name_Walla Walla County,county_name_Whatcom County,county_name_Whitman County,county_name_Yakima County,county_name_nan
0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Question 2.a: In your own words, how is one hot coding tranforming the categorical data? What does the term "one-hot" refer to?
one hot encoding is the process of converting categorical data into numerical data (binary values, 0s and 1s). It is called one-hot because in each row, if something falls under any column category, it's going to be 1, and 0 otherwise. so we have an idea about the existence/nonexistence of different features with just 0s and 1s.

### Part 2.B Scaling down continuous numerical data
Run the following code to normalize any continous data around the column's mean. This process will ensure that the average of that feature, such as the average amount that a person asks for in loan amount, is scaled to 0. Values less than the average will be negative numbers, and values larger than the average will be positive numbers.

In [9]:
from sklearn import preprocessing

scaled = preprocessing.scale(X_num)
X_num_proc = pd.DataFrame(scaled, columns=features_num)
X_num_proc.head()

Unnamed: 0,loan_amount_000s,applicant_income_000s,is_hoepa_loan
0,-0.130864,0.016448,-0.005701
1,-0.10368,-0.596232,-0.005701
2,-0.101589,0.024727,-0.005701
3,0.128424,1.664059,-0.005701
4,0.266432,-0.000111,-0.005701


### Part 2.C Merge our feature sets into one sample dataset _X_ and fix NaN values
Run the code below to combine the numerical and categorical feature sets.

In [10]:
X = pd.concat([X_num_proc, X_cat_proc], axis=1, sort=False)
X.head()

Unnamed: 0,loan_amount_000s,applicant_income_000s,is_hoepa_loan,loan_purpose_name_Home improvement,loan_purpose_name_Home purchase,loan_purpose_name_Refinancing,applicant_sex_name_Female,"applicant_sex_name_Information not provided by applicant in mail, Internet, or telephone application",applicant_sex_name_Male,applicant_sex_name_Not applicable,...,county_name_Snohomish County,county_name_Spokane County,county_name_Stevens County,county_name_Thurston County,county_name_Wahkiakum County,county_name_Walla Walla County,county_name_Whatcom County,county_name_Whitman County,county_name_Yakima County,county_name_nan
0,-0.130864,0.016448,-0.005701,0.0,0.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,-0.10368,-0.596232,-0.005701,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,-0.101589,0.024727,-0.005701,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.128424,1.664059,-0.005701,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.266432,-0.000111,-0.005701,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Question 2.c The code line below removes any NaN values in our sample with 0. NaNs are missing values that a model won't be able to understand. What is the _semantic_ meaning of replacing a NaN with 0 for the categorical variables? And for the continous numerical variables? 

Since we're using one-hot encoding to convert our variables into binary features, for categorical variables, a row that has a 0 for all the "options" of a certain catergory (i.e race) means that we don't have that information and it won't be taken into consideration as a factor of deciding whether or not to give the person a loan. For the continuous numerical variables, it indicates non-missing data that is actually equal 0.

In [11]:
X = X.fillna(0) # remove NaN values

### Part 2.D Create our target array _y_ that our model will try to predict

In [12]:
y = df['loan_approved'] # target

### Part 2.E Split our data into training, test, and validation sets
Run the code below to split the data. Both validation and test sets will be used for testing our model, but use the validation set while you are developing and improving your model, and leave the test for late stage evaluation.

In [13]:
from sklearn.model_selection import train_test_split
X_train, X_TEMP, y_train, y_TEMP = train_test_split(X, y, test_size=0.30) # split out into training 70% of our data
X_validation, X_test, y_validation, y_test = train_test_split(X_TEMP, y_TEMP, test_size=0.50,  random_state=16) # split out into validation 15% of our data and test 15% of our data
print(X_train.shape, X_validation.shape, X_test.shape) # print data shape to check the sizing is correct

(258496, 65) (55392, 65) (55393, 65)


### Question 2.e:  In a  single sentence, what is the difference between train, test, and validation sets?


_the Training set_ is to build up our ML model, _the validation set_ is a subset of our original data set that helps us evaluate the model's performance as we train it and tweak it, and _the testing set_ is our final step in evaluating our model's peformance on data it hasn't seen.

## Part 3. Developing Models
Scikit-learn has a substantial library of different models we can use for classification. Below are implemented two of the most simple classification models, Logistic Regression and Dummy Classifier.

In [14]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

# helper method to print basic model metrics
def metrics(y_true, y_pred):
    print('Confusion matrix:\n', confusion_matrix(y_true, y_pred))
    print('\nReport:\n', classification_report(y_true, y_pred))

In [15]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver='lbfgs',max_iter=1000).fit(X_train, y_train) # first fit (train) the model
y_pred = model.predict(X_validation) # next get the model's predictions for a sample in the validation set
metrics(y_validation, y_pred) # finally evaluate performance

#how is the precision = 1 ? solved in office hours

Confusion matrix:
 [[   24  9041]
 [    8 46319]]

Report:
               precision    recall  f1-score   support

           0       0.75      0.00      0.01      9065
           1       0.84      1.00      0.91     46327

    accuracy                           0.84     55392
   macro avg       0.79      0.50      0.46     55392
weighted avg       0.82      0.84      0.76     55392



The Dummy Classifier is a 'dummy' because it is going to use zero machine learning, and simply predict "approve this loan" (value 1) for every loan it sees.

In [16]:
from sklearn.dummy import DummyClassifier

approve_everyone = DummyClassifier(strategy='constant', constant = 1).fit(X_train, y_train) # first fit (train) the model
y_pred_dummy = approve_everyone.predict(X_validation) # next get the model's predictions for a sample in the validation set
metrics(y_validation, y_pred_dummy) # finally evaluate performance

Confusion matrix:
 [[    0  9065]
 [    0 46327]]

Report:
               precision    recall  f1-score   support

           0       0.00      0.00      0.00      9065
           1       0.84      1.00      0.91     46327

    accuracy                           0.84     55392
   macro avg       0.42      0.50      0.46     55392
weighted avg       0.70      0.84      0.76     55392



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Question 3.a: Considering only the data itself, why do Logistic Regression and the Dummy Classifier perform the same? What is the semantic meaning for why Dummy Classifier has such high accuracy?

I think it's because the data is skewed, meaning that in most cases in the data we have, we approved the person's loan which makes the Dummy classifier predictions right most of the time. 

## Part 4: Obtaining a Baseline

### Task 4.a: Create a new balanced dataset where exactly half of the samples are rejected loan applications and half are accepted loan applications.
(Hint: You may choose to do this iteratively with for..loops (which may take a _long_ time to run), or consider adapting the _Down-sample Majority Class_ code from [this link](https://elitedatascience.com/imbalanced-classes) - although do note that `balance` is not our target column and `49` is likely not the correct number of samples)

_Show your python work in the code block below:_

In [17]:
import pandas as pd # import pandas library
from sklearn.utils import resample,shuffle

df = pd.read_csv('data/home_loans_sample.csv', low_memory=False)

# Separate majority and minority classes
df_majority = df[df.loan_approved==1]
df_minority = df[df.loan_approved==0]
 
# Downsample majority class
df_majority_downsampled = resample(df_majority, replace= False, n_samples=60380, random_state=123) 

# Combine minority class with downsampled majority class
df_downsampled = pd.concat([df_majority_downsampled, df_minority])
 
# Display new class counts
df_downsampled.loan_approved.value_counts()
# 1    60380
# 0    60380

1    60380
0    60380
Name: loan_approved, dtype: int64

### Task 4.b: Below, retry training and evaluating a Logistic regression model on the updated data. What is the new performance of the model? If you were to re-run the DummyClassifier on the balanced data, what do you think would happen to its performance?
(Hint: After balancing your original dataset, you might want to repeat the data processing, feature selection, etc. process performed in Parts 2 and 3).


--> if we rerun the dummy classifier we'd get 0.5 accuracy because now the data is balanced (50% approved loan, 50% denied)


In [18]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
import sklearn # import scikit-learn
from sklearn import preprocessing # import preprocessing utilites

features_cat = ['loan_purpose_name', 'applicant_sex_name', 'town_name',
'county_name']

features_num = ['loan_amount_000s', 'applicant_income_000s', 'is_hoepa_loan']

X_cat = df_downsampled[features_cat]
X_num = df_downsampled[features_num]


enc = preprocessing.OneHotEncoder()
enc.fit(X_cat) # fit the encoder to categories in our data 
one_hot = enc.transform(X_cat) # transform data into one hot encoded sparse array format

# Finally, put the newly encoded sparse array back into a pandas dataframe so that we can use it
X_cat_proc = pd.DataFrame(one_hot.toarray(), columns=enc.get_feature_names_out())
#X_cat_proc.head()

scaled = preprocessing.scale(X_num)
X_num_proc = pd.DataFrame(scaled, columns=features_num)
X = pd.concat([X_num_proc, X_cat_proc], axis=1, sort=False)
X = X.fillna(0) # remove NaN values


y = df_downsampled['loan_approved'] # target
from sklearn import preprocessing


X_train, X_TEMP, y_train, y_TEMP = train_test_split(X, y, test_size=0.30) # split out into training 70% of our data
X_validation, X_test, y_validation, y_test = train_test_split(X_TEMP, y_TEMP, test_size=0.50,  random_state=16) # split out into validation 15% of our data and test 15% of our data
print(X_train.shape, X_validation.shape, X_test.shape) # print data shape to check the sizing is correct


model = LogisticRegression(solver='lbfgs', max_iter=1000).fit(X_train, y_train) # first fit (train) the model
y_pred = model.predict(X_validation) # next get the model's predictions for a sample in the validation set

def metrics(y_true, y_pred):
    print('Confusion matrix:\n', confusion_matrix(y_true, y_pred))
    print('\nReport:\n', classification_report(y_true, y_pred))
    
    

metrics(y_validation, y_pred)



(84532, 65) (18114, 65) (18114, 65)
Confusion matrix:
 [[7131 1882]
 [4314 4787]]

Report:
               precision    recall  f1-score   support

           0       0.62      0.79      0.70      9013
           1       0.72      0.53      0.61      9101

    accuracy                           0.66     18114
   macro avg       0.67      0.66      0.65     18114
weighted avg       0.67      0.66      0.65     18114



### Task 4.c: Interpret the Confusion Matrix by identifying the numbers of true/false positives/negatives the code below:

In [21]:
# Write the numbers of each in below. MAKE SURE you understand which cell in the confusion matrix has each of these.
false_negative = 4314
false_positive = 1882
true_negative = 4787
true_positive = 7131
# these values change a little every time I rerun the code above 

print (false_negative, false_positive, true_negative, true_positive)

4314 1882 4787 7131


### Task 4.d: How does a *false positive* in this situation hurt _the bank_? (~20 words)

This would hurt the bank by lending money to somebody who most likely would not be able to pay it back.

### Task 4.e: How does a *false negative* in this situation hurt _the bank_? (~20 words)

The bank would suffer as a result of losing the customers to whom it would have lent money and as a result losing the potential income from interest on loans that could have been granted

### Task 4.f: How does a *false positive* in this situation hurt _the customer_? (~20 words)

the customer might take on loans that they would probably not be able to pay back and drown in debts as a result.

### Task 4.g: How does a *false negative* in this situation hurt _the customer_? (~20 words)

the customer won't be able to get this source of financial assistance which might cause his other financial and personal issues.

## Part 5: Your Turn!

### Task 5.a: Improving the model - why might this be difficult? 

Use your own imagination and experimentation to improve predictive performance for this task, modifying the model/algorithm choices, feature/attribute choices, and data processing however you think will be most effective. It is _very_ difficult to improve the model's performance significantly, why might that be?

Important! Don't spend too much time on this portion! Consider your ability to improve the model above the baseline after Task 4.B to be only ~10% of this assignment effort, with ~5% of that given for small improvements to performance. Thus while I encourage you to experiment, do not sink excessive time into this task. The goal is to learn about the process of machine learning model development, and some of the common pitfalls!

_Hint:_ Be sure to check out the other Supervised ML models/algorithms available from the [sci-kit learn documentation](https://scikit-learn.org/stable/user_guide.html#user-guide).

- maybe splitting out into training 60% or less of our data instead of 70% so that we can avoid overfitting.
- having more data but also we're constrained by the fact that we have fewer "0" cases than 1s as the outcome, so using more data would mean getting imbalanced, skewed data again
- we can be more picky about the features we're feeding the model, for example race/gender shouldn't matter in deciding whether a person gets a home loan or not 
- maybe we can also experiment with different classification algorithms but I tried some from the linked website and they only seemed to perform worse than logistic regression

In [29]:
# Load libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
import sklearn # import scikit-learn
from sklearn import preprocessing # import preprocessing utilites

features_cat = ['loan_purpose_name', 'applicant_sex_name', 'town_name',
'county_name','property_type_name', 'occupied_by_owner', 'loan_type_name',
'loan_purpose_name', 'denial_reason_name_3', 'denial_reason_name_2', 'denial_reason_name_1',
'co_applicant_sex_name', 'co_applicant_race_name_5',
'co_applicant_race_name_4', 'co_applicant_race_name_3',
'co_applicant_race_name_2', 'co_applicant_race_name_1',
'co_applicant_ethnicity_name', 'applicant_sex_name',
'applicant_race_name_5', 'applicant_race_name_4',
'applicant_race_name_3', 'applicant_race_name_2',
'applicant_race_name_1', 'applicant_ethnicity_name']

features_num = ['loan_amount_000s', 'applicant_income_000s', 'is_hoepa_loan']

X_cat = df_downsampled[features_cat]
X_num = df_downsampled[features_num]


enc = preprocessing.OneHotEncoder()
enc.fit(X_cat) # fit the encoder to categories in our data 
one_hot = enc.transform(X_cat) # transform data into one hot encoded sparse array format

# Finally, put the newly encoded sparse array back into a pandas dataframe so that we can use it
X_cat_proc = pd.DataFrame(one_hot.toarray(), columns=enc.get_feature_names_out())
#X_cat_proc.head()

scaled = preprocessing.scale(X_num)
X_num_proc = pd.DataFrame(scaled, columns=features_num)
X = pd.concat([X_num_proc, X_cat_proc], axis=1, sort=False)
X = X.fillna(0) # remove NaN values
y = df_downsampled['loan_approved'] # target

X_train, X_TEMP, y_train, y_TEMP = train_test_split(X, y, test_size=0.60) # split out into training 50% of our data
X_validation, X_test, y_validation, y_test = train_test_split(X_TEMP, y_TEMP, test_size=0.20) 
print(X_train.shape, X_validation.shape, X_test.shape) # print data shape to check the sizing is correct
    

model = LogisticRegression(solver='lbfgs',max_iter=1000).fit(X_train, y_train) # first fit (train) the model
y_pred = model.predict(X_validation) # next get the model's predictions for a sample in the validation set
predictions = model.predict(X_test)
def metrics(y_true, y_pred):
    print('Confusion matrix:\n', confusion_matrix(y_true, y_pred))
    print('\nReport:\n', classification_report(y_true, y_pred))
    
#print("Accuracy:",metrics(y_test, y_pred))
metrics(y_validation, y_pred)

#test data
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics

cm = metrics.confusion_matrix(y_test, predictions)
print(cm)
print(classification_report(y_test, predictions))

(48304, 170) (57964, 170) (14492, 170)
Confusion matrix:
 [[21063  7887]
 [ 3449 25565]]

Report:
               precision    recall  f1-score   support

           0       0.86      0.73      0.79     28950
           1       0.76      0.88      0.82     29014

    accuracy                           0.80     57964
   macro avg       0.81      0.80      0.80     57964
weighted avg       0.81      0.80      0.80     57964

[[5260 2016]
 [ 803 6413]]
              precision    recall  f1-score   support

           0       0.87      0.72      0.79      7276
           1       0.76      0.89      0.82      7216

    accuracy                           0.81     14492
   macro avg       0.81      0.81      0.80     14492
weighted avg       0.81      0.81      0.80     14492



### Task 5.b.: What is the performance of this new model on your validation data? 

it has improved from 0.66 accuracy to 0.80

### Task 5.c.: How does your selected model perform on the withheld test set?
almost same accuracy as validation data, 0.81.

### Task 5.d: What other models did you try, and why might this one perform better than the others?

I tried Naive Bayes and decision trees, I am unsure as to why LR outperformed Naive Bayes but I think it is more suitable to use in this case than decision trees because the outcome we're trying to decide is binary, either 1 or 0.

## Submit Your Assignment

Once you've completed all of the above, you're done with assignment 1! You might want to double check that your code works like you expect. You can do this by choosing "Restart & Run All" in the Kernel menu. If it outputs errors, you may want to go back and check what you've done.

Once you think everything is set, please upload your final notebook (with all of your code run and output showing), to Glow with filename [yourunixID]_haii21[assignmentnumber].ipynb, e.g., ikh1_haii21a1.ipynb.