# Home Loan Prediction
This dataset `full_home_loans.csv` is about home loan applications in Washington state, USA, where each row of the dataset is an individual loan application. Your goal in this assignment is to build a machine learning model that can accurately predict whether a given loan application was accepted or rejected.


## Part 1: Data Exploration
The first few exercises will get you used to looking at the data using `pandas`. Pandas is a widely used library in python for manipulating data. Why? Datasets can consume a _lot_ of space in your computer's memory and traditional python data structures like lists or dictionaries will become painfully slow as we add thousands of rows of data. We use a specialized dataset library `pandas` which has a specialized data structure called a `dataframe` designed to be ultra fast & efficient. Documentation is here: https://pandas.pydata.org/pandas-docs/stable/

In [1]:
from google.colab import files
data_to_load = files.upload()

Saving home_loans.csv to home_loans.csv


In [2]:
import io
import pandas as pd # import pandas library
df = pd.read_csv(io.BytesIO(data_to_load['home_loans.csv']))

  df = pd.read_csv(io.BytesIO(data_to_load['home_loans.csv']))


To understand what kind of data was collected, `pandas` has some handy commands:

- `df.head()` will show us the first 5 rows of our dataset. You can also specify the first N rows, like `df.head(18)` will show us the first 18 rows.
- `df.sample(10)` will show us 10 randomly sampled rows of our dataset
- `df.shape` will tell us how many rows and how many columns are in the dataset
- `df.columns` will list the names of all columns in the dataset
- `df.describe()` will give you summary statistics about all numerical columns in the dataset

### Question 1.A:  How many rows are in this dataset? How many columns?

There are 369281 rows and 27 columns in this dataset.

In [3]:
rows = df.shape[0]
columns = df.shape[1]
print(rows)
print (columns)

369281
27


### Question 1.B: One of the columns in the dataset is the outcome value for each application, the value we will try to predict. Which column is that?

The column loan_approved is the outcome value for each application.

In [4]:
print(df.columns)

Index(['town_name', 'county_name', 'loan_amount_000s', 'applicant_income_000s',
       'property_type_name', 'occupied_by_owner', 'loan_type_name',
       'is_hoepa_loan', 'loan_purpose_name', 'loan_approved',
       'denial_reason_name_3', 'denial_reason_name_2', 'denial_reason_name_1',
       'co_applicant_sex_name', 'co_applicant_race_name_5',
       'co_applicant_race_name_4', 'co_applicant_race_name_3',
       'co_applicant_race_name_2', 'co_applicant_race_name_1',
       'co_applicant_ethnicity_name', 'applicant_sex_name',
       'applicant_race_name_5', 'applicant_race_name_4',
       'applicant_race_name_3', 'applicant_race_name_2',
       'applicant_race_name_1', 'applicant_ethnicity_name'],
      dtype='object')


### Question 1.C: What reasons were given in this dataset for denying a loan application?
Hint: There are 3 columns in the dataset that list why a loan was denied. Try looking up the pandas command to list the unique values in a column.

The reasons that were given for denying a loan application are:

*   Credit History
*   Insufficient cas (downpayment, closing costs)
*   Employment History
*   Debt-to-income ration
*   Inverifiable information
*   Collateral
*   Credit application incomplete
*   Mortgage insurance denied
*   Other









In [5]:
df['denial_reason_name_1'].unique()
df['denial_reason_name_2'].unique()
df['denial_reason_name_3'].unique()

array([nan, 'Other', 'Credit history',
       'Insufficient cash (downpayment, closing costs)',
       'Employment history', 'Debt-to-income ratio',
       'Unverifiable information', 'Collateral',
       'Credit application incomplete', 'Mortgage insurance denied'],
      dtype=object)

### Question 1.D: Given the denial reasons and the columns in this dataset, think about what information you _don't_ have about each application. Rank your top 3 _missing_ pieces of information about each application that could help you better predict the application's loan outcome.

_Double click to write your answer question here. Show your work in code below if applicable._

#1.  Credit score
#2.  Value of Mortgage Property
#3. Debt-to-income ration

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Part 2: Preparing Data to Input to a Model
Here we'll start using `scikit-learn` which provides simple library calls for most things we'd like to do in a simple machine learning pipeline. If you haven't used `scikit-learn` before this tutorial may be useful to give you a sense of what the library can do: https://scikit-learn.org/stable/tutorial/basic/tutorial.html

Many machine learning models can only understand data that is represented numerically, but lots of the columns in our dataset like "town_name" are text _categorical_ data. Meanwhile, many models do better when continuous numerical data is within small, consistent ranges, such as all data being between -1, 0 and 1, which is definitely not the case with our thousands of dollars loan units.

So first, we will separate our samples (called _X_) into features we'd like to include in our model that are categorical or continuous so that we can preprocess each appropriately separately.

In [7]:
import sklearn # import scikit-learn
from sklearn import preprocessing # import preprocessing utilites

features_cat = ['loan_purpose_name', 'applicant_sex_name']
features_num = ['loan_amount_000s', 'applicant_income_000s']

X_cat = df[features_cat]
X_num = df[features_num]

### Part 2.A One Hot Encode Categorical Variables
Run the following code to one hot encode the categorical features:

In [8]:
enc = preprocessing.OneHotEncoder()
enc.fit(X_cat) # fit the encoder to categories in our data
one_hot = enc.transform(X_cat) # transform data into one hot encoded sparse array format

In [9]:
# Finally, put the newly encoded sparse array back into a pandas dataframe so that we can use it
X_cat_proc = pd.DataFrame(one_hot.toarray(), columns=enc.get_feature_names_out())
X_cat_proc.head()

Unnamed: 0,loan_purpose_name_Home improvement,loan_purpose_name_Home purchase,loan_purpose_name_Refinancing,applicant_sex_name_Female,"applicant_sex_name_Information not provided by applicant in mail, Internet, or telephone application",applicant_sex_name_Male,applicant_sex_name_Not applicable
0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,1.0,0.0
2,0.0,0.0,1.0,0.0,0.0,1.0,0.0
3,0.0,0.0,1.0,0.0,0.0,1.0,0.0
4,1.0,0.0,0.0,1.0,0.0,0.0,0.0


### Question 2.A: In your own words, how is one hot coding tranforming the categorical data? What does the term "one-hot" refer to?

One hot coding transforms the categorical data in binary format for the model to interpret all the data in terms of 1s and 0s. This approach creates a new column for each unique value in the original category column. It is necessary for the model to treat each category as independent item, hence we apply one hot coding transformation to such type of data. 'One-hot' refers to the one bit in the binary code in the row of a categorical data that is '1' while all rest of the bits are '0'.

### Part 2.B Scaling down continuous numerical data
Run the following code to normalize any continous numberical features, such as loan dollar amount, between -1 and 0. This process will ensure that the average of that feature, such as the average amount that a person asks for in loan amount, is scaled to 0. Values less than the average will be negative numbers, and values larger than the average will be positive numbers.

In [10]:
scaled = preprocessing.scale(X_num)
X_num_proc = pd.DataFrame(scaled, columns=features_num)
X_num_proc.head()

Unnamed: 0,loan_amount_000s,applicant_income_000s
0,-0.130864,0.016448
1,-0.10368,-0.596232
2,-0.101589,0.024727
3,0.128424,1.664059
4,0.266432,-0.000111


### Part 2.C Merge our feature sets into one sample dataset _X_ and fix NaN values
Run the code below to combine the numerical and categorical feature sets.

In [11]:
X = pd.concat([X_num_proc, X_cat_proc], axis=1, sort=False)
X.head()

Unnamed: 0,loan_amount_000s,applicant_income_000s,loan_purpose_name_Home improvement,loan_purpose_name_Home purchase,loan_purpose_name_Refinancing,applicant_sex_name_Female,"applicant_sex_name_Information not provided by applicant in mail, Internet, or telephone application",applicant_sex_name_Male,applicant_sex_name_Not applicable
0,-0.130864,0.016448,0.0,0.0,1.0,1.0,0.0,0.0,0.0
1,-0.10368,-0.596232,0.0,1.0,0.0,0.0,0.0,1.0,0.0
2,-0.101589,0.024727,0.0,0.0,1.0,0.0,0.0,1.0,0.0
3,0.128424,1.664059,0.0,0.0,1.0,0.0,0.0,1.0,0.0
4,0.266432,-0.000111,1.0,0.0,0.0,1.0,0.0,0.0,0.0


### Question 2.C Describe what the code below does.

This code will replace all the null values to 0 in the dataset.

In [12]:
X = X.fillna(0)

### Part 2.D Create our target array _y_ that our model will try to predict

In [13]:
y = df['loan_approved'] # target

### Part 2.E Split our data into training, test, and validation sets
Run the code below to split the data. Both validation and test sets will be used for testing our model, but use the validation set while you are developing and improving your model, and leave the test for late stage evaluation.

In [14]:
from sklearn.model_selection import train_test_split
X_train, X_TEMP, y_train, y_TEMP = train_test_split(X, y, test_size=0.30) # split out into training 70% of our data
X_validation, X_test, y_validation, y_test = train_test_split(X_TEMP, y_TEMP, test_size=0.50) # split out into validation 15% of our data and test 15% of our data
print(X_train.shape, X_validation.shape, X_test.shape) # print data shape to check the sizing is correct

(258496, 9) (55392, 9) (55393, 9)


### Question 2.E:  Describe the differences between train, test, and validation sets?

The training data is used as an input to the ML Model during it's learning phase. The model evaluates this data repeatedly to learn about the patterns in the data and improve itself to serve it's intended purpose.
The validation set is used within the learning phase of the machine learning model to provide the first test against unseen data. The results can be used to validate if the model is working correctly. The results of this data can be known to the implementers of the model.
The test data is used once the model is built to make accurate predictions. The testing data provides a final check if the model is functioning correctly.

## Part 3. Developing Models
Scikit-learn has a substantial library of different models we can use for classification. Below are implemented two of the most simple classification models, Logistic Regression and Dummy Classifier.

In [15]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

# helper method to print basic model metrics
def metrics(y_true, y_pred):
    print('Confusion matrix:\n', confusion_matrix(y_true, y_pred))
    print('\nReport:\n', classification_report(y_true, y_pred))

In [16]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver='lbfgs').fit(X_train, y_train) # first fit (train) the model
y_pred = model.predict(X_validation) # next get the model's predictions for a sample in the validation set
metrics(y_validation, y_pred) # finally evaluate performance

Confusion matrix:
 [[    0  9038]
 [    0 46354]]

Report:
               precision    recall  f1-score   support

           0       0.00      0.00      0.00      9038
           1       0.84      1.00      0.91     46354

    accuracy                           0.84     55392
   macro avg       0.42      0.50      0.46     55392
weighted avg       0.70      0.84      0.76     55392



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


The Dummy Classifier is a 'dummy' because it is going to use zero machine learning, and simply predict "approve this loan" (value 1) for every loan it sees.

In [17]:
from sklearn.dummy import DummyClassifier

approve_everyone = DummyClassifier(strategy='constant', constant = 1).fit(X_train, y_train) # first fit (train) the model
y_pred_dummy = approve_everyone.predict(X_validation) # next get the model's predictions for a sample in the validation set
metrics(y_validation, y_pred_dummy) # finally evaluate performance

Confusion matrix:
 [[    0  9038]
 [    0 46354]]

Report:
               precision    recall  f1-score   support

           0       0.00      0.00      0.00      9038
           1       0.84      1.00      0.91     46354

    accuracy                           0.84     55392
   macro avg       0.42      0.50      0.46     55392
weighted avg       0.70      0.84      0.76     55392



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Question 3.A: Considering only the data itself, why do Logistic Regression and the Dummy Classifier perform the same? What is the semantic meaning for why Dummy Classifier has such high accuracy?

The dummy classifier has high accuracy because the dataframe contains skewed data. The count of data for approved loans is greater than the count of data for rejceted loans which created imbalance in the dataset. This imbalance in dataset is what causes the dummy classifier and the logistic regression model to perform with high accuracy, but it doesn't necessarily mean they are performing well.

## Part 4: Your turn!

### Task 4.A: Create a new balanced dataset where exactly half of the samples are rejected loan applications and half are accepted loan application.
_show your work below_

In [18]:
approved_df = df[df['loan_approved']==1]
rejected_df = df[df['loan_approved']!=1]

approved_df_rows = approved_df.shape[0]
rejected_df_rows = rejected_df.shape[0]

print(approved_df_rows, rejected_df_rows)

308901 60380


In [19]:
half_accepted = approved_df.sample(n=int(rejected_df_rows))
half_rejected = rejected_df.sample(n=int(rejected_df_rows))

balanced_df = pd.concat([half_accepted,half_rejected],ignore_index=True)

In [20]:
print(half_accepted.shape)
print(half_rejected.shape)
print(balanced_df.shape)

(60380, 27)
(60380, 27)
(120760, 27)


### Task 4.B: Below, retry training and evaluating a Logistic regression model on the updated data.
_show your work below_

In [21]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

features_cat_u = ['loan_purpose_name', 'applicant_sex_name']
features_num_u = ['loan_amount_000s', 'applicant_income_000s']

X_cat_u = balanced_df[features_cat_u]
X_num_u = balanced_df[features_num_u]

In [22]:
enc_u = preprocessing.OneHotEncoder()
enc_u.fit(X_cat_u) # fit the encoder to categories in our data
one_hot_u = enc_u.transform(X_cat_u) # transform data into one hot encoded sparse array format

In [23]:
# Finally, put the newly encoded sparse array back into a pandas dataframe so that we can use it
X_cat_proc_u = pd.DataFrame(one_hot_u.toarray(), columns=enc_u.get_feature_names_out())
X_cat_proc_u.head()

Unnamed: 0,loan_purpose_name_Home improvement,loan_purpose_name_Home purchase,loan_purpose_name_Refinancing,applicant_sex_name_Female,"applicant_sex_name_Information not provided by applicant in mail, Internet, or telephone application",applicant_sex_name_Male,applicant_sex_name_Not applicable
0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
1,0.0,1.0,0.0,1.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0,0.0,1.0,0.0
3,0.0,0.0,1.0,1.0,0.0,0.0,0.0
4,0.0,1.0,0.0,0.0,0.0,1.0,0.0


In [24]:
scaled = preprocessing.scale(X_num_u)
X_num_proc_u = pd.DataFrame(scaled, columns=features_num_u)
X_num_proc_u.head()

Unnamed: 0,loan_amount_000s,applicant_income_000s
0,0.336727,0.06004
1,-0.04135,-0.164476
2,-0.177196,-0.41996
3,-0.083904,-0.265122
4,-0.218113,-0.551573


In [25]:
X_u = pd.concat([X_num_proc_u, X_cat_proc_u], axis=1, sort=False)
X_u.head()

Unnamed: 0,loan_amount_000s,applicant_income_000s,loan_purpose_name_Home improvement,loan_purpose_name_Home purchase,loan_purpose_name_Refinancing,applicant_sex_name_Female,"applicant_sex_name_Information not provided by applicant in mail, Internet, or telephone application",applicant_sex_name_Male,applicant_sex_name_Not applicable
0,0.336727,0.06004,0.0,1.0,0.0,0.0,0.0,1.0,0.0
1,-0.04135,-0.164476,0.0,1.0,0.0,1.0,0.0,0.0,0.0
2,-0.177196,-0.41996,0.0,1.0,0.0,0.0,0.0,1.0,0.0
3,-0.083904,-0.265122,0.0,0.0,1.0,1.0,0.0,0.0,0.0
4,-0.218113,-0.551573,0.0,1.0,0.0,0.0,0.0,1.0,0.0


In [26]:
X_u = X_u.fillna(0)
X_u.shape

(120760, 9)

In [27]:
y_u = balanced_df['loan_approved']
y_u.shape

(120760,)

In [28]:
X_train, X_TEMP, y_train, y_TEMP = train_test_split(X_u, y_u, test_size=0.30) # split out into training 70% of our data
X_validation, X_test, y_validation, y_test = train_test_split(X_TEMP, y_TEMP, test_size=0.50) # split out into validation 15% of our data and test 15% of our data
print(X_train.shape, X_validation.shape, X_test.shape) # print data shape to check the sizing is correct

(84532, 9) (18114, 9) (18114, 9)


In [29]:
model = LogisticRegression(solver='lbfgs').fit(X_train, y_train) # first fit (train) the model
y_pred = model.predict(X_validation)
print("Accuracy (in %) of the Logistic Regression model is", accuracy_score(y_validation, y_pred)*100)

Accuracy (in %) of the Logistic Regression model is 64.94976261455227


### Task 4.C: Use your own imagination and experimentation to improve predictive performance for this task, modifying the model choices, feature choices, and data processing however you wish.
_Important! Your ability to improve the model above the baseline after Task 4.B will count for 10% of this assignment grade, with 5% of that given for modest improvements to performance. Thus while we encourage you to experiment, do not sink excessive time into this task. We will test the performance on our own holdout dataset._

_show your work below_

#Trying the XGB Model:

In [30]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

features_cat_bal = ['loan_purpose_name', 'applicant_sex_name']
features_num_bal = ['loan_amount_000s', 'applicant_income_000s']

X_cat_bal = balanced_df[features_cat_bal]
X_num_bal = balanced_df[features_num_bal]

In [31]:
enc_bal = preprocessing.OneHotEncoder()
enc_bal.fit(X_cat_bal) # fit the encoder to categories in our data
one_hot_bal = enc_bal.transform(X_cat_bal) # transform data into one hot encoded sparse array format

In [32]:
# Finally, put the newly encoded sparse array back into a pandas dataframe so that we can use it
X_cat_proc_bal = pd.DataFrame(one_hot_bal.toarray(), columns=enc_bal.get_feature_names_out())
X_cat_proc_bal.head()

Unnamed: 0,loan_purpose_name_Home improvement,loan_purpose_name_Home purchase,loan_purpose_name_Refinancing,applicant_sex_name_Female,"applicant_sex_name_Information not provided by applicant in mail, Internet, or telephone application",applicant_sex_name_Male,applicant_sex_name_Not applicable
0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
1,0.0,1.0,0.0,1.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0,0.0,1.0,0.0
3,0.0,0.0,1.0,1.0,0.0,0.0,0.0
4,0.0,1.0,0.0,0.0,0.0,1.0,0.0


In [33]:
scaled_bal = preprocessing.scale(X_num_bal)
X_num_proc_bal = pd.DataFrame(scaled_bal, columns=features_num_bal)
X_num_proc_bal.head()

Unnamed: 0,loan_amount_000s,applicant_income_000s
0,0.336727,0.06004
1,-0.04135,-0.164476
2,-0.177196,-0.41996
3,-0.083904,-0.265122
4,-0.218113,-0.551573


In [34]:
X_bal = pd.concat([X_num_proc_bal, X_cat_proc_bal], axis=1, sort=False)
X_bal.head()

Unnamed: 0,loan_amount_000s,applicant_income_000s,loan_purpose_name_Home improvement,loan_purpose_name_Home purchase,loan_purpose_name_Refinancing,applicant_sex_name_Female,"applicant_sex_name_Information not provided by applicant in mail, Internet, or telephone application",applicant_sex_name_Male,applicant_sex_name_Not applicable
0,0.336727,0.06004,0.0,1.0,0.0,0.0,0.0,1.0,0.0
1,-0.04135,-0.164476,0.0,1.0,0.0,1.0,0.0,0.0,0.0
2,-0.177196,-0.41996,0.0,1.0,0.0,0.0,0.0,1.0,0.0
3,-0.083904,-0.265122,0.0,0.0,1.0,1.0,0.0,0.0,0.0
4,-0.218113,-0.551573,0.0,1.0,0.0,0.0,0.0,1.0,0.0


In [35]:
X_bal = X_bal.fillna(0)

In [36]:
y_bal = balanced_df['loan_approved']

In [37]:
X_bal_train, X_bal_TEMP, y_bal_train, y_bal_TEMP = train_test_split(X_bal, y_bal, test_size=0.20) # split out into training 70% of our data
X_bal_validation, X_bal_test, y_bal_validation, y_bal_test = train_test_split(X_bal_TEMP, y_bal_TEMP, test_size=0.50) # split out into validation 15% of our data and test 15% of our data
print(X_bal_train.shape, X_bal_validation.shape, X_bal_test.shape) # print data shape to check the sizing is correct

(96608, 9) (12076, 9) (12076, 9)


In [38]:
import xgboost as xgb

In [39]:
xgb_model = xgb.XGBClassifier()
xgb_model.fit(X_bal_train,y_bal_train)

In [40]:
y_xgb_pred = xgb_model.predict(X_bal_test)

In [42]:
print("Accuracy for zgboost", accuracy_score(y_bal_test,y_xgb_pred)*100)

Accuracy for zgboost 68.20967207684664


#Let's try Neural Network Model Now:

In [44]:
import tensorflow as tf
from tensorflow import keras
from sklearn.preprocessing import StandardScaler

In [45]:
scaler = StandardScaler()
X_bal_train = scaler.fit_transform(X_bal_train)
X_bal_test = scaler.transform(X_bal_test)

In [46]:
model = keras.Sequential([
    keras.layers.Dense(64, activation='relu', input_shape=(9,)),  # Input layer with 9 features
    keras.layers.Dense(128, activation='relu'),                  # Hidden layer with 128 neurons and ReLU activation
    keras.layers.Dense(3, activation='softmax')                 # Output layer with 3 neurons for the 3 classes and softmax activation
])

In [47]:
y_bal_validation

67817     0
70931     0
102793    0
8106      1
62968     0
         ..
77020     0
34581     1
28065     1
3153      1
97892     0
Name: loan_approved, Length: 12076, dtype: int64

In [48]:
model.compile(optimizer='adam',loss='sparse_categorical_crossentropy',metrics=['accuracy'])

In [49]:
model.fit(X_bal_train, y_bal_train, epochs=10, batch_size=40, verbose=2)

Epoch 1/10
2416/2416 - 4s - loss: 0.6085 - accuracy: 0.6568 - 4s/epoch - 2ms/step
Epoch 2/10
2416/2416 - 4s - loss: 0.5964 - accuracy: 0.6663 - 4s/epoch - 2ms/step
Epoch 3/10
2416/2416 - 5s - loss: 0.5937 - accuracy: 0.6709 - 5s/epoch - 2ms/step
Epoch 4/10
2416/2416 - 3s - loss: 0.5925 - accuracy: 0.6715 - 3s/epoch - 1ms/step
Epoch 5/10
2416/2416 - 3s - loss: 0.5914 - accuracy: 0.6714 - 3s/epoch - 1ms/step
Epoch 6/10
2416/2416 - 5s - loss: 0.5912 - accuracy: 0.6725 - 5s/epoch - 2ms/step
Epoch 7/10
2416/2416 - 4s - loss: 0.5916 - accuracy: 0.6732 - 4s/epoch - 1ms/step
Epoch 8/10
2416/2416 - 3s - loss: 0.5901 - accuracy: 0.6731 - 3s/epoch - 1ms/step
Epoch 9/10
2416/2416 - 3s - loss: 0.5907 - accuracy: 0.6731 - 3s/epoch - 1ms/step
Epoch 10/10
2416/2416 - 5s - loss: 0.5898 - accuracy: 0.6741 - 5s/epoch - 2ms/step


<keras.src.callbacks.History at 0x7bb419564550>

In [50]:
X_bal_validation.shape

(12076, 9)

In [51]:
y_bal_pred = (model.predict(X_bal_validation)).flatten()
# metrics(y_test_nn, y_pred_nn)
# accuracy_nn = accuracy_score(y_test_nn, y_pred_nn)
# # print(f"Neural Network Accuracy: {accuracy_nn}")
print(X_bal_validation.shape)
print(y_bal_pred.shape)

(12076, 9)
(36228,)


In [53]:
# accuracy = accuracy_score(y_bal_validation, y_bal_pred)

# Documenting collaborations
## Briefly list and describe the sources you received help from, and how they helped you
### These may include friends, peers, TAs, generative AI tools, etc.

**References:**
https://www.geeksforgeeks.org/python-pandas-dataframe-sample/#
https://www.geeksforgeeks.org/understanding-logistic-regression/

**Friends:**
Savani Mengawade

**Generative AI:**
ChatGPT (for resolving errors in code)


# Learning assessment

### Reflect in a few words the amount of new content learned from completing the assignment.
### If most of the material was not new to you, where did you see it before?

I was unfamiliar with the ML model implementation; I learned how to divide the data into train, test, and validation datasets. I have also learned how to use logistic regression. I also learned more about data preprocessing techniques such as one hot encoding. I was unable to complete a few tasks, such as boosting the model's accuracy. I tried to discuss with my friends about how

This was somethng new for me, I have previously worked on OpenCV applications, but data processing and regression was new for me.