# ***Cloud-Driven Loan Default Predictor using Machine Learning***
<hr>

#### ***Please run the below cell to import libraries:***

In [None]:
#### Import statements here
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.utils import resample
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split 
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report 
from sklearn import datasets

import warnings
import boto3
from sagemaker import get_execution_role

warnings.filterwarnings('ignore')
####

### Task I - Data Loading 

**Instructions:**
- Build the S3 path for the dataset `loan_cleaned_data.csv` using string formatting to concatenate the bucket name, folder name and file key i.e the name of the dataset. 
    - Note: Bucket name - ` loan_dataXYZXYZ` (XYZXYZ can be any random integers) & Folder name - ` loan_cleaned_data`.    
- Load the dataset into a pandas DataFrame. 


**Hints:**
- Sample S3 URI - “s3://bucket_name/folder_name/file_name.csv” 

In [None]:
#### Import the dataset from S3
bucket= None
folder_name = None
data_key = None
data_location = "S3 URI"

In [None]:
##### Load the dataset

data= pd.read_csv(data_location)
data.head()

### Task II - Feature Engineering

**Instructions:**
- Convert the values in the categorical column `purpose` into numerical format using **One-hot Encoding**. The datatype of the new columns should be *int*.


In [None]:
#### Store the updated dataframe below

data = pd.get_dummies(data,columns=['purpose'],dtype=int)
data.head()

### Task III - Data Preprocessing

**Instructions:**
- Inspect the target column `not_fully_paid` and identify the count of records belonging to the two classes.
- Filter out the majority and minority classes and store them separately.
- Handle the data imbalance by oversampling the minority class using the **resample** method so that the final count of records in both the classes becomes equal. Store the result in the variable *df_minority_upsampled*.
- Concatenate the upsampled minority data with the majority and assign the result to the new dataframe *df*. 
- Inspect the target column of the new dataframe to verify that the data is balanced. 

In [None]:
print(data['not_fully_paid'].value_counts())

In [None]:

# Separate majority and minority classes
df_majority = data[data['not_fully_paid'] == 0]
df_minority = data[data['not_fully_paid'] == 1]

In [None]:
# Handle the imbalanced data using resample method and oversample the minority class
df_minority_upsampled = resample(df_minority, replace=True,n_samples=df_majority.shape[0],random_state=42)  

In [None]:
# Concatenate the upsampled data records with the majority class records and shuffle the resultant dataframe
df_balanced = pd.concat([df_majority, df_minority_upsampled])

#Optional
print(df_balanced['not_fully_paid'].value_counts())

### Task IV - Model Training

**Instructions:**
- Drop the columns `sl_no` and `not_fully_paid` and create a dataframe of independent variables named *X*. Filter the dependent variable and store it in *y*.
- Split the data into training and test sets using **60:40** ratio. Use a random state equal to **42**.
- Train a **Random Forest Classifier** model called *rf* using the training data. Use a random state equal to **42**. 


In [None]:
# Create X and y data for train-test split

X = df_balanced.drop(['sl_no', 'not_fully_paid'], axis=1)
y = df_balanced['not_fully_paid']

In [None]:
# Split the data 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)


In [None]:
# Train a Random Forest Classifier model

rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)


### Task V - Model Evaluation

**Instructions:**
- Predict using the trained **Random Forest Classifier** model *rf* on the test data *X_test*.
- Evaluate the predictions by comparing it with the actual test data *y_test*. 
- Print the classification report to determine the evaluation metric scores. 

In [None]:
# Predict using the trained Random Forest Classifier model

y_pred = rf.predict(X_test)


In [None]:
# Print the classification report 
print("Classification Report:\n")
print(classification_report(y_test, y_pred))

### Task VI - Saving the Model to AWS S3 

**Instructions:** 
- Serialize the trained Random Forest model using `joblib`. 
- Initialize the S3 client using the `boto3` library. 
- Save the serialized model to a temporary file using `tempfile`. 
- Upload the model file to the specified S3 bucket named `loan-data`. 
- Ensure the model is saved as `model.pkl` in the S3 bucket. 

**Hints:**
- Temporary files in Python can be managed using `tempfile.TemporaryFile().` 
- Use `joblib.dump()` for saving the model. 
- We can push objects into S3 using `.put_object(...) method with necessary parameters available under boto3. 


In [None]:
#### Uploading the model data to S3 bucket
import tempfile
import boto3
import joblib

BUCKET_NAME = "Loan_data"

# intialize s3 client to save model
s3_client = boto3.client('s3')

# name to save model as in s3
model_name = "model.pkl"

# save to s3 - make necessary changes to the function
with tempfile.TemporaryFile() as fp:
    joblib.dump(rf, fp) # Replace with appropriate field
    fp.seek(0)
    s3_client.put_object( # Use appropriate function name
        Body=fp.read(), 
        Bucket=BUCKET_NAME,
        Key=model_name
    )

print(f'Model saved to s3 as: {model_name}')
####