<a href="https://colab.research.google.com/github/joannedonohue/neural-network-challenge-1/blob/main/student_loans_with_deep_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Student Loan Risk with Deep Learning

In [None]:
# Imports
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from pathlib import Path

---

## Prepare the data to be used on a neural network model

### Step 1: Read the `student-loans.csv` file into a Pandas DataFrame. Review the DataFrame, looking for columns that could eventually define your features and target variables.   

In [None]:
# Read the csv into a Pandas DataFrame
file_path = "https://static.bc-edx.com/ai/ail-v-1-0/m18/lms/datasets/student-loans.csv"
loans_df = pd.read_csv(file_path)

# Review the DataFrame
loans_df.head()

Unnamed: 0,payment_history,location_parameter,stem_degree_score,gpa_ranking,alumni_success,study_major_code,time_to_completion,finance_workshop_score,cohort_ranking,total_loan_score,financial_aid_score,credit_ranking
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,0
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,0
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,0
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,1
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,0


## Observations:
The data frame contains student-level numerical attributes such as payment history, location, GPA, scores and cohort rankings as well as other information about the student's program and school alumni sucess along with a boolean "Credit Ranking" score of 0 or 1, indicating loan replayment sucess rate.

In [None]:
# Review the data types associated with the columns
loans_df.dtypes

Unnamed: 0,0
payment_history,float64
location_parameter,float64
stem_degree_score,float64
gpa_ranking,float64
alumni_success,float64
study_major_code,float64
time_to_completion,float64
finance_workshop_score,float64
cohort_ranking,float64
total_loan_score,float64


In [None]:
# Check the credit_ranking value counts
loans_df["credit_ranking"].value_counts()

Unnamed: 0_level_0,count
credit_ranking,Unnamed: 1_level_1
1,855
0,744


## Observations:
Out of 1599 total observations, we have a 53% loan repayment success rate with 855 students scoring a 1. The remaining 744 past students failed to repay their loans. This sample dataset is a good mix of positive and negative samples to help us create an accurate model.

### Step 2: Using the preprocessed data, create the features (`X`) and target (`y`) datasets. The target dataset should be defined by the preprocessed DataFrame column “credit_ranking”. The remaining columns should define the features dataset.

In [None]:
# Since we're trying to predict the Loan Repayment Success rate, we're going to isolate the "credit_ranking" value as the y dataset

# Define the target set y using the credit_ranking column
y = loans_df["credit_ranking"]

# Display a sample of y
y[:5]

Unnamed: 0,credit_ranking
0,0
1,0
2,0
3,1
4,0


In [None]:
# The remaining attributes about the students, program and school will be our predictive features as X

# Define features set X by selecting all columns but credit_ranking
X = loans_df.drop(columns=["credit_ranking"])

# Review the features DataFrame
X.head()

Unnamed: 0,payment_history,location_parameter,stem_degree_score,gpa_ranking,alumni_success,study_major_code,time_to_completion,finance_workshop_score,cohort_ranking,total_loan_score,financial_aid_score
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4


### Step 3: Split the features and target sets into training and testing datasets.


In [None]:
# We'll split the dataset to Test and Train sets to create unseen data for validation. The random_state set to 1 so that we can ensure consistency in the results each time the model is run

# Split the preprocessed data into a training and testing dataset
# Assign the function a random_state equal to 1
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)

### Step 4: Use scikit-learn's `StandardScaler` to scale the features data.

In [None]:
# We'll standardize the features to make them have a mean of 0 and strandard deviation of 1 so that the magnitude of the values does not sway the model. For neural networks, it will help the model converg faster an dmore reliably.

# Create a StandardScaler instance
X_scaler = StandardScaler()

# Fit the scaler to the features training dataset
X_scaler.fit(X_train)

# Fit the scaler to the features training dataset
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

---

## Compile and Evaluate a Model Using a Neural Network

### Step 1: Create a deep neural network by assigning the number of input features, the number of layers, and the number of neurons on each layer using Tensorflow’s Keras.

> **Hint** You can start with a two-layer deep neural network model that uses the `relu` activation function for both layers.


In [None]:
# Define the the number of inputs (features) to the model
input_nodes = len(X.columns)

# Review the number of features
input_nodes

# there are 11 features input to the model

11

In [None]:
# As a rule of thumb, we want one hidden layer for simple problem and 2 to 3 hidden layers for moderately complex problems with many features and a binary classification output layer.
# If too many are selected, this can lead to overfitting the model. In this case, I've chosen to use 2 hidden layers.
# For the number of nodes, I'm using the halfway rule - half the number of input features for the first layer, then gradually reducing the number in the next layer.

# Define the number of hidden nodes for the first hidden layer
hidden_nodes_layer1 = (input_nodes + 1) // 2

# Define the number of hidden nodes for the second hidden layer
hidden_nodes_layer2 = (hidden_nodes_layer1 + 1) // 2

# Define the number of neurons in the output layer
output_nodes = 1

In [None]:
# For our case, we'll use a sequential Keras model to build a simple, linear stack of layers, with one input tensor and one output tensor.
# We'll use the ReLU activation function for both hidden layers to avoid the vanishing gradient problem and is best for non-linear patterns like our data.
# We use a sigmoid function as the last output step for binary classifcation - a probability of 0 or 1 (likely to default or not likely to default on the loan)

# Create the Sequential model instance
nn_model = tf.keras.models.Sequential()

# Add the first hidden layer
nn_model.add(tf.keras.layers.Dense(units = hidden_nodes_layer1, activation = "relu", input_dim = input_nodes))

# Add the second hidden layer
nn_model.add(tf.keras.layers.Dense(units = hidden_nodes_layer2, activation= "relu"))

# Add the output layer to the model specifying the number of output neurons and activation function
nn_model.add(tf.keras.layers.Dense(units = output_nodes, activation="sigmoid"))

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [None]:
# Display the Sequential model summary
nn_model.summary()

### Step 2: Compile and fit the model using the `binary_crossentropy` loss function, the `adam` optimizer, and the `accuracy` evaluation metric.


In [None]:
# Next we'll compile the model using the binary_crossentropy loss function, the gold standard for binary classification tasks
# The adam optimizer is efficient and has an adaptive learning rate using weights
# Our aim is to find the best accuracy with minimal loss across 50 epochs (model trials)

# Compile the Sequential model
nn_model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

In [None]:
# Fit the model using 50 epochs and the training data
fit_model = nn_model.fit(X_train_scaled, y_train, epochs=50)

Epoch 1/50
[1m38/38[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 5ms/step - accuracy: 0.5332 - loss: 0.7060
Epoch 2/50
[1m38/38[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.5608 - loss: 0.6834
Epoch 3/50
[1m38/38[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.5927 - loss: 0.6629
Epoch 4/50
[1m38/38[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.6244 - loss: 0.6609
Epoch 5/50
[1m38/38[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.6334 - loss: 0.6447
Epoch 6/50
[1m38/38[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.6666 - loss: 0.6364
Epoch 7/50
[1m38/38[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.6748 - loss: 0.6250
Epoch 8/50
[1m38/38[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.6675 - loss: 0.6162
Epoch 9/50
[1m38/38[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[

### Step 3: Evaluate the model using the test data to determine the model’s loss and accuracy.


In [41]:
# Evaluate the model loss and accuracy metrics using the evaluate method and the test data
model_loss, model_accuracy = nn_model.evaluate(X_test_scaled,y_test,verbose=2)

# Display the model loss and accuracy results
print(f"Loss: {model_loss}, Accuracy: {model_accuracy}")


13/13 - 0s - 37ms/step - accuracy: 0.7550 - loss: 0.5086
Loss: 0.508571445941925, Accuracy: 0.7549999952316284


## Observations:

### Our model quality is OK at 75% accuracy and a loss of .51.
### I'd like to see a model with 80%+ accuracy and a lower loss value (closer to 0) but for a binary classification task where the predictions are probabilisitic, this model is a good starting point.
### In the real world, it's possible that even a student with poor grades and school alumni outcomes can pay back their loans successfully if they are lucky in the jobs market, start their own successful company, or get help from a spouse/family member or inheritance to pay off their loans. Therefore, while our features can be predictors of future success/failure, they do not account for all possible outcomes to merit an exceptional model score.

### Step 4: Save and export your model to a keras file, and name the file `student_loans.keras`.


In [None]:
# Saving the model to avoid re-running the model in the future (unless desired) to save time and drive consistent results with future test sets or to productionize.

# Set the model's file path
file_path = Path("student_loans.keras")

# Export your model to a keras file
nn_model.save(file_path)

---
## Predict Loan Repayment Success by Using your Neural Network Model

### Step 1: Reload your saved model.

In [None]:
# Set the model's file path
file_path = Path("student_loans.keras")

# Load the model to a new object
nn_model = tf.keras.models.load_model(file_path)

### Step 2: Make predictions on the testing data and save the predictions to a DataFrame.

In [None]:
# Make predictions with the test data
predictions = nn_model.predict(X_test_scaled)

# Display a sample of the predictions
display(predictions[:5])

# Predictions from the model will be a value between 0 and 1

[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step 


array([[0.5228555 ],
       [0.3088252 ],
       [0.70050627],
       [0.65888   ],
       [0.97015995]], dtype=float32)

In [None]:
# Save the predictions to a DataFrame and round the predictions to binary results
predictions_df = pd.DataFrame(predictions, columns=["predictions"]).round(0)

# Review the DataFrame
predictions_df

# Rounding our value prediction to binary results to depict failure/success of unseen test values for ease of interpretation

Unnamed: 0,predictions
0,1.0
1,0.0
2,1.0
3,1.0
4,1.0
...,...
395,1.0
396,0.0
397,1.0
398,0.0


### Step 4: Display a classification report with the y test data and predictions

In [None]:
# Use the Classification Report function to gain deeper understanding of the model's performance beyond accuracy
# Precision is how many predictions made for each class are correct, interpreted as a percentage
# Recall is how many of actual positives correctly identified by model
# F1-Score is harmonic mean of precision and recall, balancing both into one value
# Support is number of actual occurances of each class in test data


# Print the classification report with the y test data and predictions
print(classification_report(y_test, predictions_df))

              precision    recall  f1-score   support

           0       0.72      0.78      0.75       188
           1       0.79      0.74      0.76       212

    accuracy                           0.76       400
   macro avg       0.76      0.76      0.75       400
weighted avg       0.76      0.76      0.76       400



## Observations:

### The model performs at a 76% average across both classes treated equally and as a weighted average. This is an OK outcome, given the nature of the data.
### The Model was able to predict 78% of bad debts accurately and 74% of loans paid back correctly (recall), with a precision of 72% for non payment and 79% for payment.
### In the test dataset, 188 were labeled as non payment and 212 loans were paid back. The model was able to predict accurately 76% of the 400 outcomes.

---
## Discuss creating a recommendation system for student loans

Briefly answer the following questions in the space provided:

1. Describe the data that you would need to collect to build a recommendation system to recommend student loan options for students. Explain why this data would be relevant and appropriate.

2. Based on the data you chose to use in this recommendation system, would your model be using collaborative filtering, content-based filtering, or context-based filtering? Justify why the data you selected would be suitable for your choice of filtering method.

3. Describe two real-world challenges that you would take into consideration while building a recommendation system for student loans. Explain why these challenges would be of concern for a student loan recommendation system.

## 1. Recommendation System:

### Data Needed:

- student demographics (age, income level, school type)
- academic background (GPA, field of study)
- financial data (existing debt, family income)
- loan features (interest rates, repayment terms)
- area of study (difficulty, future pay, job prospects, growth rate)

### Relevance:
This data helps assess a student's financial and academic profile to recommend suitable loan options AND may guide students away from poor performing majors/unsuitable majors for their previous academic background. Anecdotaly, students who regret their major or student loans point to their lack of understanding at 18 when signing up for loans or majors that don't pay well in the real world.

## 2. Reco System Filtering:

### Content-based filtering

###Justification:
This method uses a student's profile to match them with loan products based on features like interest rates and repayment terms. Unlike collaborative filtering, which requires user interaction data, content-based filtering can work well with structured, attribute-rich data.

This type of filtering works well when we have detailed information about the users and the items being recommended, it also works well for new items that don't have rich history but we have full information, for example, a new type of loan structure that is rolling out the next school year.


## 3. Challenges:

###1. Data Privacy and Security:
Handling sensitive financial and personal data requires stringent privacy measures. A breach could result in significant harm to students and can trigger federal investigations as student loans are often originated by the US government.

###2. Bias and Fairness:
Models trained on biased data might reinforce inequalities, unfairly disadvantaging certain demographics. Ensuring fairness in recommendations is essential to avoid perpetuating systemic biases.

It may also push students to majors/schools that aren't a good fit for qualitative reasons difficult to measure with a model that can lead to poor outcomes. A STEM degree might be most in demand and high paying today but these careers could be affected by AI in the future, leading to lower pay and poor job security in the long term.

