# Unit 2: Learning from Data

---

## Lesson 2.4: Logistic Regression With a Real Dataset

In this lesson, we’ll use a real dataset, but this time we will use a **logistic regression** model and feed it multiple features.

### What’s in the Dataset?

We are using only 4 features from the full dataset:

person_income, loan_amnt, loan_int_rate, and credit_score

We will use these features to predict whether a person **gets a loan approved** (`loan_status` = 1) or **not approved** (`loan_status` = 0).

Since this is a classification problem, we can't use linear regression and have to use logistic regression


## Coding

In [None]:
import pandas as pd

# this will be used to split the original data into training and testing data
from sklearn.model_selection import train_test_split

# our model
from sklearn.linear_model import LogisticRegression

# used to determine the accuracy of our model
from sklearn.metrics import accuracy_score, confusion_matrix

# used to visualize our results
import matplotlib.pyplot as plt


# importing the dataset using pandas read_csv() function
train_df = pd.read_csv("Data/loan_data.csv")

# setting Salary as the target variable (whether they got the loan).
y = train_df["loan_status"]

# setting YearsExperience as our features (what we are using to predict the loan status)
X = train_df[["person_income", "loan_amnt", "loan_int_rate", "credit_score"]]

# creating train and test data, test_size=0.2 makes 80% of the data used for training and 20% of the data used for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [None]:
model = LogisticRegression()

# training the model using .fit()
model.fit(X_train, y_train)

In [None]:
# storing our predictions in a variable
y_preds= model.predict(X_test)

In [None]:
# Comparing our predictions to the actual values
accuracy = accuracy_score(y_test, y_preds)
print("Accuracy:", accuracy)

confusion_matrix(y_test, y_preds)

## What Do These Numbers Mean?

Notice that we are using different methods to test the accuracy of our model. That is because certain methods can only be used for linear regression and some can only be used for Logistic Regression

## 1. accuracy_score

- Tells us the **percentage of correct predictions** out of all predictions.
- Pretty simple way to test how good the model is

### Example:
If the model made 10 predictions and got 8 correct:

```python
accuracy_score = 8 / 10 = 0.8 → 80% accurate

## 2. Confusion Matrix

- Shows exactly how many predictions were right and wrong.
- Breaks it down into 4 groups:

<img src="../Photos/ConfusionMatrix.png" width="500">

### What It Tells Us:

- True Positive (TP): Model correctly said Yes

- True Negative (TN): Model correctly said No

- False Positive (FP): Model said Yes but it was No

- False Negative (FN): Model said No but it was Yes