# Logistic regression

## Data

Download the dataset from Canvas, in the folder `files` (not from GitHub, because the data is not publicly available without an account). It's from a [paper](https://www.sciencedirect.com/science/article/pii/S0957417420303912) that uses neural networks to predict corporate defaults of Italian firms. You can see the full data [on Kaggle](https://www.kaggle.com/datasets/lukaszpostek/newconnect-market-corporate-default-prediction?resource=download) (requires a free account).

Save it somewhere on your computer. You will need to know the path to import it in the next cell.

If running on Colab, run this cell to connect Colab to your Google Drive:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Logistic regression

Here we implement logistic regression for our data. We use pandas to load the data, which uses "dataframes", or matrices you can access by column name with syntax `dataframe[list_of_column_names]` (we'll see more of it later; please ignore the line that renames the columns).

In [None]:
import pandas as pd
data = pd.read_csv("/content/drive/MyDrive/defaults.csv", delimiter="\t")

# Students: Please ignore these lines, which clean the data.
data.rename(columns={"x2": "assets_ratio", "x4": "ebitda"}, inplace=True)

# Show a bit of the data.
print(data.head())

# Show summary statistics.
data[["default", "assets_ratio", "ebitda"]].describe()

Here, we'll focus on 2 regressors: Assets ratio (x2) and EBITDA (x4) to predict default.

Import sci-kit learn and run the regression:

In [None]:
import sklearn.linear_model

columns = ["ebitda", "assets_ratio"]
x = data[columns]
y = data[["default"]].squeeze()  # Converts to a format suitable for regression.

log_reg = sklearn.linear_model.LogisticRegression()
classifier = log_reg.fit(x, y)
accuracy = 100 * log_reg.score(x, y)
print("Accuracy on training data: %.1f%%" % accuracy)

coeffs = log_reg.coef_
for i in range(len(columns)):
    print("Coefficient on %s: %.2f" % (columns[i], coeffs[0, i]))


In [None]:
print(coeffs)

Now we predict values for training data or for new, test data. We use the `.predict()` method to predict 0 or 1, and `.predict_proba()` to predict the probability of each outcome.

In [None]:
#x_array = np.array(x.loc[:2, columns])
#1 / (1 = np.exp(log_reg.intercept_ + np.matmul(x_array, coeffs)))

# Prediction on training data: 0/1.
print("Prediction for first three data points")
print(classifier.predict(x.loc[:2, columns]))

# Prediction on training data: probabilities.
print("Predicted probabilities for first three data points:")
print(classifier.predict_proba(x.loc[:2, columns]))

# Generate a new, unseen data point.
new_data = pd.DataFrame([[0.5, 0.1]], columns=columns)
print("Prediction for new data point:")
print(classifier.predict(new_data))
print("Predicted probability for new data point:")
print(classifier.predict_proba(new_data))

# Print the classes that these probabilities refer to:
print("Classes: ", classifier.classes_)

For comparison, we run a linear regression too. In this case, the coefficients are very different, about 20 times smaller.

In [None]:
lin_reg = sklearn.linear_model.LinearRegression()
lin_reg.fit(x, y)
print(lin_reg.coef_)

Finally, we show a comparison of the interplay between the two regressors: a lower ebitda is a bad sign for a company, unless it has a higher asset ratio.

In [None]:
import sklearn.inspection
_ = sklearn.inspection.DecisionBoundaryDisplay.from_estimator(
    log_reg,
    x,
    response_method="predict",
    cmap="RdBu_r",
    alpha=0.5,
)

# Other data: loan applications

We want to automate this process or assist a new loan officer in making decisions.

We have data on past loan applications from [this GitHub repository](https://github.com/SandeepHonnali/Loan-Approval-Prediction-using-Machine-Learning/blob/main/1Copy%20of%20loan.csv). The outcome variable is 0-1, whether the loan officer granted the loan. The explanatory variables for logistic regression are the applicant's income and the loan amount.

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv("/content/drive/MyDrive/loans.csv", delimiter=",")

# Clean the data
outcome = "outcome"
df[outcome] = (df["Loan_Status"] == "Y").astype(np.uint8)

df.drop(df[df["Loan_Amount_Term"] != 360.0].index, inplace=True)

features = ["ApplicantIncome", "LoanAmount", "CoapplicantIncome"]

df = df[[outcome] + features]
for f in features:
  if f not in ["ApplicantIncome", "LoanAmount", "CoapplicantIncome"]:
    continue
  df[f] = np.log(1 + df[f])
df.dropna(inplace=True)


# Show a bit of the data and summary statistics.
print(df.head())
df[["outcome", "ApplicantIncome", "LoanAmount"]].describe()