### Instructions- To be solved using Python 

### Predicting Loan Repayment

In the lending industry, investors provide loans to borrowers in exchange for the promise of repayment with interest. If the borrower repays the loan, then the lender profits from the interest. However, if the borrower is unable to repay the loan, then the lender loses money. Therefore, lenders face the problem of predicting the risk of a borrower being unable to repay a loan.

To address this problem, we will use publicly available data from LendingClub.com, a website that connects borrowers and investors over the Internet. This dataset represents 9,578 3-year loans that were funded through the LendingClub.com platform between May 2007 and February 2010. The binary dependent variable not_fully_paid indicates that the loan was not paid back in full (the borrower either defaulted or the loan was "charged off," meaning the borrower was deemed unlikely to ever pay it back).

To predict this dependent variable, we will use the following independent variables available to the investor when deciding whether to fund a loan:

Variable | Description
------------------ | ------------------------------------
credit.policy: | 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
purpose:  |  The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").
int.rate:  | The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.
installment:  | The monthly installments ($) owed by the borrower if the loan is funded.
log.annual.inc:  | The natural log of the self-reported annual income of the borrower.
dti:  | The debt-to-income ratio of the borrower (amount of debt divided by annual income).
fico:  | The FICO credit score of the borrower.
days.with.cr.line:  | The number of days the borrower has had a credit line.
revol.bal:  | The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).
revol.util:  | The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).
inq.last.6mths:  | The borrower's number of inquiries by creditors in the last 6 months.
delinq.2yrs:  | The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
pub.rec:  | The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).

#### Problem 2.1 - Prediction Models

Now that we have prepared the dataset, we need to split it into a training and testing set. To ensure everybody obtains the same split, set the random seed to 144 (even though you already did so earlier in the problem) and use the sample.split function to select the 70% of observations for the training set (the dependent variable for sample.split is not.fully.paid). 

Name the data frames train and test.

Now, use logistic regression trained on the training set to predict the dependent variable not.fully.paid using all the independent variables.

#### Q1) Which independent variables are significant in our model? (Significant variables have at least one star, or a Pr(>|z|) value less than 0.05.) 

Select all that apply.
* credit.policy
* purpose2 (credit card)
* purpose3 (debt consolidation)
* purpose4 (educational)
* purpose5 (home improvement)
* purpose6 (major purchase)
* purpose7 (small business)
* int.rate
* installment
* log.annual.inc
* dti
* fico
* days.with.cr.line
* revol.bal
* revol.util
* inq.last.6mths
* delinq.2yrs
* pub.rec

#### Problem 1.0 - Prediction Models


Consider two loan applications, which are identical other than the fact that the borrower in Application A has FICO credit score 700 while the borrower in Application B has FICO credit score 710.
Let Logit(A) be the log odds of loan A not being paid back in full, according to our logistic regression model, and define Logit(B) similarly for loan B. 

#### Q(2): What is the value of Logit(A) - Logit(B)?

Now, let O(A) be the odds of loan A not being paid back in full, according to our logistic regression model, and define O(B) similarly for loan B. 

#### Q(3): What is the value of O(A)/O(B)? (HINT: Use the mathematical rule that exp(A + B + C) = exp(A)*exp(B)*exp(C). Also, remember that exp() is the exponential function in R.)

#### Problem 1.1 - Prediction Models

Predict the probability of the test set loans not being paid back in full (remember type="response" for the predict function). Store these predicted probabilities in a variable named predicted risk and add it to your test set (we will use this variable in later parts of the problem). 

#### Q(4): 
a. Compute the confusion matrix using a threshold of 0.5.
b. What is the accuracy of the logistic regression model? Input the accuracy as a number between 0 and 1.
c. What is the accuracy of the baseline model? Input the accuracy as a number between 0 and 1.

#### Problem 2.0 - Prediction Models

#### Q(5):Use the appropriate package to compute the test set AUC.

The model has poor accuracy at the threshold 0.5. But despite the poor accuracy, we will see later how an investor can still leverage this logistic regression model to make profitable investments.

#### Problem 2.1 - A "Smart Baseline"

In the previous problem, we built a logistic regression model that has an AUC significantly higher than the AUC of 0.5 that would be obtained by randomly ordering observations.

However, LendingClub.com assigns the interest rate to a loan based on their estimate of that loan's risk. This variable, int.rate, is an independent variable in our dataset. In this part, we will investigate using the loan's interest rate as a "smart baseline" to order the loans according to risk.

#### Q(5): Using the training set, build a bivariate logistic regression model (aka a logistic regression model with a single independent variable) that predicts the dependent variable not.fully.paid using only the variable int.rate.

#### Q(6): The variable int.rate is highly significant in the bivariate model, but it is not significant at the 0.05 level in the model trained with all the independent variables. What is the most likely explanation for this difference?

a) int.rate is correlated with other risk-related variables, and therefore does not incrementally improve the model when those other variables are included.

b) This effect is likely due to the training/testing set split we used. In other splits, we could see the opposite effect.

c) These models are trained on a different set of observations, so the coefficients are not comparable.

#### Problem 2.2 - A "Smart Baseline"

#### Q(6): 

a. Make test set predictions for the bivariate model. 
b. What is the highest predicted probability of a loan not being paid in full on the testing set?
c. With a logistic regression cutoff of 0.5, how many loans would be predicted as not being paid in full on the testing set?

#### Problem 2.3 - A "Smart Baseline"

#### Q(7): What is the test set AUC of the bivariate model?
