## Predictive Modelling Midterm

## Part One

In [92]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_curve, auc, precision_recall_curve
from sklearn.pipeline import Pipeline
import statsmodels.api as sm
from scipy import stats

In [93]:
from statsmodels.sandbox.regression.gmm import IV2SLS 

In [94]:
from statsmodels.sandbox.regression.gmm import GMM

In [95]:
input_table = pd.read_csv('/Users/shubhangimallik/Downloads/midterm_partone.csv')
input_table.head()

Unnamed: 0,Constant,Stock Change,Inventory Turnover,Operating Profit,Interaction Effect,Current Ratio,Quick Ratio,Debt Asset Ratio
0,1,0.870332,1.795946,0.115846,0.208053,1.672527,0.255171,0.473317
1,1,-0.047347,1.395501,0.436967,0.609788,1.637261,0.221763,0.489967
2,1,0.001176,1.664563,0.541016,0.900555,1.640619,0.189141,0.374269
3,1,-0.9012,1.605738,0.539399,0.866133,1.436221,0.131944,0.224399
4,1,-0.176353,1.591451,0.539938,0.859285,1.43314,0.183095,0.213446


In [96]:
model_iv = sm.OLS(input_table["Inventory Turnover"],input_table[["Constant","Current Ratio","Quick Ratio",\
                                                                 "Debt Asset Ratio"]]).fit()
endog_predict = model_iv.predict(input_table[["Constant","Current Ratio","Quick Ratio","Debt Asset Ratio"]])
input_table["Endogenous Param"] = endog_predict

In [97]:
model_2sls = sm.OLS(input_table["Stock Change"], input_table[["Constant","Endogenous Param",\
                                                              "Operating Profit","Interaction Effect",\
                                                             ]]).fit()
model_2sls.summary()

0,1,2,3
Dep. Variable:,Stock Change,R-squared:,0.015
Model:,OLS,Adj. R-squared:,0.013
Method:,Least Squares,F-statistic:,8.53
Date:,"Sun, 12 Nov 2023",Prob (F-statistic):,1.27e-05
Time:,18:27:58,Log-Likelihood:,-1186.5
No. Observations:,1696,AIC:,2381.0
Df Residuals:,1692,BIC:,2403.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Constant,-0.0176,0.020,-0.896,0.370,-0.056,0.021
Endogenous Param,0.0011,0.001,1.827,0.068,-7.76e-05,0.002
Operating Profit,-0.1201,0.028,-4.319,0.000,-0.175,-0.066
Interaction Effect,0.0014,0.000,3.621,0.000,0.001,0.002

0,1,2,3
Omnibus:,368.832,Durbin-Watson:,2.243
Prob(Omnibus):,0.0,Jarque-Bera (JB):,3433.92
Skew:,0.742,Prob(JB):,0.0
Kurtosis:,9.811,Cond. No.,109.0


# OLS Regression Results

## Stock Change Prediction

- **Model Fit:**
  - The model explains a small portion of the variability in Stock Change, with an R-squared of 0.015. The Adjusted R-squared, considering the number of predictors, is 0.013.

- **Statistical Significance:**
  - The F-statistic of 8.530 is statistically significant (p-value: 1.27e-05), indicating that at least one predictor is significantly related to Stock Change.

- **Coefficients:**
  - **Constant:** The constant term is not statistically significant (p-value: 0.370), suggesting that, on its own, it may not significantly predict Stock Change.
  - **Endogenous Param:** While not highly significant (p-value: 0.068), there may be a marginal effect of Endogenous Param on Stock Change.
  - **Operating Profit:** Operating Profit is statistically significant (p-value: 0.000), and the negative coefficient of -0.1201 suggests a negative relationship with Stock Change.
  - **Interaction Effect:** The Interaction Effect is statistically significant (p-value: 0.000), indicating its importance in predicting Stock Change.

These results suggest that, overall, the model has some predictive power, with Operating Profit and the Interaction Effect playing significant roles in explaining changes in Stock.



In [98]:
y_vals  = np.array(input_table["Stock Change"])
x_vals  = np.array(input_table[["Inventory Turnover","Operating Profit","Interaction Effect"]])
iv_vals = np.array(input_table[["Current Ratio","Quick Ratio","Debt Asset Ratio"]])

class gmm(GMM):
    def momcond(self, params):
        p0, p1, p2, p3 = params
        endog = self.endog
        exog = self.exog
        inst = self.instrument   

        error0 = endog - p0 - p1 * exog[:,0] - p2 * exog[:,1] - p3 * exog[:,2] 
        error1 = (endog - p0 - p1 * exog[:,0] - p2 * exog[:,1] - p3 * exog[:,2]) * exog[:,1]
        error2 = (endog - p0 - p1 * exog[:,0] - p2 * exog[:,1] - p3 * exog[:,2]) * exog[:,2]
        error3 = (endog - p0 - p1 * exog[:,0] - p2 * exog[:,1] - p3 * exog[:,2]) * inst[:,0] 
        error4 = (endog - p0 - p1 * exog[:,0] - p2 * exog[:,1] - p3 * exog[:,2]) * inst[:,1] 
        error5 = (endog - p0 - p1 * exog[:,0] - p2 * exog[:,1] - p3 * exog[:,2]) * inst[:,2] 

        g = np.column_stack((error0, error1, error2, error3, error4, error5))
        return g


beta0 = np.array([0.1, 0.1, 0.1, 0.1])
res = gmm(endog = y_vals, exog = x_vals, instrument = iv_vals, k_moms=6, k_params=4).fit(beta0)

res.summary()


Optimization terminated successfully.
         Current function value: 0.000046
         Iterations: 8
         Function evaluations: 12
         Gradient evaluations: 12
Optimization terminated successfully.
         Current function value: 0.000373
         Iterations: 7
         Function evaluations: 13
         Gradient evaluations: 13
Optimization terminated successfully.
         Current function value: 0.000372
         Iterations: 5
         Function evaluations: 9
         Gradient evaluations: 9
Optimization terminated successfully.
         Current function value: 0.000372
         Iterations: 5
         Function evaluations: 11
         Gradient evaluations: 11
Optimization terminated successfully.
         Current function value: 0.000372
         Iterations: 0
         Function evaluations: 1
         Gradient evaluations: 1


0,1,2,3
Dep. Variable:,y,Hansen J:,0.6317
Model:,gmm,Prob (Hansen J):,0.729
Method:,GMM,,
Date:,"Sun, 12 Nov 2023",,
Time:,18:27:58,,
No. Observations:,1696,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
p 0,-0.0200,0.021,-0.964,0.335,-0.061,0.021
p 1,0.0011,0.001,1.843,0.065,-6.89e-05,0.002
p 2,-0.1071,0.032,-3.370,0.001,-0.169,-0.045
p 3,0.0011,0.000,2.760,0.006,0.000,0.002


# Optimization Results

- Optimization terminated successfully for multiple iterations.
- Current function values range from 0.000046 to 0.000372.
- Iterations vary between 0 and 8.
- Function evaluations and gradient evaluations also differ across iterations.

# GMM Results

- **Dependent Variable:** y
- **Hansen J Statistic:** 0.6317
- **Prob (Hansen J):** 0.729
- **Model Method:** GMM
- **Date:** Sun, 12 Nov 2023
- **Time:** 18:15:11
- **No. Observations:** 1696

## Coefficients

| Coefficient | Estimate | Std Error | Z-value | P>|z| | 95% Confidence Interval |
|-------------|----------|-----------|---------|------|--------------------------|
| p0          | -0.0200  | 0.021     | -0.964  | 0.335| [-0.061, 0.021]         |
| p1          | 0.0011   | 0.001     | 1.843   | 0.065| [-6.89e-05, 0.002]      |
| p2          | -0.1071  | 0.032     | -3.370  | 0.001| [-0.169, -0.045]        |
| p3          | 0.0011   | 0.000     | 2.760   | 0.006| [0.000, 0.002]          |

Interpretation: 
- p0 may not be significantly different from zero (p-value = 0.335).
- p1 is marginally significant (p-value = 0.065).
- p2 is likely significant (p-value = 0.001).
- p3 is significant (p-value = 0.006).




In [99]:
class gmm_with_delta(GMM):
    def momcond(self, params):
        p0, p1, p2, p3, delta = params
        endog = self.endog
        exog = self.exog
        inst = self.instrument  

        error0 = endog - p0 - p1 * exog[:,0] - p2 * exog[:,1] - p3 * exog[:,2] 
        error1 = (endog - p0 - p1 * exog[:,0] - p2 * exog[:,1] - p3 * exog[:,2]) * exog[:,1]
        error2 = (endog - p0 - p1 * exog[:,0] - p2 * exog[:,1] - p3 * exog[:,2]) * exog[:,2]
        error3 = (endog - p0 - p1 * exog[:,0] - p2 * exog[:,1] - p3 * exog[:,2] ) * inst[:,0]- delta
        error4 = (endog - p0 - p1 * exog[:,0] - p2 * exog[:,1] - p3 * exog[:,2]) * inst[:,1] - delta
        error5 = (endog - p0 - p1 * exog[:,0] - p2 * exog[:,1] - p3 * exog[:,2]) * inst[:,2] - delta

        g = np.column_stack((error0, error1, error2, error3, error4, error5))
        return g

y_vals = np.array(input_table["Stock Change"])
x_vals = np.array(input_table[["Inventory Turnover", "Operating Profit", "Interaction Effect"]])
iv_vals = np.array(input_table[["Current Ratio", "Quick Ratio", "Debt Asset Ratio"]])


beta0_with_delta = np.array([0.1, 0.1, 0.1, 0.1, 0.1]) 
results = gmm_with_delta(endog=y_vals, exog=x_vals, instrument=iv_vals, k_moms=6, k_params=5).fit(beta0_with_delta)
results.summary()

Optimization terminated successfully.
         Current function value: 0.000031
         Iterations: 10
         Function evaluations: 15
         Gradient evaluations: 15
Optimization terminated successfully.
         Current function value: 0.000345
         Iterations: 9
         Function evaluations: 11
         Gradient evaluations: 11
Optimization terminated successfully.
         Current function value: 0.000346
         Iterations: 7
         Function evaluations: 10
         Gradient evaluations: 10
Optimization terminated successfully.
         Current function value: 0.000346
         Iterations: 2
         Function evaluations: 5
         Gradient evaluations: 5


0,1,2,3
Dep. Variable:,y,Hansen J:,0.5862
Model:,gmm_with_delta,Prob (Hansen J):,0.444
Method:,GMM,,
Date:,"Sun, 12 Nov 2023",,
Time:,18:27:58,,
No. Observations:,1696,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
p 0,-0.0208,0.021,-0.986,0.324,-0.062,0.020
p 1,0.0011,0.001,1.839,0.066,-7.31e-05,0.002
p 2,-0.1062,0.032,-3.316,0.001,-0.169,-0.043
p 3,0.0011,0.000,2.688,0.007,0.000,0.002
p 4,-0.0006,0.003,-0.213,0.831,-0.006,0.005


# Optimization Results

- Optimization terminated successfully for multiple iterations.
- Current function values during optimization range from 0.000031 to 0.000346.
- Number of iterations varies from 2 to 10.
- Function evaluations and gradient evaluations were performed during the optimization process.

# GMM_with_delta Results

- **Dependent Variable:** y
- **Hansen J Statistic:** 0.5862
- **Prob (Hansen J):** 0.444
- **Model Method:** GMM_with_delta
- **Date:** Sun, 12 Nov 2023
- **Time:** 18:15:11
- **No. Observations:** 1696

## Coefficients

| Coefficient | Estimate | Std Error | Z-value | P>|z| | 95% Confidence Interval |
|-------------|----------|-----------|---------|------|--------------------------|
| p0          | -0.0208  | 0.021     | -0.986  | 0.324| [-0.062, 0.020]         |
| p1          | 0.0011   | 0.001     | 1.839   | 0.066| [-7.31e-05, 0.002]      |
| p2          | -0.1062  | 0.032     | -3.316  | 0.001| [-0.169, -0.043]        |
| p3          | 0.0011   | 0.000     | 2.688   | 0.007| [0.000, 0.002]          |
| p4          | -0.0006  | 0.003     | -0.213  | 0.831| [-0.006, 0.005]         |

## Interpretation

- The optimization process successfully converged to a solution for each run.
- The GMM_with_delta model includes five coefficients (p0 to p4) representing estimated parameters.
- The Hansen J statistic tests the over-identifying restrictions in GMM. In this case, it is 0.5862 with a p-value of 0.444, indicating that the model is not rejected at the conventional significance level.
- Among the coefficients, p2 is statistically significant (p-value = 0.001), while p1 is marginally significant (p-value = 0.066). The other coefficients do not appear to be statistically significant based on conventional significance levels.



This suggests that, in the context of the GMM model with delta, there is no strong statistical evidence to support the industry expert's claim that the δ term has a significant effect on the model.

The results indicate that, in this particular analysis, the industry expert's claim is not statistically justified.


## Part Two

In [101]:
df = pd.read_csv('/Users/shubhangimallik/Downloads/midterm_parttwo.csv')
df.head()

Unnamed: 0,Years of Education after High School,Requested Credit Amount,Number of Dependents,Monthly Income,Monthly Expense,Marital Status,Credit Rating
0,1,Low,No dependent,Very low,Very low,Married,Positive
1,2,Low,No dependent,Very low,Very low,Single,Positive
2,1,Low,No dependent,Very low,Very low,Single,Positive
3,3,Low,No dependent,Very low,Very low,Married,Positive
4,3,Low,No dependent,Very low,Very low,Single,Negative


In [102]:
df = pd.get_dummies(df,columns=['Requested Credit Amount', 'Marital Status', 'Number of Dependents',
                                      'Monthly Income', 'Monthly Expense'], drop_first=True)

In [103]:
X = df.drop('Credit Rating', axis=1)
y = df['Credit Rating']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

In [104]:
model = LogisticRegression()
model.fit(X_train, y_train)

In [105]:

y_pred = model.predict(X_test)


In [106]:
from sklearn.metrics import recall_score, precision_score, f1_score

recall = recall_score(y_test, y_pred, pos_label='Positive')
precision = precision_score(y_test, y_pred, pos_label='Positive')
f1 = f1_score(y_test, y_pred, pos_label='Positive')


print(f"Recall: {recall:.2f}")
print(f"Precision: {precision:.2f}")
print(f"F1 Score: {f1:.2f}")


Recall: 1.00
Precision: 0.86
F1 Score: 0.92


In [107]:
confusion = confusion_matrix(y_test, y_pred)

In [108]:
print("Confusion Matrix:")
print(confusion)

Confusion Matrix:
[[   0  577]
 [   0 3464]]


## Results Before Threshold Adjustment

### Metrics:
- **Recall (Sensitivity or True Positive Rate):** 1.00
- **Precision:** 0.86
- **F1 Score:** 0.92

### Confusion Matrix:
[[   0  577]
 [   0 3464]]

### Interpretation:
1. **Recall:** The model correctly identified all positive instances (credit fully repaid) in the test set.
2. **Precision:** 86% of instances predicted as positive were actually positive.
3. **F1 Score:** A balanced performance between precision and recall, with a score of 0.92.



In [109]:
X_test.columns

Index(['Years of Education after High School', 'Requested Credit Amount_Low',
       'Requested Credit Amount_Medium', 'Marital Status_Not specified',
       'Marital Status_Single', 'Number of Dependents_More than 2',
       'Number of Dependents_No dependent', 'Monthly Income_Low',
       'Monthly Income_Moderate', 'Monthly Income_Very High',
       'Monthly Income_Very low', 'Monthly Expense_Low',
       'Monthly Expense_Moderate', 'Monthly Expense_Very high',
       'Monthly Expense_Very low'],
      dtype='object')

In [110]:
approval_threshold = 0.15 
y_pred_prob = model.predict_proba(X_test)[:, 1]
threshold_value = sorted(y_pred_prob)[int((1 - approval_threshold) * len(y_pred_prob))]

In [111]:
print(threshold_value)

0.8875163812479165


In [112]:
y_pred_new_threshold = [1 if p >= threshold_value else 0 for p in y_pred_prob]

In [113]:

y_pred_new_threshold_mapped = ['Negative' if pred == 0 else 'Positive' for pred in y_pred_new_threshold]

confusion_new_threshold = confusion_matrix(y_test, y_pred_new_threshold_mapped)
recall_new_threshold = recall_score(y_test, y_pred_new_threshold_mapped, pos_label='Positive')
precision_new_threshold = precision_score(y_test, y_pred_new_threshold_mapped, pos_label='Positive')
f1_new_threshold = f1_score(y_test, y_pred_new_threshold_mapped, pos_label='Positive')


print("\nConfusion Matrix with New Threshold:")
print(confusion_new_threshold)
print(f"Recall with New Threshold: {recall_new_threshold:.2f}")
print(f"Precision with New Threshold: {precision_new_threshold:.2f}")
print(f"F1 Score with New Threshold: {f1_new_threshold:.2f}")



Confusion Matrix with New Threshold:
[[ 495   82]
 [2936  528]]
Recall with New Threshold: 0.15
Precision with New Threshold: 0.87
F1 Score with New Threshold: 0.26


## Results After Threshold Adjustment

### Metrics:
- **Confusion Matrix with New Threshold:**
[[ 495   82]
 [2936  528]]
- **Recall with New Threshold:** 0.15
- **Precision with New Threshold:** 0.87
- **F1 Score with New Threshold:** 0.26

### Updated Interpretation:
1. **Confusion Matrix with New Threshold:** 
   - True Positives (TP): 528
   - True Negatives (TN): 495
   - False Positives (FP): 82
   - False Negatives (FN): 2936

2. **Recall with New Threshold:** 
   - The model correctly identified 15% of positive instances (credit fully repaid) in the test set.

3. **Precision with New Threshold:** 
   - 87% of instances predicted as positive were actually positive.

4. **F1 Score with New Threshold:** 
   - The F1 score has dropped significantly to 0.26, indicating a trade-off between precision and recall. This could be a result of the more challenging threshold, making the model conservative in predicting positive instances.

---

