<a href="https://colab.research.google.com/github/minhhuong05/Econometrics_Midterm_Assignment/blob/main/dummy_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import mean_squared_error

## Linear Regression

In [None]:
# Load data
data = pd.read_csv('bank-additional-full.csv', delimiter=';')

# Drop duplicates
data = data.drop_duplicates()

# Select only the relevant features
selected_features = ['marital', 'default', 'housing', 'loan', 'campaign', 'emp.var.rate', 'cons.price.idx', 'euribor3m', 'nr.employed', 'y']
data = data[selected_features]

# Manual binary encoding
data['marital'] = data['marital'].map({'married': 1, 'single': 0, 'divorced': 0, 'unknown': 0})
data['default'] = data['default'].map({'no': 1, 'yes': 0, 'unknown': 0})
data['housing'] = data['housing'].map({'yes': 1, 'no': 0, 'unknown': 0})
data['loan'] = data['loan'].map({'no': 1, 'yes': 0, 'unknown': 0})

# Split data
X = data.drop('y', axis=1)
y = data['y'].map({'yes': 1, 'no': 0})  # Encoding target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# Linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Model evaluation
r2_train = model.score(X_train, y_train)
r2_test = model.score(X_test, y_test)
print(f'R^2 on training data: {r2_train}')
print(f'R^2 on testing data: {r2_test}')

# Coefficients and intercept
coefficients = model.coef_
intercept = model.intercept_

# Feature names
feature_names = X.columns

# Constructing the Linear Regression Equation
equation_parts = [f"{coeff:.4f}*{name}" for coeff, name in zip(coefficients, feature_names)]
linear_regression_equation = f"y = {intercept:.4f} + " + " + ".join(equation_parts)

print("Linear Regression Equation:")
print(linear_regression_equation)


R^2 on training data: 0.12802395471392958
R^2 on testing data: 0.15090186369163583
Linear Regression Equation:
y = 9.2467 + -0.0110*marital + 0.0236*default + -0.0013*housing + 0.0050*loan + -0.0019*campaign + -0.0427*emp.var.rate + 0.0276*cons.price.idx + 0.0691*euribor3m + -0.0023*nr.employed



### Linear Regression Equation Dummy Variables Interpretation

1. **Intercept (9.2467):** This is the log-odds of \( y = 1 \) when all predictors are at their reference level (i.e., when all are 0). Since the model involves binary coding, the reference levels correspond to not being in the '1' category of each variable.

2. **Marital (-0.011):** Being married (marital = 1) is associated with a decrease in the log-odds of \( y = 1 \) by 0.011 compared to not being married (single, divorced, unknown). This suggests that married individuals are less likely to subscribe to a Term Deposit, assuming other variables are held constant.

3. **Default (0.0236):** Having no default on credit (default = 1) is associated with an increase in the log-odds of \( y = 1 \) by 0.0236 compared to having a default or unknown status. This indicates that individuals with no default history are more likely to subscribe to a Term Deposit.

4. **Housing (-0.0013):** Having a housing loan (housing = 1) is associated with a decrease in the log-odds of \( y = 1 \) by 0.0013. This could imply that individuals with housing loans are slightly less likely to subscribe to a Term Deposit

5. **Loan (0.005):** Not having a personal loan (loan = 1) is associated with an increase in the log-odds of \( y = 1 \) by 0.005. This suggests a slight positive effect of not having a personal loan on the likelihood of subscribing to a Term Deposit



### Overall Model Performance
- **Accuracy on Training Data (12.8%):** This suggests that the model does not fit the training data quite well.
- **Accuracy on Testing Data (15%):** Low accuracy on the testing data indicates that the model does notgeneralize well to new, unseen data.


---



## Logistic Regression

In [None]:
#Split data
X = data.drop('y', axis=1)
y = data['y'].map({'yes': 1, 'no': 0})  # Encoding target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# Logistic regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Model evaluation
accuracy_train = model.score(X_train, y_train)
accuracy_test = model.score(X_test, y_test)
print(f'Accuracy on training data: {accuracy_train}')
print(f'Accuracy on testing data: {accuracy_test}')

# Coefficients and intercept
coefficients = model.coef_[0]
intercept = model.intercept_[0]

# Feature names
feature_names = X.columns

# Constructing the Logistic Regression Equation
equation_parts = [f"{coeff:.4f}*{name}" for coeff, name in zip(coefficients, feature_names)]
logistic_regression_equation = f"Log-odds(P(y=1)) = {intercept:.4f} + " + " + ".join(equation_parts)

print("Logistic Regression Equation:")
print(logistic_regression_equation)


Accuracy on training data: 0.8854895421756624
Accuracy on testing data: 0.8874825226285966
Logistic Regression Equation:
Log-odds(P(y=1)) = 0.0061 + -0.2075*marital + 0.3772*default + -0.0778*housing + 0.0782*loan + -0.0544*campaign + -0.5529*emp.var.rate + 0.4978*cons.price.idx + 0.2762*euribor3m + -0.0097*nr.employed



### Logistic Regression Equation Interpretation


#### Coefficients Interpretation

1. **Intercept (0.0061):** This is the log-odds of \( y = 1 \) when all predictors are at their reference level (i.e., when all are 0). Since the model involves binary coding, the reference levels correspond to not being in the '1' category of each variable.

2. **Marital (-0.2075):** Being married (marital = 1) is associated with a decrease in the log-odds of \( y = 1 \) by 0.2075 compared to not being married (single, divorced, unknown). This suggests that married individuals are less likely to subscribe to a Term Deposit, assuming other variables are held constant.

3. **Default (0.3772):** Having no default on credit (default = 1) is associated with an increase in the log-odds of \( y = 1 \) by 0.3772 compared to having a default or unknown status. This indicates that individuals with no default history are more likely to subscribe to a Term Deposit .

4. **Housing (-0.0778):** Having a housing loan (housing = 1) is associated with a decrease in the log-odds of \( y = 1 \) by 0.0778. This could imply that individuals with housing loans are slightly less likely to subscribe to a Term Deposit

5. **Loan (0.0782):** Not having a personal loan (loan = 1) is associated with an increase in the log-odds of \( y = 1 \) by 0.0782. This suggests a slight positive effect of not having a personal loan on the likelihood of subscribing to a Term Deposit

### Overall Model Performance
- **Accuracy on Training Data (88.55%):** This suggests that the model fits the training data quite well.
- **Accuracy on Testing Data (88.75%):** High accuracy on the testing data indicates that the model generalizes well to new, unseen data.
