# Exercise Instructions: Panel Data Modeling with Machine Learning Models

**Objective:**
The goal of this exercise is to practice panel data modeling skills using three machine learning models (Random Forest, Single Decision Tree, and Linear Regression with Elastic Net) that have not been utilized in the project so far. Completing the entire task or a significant portion during the class will earn you an additional 7 points (above what is outlined in the syllabus) towards your final grade.

**Tasks:**

1. **GitHub Setup:**
   - If you haven't done so already, [create](https://github.com/join) a GitHub account.
   - [Download](https://desktop.github.com) and [configure](https://docs.github.com/en/desktop/configuring-and-customizing-github-desktop/configuring-basic-settings-in-github-desktop) GitHub Desktop on your laptop. (Here you can find nice intro to the GitHub Dekstop app: [link](https://joshuadull.github.io/GitHub-Desktop/02-getting-started/index.html)). If you prefare git command line usage you can go with this [instruction](https://github.com/michaelwozniak/ml2_tools?tab=readme-ov-file#git).
2. **Repository Forking:**
   - [Fork](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/fork-a-repo) the following repository to your projects: [https://github.com/michaelwozniak/ML-in-Finance-I-case-study-forecasting-tax-avoidance-rates](https://github.com/michaelwozniak/ML-in-Finance-I-case-study-forecasting-tax-avoidance-rates)

3. **Repository Cloning:**
   - [Clone](https://docs.github.com/en/desktop/adding-and-cloning-repositories/cloning-a-repository-from-github-to-github-desktop) the forked repository to your local computer using GitHub Desktop.

4. **Notebook Exploration:**
   - Open the file `notebooks/10.exercise.ipynb` to begin the ML tasks.

5. **Model Creation:**

   In the file `notebooks/10.exercise.ipynb`:
   - Create the following models:
      1. Random Forest ([RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html))
      2. Decision Tree ([DecisionTreeRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html))
      3. Linear Regression with Elastic Net ([ElasticNet](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html))
   
   Follow a similar process to the models presented in class (e.g., KNN - `notebooks/07.knn-model.ipynb`):
      - Load the prepared training data.
      - Perform feature engineering if deemed necessary (note: these three models do not require data standardization, unlike SVM and KNN).
      - Conduct feature selection.
      - Perform hyperparameter tuning.
      - Identify a local champion for each model class (the best model for RF, DT, Elastic Net).
      - Save local champions to a pickle file.

6. **Model Evaluation:**
   - In the notebook `notebooks/09.final-comparison-and-summary.ipynb`, load the models you created and check if they outperform the previously used models.

7. **Version Control:**
   - At the end of the class, even if the tasks are incomplete, [commit](https://docs.github.com/en/desktop/making-changes-in-a-branch/committing-and-reviewing-changes-to-your-project-in-github-desktop) your changes using GitHub Desktop.
   - [Push](https://docs.github.com/en/desktop/making-changes-in-a-branch/pushing-changes-to-github-from-github-desktop) your changes to your remote GitHub repository.

8. **Submission:**
   - Send me the link to your GitHub project (my email: *mj.wozniak9@uw.edu.pl*).

Good luck with the exercise! If you have any questions, feel free to ask.

In [23]:
# I hate dependency management
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import ElasticNet

In [28]:
df_train = pd.read_csv("./train_fe.csv", index_col=0)
df_test = pd.read_csv("./test_fe.csv", index_col=0)

df = pd.concat([df_train, df_test], axis=0)

fr = pd.read_excel("feature_ranking.xlsx", index_col=0)

### Feature engineering

We have to standardize our variables. We will use range standardization (Min Max Scaler) because we have got dummies! We gave every variable a chance to have the same impact on the model.

In [25]:
columns = [
    "rok",
    "ta",
    "txt",
    "pi",
    "str",
    "xrd",
    "ni",
    "ppent",
    "intant",
    "dlc",
    "dltt",
    "capex",
    "revenue",
    "cce",
    "adv",
    "etr",
    "diff",
    "roa",
    "lev",
    "intan",
    "rd",
    "ppe",
    "sale",
    "cash_holdings",
    "adv_expenditure",
    "capex2",
    "cfc",
    "dta",
    "capex2_scaled",
    "y_v2x_polyarchy",
    "y_e_p_polity",
    "y_BR_Democracy",
    "WB_GDPgrowth",
    "WB_GDPpc",
    "WB_Inflation",
    "rr_per_country",
    "rr_per_sector",
    "sektor_consumer discretionary",
    "sektor_consumer staples",
    "sektor_energy",
    "sektor_health care",
    "sektor_industrials",
    "sektor_materials",
    "sektor_real estate",
    "sektor_technology",
    "sektor_utilities",
    "gielda_2",
    "gielda_3",
    "gielda_4",
    "gielda_5",
    "ta_log",
    "txt_cat_(-63.011, -34.811]",
    "txt_cat_(-34.811, 0.488]",
    "txt_cat_(0.488, 24.415]",
    "txt_cat_(24.415, 25.05]",
    "txt_cat_(25.05, 308.55]",
    "txt_cat_(308.55, 327.531]",
    "txt_cat_(327.531, inf]",
    "pi_cat_(-8975.0, -1.523]",
    "pi_cat_(-1.523, 157.119]",
    "pi_cat_(157.119, 465.9]",
    "pi_cat_(465.9, 7875.5]",
    "pi_cat_(7875.5, 8108.5]",
    "pi_cat_(8108.5, inf]",
    "str_cat_(0.0875, 0.192]",
    "str_cat_(0.192, 0.28]",
    "str_cat_(0.28, inf]",
    "xrd_exists",
    "ni_profit",
    "ni_profit_20000",
    "ppent_sqrt",
    "intant_sqrt",
    "dlc_cat_(42.262, 176.129]",
    "dlc_cat_(176.129, 200.9]",
    "dlc_cat_(200.9, inf]",
    "dltt_cat_(39.38, 327.85]",
    "dltt_cat_(327.85, 876.617]",
    "dltt_cat_(876.617, inf]",
    "capex_cat_(7.447, 79.55]",
    "capex_cat_(79.55, 5451.0]",
    "capex_cat_(5451.0, inf]",
    "revenue_cat_(0.174, 1248.817]",
    "revenue_cat_(1248.817, 4233.587]",
    "revenue_cat_(4233.587, inf]",
    "cce_cat_(5.619, 63.321]",
    "cce_cat_(63.321, inf]",
    "adv_cat_(0.3, 874.5]",
    "adv_cat_(874.5, inf]",
    "diff_positive",
    "roa_clip",
    "lev_sqrt",
    "intan_pow2",
    "rd_sqrt",
    "ppe_clip",
    "cash_holdings_sqrt",
    "adv_expenditure_positive",
    "diff_dta",
    "cfc_dta",
    "etr_y_past",
    "etr_y_ma",
    "diff_ma",
    "roa_ma",
    "lev_ma",
    "intan_ma",
    "ppe_ma",
    "sale_ma",
    "cash_holdings_ma",
    "roa_past",
    "lev_past",
    "intan_past",
    "ppe_past",
    "sale_past",
    "cash_holdings_past",
]

standardization = []
not_standardization = []
for i in columns:
    if df[i].nunique() > 2:
        standardization.append(i)
    else:
        not_standardization.append(i)
standardization.remove("etr")
standardization.append("y_e_p_polity")

scaler = MinMaxScaler()
scaler.fit(df[standardization])
df[standardization] = scaler.transform(df[standardization])
print(df[columns].describe().T["min"].unique())
print(df[columns].describe().T["max"].unique())

var = fr.mi_score.sort_values(ascending=False).index.tolist()[0:10]
df.shape[0] ** (0.5)

mse = make_scorer(mean_squared_error, greater_is_better=False)

[0. 1.]
[1. 1. 1.]
[2.01600000e+03 0.00000000e+00 8.73839499e-04 1.00000000e+00]
[2.01600000e+03 1.00000000e+00 1.00000000e+00 1.00000000e+00
 5.16991643e-01 1.00000000e+00 0.00000000e+00]


['etr_y_past',
 'etr_y_ma',
 'txt',
 'diff',
 'ni',
 'pi',
 'intant',
 'intant_sqrt',
 'ta',
 'revenue']

## RandomForest

In [29]:
# Define the parameter grid for RandomForest
param_grid_random_forest = {
    'n_estimators': [10, 50, 100],
    'max_features': ['auto', 'sqrt'],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Create the GridSearchCV object for RandomForest
# Random forest default criterion is mse, so no need to pass it as a parameter
grid_search_rf = GridSearchCV(RandomForestRegressor(), param_grid_random_forest, cv=5)
grid_search_rf.fit(df.loc[:, var].values, df.loc[:, "etr"].values.ravel())

# Print the best parameters and best score
print("Best parameters for RandomForest:", grid_search_rf.best_params_)
print("Best score for RandomForest:", grid_search_rf.best_score_)

540 fits failed out of a total of 1080.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
540 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/omilod/Desktop/projects/temp/env/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 729, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/omilod/Desktop/projects/temp/env/lib/python3.9/site-packages/sklearn/base.py", line 1145, in wrapper
    estimator._validate_params()
  File "/Users/omilod/Desktop/projects/temp/env/lib/python3.9/site-packages/sklearn/base.py", line 638, in _validate_params
    validate_parameter_constraints(
  File "/Users/omilod/Desktop/projects/temp/env/lib/python3.9/site-packages/sk

Best parameters for RandomForest: {'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 100}
Best score for RandomForest: 0.14725091287694575


## Decision Tree

In [31]:
reg = DecisionTreeRegressor()

param_grid = {
    'max_depth': [None, 2, 3, 4, 5],
    'min_samples_split': [2, 3, 4],
    'min_samples_leaf': [1, 2, 3]
}

grid_search = GridSearchCV(estimator=reg, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(df.loc[:, var].values, df.loc[:, "etr"].values.ravel())

print("Best parameters: ", grid_search.best_params_)
print("Best score: ", grid_search.best_score_)

Best parameters:  {'max_depth': 2, 'min_samples_leaf': 1, 'min_samples_split': 2}
Best score:  -0.019032479638322383


## ElasticNet

In [32]:
# Define the parameter grid for ElasticNet
param_grid_elastic_net = {
    'alpha': [0.1, 1, 10],
    'l1_ratio': [0.1, 0.5, 0.9]
}

# Create the GridSearchCV object for ElasticNet
grid_search_en = GridSearchCV(ElasticNet(), param_grid_elastic_net, cv=5)
grid_search_en.fit(df.loc[:, var].values, df.loc[:, "etr"].values.ravel())

# Print the best parameters and best score
print("Best parameters for ElasticNet:", grid_search_en.best_params_)
print("Best score for ElasticNet:", grid_search_en.best_score_)

Best parameters for ElasticNet: {'alpha': 0.1, 'l1_ratio': 0.5}
Best score for ElasticNet: 0.02505182932311576
