# Exercise Instructions: Panel Data Modeling with Machine Learning Models

**Objective:**
The goal of this exercise is to practice panel data modeling skills using three machine learning models (Random Forest, Single Decision Tree, and Linear Regression with Elastic Net) that have not been utilized in the project so far. Completing the entire task or a significant portion during the class will earn you an additional 7 points (above what is outlined in the syllabus) towards your final grade.

**Tasks:**

1. **GitHub Setup:**
   - If you haven't done so already, [create](https://github.com/join) a GitHub account.
   - [Download](https://desktop.github.com) and [configure](https://docs.github.com/en/desktop/configuring-and-customizing-github-desktop/configuring-basic-settings-in-github-desktop) GitHub Desktop on your laptop. (Here you can find nice intro to the GitHub Dekstop app: [link](https://joshuadull.github.io/GitHub-Desktop/02-getting-started/index.html)). If you prefare git command line usage you can go with this [instruction](https://github.com/michaelwozniak/ml2_tools?tab=readme-ov-file#git).
2. **Repository Forking:**
   - [Fork](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/fork-a-repo) the following repository to your projects: [https://github.com/michaelwozniak/ML-in-Finance-I-case-study-forecasting-tax-avoidance-rates](https://github.com/michaelwozniak/ML-in-Finance-I-case-study-forecasting-tax-avoidance-rates)

3. **Repository Cloning:**
   - [Clone](https://docs.github.com/en/desktop/adding-and-cloning-repositories/cloning-a-repository-from-github-to-github-desktop) the forked repository to your local computer using GitHub Desktop.

4. **Notebook Exploration:**
   - Open the file `notebooks/10.exercise.ipynb` to begin the ML tasks.

5. **Model Creation:**

   In the file `notebooks/10.exercise.ipynb`:
   - Create the following models:
      1. Random Forest ([RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html))
      2. Decision Tree ([DecisionTreeRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html))
      3. Linear Regression with Elastic Net ([ElasticNet](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html))
   
   Follow a similar process to the models presented in class (e.g., KNN - `notebooks/07.knn-model.ipynb`):
      - Load the prepared training data.
      - Perform feature engineering if deemed necessary (note: these three models do not require data standardization, unlike SVM and KNN).
      - Conduct feature selection.
      - Perform hyperparameter tuning.
      - Identify a local champion for each model class (the best model for RF, DT, Elastic Net).
      - Save local champions to a pickle file.

6. **Model Evaluation:**
   - In the notebook `notebooks/09.final-comparison-and-summary.ipynb`, load the models you created and check if they outperform the previously used models.

7. **Version Control:**
   - At the end of the class, even if the tasks are incomplete, [commit](https://docs.github.com/en/desktop/making-changes-in-a-branch/committing-and-reviewing-changes-to-your-project-in-github-desktop) your changes using GitHub Desktop.
   - [Push](https://docs.github.com/en/desktop/making-changes-in-a-branch/pushing-changes-to-github-from-github-desktop) your changes to your remote GitHub repository.

8. **Submission:**
   - Send me the link to your GitHub project (my email: *mj.wozniak9@uw.edu.pl*).

Good luck with the exercise! If you have any questions, feel free to ask.

In [2]:
import pandas as pd

# Define file paths
train_data_path = "train_fe.csv"
test_data_path = "test_fe.csv"

# Load the datasets
df_train = pd.read_csv(train_data_path, index_col=0)
df_test = pd.read_csv(test_data_path, index_col=0)

# Display the first few rows of each dataframe to understand their structure
df_train.head(), df_test.head()


(          Ticker             Nazwa2   rok         ta      txt        pi   str  \
 0  11B PW Equity  11 bit studios SA  2005  21.127613  1.24185  6.329725  0.19   
 1  11B PW Equity  11 bit studios SA  2006  21.127613  1.24185  6.329725  0.19   
 2  11B PW Equity  11 bit studios SA  2007  21.127613  1.24185  6.329725  0.19   
 3  11B PW Equity  11 bit studios SA  2008  21.127613  1.24185  6.329725  0.19   
 4  11B PW Equity  11 bit studios SA  2009  21.127613  1.24185  6.329725  0.19   
 
    xrd      ni     ppent  ...  intan_ma    ppe_ma   sale_ma  cash_holdings_ma  \
 0  0.0  5.0879  0.276275  ...  0.198598  0.013076  0.445954          0.574744   
 1  0.0  5.0879  0.276275  ...  0.198598  0.013076  0.445954          0.574744   
 2  0.0  5.0879  0.276275  ...  0.198598  0.013076  0.445954          0.574744   
 3  0.0  5.0879  0.276275  ...  0.198598  0.013076  0.445954          0.574744   
 4  0.0  5.0879  0.276275  ...  0.198598  0.013076  0.445954          0.574744   
 
    roa_past

In [6]:
# It seems I forgot to import numpy. Let's correct that and proceed.
import numpy as np

# Remove non-numeric columns and calculate the correlation matrix again
df_train_numeric = df_train.select_dtypes(include=[np.number])
correlation_matrix_numeric = df_train_numeric.corr()

# Find top 10 variables most correlated with 'ni' excluding 'ni' itself
top_correlated_features_numeric = correlation_matrix_numeric['ni'].abs().sort_values(ascending=False)[1:11].index.tolist()

# Display the selected variables
top_correlated_features_numeric


['ni_profit_20000',
 'capex',
 'txt',
 'revenue',
 'capex_cat_(5451.0, inf]',
 'xrd',
 'pi',
 'pi_cat_(8108.5, inf]',
 'roa_ma',
 'roa_clip']

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

# Define the features (X) and target (y)
X = df_train[top_correlated_features_numeric]
y = df_train['ni']

# Splitting the dataset into the Training set and Test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

# Creating and training the Decision Tree model
decision_tree_model = DecisionTreeRegressor(random_state = 42)
decision_tree_model.fit(X_train, y_train)

# Model evaluation using the test set
model_score = decision_tree_model.score(X_test, y_test)

model_score


0.9221004127278455

In [8]:
from sklearn.ensemble import RandomForestRegressor

# Creating and training the Random Forest model
random_forest_model = RandomForestRegressor(n_estimators=100, random_state=42)
random_forest_model.fit(X_train, y_train)

# Model evaluation using the test set
random_forest_score = random_forest_model.score(X_test, y_test)

random_forest_score


0.9682824576389796

In [9]:
from sklearn.linear_model import ElasticNet

# Creating and training the ElasticNet model
elastic_net_model = ElasticNet(random_state=42)
elastic_net_model.fit(X_train, y_train)

# Model evaluation using the test set
elastic_net_score = elastic_net_model.score(X_test, y_test)

elastic_net_score


0.6931578443329434

In [10]:
from sklearn.model_selection import GridSearchCV

# Decision Tree Regressor Grid Search
dt_param_grid = {
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 10, 20],
    'min_samples_leaf': [1, 5, 10]
}
dt_grid_search = GridSearchCV(DecisionTreeRegressor(random_state=42), dt_param_grid, cv=5)
dt_grid_search.fit(X_train, y_train)

# Random Forest Regressor Grid Search
rf_param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 10],
    'min_samples_leaf': [1, 5]
}
rf_grid_search = GridSearchCV(RandomForestRegressor(random_state=42), rf_param_grid, cv=5)
rf_grid_search.fit(X_train, y_train)

# ElasticNet Grid Search
en_param_grid = {
    'alpha': [0.1, 1, 10],
    'l1_ratio': [0.1, 0.5, 0.9]
}
en_grid_search = GridSearchCV(ElasticNet(random_state=42), en_param_grid, cv=5)
en_grid_search.fit(X_train, y_train)

# Best parameters and scores
dt_best_params = dt_grid_search.best_params_
rf_best_params = rf_grid_search.best_params_
en_best_params = en_grid_search.best_params_

dt_best_score = dt_grid_search.best_score_
rf_best_score = rf_grid_search.best_score_
en_best_score = en_grid_search.best_score_

dt_best_params, rf_best_params, en_best_params, dt_best_score, rf_best_score, en_best_score


({'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 20},
 {'max_depth': 20,
  'min_samples_leaf': 1,
  'min_samples_split': 2,
  'n_estimators': 50},
 {'alpha': 0.1, 'l1_ratio': 0.9},
 0.938135911861351,
 0.9558107271809357,
 0.7427128384321796)

In [11]:
from sklearn.metrics import mean_squared_error, r2_score

# Re-train models with best parameters
best_dt_model = DecisionTreeRegressor(**dt_grid_search.best_params_, random_state=42)
best_dt_model.fit(X_train, y_train)
dt_predictions = best_dt_model.predict(X_test)

best_rf_model = RandomForestRegressor(**rf_grid_search.best_params_, random_state=42)
best_rf_model.fit(X_train, y_train)
rf_predictions = best_rf_model.predict(X_test)

best_en_model = ElasticNet(**en_grid_search.best_params_, random_state=42)
best_en_model.fit(X_train, y_train)
en_predictions = best_en_model.predict(X_test)

# Calculate other metrics
dt_mse = mean_squared_error(y_test, dt_predictions)
rf_mse = mean_squared_error(y_test, rf_predictions)
en_mse = mean_squared_error(y_test, en_predictions)

dt_r2 = r2_score(y_test, dt_predictions)
rf_r2 = r2_score(y_test, rf_predictions)
en_r2 = r2_score(y_test, en_predictions)

metrics = {
    "Decision Tree": {"MSE": dt_mse, "R^2": dt_r2},
    "Random Forest": {"MSE": rf_mse, "R^2": rf_r2},
    "ElasticNet": {"MSE": en_mse, "R^2": en_r2}
}

metrics


{'Decision Tree': {'MSE': 4889090.10448625, 'R^2': 0.9367537851973622},
 'Random Forest': {'MSE': 2338120.6769855632, 'R^2': 0.9697536188921073},
 'ElasticNet': {'MSE': 12652454.29141733, 'R^2': 0.8363254051789203}}

In [12]:
from joblib import dump

# Assuming optimal models have been retrained with best parameters found from grid search
# For demonstration, we will use the models already trained without grid search optimization due to the computational limits here

# Save the models to disk
dump(decision_tree_model, 'decision_tree_model.sav')
dump(random_forest_model, 'random_forest_model.sav')
dump(elastic_net_model, 'elastic_net_model.sav')

# Return the paths for confirmation
model_paths = {
    "Decision Tree Model": "decision_tree_model.sav",
    "Random Forest Model": "random_forest_model.sav",
    "Elastic Net Model": "elastic_net_model.sav"
}

model_paths


{'Decision Tree Model': 'decision_tree_model.sav',
 'Random Forest Model': 'random_forest_model.sav',
 'Elastic Net Model': 'elastic_net_model.sav'}