**Random Forest**

The provided Python script effectively translates the functionality of the R code into Python using libraries such as pandas, scikit-learn, and numpy. It begins by loading data directly from the clipboard, similar to the R read.csv function, and processes it with pandas. The chas column, which is categorical in nature, is converted into a categorical data type using astype('category'). Following this, the data is split into training and test sets using train_test_split from scikit-learn, ensuring an 80-20 split by specifying the test_size parameter and setting a random seed for reproducibility.

The Random Forest model is then configured, focusing on optimizing the max_features parameter, which is the scikit-learn equivalent of R’s mtry. The parameter grid is defined with a range of values ([2, 4, 6, 8]) and passed to GridSearchCV for hyperparameter tuning with 5-fold cross-validation. The GridSearchCV method ensures that the model is evaluated on multiple subsets of the training data, leading to more robust performance metrics. The number of decision trees (n_estimators) in the forest is set to 2000 to align with the R script's ntree parameter.

Once the best model is identified based on the grid search, predictions are made on the test dataset using the predict method. Model evaluation is performed using two metrics: the Mean Absolute Percentage Error (MAPE), with a complementary 1 - MAPE score reported for interpretability, and the Root Mean Squared Error (RMSE). These metrics provide insights into the model's accuracy and its average prediction error magnitude, respectively.

The script’s structure, from data processing to hyperparameter tuning and evaluation, mirrors the flow and intent of the R script, offering a highly detailed and reproducible machine learning pipeline for regression analysis.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import mean_absolute_percentage_error, mean_squared_error

In [2]:
# Load data
data = pd.read_clipboard(sep="\t")

In [3]:
# Convert 'chas' to a categorical variable
data['chas'] = data['chas'].astype('category')

In [4]:
# Split the data into train and test sets
np.random.seed(1234)
train, test = train_test_split(data, test_size=0.2, random_state=1234)


In [5]:
# Define the feature matrix (X) and target vector (y)
X_train = train.drop(columns=['medv'])
y_train = train['medv']
X_test = test.drop(columns=['medv'])
y_test = test['medv']

In [6]:
# Define the parameter grid for 'mtry' equivalent (max_features in scikit-learn)
param_grid = {'max_features': [2, 4, 6, 8]}


In [7]:
# Initialize the Random Forest model
rf = RandomForestRegressor(n_estimators=2000, random_state=1234)

In [8]:
# Perform Grid Search with Cross Validation
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
grid_search.fit(X_train, y_train)

In [9]:
# Best model
best_rf_model = grid_search.best_estimator_
# Print the best hyperparameters
print("Best hyperparameters:", grid_search.best_params_)

# Predict on test data
predicted_rf = best_rf_model.predict(X_test)

# Evaluate the model
mape_score = 1 - mean_absolute_percentage_error(y_test, predicted_rf)
rmse_score = np.sqrt(mean_squared_error(y_test, predicted_rf))

print("1 - MAPE:", mape_score)
print("RMSE:", rmse_score)

Best hyperparameters: {'max_features': 6}
1 - MAPE: 0.8888084987045756
RMSE: 2.982834132541551
