# Random Forest Model 

**Why Random Forest Model?** 

As a data scientist, my goal is to select a model that provides the best balance of accuracy, interpretability, and efficiency, while addressing the specific requirements of the project. Here are some of the key reasons I chose the Random Forest model for this task:

Accuracy: Random Forest is an ensemble learning method that combines multiple decision trees to make predictions. By aggregating the results of multiple trees, Random Forest typically achieves higher accuracy compared to a single decision tree. The model has the ability to capture complex patterns and relationships in the data, making it well-suited for this task.

Robustness to Overfitting: One of the main challenges when working with decision trees is their tendency to overfit the training data. Random Forest mitigates this issue by averaging the predictions of multiple trees, resulting in a more generalized model that is less susceptible to overfitting.

Handling of Missing and Categorical Data: Random Forest can handle missing data and categorical features more effectively than some other models, making it a good choice for datasets that may have missing values or a mix of categorical and numerical features, like the Yelp dataset.

Parallelizability: Random Forest can be easily parallelized, meaning that it can take advantage of multi-core processors to speed up the training process. This is particularly beneficial when working with large datasets, as it allows the model to be trained more quickly.

Feature Importance: Random Forest provides an easy way to estimate the importance of each feature in making predictions. This can be valuable for understanding which variables are the most influential in the model and may help guide further feature engineering or model selection efforts.

Model Flexibility: Random Forest can be applied to both classification and regression problems, making it a versatile choice that can be adapted to a wide range of tasks.

Given these advantages, the Random Forest model was chosen as a suitable option for predicting the star rating of a business using the Yelp dataset. However, it's important to note that other models could be explored as well, depending on the specific needs of the project and the desired trade-offs between accuracy, interpretability, and computational efficiency.

**Alternatives** 

There are several alternative models that could be considered for this task. I'll outline a few of them and explain why they were not chosen in the previous code snippet:

Logistic Regression: Logistic Regression is a simple and interpretable model that works well for binary classification tasks. However, in this case, we have a multi-class classification problem, and while logistic regression can be extended to handle multi-class problems, it may not perform as well as other models when dealing with complex relationships between features. Moreover, logistic regression assumes a linear relationship between features and the log-odds of the target class, which may not hold true for this dataset.

Support Vector Machines (SVM): SVM is a powerful classification model that works well with high-dimensional data and can capture complex decision boundaries. However, SVM can be computationally expensive, especially with large datasets, and may not scale well to the size of the Yelp dataset. Additionally, SVM is less interpretable than some other models, which may be a consideration for certain projects.

K-Nearest Neighbors (KNN): KNN is a simple and intuitive model that can work well for classification tasks. However, KNN can be sensitive to the choice of the number of neighbors (k) and the distance metric used. Additionally, KNN can be computationally expensive for large datasets, as it requires calculating the distance between each data point and all other data points in the dataset. This may be a concern when working with a large dataset like the Yelp data.

Neural Networks: Neural networks, particularly deep learning models, can achieve high accuracy on complex classification tasks. However, they can be computationally expensive to train and require a large amount of data to perform well. Moreover, they are less interpretable than other models, making it difficult to understand the relationships between features and the target variable. In this case, the added complexity of a neural network may not be necessary, given that simpler models like Random Forest can achieve good performance while being more interpretable and computationally efficient.

In the given context, the Random Forest model was chosen because it balances accuracy, interpretability, and computational efficiency while being robust to overfitting and able to handle missing and categorical data. However, depending on the specific requirements of a project, it might be worthwhile to explore alternative models, as each model has its own strengths and weaknesses. In practice, data scientists often try multiple models and compare their performance to choose the most suitable one for the task at hand.


## Imports 

In [14]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
import numpy as np

## data import

In [4]:
# Load the preprocessed data
preprocessed_data = pd.read_csv('../data/preprocessed_data.csv')

In [5]:
# Prepare the features and target variable
features = preprocessed_data[['review_count', 'stars_review']]
target = preprocessed_data['stars_business'].apply(lambda x: int(x))

In [6]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)


In [16]:
# Create a RandomForestClassifier
rf_classifier = RandomForestClassifier(random_state=42)

## Hyper-parameters 


In [11]:
# Define the hyperparameter search space
param_dist = {
    'n_estimators': np.arange(100, 1001, 50),
    'max_depth': [None] + list(np.arange(10, 51, 5)),
    'min_samples_split': np.arange(2, 21, 2),
    'min_samples_leaf': np.arange(1, 21, 2),
    'max_features': ['auto', 'sqrt', 'log2']
}

In [18]:
# Create the RandomizedSearchCV instance
random_search = RandomizedSearchCV(
    rf_classifier, param_distributions=param_dist, n_iter=50, cv=5, n_jobs=-1, verbose=2, random_state=42
)

In [19]:
random_search.fit(X_train, y_train)

Fitting 5 folds for each of 50 candidates, totalling 250 fits
[CV] END max_depth=50, max_features=log2, min_samples_leaf=13, min_samples_split=20, n_estimators=650; total time= 8.6min
[CV] END max_depth=50, max_features=log2, min_samples_leaf=13, min_samples_split=20, n_estimators=650; total time= 8.6min
[CV] END max_depth=50, max_features=log2, min_samples_leaf=13, min_samples_split=20, n_estimators=650; total time= 8.6min
[CV] END max_depth=50, max_features=log2, min_samples_leaf=13, min_samples_split=20, n_estimators=650; total time= 8.7min


In [None]:
# Print the best hyperparameters found
print("Best hyperparameters:", random_search.best_params_)

In [None]:
# Evaluate the model with the best hyperparameters on the test set
best_rf = random_search.best_estimator_
y_pred = best_rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Evaluate the model
print("Accuracy on the test set:", accuracy)
print("\nClassification Report:\n", classification_report(y_test, y_pred))
