#**CS 180 Group LED - Project Implementation**

This dataset is a mobile price dataset with the specifications indicated per column. This data contains 835 entries with the following as the columns:    

*   Unnamed: 0: This is an index or identifier for each entry in the dataset.
* Brand:The brand name of the mobile phone.
* Ratings: The user ratings for the mobile phone on a scale of 1 to 5.
* RAM: The amount of RAM (Random Access Memory) in the mobile phone, usually measured in gigabytes (GB).
* ROM: The amount of ROM (Read-Only Memory) or storage capacity in the mobile phone, usually measured in gigabytes (GB).
* Mobile_Size: The size of the mobile phone referring to the display size in inches.
* Primary_Cam: The resolution of the primary camera in the mobile phone in megapixels.
* Selfi_Cam: The resolution of the front or selfie camera in the mobile phone in megapixels.
* Battery_Power: The battery capacity of the mobile phone in milliampere-hours (mAh).
* Price: The price of the mobile phone in the local currency.

This dataset is from this [link.](https://www.kaggle.com/datasets/jsonali2003/mobile-price-prediction-dataset)

##**Dataset Importing**

In [1]:
%pylab inline
import pandas as pd

%pylab is deprecated, use %matplotlib inline and import the required libraries.
Populating the interactive namespace from numpy and matplotlib


In [2]:
import requests
import pandas as pd

# URL of the CSV file
url = 'https://drive.google.com/file/d/1peEM4fuLa-t9X1ynhfEcq-BFcCz7iDsw/view?usp=sharing'
url='https://drive.google.com/uc?id=' + url.split('/')[-2]

# Read the CSV file directly into a DataFrame
df = pd.read_csv(url, sep=",")

# Display the DataFrame
df.head()

Unnamed: 0.1,Unnamed: 0,Brand me,Ratings,RAM,ROM,Mobile_Size,Primary_Cam,Selfi_Cam,Battery_Power,Price
0,0,"LG V30+ (Black, 128 )",4.3,4.0,128.0,6.0,48,13.0,4000,24999
1,1,I Kall K11,3.4,6.0,64.0,4.5,48,12.0,4000,15999
2,2,Nokia 105 ss,4.3,4.0,4.0,4.5,64,16.0,4000,15000
3,3,"Samsung Galaxy A50 (White, 64 )",4.4,6.0,64.0,6.4,48,15.0,3800,18999
4,4,"POCO F1 (Steel Blue, 128 )",4.5,6.0,128.0,6.18,35,15.0,3800,18999


## **Preprocess the dataset**

The dataset contains error for some entries and need to be corrected and verified from other sources. In this section, duplicates will be removed, detect outliers, and handle the null values through removal and imputation.

In [3]:
# Remove duplicates
df.drop_duplicates(inplace=True)

In [4]:
# remove unnamed column since this is not important
df.drop(columns=['Unnamed: 0'], inplace=True)

In [5]:
df.head()  # First 5 rows

Unnamed: 0,Brand me,Ratings,RAM,ROM,Mobile_Size,Primary_Cam,Selfi_Cam,Battery_Power,Price
0,"LG V30+ (Black, 128 )",4.3,4.0,128.0,6.0,48,13.0,4000,24999
1,I Kall K11,3.4,6.0,64.0,4.5,48,12.0,4000,15999
2,Nokia 105 ss,4.3,4.0,4.0,4.5,64,16.0,4000,15000
3,"Samsung Galaxy A50 (White, 64 )",4.4,6.0,64.0,6.4,48,15.0,3800,18999
4,"POCO F1 (Steel Blue, 128 )",4.5,6.0,128.0,6.18,35,15.0,3800,18999


##**Summary statistic**


In [6]:
df.describe()  # Basic stats (mean, std, etc.)

Unnamed: 0,Ratings,RAM,ROM,Mobile_Size,Primary_Cam,Selfi_Cam,Battery_Power,Price
count,805.0,829.0,832.0,834.0,836.0,567.0,836.0,836.0
mean,4.103106,6.066345,74.261298,5.597282,47.983254,9.784832,3274.688995,17614.246411
std,0.365356,2.530336,82.297798,3.898664,11.170093,6.503838,927.518852,49339.273936
min,2.8,0.0,0.0,2.0,5.0,0.0,1020.0,479.0
25%,3.8,6.0,32.0,4.5,48.0,5.0,3000.0,984.75
50%,4.1,6.0,45.5,4.77,48.0,8.0,3000.0,1697.0
75%,4.4,6.0,64.0,6.3,48.0,13.0,3800.0,18999.0
max,4.8,34.0,512.0,44.0,64.0,61.0,6000.0,573000.0


The price 573000 is considered as higher than the percentile which is an outlier. We will remove this value.

In [7]:
# remove the outlier
df = df[df['Price'] != 573000]

##**Handle the missing values**

In [8]:
#count the missing values per column
print(df.isnull().sum())

Brand me           0
Ratings           31
RAM                7
ROM                4
Mobile_Size        2
Primary_Cam        0
Selfi_Cam        269
Battery_Power      0
Price              0
dtype: int64


In [9]:
# Imputate the null rating values

# Calculate the mean rating (excluding missing values)
mean_rating = df['Ratings'].mean()

# Replace missing values in 'Ratings' with the calculated mean
df['Ratings'].fillna(mean_rating, inplace=True)

# replace the null selfie cam with 0
df['Selfi_Cam'] = df['Selfi_Cam'].fillna(0)

# remove remaining null values
df.dropna(inplace=True)

print("NULL VALUES================")
print(df.isnull().sum())


print("\nTYPES==================\n")
df.dtypes

Brand me         0
Ratings          0
RAM              0
ROM              0
Mobile_Size      0
Primary_Cam      0
Selfi_Cam        0
Battery_Power    0
Price            0
dtype: int64




Brand me          object
Ratings          float64
RAM              float64
ROM              float64
Mobile_Size      float64
Primary_Cam        int64
Selfi_Cam        float64
Battery_Power      int64
Price              int64
dtype: object

## **One Hot Encoding for Brand**

The `brand_me` colum was encoded using One Hot Encoding. This resulted to columns of brand that the dataset contains such as Apple, Samsung, etc.

In [10]:
import re
def extract_brand(text):
  match = re.search(r"^(Alcatel|Apple|Blacear|Blacerry|Black Shark|BlackZone|Callbar|Detel|Dublin|Easyfone|Ecotel|F-Fook|Forme|GAMMA|Gee|Gfive|Good One|Google|Grabo|GreenBerry|Heemax|Hicell|Honor|Huawei|I Kall|Infinix|InFocus|Inovu|Intex|iQOO|Itel|JIVI|Jmax|Karbonn|Kechaoda|Lava|Lenovo|LG|Mafe|Megus|Meizu|Mi|Micax|Moto|Motorola|MTR|Muphone|Mymax|Nexus|Nokia|OnePlus|OPPO|Peace|POCO|Q-Tel|Realme|Redmi)"
, text)
  return match.group(0) if match else "Unknown"  # Handle unknown brands

df['Brand'] = df['Brand me'].apply(extract_brand)
# df.drop(columns=['Brand me'], inplace=True)

df = pd.get_dummies(df, columns=['Brand'])

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 820 entries, 0 to 835
Data columns (total 65 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Brand me           820 non-null    object 
 1   Ratings            820 non-null    float64
 2   RAM                820 non-null    float64
 3   ROM                820 non-null    float64
 4   Mobile_Size        820 non-null    float64
 5   Primary_Cam        820 non-null    int64  
 6   Selfi_Cam          820 non-null    float64
 7   Battery_Power      820 non-null    int64  
 8   Price              820 non-null    int64  
 9   Brand_Alcatel      820 non-null    uint8  
 10  Brand_Apple        820 non-null    uint8  
 11  Brand_Blacear      820 non-null    uint8  
 12  Brand_Blacerry     820 non-null    uint8  
 13  Brand_Black Shark  820 non-null    uint8  
 14  Brand_BlackZone    820 non-null    uint8  
 15  Brand_Callbar      820 non-null    uint8  
 16  Brand_Detel        820 non

In [11]:
df

Unnamed: 0,Brand me,Ratings,RAM,ROM,Mobile_Size,Primary_Cam,Selfi_Cam,Battery_Power,Price,Brand_Alcatel,...,Brand_Nokia,Brand_OPPO,Brand_OnePlus,Brand_POCO,Brand_Peace,Brand_Q-Tel,Brand_Realme,Brand_Redmi,Brand_Unknown,Brand_iQOO
0,"LG V30+ (Black, 128 )",4.3,4.0,128.0,6.00,48,13.0,4000,24999,0,...,0,0,0,0,0,0,0,0,0,0
1,I Kall K11,3.4,6.0,64.0,4.50,48,12.0,4000,15999,0,...,0,0,0,0,0,0,0,0,0,0
2,Nokia 105 ss,4.3,4.0,4.0,4.50,64,16.0,4000,15000,0,...,1,0,0,0,0,0,0,0,0,0
3,"Samsung Galaxy A50 (White, 64 )",4.4,6.0,64.0,6.40,48,15.0,3800,18999,0,...,0,0,0,0,0,0,0,0,1,0
4,"POCO F1 (Steel Blue, 128 )",4.5,6.0,128.0,6.18,35,15.0,3800,18999,0,...,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
831,Karbonn K24 Plus Pro,3.8,6.0,32.0,4.54,48,12.0,2800,1299,0,...,0,0,0,0,0,0,0,0,0,0
832,InFocus POWER 2,4.1,8.0,64.0,4.54,64,0.0,2500,1390,0,...,0,0,0,0,0,0,0,0,0,0
833,"Alcatel 5V (Spectrum Blue, 32 )",4.4,3.0,32.0,6.20,48,1.0,3800,9790,1,...,0,0,0,0,0,0,0,0,0,0
834,JIVI JV 12M,3.7,10.0,32.0,4.50,64,0.0,3500,799,0,...,0,0,0,0,0,0,0,0,0,0


## **Feature Selection**

In this Section, the features will be evaluated which features have the significant correlation with the target variable. This will lessen the feature that will be used for training the models and prevent overfitting and reduce complexity.

In [12]:
!pip install tabulate



In [13]:
import pandas as pd
from sklearn.feature_selection import f_regression, SelectKBest

# 1. Separate target variable
y = df['Price']
X = df.drop('Price', axis=1)


# correlations = X_filtered.corrwith(y).abs()
correlations = X.corrwith(y).abs()

# 4. Sort correlations in descending order
sorted_correlations = correlations.sort_values(ascending=False)

# 5. Print sorted correlations
print("Sorted Correlations with Price:")
print(sorted_correlations.to_markdown(numalign="left", stralign="left"))

# 6. Determine correlation threshold (example: 0.1)
correlation_threshold = 0.1

# 7. Select features above threshold
selected_features = sorted_correlations[sorted_correlations > correlation_threshold]

# 8. Print selected features
print("\nSelected Features:")
print(selected_features.index.tolist())


Sorted Correlations with Price:
|                   | 0          |
|:------------------|:-----------|
| ROM               | 0.59395    |
| Ratings           | 0.485609   |
| Brand_Apple       | 0.478219   |
| Battery_Power     | 0.274435   |
| RAM               | 0.189521   |
| Primary_Cam       | 0.165387   |
| Selfi_Cam         | 0.162467   |
| Brand_Unknown     | 0.139483   |
| Brand_OnePlus     | 0.127623   |
| Brand_I Kall      | 0.123887   |
| Brand_Kechaoda    | 0.113485   |
| Brand_Karbonn     | 0.0979535  |
| Brand_Lava        | 0.0964794  |
| Brand_Mi          | 0.0775185  |
| Brand_Nokia       | 0.0767297  |
| Brand_iQOO        | 0.0738297  |
| Brand_Itel        | 0.0709285  |
| Mobile_Size       | 0.0686533  |
| Brand_Blacear     | 0.065324   |
| Brand_MTR         | 0.0645543  |
| Brand_InFocus     | 0.0594355  |
| Brand_Redmi       | 0.0593715  |
| Brand_Easyfone    | 0.0585569  |
| Brand_LG          | 0.0495379  |
| Brand_JIVI        | 0.0480934  |
| Brand_OPPO        | 0

  correlations = X.corrwith(y).abs()


In [14]:
# remove the Mobile_Size based on the feature selection
X = X.drop(columns=['Mobile_Size'])

# drop brand me
X = X.drop(columns=['Brand me'])
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 820 entries, 0 to 835
Data columns (total 62 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Ratings            820 non-null    float64
 1   RAM                820 non-null    float64
 2   ROM                820 non-null    float64
 3   Primary_Cam        820 non-null    int64  
 4   Selfi_Cam          820 non-null    float64
 5   Battery_Power      820 non-null    int64  
 6   Brand_Alcatel      820 non-null    uint8  
 7   Brand_Apple        820 non-null    uint8  
 8   Brand_Blacear      820 non-null    uint8  
 9   Brand_Blacerry     820 non-null    uint8  
 10  Brand_Black Shark  820 non-null    uint8  
 11  Brand_BlackZone    820 non-null    uint8  
 12  Brand_Callbar      820 non-null    uint8  
 13  Brand_Detel        820 non-null    uint8  
 14  Brand_Dublin       820 non-null    uint8  
 15  Brand_Easyfone     820 non-null    uint8  
 16  Brand_Ecotel       820 non

## **Splitting and Normalizing the features (min/max scaling)**

In this section, we perform normalization in our dataset to ensure that all features are on a similar scale. This is crucial because the features have different units of measurement (e.g., ratings, storage capacity, battery power) and vastly different ranges. Normalization prevents features with larger ranges from dominating the model's learning process and ensures that each feature contributes equally to the prediction of mobile phone prices. Specifically, we use Min-Max scaling to rescale all numerical features to a range between 0 and 1, preserving the relationships between values and making them easier to interpret.

We also performed splitting of data into:

*   Training Set - 80%
*   Validation Set - 10%
*   Test Set - 10%

In [15]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

# 2. Split into training and temporary set
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Split temporary set into validation and test
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# 4. Feature Scaling (Min-Max) on training set ONLY
scaler = MinMaxScaler()

# Select numeric columns to scale (excluding boolean/categorical columns)
columns_to_scale = X_train.select_dtypes(include=['float64', 'int64']).columns
X_train[columns_to_scale] = scaler.fit_transform(X_train[columns_to_scale])

# 5. Apply the same scaler to validation and test sets (without refitting)
X_val[columns_to_scale] = scaler.transform(X_val[columns_to_scale])
X_test[columns_to_scale] = scaler.transform(X_test[columns_to_scale])



In [16]:
# examine the normalized values
# Display the shapes of each set
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_val shape:", X_val.shape)
print("y_val shape:", y_val.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)

X_train shape: (656, 62)
y_train shape: (656,)
X_val shape: (82, 62)
y_val shape: (82,)
X_test shape: (82, 62)
y_test shape: (82,)


In [17]:
y_train

183      595
253      998
465      798
366     1110
513     1570
       ...  
73     42998
111    77999
279     9999
446      927
107      649
Name: Price, Length: 656, dtype: int64

##**Model Summary**

In [18]:
def clean_summary():
  empty = pd.DataFrame(columns=['Info','Random Forest (Random)','Random Forest (Grid)',
                                'Neural Networks (Random)', 'Neural Networks (Grid)',
                                'Gradient Boosting (Random)', 'Gradient Boosting (Grid)'])
  # Define the rows to add
  rows = [
      ['Best Params', '', '', '', '', '', ''],
      ['Train MSE', '', '', '', '', '', ''],
      ['Train MAE', '', '', '', '', '', ''],
      ['Train R2', '', '', '', '', '', ''],
      ['Val MSE', '', '', '', '', '', ''],
      ['Val MAE', '', '', '', '', '', ''],
      ['Val R2', '', '', '', '', '', ''],
      ['Test MSE', '', '', '', '', '', ''],
      ['Test MAE', '', '', '', '', '', ''],
      ['Test R2', '', '', '', '', '', '']
  ]

  # Add the rows to the DataFrame
  for row in rows:
    empty.loc[len(empty)] = row

  return empty

summary = clean_summary()
pd.set_option('display.max_colwidth', None)  # Display entire content of columns
summary.head(10)


Unnamed: 0,Info,Random Forest (Random),Random Forest (Grid),Neural Networks (Random),Neural Networks (Grid),Gradient Boosting (Random),Gradient Boosting (Grid)
0,Best Params,,,,,,
1,Train MSE,,,,,,
2,Train MAE,,,,,,
3,Train R2,,,,,,
4,Val MSE,,,,,,
5,Val MAE,,,,,,
6,Val R2,,,,,,
7,Test MSE,,,,,,
8,Test MAE,,,,,,
9,Test R2,,,,,,


In [19]:
# Metrics Calculation
def evaluate_model(y_true, y_pred):
    mse = mean_squared_error(y_true, y_pred)
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    return mse, mae, r2

In [20]:
def predict_and_evaluate(modelstr, model, X_train, y_train, X_val, y_val):
    # Predictions on training set
    y_train_pred = model.predict(X_train)
    # Predictions on validation set
    y_val_pred = model.predict(X_val)

    # Evaluation metrics on training set
    mse_train, mae_train, r2_train = evaluate_model(y_train, y_train_pred)
    # Evaluation metrics on validation set
    mse_val, mae_val, r2_val = evaluate_model(y_val, y_val_pred)

    # Print metrics
    print(modelstr,"- Training Metrics:")
    print("MSE:", mse_train)
    print("MAE:", mae_train)
    print("R-squared:", r2_train)

    print("\n",modelstr,"- Validation Metrics:")
    print("MSE:", mse_val)
    print("MAE:", mae_val)
    print("R-squared:", r2_val)

    return {
        'mse_train': mse_train,
        'mae_train': mae_train,
        'r2_train': r2_train,
        'mse_val': mse_val,
        'mae_val': mae_val,
        'r2_val': r2_val,
    }

In [21]:
def test(modelstr, model, X_test, y_test):
    # Predictions on test set
    y_test_pred = model.predict(X_test)

    # Evaluation metrics on test set
    mse_test, mae_test, r2_test = evaluate_model(y_test, y_test_pred)

    # Print metrics

    print("\n",modelstr,"- Test Metrics:")
    print("MSE:", mse_test)
    print("MAE:", mae_test)
    print("R-squared:", r2_test)

    return {
        'mse_test': mse_test,
        'mae_test': mae_test,
        'r2_test': r2_test
    }

# **Model 1: Random Forest Regression**

## **Random Search**

In [22]:
y_train

183      595
253      998
465      798
366     1110
513     1570
       ...  
73     42998
111    77999
279     9999
446      927
107      649
Name: Price, Length: 656, dtype: int64

In [23]:
import pandas as pd
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import MinMaxScaler
from scipy.stats import randint as sp_randint

#regularization parameters
param_distributions = {
    'n_estimators': sp_randint(50, 200),
    'max_depth': sp_randint(3, 10),
    'min_samples_split': sp_randint(2, 10),
    'min_samples_leaf': sp_randint(1, 5),
    'max_features': ['sqrt', 'log2'],
    'bootstrap': [True, False]
}

random_search = RandomizedSearchCV(
    estimator=RandomForestRegressor(random_state=1),
    param_distributions=param_distributions,
    n_iter=20,  # Number of parameter settings sampled
    scoring='neg_mean_squared_error',  # Use MSE as the scoring metric
    cv=5,       # 5-fold cross-validation
    random_state=1,
    n_jobs=-1  # Use all available cores for parallel processing
)

random_search.fit(X_train, y_train)
best_rf_model_random = random_search.best_estimator_

#Train Model with Best Hyperparameters
best_rf_model_random.fit(X_train, y_train)

### **Evaluation**

In [24]:
evaluation_metrics = predict_and_evaluate("Random Forest (Random)",best_rf_model_random, X_train, y_train, X_val, y_val)
print(evaluation_metrics)

Random Forest (Random) - Training Metrics:
MSE: 331384248.4671575
MAE: 4695.463258451932
R-squared: 0.685413500279692

 Random Forest (Random) - Validation Metrics:
MSE: 181821091.90063015
MAE: 5898.2151083813915
R-squared: 0.6534883321121633
{'mse_train': 331384248.4671575, 'mae_train': 4695.463258451932, 'r2_train': 0.685413500279692, 'mse_val': 181821091.90063015, 'mae_val': 5898.2151083813915, 'r2_val': 0.6534883321121633}


### **Summary**

In [25]:
model = 'Random Forest (Random)'

summary[model][0] = str(best_rf_model_random)
summary[model][1:4] = ['{:.2f}'.format(evaluation_metrics['mse_train']), '{:.2f}'.format(evaluation_metrics['mae_train']), '{:.2f}'.format(evaluation_metrics['r2_train'])]
summary[model][4:7] = ['{:.2f}'.format(evaluation_metrics['mse_val']), '{:.2f}'.format(evaluation_metrics['mae_val']), '{:.2f}'.format(evaluation_metrics['r2_val'])]

# Format numeric values to two decimal places
summary[model] = summary[model].apply(lambda x: '{:.2f}'.format(x) if isinstance(x, (int, float)) else x)

# Print the updated DataFrame
summary

Unnamed: 0,Info,Random Forest (Random),Random Forest (Grid),Neural Networks (Random),Neural Networks (Grid),Gradient Boosting (Random),Gradient Boosting (Grid)
0,Best Params,"RandomForestRegressor(max_depth=8, max_features='sqrt', min_samples_split=5,\n n_estimators=192, random_state=1)",,,,,
1,Train MSE,331384248.47,,,,,
2,Train MAE,4695.46,,,,,
3,Train R2,0.69,,,,,
4,Val MSE,181821091.90,,,,,
5,Val MAE,5898.22,,,,,
6,Val R2,0.65,,,,,
7,Test MSE,,,,,,
8,Test MAE,,,,,,
9,Test R2,,,,,,


In [26]:
evaluation_metrics = test("Random Forest (Random)",best_rf_model_random, X_test, y_test)
print(evaluation_metrics)

summary[model][7:] = ['{:.2f}'.format(evaluation_metrics['mse_test']), '{:.2f}'.format(evaluation_metrics['mae_test']), '{:.2f}'.format(evaluation_metrics['r2_test'])]

summary


 Random Forest (Random) - Test Metrics:
MSE: 51688624.439550094
MAE: 4674.959158255656
R-squared: 0.92518137553859
{'mse_test': 51688624.439550094, 'mae_test': 4674.959158255656, 'r2_test': 0.92518137553859}


Unnamed: 0,Info,Random Forest (Random),Random Forest (Grid),Neural Networks (Random),Neural Networks (Grid),Gradient Boosting (Random),Gradient Boosting (Grid)
0,Best Params,"RandomForestRegressor(max_depth=8, max_features='sqrt', min_samples_split=5,\n n_estimators=192, random_state=1)",,,,,
1,Train MSE,331384248.47,,,,,
2,Train MAE,4695.46,,,,,
3,Train R2,0.69,,,,,
4,Val MSE,181821091.90,,,,,
5,Val MAE,5898.22,,,,,
6,Val R2,0.65,,,,,
7,Test MSE,51688624.44,,,,,
8,Test MAE,4674.96,,,,,
9,Test R2,0.93,,,,,


## **Grid Search**

In [27]:
from sklearn.model_selection import GridSearchCV
# 5. Hyperparameter Tuning (Grid Search)
param_grid = {
    'n_estimators': [50, 75, 100, 150],           # Finer grid for n_estimators
    'max_depth': [None, 5, 7, 10],                # Include None for unlimited depth
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2']
}

grid_search = GridSearchCV(
    estimator=RandomForestRegressor(random_state=1),
    param_grid=param_grid,
    scoring='neg_mean_squared_error',  # Use MSE as the scoring metric
    cv=5,       # 5-fold cross-validation
    n_jobs=-1  # Use all available cores for parallel processing
)

grid_search.fit(X_train, y_train)
best_rf_model_grid = grid_search.best_estimator_

# 6. Train Model with Best Hyperparameters
best_rf_model_grid.fit(X_train, y_train)


### **Evaluation**

In [28]:
# Usage
evaluation_metrics = predict_and_evaluate("Random Forest (Grid)",best_rf_model_grid, X_train, y_train, X_val, y_val)

Random Forest (Grid) - Training Metrics:
MSE: 285582687.7852591
MAE: 3636.148509236694
R-squared: 0.7288933962714101

 Random Forest (Grid) - Validation Metrics:
MSE: 180536044.43347248
MAE: 5196.25785300536
R-squared: 0.655937354590827


### **Summary**

In [29]:
model = 'Random Forest (Grid)'

summary[model][0] = str(best_rf_model_grid)
summary[model][1:4] = ['{:.2f}'.format(evaluation_metrics['mse_train']), '{:.2f}'.format(evaluation_metrics['mae_train']), '{:.2f}'.format(evaluation_metrics['r2_train'])]
summary[model][4:7] = ['{:.2f}'.format(evaluation_metrics['mse_val']), '{:.2f}'.format(evaluation_metrics['mae_val']), '{:.2f}'.format(evaluation_metrics['r2_val'])]

# Format numeric values to two decimal places
summary[model] = summary[model].apply(lambda x: '{:.2f}'.format(x) if isinstance(x, (int, float)) else x)

# Print the updated DataFrame
summary

Unnamed: 0,Info,Random Forest (Random),Random Forest (Grid),Neural Networks (Random),Neural Networks (Grid),Gradient Boosting (Random),Gradient Boosting (Grid)
0,Best Params,"RandomForestRegressor(max_depth=8, max_features='sqrt', min_samples_split=5,\n n_estimators=192, random_state=1)","RandomForestRegressor(max_depth=10, max_features='sqrt', n_estimators=150,\n random_state=1)",,,,
1,Train MSE,331384248.47,285582687.79,,,,
2,Train MAE,4695.46,3636.15,,,,
3,Train R2,0.69,0.73,,,,
4,Val MSE,181821091.90,180536044.43,,,,
5,Val MAE,5898.22,5196.26,,,,
6,Val R2,0.65,0.66,,,,
7,Test MSE,51688624.44,,,,,
8,Test MAE,4674.96,,,,,
9,Test R2,0.93,,,,,


In [30]:
evaluation_metrics = test("Random Forest (Grid)",best_rf_model_grid, X_test, y_test)
print(evaluation_metrics)

summary[model][7:] = ['{:.2f}'.format(evaluation_metrics['mse_test']), '{:.2f}'.format(evaluation_metrics['mae_test']), '{:.2f}'.format(evaluation_metrics['r2_test'])]

summary


 Random Forest (Grid) - Test Metrics:
MSE: 51621137.38395479
MAE: 3732.6801707600825
R-squared: 0.9252790621906017
{'mse_test': 51621137.38395479, 'mae_test': 3732.6801707600825, 'r2_test': 0.9252790621906017}


Unnamed: 0,Info,Random Forest (Random),Random Forest (Grid),Neural Networks (Random),Neural Networks (Grid),Gradient Boosting (Random),Gradient Boosting (Grid)
0,Best Params,"RandomForestRegressor(max_depth=8, max_features='sqrt', min_samples_split=5,\n n_estimators=192, random_state=1)","RandomForestRegressor(max_depth=10, max_features='sqrt', n_estimators=150,\n random_state=1)",,,,
1,Train MSE,331384248.47,285582687.79,,,,
2,Train MAE,4695.46,3636.15,,,,
3,Train R2,0.69,0.73,,,,
4,Val MSE,181821091.90,180536044.43,,,,
5,Val MAE,5898.22,5196.26,,,,
6,Val R2,0.65,0.66,,,,
7,Test MSE,51688624.44,51621137.38,,,,
8,Test MAE,4674.96,3732.68,,,,
9,Test R2,0.93,0.93,,,,


# **Model 2: Neural Networks**

In [31]:
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

## **Random Search**

In [32]:
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

# Define the hyperparameter distributions
param_distributions = {
    'hidden_layer_sizes': randint(100, 501),  # Random integers between 100 and 500
    'alpha': uniform(1e-6, 1e-3),  # Random floats between 0.00001 and 0.1
    'learning_rate_init': uniform(1e-4, 1e-2),  # Random floats between 0.001 and 0.1
    'solver': ['adam'],  # Categorical choice
}

# Create the model
mlp = MLPRegressor(max_iter=500, random_state=3)

# Create the RandomizedSearchCV object
random_search = RandomizedSearchCV(estimator=mlp, param_distributions=param_distributions, n_iter=20, cv=10, scoring='neg_mean_squared_error', random_state=427, n_jobs=-1, verbose=1)

# Fit the RandomizedSearchCV object to the training data
random_search.fit(X_train, y_train)

# Get the best hyperparameters
best_params = random_search.best_params_

# Create the final model with the best hyperparameters
best_mlp_model_random = MLPRegressor(**best_params, max_iter=500, random_state=1)
best_mlp_model_random.fit(X_train, y_train)

print("Best hyperparameters:", best_params)

Fitting 10 folds for each of 20 candidates, totalling 200 fits




Best hyperparameters: {'alpha': 0.0004293927950001161, 'hidden_layer_sizes': 408, 'learning_rate_init': 0.009439794788935859, 'solver': 'adam'}




### **Evaluation**

In [33]:
evaluation_metrics = predict_and_evaluate("Neural Networks (Random)", best_mlp_model_random, X_train, y_train, X_val, y_val)
print(evaluation_metrics)

Neural Networks (Random) - Training Metrics:
MSE: 588497900.297037
MAE: 7534.423869944959
R-squared: 0.4413328472806284

 Neural Networks (Random) - Validation Metrics:
MSE: 165005387.06997937
MAE: 6294.158788107245
R-squared: 0.6855354277855457
{'mse_train': 588497900.297037, 'mae_train': 7534.423869944959, 'r2_train': 0.4413328472806284, 'mse_val': 165005387.06997937, 'mae_val': 6294.158788107245, 'r2_val': 0.6855354277855457}


### **Summary**

In [34]:
model = 'Neural Networks (Random)'

summary[model][0] = str(best_mlp_model_random)
summary[model][1:4] = ['{:.2f}'.format(evaluation_metrics['mse_train']), '{:.2f}'.format(evaluation_metrics['mae_train']), '{:.2f}'.format(evaluation_metrics['r2_train'])]
summary[model][4:7] = ['{:.2f}'.format(evaluation_metrics['mse_val']), '{:.2f}'.format(evaluation_metrics['mae_val']), '{:.2f}'.format(evaluation_metrics['r2_val'])]
# summary[model][7:] = ['{:.2f}'.format(evaluation_metrics['mse_test']), '{:.2f}'.format(evaluation_metrics['mae_test']), '{:.2f}'.format(evaluation_metrics['r2_test'])]

# Format numeric values to two decimal places
summary[model] = summary[model].apply(lambda x: '{:.2f}'.format(x) if isinstance(x, (int, float)) else x)

# Print the updated DataFrame
summary

Unnamed: 0,Info,Random Forest (Random),Random Forest (Grid),Neural Networks (Random),Neural Networks (Grid),Gradient Boosting (Random),Gradient Boosting (Grid)
0,Best Params,"RandomForestRegressor(max_depth=8, max_features='sqrt', min_samples_split=5,\n n_estimators=192, random_state=1)","RandomForestRegressor(max_depth=10, max_features='sqrt', n_estimators=150,\n random_state=1)","MLPRegressor(alpha=0.0004293927950001161, hidden_layer_sizes=408,\n learning_rate_init=0.009439794788935859, max_iter=500,\n random_state=1)",,,
1,Train MSE,331384248.47,285582687.79,588497900.30,,,
2,Train MAE,4695.46,3636.15,7534.42,,,
3,Train R2,0.69,0.73,0.44,,,
4,Val MSE,181821091.90,180536044.43,165005387.07,,,
5,Val MAE,5898.22,5196.26,6294.16,,,
6,Val R2,0.65,0.66,0.69,,,
7,Test MSE,51688624.44,51621137.38,,,,
8,Test MAE,4674.96,3732.68,,,,
9,Test R2,0.93,0.93,,,,


In [35]:
evaluation_metrics = test("Neural Networks (Random)",best_mlp_model_random, X_test, y_test)
print(evaluation_metrics)

summary[model][7:] = ['{:.2f}'.format(evaluation_metrics['mse_test']), '{:.2f}'.format(evaluation_metrics['mae_test']), '{:.2f}'.format(evaluation_metrics['r2_test'])]

summary


 Neural Networks (Random) - Test Metrics:
MSE: 170468374.83748007
MAE: 7513.637337238976
R-squared: 0.7532492021638466
{'mse_test': 170468374.83748007, 'mae_test': 7513.637337238976, 'r2_test': 0.7532492021638466}


Unnamed: 0,Info,Random Forest (Random),Random Forest (Grid),Neural Networks (Random),Neural Networks (Grid),Gradient Boosting (Random),Gradient Boosting (Grid)
0,Best Params,"RandomForestRegressor(max_depth=8, max_features='sqrt', min_samples_split=5,\n n_estimators=192, random_state=1)","RandomForestRegressor(max_depth=10, max_features='sqrt', n_estimators=150,\n random_state=1)","MLPRegressor(alpha=0.0004293927950001161, hidden_layer_sizes=408,\n learning_rate_init=0.009439794788935859, max_iter=500,\n random_state=1)",,,
1,Train MSE,331384248.47,285582687.79,588497900.30,,,
2,Train MAE,4695.46,3636.15,7534.42,,,
3,Train R2,0.69,0.73,0.44,,,
4,Val MSE,181821091.90,180536044.43,165005387.07,,,
5,Val MAE,5898.22,5196.26,6294.16,,,
6,Val R2,0.65,0.66,0.69,,,
7,Test MSE,51688624.44,51621137.38,170468374.84,,,
8,Test MAE,4674.96,3732.68,7513.64,,,
9,Test R2,0.93,0.93,0.75,,,


## **Grid Search**

In [36]:
# Define the hyperparameter grid
param_grid = {
    'hidden_layer_sizes': [(50,), (100,), (50, 25), (100, 50)],
    'alpha': [0.0001, 0.001, 0.01],
    'learning_rate_init': [0.001, 0.01, 0.1],
    'solver': ['adam', 'sgd']
}

# Create the GridSearchCV object
mlp = MLPRegressor(max_iter=500, random_state=427)
grid_search = GridSearchCV(mlp, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1, verbose=1)

# Fit the GridSearchCV object to the data
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 72 candidates, totalling 360 fits


  (array - array_means[:, np.newaxis]) ** 2, axis=1, weights=weights


### **Evaluation**

In [37]:
# Get the best hyperparameters
best_params = grid_search.best_params_

# Get the best model
best_mlp_model_grid = grid_search.best_estimator_

In [38]:
evaluation_metrics = predict_and_evaluate("Neural Networks (Grid)",best_mlp_model_grid, X_train, y_train, X_val, y_val)
print(evaluation_metrics)

Neural Networks (Grid) - Training Metrics:
MSE: 541206659.2108954
MAE: 6523.047352871206
R-squared: 0.48622691231097914

 Neural Networks (Grid) - Validation Metrics:
MSE: 188261424.31384292
MAE: 5328.95442860486
R-squared: 0.6412144517667845
{'mse_train': 541206659.2108954, 'mae_train': 6523.047352871206, 'r2_train': 0.48622691231097914, 'mse_val': 188261424.31384292, 'mae_val': 5328.95442860486, 'r2_val': 0.6412144517667845}


### **Summary**

In [39]:
model = 'Neural Networks (Grid)'

summary[model][0] = str(best_mlp_model_grid)
summary[model][1:4] = ['{:.2f}'.format(evaluation_metrics['mse_train']), '{:.2f}'.format(evaluation_metrics['mae_train']), '{:.2f}'.format(evaluation_metrics['r2_train'])]
summary[model][4:7] = ['{:.2f}'.format(evaluation_metrics['mse_val']), '{:.2f}'.format(evaluation_metrics['mae_val']), '{:.2f}'.format(evaluation_metrics['r2_val'])]
# summary[model][7:] = ['{:.2f}'.format(evaluation_metrics['mse_test']), '{:.2f}'.format(evaluation_metrics['mae_test']), '{:.2f}'.format(evaluation_metrics['r2_test'])]

# Format numeric values to two decimal places
summary[model] = summary[model].apply(lambda x: '{:.2f}'.format(x) if isinstance(x, (int, float)) else x)

# Print the updated DataFrame
summary

Unnamed: 0,Info,Random Forest (Random),Random Forest (Grid),Neural Networks (Random),Neural Networks (Grid),Gradient Boosting (Random),Gradient Boosting (Grid)
0,Best Params,"RandomForestRegressor(max_depth=8, max_features='sqrt', min_samples_split=5,\n n_estimators=192, random_state=1)","RandomForestRegressor(max_depth=10, max_features='sqrt', n_estimators=150,\n random_state=1)","MLPRegressor(alpha=0.0004293927950001161, hidden_layer_sizes=408,\n learning_rate_init=0.009439794788935859, max_iter=500,\n random_state=1)","MLPRegressor(alpha=0.001, hidden_layer_sizes=(50, 25), learning_rate_init=0.1,\n max_iter=500, random_state=427)",,
1,Train MSE,331384248.47,285582687.79,588497900.30,541206659.21,,
2,Train MAE,4695.46,3636.15,7534.42,6523.05,,
3,Train R2,0.69,0.73,0.44,0.49,,
4,Val MSE,181821091.90,180536044.43,165005387.07,188261424.31,,
5,Val MAE,5898.22,5196.26,6294.16,5328.95,,
6,Val R2,0.65,0.66,0.69,0.64,,
7,Test MSE,51688624.44,51621137.38,170468374.84,,,
8,Test MAE,4674.96,3732.68,7513.64,,,
9,Test R2,0.93,0.93,0.75,,,


In [40]:
evaluation_metrics = test("Neural Networks (Grid)",best_mlp_model_grid, X_test, y_test)
print(evaluation_metrics)

summary[model][7:] = ['{:.2f}'.format(evaluation_metrics['mse_test']), '{:.2f}'.format(evaluation_metrics['mae_test']), '{:.2f}'.format(evaluation_metrics['r2_test'])]

summary


 Neural Networks (Grid) - Test Metrics:
MSE: 198377441.1690148
MAE: 5978.9504579905915
R-squared: 0.7128511846973586
{'mse_test': 198377441.1690148, 'mae_test': 5978.9504579905915, 'r2_test': 0.7128511846973586}


Unnamed: 0,Info,Random Forest (Random),Random Forest (Grid),Neural Networks (Random),Neural Networks (Grid),Gradient Boosting (Random),Gradient Boosting (Grid)
0,Best Params,"RandomForestRegressor(max_depth=8, max_features='sqrt', min_samples_split=5,\n n_estimators=192, random_state=1)","RandomForestRegressor(max_depth=10, max_features='sqrt', n_estimators=150,\n random_state=1)","MLPRegressor(alpha=0.0004293927950001161, hidden_layer_sizes=408,\n learning_rate_init=0.009439794788935859, max_iter=500,\n random_state=1)","MLPRegressor(alpha=0.001, hidden_layer_sizes=(50, 25), learning_rate_init=0.1,\n max_iter=500, random_state=427)",,
1,Train MSE,331384248.47,285582687.79,588497900.30,541206659.21,,
2,Train MAE,4695.46,3636.15,7534.42,6523.05,,
3,Train R2,0.69,0.73,0.44,0.49,,
4,Val MSE,181821091.90,180536044.43,165005387.07,188261424.31,,
5,Val MAE,5898.22,5196.26,6294.16,5328.95,,
6,Val R2,0.65,0.66,0.69,0.64,,
7,Test MSE,51688624.44,51621137.38,170468374.84,198377441.17,,
8,Test MAE,4674.96,3732.68,7513.64,5978.95,,
9,Test R2,0.93,0.93,0.75,0.71,,


# **Model 3: Gradient Boosting**

## **Random Search**

In [41]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

# Define the hyperparameter distributions
param_distributions = {
    'max_depth': randint(3, 10),  # Random integers between 3 and 7 (inclusive)
    'n_estimators': randint(100, 301),  # Random integers between 100 and 300 (inclusive)
    'learning_rate': uniform(0.0001, 0.01 - 0.0001)  # Random floats between 0.0001 and 0.01
}

# Create the model
gb_regressor = GradientBoostingRegressor(random_state=1)

# Create the RandomizedSearchCV object
random_search = RandomizedSearchCV(estimator=gb_regressor, param_distributions=param_distributions, n_iter=20, cv=5, scoring='neg_mean_squared_error', random_state=427, n_jobs=-1, verbose=1)

# Fit the RandomizedSearchCV object to the training data
random_search.fit(X_train, y_train)

# Get the best hyperparameters
best_params = random_search.best_params_

# Create the final model with the best hyperparameters
best_gb_model_random = GradientBoostingRegressor(**best_params, random_state=42)
best_gb_model_random.fit(X_train, y_train)

print("Best hyperparameters:", best_params)

Fitting 5 folds for each of 20 candidates, totalling 100 fits
Best hyperparameters: {'learning_rate': 0.00640120499556805, 'max_depth': 9, 'n_estimators': 250}


### **Evaluation**

In [42]:
evaluation_metrics = predict_and_evaluate("Gradient Boosting (Random)",best_gb_model_random, X_train, y_train, X_val, y_val)
print(evaluation_metrics)

Gradient Boosting (Random) - Training Metrics:
MSE: 292303023.2113613
MAE: 4535.870668613658
R-squared: 0.722513712238681

 Gradient Boosting (Random) - Validation Metrics:
MSE: 192269244.5907642
MAE: 6155.996843367584
R-squared: 0.6335764133289244
{'mse_train': 292303023.2113613, 'mae_train': 4535.870668613658, 'r2_train': 0.722513712238681, 'mse_val': 192269244.5907642, 'mae_val': 6155.996843367584, 'r2_val': 0.6335764133289244}


### **Summary**

In [43]:
model = 'Gradient Boosting (Random)'

summary[model][0] = str(best_gb_model_random)
summary[model][1:4] = ['{:.2f}'.format(evaluation_metrics['mse_train']), '{:.2f}'.format(evaluation_metrics['mae_train']), '{:.2f}'.format(evaluation_metrics['r2_train'])]
summary[model][4:7] = ['{:.2f}'.format(evaluation_metrics['mse_val']), '{:.2f}'.format(evaluation_metrics['mae_val']), '{:.2f}'.format(evaluation_metrics['r2_val'])]
# summary[model][7:] = ['{:.2f}'.format(evaluation_metrics['mse_test']), '{:.2f}'.format(evaluation_metrics['mae_test']), '{:.2f}'.format(evaluation_metrics['r2_test'])]

# Format numeric values to two decimal places
summary[model] = summary[model].apply(lambda x: '{:.2f}'.format(x) if isinstance(x, (int, float)) else x)

# Print the updated DataFrame
summary

Unnamed: 0,Info,Random Forest (Random),Random Forest (Grid),Neural Networks (Random),Neural Networks (Grid),Gradient Boosting (Random),Gradient Boosting (Grid)
0,Best Params,"RandomForestRegressor(max_depth=8, max_features='sqrt', min_samples_split=5,\n n_estimators=192, random_state=1)","RandomForestRegressor(max_depth=10, max_features='sqrt', n_estimators=150,\n random_state=1)","MLPRegressor(alpha=0.0004293927950001161, hidden_layer_sizes=408,\n learning_rate_init=0.009439794788935859, max_iter=500,\n random_state=1)","MLPRegressor(alpha=0.001, hidden_layer_sizes=(50, 25), learning_rate_init=0.1,\n max_iter=500, random_state=427)","GradientBoostingRegressor(learning_rate=0.00640120499556805, max_depth=9,\n n_estimators=250, random_state=42)",
1,Train MSE,331384248.47,285582687.79,588497900.30,541206659.21,292303023.21,
2,Train MAE,4695.46,3636.15,7534.42,6523.05,4535.87,
3,Train R2,0.69,0.73,0.44,0.49,0.72,
4,Val MSE,181821091.90,180536044.43,165005387.07,188261424.31,192269244.59,
5,Val MAE,5898.22,5196.26,6294.16,5328.95,6156.00,
6,Val R2,0.65,0.66,0.69,0.64,0.63,
7,Test MSE,51688624.44,51621137.38,170468374.84,198377441.17,,
8,Test MAE,4674.96,3732.68,7513.64,5978.95,,
9,Test R2,0.93,0.93,0.75,0.71,,


In [44]:
evaluation_metrics = test("Gradient Boosting (Random)",best_gb_model_random, X_test, y_test)
print(evaluation_metrics)

summary[model][7:] = ['{:.2f}'.format(evaluation_metrics['mse_test']), '{:.2f}'.format(evaluation_metrics['mae_test']), '{:.2f}'.format(evaluation_metrics['r2_test'])]

summary


 Gradient Boosting (Random) - Test Metrics:
MSE: 66073990.36756006
MAE: 4684.105111700044
R-squared: 0.9043587418783257
{'mse_test': 66073990.36756006, 'mae_test': 4684.105111700044, 'r2_test': 0.9043587418783257}


Unnamed: 0,Info,Random Forest (Random),Random Forest (Grid),Neural Networks (Random),Neural Networks (Grid),Gradient Boosting (Random),Gradient Boosting (Grid)
0,Best Params,"RandomForestRegressor(max_depth=8, max_features='sqrt', min_samples_split=5,\n n_estimators=192, random_state=1)","RandomForestRegressor(max_depth=10, max_features='sqrt', n_estimators=150,\n random_state=1)","MLPRegressor(alpha=0.0004293927950001161, hidden_layer_sizes=408,\n learning_rate_init=0.009439794788935859, max_iter=500,\n random_state=1)","MLPRegressor(alpha=0.001, hidden_layer_sizes=(50, 25), learning_rate_init=0.1,\n max_iter=500, random_state=427)","GradientBoostingRegressor(learning_rate=0.00640120499556805, max_depth=9,\n n_estimators=250, random_state=42)",
1,Train MSE,331384248.47,285582687.79,588497900.30,541206659.21,292303023.21,
2,Train MAE,4695.46,3636.15,7534.42,6523.05,4535.87,
3,Train R2,0.69,0.73,0.44,0.49,0.72,
4,Val MSE,181821091.90,180536044.43,165005387.07,188261424.31,192269244.59,
5,Val MAE,5898.22,5196.26,6294.16,5328.95,6156.00,
6,Val R2,0.65,0.66,0.69,0.64,0.63,
7,Test MSE,51688624.44,51621137.38,170468374.84,198377441.17,66073990.37,
8,Test MAE,4674.96,3732.68,7513.64,5978.95,4684.11,
9,Test R2,0.93,0.93,0.75,0.71,0.90,


## **Grid Search**

In [45]:
from sklearn.ensemble import GradientBoostingRegressor

param_grid = {
    'max_depth': [3, 5, 7],
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.05, 0.1]
}

# Create the model
gb_regressor = GradientBoostingRegressor(random_state=1)

# Create the GridSearchCV object
grid_search = GridSearchCV(estimator=gb_regressor, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error')

# Fit the GridSearchCV object to the training data
grid_search.fit(X_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_

# Create the final model with the best hyperparameters
best_gb_regressor_grid = GradientBoostingRegressor(**best_params, random_state=1)
best_gb_regressor_grid.fit(X_train, y_train)

### **Evaluation**

In [46]:
evaluation_metrics = predict_and_evaluate("Gradient Boosting (Grid)",best_gb_regressor_grid, X_train, y_train, X_val, y_val)
print(evaluation_metrics)

Gradient Boosting (Grid) - Training Metrics:
MSE: 287303090.9757526
MAE: 4321.593104874118
R-squared: 0.7272601997018434

 Gradient Boosting (Grid) - Validation Metrics:
MSE: 218800683.40944466
MAE: 6215.883378573559
R-squared: 0.5830132304747069
{'mse_train': 287303090.9757526, 'mae_train': 4321.593104874118, 'r2_train': 0.7272601997018434, 'mse_val': 218800683.40944466, 'mae_val': 6215.883378573559, 'r2_val': 0.5830132304747069}


### **Summary**

In [47]:
model = 'Gradient Boosting (Grid)'

summary[model][0] = str(best_gb_regressor_grid)
summary[model][1:4] = ['{:.2f}'.format(evaluation_metrics['mse_train']), '{:.2f}'.format(evaluation_metrics['mae_train']), '{:.2f}'.format(evaluation_metrics['r2_train'])]
summary[model][4:7] = ['{:.2f}'.format(evaluation_metrics['mse_val']), '{:.2f}'.format(evaluation_metrics['mae_val']), '{:.2f}'.format(evaluation_metrics['r2_val'])]
# summary[model][7:] = ['{:.2f}'.format(evaluation_metrics['mse_test']), '{:.2f}'.format(evaluation_metrics['mae_test']), '{:.2f}'.format(evaluation_metrics['r2_test'])]

# Format numeric values to two decimal places
summary[model] = summary[model].apply(lambda x: '{:.2f}'.format(x) if isinstance(x, (int, float)) else x)

# Print the updated DataFrame
summary

Unnamed: 0,Info,Random Forest (Random),Random Forest (Grid),Neural Networks (Random),Neural Networks (Grid),Gradient Boosting (Random),Gradient Boosting (Grid)
0,Best Params,"RandomForestRegressor(max_depth=8, max_features='sqrt', min_samples_split=5,\n n_estimators=192, random_state=1)","RandomForestRegressor(max_depth=10, max_features='sqrt', n_estimators=150,\n random_state=1)","MLPRegressor(alpha=0.0004293927950001161, hidden_layer_sizes=408,\n learning_rate_init=0.009439794788935859, max_iter=500,\n random_state=1)","MLPRegressor(alpha=0.001, hidden_layer_sizes=(50, 25), learning_rate_init=0.1,\n max_iter=500, random_state=427)","GradientBoostingRegressor(learning_rate=0.00640120499556805, max_depth=9,\n n_estimators=250, random_state=42)","GradientBoostingRegressor(learning_rate=0.05, random_state=1)"
1,Train MSE,331384248.47,285582687.79,588497900.30,541206659.21,292303023.21,287303090.98
2,Train MAE,4695.46,3636.15,7534.42,6523.05,4535.87,4321.59
3,Train R2,0.69,0.73,0.44,0.49,0.72,0.73
4,Val MSE,181821091.90,180536044.43,165005387.07,188261424.31,192269244.59,218800683.41
5,Val MAE,5898.22,5196.26,6294.16,5328.95,6156.00,6215.88
6,Val R2,0.65,0.66,0.69,0.64,0.63,0.58
7,Test MSE,51688624.44,51621137.38,170468374.84,198377441.17,66073990.37,
8,Test MAE,4674.96,3732.68,7513.64,5978.95,4684.11,
9,Test R2,0.93,0.93,0.75,0.71,0.90,


In [48]:
evaluation_metrics = test("Gradient Boosting (Grid)",best_gb_regressor_grid, X_test, y_test)
print(evaluation_metrics)

summary[model][7:] = ['{:.2f}'.format(evaluation_metrics['mse_test']), '{:.2f}'.format(evaluation_metrics['mae_test']), '{:.2f}'.format(evaluation_metrics['r2_test'])]

summary


 Gradient Boosting (Grid) - Test Metrics:
MSE: 48466577.31407807
MAE: 4239.530757838103
R-squared: 0.9298452476476183
{'mse_test': 48466577.31407807, 'mae_test': 4239.530757838103, 'r2_test': 0.9298452476476183}


Unnamed: 0,Info,Random Forest (Random),Random Forest (Grid),Neural Networks (Random),Neural Networks (Grid),Gradient Boosting (Random),Gradient Boosting (Grid)
0,Best Params,"RandomForestRegressor(max_depth=8, max_features='sqrt', min_samples_split=5,\n n_estimators=192, random_state=1)","RandomForestRegressor(max_depth=10, max_features='sqrt', n_estimators=150,\n random_state=1)","MLPRegressor(alpha=0.0004293927950001161, hidden_layer_sizes=408,\n learning_rate_init=0.009439794788935859, max_iter=500,\n random_state=1)","MLPRegressor(alpha=0.001, hidden_layer_sizes=(50, 25), learning_rate_init=0.1,\n max_iter=500, random_state=427)","GradientBoostingRegressor(learning_rate=0.00640120499556805, max_depth=9,\n n_estimators=250, random_state=42)","GradientBoostingRegressor(learning_rate=0.05, random_state=1)"
1,Train MSE,331384248.47,285582687.79,588497900.30,541206659.21,292303023.21,287303090.98
2,Train MAE,4695.46,3636.15,7534.42,6523.05,4535.87,4321.59
3,Train R2,0.69,0.73,0.44,0.49,0.72,0.73
4,Val MSE,181821091.90,180536044.43,165005387.07,188261424.31,192269244.59,218800683.41
5,Val MAE,5898.22,5196.26,6294.16,5328.95,6156.00,6215.88
6,Val R2,0.65,0.66,0.69,0.64,0.63,0.58
7,Test MSE,51688624.44,51621137.38,170468374.84,198377441.17,66073990.37,48466577.31
8,Test MAE,4674.96,3732.68,7513.64,5978.95,4684.11,4239.53
9,Test R2,0.93,0.93,0.75,0.71,0.90,0.93


# **Test Summary**

In [49]:
import pickle

# save X_train and y_train as csvs
X_train.to_csv('X_train.csv', index=False)
y_train.to_csv('y_train.csv', index=False)

models = { 'Random Forest (Random)': best_rf_model_random, 'Random Forest (Grid)': best_rf_model_grid, 'Neural Networks (Random)': best_mlp_model_random, 'Neural Networks (Grid)': best_mlp_model_grid, 'Gradient Boosting (Random)': best_gb_model_random, 'Gradient Boosting (Grid)': best_gb_regressor_grid }

for i in ['Random Forest (Random)', 'Random Forest (Grid)', 'Neural Networks (Random)', 'Neural Networks (Grid)', 'Gradient Boosting (Random)', 'Gradient Boosting (Grid)']:
  model = models[i]
  with open(f'{i}.pkl', 'wb') as file:
    pickle.dump(model, file)

# **Conclusion**

Overall, the Random Forest model with Random Search hyperparameter tuning achieved the best overall performance on this dataset, with a test R² score of 0.93, and is recommended for use. Gradient Boosting with Random Search is a strong alternative, with a test R² score of 0.90. Neural Networks, while still viable, had lower performance compared to the other models. When selecting a model, consider the trade-off between performance and interpretability. Random Forest may be preferred if interpretability is a concern, while Gradient Boosting can provide slightly better performance in some cases. It is crucial to monitor the chosen model's performance on new data and retrain as necessary to ensure ongoing accuracy and reliability.

# **Going Forward**

For similar projects in the future, we recommend expanding the dataset to include a larger number of phone records. This will help improve the robustness and generalizability of our models. Additionally, we suggest incorporating phone brand and model information into the analysis, as these factors can have a significant impact on pricing. Certain brands and models are known to command higher prices due to their popularity, features, or perceived value. Finally, it is crucial to consider the age of the phones in the dataset, as phone prices tend to depreciate over time. By accounting for these factors, we can develop more accurate and comprehensive models for predicting phone prices.