# 02 - Modelo de Regressão para Preço

## 1. Introduction

This notebook focuses on building, evaluating, and comparing machine learning models to predict smartphone prices in USD based on technical specifications.

Key goals:

Build multiple regression models

Handle missing values correctly

Use pipelines and best practices

Compare models using MAE (Mean Absolute Error)


## 2. Imports & Data Loading

In [5]:
import pandas as pd
import numpy as np


from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error


from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor


from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

## 3. Target Conversion (INR → USD)

Prices were originally provided in INR and converted to USD for global consistency.

In [6]:
INR_TO_USD = 0.012


df['price_usd'] = df['price_inr'] * INR_TO_USD

## 4. Feature Selection

We keep only numerical features and remove the target from predictors.

In [7]:
X = df.select_dtypes(include=['int64', 'float64']).drop(['price_inr', 'price_usd'], axis=1)
y = df['price_usd']

## 5. Train / Test Split


In [8]:
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)

## 6. Baseline Model – Linear Regression

In [9]:
lr_pipeline = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('model', LinearRegression())
])


lr_pipeline.fit(X_train, y_train)
pred_lr = lr_pipeline.predict(X_test)
mae_lr = mean_absolute_error(y_test, pred_lr)


mae_lr

105.53085259762565

## 7. Random Forest Regressor

In [10]:
rf_pipeline = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('model', RandomForestRegressor(
n_estimators=200,
random_state=42,
n_jobs=-1
))
])


rf_pipeline.fit(X_train, y_train)
pred_rf = rf_pipeline.predict(X_test)
mae_rf = mean_absolute_error(y_test, pred_rf)


mae_rf

46.88047408082713

## 8. Gradient Boosting Regressor

Gradient Boosting does not accept NaN values, so the pipeline is mandatory.

In [11]:
gbr_pipeline = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('model', GradientBoostingRegressor(random_state=42))
])


gbr_pipeline.fit(X_train, y_train)
pred_gbr = gbr_pipeline.predict(X_test)
mae_gbr = mean_absolute_error(y_test, pred_gbr)


mae_gbr

49.70370951543606

## 9. Model Comparison

In [12]:
results = pd.DataFrame({
'Model': [
'Linear Regression',
'Random Forest',
'Gradient Boosting'
],
'MAE (USD)': [
mae_lr,
mae_rf,
mae_gbr
]
}).sort_values(by='MAE (USD)')


results

Unnamed: 0,Model,MAE (USD)
1,Random Forest,46.880474
2,Gradient Boosting,49.70371
0,Linear Regression,105.530853


10. Feature Importance (Random Forest)

In [13]:
feature_importance = pd.Series(
rf_pipeline.named_steps['model'].feature_importances_,
index=X.columns
).sort_values(ascending=False)


feature_importance.head(10)

clock_speed_ghz         0.576071
rating_score            0.152484
display_inches          0.117897
battery_mah             0.034308
storage_gb              0.029142
res_width_px            0.027161
charging_watt           0.026499
front_camera_main_mp    0.011296
res_height_px           0.009786
ram_gb                  0.005568
dtype: float64

## 11. Key Insights

Tree-based models significantly outperform linear regression

Performance-related features dominate pricing decisions

Proper preprocessing (imputation + pipelines) is essential