## UTS Machine Learning - Regression

**Name:** Agatha Kinanthi Pramdriswara Truly Amorta

**Class:** TK-46-04

**NIM:** 1103223212


*- This notebook is part of the midterm assignment for the Machine Learning course.*  

*- The objective is to build a regression pipeline to predict continuous values from given features.*


**1. Imports**

In [None]:
# Core libraries
import os
import numpy as np
import pandas as pd
# Core libraries for visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Preprocessing and splitting
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
# Models
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
# Evaluation metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
# Helper function for regression metrics
def regression_report(y_true, y_pred):
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    print(f"RMSE: {rmse:.4f}  MAE: {mae:.4f}  R2: {r2:.4f}")

**2. Mount Google Drive and Load Data**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

datasets = '/content/drive/MyDrive/Machine-Learning-Midterm-Datasets/'

file_path = datasets + 'midterm-regresi-dataset.csv'
file_size = os.path.getsize(file_path) / (1024*1024) #Megabyte
print(f"File size: {file_size:.2f} MB")

df = pd.read_csv(file_path, nrows=20000)
print("Shape:", df.shape)
print("First 5 rows of the dataset: \n")
df.head()

**3. Exploratory Data Analysis (EDA)**

In [None]:
df.info()

In [None]:
# Summary statistics
df.describe().T

In [None]:
# Missing values
print("Missing values per column:\n", df.isnull().sum().sort_values(ascending=False).head(10))

In [None]:
# Correlations
correlation = df.corr()
plt.figure(figsize=(7,5))
sns.heatmap(correlation.iloc[:20, :20], cmap="coolwarm", center=0)
plt.title("Correlation heatmap (first 20 features)")
plt.show()

In [None]:
# Distribution
sns.histplot(df[df.columns[0]], bins=30, kde=True)
plt.title(f"Distribution of target column: {df.columns[0]}")
plt.show()

**4. Preprocessing**

In [None]:
# Define target and features and handle missing values
target_col = df.columns[0]
y = df[target_col]
X = df.drop(columns=[target_col])
X = X.fillna(X.median()) #prevent errors

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features for Linear Regression
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

In [None]:
# Training data size
print("Train shape:", X_train.shape)

In [None]:
# Testing data size
print("Test shape:", X_test.shape)

In [None]:
# Training data standardization
print("Scaled train sample:\n", X_train_s[:5])

**5. Baseline Models**

In [None]:
# Linear Regression
lr = LinearRegression()
lr.fit(X_train_s, y_train)
pred_lr = lr.predict(X_test_s)

print("===Linear Regression Performance===")
regression_report(y_test, pred_lr)

In [None]:
# Random Forest Regressor
rf = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
pred_rf = rf.predict(X_test)

print("===Random Forest Regressor Performance===")
regression_report(y_test, pred_rf)

**6. Basic Hyperparameter Tuning**

In [None]:
parameter_grid_small = {
    'n_estimators': [100],
    'max_depth': [None, 10]
}

grid_small = GridSearchCV(
    RandomForestRegressor(random_state=42, n_jobs=-1),
    parameter_grid_small,
    cv=2,
    scoring='neg_mean_squared_error'
)

grid_small.fit(X_train, y_train)

print("Best parameters (small grid):", grid_small.best_params_)

best_rf_small = grid_small.best_estimator_
pred_best_rf_small = best_rf_small.predict(X_test)

print("Performance with small grid tuning:")
regression_report(y_test, pred_best_rf_small)

**7. Interpretation of Results**

- Linear Regression achieved RMSE of *9.3910*, MAE of *6.7686*, and R² of *0.2137*.
- Random Forest archived RMSE of *9.0608*,  MAE of *6.6164*, and R² of *0.2680*. The baseline improved performance, showing lower error and higher R².
- Hyperparameter tuning further optimized Random Forest, reducing RMSE and improving fit.


### **Conclusion**

Random Forest with tuned parameters is the best model for this dataset, capturing non-linear relationships better than Linear Regression.