**Machine Learning.**  
*  LinearRegression, DecisionTree, RandomForest
*  Scorings: MAE and RMSE

In [107]:
import pandas as pd
import numpy as np
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression

# Load the data
URL = "https://github.com/ageron/handson-ml2/blob/master/datasets/housing/housing.csv?raw=true"
df = pd.read_csv(URL)

# Split the data into training and testing sets
train_set, test_set = train_test_split(df, random_state=10, test_size=0.2)

x_train = train_set.drop('median_house_value', axis=1)
y_train = train_set['median_house_value']

# Define the numeric and categorical columns
num = list(x_train.drop('ocean_proximity', axis=1))
cat = ['ocean_proximity']

# Create a pipeline for numeric features
pipeline_num = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('std', StandardScaler())])

# Create a pipeline for categorical features
pipeline_cat = Pipeline([
    ('cat', OneHotEncoder(handle_unknown='ignore'))])

"""use handle_unknown='ignore' in OneHotEncoder to avoid issues with
unseen categories in the test set. Using OneHotEncoder(handle_unknown='ignore')
is a safe and recommended practice in most cases, especially when you cannot
guarantee that all categories in the test set will be seen during training.
It helps prevent errors and ensures your model can handle new data with
minimal disruption. As a new learner, adopting good practices like this will
enhance the reliability and stability of your machine learning workflows."""

# Combine numeric and categorical pipelines
full_pipeline = ColumnTransformer([
    ('num', pipeline_num, num),
    ('cat', pipeline_cat, cat)])

# Fit and transform the training data
x_prepared = full_pipeline.fit_transform(x_train)

# Create and train the Linear Regression model
LR_model = LinearRegression()
LR_model.fit(x_prepared, y_train)

# Select the first 10 rows of x_train
a = x_train.head(10)
# Get the corresponding y_train values
b = y_train.loc[a.index]

# Transform the selected rows using the already fitted pipeline
readyforp = full_pipeline.transform(a)

# Predict using the Linear Regression model
predictions = LR_model.predict(readyforp)

pd.DataFrame({'predicted':predictions, 'actual_price':b})


Unnamed: 0,predicted,actual_price
12346,183595.18251,145200.0
19326,218139.44911,117000.0
16824,314336.128565,263900.0
6869,157880.716926,163700.0
16677,234363.065039,236100.0
1811,217750.961164,129100.0
15642,255232.042376,420000.0
704,197324.446192,162500.0
18433,326506.88955,291200.0
2272,112829.607725,71900.0


Now, we use this model for test_set:

In [112]:
x_test=test_set.drop('median_house_value',axis=1)
y_test=test_set['median_house_value']

xpreparedtest=full_pipeline.transform(x_test)
y_predicted=LR_model.predict(xpreparedtest)

pd.DataFrame({'Prognoz':y_predicted, 'Real baxosi': y_test})

Unnamed: 0,Prognoz,Real baxosi
20303,276345.505187,167400.0
16966,281753.138103,354100.0
10623,274363.452764,160200.0
6146,116480.088800,140800.0
2208,158926.767080,107800.0
...,...,...
3263,102510.278010,106300.0
11694,328594.916648,393700.0
1729,229754.465322,131300.0
5087,117460.910492,92300.0


In [117]:
from sklearn.ensemble import RandomForestRegressor
RF_model=RandomForestRegressor()
RF_model.fit(x_prepared, y_train)

rfpredicted=RF_model.predict(xpreparedtest)

pd.DataFrame({'Prognoz':rfpredicted, 'Real baxosi': y_test})

Unnamed: 0,Prognoz,Real baxosi
20303,352442.25,167400.0
16966,319330.00,354100.0
10623,261472.00,160200.0
6146,148281.00,140800.0
2208,99755.00,107800.0
...,...,...
3263,95383.00,106300.0
11694,335232.01,393700.0
1729,147611.00,131300.0
5087,105156.00,92300.0


In [118]:
from sklearn.tree import DecisionTreeRegressor
DT_model=DecisionTreeRegressor()
DT_model.fit(x_prepared, y_train)

dtpredicted=DT_model.predict(xpreparedtest)
pd.DataFrame({'Prognoz':dtpredicted, 'Real baxosi': y_test})

Unnamed: 0,Prognoz,Real baxosi
20303,500001.0,167400.0
16966,292600.0,354100.0
10623,275000.0,160200.0
6146,141600.0,140800.0
2208,89300.0,107800.0
...,...,...
3263,88500.0,106300.0
11694,338800.0,393700.0
1729,162800.0,131300.0
5087,90100.0,92300.0


Scoring: MAE and RMSE

In [130]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
#Mean Absolute Error (MAE)
maeLR=mean_absolute_error(y_test, y_predicted)
maeRF=mean_absolute_error(y_test, rfpredicted)
maeDT=mean_absolute_error(y_test, dtpredicted)
print('Mean Absolute Error (MAE):')
print(f'maeLR: {maeLR}')
print(f'maeRF: {maeRF}')
print(f'maeDT: {maeDT}')

#Mean Squared error
mseLR=mean_squared_error(y_test, y_predicted)
mseRF=mean_squared_error(y_test, rfpredicted)
mseDT=mean_squared_error(y_test, dtpredicted)
print('\nMean Squared error (RMSE):')
print(f'mseLR: {np.sqrt(mseLR)}')
print(f'mseRF: {np.sqrt(mseRF)}')
print(f'mseDT: {np.sqrt(mseDT)}')


Mean Absolute Error (MAE):
maeLR: 49901.48765982328
maeRF: 31502.241637596897
maeDT: 42705.664001937985

Mean Squared error (RMSE):
mseLR: 69450.8877477631
mseRF: 49480.12942282797
mseDT: 68020.22313599246


**Conclusion:**  Random Forest (RF) performs better than both Linear Regression (LR) and Decision Tree (DT) based on both MAE and RMSE metrics. It consistently shows lower error metrics, suggesting it provides more accurate predictions on average compared to the other models.
 Lower values of MAE and MSE are considered more positive because they indicate that the model's predictions are closer to the actual values, which is the ultimate goal in machine learning tasks.

**Bonus: Cross Validation**

In [131]:
x=df.drop("median_house_value", axis=1)
y=df['median_house_value']

prepared_x=full_pipeline.fit_transform(x)

In [161]:
from sklearn.model_selection import cross_val_score
cv_scoresLR = cross_val_score(LR_model, prepared_x, y, scoring='neg_mean_squared_error', cv=5) #LinearRegression
cv_rmse_scoresLR = np.sqrt(-cv_scoresLR)
cv_scoresRF = cross_val_score(RF_model, prepared_x, y, scoring='neg_mean_squared_error', cv=5) #RandomForest
cv_rmse_scoresRF = np.sqrt(-cv_scoresRF)
cv_scoresDT = cross_val_score(DT_model, prepared_x, y, scoring='neg_mean_squared_error', cv=5) #DecisionTree
cv_rmse_scoresDT = np.sqrt(-cv_scoresDT)

In [164]:
# Print and interpret the cross-validation scores
print('Cross Validation: LinearRegression')
for i, score in enumerate(cv_rmse_scoresLR):
    print(f"Fold {i+1}: RMSE = {score}")


print('\nCross Validation: RandomForest')
for i, score in enumerate(cv_rmse_scoresRF):
    print(f"Fold {i+1}: RMSE = {score}")


print('\nCross Validation: DecisionTree')
for i, score in enumerate(cv_rmse_scoresDT):
    print(f"Fold {i+1}: RMSE = {score}")

Cross Validation: LinearRegression
Fold 1: RMSE = 73500.76739874051
Fold 2: RMSE = 75661.16140367663
Fold 3: RMSE = 75879.2416808118
Fold 4: RMSE = 77082.1315721913
Fold 5: RMSE = 66503.85978097937

Cross Validation: RandomForest
Fold 1: RMSE = 81633.97811337588
Fold 2: RMSE = 67760.73863588246
Fold 3: RMSE = 65418.88036070297
Fold 4: RMSE = 97017.00878507124
Fold 5: RMSE = 71842.98394584734

Cross Validation: DecisionTree
Fold 1: RMSE = 117715.83732832603
Fold 2: RMSE = 87929.75175590396
Fold 3: RMSE = 91874.26573130225
Fold 4: RMSE = 109158.77297723528
Fold 5: RMSE = 92156.00502915753
