## <center>Khalifa University</center>
## <center>Computer Science Department
### <center>ENGR 202: Data Science and Artificial Intelligence- Spring 2024
# <center>Lab5: Supervised Machines Learning
### Aim: 
    
This lab aims to prepare the dataset for machine learning algorithms, evaluate the performance of various machine learning models, and explore multiple evaluation techniques.
    
### Objectives:		

* Designing and testing supervised Machine learning models 
    
* Explore the performance of one of the machine learning models: Linear Regression
    
* Performance evaluation using multiple evaluation techniques: Mean-Square Error and absolute mean error
  
  
#### Risk Assessment: Low

# Introducion:

This lab will be continuation to the work that have been completed in Lab4. The training data preprocessing steps will be given to you to start with. Your model should learn from this data and be able to predict the median housing price in any district, given all the other metrics. Designing, testing and evaluating the model will be the objective of this lab. 

Data Transformation (Preprocessing)
As you learned in the last lab, the housing dataset involves two data types: Numerical and text. Machine learning models does not handle data in text format. Thus, you need to encode the text data, called categorization, by one of the well known encoding methods: ordinal encoding and one-hot encoding. In the last lab, you encoded the text data using the ordinal encoding method; however, the one-hot encoding method is preferable when you deal with nominal data( where there is no inherent order among categories).

In this section, you will learn how to compose all preprocessing you have done so far in the last lab into a well-structured transformation, in which you can apply all preprocessing at once.

## Step 1 Dataset Preperation

**Task 1.1: Read the dataset from the `housing.csv` file. and display its info using .info()**

In [18]:
import pandas as pd
housing = pd.read_csv('./housing.csv')

**Task 1.2: Split the dataset into 80% training and 20% testing data.**

In [19]:
#Split the dataset into training and testing sets
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(housing, test_size=0.20, random_state=5,shuffle=True)


The goal is to predict the `median_house_value` value from the other features provided in the dataset. Hence, this column will be the target/label we need to predict. Therefore, you will separate this column from the rest of the features.

Since we have two sets, train and test, we will have four variables `X_train` and `X_test` for the features and `y_train` and `y_test` for the target, which is the price of the house. 

**Task 1.3: Split the target column from the features on the train and test dataset**

In [20]:
#Split train target from features coloumns you did in Lab4
X_train=train_set.drop("median_house_value",axis=1)
y_train=train_set["median_house_value"]

# TODO: Similarly split the target from features in the test data set
X_test=test_set.drop("median_house_value",axis=1)
y_test=test_set["median_house_value"]

## Step 2: Data Preprocessing

Generally, any new data should pass through the same preprocessing you have done on the training dataset.So, you will apply all preprocessing steps you have done in the previous lab on the training datga set to testing sets as well.

Hence, it is best practice to have a function that takes the data, composes all the preprocessing, and return preprocessed data (Ready for training/prediction).

In [21]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder

def one_hot_encoding(data, column, transformer_one_hot_encoder= None):
    training = False

    if transformer_one_hot_encoder is None:
        training = True
        # Initialize OneHotEncoder
        transformer_one_hot_encoder = OneHotEncoder(sparse_output=False)  
        
        # Fit and transform the data
        transformer_one_hot_encoder.fit(data[[column]])
        
    one_hot_encoded = transformer_one_hot_encoder.transform(data[[column]])
    
    # Convert back to a DataFrame 
    encoded_df = pd.DataFrame(one_hot_encoded, columns=transformer_one_hot_encoder.get_feature_names_out([column]))

    # Drop the categorical column
    data = data.drop(column,axis=1) 
    
    # Combine with original data
    df_encoded = pd.concat([data, encoded_df], axis=1)
    if training:
        return df_encoded, transformer_one_hot_encoder
    else:
        return df_encoded

def preprocessing_training(training_data):
    # Make a copy of the origin data to use.
    train_data_cp = training_data.copy()
    ############################################### Data cleaning #################################################

    # Determine numerical/categorical columns since you should'nt apply normalization on the categorical columns.
    numerical_cols = list(train_data_cp.iloc[:, train_data_cp.columns != 'ocean_proximity'])
    
    imputer = SimpleImputer(strategy="median")    
    transformer_clean = ColumnTransformer(
        transformers=[('imputer', imputer, numerical_cols)], # Define the transformation & columns to apply to
        remainder = 'passthrough' # To do nothing for the remain columns.
     )
    # Apply data cleaning on the train by learning values in the data.
    train_data_cp = transformer_clean.fit_transform(train_data_cp)
   
    #Bring back the columns' names to the clean datasets
    train_data_cp = pd.DataFrame(train_data_cp, columns = numerical_cols + ['ocean_proximity'])
    
    ############################################## Feature Scaling #############################################
    scaler = MinMaxScaler()    
    transformer_scale = ColumnTransformer(
        transformers=[('scaler', scaler, numerical_cols)], # Define the transformation & columns to apply to
        remainder = 'passthrough' # To do nothing for the remain columns.
    )
    # Apply data cleaning on the train by learning values in the data.
    train_data_cp = transformer_scale.fit_transform(train_data_cp)

    #Bring back the columns' names to the clean datasets
    train_data_cp = pd.DataFrame(train_data_cp, columns = numerical_cols + ['ocean_proximity'])
   ################################################ Categorization ###############################################

    train_data_cp, transformer_one_hot_encoder = one_hot_encoding(data = train_data_cp, column= "ocean_proximity")
 
    
    return train_data_cp, transformer_one_hot_encoder, transformer_clean, transformer_scale

**Question:** Check the code written inside `preprocessing_training()` function and describe, in your words, all the included data preprocessing ?

The function preprocess the data by imputing the numerical featurse using median strategy and one-hot-encode the categorical features. 

**Question:** What does `preprocessing_training()` function return ?

it returns the cleaned training data, one-hot-encoder, cleaning transformer, and scaling transformer  

**Question:** Is the order important while applying preprocessing, referring to the `preprocessing_training` function? Could changing the order in any way cause an error (Exception)? If **yes**, give an example of what could happen? 

the order between the one-hot-encoder and other transformers doesn't make a difference as it targets different columns. However, the order between the scaling transformer and the cleaning transformer is important as having a null values while cleaning might cause errors or result a data different from the intended results.

The `preprocessing_training()` function takes the training dataset and applys certain preprocessing techniques on the data. Complete below line so that the function applys the preprocessing on the `X_train` data.

In [22]:
X_train_transformed, transformer_one_hot_encoder, transformer_clean, transformer_scale = preprocessing_training(X_train)

### Task 1: Data Preprocessing (Student Exercise)

So far, you have done preprocessing on the training dataset. You may notice that the `preprocessing_training()` function returns `transformer_clean` and `transformer_scale` alongside the preprocessed data. The main reason is that the clean and scale transformers hold the computed medium and other values used while scaling the data. These values should be **fixed** for further preprocessing you apply on a test or external dataset. 

**Question** In this task, you must write the preprocessing function for the test dataset using the scaler and imputer provided from the training dataset. **Remember you shouldn't re-compute the scaling nor the median on the test dataset, use the one you obtained from the training data**

In [23]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler


def preprocessing_testing(testing_data, transformer_one_hot_encoder, transformer_clean, transformer_scale):
    # Make a copy of the original data
    testing_data_cp = testing_data.copy()

    ############################################### Data cleaning #################################################
  

    # Determine numerical/categorical columns since you should'nt apply normalization on the categorical columns.
    numerical_cols = list(testing_data_cp.iloc[:, testing_data_cp.columns != 'ocean_proximity'])
    
    # Apply data cleaning on the train by learning values in the data.
    testing_data_cp = transformer_clean.transform(testing_data_cp)
   
    #Bring back the columns' names to the clean datasets
    testing_data_cp = pd.DataFrame(testing_data_cp, columns = numerical_cols + ['ocean_proximity'])
    
    ############################################## Feature Scaling #############################################

    # Apply data cleaning on the train by learning values in the data.
    testing_data_cp =  transformer_scale.transform(testing_data_cp)

    #Bring back the columns' names to the clean datasets
    testing_data_cp = pd.DataFrame(testing_data_cp, columns = numerical_cols + ['ocean_proximity'])
    ################################################ Categorization ###############################################
    testing_data_cp = one_hot_encoding(data = testing_data_cp, column= "ocean_proximity", transformer_one_hot_encoder=transformer_one_hot_encoder)

    return testing_data_cp

**Question:** Use the function you have just implemented to preprocess the testing data.

In [24]:
X_test_transformed = preprocessing_testing(X_test, transformer_one_hot_encoder, transformer_clean, transformer_scale)

## Step 3 Machine learning models: Training and Performance Evaluation Comparison

### 3.1 Linear Regression Model

 **Task 3.1.1: Initialize a linear regression model and use `fit()` function to train the model on the training dataset.**

In [25]:
from sklearn.linear_model import LinearRegression

linear_regression_model = LinearRegression()
linear_regression_model.fit(X_train_transformed, y_train)

 **Task 3.1.2: Assess the overall performance using `mean_square_error()` function with `squared=False` argument to measure the error average of the predicted house's price with respect to the actual price. <span style="color:red">You need to do it on both training and testing.</span>**

 ![Alt text](https://docs.oracle.com/en/cloud/saas/planning-budgeting-cloud/pfusu/img/insights_rmse_formula.jpg)

In [26]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Use the train model to predict house prices for the training data
y_train_prediction = linear_regression_model.predict(X_train_transformed)
# Use the train model to predict house prices for the testing data
y_test_prediction = linear_regression_model.predict(X_test_transformed)

# Compute the RMSE ann the MAS for the predicted price of the training data and the actual price 
rmse_train = mean_squared_error(y_true=y_train,y_pred=y_train_prediction, squared=False)
mas_train  = mean_absolute_error(y_true=y_train,y_pred=y_train_prediction)

# Compute the RMSE ann the MAS for the predicted price of the training data and the actual price 
rmse_test = mean_squared_error(y_true=y_test,y_pred=y_test_prediction, squared=False)
mas_test  = mean_absolute_error(y_true=y_test,y_pred=y_test_prediction)

print("Root Mean Square Error: train:",rmse_train, ", test: ", rmse_test)
print("Absolute Mean Error: train:",mas_train, ", test: ", mas_test)
#in the book they are calculating the rmse for all? change it if it matters please

Root Mean Square Error: train: 68542.09756453634 , test:  69400.20403287992
Absolute Mean Error: train: 49540.16651057423 , test:  50582.766987339455




**Question:** Based on the error shown in the training and testing dataset, does model suffer from overfitting or underfitting? Justify your answer.

No it doesn't overfit or underfit as the errors for the training and testing is close to each other.

**Question:** Given the MSE result above, what is your initial opinion of the prediction performance? Could you put an **error range** that the price prediction could have when it predicts the price for another house?

We assume that the price is +/- 69400 from the predicted price

**Task3.1.3: Pick random sample from the test dataset, predict the price, and display the actual and the predicted price for each with the absolute difference.**

In [27]:
# Pick 5 samples
five_samples = X_test_transformed.head()
y_five_samples_actual =y_test.head()

y_five_samples_prediction = linear_regression_model.predict(five_samples)

for actual, prediction in zip(y_five_samples_actual, y_five_samples_prediction):
    #print(f'Expected - Actual = {mas_test - abs(prediction - actual)}')
    print('Actual house price: ', actual, ' Predicted price: ', prediction, ' abs_error: ', abs(prediction - actual))

Actual house price:  93600.0  Predicted price:  166082.65773116273  abs_error:  72482.65773116273
Actual house price:  153600.0  Predicted price:  210820.0587284672  abs_error:  57220.05872846721
Actual house price:  132500.0  Predicted price:  104433.22097642202  abs_error:  28066.779023577983
Actual house price:  147900.0  Predicted price:  178408.570216519  abs_error:  30508.570216519
Actual house price:  120700.0  Predicted price:  184704.89404781692  abs_error:  64004.894047816924


**Question:** From the absolute errors of the five samples, what is the difference between your expected error range, from ***last question***, and the highest and lowest absolute error value in the five samples?

- Expected - Actual = -21899.890743823278
- Expected - Actual = -6637.291741127752
- Expected - Actual = 22515.987963761472
- Expected - Actual = 20074.196770820454
- Expected - Actual = -13422.127060477469
- we can see that the lowest error was 28066 while the highest error was 72482

**Question**: What is the new error range you expect while housing price prediction based on the absolute errors of the five samples?

It would have a range of 50456.59193

**Question**: Do you think this range will change if you changed the number of samples, i.e. ten instead of five? If **yes**, what is the best way, in your opinion, to have an estimation of the prediction error range for a new dataset, neither train/test?

It will  slightly changes but eventually it will converges to the true error (or be close enough to it), which is around 50500.