# **Introduction**

- The used car market is large and dynamic, with prices influenced by multiple factors such as mileage, engine size, fuel efficiency, year of manufacture, and brand. For buyers, accurate price predictions help them avoid overpaying, while for sellers and dealerships, predictive models can guide competitive pricing strategies.

- In this lab, students will need to develop predictive models to accurately estimate the used car price based on the car information.

<div align="center">

![Used car](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTwsNyFobkAjOIx4t1z4eublgdP90ALrglnIA&s)

</div>

# **Environment**

- We will be using .ipynb (Jupyter Notebook) files. If you don’t already have an environment to run these files, we recommend using **Anaconda**.

- The **coding exam** will also use Anaconda, so it’s a good idea to get familiar with it. For guidance, refer to the tutorial anaconda_guide.pptx.

- If you are unsure about a function or its parameters, you can use help() to view its documentation. For example: help(train_test_split)


# **Requirement**

- Do it individually! Not as a team! (The team is for final project)

- Deadline: **2025/9/25 23:59** (Late submission is not allowed!)

- Hand in following files to eeclass in the following format (Do not compressed!)
	- Lab1.ipynb
	- Lab1_basic.csv
	- Lab1_advanced.csv

- Lab 1 would be covered on the coding and writing exam next time.

- You may modify the provided sample code or add new cells as needed, as long as you meet the requirements.

- Responsible TA: Pin-Shun Wang (wangpinshun@gmail.com)
	- Email for questions or visit EECS 639 during TA hours (Make a reservation in advance).
	- No debugging service

# **Penalty Rules**

0 points if any of the following conditions happened
- Plagiarism
- Late submission
- Not using the template or importing any other packages
- No code(“Lab1.ipynb”) submission on eeclass
- No prediction csv files submission on eeclass
- Your submission was not generated by your code

5 Points would be deducted if your submission format is incorrect


# **Lab1: Regression**
In this lab, you are required to complete the following tasks:

1.  Part I (**60%**) - Preprocess data and implement a linear regression model to predict used car price

    - Step 1: Split Data
    - Step 2: Preprocess Data
    - Step 3: Train Model and Generate Result

2.  Part II  (**35%**) - Extend the linear regression model in part I to polynomial regression model and improve prediction performance

    - Step 1: Generate the Polynomial Features
    - Step 2: Train Model and Generate Result

3. Part III (**5%**) – Write a report that answers the given questions.

### Import Packages

⚠️You **cannot** import any other package


In [4]:
import pandas as pd
import csv
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.model_selection import train_test_split

### Global attributes
- Define the global attributes. You can also add your own global attributes here

In [5]:
training_dataroot = 'train.csv' # Training data file file named as 'train.csv'
testing_dataroot = 'test.csv'   # Testing data file named as 'test.csv'
basic_output_path = 'Lab1_basic.csv' # Your model prediction in part I to submit to eeclass
advanced_output_path = 'Lab1_advanced.csv' # Your model prediction in part II to submit to eeclass

basic_output =  [] # save your model prediction in part I
advanced_output = [] # save your model prediction in part II

### Load the Input File

First, load the input file **train.csv** and **test.csv**

In [6]:
df_train = pd.read_csv(training_dataroot)
df_test = pd.read_csv(testing_dataroot)

display(df_train.head(5))
display(df_test.head(5))
print("Number of training data: ", len(df_train))
print("Number of testing data: ", len(df_test))

Unnamed: 0,model,year,transmission,mileage,fuelType,tax,mpg,engineSize,price
0,B Class,2017,Semi-Auto,26704,Diesel,145,68.9,2.1,13999
1,CL Class,2020,Semi-Auto,1000,Diesel,145,55.4,2.0,30389
2,V Class,2018,Manual,24164,Diesel,145,46.3,2.1,19498
3,E Class,2017,Semi-Auto,28078,Diesel,145,65.7,2.0,21799
4,C Class,2019,Semi-Auto,15838,Diesel,145,61.4,2.0,24498


Unnamed: 0,model,year,transmission,mileage,fuelType,tax,mpg,engineSize
0,A Class,2019,Automatic,8478,Diesel,145,65.7,1.5
1,E Class,2014,Automatic,60514,Diesel,145,52.3,2.1
2,E Class,2020,Automatic,2568,Diesel,145,42.8,2.0
3,GLC Class,2020,Semi-Auto,2000,Diesel,145,40.9,2.0
4,C Class,2017,Automatic,20949,Diesel,145,61.4,2.1


Number of training data:  10495
Number of testing data:  2624


---
# 1. Part I (60%)
In part I, you need to implement the linear regression to predict used car price.

You will receive full credit (60 points) if the MAPE of your predictions on the testing data is below **20**.

⚠️**Please save the prediction result for the testing data in a CSV file and submit it to eeclass. This file will be used to evaluate your assignment**⚠️

## Step 1: Split data

Use **train_test_split** from scikit-learn to divide the dataset into a training set and a validation set. The training set is used to fit your regression model, while the validation set is used to evaluate its performance.

- **We recommend setting random_state=0 in train_test_split to ensure that the validation data is representative and the evaluation is consistent with the testing data.**

In [7]:
# TODO Split df_train into training set and validation set

X = df_train.drop(columns=['price'])  # features (smua kecuali price)
y = df_train['price']                 # price = target variable

# Split data into training and validation sets
x_train, x_valid, y_train, y_valid = train_test_split( # X n Y sama" dipisah buat train and validation, validation 18.5% sisanya train
    X, y, test_size=0.185, random_state=0
) # set parameter random_state=0

display(x_train.head(5))
display(y_train.head(5))

Unnamed: 0,model,year,transmission,mileage,fuelType,tax,mpg,engineSize
6880,GLC Class,2020,Semi-Auto,4000,Diesel,145,40.9,2.0
4629,E Class,2019,Automatic,8000,Diesel,145,45.6,2.0
1536,E Class,2019,Automatic,11210,Diesel,150,72.4,2.0
5570,C Class,2019,Semi-Auto,13027,Diesel,145,61.4,2.0
6593,E Class,2016,Semi-Auto,29511,Diesel,30,65.7,2.0


6880    46481
4629    31995
1536    24250
5570    25899
6593    19998
Name: price, dtype: int64

## Step 2: Preprocess Data

As we can see from the input file, the scales of the input features vary significantly. Therefore, it is important to standardize them first to ensure that no single feature dominates the regression results.

### Step 2-1: Standardize Continuous Value

- As we can see from the input file, the scales of the input features vary significantly. Therefore, it is important to standardize them first to ensure that no single feature dominates the regression results.

- Try to use StandardScaler() to transform both the training and validation data.

**Note**: Always fit the scaler on the training data only (to compute the mean and standard deviation), and then use the same scaler to transform both the training and validation sets.

In [8]:
cont_columns = ['year', 'mileage', 'tax','mpg', 'engineSize']
example_train = x_train[cont_columns]
example_valid = x_valid[cont_columns]
print("Before standardization:", example_train.iloc[0].values, example_valid.iloc[0].values)

# TODO Standardize both example_train and example_valid.
scaler = StandardScaler()
scale_train = scaler.fit_transform(example_train)  # fit on training set biar ga liat validation pny
scale_valid = scaler.transform(example_valid)      
print("After standardization:", scale_train[0], scale_valid[0])

Before standardization: [2.02e+03 4.00e+03 1.45e+02 4.09e+01 2.00e+00] [2.013e+03 3.100e+04 1.450e+02 5.540e+01 2.200e+00]
After standardization: [ 1.21110625 -0.85200702  0.24519601 -0.94338844 -0.11776907] [-1.92351861  0.42389884  0.24519601  0.00404356  0.23516676]


### Step 2-2: Encode Categorical Value

- The dataset contains several categorical columns. For example, the model column has 26 distinct car models. Since regression models cannot directly handle categorical values, we need to encode these features first.

- We can use one hot encoding to tackle such issue. It can creates a new binary feature for each distinct category. The column corresponding to the observed category is set to 1, while all others are 0.

**Note**: Just like with scaling, you should fit the encoder on the training data only.

In [9]:
category_columns = ['model','transmission', 'fuelType']
example_train = x_train[category_columns]
example_valid = x_valid[category_columns]

# set handle_unknown='ignore' to prevent unseen categories in validation data
onehotencoder = OneHotEncoder(handle_unknown='ignore')

# TODO Encode example_train and example_valid
onehot_train = onehotencoder.fit_transform(example_train)   
onehot_valid = onehotencoder.transform(example_valid)      

print("Before encoding:")
print(f"Feature: {example_train.columns.values}")
print('=' * 100)
print("After encoding:")
print(f"Feature: {onehotencoder.get_feature_names_out()}")

Before encoding:
Feature: ['model' 'transmission' 'fuelType']
After encoding:
Feature: ['model_ A Class' 'model_ B Class' 'model_ C Class' 'model_ CL Class'
 'model_ CLA Class' 'model_ CLC Class' 'model_ CLK' 'model_ CLS Class'
 'model_ E Class' 'model_ G Class' 'model_ GL Class' 'model_ GLA Class'
 'model_ GLB Class' 'model_ GLC Class' 'model_ GLE Class'
 'model_ GLS Class' 'model_ M Class' 'model_ R Class' 'model_ S Class'
 'model_ SL CLASS' 'model_ SLK' 'model_ V Class' 'model_ X-CLASS'
 'model_180' 'model_200' 'model_220' 'model_230' 'transmission_Automatic'
 'transmission_Manual' 'transmission_Other' 'transmission_Semi-Auto'
 'fuelType_Diesel' 'fuelType_Hybrid' 'fuelType_Other' 'fuelType_Petrol']


### Step 2-3: Use ColumnTransformer

- The input CSV file contains both continuous and categorical features. Since these types of data require different preprocessing steps, it is convenient to use ColumnTransformer in scikit-learn to apply the appropriate transformations to each column.

- Using the preprocessing steps we defined earlier (scaling for continuous features and one-hot encoding for categorical features), define a preprocessor that transforms the input data into a format suitable for linear regression, all in a single step.

In [10]:
# TODO Define the preprocessor
preprocessor = ColumnTransformer([
    ("numerical", scaler, cont_columns),
    ("category", onehotencoder, category_columns)
])

# TODO Preprocess continuous data and categorical data at the same time
preprocess_train = preprocessor.fit_transform(x_train)  
preprocess_valid = preprocessor.transform(x_valid)      

print(preprocess_train.shape)
print(preprocess_valid.shape)

(8553, 40)
(1942, 40)


## Step 3: Train Model and Generate Result

- Now that you know how to preprocess the data, let’s train a linear regression model. For convenience, you don’t need to preprocess the data separately before training. Instead, you can use a **Pipeline** to combine the preprocessor and the regression model, so that you can perform preprocessing and model training in a single step.

- In this lab, we use Mean Absolute Percentage Error (MAPE) to evaluate the performance. It measures the average absolute difference between the predicted and actual values, expressed as a percentage of the actual values. It is more interpretable because it provides a relative error in percentage terms, making it easier to understand and compare across different datasets or scales. You can calculate it with the imported function **mean_absolute_percentage_error**. The formula is as below:
<div align="center">

![MAPE_formula](https://ithelp.ithome.com.tw/upload/images/20210929/20142004n36Qnhw9js.png)

</div>

- Save the predicted values for the testing dataset in `basic_output`

In [None]:
pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("model", LinearRegression())
])

# TODO Fit the model
pipeline.fit(x_train, y_train)

# TODO Evaluate your model using validation data
y_pred = pipeline.predict(x_valid)

mape = mean_absolute_percentage_error(y_valid, y_pred)
print("MAPE (%):", mape * 100)

# TODO Predict used car price in the testing data
basic_output = pipeline.predict(df_test)

In [71]:
#model = pipeline.named_steps["model"]

#feature_names = pipeline.named_steps["preprocessor"].get_feature_names_out()
#feature = [name.split("__")[-1] for name in feature_names]

#coefficients = model.coef_#
#intercept = model.intercept_

#equation_terms = []
#for coef, name in zip(coefficients, feature):
#    if abs(coef) > 1e-8:  # skip ~zero terms
#        equation_terms.append(f"({coef:.4f} * {name})")

#equation = " + ".join(equation_terms)
#equation = f"price = {intercept:.4f} + " + equation
#print(equation)

#coef_str = ", ".join([f"{coef:.4f}" for coef in coefficients])

#print(coef_str)



### Write the Output File

Write the prediction in *basic_output* to Lab1_basic.csv
> Format: 'ID', 'price'

⚠️**Remember to submit it to eeclass. This file will be used to evaluate your part I**⚠️

In [72]:
# Assume that basic_output is a list with length = 2624
with open(basic_output_path, 'w', newline='', encoding="utf-8") as csvfile:
  writer = csv.writer(csvfile)
  writer.writerow(['ID', 'price'])
  for i in range(len(basic_output)):
    writer.writerow([i,basic_output[i]])

# 2. Part II (35%)
In part II, you need to implement the polynomial regression to improve your price predictions. Polynomial regression is useful because it can capture the non-linear relationships between the input features and the target variable.

You will receive full credit (35 points) if the MAPE of your predictions on the testing data is below **15**.

⚠️**Please save the prediction result for the testing data in a CSV file and submit it to eeclass. This file will be used to evaluate your assignment**⚠️

---
## Step 1: Generate the Polynomial Features

To implement polynomial regression, we first need to expand the original input features into polynomial features.For example, suppose we have two input features (${x1}$, ${x2}$), and we want to generate polynomial features up to degree 3. The transformed features would include:
- Degree 1:  (${x1}$, ${x2}$)
- Degree 2: (${x1^2}$, ${x2^2}$, ${x1x2}$)
- Degree 3: (${x1^3}$, ${x2^3}$, ${x1^2x2}$, ${x1x2^2}$)

In total, this gives us 9 polynomial features. By applying our regression model in basic part to these expanded features, we can capture non-linear relationships between the input variables and the target variable.

In [73]:
example_columns = ['tax', 'mpg']
example_train = x_train[example_columns]
example_valid = x_valid[example_columns]

# set include_bias=False since the regression model would consider intercept term itself
poly = PolynomialFeatures(degree=3, include_bias=False)

# TODO Generate the polynomial features for example_train and example_valid
poly_train = poly.fit_transform(example_train)
poly_valid = poly.transform(example_valid)

print("Original Feature shape: ", example_train.shape)
print("Polynomial Feature shape: ", poly_train.shape)

Original Feature shape:  (8553, 2)
Polynomial Feature shape:  (8553, 9)


## Step 2: Train Model and Generate Result

Extend the **ColumnTransformer** you defined in the basic part by adding **PolynomialFeatures**, and use it to generate predictions on the testing dataset. To apply polynomial expansion and standardization together, wrap them in a Pipeline so they are processed in the correct order.

**Hint**: You can experiment with different polynomial degrees, or try alternative linear models such as **Ridge** or **Lasso** regression to improve performance.

- Save the predicted values for the testing dataset in `advanced_output`

In [74]:
# TODO Define a new preprocessor that includes PolynomialFeatures
preprocessor = ColumnTransformer([
    ("numerical", Pipeline([
        ("poly", poly),  
        ("scaler", scaler)
    ]), cont_columns),
    ("category", onehotencoder, category_columns)
])

# TODO Build your own pipeline with different regression model
pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("model", Ridge(alpha=20))  
])

#from sklearn.model_selection import GridSearchCV

#params = {"model__alpha": [0.1, 0.5, 1, 3, 5, 10, 20, 30, 35, 100]}
#search = GridSearchCV(pipeline, params, cv=5, scoring="neg_mean_absolute_percentage_error")
#search.fit(x_train, y_train)

#print("Best alpha from CV:", search.best_params_)

# TODO Fit the model
pipeline.fit(x_train, y_train)

# TODO Evaluate your model using validation data
y_pred = pipeline.predict(x_valid)
mape = mean_absolute_percentage_error(y_valid, y_pred)
print("MAPE (%):", mape * 100)


# TODO Predict used car price in the testing data
advanced_output = pipeline.predict(df_test)

MAPE (%): 11.613134367256439


### Write the Output File

Write the prediction in *advanced_output* to Lab1_advanced.csv
> Format: 'ID', 'price'

⚠️**Remember to submit it to eeclass. This file will be used to evaluate your part II**⚠️

In [75]:
# Assume that advanced_output is a list with length = 2624
with open(advanced_output_path, 'w', newline='', encoding="utf-8") as csvfile:
  writer = csv.writer(csvfile)
  writer.writerow(['ID', 'price'])
  for i in range(len(advanced_output)):
    writer.writerow([i,advanced_output[i]])

# Part III (5%)

Answer each question in the below markdown cell.

1. Write down your regression equation in basic part. For example: 1 + 20*x1 + 30*x2 (1%)

2. When standardizing input features, why do we standardize each feature across all samples, rather than standardizing each sample individually? (1%)

3. Why don’t we simply map each categorical value to an integer (0 to number of classes – 1)? What advantages does one-hot encoding provide compared to this approach? (1%)

4. In the advanced part, should we generate polynomial features first or standardize the data first? Explain your reasoning. (2%)

## Your answer to the questions
1. 36340.2482 + (4183.3429 * year) + (-3131.3262 * mileage) + (-1291.2512 * tax) + (-2573.7598 * mpg) + (5215.4466 * engineSize) + (-5136.1274 * model_ A Class) + (-7839.7146 * model_ B Class) + (-5721.5492 * model_ C Class) + (-6210.7999 * model_ CL Class) + (-5187.9780 * model_ CLA Class) + (-3020.2951 * model_ CLC Class) + (-1567.9213 * model_ CLK) + (-6095.3489 * model_ CLS Class) + (-4180.5088 * model_ E Class) + (53873.8685 * model_ G Class) + (-4173.7563 * model_ GL Class) + (-7399.4666 * model_ GLA Class) + (747.1302 * model_ GLB Class) + (-818.2329 * model_ GLC Class) + (1565.8302 * model_ GLE Class) + (4505.9866 * model_ GLS Class) + (1148.3186 * model_ M Class) + (-1133.8473 * model_ R Class) + (6323.3326 * model_ S Class) + (-5189.1113 * model_ SL CLASS) + (-2394.5984 * model_ SLK) + (-129.5069 * model_ V Class) + (-8256.3399 * model_ X-CLASS) + (-1209.6356 * model_180) + (-8928.2316 * model_200) + (-10010.5658 * model_220) + (26439.0693 * model_230) + (2124.1269 * transmission_Automatic) + (-40.9900 * transmission_Manual) + (-4196.2965 * transmission_Other) + (2113.1596 * transmission_Semi-Auto) + (-9635.6153 * fuelType_Diesel) + (2139.8923 * fuelType_Hybrid) + (15559.7937 * fuelType_Other) + (-8064.0708 * fuelType_Petrol)

2. Because we want to be able to compare features across the dataset. By standardizing each feature across samples, the model can obtain global information about what is “high” or “low” for each feature. Standardizing per sample would not help us find the patterns we want the model to learn.

3. Since our categories don't have any order, we use one-hot encoding so the model doesn’t mistakenly treat categories as bigger or smaller. It keeps them separate and equal. This is safer because regression will treat the values as if the categories are numeric and ordered in some way if we map each categorical value to an integer. This way, the coefficient on each one-hot variable directly tell us the contribution of that category relative to the baseline (treat them as individiual options), not based on numeric order

4. We should generate the polynomial features first, then standardize them. This is because polynomial features can be much larger or smaller than the original feature. If we only standardize the original features first, the new polynomial ones will result in very different ranges. By standardizing after expansion, the original and polynomial feature are on the same scale so the model avoids being biased to large-valued polynomial features and treats all features fairly.

# Save the Code File
Please save your code as a Jupyter Notebook file (Lab1.ipynb) and submit it to eeclass along with your prediction files (Lab1_basic.csv and Lab1_advanced.csv).