# CSI4142 - Group 48 - Assignment 3 - Part 1

---

## Introduction
In this report, we will conduct an empirical study to evaluate a linear regression approach on a regression task. This will be conducted on the ___ daatset. For this study, we will follow the following steps:

1. Clean the data 
2. Encode categorical features to transform them into numerical features 
3. Conduct an EDA (Exploritory Data Analysis) to visualize data and find outliers in the features using LOF (Local Outlier Factor)
4. (Optional) Explore the LinearRegression method suggested in scikit-learn (or other packages)
5. Program a feature aggregator to create 2 additional features
6. Split the data into train, validation, and test sets, choose an evaluation metric (e.g., MSE, RMSE, RÂ²), establish a baseline using linear regression without outlier removal or feature aggregation, perform 4-fold cross-validation on different system variations, select the best model, and evaluate it on the untouched test set.
7. Analyize the results
8. Discuss the outliers and feature aggregation, as well as the results on the unseen test set compare to the cross-validation results


#### Group 48 Members
- Ali Bhangu - 300234254
- Justin Wang - 300234186

<br>

---

## Dataset Descriptions

### Car Detail Dataset

- **Dataset Name:** Vehicle Dataset
- **Author:** Nehal Birla, Nishant Verma, Nikhil Kushwaha (Kaggle)
- **Purpose:** This dataset contains information about used cars. This data can be used for a lot of purposes such as price prediction to exemplify the use of linear regression in Machine Learning.

##### Dataset Shape
- **Rows:** 4339 Rows
- **Columns:** 8 Columns 

#### Features & Descriptions
| Feature Name       | Data Type  | Category    | Description |
|--------------------|------------|-------------|-------------|
| `name`             | String     | Categorical | Name of the cars |
| `year`             | Float      | Numerical   | Year of the car when it was bought |
| `selling_price`    | Float      | Numerical   | Price at which the car is being sold |
| `km_driven`        | Float      | Numerical   | Number of Kilometres the car is driven |
| `fuel`             | String     | Categorical | Fuel type of car (petrol / diesel / CNG / LPG / electric) |
| `seller_type`      | String     | Categorical | Tells if a Seller is Individual or a Dealer |
| `transmission`     | String     | Categorical | Gear transmission of the car (Automatic/Manual) |
| `owner`            | String     | Categorical | Number of previous owners of the car. |

---

In [8]:
# Importing the required Python libraries
import numpy as npy
import pandas as pd
from fuzzywuzzy import fuzz
import os as os
import re
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder

In [7]:
# Define paths
csv_path = "CAR.csv"

# Load dataset
carSet = pd.read_csv(csv_path)
print("Dataset loaded successfully.")
carSet.tail()

Dataset loaded successfully.


Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner
4335,Hyundai i20 Magna 1.4 CRDi (Diesel),2014,409999,80000,Diesel,Individual,Manual,Second Owner
4336,Hyundai i20 Magna 1.4 CRDi,2014,409999,80000,Diesel,Individual,Manual,Second Owner
4337,Maruti 800 AC BSIII,2009,110000,83000,Petrol,Individual,Manual,Second Owner
4338,Hyundai Creta 1.6 CRDi SX Option,2016,865000,90000,Diesel,Individual,Manual,First Owner
4339,Renault KWID RXT,2016,225000,40000,Petrol,Individual,Manual,First Owner


### a) Clean Data

In [None]:
# TBD

### b) Categorical feature encoding

In [None]:
print(f"Employee data : \n{carSet.head()}")

categorical_columns = carSet.select_dtypes(include=['object']).columns.tolist()

encoder = OneHotEncoder(sparse_output=False)

one_hot_encoded = encoder.fit_transform(carSet[categorical_columns])

one_hot_df = pd.DataFrame(one_hot_encoded, columns=encoder.get_feature_names_out(categorical_columns))

df_encoded = pd.concat([carSet, one_hot_df], axis=1)

df_encoded = df_encoded.drop(categorical_columns, axis=1)
print(f"Encoded Employee data : \n{df_encoded.head()}")


Employee data : 
                       name  year  selling_price  km_driven    fuel  \
0             Maruti 800 AC  2007          60000      70000  Petrol   
1  Maruti Wagon R LXI Minor  2007         135000      50000  Petrol   
2      Hyundai Verna 1.6 SX  2012         600000     100000  Diesel   
3    Datsun RediGO T Option  2017         250000      46000  Petrol   
4     Honda Amaze VX i-DTEC  2014         450000     141000  Diesel   

  seller_type transmission         owner  
0  Individual       Manual   First Owner  
1  Individual       Manual   First Owner  
2  Individual       Manual   First Owner  
3  Individual       Manual   First Owner  
4  Individual       Manual  Second Owner  
Encoded Employee data : 
                       name  year  selling_price  km_driven    fuel  \
0             Maruti 800 AC  2007          60000      70000  Petrol   
1  Maruti Wagon R LXI Minor  2007         135000      50000  Petrol   
2      Hyundai Verna 1.6 SX  2012         600000     100000 