# Car Price Prediction Project
**Car Price Prediction Project Overview:**

Car price prediction uses machine learning to estimate a car's market value based on various attributes like brand, features, horsepower, and mileage. This project is crucial for both buyers and sellers in making informed decisions and for businesses to strategize pricing models. By training models on a diverse dataset, you can accurately forecast car prices and enhance your data analysis skills.

### Column Description of Car Dataset

1. **car_ID**: Unique identifier for each car.
2. **symboling**: Risk factor assigned to the car (-3 to +3), where a higher value indicates more risk.
3. **CarName**: Name of the car (includes brand and model).
4. **fueltype**: Type of fuel used by the car (e.g., 'gas', 'diesel').
5. **aspiration**: Type of aspiration used in the engine (e.g., 'std', 'turbo').
6. **doornumber**: Number of doors on the car (e.g., 'two', 'four').
7. **carbody**: Body style of the car (e.g., 'sedan', 'hatchback').
8. **drivewheel**: Type of drive wheel (e.g., 'fwd' - front-wheel drive, 'rwd' - rear-wheel drive, '4wd' - four-wheel drive).
9. **enginelocation**: Location of the engine (e.g., 'front', 'rear').
10. **wheelbase**: Distance between the front and rear wheels (in inches).
11. **carlength**: Length of the car (in inches).
12. **carwidth**: Width of the car (in inches).
13. **carheight**: Height of the car (in inches).
14. **curbweight**: Weight of the car without passengers or cargo (in pounds).
15. **enginetype**: Type of engine (e.g., 'ohc', 'ohcf', 'dohc').
16. **cylindernumber**: Number of cylinders in the engine (e.g., 'four', 'six').
17. **enginesize**: Size of the engine (in cubic inches).
18. **fuelsystem**: Fuel system used (e.g., 'mpfi', '2bbl').
19. **boreratio**: Bore ratio of the engine (diameter of the cylinder).
20. **stroke**: Stroke of the engine (length of the piston travel).
21. **compressionratio**: Ratio of the engine's cylinder volume to its combustion chamber volume.
22. **horsepower**: Engine power (in horsepower).
23. **peakrpm**: Peak revolutions per minute (RPM) of the engine.
24. **citympg**: Fuel efficiency in the city (miles per gallon).
25. **highwaympg**: Fuel efficiency on the highway (miles per gallon).
26. **price**: Price of the car (in dollars).

## Step 1: Import Libraries

In [10]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


## Step 2: Load the Dataset

In [11]:
# Load the dataset
data = pd.read_csv('CarPrice_Assignment.csv')

# Display the first few rows of the dataset
print(data.head())


   car_ID  symboling                   CarName fueltype aspiration doornumber  \
0       1          3        alfa-romero giulia      gas        std        two   
1       2          3       alfa-romero stelvio      gas        std        two   
2       3          1  alfa-romero Quadrifoglio      gas        std        two   
3       4          2               audi 100 ls      gas        std       four   
4       5          2                audi 100ls      gas        std       four   

       carbody drivewheel enginelocation  wheelbase  ...  enginesize  \
0  convertible        rwd          front       88.6  ...         130   
1  convertible        rwd          front       88.6  ...         130   
2    hatchback        rwd          front       94.5  ...         152   
3        sedan        fwd          front       99.8  ...         109   
4        sedan        4wd          front       99.4  ...         136   

   fuelsystem  boreratio  stroke compressionratio horsepower  peakrpm citympg  \

## Step 3: Data Preprocessing

### Handling Missing Values

In [12]:
# Check for missing values
print(data.isnull().sum())

# Drop rows with missing values (if any)
data.dropna(inplace=True)


car_ID              0
symboling           0
CarName             0
fueltype            0
aspiration          0
doornumber          0
carbody             0
drivewheel          0
enginelocation      0
wheelbase           0
carlength           0
carwidth            0
carheight           0
curbweight          0
enginetype          0
cylindernumber      0
enginesize          0
fuelsystem          0
boreratio           0
stroke              0
compressionratio    0
horsepower          0
peakrpm             0
citympg             0
highwaympg          0
price               0
dtype: int64


### Encoding Categorical Variables

In [14]:
# Define categorical and numerical columns
categorical_cols = ['CarName', 'fueltype', 'aspiration', 'doornumber', 'carbody', 'drivewheel', 'enginelocation', 'enginetype', 'cylindernumber', 'fuelsystem']
numerical_cols = ['symboling', 'wheelbase', 'carlength', 'carwidth', 'carheight', 'curbweight', 'enginesize', 'boreratio', 'stroke', 'compressionratio', 'horsepower', 'peakrpm', 'citympg', 'highwaympg']

# Create a preprocessor for encoding and scaling
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ])


## Step 4: Split the Data into Training and Testing Sets

In [15]:
# Define features and target variable
X = data.drop(columns=['car_ID', 'price'])
y = data['price']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## Step 5: Build and Train the Model

### Random Forest Regressor

In [16]:
# Create a pipeline with the preprocessor and the model
rf_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('regressor', RandomForestRegressor(n_estimators=100, random_state=42))])

# Train the model
rf_pipeline.fit(X_train, y_train)


### Linear Regression

In [18]:
# Create a pipeline with the preprocessor and the model
lr_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('regressor', LinearRegression())])

# Train the model
lr_pipeline.fit(X_train, y_train)


## Step 6: Evaluate the Models

In [19]:
# Function to evaluate the model
def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    return mse, r2

# Evaluate Random Forest Regressor
rf_mse, rf_r2 = evaluate_model(rf_pipeline, X_test, y_test)
print(f'Random Forest Regressor - MSE: {rf_mse}, R2: {rf_r2}')

# Evaluate Linear Regression
lr_mse, lr_r2 = evaluate_model(lr_pipeline, X_test, y_test)
print(f'Linear Regression - MSE: {lr_mse}, R2: {lr_r2}')


Random Forest Regressor - MSE: 3524118.638162779, R2: 0.9553592710518894
Linear Regression - MSE: 41996850.6932343, R2: 0.4680173339884177


## Conclusions

You have successfully built and evaluated a car price prediction model using both Random Forest Regressor and Linear Regression. This project demonstrates how to handle data preprocessing, feature scaling, encoding categorical variables, and model evaluation.

Feel free to adjust the hyperparameters of the models and experiment with different algorithms to improve the performance of your car price prediction model.