#                                        <Center> Predicting Car Prices </Center>

In this project, we will be predicting car prices using a dataset of car features and prices. We will use a variety of techniques, including data cleaning, exploratory data analysis, feature engineering, and machine learning modeling.


In [None]:
# !pip install pandas
# !pip install matplotlib
# !pip install seaborn

#Import statements
import numpy as np


In [None]:
#Load Data

#Check the data



In [None]:
#Checking the dimensions of the data
df.shape

In [None]:
# Basic information to dataset
df.info()

In [None]:
#Statistics Analyis of the data
df.describe()

In [None]:
#Get count of missing values in each column
df.isnull().sum().to_frame().rename(columns={0:"Total No. of Missing Values"})

In [None]:
#Show categorical variables
df.select_dtypes(include="object").head()

In [None]:
# Show numerical variables.
df.select_dtypes(include=["int","float"]).head()

## Data Cleaning

This involves checking for missing values, outliers, and errors in the data. We will also perform some basic data transformations, such as converting categorical variables to numerical variables, and scaling numerical variables.


In [None]:
#Remove car name from column CompanyName

Company_Name = df["CarName"].apply(lambda x: x.split(" ")[0])
df.insert(2,"CompanyName",Company_Name)

# Now we can drop the CarName Feature.
df.drop(columns=["CarName"],inplace=True)

In [None]:
#Check for spelling mistakes in car company names
df["CompanyName"].unique()

In [None]:
#Fix spelling mistakes in Car company name
def replace(a,b):
    df["CompanyName"].replace(a,b,inplace=True)

replace('maxda','mazda')
replace('porcshce','porsche')
replace('toyouta','toyota')
replace('vokswagen','volkswagen')
replace('vw','volkswagen')

df["CompanyName"].unique()

## Exploratory Data Analysis

Next, we will perform exploratory data analysis to understand the relationships between the variables and the target variable (car price). We will use visualization techniques such as scatterplots, histograms, and box plots to explore the data and identify any trends or patterns.

### Visualizing Car Company w.r.t Price.


In [None]:
#Draw a boxplot and bar graph based on Company name and Avg. Price in x and y axis respectively
plt.figure(figsize=(15,6))

plt.subplot(1,2,1)
sns.boxplot(x="CompanyName",y="price",data=df)
plt.xticks(rotation=90)
plt.title("Car Company vs Price", pad=10, fontweight="black", fontsize=20)

plt.subplot(1,2,2)
x = pd.DataFrame(df.groupby("CompanyName")["price"].mean().sort_values(ascending=False))
sns.barplot(x=x.index,y="price",data=x) 
plt.xticks(rotation=90)
plt.title("Car Company vs Average Price", pad=10, fontweight="black", fontsize=20)
plt.tight_layout()
plt.show()



Insights

    Jaguar & buick seems to have the highest price range cars.
    Car companies like Nisaan,Renault & Mercury are having only one to two datapoints.
    So we can't make any inference related to lowest price range car companies.

Note

    Since there are too many categories in car compnay feature. So we can derive a new feature Company Price Range which will show the price range as Low Range, Medium Range, High Range.



## Feature Engineering

Based on our exploratory data analysis, we will perform feature engineering to create new variables that may be useful in predicting car prices. This can include combining existing variables, creating interaction terms, and transforming variables to better capture their relationship with the target variable.

In [None]:
# Deriving New Features From "Company Name" Feature.
# As we made an insight above that we can split the car company name into different price ranges.Like Low Range, Medium Range, High Range cars.
z = round(df.groupby(["CompanyName"])["price"].agg(["mean"]),2).T
z

In [None]:
df = df.merge(z.T,how="left",on="CompanyName")
bins = [0,10000,20000,40000]
cars_bin=['Budget','Medium','Highend']
df['CarsRange'] = pd.cut(df['mean'],bins,right=False,labels=cars_bin)
df.head()

## Data Preprocessing

Data preprocessing involves preparing the data for analysis by cleaning, transforming, and normalizing it. This can involve steps such as removing missing values, scaling numerical variables, encoding categorical variables, and splitting the data into training and testing sets. 

In [None]:
# Creating new DataFrame with all the useful Features.

new_df = df[['fueltype','aspiration','doornumber','carbody','drivewheel','enginetype','cylindernumber','fuelsystem'
             ,'wheelbase','carlength','carwidth','curbweight','enginesize','boreratio','horsepower','citympg','highwaympg',
             'price','CarsRange']]

In [None]:
new_df.head()


In [None]:
new_df = pd.get_dummies(columns=["fueltype","aspiration","doornumber","carbody","drivewheel","enginetype",
                                "cylindernumber","fuelsystem","CarsRange"],data=new_df)

In [None]:
new_df.head()

In [None]:
%pip install scikit-learn
from sklearn.preprocessing import StandardScaler
# Feature Scaling of Numerical Data
scaler = StandardScaler()

In [None]:
num_cols = ['wheelbase','carlength','carwidth','curbweight','enginesize','boreratio','horsepower',
            'citympg','highwaympg']

new_df[num_cols] = scaler.fit_transform(new_df[num_cols])

In [None]:
new_df.head()

In [None]:
# Selecting Features & Labels for Model Training & Testing
x = new_df.drop(columns=["price"])
y = new_df["price"]

In [None]:
x.shape,y.shape

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)


In [None]:
print("x_train - >  ",x_train.shape)
print("x_test - >  ",x_test.shape)
print("y_train - >  ",y_train.shape)
print("y_test - >  ",y_test.shape)

## Machine Learning Modeling

Finally, we will build a machine learning model to predict car prices. We will use a variety of models, such as linear regression, decision trees, and random forests, and evaluate their performance using metrics such as mean squared error and R-squared.

In [None]:
training_score = []
testing_score = []

In [None]:
def model_prediction(model):
    model.fit(x_train,y_train)
    x_train_pred = model.predict(x_train)
    x_test_pred = model.predict(x_test)
    a = r2_score(y_train,x_train_pred)*100
    b = r2_score(y_test,x_test_pred)*100
    training_score.append(a)
    testing_score.append(b)
    
    print(f"r2_Score of {model} model on Training Data is:",a)
    print(f"r2_Score of {model} model on Testing Data is:",b)

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
model_prediction(LinearRegression())

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor
model_prediction(DecisionTreeRegressor())

In [None]:
model_prediction(RandomForestRegressor())

In [None]:
models = ["Linear Regression","Decision Tree","Random Forest"]

In [None]:
df = pd.DataFrame({"Algorithms":models,
                   "Training Score":training_score,
                   "Testing Score":testing_score})
df