# Car Price Prediction
#### Project Objective

- [x] Prepare data and Exploratory data analysis (EDA)
- [ ] Use linear regression for predicting price
- [ ] Understanding the internals of linear regression
- [ ] Evaluating the model with RMSE
- [ ] Feature engineering
- [ ] Regularization
- [ ] Using the model

In [83]:
# Importing libraries.
import pandas as pd
import numpy as np

# for visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# plot style setup
plt.style.use("tableau-colorblind10")
plt.rcParams['figure.figsize'] = (12, 7)
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['axes.titlesize'] = 16
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

In [2]:
# Reading data into pandas dataframe using `read_csv()` method of pandas.
car_data = pd.read_csv("../input/cardataset/data.csv")

In [3]:
# Shape of the data
print("Shape::", car_data.shape)

# Data Preparation

In [5]:
# view first five rows.
car_data.head().T

In [8]:
# columns of dataframe
print("Columns::", car_data.columns.tolist())

In [13]:
# clean up the columns name.
car_data.columns = car_data.columns.str.lower()
car_data.columns = car_data.columns.str.replace(" ", "_")

In [115]:
# view data
car_data.head().T

In [44]:
# Let's change the case of the string values of columns to lower and replace space by '_'.

# Categorical data
categorical_cols = car_data.select_dtypes(include='object').columns.tolist()
for col in categorical_cols:
    car_data[col] = car_data[col].str.lower().str.replace(" ", "_")

In [114]:
# view first rows of data
car_data.head().T

In [19]:
# check the datatype of columns using `dtypes` attribute of the pandas dataframe.
car_data.dtypes

In [25]:
# check the index of the dataframe using `index` attribute of the pandas dataframe.
car_data.index

# Exploratory data analysis (EDA)

In [35]:
# check the unique values of columns using `unqiue()` and `nunique()` method of pandas dataframe
for col in car_data.columns:
    print()
    print(f"{col}: {car_data[col].unique()[:5]}")
    print("Number of unqiue values::", car_data[col].nunique())

## Missing Data

In [46]:
# Missing data
car_data.isnull().sum()

The columns `engine_fuel_type`, `engine_hp`, `engine_cylinders`, and `market_category` have missing values.

## Distriubtion of data

In [85]:
# Let's examine the distriubtion of price of the cars.
sns.histplot(data=car_data, 
             x='msrp', 
             bins=50, kde=True)

plt.title("Long tail distribution of car prices")
plt.xlabel("Car price")
plt.show()

In [111]:
# Let's examine the distribution of the car prices less than $50000
sns.histplot(data = car_data[car_data.msrp < 100000],
             x='msrp', 
             bins=50, kde=True)

plt.title("Distribution of car prices less than 100000")
plt.xlabel("Car price")
plt.show()

In [113]:
# Log Transformation - to convert the long-tail distribution of target variabel `price` into a normal distribution.
# Log transformation can be performed using `log1p` method of NumPy.
car_data['msrp'] = np.log1p(car_data['msrp'])

# Examine the distribution of the target variable `car price` after log transformation.
sns.histplot(data=car_data, 
             x='msrp', 
             bins=50, kde=True)

plt.title("Normal distribution after log transformation")
plt.xlabel("Car price")
plt.show()