<a href="https://colab.research.google.com/github/lucifernob/Exploratory-Data-Analysis-of-Car-Features/blob/master/Exploratory_Data_Analysis_of_Car_Features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###**Exploratory Data Analysis of Car Features**


---



#Table of Contents

[Problem statement](#scrollTo=cLYy6-OZaxjA)

1. [Importing libraries](#scrollTo=txOHBj0enPui)
      *   Loading the data file
      *   Loading the data into the datafiles
      *   Checking the types of data and basic summary stats
2. [Dropping irrelevant data](#scrollTo=DX1l2uVysU_Z)
3. [Renamming the columns](#scrollTo=N7gAUtMDwuDe)
4. [Removing duplicate data](#scrollTo=Uvz31baGA1a2)
      *   Dropping duplicate rows
      *   Dropping null/missing values
5. [Detecting outliers](#scrollTo=C5qMTUprWnoh)
6. [Top car brands and average price](#scrollTo=uRU42vmga-26)
      *   Top 10 car brands
      *   Average price of car
      *   Top 5 highest selling car
      *   Top 5 least popular cars
7. [Correlation matrix](#scrollTo=uKC0lEmbFE8i)
8. [Plotting different graphs & Performing EDA](#scrollTo=-eOQrWVu2byn)
      *   Scatter plots
      *   Heat Map
      *   Most sold car segment
      *   Forming new group "price_group"
      *   Multivariate Graphs
      *   Pair Plot
9. [Basic Machine learning model](#scrollTo=mYmYeeKC2mKV)
      *   Splitting the dataset
      *   ML with linear regression
10. [Spot checking algorithms](#scrollTo=g03tzB_J7j8s)
      *   Polynomial regression
      *   SVR regression
      *   Random Forest

[Downloading the output graphs we made](#scrollTo=Wt6UZ9GajDd1)

[Conclusion](#scrollTo=I5FAHdD8jFcD)

---

##**Problem Statement**
To perform Exploratory data analysis(EDA) on how the different features of a car and its price are related. 

The below code explore the basic use of Pandas and will cover the basic commands & features of (EDA) i.e. cleaning, combining, reshaping, transforming data for analysis purpose..

EDA is a critical and building block in analyzing the data and we do this for various purpose like for 

*   finding patterns in Data
*   Determining relationships in Data
*   Detection of mistakes and many more.

The data comes from the [Kaggle dataset](https://www.kaggle.com/CooperUnion/cardataset) "Car Features and MSRP". It describes almost 12,000 car models, sold in the USA between 1990 and 2017, with the market price (new or used) and some features.

# **1. Importing libraries**

Importing all the libraries which we will be required for the project.

In [None]:
import pandas as pd                                                             # For data manipulation and analysis
import numpy as np                                                              # Implemennts milti-dimensional array and matrices
import seaborn as sns                                                           # Used for high level Data Visualisation
import matplotlib.pyplot as plt                                                 # Plotting library for Python programming language and it's numerical
%matplotlib inline

**1.1 Load data file**

The data file is in .csv format and for importing there are 3 main methods: 
- From local drive (For ease we will use it)
- From URL
- From google drive


> *Note: Using "import from local drive" may require you to load data file every time you run the code, so import from google drive is better option.*



In [None]:
#Import data file from your google drive
'''
from google.colab import drive
drive.mount("/content/gdrive")

import pandas as pd
pd.read_csv('/content/gdrive/My Drive/Internship studio/Project/data.csv')      # Copy your file path and replace with the given path
'''

In [None]:
#Import data file for local drive
from google.colab import files           
uploaded=files.upload()                                                         #it will create upload option to load your desired file form your local drive

**1.2 Loading the data into the datafiles**

The data comes from the Kaggle dataset "Car Features and MSRP". It describes almost 12,000 car models, sold in the USA between 1990 and 2017, with the market price (new or used) and some features.

Load the required data file for data analysis, and check whether data is loaded properly.

In [None]:
import io
df = pd.read_csv(io.BytesIO(uploaded['data.csv']))                              #Reading the file "data.cv"

In [None]:
#To display the top 10 rows
df.head(10)

In [None]:
#To display the bottom 10 rows
df.tail(10)

**1.3 Checking the types of data and basic summary stats**

Sometimes the data is not in correct format, like integer data is stored as string so we need to convert it, hence we check data type here.


> Note: Don't procede furthere before checking data type.







In [None]:
df.info()                                                                       # This will give Index, Datatype and Memory information

There was no data which require change in its format so we will procede futher, if you find any do change its data type and then move ahead.

In [None]:
df.describe(include = "all")                                                    # Use include='all' option to generate descriptive statistics for all columns

From describe we can get the basiz idea that "Engine HP" and "Engine cylinders" don't have value 11914 like other prameters so it means they have some missing data, don't worry We will change and see them in coming section.

# **2. Dropping irrelevant data**

When we import data, there are chances of irrlevant data i.e. data which is not much necessary for anaylsiying, so we will remove that column or row which is less relvant for us.


> Dropping of irrelvant data can have multiple rows of same data, some missing values, so as per our need we can remove them or imput the new values, remember: More data we provide, More accurate result we will get. 



In [None]:
df=df.drop(['Number of Doors','Market Category'], axis=1)                       #axis is basically row, here from row ! drop the labelled column.
df.head(5)

In this case parameters such as "No. of doors", "Market categorry" are not making such big impact in our analysing so we drop those parameter.

#**3. Renaming the columns**
Our data have some big terms so for our ease we will rename some parameters for the better understanding of data.

In [None]:
df=df.rename(columns={"Engine HP": "HP", "Engine Cylinders": "Cylinders", "Transmission Type": "Transmission", "Driven_Wheels": "Drive Mode","highway MPG": "MPG-H", "city mpg": "MPG-C", "MSRP": "Price"})
df.head(10)                                                                     #Seeing top 10 data

#**4. Dropping duplicate data**

There is a chance of duplicate data or null values in large dataset so either we should remove them or impute new values here..




In [None]:
df.shape                                                                        # size of data

**4.1 Dropping duplicate rows**

We will drop the rows which have duplicate data.

In [None]:
duplicate_rows_df=df[df.duplicated()]                                           #Finding duplicate rows
print("No. of duplicate rows= ", duplicate_rows_df.shape)                       #Print how many rows with duplicate data are present.

In [None]:
df=df.drop_duplicates()                                                         #Drop duplicate data
df.head(10)

In [None]:
df.shape                                                                        #So we are left with less rows after removing suplicate rows.

**4.2 Dropping the missing or null values**

Similar to previous there is a chance of null values in large dataset so removing them is better idea, the data set contain very few null values so we can remove them instead of adding.

> *NOTE: Instead of removing the the null values we can also impute the values which are missing, this approach is better than dropping as the more data we provide to the system more accurate result we will get.*
> If we need to impute values we will prefer imputing with the median values of that column and not mean it is more robust to outlier.





In [None]:
#Printing the data with null vaues.
print (df.isnull().sum())                                                       # Will show you null count for each column, but remember it will not count Zeros(0) as null

As we can see "Market category" has max. missing values followed by "Engine HP" & "Engine Cylinders".

In [None]:
#Dropping the null values
df=df.dropna()
df.count()

In [None]:
#Rechecking how many null values are there now after removing
print(df.isnull().sum())

#**5. Detecting outliers**

We will use boxplot to plot the outkiers and then remove them.



In [None]:
sns.set_style("whitegrid")
sns.boxplot(x=df['Year'],color="#AEB404");

# saving the plot
plt.savefig("Detecting outliers-1.pdf")

In [None]:
sns.set_style("whitegrid")
sns.boxplot(x=df['HP'],color="#74DF00");

# saving the plot
plt.savefig('Detecting outliers-2.pdf')

In [None]:
sns.set_style("whitegrid")
sns.boxplot(x=df['Cylinders']);

# saving the plot
plt.savefig('Detecting outliers-3.pdf')

In [None]:
sns.set_style("whitegrid")
sns.boxplot(x=df['MPG-H'],color="#424242");

# saving the plot
plt.savefig('Detecting outliers-4.pdf')

In [None]:
sns.set_style("whitegrid")
sns.boxplot(x=df['MPG-C'],color="#8000FF");

# saving the plot
plt.savefig('Detecting outliers-5.pdf')

In [None]:
sns.set_style("whitegrid")
sns.boxplot(x=df['Popularity'],color="#0489B1");

# saving the plot
plt.savefig('Detecting outliers-6.pdf')

In [None]:
sns.set_style("whitegrid")
sns.boxplot(x=df['Price'],color="#088A68");

# saving the plot
plt.savefig('Detecting outliers-7.pdf')

In [None]:
Q1=df.quantile(0.25)                                                            #Whisker 1
Q3=df.quantile(0.75)                                                            #whisker 2
IQR=Q3-Q1                                                                       #interquartile range here
print(IQR)

In [None]:
df=df[~((df<(Q1-1.5*IQR))| (df>(Q3+1.5*IQR))).any(axis=1)]                      #Standard formula but we can also use mean to 
df.shape

After removing and editing all the irrelevant data from data size (11193, 14) we came to data size of (8608, 14).

#**6. Most represented car brands**

In this section we will find top 10 car brands and calculate there average price of the car brand wise.

In [None]:
#Percentage of car brand
counts=df['Make'].value_counts()*100/sum(df['Make'].value_counts())

#Top 10car brands
popular_labels=counts.index[:10]

#Plot
plt.figure(figsize=(14,6))
plt.barh(popular_labels,width=counts[:10],color="#086A87")
plt.title('Top 10 car brands', fontsize="15")
plt.show();

We got are top 10 car brands and as we can see **Chevrolet** is the winner among among all the car brands, hence it is preferred by majority of people.

**6.1 Average price among the top car brands**

In [None]:
prices=df[['Make','Price']].loc[
                                (df['Make']=='Chevrolet') |
                                (df['Make']=='Toyota') |
                                (df['Make']=='Volkswagen') |
                                (df['Make']=='Nissan') |
                                (df['Make']=='GMC') |
                                (df['Make']=='Dodge') |
                                (df['Make']=='Mazda') | 
                                (df['Make']=='Honda') |
                                (df['Make']=='Suzuki') |
                                (df['Make']=='Infiniti')].groupby('Make').mean()
print(round(prices,2));                                                         #Printing average price upto 2 values

**6.2 Top 5 highest selling car**

In [None]:
df[df.Price.isin(df.Price.nlargest(5))].sort_values(['Model','Make','MPG-H','MPG-C','Popularity','Price'])

This shows top 5 **highest price** selling car and their models details preffered by high profile income group peoples.

**6.3 Top 5 least popular cars**

In [None]:
df[df.Popularity.isin(df.Popularity.nsmallest(5))].sort_values(['Model','Make','MPG-H','MPG-C',"Popularity",'Price'])

This shows top 5 **lowest popular** selling car and their models details and hence these models can avoided for selling.

#**7. Correlation matrix** 

In [None]:
#Performing correlation
df.corr()

**Correlation & Anticorrelation**

From the above anlysis we obseve there is 


1.   High correlation between 

      *   Cylinders and HP
      *   Highway mpg and City mpg
      *   HP and Price



More the the cylinders present in car more will the horse power i.e. more power.

2.   Hight Anticorrelation between

      *   Cylinders and highway mpg

More the cylinders present more is the power which results in more fuel consumption hence less mileage.








#**8. Plotting different graphs & Performing EDA**

**8.1 Scatterplots**

Scatter plot is used to find the correlation between two variables. As from the previous result we see strong correlation between **Cylinders and HP**, **Highway mpg and City mpg** and **HP and Price** and also  too, so we plot the graph for them and then draw the trend line.

In [None]:
#Scatterplot between HP & Cylinders

fig, ax=plt.subplots(figsize=(10,6))
ax.scatter(df['Cylinders'], df['HP'], color='#2E64FE')
sns.set()                                                                       #set background 'dark grid'
plt.title("Scatter Plot of Cylinders and HP", fontsize = 25)
ax.set_xlabel('Cylinders',fontsize= 15)
ax.set_ylabel('HP',fontsize= 15)
plt.show();

In [None]:
#Scatterplot between HP & Cylinders

fig, ax=plt.subplots(figsize=(10,6))
ax.scatter(df['MPG-H'], df['MPG-C'], color='#800000')
plt.title("Scatter Plot between MPG-C MPG-H", fontsize = 25)
ax.set_xlabel('MPG-H',fontsize= 15)
ax.set_ylabel('MPG-C',fontsize= 15)
plt.show();

In [None]:
#Scatterplot between HP & Price

fig, ax=plt.subplots(figsize=(10,6))
ax.scatter(df['HP'], df['Price'],color='#04B431')
plt.title("Scatter Plot between Price and HP", fontsize = 25)
ax.set_xlabel('HP',fontsize= 15)
ax.set_ylabel('Price',fontsize= 15)
plt.show();

**8.2 Heat Map**

Heat Map is also preffered to find the correlation between two variables. Below graph shows the which features are most relative and dependent on each other. 

In [None]:
plt.figure(figsize=(12,8))
c=df.corr()
sns.heatmap(c,cmap="BrBG", annot=True, linewidths=0.5);

# saving the plot
plt.savefig('Heat Map.png')

Hence it looks Cylinders and HP, Highway mpg and City mpg, HP and Price are more dependnt on each other the we see Hight Anticorrelation between Cylinders and highway mpg. Just same as correlation matrix we studied before.

**8.3 Most sold car segment**

In [None]:
#Bar chart for car "Body" variable
df["Vehicle Style"].value_counts().plot.bar(figsize=(10,6),color="#FF8000")
sns.set()  
plt.title("Most car sold",fontsize= 15)
plt.xlabel("Car type",fontsize= 15)
plt.ylabel("No. of vehicles",fontsize= 15);

# saving the plot
plt.savefig('Most car sold.png')

From the chart we can see **Sedan** cars were the most sold cards followed by 4dr SUV.

In [None]:
#Vehicle style type and Drive type analysis
sns.set_style("whitegrid")
plt.figure(figsize=(25,15))
sns.countplot(y="Vehicle Style", data=df, hue="Drive Mode")
plt.title("Vehicle Type v/s Drive mode Type", fontsize="20")
plt.ylabel("Vehicle Type",fontsize= 15)
plt.xlabel("Count of vehicles",fontsize= 15)
plt.show();

For the deeper understanding we found that **front wheel drive in sedan type** followed by **all wheel drive in 4dr SUV** are mostly preferred by people.

**8.4 Forming new group "price_group"**

In [None]:
#Create a new group "price_group" and assign the value based on the car price

df["price_group"]=pd.cut(df["Price"],[0,10000,20000,40000,60000,80000,100000,500000],
                          labels=["<10k","10-19K","20-39K","40-59K","60-79K","80-99K",">100k"], include_lowest=True)
df["price_group"]=df["price_group"].astype(object)

In [None]:
(df["price_group"].value_counts()/len(df)*100).plot.bar(figsize=(10,6),color=('grey', 'red', 'green', 'blue', 'black'))
sns.set_style("whitegrid")
plt.title("Price Group bar diagram")
plt.ylabel("% of vehicles",fontsize= 15)
plt.xlabel("Price Group",fontsize= 15);

# saving the plot
plt.savefig('Price group.png')

Hence, we divided price among 5 groups and we can see **more than 50% of cars are sold between price range of "20-39k US dollars."** and least sold are for the price range 60-79k.

**8.5 Multivariate Graphs**

In [None]:
sns.lmplot('Year','Price', df, fit_reg=False, hue='Vehicle Size', height=8,aspect=1.5)
sns.set_style("whitegrid")
plt.title("Price distribution over the years w.r.t Vehicle size", fontsize="25")
plt.show();

The above multivariate graphs shows the Price distribution over the years w.r.t Vehicle size. As the years increase, people bought high price range car or it might means as the year increase rate of cars also increased, but the important aspect which needs to be notices is that, in any time frame **people prefer buying "Large" vehicle size cars**.

**8.6  Pair Plot**

In [None]:
sns.set_color_codes()
sns.set()  
sns.pairplot(df, hue='Vehicle Size', height=2, aspect=1.5)
plt.show();

This pairplot gives the observations which already have been referred from other graphs above, such as:

*     Most prefered car size over the years is "Large"
*     Large car give 0-30MPG-C, MPG-H and midsize gives above 30 

#**9. Basic Machine learning model**

With "Price" as the target variable we will build a machine learning model.


In [None]:
x=df[["Year","HP","Cylinders","MPG-H","MPG-H","Popularity"]]
y=df["Price"].values

In [None]:
#Feature scaling, it will help in faster optimizing

from sklearn.preprocessing import StandardScaler
sc_x=StandardScaler()
sc_y=StandardScaler()
x=sc_x.fit_transform(x)
y=sc_y.fit_transform(y.reshape(-1,1))

**9.1 Splitting the data set**

We split the data set into 80-20.

In [None]:
#Splitting the dataset into the Training set and Test set

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test=train_test_split(x,y,test_size=0.2,random_state=0)

**9.2 ML with linear regression**

Performing Linear Regression with "Price" as the target variable.

In [None]:
#Fitting Multiple Linear regression to the training set

from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
regressor.fit(x_train,y_train);

In [None]:
#predicting the test set results

sns.set()
plt.figure(figsize=(10,6))
y_pred=regressor.predict(x_test)
plt.scatter(y_test,y_pred,color="#610B21");

In [None]:
sns.set()
sns.distplot((y_test-y_pred),bins=50,color="#610B21");

In [None]:
from sklearn import metrics
print("Mean Absolute error=", metrics.mean_absolute_error(y_test,y_pred))
print("Root Mean Squared Error=", np.sqrt(metrics.mean_squared_error(y_test,y_pred)))
print("R2 Score=", metrics.r2_score(y_test,y_pred))

From the metrics, we can see "Linear regression" is not having good performance, like root square value is just "0.654", which is very poor, so will try different algorithms to get better performance from our model.

#**10. Spot checking algorithms**

**10.1 Polynomial Regression**

In [None]:
#Fitting Polynomial Regression to the data set

from sklearn.preprocessing import PolynomialFeatures
poly_reg=PolynomialFeatures(degree=4)
x_poly=poly_reg.fit_transform(x_train)
poly_reg.fit(x_poly,y_train)
lin_reg2=LinearRegression()
lin_reg2.fit(x_poly,y_train);

In [None]:
#Prediciting th new result with Polynnomial Regression

plt.figure(figsize=(10,6))
y_pred=lin_reg2.predict(poly_reg.fit_transform(x_test))
plt.scatter(y_test,y_pred,color="#AEB404")
plt.show()

In [None]:
sns.distplot((y_test-y_pred),bins=50,color="#AEB404");                          #Plotting the graph

In [None]:
#Checking the performance over metrics

from sklearn import metrics
print("Mean Absolute error=", metrics.mean_absolute_error(y_test,y_pred))
print("Root Mean Squared Error=", np.sqrt(metrics.mean_squared_error(y_test,y_pred)))
print("R2 Score=", metrics.r2_score(y_test,y_pred))

Here we can see Polynomial Regression is way better than "Linear Regression" we got more accurate result, with R2 valued "0.79".

**10.2 SVR regression**

In [None]:
#Fitting SVR to the dataset

from sklearn.svm import SVR
regressor=SVR(kernel="rbf")
regressor.fit(x_train,y_train);

In [None]:
#Predicting the new result

plt.figure(figsize=(10,6))
y_pred=regressor.predict(x_test)
plt.scatter(y_test,y_pred);

In [None]:
sns.distplot((y_test-y_pred),bins=50);                                          #Plotting the grapph

In [None]:
#Checking the performance over metrics

from sklearn import metrics
print("Mean Absolute error=", metrics.mean_absolute_error(y_test,y_pred))
print("Root Mean Squared Error=", np.sqrt(metrics.mean_squared_error(y_test,y_pred)))
print("R2 Score=", metrics.r2_score(y_test,y_pred))

Here in SVR Regression we got more accurate result, with R2 valued "0.80".

**10.3 Random Forest**

In [None]:
#Fitting Random Forest Regression to the dataset

from sklearn.ensemble import RandomForestRegressor
regressor=RandomForestRegressor(n_estimators=300,random_state=0)
regressor.fit(x_train,y_train)

In [None]:
#Predicting the new result

plt.figure(figsize=(10,6),)
y_pred=regressor.predict(x_test)
plt.scatter(y_test,y_pred,color="#33cc33");

# saving the plot
plt.savefig('Random forest.png')

In [None]:
sns.distplot((y_test-y_pred),bins=50,color="#33cc33");                          #Plotting the grapph

# saving the plot
plt.savefig('Random forest-1.png')

In [None]:
#Checking the performance over metrics

from sklearn import metrics
print("Mean Absolute error=", metrics.mean_absolute_error(y_test,y_pred))
print("Root Mean Squared Error=", np.sqrt(metrics.mean_squared_error(y_test,y_pred)))
print("R2 Score=", metrics.r2_score(y_test,y_pred))

Among all the algorithms, the "Random Forest" outperformed with R2 score of "0.93", which is better among all, so the best fit model for machine learning is Random forest.

##**Downloading the output graphs we made**

As we have saved some figures we can download the saved figures in one go from here.

> Uncomment the code to download  all the figures.



In [None]:
#Downloading the saved figure

files.download('Detecting outliers-1.pdf')
'''
files.download('Detecting outliers-2.pdf')
files.download('Detecting outliers-3.pdf')
files.download('Detecting outliers-4.pdf')
files.download('Detecting outliers-5.pdf')
files.download('Detecting outliers-6.pdf')
files.download('Detecting outliers-7.pdf')

files.download('Heat Map.png')

files.download('Most car sold.png')

files.download('Price group.png')

files.download('Random forest.png')
files.download('Random forest-1.png')
'''

##**Conclusion**

I learnt various things during this project such as:

*    Exploratory Data Analysis(EDA) can be carried out using Pandas plotting, and use of matplotlib and seaborn package to **develop better insights** about the data.
*    Preprocessing helps in **dealing with missing values** and irregularities present in the data. 
*    Creating new features and **plotting various graphs** to anaysis the data from every viewpooint.
*    Analysing **impact of various columns** like Mileage, Year and HP on the Price increase/decrease.
*    The most important inference drawn from all this analysis is, we get to know what are the features on **which price is highly positively and negatively correlated with and which type of cars public preferred over time**.
*    This analysis will helped me in **choosing which machine learning model** we can apply for various purpose.
*   We tried various models of Machine Learning and from them we found the **best suitable model for ours was "Random forest"**, Errors were almost normallize in it thus result in most accurate result.