<a href="https://colab.research.google.com/github/lucifernob/Exploratory-Data-Analysis-of-Car-Features/blob/master/Project_car_data_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Exploratory Data Analysis of Car Features**

---


# **1. Importing libraries**

Importing all the libraries which we will be required for the project.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns                     #used for visualisation
import matplotlib.pyplot as plt           #used for visualisation
%matplotlib inline

**1.1 Load data file**

The data file is in .csv format and for importing there are 3 main methods: 
- From local drive
- From URL
- From google drive


> *Note: Using "import from local drive" may require you to load data file every time you run the code, so import from google drive is better option.*



In [None]:
#Import data file from your google drive
from google.colab import drive
drive.mount("/content/gdrive")

import pandas as pd
pd.read_csv('/content/gdrive/My Drive/Internship studio /Project/data.csv')                   #Copy the file path and replace with the given path

In [None]:
#Import data file for local drive
#from google.colab import files           
#uploaded=files.upload()                      #it will create upload option to load your desired file form your local drive


**1.2 Loading the data into the data**

Load the required data file for data analysis, and check whether data is loaded properly.

In [None]:
import io
df = pd.read_csv(io.BytesIO(uploaded['data.csv']))                            #Reading the file "data.cv"

In [None]:
#To display the top 5 rows
df.head(5)

In [None]:
#To display the bottom 5 rows
df.tail(5)

**1.3 Checking the types of data and basic summary stats**

Sometimes the data is not in correct format, like integer data is stored as string so we need to convert it, hence we check data type here.


> Note: Don't procede furthere before checking data type.







In [None]:
df.info()

In [None]:
df.describe()

# **2. Dropping irrelevant data**

When we import data, sometimes irrlevant data is also there, so we will drop that column or row which is less relvant for us.


> Dropping of irrelvant data can have multiple rows of same data, some missing values, so as per our need we can remove them or imput the new values, remember: More data we provide, More accurate result we will get. 



In [None]:
df=df.drop(['Number of Doors','Market Category','Engine Fuel Type'], axis=1) #axis is basically row, here from row ! drop the labelled column.
df.head(5)

#**3. Renaming the columns**
Renaming the columns for the better understanding of data.

In [None]:
df=df.rename(columns={"Engine HP": "HP", "Engine Cylinders": "Cylinders", "Transmission Type": "Transmission", "Driven_Wheels": "Drive Mode","highway MPG": "MPG-H", "city mpg": "MPG-C", "MSRP": "Price"})
df.head(5)

#**4. Dropping duplicate rows**

There is a chance of duplicate data or null values in large dataset so to removing them is better idea.

> *Note: We are only dropping duplicate values not null values here.*




In [None]:
df.shape             # size of data

In [None]:
duplicate_rows_df=df[df.duplicated()]                             #Finding duplicate rows
print("No. of duplicate rows= ", duplicate_rows_df.shape)         #Print how many rows with duplicate data are present.

In [None]:
df=df.drop_duplicates()
df.head(5)

In [None]:
df.shape          #So we are left with less rows after removing suplicate rows.

**4.1 Dropping the missing or null values**

Similar to previous there is a chance of null values in large dataset so to removing them is better idea, the data set contain very few null values so we can remove them instead of adding.

> *NOTE: Instead of removing the the null values we can also impute the values which are missing, this approach is better than dropping as more data give more accurate result.*
> If we need to impute we imput with the medan values of that column and not mean it is more robust to outline.





In [None]:
print (df.isnull().sum())   #Printing the data with null vaues.

In [None]:
df=df.dropna()          #Dropping the null values
df.count()

In [None]:
print(df.isnull().sum())      

#**5. Detecting outliers**



> We will use box and whisker plot to plot outliers



In [None]:
sns.boxplot(x=df['Year'])

In [None]:
sns.boxplot(x=df['HP'])

In [None]:
sns.boxplot(x=df['Cylinders'])

In [None]:
sns.boxplot(x=df['MPG-H'])

In [None]:
sns.boxplot(x=df['MPG-C'])

In [None]:
sns.boxplot(x=df['Popularity'])

In [None]:
sns.boxplot(x=df['Price'])

In [None]:
Q1=df.quantile(0.25)    #Whisker 1
Q3=df.quantile(0.75)    #whisker 2
IQR=Q3-Q1               #interquartile range here
print(IQR)

In [None]:
df=df[~((df<(Q1-1.5*IQR))| (df>(Q3+1.5*IQR))).any(axis=1)]      #Standard formula but we can also use mean to 
df.shape

#**6. Most represented car brands**

In this section we will find top 10 car brands and calculate there average price of the car in that car brand.

In [None]:
#Percentage of car brand
counts=df['Make'].value_counts()*100/sum(df['Make'].value_counts())

#Top 10car brands
popular_labels=counts.index[:10]

#Plot
plt.figure(figsize=(14,6))
plt.barh(popular_labels,width=counts[:10])
plt.title('Top 10 car brands')
plt.show()

In [None]:
prices=df[['Make','Price']].loc[(df['Make']==  'Chevrolet') |
                               (df['Make']=='Ford') |
                               (df['Make']=='Volkswagen') |
                               (df['Make']=='Toyota') |
                               (df['Make']=='Dodge') |
                               (df['Make']=='Nissan') |
                               (df['Make']=='GMC') |
                               (df['Make']=='Honda') |
                               (df['Make']=='Mazda')].groupby('Make').mean()
print(prices)

#**7. Correlation matrix** 

In [None]:
df.corr()