# Week_03: Descriptive Statistics
Course: WMASDS-04_Introduction to Data Science with Python
<br>Instructor: Farhana Afrin Duti, Department of Statistics, JU
### Topics:
- Descriptive Statistics
- Exploratory Data Analysis
- Data Preparation

## Used-car-price data is being used as an example
-	In this dataset, we are trying to analyze the used car’s price and how EDA focuses on identifying the factors influencing the car price.
- Data source: https://www.kaggle.com/datasets/sukhmanibedi/cars4u

<a id="contents"></a>
### Contents
- [Understanding the Data](#understanding-the-data)
    - [Data Description](#dataset_description)
    - [Import Libraries](#import-libraries)
    - [Loading Dataset](#Reading_Dataset)
    - [Dimension of the Data](#dimension-of-the-data)
    - [Features](#Features)
    - [Data Structure](#data-structure)
    - [Summary of the Data](#summary)
- [Looking at the data](#looking-at-the-data)
    - [Head, Tail, and Ramdom Sample](#head-tail-and-random-sample)
    - [Subset of data, Slicing](#slicing)
    - [Unique Values](#unique-values)
    - [Groupping data](#grouping-data)
    - [Categorical Variables](#categorical-variables)
- [Dealing with Missing Values](#dealing-with-missing-values)
    - [Check Missing Values](#check-missing-values)
    - [Dropping Row/Column](#dropping-row-column)
    - [Imputation- Mean, Median, Mode](#imputation)
- [Feature Engineering](#feature-engineering)
    - [Feature Extraction](#feature-extraction)
    - [Feature Creation](#feature-creation) 
- [Data Cleaning](#data-cleaning)
    - [Renaming variables/values](#rename-variables) 
    - [Changing data types](#changing-datatypes)
    - [Dropping Redundant Information](#dropping-redundant-information)
- [Feature Scaling](#feature-scaling)
    - [Standardization](#standardization)
    - [Normalization](#normalization)
- [Visualizing Data](#visualization)
    - [Histogram](#histogram)
    - [Barplot](#barplot)
- [Dealing with Outliers](#dealing-with-outliers)
    - [Boxplot](#boxplot)
    - [Treatment of Outliers](#treatment-of-outliers)

***

<a id='understanding-the-data'></a>
## Undrestanding the Data

<a id='dataset_description'></a>
### Dataset Description

    
1. S.No. : Serial Number<br>
    
2. Name : Name of the car which includes Brand name and Model name<br>
    
3. Location : The location in which the car is being sold or is available for purchase Cities<br>
    
4. Year : Manufacturing year of the car<br>
    
5. Kilometers_driven : The total kilometers driven in the car by the previous owner(s) in KM.<br>
    
6. Fuel_Type : The type of fuel used by the car. (Petrol, Diesel, Electric, CNG, LPG)<br>
    
7. Transmission : The type of transmission used by the car. (Automatic / Manual)<br>
    
8. Owner : Type of ownership<br>
    
9. Mileage : The standard mileage offered by the car company in kmpl or km/kg<br>
    
10. Engine : The displacement volume of the engine in CC.<br>
    
11. Power : The maximum power of the engine in bhp [Break Horse Power].<br>
    
12. Seats : The number of seats in the car.<br>
    
13. New_Price : The price of a new car of the same model in INR Lakhs.(1 Lakh = 100, 000)<br>
    
14. Price : The price of the used car in INR Lakhs (1 Lakh = 100, 000)<br>

## Questions regarding dataset
<p style = "font-size : 15px ; color: black;font-family:TimesNewRoman">

- Does various predicating factors effect the price of the used car .?<br>
- What all  independent variables effect the pricing of used cars?<br>
- Does name of a car have any effect on  pricing of car.?<br>
- How does type of Transmission  effect  pricing?<br>
- Does Location in which the car being sold has any effect on the price?<br>
- Does kilometers_Driven,Year of manufacturing  have negative correlation with  price of the car?<br>
- Does Mileage ,Engine and Power have any effect on the pricing of the car?<br>
- How does number of seat ,Fuel type effect the pricing.?<br>
</p>

***

<a id="import-libraries"></a>
## Import Python Libraries
-	Pandas and Numpy- for Data Manipulation and numerical Calculations
-	Matplotlib and Seaborn- for Data visualizations. 


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# #to ignore warnings
# import warnings
# warnings.filterwarnings('ignore')

***

<a id="Reading_Dataset"></a>
### Reading Dataset
-	The Pandas library offers a wide range of possibilities for loading data into the pandas DataFrame from files like JSON, .csv, .xlsx, .sql, .pickle, .html, .txt, images etc. 
- to read csv files as dataframe:	https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

In [None]:
data = pd.read_csv("used_cars_data.csv")

***

<a id="dimension-of-the-data"></a>
### Dimension of the data
[back to contents](#contents)

In [None]:
data.ndim

In [None]:
data.shape

In [None]:
print('Number of rows or obseration in the dataset is:',data.shape[0])
print('Number of columns or features or variables in the dataset is:',data.shape[1])

***

<a id="features"></a>
### Features in the dataset

In [None]:
data.columns

In [None]:
data.dtypes

***

<a id="data-structure"></a>
### Data Structures

In [None]:
data.info()

<div class="alert alert-block alert-info">
<b> Note:</b> info() helps to understand the data type and information about data, including the number of records in each column, data having null or not null, Data type, the memory usage of the dataset 
    </div>

* There are 7253 observations and 14 variables in our dataset 

* data.info() shows the variables Mileage, Engine, Power, Seats, New_Price, and Price have missing values. 

* Numeric variables like Mileage, Power are of datatype as  float64 and int64. 

* Categorical variables like Location, Fuel_Type, Transmission, and Owner Type are of object data type




***

<a id="summary"></a>
### Summaray of the Data

In [None]:
data.describe()

#### From the statistics summary, we can infer the below findings :
-	Years range from 1996- 2019 and has a high in a range which shows used cars contain both latest models and old model cars.
-	On average of Kilometers-driven in Used cars are ~58k KM. The range shows a huge difference between min and max as max values show 650000 KM shows the evidence of an outlier. This record can be removed.
-	Min value of Mileage shows 0 cars won’t be sold with 0 mileage. This sounds like a data entry issue.
-	It looks like Engine and Power have outliers, and the data is right-skewed.
-	The average number of seats in a car is 5. car seat is an important feature in price contribution.
-	The max price of a used car is 160k which is quite weird, such a high price for used cars. There may be an outlier or data entry issue.


<a id="looking-at-the-data"></a>
### Looking at the data

In [None]:
#head() will display the top 5 observations of the dataset 
display(data.head())

#tail() will display the last 5 observations of the dataset
display(data.tail())

In [None]:
data.sample(10)

<a id="unique-values"></a>
### Unique Values

In [None]:
data.nunique()

In [None]:
data['Fuel_Type'].unique

<a id="slicing"></a>
### Subset or slicing

In [None]:
data[data['Fuel_Type']=='Diesel'].head()

***

<a id="grouping-data"></a>
### Grouping Data
- Groupby: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html
- Pivot table: https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html

In [None]:
data.groupby([ 'Transmission', 'Fuel_Type',])['Price'].describe()

In [None]:
data.groupby('Fuel_Type')['Price'].describe()

In [None]:
data.pivot_table??

In [None]:
data.pivot_table(values='Price',
    index='Fuel_Type',
    columns='Transmission',
    aggfunc= ['count', 'mean'],
    fill_value=None,
    sort=True)

<a id="categorical-variables"></a>
### Categorical Variables
[back to contents](#contents)

In [None]:
def cat_var(data):
    print('This is about ', data)
    print(data.value_counts())
    print("#" * 40,'\n')
cat_var(data['Seats'])

In [None]:
# Making a list of all categorical variables
cat_col = [
    "Fuel_Type",
    "Location",
    "Transmission",
    "Seats",
    "Year",
    "Owner_Type",
    
]
# Printing number of count of each unique value in each column
for column in cat_col:
    print('This is about ', column)
    print(data[column].value_counts())
    print("#" * 40,'\n')


<div class="alert alert-block alert-info">
<b> Notes:</b> 
    </div>

 - Maximum car being sold have fuel type as Diesel.
 - Mumbai has highest numbers of car availabe for purchase.
 - 5204 cars with Manual transmission are available for purchase.
 - Most of the cars are 5 seaters and First owned.
 - Years of car ranges form 1996- 2015

***

<a id="dealing-with-missing-values"></a>
### Dealing with Missing Values
[back to contents](#contents)

<a id="check-missing-values"></a>
#### Checking for Missing Values

In [None]:
data.isnull().sum()

In [None]:
#The below code helps to calculate the percentage of missing values in each column
(data.isnull().sum()/(len(data)))*100

<div class="alert alert-block alert-info">
<b> Comments</b>  
    </div>

-  **`New_Price`** has only 1006 values. Around 86 % values are missing

-  **`Price`**, which is a Target variable 17 % missing values.This needs to be analysed further.

-  **`Seats`** has only 53 values missing and number of seats can be one of key factor in deciding price.
-  **`Power`** and **`Engine`** has 46 missing values.

-  **`Mileage`** only has two values missing.

-  **`Mileage`,`Power`,`Engine`,`New_Price`** we know are quantitative variables but are of object dtype here and needs to to converted to numeric.

***


<a id="dropping-row-column"></a>
### Dropping Row or Column
- dropna: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html
[back to contents](#contents)

In [None]:
## delete the column "New_Price"
# data.drop('New_Price',axis = 1, inplace = True)
# data.head()

In [None]:
# data.dropna()

<a id="imputation"></a>
### Imputation
- mean
- median
- mode
- forward
- backward
- Constant or zero values and 
- use advanced imputation algorithms like KNN, Regularization, etc

<div class = "alert alert-block alert-success"><b>If we have domain knowledge, data can be imputed on assumptions.</b></div>

<br>[back to contents](#contents)

In [None]:
data['Mileage'][5]

In [None]:
data.info()

In [None]:
data.loc[data["Mileage"]==0.0,'Mileage']=np.nan
data.Mileage.isnull().sum()
# data['Mileage'].fillna(value=np.mean(data['Mileage']),inplace=True)

-	Similarly, imputation for Seats. As we mentioned earlier, we need to know common insights about the data.

**Let’s assume some cars brand and Models have features like Engine, Mileage, Power, and Number of seats that are nearly the same. Let’s impute those missing values with the existing data:**



In [None]:
data.Seats.isnull().sum()
data['Seats'].fillna(value=np.nan,inplace=True)
data['Seats']=data.groupby(['Model','Brand'])['Seats'].apply(lambda x:x.fillna(x.median()))
data['Engine']=data.groupby(['Brand','Model'])['Engine'].apply(lambda x:x.fillna(x.median()))
data['Power']=data.groupby(['Brand','Model'])['Power'].apply(lambda x:x.fillna(x.median()))

-	In general, there are no defined or perfect rules for imputing missing values in a dataset. Each method can perform better for some datasets but may perform even worse. Only practice and experiments give the knowledge which works better.


***

<div class="alert alert-block alert-success">
    </div>

***

<a id = "feature-engineering" ></a>
### Feature Engineering
- Feature Extraction
- Creating New Feature
<br> [back to contents](#contents)

<a id = "feature-extraction"></a>
### Feature Extraction
[back to contents](#contents)

- Processing **Engine**,**Power**, and **Mileage** columns

In [None]:
np.random.seed(9)
data[['Engine','Power','Mileage']].sample(10)

In [None]:
typeoffuel=['CNG','LPG']
data.loc[data.Fuel_Type.isin(typeoffuel)].head(10)

<div class = "alert alert-block alert-info"><b> Comments</b>  </div>

 
- Power has some values as "null, bhp" .
- Mileage also has some observations as 0. For fuel type and CNG and LPG mileage is measured in km/kg where as for other type it is measured in kmpl. 
- Since  those units are in  km for both of them no need of conversion . 

### Remove units from mileages,Engine and Power
***

#### Mileage

In [None]:
data[data.Mileage.isnull()==True]

In [None]:
data["Mileage"] = data["Mileage"].str.rstrip(" kmpl")
data["Mileage"] = data["Mileage"].str.rstrip(" km/g")

#### Engine

In [None]:
#remove units
data["Engine"] = data["Engine"].str.rstrip(" CC")


#### Power

In [None]:
#remove bhp and replace null with nan
data["Power"] = data["Power"].str.rstrip(" bhp")
data["Power"]= data["Power"].replace(regex="null", value = np.nan)

In [None]:
#verify the data
num=['Engine','Power','Mileage']
data[num].sample(20)

In [None]:
data.info()
data.isnull().sum()

<div class = "alert alert-block alert-info"><b>Note:</b> Some values in Power and Mileage as 0.0.</div>

In [None]:
data.query("Power == '0.0'")['Power'].count()

In [None]:
data.query("Mileage == '0.0'")['Mileage'].count()

In [None]:
data.query("Mileage == '0.0'")['Mileage']

### Converting null(0.0) observations to *NaN* 
*** 

In [None]:
data.loc[data["Mileage"]=='0.0','Mileage']=np.nan

In [None]:
data.loc[data["Engine"]=='0.0','Engine'].count()

In [None]:
data[num].nunique()

In [None]:
data[num].isnull().sum()

<div class = "alert alert-block alert-info">
    <b>Comments:</b>
    There are 46 missing values in Engine, 175 in Power,83 in Mileage.
    </div>


<a id="section_ID"></a>
### Processing Seats

In [None]:
data.query("Seats == 0.0")['Seats'].count()

In [None]:
data.query("Seats == 0.0")['Seats']

In [None]:
#seats cannot be 0 so changing it to nan and will be handled in missing value
data.loc[3999,'Seats'] =np.nan
# data.loc[data['Seats'] == 0.0] =np.nan

In [None]:
data.head()

### Processing New Price
- We know that New_Price is the price of a new car of the same model in INR Lakhs.(1 Lakh = 100, 000)
- This column clearly has a lot of missing values. We will impute the missing values later. For now we will only extract the numeric values from this column.

In [None]:
data['unit'] = data['New_Price'].str.split().str.get(1)
data['New_Price'] = data['New_Price'].str.split().str.get(0)

data.head()

In [None]:
data[data['unit']=='Lakh'].shape

In [None]:
data[data['unit']=='Cr'].shape

In [None]:
data[data['unit']=='NaN'].shape

In [None]:
data[data['unit']=='Cr']['New_Price'].head(10)
# data[data['unit']=='Lakh']['New_Price'].head(10)

In [None]:
data[data['unit']=='Cr'].index

In [None]:
for i in data[data['unit']=='Cr'].index:
    data['New_Price'][i] =  np.round(100*float(data['New_Price'][i]),2)
    print(data['New_Price'][i])

In [None]:
# data.sample(100)

***

<a id="section_ID"></a>
### Processing Name column to obtain *Brand* and *Model*
Brands do play an important role in Car selection and Prices. Let’s split the name and introduce new variables “Brand” and “Model”

In [None]:
#dropping rows with name as null
# cars['Name'] = cars.dropna(subset=['Name'])

In [None]:
#As mentioned in dataset car name has Brand and model so extracting it ,This can help to fill missing values of price column as brand 
data['Brand'] = data['Name'].str.split(' ').str[0] #Separating Brand name from the Name
data['Model'] = data['Name'].str.split(' ').str[1] + data['Name'].str.split(' ').str[2]

In [None]:
# data['Brand'] = data.Name.str.split().str.get(0)
# data['Model'] = data.Name.str.split().str.get(1) + data.Name.str.split().str.get(2)
data[['Name','Brand','Model']]

<a id = "feature-creation"></a>
## Feature Creation
[back to contents](#contents)

### Processing Year to obtain age of car

-	the column “Year” shows the manufacturing year of the car.

In [None]:
data['Current_year']=2023
data['Ageofcars']=data['Current_year']-data['Year']
data.drop('Current_year',axis=1,inplace=True)
data.head()

***
***

<a id = "data-cleaning"></a>
## Further approaches of data cleaning
- dropping redundant rows/columns
- changing datatypes
- Rename variables/values

<br>[back to content](#contents)

<a id="changing-datatypes"></a>
### Converting Datatypes
[back to contents](#contents)

In [None]:
# #converting object data type to category data type
# data["Fuel_Type"] = data["Fuel_Type"].astype("category")
# data["Transmission"] = data["Transmission"].astype("category")
# data["Owner_Type"] = data["Owner_Type"].astype("category")
# #converting Continuous datatype  
data["Mileage"] = data["Mileage"].astype(float)
data["Power"] = data["Power"].astype(float)
data["Engine"]=data["Engine"].astype(float)
data.head()

In [None]:
# data.info()

In [None]:
data['New_Price'] = data['New_Price'].astype(float)


In [None]:
# data.info()

In [None]:
data.drop('unit',axis=1,inplace=True)

In [None]:
data.head()

***

<a id="rename-variables"></a>
### Renaming
-	Some names of the variables are not relevant and not easy to understand. Some data may have data entry errors, and some variables may need data type conversion. 
-	The brand name ‘Isuzu’ ‘ISUZU’ and ‘Mini’ and ‘Land’ looks incorrect. This needs to be corrected

<br>[back to contents](#contents)

In [None]:
print(data.Brand.unique())
print(data.Brand.nunique())

In [None]:
searchfor = ['Isuzu' ,'ISUZU','Mini','Land']
data[data.Brand.str.contains('|'.join(searchfor))].head(5)

In [None]:
data["Brand"].replace({"ISUZU": "Isuzu", "Mini": "Mini Cooper","Land":"Land Rover"}, inplace=True)

In [None]:
data.Brand.unique()

In [None]:
# #changing brandnames
# data.loc[data.Brand == 'ISUZU','Brand']='Isuzu'
# data.loc[data.Brand=='Mini','Brand']='Mini Cooper'
# data.loc[data.Brand=='Land','Brand']='Land Rover'
# #data['Brand']=data["Brand"].astype("category")

In [None]:
data.groupby(data.Brand).size().sort_values(ascending =False)

<div class = "alert alert-block alert-info">There are 32 unique Brands in the dataset.Maruti brand is most available for purchase/Sold followed by Hyundai.</div> 

***

<a id = "dropping-redundant-information"></a>
### Dropping Redundant Information
- Some columns or variables can be dropped if they do not add value to our analysis.
- In our dataset, the column S.No have only ID values, assuming they don’t have any predictive power to predict the dependent variable.

<br>[back to contents](#contents)

In [None]:
# Remove S.No. column from data
data = data.drop(['S.No.'], axis = 1)
data.info()

**Cooments:** We start our Feature Engineering as we need to add some columns required for analysis.
***

***

<a id = "feature-scalling"></a>
## Feature Scalling
- MinMaxScaler: - https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
- StandardScaler:  https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
- Log transformation

<br>[back to contents](#contents)

***
***

<a id = "visualization"></a>
### Visualization
[back to contents](#contents)

-	Categorical variables can be visualized using a Count plot, Bar Chart, Pie Plot, etc.
-	Numerical Variables can be visualized using Histogram, Box Plot, Density Plot, etc.
-	In our example, we have done a Univariate analysis using Histogram and  Box Plot for continuous Variables.
-	In the below fig, a histogram and box plot is used to show the pattern of the variables, as some variables have skewness and outliers.


In [None]:
#Before we do EDA, lets separate Numerical and categorical variables for easy analysis

cat_cols=data.select_dtypes(include=['object']).columns
num_cols = data.select_dtypes(include=np.number).columns.tolist()
print("Categorical Variables:")
print(cat_cols)
print("Numerical Variables:")
print(num_cols)

### Univariate Analysis
-	Price and Kilometers Driven are right skewed for this data to be transformed, and all outliers will be handled during imputation
-	categorical variables are being visualized using a count plot. Categorical variables provide the pattern of factors influencing car price


In [None]:
print('Skewness :', data['Price'].skew())

In [None]:
plt.figure(figsize = (15, 4))
data['Price'].hist(grid=False)
plt.ylabel('count')
plt.show()

<a id = "histogram"></a>
### Histogram
[back to contents](#contents)

<a id = "boxplot"></a>
### boxplot
[back to contents](#contents)

In [None]:
plt.figure(figsize = (15, 4))
sns.boxplot(x=data['Price'])
plt.show()

In [None]:
plt.figure(figsize = (15, 4))
plt.subplot(1, 2, 1)
data['Price'].hist(grid=False)
plt.ylabel('count')
plt.subplot(1, 2, 2)
sns.boxplot(x=data['Price'])
plt.show()

##### Histogram and boxplot for all continuous variables using subplot

In [None]:
for col in num_cols:
    print(col)
    print('Skew :', round(data[col].skew(), 2))
    plt.figure(figsize = (15, 4))
    plt.subplot(1, 2, 1)
    data[col].hist(grid=False)
    plt.ylabel('count')
    plt.subplot(1, 2, 2)
    sns.boxplot(x=data[col])
    plt.show()

<a id = "barplot"></a>
### Bar Plot
[back to contents](#contents)

In [None]:
plt.figure(figsize = (15, 4))

sns.countplot( x = 'Fuel_Type', data = data, color = 'blue', 
              order = data['Fuel_Type'].value_counts().index);


In [None]:
fig, axes = plt.subplots(3, 2, figsize = (18, 18))
fig.suptitle('Bar plot for all categorical variables in the dataset')
sns.countplot(ax = axes[0, 0], x = 'Fuel_Type', data = data, color = 'blue', 
              order = data['Fuel_Type'].value_counts().index);
sns.countplot(ax = axes[0, 1], x = 'Transmission', data = data, color = 'blue', 
              order = data['Transmission'].value_counts().index);
sns.countplot(ax = axes[1, 0], x = 'Owner_Type', data = data, color = 'blue', 
              order = data['Owner_Type'].value_counts().index);
sns.countplot(ax = axes[1, 1], x = 'Location', data = data, color = 'blue', 
              order = data['Location'].value_counts().index);
sns.countplot(ax = axes[2, 0], x = 'Brand', data = data, color = 'blue', 
              order = data['Brand'].head(20).value_counts().index);
sns.countplot(ax = axes[2, 1], x = 'Model', data = data, color = 'blue', 
              order = data['Model'].head(20).value_counts().index);
axes[1][1].tick_params(labelrotation=45);
axes[2][0].tick_params(labelrotation=90);
axes[2][1].tick_params(labelrotation=90);


<div class = "alert alert-block alert-info"><b>Comments</b></div>	
From the count plot, we can have below observations

-	Mumbai has the highest number of cars available for purchase, followed by Hyderabad and Coimbatore 
- ~53% of cars have fuel type as Diesel this shows diesel cars provide higher performance 
- ~72% of cars have manual transmission 
- ~82 % of cars are First owned cars. 
- This shows most of the buyers prefer to purchase first-owner cars ~20% of cars belong to the brand Maruti followed by 19% of cars belonging to Hyundai WagonR ranks first among all models which are available for purchase


### Bivariate Analysis
-	Bivariate Analysis helps to understand how variables are related to each other and the relationship between dependent and independent variables present in the dataset.
-	For Numerical variables, Pair plots and Scatter plots are widely been used to do Bivariate Analysis.
-	A Stacked bar chart can be used for categorical variables if the output variable is a classifier. Bar plots can be used if the output variable is continuous
-	In our example, a pair plot has been used to show the relationship between two Categorical variables.

<div class = "alert alert-block alert-warning"><b> Need Preprocessing Again! Data Transformation</b></div>

- Before we proceed to Bi-variate Analysis, Univariate analysis demonstrated the data pattern as some variables to be transformed.
- Price and Kilometer-Driven variables are highly skewed and on a larger scale. Let’s do log transformation.
- Log transformation can help in normalization, so this variable can maintain standard scale with other variables:

https://numpy.org/doc/stable/reference/generated/numpy.log.html

In [None]:
data['log_price']=np.log(data['Price'])

In [None]:
# data.head()

In [None]:
data["Kilometers_Driven_log"]=np.log(data["Kilometers_Driven"])
#Log transformation of the feature 'Kilometers_Driven'
sns.distplot(data["Kilometers_Driven_log"], axlabel="Kilometers_Driven_log");


In [None]:
plt.figure(figsize=(13,17))
sns.pairplot(data=data.drop(['Kilometers_Driven','Price'],axis=1))
plt.show()

### Pair Plot provides following insights:
-	The variable Year has a positive correlation with price and mileage
-	A year has a Negative correlation with kilometers-Driven
-	Mileage is negatively correlated with Power
-	As power increases, mileage decreases
-	Car with recent make is higher at prices. As the age of the car increases price decreases
-	Engine and Power increase, and the price of the car increases
-	A bar plot can be used to show the relationship between Categorical variables and continuous variables 

In [None]:
data.columns

In [None]:
data['Price_log'] = np.log(data['Price'])

In [None]:
fig, axarr = plt.subplots(4, 2, figsize=(12, 18))
data.groupby('Location')['Price_log'].mean().sort_values(ascending=False).plot.bar(ax=axarr[0][0], fontsize=12)
axarr[0][0].set_title("Location Vs Price", fontsize=18)
data.groupby('Transmission')['Price_log'].mean().sort_values(ascending=False).plot.bar(ax=axarr[0][1], fontsize=12)
axarr[0][1].set_title("Transmission Vs Price", fontsize=18)
data.groupby('Fuel_Type')['Price_log'].mean().sort_values(ascending=False).plot.bar(ax=axarr[1][0], fontsize=12)
axarr[1][0].set_title("Fuel_Type Vs Price", fontsize=18)
data.groupby('Owner_Type')['Price_log'].mean().sort_values(ascending=False).plot.bar(ax=axarr[1][1], fontsize=12)
axarr[1][1].set_title("Owner_Type Vs Price", fontsize=18)
data.groupby('Brand')['Price_log'].mean().sort_values(ascending=False).head(10).plot.bar(ax=axarr[2][0], fontsize=12)
axarr[2][0].set_title("Brand Vs Price", fontsize=18)
data.groupby('Model')['Price_log'].mean().sort_values(ascending=False).head(10).plot.bar(ax=axarr[2][1], fontsize=12)
axarr[2][1].set_title("Model Vs Price", fontsize=18)
data.groupby('Seats')['Price_log'].mean().sort_values(ascending=False).plot.bar(ax=axarr[3][0], fontsize=12)
axarr[3][0].set_title("Seats Vs Price", fontsize=18)
data.groupby('Ageofcars')['Price_log'].mean().sort_values(ascending=False).plot.bar(ax=axarr[3][1], fontsize=12)
axarr[3][1].set_title("Ageofcars Vs Price", fontsize=18)
plt.subplots_adjust(hspace=1.0)
plt.subplots_adjust(wspace=.5)
sns.despine()

<div class = "alert alert-block alert-info"><b>Comments:</b></div>

-	The price of cars is high in Coimbatore and less price in Kolkata and Jaipur
-	Automatic cars have more price than manual cars.
-	Diesel and Electric cars have almost the same price, which is maximum, and LPG cars have the lowest price
-	First-owner cars are higher in price, followed by a second
-	The third owner’s price is lesser than the Fourth and above
-	Lamborghini brand is the highest in price
-	Gallardocoupe Model is the highest in price
-	2 Seater has the highest price followed by 7 Seater
-	The latest model cars are high in price

### Multivariate Analysis
-	Multivariate analysis is useful methods to determine relationships and analyze patterns for any dataset. 
-	Heat Map gives the correlation between the variables, whether it has a positive or negative correlation. 

In [None]:
data.drop(['Kilometers_Driven','Price'],axis=1, inplace = True)

In [None]:
plt.figure(figsize=(12, 7))
sns.heatmap(data.corr(), annot = True, vmin = -1, vmax = 1)
plt.show()


<div class = "alert alert-block alert-info"><b> Comments: </b> From the Heat map, we can infer the following</div>

-	The engine has a strong positive correlation to Power 0.86
-	Price has a positive correlation to Engine 0.69 as well Power 0.77
-	Mileage has correlated to Engine, Power, and Price negatively
-	Price is moderately positive in correlation to year.
-	Kilometer driven has a negative correlation to year not much impact on the price
-	Car age has a negative correlation with Price
-	car Age is positively correlated to Kilometers-Driven as the Age of the car increases; then the kilometer will also increase of car has a negative correlation with Mileage this makes sense


***

<a id = "dealing-with-outliers"></a>
### Dealing with Outliers
- https://www.analyticsvidhya.com/blog/2021/05/detecting-and-treating-outliers-treating-the-odd-one-out/

[back to contents](#contents)

### Outlier detection
- Boxplots
- Z-score
- Inter Quantile Range(IQR)

![image.png](attachment:image.png), ![image-2.png](attachment:image-2.png)

### Treatment of outlier
- Trimming or removing
- Quantile based flooring and capping
- Imputation

***

<div class = "alert alert-block alert-success"><b> Overall Comments:</b>
Through EDA, we got useful insights, and below are the factors influencing the price of the car and a few takeaways:

-	Most of the customers prefer 2 Seat cars hence the price of the 2-seat cars is higher than other cars.
-	The price of the car decreases as the Age of the car increases.
-	Customers prefer to purchase the First owner rather than the Second or Third.
-	Due to increased Fuel price, the customer prefers to purchase an Electric vehicle.
-	Automatic Transmission is easier than Manual.
</div>

This way, we perform EDA on the datasets to explore the data and extract all possible insights, which can help in model building and better decision making.
<br>**However, this was only an overview of how EDA works; you can go deeper into it and attempt the stages on larger datasets.**

### Save the preprocessed data file as csv

In [None]:
import pandas as pd
data.to_csv('data_processed.csv')
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html

***

<div class = "alert alert-block alert-danger"><b> Why Exploratory Data Analysis?</b></div>

-	Exploratory Data Analysis refers to the crucial process of performing initial investigations on data to discover patterns to check assumptions with the help of summary statistics and graphical representations.
-	EDA can be leveraged to check for outliers, patterns, and trends in the given data.
-	EDA helps to find meaningful patterns in data.
-	EDA provides in-depth insights into the data sets to solve our business problems.
-	EDA gives a clue to impute missing values in the dataset 

# Thank You..