
# Used Car Price Exploration

This notebook introduces the pandas library for data loading and prepartation and explores techniques of Exploratory Data Analysis (EDA).

### Dataset

**Filename**: usercarlastest.csv

It is a comma separated file and there are 14 columns in the dataset.

1. Id - Car's id
2. Name - The brand and model of the car.
3. Location - The location in which the car is being sold or is available for purchase.
4. Year - The year or edition of the model.
5. Kilometers_Driven - The total kilometers are driven in the car by the previous owner(s) in KM.
6. Fuel_Type - The type of fuel used by the car. (Petrol, Diesel, Electric, CNG, LPG)
7. Transmission - The type of transmission used by the car. (Automatic / Manual)
8. Owner_Type - First, Second, Third, or Fourth & Above
9. Mileage - The standard mileage offered by the car company in kmpl or km/kg
10. Engine - The displacement volume of the engine in CC.
11. Power - The maximum power of the engine in bhp.
12. Seats - The number of seats in the car.
13. New_Price - The price of a new car of the same model.
14. Price - The price of the car (target).

#### Importing required libraries

In [None]:
import pandas as pd
import numpy as np

#### Check Library Versions

In [None]:
pd.__version__

In [None]:
np.__version__

### Loading the dataset

In [None]:
cars_df = pd.read_csv( "./usercarlastest.csv" )

In [None]:
type(cars_df)

### Showing few records

In [None]:
cars_df.head(5)

In [None]:
cars_df.tail(5)

## Getting metadata

In [None]:
## Dimension of the dataset
cars_df.shape

In [None]:
cars_df.info()

## Indexing and Slicing

Selecting specific set of rows and columns: How to slice, dice, and generally get and set subsets of pandas objects. 

Detailed Tutorial: https://pandas.pydata.org/pandas-docs/dev/user_guide/indexing.html

In [None]:
cars_df[4:10]

In [None]:
cars_df[-5:]

In [None]:
cars_df['Name'][0:5]

In [None]:
cols = ['Name', 'Price']
cars_df[cols][0:5]

In [None]:
cars_df[['Name', 'Price']][0:5]

## Sampling Records

In [None]:
cars_df.sample(5)

#### random_state in sample()
If you pass it an integer, it will use this as a seed for a pseudo random number generator. As the name already says, the generator does not produce true randomness. It rather has an internal state (that you can get by calling np.random.get_state()) which is initialized based on a seed. When initialized by the same seed, it will reproduce the same sequence of "random numbers".

If you pass it a RandomState it will use this (already initialized/seeded) RandomState to generate pseudo random numbers. This also allows you to get reproducible results by setting a fixed seed when initializing the RandomState and then passing this RandomState around. 

https://stackoverflow.com/questions/45211624/what-exactly-does-the-pandas-random-state-do

In [None]:
cars_df.sample(5, random_state=100)

In [None]:
cars_df.shape

## Understanding distribution of variables

## Variable Types

Two variable types
- Numerical : quantify 
   - e.g. age, salary, sales
   - Two types
       - Continuous
       - Discrete : Specific values
           - e.g. Number of dependents
           - e.g. number of cars you own
- Categorical : 
    - e.g Transmission, Location etc.

### What are different transmission types and how many cars are available from each transmission type (in percentages)?

In [None]:
cars_df.Transmission.unique()

In [None]:
#cars_df['Transmission'].value_counts()
cars_df.Transmission.value_counts()

In [None]:
cars_df.Transmission.value_counts(normalize=True)*100

### 1. Participants Exercises:

- What are different owner types?
- What percentage of cars are available from each owner type for resale?

In [None]:
#TODO by participants


### Distribution of car sale price

A distribution in statistics is a function that shows the possible values for a variable and how often they occur.

- How many cars are sold at different price ranges? For example: cars sold in the price range of 1L - 2L, 2L - 3L etc.?

#### Histogram for plotting Continuous Variables (Price)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sn

sn.set_style("whitegrid")
sn.set_context("paper")
sn.color_palette("Set2");

In [None]:
plt.figure(figsize=(15,5))
plt.hist(cars_df['Price']);
plt.xlabel("Price")
plt.ylabel("Frequency")
plt.title("Histogram for Price");

#### Setting our own bins for the histogram

In [None]:
plt.figure(figsize=(15,5))
hist_data = plt.hist(cars_df['Price'], bins=list(range(0, 15, 1)));
plt.xticks(range(0, 15, 1))
plt.xlabel("Price")
plt.ylabel("Frequency")
plt.title("Histogram for Price");
plt.savefig("price.png")

In [None]:
sn.distplot(cars_df.Price, hist=False);

In [None]:
cars_df.Price.min(), cars_df.Price.max() 

#### Sort the bins by their counts

In [None]:
type(hist_data)

In [None]:
hist_data[0]

In [None]:
hist_data[1]

In [None]:
len(hist_data[1])

In [None]:
hist_data_df = pd.DataFrame( { "lv_bins" : hist_data[1][0:-1],
                               "uv_bins" : hist_data[1][1:], 
                               "count" : hist_data[0] } )

In [None]:
hist_data_df.sort_values("count", ascending=False)[0:10]

### 3. Participants Exercises:

- Draw a histogram for kilometer driven with each bin of size 10000km

In [None]:
#TODO by participants


## What is Normal Distribution

The normal distribution, also known as the Gaussian distribution, is the most important probability distribution in statistics for independent, random variables. Most people recognize its familiar bell-shaped curve in statistical reports.

- The normal distribution is a continuous probability distribution that is symmetrical around its mean, most of the observations cluster around the central peak, and the probabilities for values further away from the mean taper off equally in both directions. 
- Extreme values in both tails of the distribution are similarly unlikely. While the normal distribution is symmetrical, not all symmetrical distributions are normal

Source: https://statisticsbyjim.com/basics/normal-distribution/

References:

https://en.wikipedia.org/wiki/Normal_distribution

https://courses.lumenlearning.com/math4libarts/chapter/understanding-normal-distribution/


<img src="normal.png" alt="Nowmal Distribution" width="600"/>

#### Finding distribution parameters of Price

https://www.cuemath.com/data/standard-deviation/

In [None]:
cars_df['Price'].mean()

In [None]:
cars_df['Price'].std()

#### 95% of the cars were sold in which price range?

In [None]:
from scipy import stats

In [None]:
stats.norm.interval(0.95,
                    cars_df['Price'].mean(),
                    cars_df['Price'].std())

In [None]:
stats.norm.interval(0.997,
                    cars_df['Price'].mean(),
                    cars_df['Price'].std())

In [None]:
len(cars_df[cars_df.Price > 10.71])

## Outliers

- In statistics, an outlier is a data point that differs significantly from other observations.
- An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set.
- An outlier can cause serious problems in statistical analyses.

Source: https://en.wikipedia.org/wiki/Outlier

Using 3 standard deviation approach.

In [None]:
cars_df[cars_df.Price > 10.71].shape

### 4. Participants Exercise

- Find out the mean and standard deviation of Kilometer Driven of the cars sold?
- Any outliers in terms of Kilometer Driven?

In [None]:
cars_df['mlg'] = cars_df.Mileage.map(lambda val: float(val.split()[0]))

In [None]:
cars_df['mlg'].mean()

In [None]:
stats.norm.interval(0.997,
                    cars_df['mlg'].mean(),
                    cars_df['mlg'].std())

In [None]:
cars_df[cars_df.mlg > 30.54].shape

### Using box plot

- A boxplot is a standardized way of displaying the distribution of data based on a five number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”). 
- It can tell you about your outliers and what their values are. It can also tell you if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed.

- The minimum or lowest value of the dataset
- The first quartile Q1, which represents a quarter of the way through the list of all data
- The median of the data set, which represents the midpoint of the whole list of data
- The third quartile Q3, which represents three-quarters of the way through the list of all data
- The maximum or highest value of the data set.

Source: https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51

<img src="box.png" alt="Nowmal Distribution" width="600"/>

In [None]:
box_info = plt.boxplot(cars_df['Price']);

In [None]:
plt.figure(figsize=(15,6))
boxp = sn.boxplot(cars_df['Price']);

In [None]:
box_info

#### Finding median, IQR, min and max values

Range would be difficult to extrapolate otherwise. Similar to the range but less sensitive to outliers is the interquartile range. The interquartile range is calculated in much the same way as the range. All you do to find it is subtract the first quartile from the third quartile:

**IQR = Q3 – Q1**

Though it's not often affected much by them, the interquartile range can be used to detect outliers. This is done using these steps:


- Calculate the interquartile range for the data.
- Multiply the interquartile range (IQR) by 1.5 (a constant used to discern outliers).
- Add 1.5 x (IQR) to the third quartile. Any number greater than this is a suspected outlier.
- Subtract 1.5 x (IQR) from the first quartile. Any number less than this is a suspected outlier.

Source: https://www.thoughtco.com/what-is-the-interquartile-range-rule-3126244

In [None]:
cars_df['Price'].median()

In [None]:
from scipy import stats

In [None]:
iqr = stats.iqr(cars_df['Price'])
iqr

In [None]:
cars_df['Price'].describe()

In [None]:
min_price = cars_df['Price'].describe()['25%'] - 1.5 * iqr
max_price = cars_df['Price'].describe()['75%'] + 1.5 * iqr

In [None]:
min_price, max_price

In [None]:
cars_df[cars_df.Price > 10.30]['Price'].shape

In [None]:
outliers_df = cars_df[cars_df.Price > 10.30]

In [None]:
outliers_df.shape

In [None]:
#sn.distplot(cars_df[cars_df.Price < 10.30]['Price']);

### Creating new factors: Age of car

This is not the actual age of the car.

We are substracting the edition of the car from the year 2020.

In [None]:
#TODO by participants


In [None]:
cars_df['age'] = 2020 - cars_df['Year']

In [None]:
plt.hist(cars_df.age);

### 5. Participants Exercise:

- Find the outliers for Age driven using box plot
- What are the min and max values?
- How many outliers are present?

### Different makes and models

We are assuming the first token to be the make and the second token to be the model.

In [None]:
cars_df['make'] = cars_df['Name'].map( lambda x: x.lower().split()[0] )

In [None]:
cars_df.make.value_counts()

In [None]:
cars_df['model'] = cars_df['Name'].map( lambda x: x.lower().split()[1] )

In [None]:
cars_df['model'].unique()

In [None]:
len(cars_df['model'].unique())

### Find out top 10 selling models

In [None]:
top_models = cars_df['model'].value_counts().reset_index()

In [None]:
top_models

In [None]:
top_models.columns = ['model', 'count']

In [None]:
top_models[0:10]

### Top 10 reselling makes

In [None]:
top_10_models = list(top_models['model'][0:10])

In [None]:
top_10_models

In [None]:
plt.figure(figsize=(15, 6))
sn.countplot(data = cars_df,
             x = 'model',
             order = top_10_models);

In [None]:
plt.figure(figsize=(15, 6))
sn.countplot(data = cars_df,
             x = 'Transmission');

In [None]:
plt.figure(figsize=(15, 6))
sn.barplot(data = top_models[0:10],
          x = 'model',
          y = 'count');

In [None]:
#cars_df[(cars_df.Year > 2000) & (cars_df.Year < 2005)]

In [None]:
top_makes = cars_df[cars_df.Price < 5.0]['make'].value_counts().reset_index()

In [None]:
top_makes

In [None]:
top_makes_5 = list(top_makes['index'][0:5])

In [None]:
top_makes_5

In [None]:
plt.figure(figsize=(15, 6))
sn.countplot(data = cars_df[cars_df.Price < 5.0],
             x = 'make',
             order = top_makes_5);

### 6. Participants Exercise

- Plot the number of cars sold for each of years of car edition (Year) using count plot.
- Find out top 5 makes (most number of cars sold) in used cars with sold price less than 5 lakhs.

### Rules for plotting

- Single Variable (Univariate Analysis)

    - Continuous -> Histogram, boxplot, distribution plot
    - Categorical -> Count Plot/Bar Plot

- Two Variables (Bivariate Analysis)

    - Continuous + Categorical -> Box plot, Overlapped Distribution Plot
    - Continuous + Continuous -> Scatter Plot, heatmap
    - Categoical + Categorical -> Bar Plot / Count Plot, heatmap

## Analyzing two variables

### What is the average selling prices of top 10 models, which are less than 3 years old?

In [None]:
top_10_models_df = cars_df[(cars_df.age < 3) 
                           & (cars_df.model.isin(top_10_models))]

In [None]:
model_avg_price = top_10_models_df.groupby("model")['Price'].mean()

In [None]:
model_avg_price = model_avg_price.reset_index()

In [None]:
model_avg_price.columns = ['model', 'avg_price']
model_avg_price

### What is distribution of sales price of cars of top 10 models, which are less than 3 years old?

- What is the mean and variance of sale price across each model?

In [None]:
plt.figure(figsize=(15, 6))
sn.boxplot(data = top_10_models_df,
           x = 'model',
           y = 'Price');

In [None]:
top_10_models_df[(top_10_models_df.model == 'alto')]

### 7. Participant Exercise:

- Find out the variations of sales prices of specific model SWIFT of differnt ages?

In [None]:
plt.figure(figsize=(15, 6))
sn.boxplot(data = cars_df[cars_df.model == 'swift'],
           x = 'age',
           y = 'Price');

## Find out the demand of top 10 selling models across different locations

In [None]:
top_10_models_df = cars_df[cars_df.model.isin(top_10_models)]
models_ct = pd.crosstab(top_10_models_df.Location,
                        top_10_models_df.model,
                        normalize = 'index') * 100

In [None]:
models_ct

In [None]:
plt.figure(figsize=(10, 8))
sn.heatmapap(models_ct, annot=True, fmt = "0.2f", cmap="YlGnBu");

In [None]:
top_10_models_df.shape

### 8. Participants Exercise:

- Find out the demand for petrol and diesel cars across top 10 selling models

In [None]:
top_10_models_df = cars_df[cars_df.model.isin(top_10_models)]
models_ct = pd.crosstab(top_10_models_df.model,
                        top_10_models_df.Fuel_Type,
                        normalize = 'index') * 100

In [None]:
models_ct

In [None]:
plt.figure(figsize=(3, 12))
sn.heatmap(models_ct, annot=True, fmt = "0.2f", cmap="YlGnBu");

## Converting datatypes of columns

Engine capacity, power and mileage are not numerical values. We need to convert them to numerical values for analysis.

In [None]:
cars_df.info()

In [None]:
cars_df[0:10]

In [None]:
import math

In [None]:
def get_float_val(x):
    if x is None:
        return None   
    
    val = str(x).split()[0]
    
    if val.replace(".","",1).isdigit():
        return float(val)        
    else:
        return None

In [None]:
cars_df['mileage_new'] = cars_df.Mileage.map(lambda x: get_float_val(x))

In [None]:
cars_df.mileage_new

In [None]:
cars_df['engine_new'] = cars_df.Engine.map(lambda x: get_float_val(x))
cars_df['power_new'] = cars_df.Power.map(lambda x: get_float_val(x))

In [None]:
cars_df.info()

## Finding correlation between two numerial variables

Correlation is a statistical measure that expresses the extent to which two variables are linearly related (meaning they change together at a constant rate). It’s a common tool for describing simple relationships without making a statement about cause and effect.

We describe correlations with a unit-free measure called the correlation coefficient which ranges from -1 to +1 and is denoted by r. 

- The closer r is to zero, the weaker the linear relationship.
- Positive r values indicate a positive correlation, where the values of both variables tend to increase together.
- Negative r values indicate a negative correlation, where the values of one variable tend to increase when the values of the other variable decrease.

Source: https://www.jmp.com/en_in/statistics-knowledge-portal/what-is-correlation.html

### How engine power and mileage is effecting price of the used cars?

In [None]:
sn.lmplot(data = cars_df.sample(100),
          x = 'power_new',
          y = 'Price');

In [None]:
sn.lmplot(data = cars_df.sample(200),
          x = 'mileage_new',
          y = 'Price');

### 9. Participants Exercise:

- Find out the correlation between 
    - engine capacity and price
    - Kilometer driven and price

In [None]:
sn.lmplot(data = cars_df.sample(100),
          x = 'engine_new',
          y = 'Price');

### Finding corrleation between multiple variables (numerical)

In [None]:
corr_mat = cars_df[['Price', 
                    'mileage_new', 
                    'engine_new', 
                    'power_new', 
                    'Kilometers_Driven',
                    'age']].corr()

In [None]:
corr_mat

In [None]:
sn.heatmap(corr_mat,
           annot=True,
           vmin = -1.0,
           vmax = 1.0,
           cmap = sn.diverging_palette(240, 10));

#### Inferences

- Price, Kilometers_Driven are negative correlated to price
- engine capacity and power are positively correlatd to price

### Changing the Unit of KM Driven

In [None]:
cars_df.Kilometers_Driven

In [None]:
cars_df['KM_Driven'] = cars_df['Kilometers_Driven'].map(lambda x: int(x/1000))

In [None]:
cars_df['KM_Driven']

In [None]:
cars_df.columns

### Drop the columns not required

- Drop the following columns
    - index, Name, Year, Kilometers_Driven, Mileage, Engine, Power, New_Price

In [None]:
cols_to_be_dropped = ["New_Price",
                      "index",
                      "Name",
                      "Year", 
                      "Kilometers_Driven", 
                      "Mileage", 
                      "Engine", 
                      "Power"]                      

In [None]:
new_cars_df = cars_df.drop(cols_to_be_dropped, axis = 1)

In [None]:
new_cars_df

### Saving the dataset (with new features) 

In [None]:
new_cars_df.to_csv( "new_used_car.csv", index = False)