# Exploratory Data Analysis in IPython.


<img src='figures/Data-Visualization.jpg' width=500>


## Introduction

**What is Exploratory Data Analysis ?**

Exploratory Data Analysis or (EDA) is understanding the data sets by summarizing their main characteristics often plotting them visually. 
This step is very important especially when we model the data. 
Plotting consists of Histograms, Box plot, Scatter plot and many more. 
It often takes much time to explore the data.

**How to perform Exploratory Data Analysis ?**

Well, the answer is it depends on the data set that you are working on. 
There is no one method or common methods in order to perform EDA.
In this tutorial I propose common methods and plots that would be used in the EDA process.

**What data are we exploring today ?**



Because most people drive cars, I obtained a very beautiful data-set of cars from Kaggle. 
The data set can be downloaded [here](https://www.kaggle.com/CooperUnion/cardataset). 
The data set contains more than 10,000 rows and 10 columns, with features such as: Engine Fuel Type, Engine HP, Transmission Type, highway MPG, city MPG and many more.



---



## 1. Importing the required libraries for EDA

Below are the libraries that are used in order to perform EDA (Exploratory data analysis) in this tutorial.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns                       #visualisation
import matplotlib.pyplot as plt             #visualisation
%matplotlib inline     
sns.set(color_codes=True)



---



## 2. Loading the data into the data frame.

Loading the data into the pandas data frame is certainly one of the most important steps in EDA, as we can see that the value from the data set is comma-separated. So all we have to do is to just read the CSV into a data frame and pandas data frame does the job for us.

To get or load the dataset into the notebook, all I did was one trivial step. In Google Colab at the left-hand side of the notebook, you will find a > (greater than symbol). When you click that you will find a tab with three options, you just have to select Files. Then you can easily upload your file with the help of the Upload option. No need to mount to the google drive or use any specific libraries just upload the data set and your job is done. One thing to remember in this step is that uploaded files will get deleted when this runtime is recycled. This is how I got the data set into the notebook.

In [None]:
df = pd.read_csv("data/cars.csv")
# To display the top 5 rows 
df.head(5)               

In [None]:
df.tail(5)                        # To display the botton 5 rows



---



## 3. Checking the types of data

Here we check for the datatypes because sometimes the MSRP or the price of the car would be stored as a string, if in that case, we have to convert that string to the integer data only then we can plot the data via a graph. Here, in this case, the data is already in integer format so nothing to worry.

In [None]:
df.dtypes



---



## 4. Dropping irrelevant columns

This step is certainly needed in every EDA because sometimes there are columns that are irrelivant to the EDA process. In this case, the columns such as Engine Fuel Type, Market Category, Vehicle style, Popularity, Number of doors, Vehicle Size may not be useful, so we will remove them.

In [None]:
df = df.drop(['Engine Fuel Type', 'Market Category', 'Vehicle Style', 'Popularity', 'Number of Doors', 'Vehicle Size'], axis=1)
df.head(5)



---



## 5. Renaming the columns

In this instance, most of the column names are very confusing to read, so I just tweaked their column names. This is a good approach it improves the readability of the data set.

In [None]:
df = df.rename(columns={"Engine HP": "HP", "Engine Cylinders": "Cylinders", "Transmission Type": "Transmission", "Driven_Wheels": "Drive Mode","highway MPG": "MPG-H", "city mpg": "MPG-C", "MSRP": "Price" })
df.head(5)



---



## 6. Dropping the duplicate rows

This is often a handy thing to do because a huge data set might have duplicate data.
In this case we found 989 rows of duplicate data.

In [None]:
df.shape

In [None]:
duplicate_rows_df = df[df.duplicated()]
print("number of duplicate rows: ", duplicate_rows_df.shape)

Now let's remove the duplicate data.

In [None]:
df.count()      # Used to count the number of rows

So seen above there are 11914 rows and we are removing 989 rows.

In [None]:
df = df.drop_duplicates()
df.head(5)

In [None]:
df.count()



---



## 7. Dropping the missing or null values.

This is similar to the previous step, but now we are removing null or missing data.
Here there are nearly 100 rows with missing values.
While removing data will impact your results, we are only removing a fraction of the total data (100 rows out of 10,000 rows)

In [None]:
print(df.isnull().sum())

In [None]:
df = df.dropna()    # Dropping the missing values.
df.count()

In [None]:
print(df.isnull().sum())   # After dropping the values



---



## 8. Detecting Outliers

An outlier is a point or set of points that are different from other points. 
Sometimes they can be very high or very low. 
It's often a good idea to detect and remove the outliers. 
Outliers are one of the primary causes of a less accurate model. 
Here we will try the IQR score technique. 

Often outliers can be seen with visualizations using a box plot. 
Shown below are the box plot of MSRP, Cylinders, Horsepower and EngineSize. 
Herein all the plots, you can find some points are outside the box they are none other than outliers. 
The technique of finding and removing outlier that I am performing in this assignment is taken help of a tutorial from[ towards data science](https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba).

In [None]:
sns.boxplot(x=df['Price'])

In [None]:
sns.boxplot(x=df['HP'])

In [None]:
sns.boxplot(x=df['Cylinders'])

In [None]:
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
print(IQR)

Don't worry about the above values because it's not important to know each and every one of them because it's just important to know how to use this technique in order to remove the outliers.

In [None]:
df = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]
df.shape

As seen above there were around 1600 rows were outliers. But you cannot completely remove the outliers because even after you use the above technique there maybe 1–2 outlier unremoved but that ok because there were more than 100 outliers. Something is better than nothing.



---



## 9. Plot different features against one another (scatter), against frequency (histogram)

### Histogram

Histogram refers to the frequency of occurrence of variables in an interval. 
In this case, there are 10 different car manufacturing companies, and we would like to know who has the most number of cars. 
To do this histogram is a trivial step to determine this information.

In [None]:
df.Make.value_counts().nlargest(40).plot(kind='bar', figsize=(10,5))
plt.title("Number of cars by make")
plt.ylabel('Number of cars')
plt.xlabel('Make');

### Heat Maps

A Heat Map is useful when we need to find **dependent variables**. 
In the heat map below it is obvious that the price feature depends mainly on the Engine Size, Horsepower, and Number of Cylinders.

In [None]:
plt.figure(figsize=(10,5))
c= df.corr()
sns.heatmap(c,cmap="BrBG",annot=True)
c

### Scatterplot

We generally use scatter plots to find the correlation between **two variables**. 
Here we consider Horsepower and Price. 
With the plot below, we can easily draw a trend line.

This plot should give us the intuition that we could use **linear regression** to make predictions, based on this data.

In [None]:
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(df['HP'], df['Price'])
ax.set_xlabel('HP')
ax.set_ylabel('Price')
plt.show()

## 10. Reporting Initial Findings

I think there is a strong relationship between the MSRP (Price) and the Horsepower feature of the car. 
My problem statement is “Predicting the price (MSRP) of the car given the specifications of the car”. 
The main idea is to predict the (MSRP) price of the car. 
Now I know that I have to **predict a value** so I should use **Regression Algorithms** because I have two related features (independent and dependent features). 
There are many types of Regression Algorithms such as Linear Regression, Random Forest Regression, Lasso and Ridge Regression and many more. 
I might use one of these algorithms and implement a machine learning model to predict the price. 
Hence I now ready to build a model. 
