# Exploratory Data Analysis

we will explore several methods to see the characteristics which have the most impact on the car price

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv('clean_df_auto.csv')
df.head()

### Analyzing Individual Feature Patterns using Visualization 

To install seaborn we use the pip which is the python package manager.

Import visualization packages "Matplotlib" and "Seaborn", don't forget about "%matplotlib inline" to plot in a Jupyter notebook.

In [None]:
%%capture
! pip install seaborn

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 

<h4>How to choose the right visualization method?</h4>
<p>When visualizing individual variables, it is important to first understand what type of variable you are dealing with. This will help us find the right visualization method for that variable.</p>


In [None]:
df.dtypes

In [None]:
# correlation between  variables of types "int64" and "float64" in df
df.corr()

In [None]:
# or we can just calculate correlation between certain variables
df[['engine-size', 'wheel-base', 'city-L/100km', 'horsepower','price']].corr()

<h2>Continuous numerical variables:</h2> 

<p>Continuous numerical variables are variables that may contain any value within some range. Continuous numerical variables can have the type "int64" or "float64". A great way to visualize these variables is by using scatterplots with fitted lines.</p>

<p>In order to start understanding the (linear) relationship between an individual variable and the price. We can do this by using "regplot", which plots the scatterplot plus the fitted regression line for the data.</p>

In [None]:
# Is "engine size" a potential predictor variable of price ?

sns.regplot(x="engine-size", y="price", data=df)
plt.ylim(0,)

print(df[['engine-size','price']].corr(),'\n')

As the engine-size goes up, the price goes up: this indicates a positive direct correlation between these two variables. Engine size seems like a pretty good predictor of price since the regression line is almost a perfect diagonal line.

We can examine the correlation between 'engine-size' and 'price' and see it's approximately 0.87

In [None]:
# Is "highway-mpg" a potential predictor variable of price ?

print(df[["highway-mpg",'price']].corr(),'\n')
sns.regplot("highway-mpg", "price", data=df)

As the highway-mpg goes up, the price goes down: this indicates an inverse/negative relationship between these two variables. Highway mpg could potentially be a predictor of price.

We can examine the correlation between 'highway-mpg' and 'price' and see it's approximately -0.704

In [None]:
# Is "peak-rpm" a potential predictor variable of price ?

print(df[["peak-rpm",'price']].corr(),'\n')
sns.regplot("peak-rpm", "price", data=df)

Peak rpm does not seem like a good predictor of the price at all since the regression line is close to horizontal. Also, the data points are very scattered and far from the fitted line, showing lots of variability. Therefore it's it is not a reliable variable.

We can examine the correlation between 'peak-rpm' and 'price' and see it's approximately -0.101616

<h3>Categorical variables</h3>

<p>These are variables that describe a 'characteristic' of a data unit, and are selected from a small group of categories. The categorical variables can have the type "object" or "int64". A good way to visualize categorical variables is by using boxplots.</p>

In [None]:
# Is "body-style" a potential predictor variable of price ?

sns.boxplot("body-style", "price", data = df)

<p>We see that the distributions of price between the different body-style categories have a significant overlap,</p>  <p>so body-style would not be a good predictor of price. </p>

In [None]:
# Is "engine-location" a potential predictor variable of price ?

sns.boxplot(x="engine-location", y="price", data=df)

The distribution of price between these two engine-location categories are distinct enough to take engine-location as a potential good predictor of price.

In [None]:
# Is "drive-wheels" a potential predictor variable of price ?

sns.boxplot(x="drive-wheels", y="price", data=df)

The distribution of price between the different drive-wheels categories differs; 

as such drive-wheels could potentially be a predictor of price.

### Descriptive Statistical Analysis

The __describe__ function automatically computes basic statistics for __all continuous variables__. Any NaN values are automatically skipped in these statistics.

In [None]:
df.describe()

The default setting of "describe" skips variables of type object. We can apply the method "describe" on the variables of type __object__ as follows:

In [None]:
df.describe(include=['object'])

__Value Counts__

Value-counts is a good way of understanding how many units of each characteristic/variable we have. 
    The method __"value_counts"__ only works on Pandas series, not Pandas Dataframes. As a result, we only include one bracket df[''] not two brackets df[['']].

Let's see "drive-wheels" column

In [None]:
df['drive-wheels'].value_counts()

We can convert the series to a Dataframe as follows :

In [None]:
df['drive-wheels'].value_counts().to_frame()

- save the result to the dataframe
- rename the column 'drive-wheels' to 'value_counts'
- rename the index to 'drive-wheels'

In [None]:
drive_wheels_counts = df['drive-wheels'].value_counts().to_frame()
drive_wheels_counts.rename(columns={'drive-wheels':'value_counts'},inplace=True)
drive_wheels_counts.index.name = 'drive-wheels'
drive_wheels_counts

In [None]:
# engine-location as variable
engine_loc_counts = df['engine-location'].value_counts().to_frame()
engine_loc_counts.rename(columns={'engine-location': 'value_counts'}, inplace=True)
engine_loc_counts.index.name = 'engine-location'
engine_loc_counts

we only have three cars with a rear engine.

'engine-location' would not be a good predictor variable for the price. 

### Basics of Grouping

The "groupby" method groups data by different categories. The data is grouped based on one or several variables 

In [None]:
# Let's group by the variable "drive-wheels"

df['drive-wheels'].unique()

If we want to know, on average, __which type of drive wheel__ is most valuable, we can group "drive-wheels" and then average them.

In [None]:
df_group_drive = df[['drive-wheels', 'price']].groupby(['drive-wheels'], as_index=False).mean()
df_group_drive

From our data, it seems rear-wheel drive vehicles are, on average, the most expensive

We can also group with multiple variables.

In [None]:
# Let's group by both 'drive-wheels' and 'body-style'

df_grp_driv_body = df[['drive-wheels','body-style','price']].groupby(
    ['drive-wheels','body-style'],as_index=False).mean()

df_grp_driv_body

This grouped data is much easier to visualize when it is made into a pivot table. A pivot table is like an Excel spreadsheet. We can convert the dataframe to a pivot table using the method __"pivot "__ 

In [None]:
grouped_pivot = df_grp_driv_body.pivot(index='drive-wheels', columns='body-style')
grouped_pivot 

 We can fill these missing cells with the value 0, but any other value could potentially be used as well.

In [None]:
# fill missing values with 0
grouped_pivot = grouped_pivot.fillna(0) 
grouped_pivot

We can use a heat map to visualize the relationship between variables

The heatmap plots the target variable (price) proportional to colour with respect to the variables 'drive-wheel' and 'body-style' in the vertical and horizontal axis respectively. This allows us to visualize how the price is related to 'drive-wheel' and 'body-style'.

In [None]:
plt.pcolor(grouped_pivot, cmap='RdBu')
plt.colorbar()
plt.show()

The default labels convey no useful information to us

In [None]:
fig, ax = plt.subplots()
im = ax.pcolor(grouped_pivot, cmap='RdBu')

#label names
row_labels = grouped_pivot.columns.levels[1] # level[0] = 'price', level[1]='body-style'
col_labels = grouped_pivot.index

#move ticks and labels to the center
ax.set_xticks(np.arange(grouped_pivot.shape[1]) + 0.5, minor=False)
ax.set_yticks(np.arange(grouped_pivot.shape[0]) + 0.5, minor=False)

#insert labels
ax.set_xticklabels(row_labels, minor=False)
ax.set_yticklabels(col_labels, minor=False)

#rotate label if too long
plt.xticks(rotation=60)

fig.colorbar(im)
plt.show()

### Correlation

a measure of the extent of interdependence between variables

<p3>Pearson Correlation</p>
<p>The Pearson Correlation measures the linear dependence between two variables X and Y.</p>
<p>The resulting coefficient is a value between -1 and 1 inclusive, where:</p>
<ul>
    <li><b>1</b>: Total positive linear correlation.</li>
    <li><b>0</b>: No linear correlation, the two variables most likely do not affect each other.</li>
    <li><b>-1</b>: Total negative linear correlation.</li>
</ul>

Pearson Correlation is the default method of the function "corr".

In [None]:
df.corr()

<b>P-value</b>: 
The P-value is the probability value whether show the correlation between two variables is statistically significant. 

Normally, we choose a significance level of 0.05, which means that we are 95% confident that the correlation between the variables is significant.

By convention, when the
<ul>
    <li>p-value is $<$ 0.001: we say there is strong evidence that the correlation is significant.</li>
    <li>the p-value is $<$ 0.05: there is moderate evidence that the correlation is significant.</li>
    <li>the p-value is $<$ 0.1: there is weak evidence that the correlation is significant.</li>
    <li>the p-value is $>$ 0.1: there is no evidence that the correlation is significant.</li>
</ul>

In [None]:
# to obtain this information using "stats" module in the "scipy" library.
from scipy import stats

### Calculations the corr. and p_values of the Variables

In [None]:
# Let's calculate the Pearson Correlation Coefficient and P-value of 'wheel-base' and 'price'.

pearson_coef, p_value = stats.pearsonr(df['wheel-base'], df['price'])
print('for **wheel-base** variable ','\n\n','pearson_corr: ',pearson_coef,'\n', 'p_value: ', p_value)

<h5>Conclusion:</h5>
<p>Since the p-value is $<$ 0.001, the correlation between wheel-base and price is statistically significant, although the linear relationship isn't extremely strong (~0.585)</p>

In [None]:
# Let's calculate the Pearson Correlation Coefficient and P-value of 'horsepower' and 'price'.

pearson_coef, p_value = stats.pearsonr(df['horsepower'], df['price'])
print('for **horsepower** variable ','\n\n','pearson_corr: ',pearson_coef,'\n', 'p_value: ', p_value)

<h5>Conclusion:</h5>

<p>Since the p-value is $<$ 0.001, the correlation between horsepower and price is statistically significant, and the linear relationship is quite strong (~0.809, close to 1)</p>

In [None]:
# Let's calculate the Pearson Correlation Coefficient and P-value of 'length' and 'price'.

pearson_coef, p_value = stats.pearsonr(df['length'], df['price'])
print('for **length** variable ','\n\n','pearson_corr: ',pearson_coef,'\n', 'p_value: ', p_value)

<h5>Conclusion:</h5>
<p>Since the p-value is $<$ 0.001, the correlation between length and price is statistically significant, and the linear relationship is moderately strong (~0.691).</p>

In [None]:
# Let's calculate the Pearson Correlation Coefficient and P-value of 'width' and 'price'.

pearson_coef, p_value = stats.pearsonr(df['width'], df['price'])
print('for **width** variable ','\n\n','pearson_corr: ',pearson_coef,'\n', 'p_value: ', p_value)

##### Conclusion:

Since the p-value is < 0.001, the correlation between width and price is statistically significant, and the linear relationship is quite strong (~0.751).

In [None]:
# Let's calculate the Pearson Correlation Coefficient and P-value of 'curb-weight' and 'price'.

pearson_coef, p_value = stats.pearsonr(df['curb-weight'], df['price'])
print('for **curb-weight** variable ','\n\n','pearson_corr: ',pearson_coef,'\n', 'p_value: ', p_value)

<h5>Conclusion:</h5>
<p>Since the p-value is $<$ 0.001, the correlation between curb-weight and price is statistically significant, and the linear relationship is quite strong (~0.834).</p>

In [None]:
# Let's calculate the Pearson Correlation Coefficient and P-value of 'engine-size' and 'price'.

pearson_coef, p_value = stats.pearsonr(df['engine-size'], df['price'])
print('for **engine-size** variable ','\n\n','pearson_corr: ',pearson_coef,'\n', 'p_value: ', p_value)

<h5>Conclusion:</h5>

<p>Since the p-value is $<$ 0.001, the correlation between engine-size and price is statistically significant, and the linear relationship is very strong (~0.872).</p>

In [None]:
# Let's calculate the Pearson Correlation Coefficient and P-value of 'bore' and 'price'.

pearson_coef, p_value = stats.pearsonr(df['bore'], df['price'])
print('for **bore** variable ','\n\n','pearson_corr: ',pearson_coef,'\n', 'p_value: ', p_value)

<h5>Conclusion:</h5>
<p>Since the p-value is $<$ 0.001, the correlation between bore and price is statistically significant, but the linear relationship is only moderate (~0.521).</p>

In [None]:
# Let's calculate the Pearson Correlation Coefficient and P-value of 'city-mpg' and 'price'.

pearson_coef, p_value = stats.pearsonr(df['city-mpg'], df['price'])
print('for **city-mpg** variable ','\n\n','pearson_corr: ',pearson_coef,'\n', 'p_value: ', p_value)

<h5>Conclusion:</h5>
<p>Since the p-value is $<$ 0.001, the correlation between city-mpg and price is statistically significant, and the coefficient of ~ -0.687 shows that the relationship is negative and moderately strong.</p>

In [None]:
# Let's calculate the Pearson Correlation Coefficient and P-value of 'highway-mpg' and 'price'.

pearson_coef, p_value = stats.pearsonr(df['highway-mpg'], df['price'])
print('for **highway-mpg** variable ','\n\n','pearson_corr: ',pearson_coef,'\n', 'p_value: ', p_value)

##### Conclusion:
Since the p-value is < 0.001, the correlation between highway-mpg and price is statistically significant, and the coefficient of ~ -0.705 shows that the relationship is negative and moderately strong.

<h3>ANOVA: Analysis of Variance</h3>
<p>The Analysis of Variance  (ANOVA) is a statistical method used to test whether there are significant differences between the means of two or more groups. ANOVA returns two parameters:</p>

<p><b>F-test score</b>: ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. A larger score means there is a larger difference between the means.</p>

<p><b>P-value</b>:  P-value tells how statistically significant is our calculated score value.</p>

<p>If our price variable is strongly correlated with the variable we are analyzing, expect ANOVA to return a sizeable F-test score and a small p-value.</p>

<h3>Drive Wheels</h3>

<p>Since ANOVA analyzes the difference between different groups of the same variable, the groupby function will come in handy. Because the ANOVA algorithm averages the data automatically, we do not need to take the average before hand.</p>

<p>Let's see if different types 'drive-wheels' impact  'price', we group the data.</p>

In [None]:
# we don't use .mean() method because ANAVO will handle it 

grouped_drive_test = df[['drive-wheels', 'price']].groupby(['drive-wheels'])

 We can obtain the values of the method group using the method "get_group". 

In [None]:
grouped_drive_test.get_group('rwd')['price']

we can use the function 'f_oneway' in the module 'stats'  to obtain the <b>F-test score</b> and <b>P-value</b>.

In [None]:
f_val, p_val = stats.f_oneway(grouped_drive_test.get_group('fwd')['price'],
                             grouped_drive_test.get_group('rwd')['price'],
                             grouped_drive_test.get_group('4wd')['price'])

print("ANOVA Results:", "\n" "F-test_score = ", f_val, "P-value = ", p_val)

with a large F test score showing a strong correlation and a P value of almost 0 implying almost certain statistical significance. But does this mean all three tested groups are all this highly correlated? 

__Separately: fwd and rwd__

In [None]:
f_val, p_val = stats.f_oneway(grouped_drive_test.get_group('fwd')['price'],
                             grouped_drive_test.get_group('rwd')['price'])

print("ANOVA Results:", "\n" "F-test_score = ", f_val, "P-value = ", p_val)

__4wd and rwd__

In [None]:
f_val, p_val = stats.f_oneway(grouped_drive_test.get_group('4wd')['price'],
                             grouped_drive_test.get_group('rwd')['price'])

print("ANOVA Results:", "\n" "F-test_score = ", f_val, "P-value = ", p_val)

__4wd and fwd__

In [None]:
f_val, p_val = stats.f_oneway(grouped_drive_test.get_group('4wd')['price'],
                             grouped_drive_test.get_group('fwd')['price'])

print("ANOVA Results:", "\n" "F-test_score = ", f_val, "P-value = ", p_val)

<h2>Conclusion: Important Variables</h2>

<p>We now have a better idea of what our data looks like and which variables are important to take into account when predicting the car price. We have narrowed it down to the following variables:</p>

Continuous numerical variables:
<ul>
    <li>Length</li>
    <li>Width</li>
    <li>Curb-weight</li>
    <li>Engine-size</li>
    <li>Horsepower</li>
    <li>City-mpg</li>
    <li>Highway-mpg</li>
    <li>Wheel-base</li>
    <li>Bore</li>
</ul>
    
Categorical variables:
<ul>
    <li>Drive-wheels</li>
</ul>

<p>As we now move into building machine learning models to automate our analysis, feeding the model with variables that meaningfully affect our target variable will improve our model's prediction performance.</p>