<a href="https://colab.research.google.com/github/khraumz/DAD/blob/main/Hands_on_Data_Understanding_1_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Understanding 1
1. Data Collection
2. Exploratory Data Analysis

#Data Understanding 2
- Summary Statistics

# 1.Data Collection

## About Automobile Dataset
An automobile dataset typically includes information about various types of vehicles, such as cars, trucks, and motorcycles. This dataset consist of data From 1985 Ward's Automotive Yearbook. Here are the sources

Sources:

1) 1985 Model Import Car and Truck Specifications, 1985 Ward's Automotive Yearbook.
2) Personal Auto Manuals, Insurance Services Office, 160 Water Street, New York, NY 10038
3) Insurance Collision Report, Insurance Institute for Highway Safety, Watergate 600, Washington, DC 20037

This data set consists of three types of entities:

a) the specification of an auto in terms of various characteristics,

b) its assigned insurance risk rating, corresponds to the degree to which the auto is more risky than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is more risky (or less), this symbol is adjusted by moving it up (or down) the scale. Actuarians call this process "symboling". A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe.

c) its normalized losses in use as compared to other cars. It is the relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/speciality, etc…), and represents the average loss per car per year.


The information on the vehicle's price, both new and used, could include the manufacturer's suggested retail price (MSRP), as well as prices from various dealers and resellers. Overall, an automobile dataset can be a valuable resource for researchers, analysts, and businesses in the automotive.


## A) Data Retrieval via API Kaggle

## 1. Install Kaggle Module:
You can install it inline in the notebook as follows.
(or externally using anaconda prompt)

In [None]:
!pip install kaggle

## 2. Create API Token
1) Login Kaggle.com
2) Go to Profile (photo) --> Setting
3) Scroll down to API, Create New Token
It will automatically download kaggle.json file.

##3. Upload json file
To simplify the workflow, you do not need to manually move the downloaded JSON file into the working directory. Instead, use the code snippet below to directly upload the JSON file into your Colab notebook.

In [None]:
import os

# Upload kaggle.json to Colab - kaggle.json is needed for api access
# It can be downloaded from Kaggle account settings
from google.colab import files
files.upload()

# Create a directory for the Kaggle API token
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

# Change the permissions of the token file
!chmod 600 ~/.kaggle/kaggle.json

##4. Download Dataset from Kaggle


In [None]:
#Searching dataset in Kaggel
!kaggle datasets list -h

In [None]:
#for example, you can search by the hottest dataset
!kaggle datasets list --sort-by hottest

In [None]:
#we want to specifically choose the automobile dataset
!kaggle datasets list --search automobile

In [None]:
#choose the oldest one, by toramky (first row)
# Download and extract dataset, by default will be put on the working directory
!kaggle datasets download -d toramky/automobile-dataset --force

In [None]:
# Unzip the dataset
!unzip -q automobile-dataset.zip -d automobile-dataset

In [None]:
# checking the content of the unzipped dataset

!ls automobile-dataset

<h2>Save Dataset</h2>
<p>
Correspondingly, Pandas enables us to save the dataset to csv. By using the <code>dataframe.to_csv()</code> method, you can add the file path and name along with quotation marks in the brackets.
</p>
<p>
For example, if you would save the dataframe <b>df</b> as <b>automobile.csv</b> to your local machine, you may use the syntax below, where <code>index = False</code> means the row names will not be written.
</p>

We can also read and save other file formats. We can use similar functions like **`pd.read_csv()`** and **`df.to_csv()`** for other data formats. The functions are listed in the following table:


<h2>Read/Save Other Data Formats</h2>


| Data Formate |        Read       |            Save |
| ------------ | :---------------: | --------------: |
| csv          |  `pd.read_csv()`  |   `df.to_csv()` |
| json         |  `pd.read_json()` |  `df.to_json()` |
| excel        | `pd.read_excel()` | `df.to_excel()` |
| hdf          |  `pd.read_hdf()`  |   `df.to_hdf()` |
| sql          |  `pd.read_sql()`  |   `df.to_sql()` |
| ...          |        ...        |             ... |

In [None]:
# Read the CSV, save as dataframe df_api and display the first few rows
import pandas as pd
df = pd.read_csv('/content/automobile-dataset/Automobile_data.csv')
df.head()

## B) Manual Data Retrieval
Alternatively, you can download the csv file manually and put in the working directory (for example: Colab Notebooks on your Google Drive). Download link: https://www.kaggle.com/datasets/toramky/automobile-dataset

In [None]:
# Step 1: Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Step 2: Import pandas
import pandas as pd

# Step 3: Define the file path (adjust username / path if needed)
file_path = '/content/drive/My Drive/Colab Notebooks/DAD 2025-2026/Automobile_data.csv'

# Step 4: Read the CSV file
df_manual = pd.read_csv(file_path)

# Step 5: Display the first few rows
df_manual.head()

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question #1: </h1>
<b>Check the bottom 10 rows of data frame "df".</b>
</div>




In [None]:
# Write your code below and press Shift+Enter to execute
# show the bottom 10 rows using dataframe.tail() method

<details><summary>Click here for the solution</summary>

```python
print("The last 10 rows of the dataframe\n")
df.tail(10)
```

# 2.Exploratory Data Analysis (EDA)

###2.1. Distinguish Attributes

<h3><b>Data Types</b></h3>
<p>
Data has a variety of types.<br>

The main types stored in Pandas dataframes are <b>object</b>, <b>float</b>, <b>int</b>, <b>bool</b> and <b>datetime64</b>. In order to better learn about each attribute, it is always good for us to know the data type of each column. In Pandas:

</p>


In [None]:
df.dtypes

<h3><b>Info</b></h3>

Another method you can use to check your dataset is dataframe.info().

It provides a concise summary of your DataFrame.

This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.

In [None]:
#Column checking
df.info()

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question #2: </h1>

Answer the following questions by analyzing the result of df.info()
1. How many attributes are there?
2. How many records/rows?
3. How many null values?
4. Which are the numerical attributes? Which are the categorical ones?
5. Are they already in the right data type?
</div>





Answers:
1. ...
2. ...
3. ...
4. ...
5. ...

<details><summary>Click here for the solution</summary>

```
1. 26; 25 descriptive attributes, 1 target
2. 205
3. 0
4. Numerical: ones whose dtype in int64 or float64; categorical: in object
5. "int64" could represent categories (i.e. 1, 2, 3, etc), so we should explore to understand whether their unique values represent categories or numeric values. Numeric could be recognized as "object", if it contains symbol like "?" (from df.head(5) we can see that "normalized_losses" has "?" in the first 3 rows)
```

###Missing Value Detection

In [None]:
df.isnull().any()

It seems that we don't have missing values but if we look at the first 5 rows we can clearly note that missing are indicates with '?' instead of nan.

We can also check that some quantitative/numerical variable is represented as object in this dataframe.

The dataset contains both numerical and categorical data.

**Numerical values** take continuous values, for example Price.

**Categorical values** can have a finite number of values, for example body-styles.

We will leave the original type of distinction eventhough some variables could change type, just to simplify. For example the number of doors could be treated as numeric but in this case we have just 3 modality("four","two" and unknown) so we could treat it as categorical as well.

In [None]:
df.head(5)

It is shown that "normalized-losses" contain numerical values, but recognized as "object" (from df.info()) due to the presence of "?" (unknown).

**We should examine each "object" (categorical) attribute to see if it actually contains numeric data.**

In [None]:
df["normalized-losses"].value_counts()

Doing so we discover that "normalized-losses" is in practice a variable to be converted in numeric as well.

<h1> Question #3: </h1>

**Examine other "object" features the same way!**

In [None]:
# Write your code below and press Shift+Enter to execute

<details><summary>Click here for the solution</summary>

```python
#1   normalized-losses  205 non-null    object ->num
#2   make               205 non-null    object ->obj
#3   fuel-type          205 non-null    object ->obj
#4   aspiration         205 non-null    object ->obj
#5   num-of-doors       205 non-null    object ->num in obj (one, two, etc)
#6   body-style         205 non-null    object ->obj
#7   drive-wheels       205 non-null    object ->obj
#14  engine-type        205 non-null    object ->obj
#15  num-of-cylinders   205 non-null    object ->num in obj (one, two, etc)
#17  fuel-system        205 non-null    object ->obj
#18  bore               205 non-null    object ->num
#19  stroke             205 non-null    object ->num
#21  horsepower         205 non-null    object ->num
#22  peak-rpm           205 non-null    object ->num
#25  price              205 non-null    object ->num

# we leave the case "object ->num in obj" like 'num-of-doors' and 'num-of-cylinders' as it is.

num_as_obj = ['normalized-losses', 'bore', 'stroke', 'horsepower', 'peak-rpm', 'price']
for col in num_as_obj:
    # Show value count of each column
    print(f"Value counts for column: {col}")
    print(df[col].value_counts())
    print("\n")

```

</details>


OOPS!!! We just discovered that some variables are not really categorical. Some of them could be converted in numeric replacing some values.

These variables are: 'normalized-losses', 'bore', 'stroke', 'horsepower', 'peak-rpm', 'price'.

This is due to missing data "?" recognized as object. So, next we solve the missing data problem.

**Missing Data**

Fill missing data of normalised-losses, price, horsepower, peak-rpm, bore, stroke (numeric) with the respective column mean, or median (for robust to outlier).


Fill missing data category Number of doors with the mode of the column i.e. Four (since it is category).

In [None]:
import numpy as np

#replace all "?" in all columns with NaN
df = df.replace('?',np.nan)
df.isnull().sum()

In [None]:
# Fill missing numerical data with the mean
numerical_cols_to_fill = ['normalized-losses', 'bore', 'stroke', 'horsepower', 'peak-rpm', 'price']
for col in numerical_cols_to_fill:
    # Ensure the column is numeric before calculating the mean
    df[col] = pd.to_numeric(df[col], errors='coerce')
    mean_val = df[col].mean()
    df[col].fillna(mean_val, inplace=True)

# Fill missing categorical data ('num-of-doors') with the mode
if df['num-of-doors'].isnull().any():
    mode_val = df['num-of-doors'].mode()[0]
    df['num-of-doors'].fillna(mode_val, inplace=True)

# Convert columns to appropriate types after filling NaNs
df['normalized-losses'] = df['normalized-losses'].astype(float)
df['bore'] = df['bore'].astype(float)
df['stroke'] = df['stroke'].astype(float)
df['horsepower'] = df['horsepower'].astype(int)
df['peak-rpm'] = df['peak-rpm'].astype(int)
df['price'] = df['price'].astype(int)

df.head()

###Value Counts


<p>Value counts is a good way of understanding how many units of each characteristic/variable we have. We can apply the "value_counts" method on the column "drive-wheels". Don’t forget the method "value_counts" only works on pandas series, not pandas dataframes. As a result, we only include one bracket <code>df['drive-wheels']</code>, not two brackets <code>df[['drive-wheels']]</code>.</p>


In [None]:
df['drive-wheels'].value_counts()

We can convert the series to a dataframe as follows:


In [None]:
df['drive-wheels'].value_counts().to_frame()

Let's repeat the above steps but save the results to the dataframe "drive_wheels_counts" and rename the column  'drive-wheels' to 'value_counts'.


In [None]:
drive_wheels_counts = df['drive-wheels'].value_counts().to_frame()
drive_wheels_counts.rename(columns={'drive-wheels': 'value_counts'}, inplace=True)
drive_wheels_counts

Now let's rename the index to 'drive-wheels':


In [None]:
drive_wheels_counts.index.name = 'drive-wheels'
drive_wheels_counts

We can repeat the above process for the variable 'engine-location'.


In [None]:
# engine-location as variable
engine_loc_counts = df['engine-location'].value_counts().to_frame()
engine_loc_counts.rename(columns={'engine-location': 'value_counts'}, inplace=True)
engine_loc_counts.index.name = 'engine-location'
engine_loc_counts.head(10)

###Summary Statistics

<p>After examining the value counts of the engine location, we see that engine location would not be a good predictor variable for the price. This is because we only have three cars with a rear engine and 198 with an engine in the front, so this result is skewed. Thus, we are not able to draw any conclusions about the engine location.</p>


<p>Let's first take a look at the variables by utilizing a description method.</p>

<p>The <b>describe</b> function automatically computes basic statistics for all continuous variables. Any NaN values are automatically skipped in these statistics.</p>

This will show:

<ul>
    <li>the count of that variable</li>
    <li>the mean</li>
    <li>the standard deviation (std)</li>
    <li>the minimum value</li>
    <li>the IQR (Interquartile Range: 25%, 50% and 75%)</li>
    <li>the maximum value</li>
<ul>


####*Summary Statistics of Numerical Variabels*

In [None]:
#Summary statistics: provide various summary statistics, excluding NaN (Not a Number) values.
df.describe()

This shows the statistical summary of all numeric-typed (int, float) columns.
For example, the attribute "symboling" has 205 counts, the mean value of this column is 0.83, the standard deviation is 1.25, the minimum value is -2, 25th percentile is 0, 50th percentile is 1, 75th percentile is 2, and the maximum value is 3.

However, what if we would also like to check all the columns including those that are of type object?


You can add an argument include = "all" inside the bracket. Let's try it again.

In [None]:
# describe all the columns in "df"
df.describe(include = "all")

Now it provides the statistical summary of all the columns, including object-typed attributes.
We can now see how many unique values there, which one is the top value and the frequency of top value in the object-typed columns.

Some values in the table above show as "NaN". This is because those numbers are not available regarding a particular column type.

####*Summary Statistics of Categorical Variabels*

We can apply the method "describe" on the variables of type 'object' as follows:


In [None]:
df.describe(include=['object'])

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question #4: </h1>

<p>
You can select the columns of a dataframe by indicating the name of each column. For example, you can select the three columns as follows:
</p>
<p>
    <code>dataframe[[' column 1 ',column 2', 'column 3']]</code>
</p>
<p>
Where "column" is the name of the column, you can apply the method  ".describe()" to get the statistics of those columns as follows:
</p>
<p>
    <code>dataframe[[' column 1 ',column 2', 'column 3'] ].describe()</code>
</p>

Apply the  method to ".describe()" to the columns 'length' and 'compression-ratio'.

</div>


In [None]:
# Write your code below and press Shift+Enter to execute

<details><summary>Click here for the solution</summary>

```python
df[['length', 'compression-ratio']].describe()
```

</details>


###2.2. Univariate Analysis


What are the main characteristics that have the most impact on the car price?

This step is analyzing individual feature patterns using visualization.


To install Seaborn we use pip, the Python package manager.


Import visualization packages "Matplotlib" and "Seaborn". Don't forget about "%matplotlib inline" to plot in a Jupyter notebook.


In [None]:
import matplotlib.pyplot as plt #for visualization
import seaborn as sns #for visualization
%matplotlib inline

<h4>How to choose the right visualization method?</h4>
<p>When visualizing individual variables, it is important to first understand what type of variable you are dealing with. This will help us find the right visualization method for that variable.</p>


In [None]:
# list the data types for each column (again, to check after conversion)
print(df.dtypes)

In [None]:
#Splitting variables in numerical and objects
df.select_dtypes(include=object).columns,df.select_dtypes(include=np.number).columns

In [None]:
#copy each index and save to categorical_columns and numerical_column respectively.
categorical_columns = ['make', 'fuel-type', 'aspiration', 'num-of-doors', 'body-style',
        'drive-wheels', 'engine-location', 'engine-type', 'num-of-cylinders',
        'fuel-system']
numerical_columns = ['symboling', 'normalized-losses', 'wheel-base', 'length', 'width',
        'height', 'curb-weight', 'engine-size', 'bore', 'stroke',
        'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
        'highway-mpg', 'price']

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h3>Question  #5:</h3>

<b>What is the data type of the column "peak-rpm"? </b>

</div>


In [None]:
# Write your code below and press Shift+Enter to execute

<details><summary>Click here for the solution</summary>

```python
float64
```

</details>


####Categorical Variabel Plot
We can use the Countplot for the Categorical features to have an idea of how they are distributed.

In [None]:
for i in categorical_columns:
    plt.figure(figsize=(10,7))
    sns.countplot(
    x=i,
    data=df,
    hue=i,    # explicitly map colors to categories
    order=df[i].value_counts().index,
    palette="Set2",   # you can try "Set1", "Paired", "husl", "coolwarm", etc.
    legend=False    # hide duplicate legend
)

    plt.title(f"Distribution of {i}", fontsize=14)
    plt.xlabel(i, fontsize=12)
    plt.ylabel("Count", fontsize=12)
    plt.xticks(rotation=45)
    plt.show()


**Findings**

1. More than 70 % of the vehicle has Ohc type of Engine
2. 57% of the cars has 4 doors
3. Gas is preferred by 85 % of the vehicles
4. Most produced vehicle are of body style sedan around 48% followed by hatchback 32%

####**Numeric Variabel Plot**

In [None]:
for i in numerical_columns:
  df[i].hist(figsize=(10,8),bins=6)
  plt.xlabel(i, fontsize=12)
  plt.ylabel("Count", fontsize=12)
  plt.xticks(rotation=45)
  plt.tight_layout()
  plt.show()

**Findings**

1. Most of the car has a Curb Weight is in range 1900 to 3100
2. The Engine Size is inrange 60 to 190
3. Most vehicle has horsepower 50 to 125
4. Most Vehicle are in price range 5000 to 18000
5. peak rpm is mostly distributed between 4600 to 5700

##2.3. Bivariate Analysis

Bivariate analysis is the statistical method of analyzing the relationship between two variables at the same time.

The goal is to see whether and how the two variables are related (e.g., associated, correlated, dependent, or independent). It is used to observe the relationship between descriptive features and the target feature, and between two descriptive features (ensure that they are independent/low correlation).

Common forms of bivariate analysis:

1. Numerical vs. Numerical → correlation, scatter plot, regression (e.g., engine-size vs. price).

2. Categorical vs. Categorical → cross-tabulation, chi-square test (e.g., engine-location vs. body-style).

3. Numerical vs. Categorical → t-test, ANOVA, boxplot (e.g., engine-location vs. price).

###**Continuous Numerical Variables**

<p>Continuous numerical variables are variables that may contain any value within some range. They can be of type "int64" or "float64". A great way to visualize these variables is by using <b>scatterplots with fitted lines</b>.</p>

<p>In order to start understanding the (linear) relationship between an individual variable and the price, we can use "regplot" which plots the scatterplot plus the fitted regression line for the data.</p>


Let's see several examples of different linear relationships:


####*Positive Linear Relationship*


Let's find the scatterplot of "engine-size" and "price".


In [None]:
# Engine size as potential predictor variable of price
sns.regplot(x="engine-size", y="price", data=df)
plt.ylim(0,)

<p>As the engine-size goes up, the price goes up: this indicates a positive direct correlation between these two variables. Engine size seems like a pretty good predictor of price since the regression line is almost a perfect diagonal line.</p>


We can examine the correlation between 'engine-size' and 'price' and see that it's approximately 0.87.


In [None]:
df[["engine-size", "price"]].corr()

Highway mpg is a potential predictor variable of price. Let's find the scatterplot of "highway-mpg" and "price".


####*Negative Linear Relationship*

In [None]:
sns.regplot(x="highway-mpg", y="price", data=df)

<p>As highway-mpg goes up, the price goes down: this indicates an inverse/negative relationship between these two variables. Highway mpg could potentially be a predictor of price.</p>


We can examine the correlation between 'highway-mpg' and 'price' and see it's approximately -0.704.


In [None]:
df[['highway-mpg', 'price']].corr()

####*Weak Linear Relationship*


Let's see if "peak-rpm" is a predictor variable of "price".


In [None]:
sns.regplot(x="peak-rpm", y="price", data=df)

<p>Peak rpm does not seem like a good predictor of the price at all since the regression line is close to horizontal. Also, the data points are very scattered and far from the fitted line, showing lots of variability. Therefore, it's not a reliable variable.</p>


We can examine the correlation between 'peak-rpm' and 'price' and see it's approximately -0.101616.


In [None]:
df[['peak-rpm','price']].corr()

 <div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question  5 a): </h1>

<p>Find the correlation  between x="stroke" and y="price".</p>
<p>Hint: if you would like to select those columns, use the following syntax: df[["stroke","price"]].  </p>
</div>


In [None]:
# Write your code below and press Shift+Enter to execute


<details><summary>Click here for the solution</summary>

```python

#The correlation is 0.0823, the non-diagonal elements of the table.

df[["stroke","price"]].corr()

```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1>Question  5 b):</h1>

<p>Given the correlation results between "price" and "stroke", do you expect a linear relationship?</p>
<p>Verify your results using the function "regplot()".</p>
</div>


In [None]:
# Write your code below and press Shift+Enter to execute

#low correlation

<details><summary>Click here for the solution</summary>

```python

#There is a weak correlation between the variable 'stroke' and 'price.' as such regression will not work well. We can see this using "regplot" to demonstrate this.

#Code:
sns.regplot(x="stroke", y="price", data=df)

```

</details>


We can plot all the numerical variabels in one figure, using:

In [None]:
plt.figure()
sns.pairplot(df)
plt.show()

In [None]:
plt.figure()
sns.pairplot(df, hue = "fuel-type", height=2, markers=["o", "s"])
plt.show()

###**Categorical Variables**

<p>These are variables that describe a 'characteristic' of a data unit, and are selected from a small group of categories. The categorical variables can have the type "object" or "int64". A good way to visualize categorical variables is by using boxplots.</p>


Let's look at the relationship between "body-style" and "price".


In [None]:
sns.boxplot(x="body-style", y="price", data=df)

<p>We see that the distributions of price between the different body-style categories have a significant overlap, so body-style would not be a good predictor of price. Let's examine engine "engine-location" and "price":</p>


In [None]:
sns.boxplot(x="engine-location", y="price", data=df)

<p>Here we see that the distribution of price between these two engine-location categories, front and rear, are distinct enough to take engine-location as a potential good predictor of price.</p>


Let's examine "drive-wheels" and "price".


In [None]:
# drive-wheels
sns.boxplot(x="drive-wheels", y="price", data=df)

<p>Here we see that the distribution of price between the different drive-wheels categories differs. As such, drive-wheels could potentially be a predictor of price.</p>


We may plot the boxplots based on a particular categorical features.

In [None]:
plt.figure()
df.boxplot(by="fuel-type", figsize=(15, 10))
plt.show()

<p>Visualization is very important in data science, and Python visualization packages provide great freedom. We will go more in-depth in a separate Python visualizations course.</p>

<p>The main question we want to answer in this module is, "What are the main characteristics which have the most impact on the car price?".</p>

<p>To get a better measure of the important characteristics, we look at the correlation of these variables with the car price. In other words: how is the car price dependent on this variable?</p>


###**Correlation and Causation**


<p><b>Correlation</b>: a measure of the extent of interdependence between variables.</p>

<p><b>Causation</b>: the relationship between cause and effect between two variables.</p>

<p>It is important to know the difference between these two. Correlation does not imply causation. Determining correlation is much simpler  the determining causation as causation may require independent experimentation.</p>


####**Pearson Correlation**
<p>The Pearson Correlation measures the linear dependence between two variables X and Y.</p>
<p>The resulting coefficient is a value between -1 and 1 inclusive, where:</p>
<ul>
    <li><b>1</b>: Perfect positive linear correlation.</li>
    <li><b>0</b>: No linear correlation, the two variables most likely do not affect each other.</li>
    <li><b>-1</b>: Perfect negative linear correlation.</li>
</ul>


<p>Pearson Correlation is the default method of the function "corr". Like before, we can calculate the Pearson Correlation of the of the 'int64' or 'float64'  (numerical) variables. For categorical variables, we have to convert the classes into numerical representation. We will learn in the data preprocessing step. </p>


In [None]:
df[numerical_columns].corr()

Sometimes we would like to know the significant of the correlation estimate.


<b>P-value</b>

<p>What is this P-value? The P-value is the probability value that the correlation between these two variables is statistically significant. Normally, we choose a significance level of 0.05, which means that we are 95% confident that the correlation between the variables is significant.</p>

By convention, when the

<ul>
    <li>p-value is < 0.001: we say there is strong evidence that the correlation is significant.</li>
    <li>the p-value is < 0.05: there is moderate evidence that the correlation is significant.</li>
    <li>the p-value is < 0.1: there is weak evidence that the correlation is significant.</li>
    <li>the p-value is > 0.1: there is no evidence that the correlation is significant.</li>
</ul>


We can obtain this information using  "stats" module in the "scipy"  library.


In [None]:
from scipy import stats

<h3>Wheel-Base vs. Price</h3>


Let's calculate the  Pearson Correlation Coefficient and P-value of 'wheel-base' and 'price'.


In [None]:
pearson_coef, p_value = stats.pearsonr(df['wheel-base'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)

<h4>Conclusion:</h4>
<p>Since the p-value is $<$ 0.001, the correlation between wheel-base and price is statistically significant, although the linear relationship isn't extremely strong (~0.585).</p>


<h3>Horsepower vs. Price</h3>


Let's calculate the  Pearson Correlation Coefficient and P-value of 'horsepower' and 'price'.


In [None]:
pearson_coef, p_value = stats.pearsonr(df['horsepower'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)

<h4>Conclusion:</h4>

<p>Since the p-value is $<$ 0.001, the correlation between horsepower and price is statistically significant, and the linear relationship is quite strong (~0.809, close to 1).</p>


<h3>Length vs. Price</h3>

Let's calculate the  Pearson Correlation Coefficient and P-value of 'length' and 'price'.


In [None]:
pearson_coef, p_value = stats.pearsonr(df['length'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)

<h4>Conclusion:</h4>
<p>Since the p-value is $<$ 0.001, the correlation between length and price is statistically significant, and the linear relationship is moderately strong (~0.691).</p>


<h3>Width vs. Price</h3>


Let's calculate the Pearson Correlation Coefficient and P-value of 'width' and 'price':


In [None]:
pearson_coef, p_value = stats.pearsonr(df['width'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value )

Conclusion:

Since the p-value is < 0.001, the correlation between width and price is statistically significant, and the linear relationship is quite strong (\~0.751).


<h3> Curb-Weight vs. Price </h3>


Let's calculate the Pearson Correlation Coefficient and P-value of 'curb-weight' and 'price':


In [None]:
pearson_coef, p_value = stats.pearsonr(df['curb-weight'], df['price'])
print( "The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)

*Conclusion:*
Since the p-value is $<$ 0.001, the correlation between curb-weight and price is statistically significant, and the linear relationship is quite strong (~0.834).


<h3>Engine-Size vs. Price</h3>

Let's calculate the Pearson Correlation Coefficient and P-value of 'engine-size' and 'price':


In [None]:
pearson_coef, p_value = stats.pearsonr(df['engine-size'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)

<h4>Conclusion:</h4>

<p>Since the p-value is $<$ 0.001, the correlation between engine-size and price is statistically significant, and the linear relationship is very strong (~0.872).</p>


<h3>Bore vs. Price</h3>


Let's calculate the  Pearson Correlation Coefficient and P-value of 'bore' and 'price':


In [None]:
pearson_coef, p_value = stats.pearsonr(df['bore'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =  ", p_value )

<h4>Conclusion:</h4>
<p>Since the p-value is $<$ 0.001, the correlation between bore and price is statistically significant, but the linear relationship is only moderate (~0.521).</p>


We can relate the process for each 'city-mpg'  and 'highway-mpg':


<h3>City-mpg vs. Price</h3>


In [None]:
pearson_coef, p_value = stats.pearsonr(df['city-mpg'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)

<h4>Conclusion:</h4>
<p>Since the p-value is $<$ 0.001, the correlation between city-mpg and price is statistically significant, and the coefficient of about -0.687 shows that the relationship is negative and moderately strong.</p>


<h3>Highway-mpg vs. Price</h3>


In [None]:
pearson_coef, p_value = stats.pearsonr(df['highway-mpg'], df['price'])
print( "The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value )

Conclusion:

Since the p-value is < 0.001, the correlation between highway-mpg and price is statistically significant, and the coefficient of about -0.705 shows that the relationship is negative and moderately strong.


####**ANOVA**


<h3>ANOVA: Analysis of Variance</h3>
<p>The Analysis of Variance  (ANOVA) is a statistical method used to test whether there are significant differences between the means of two or more groups. ANOVA returns two parameters:</p>

<p><b>F-test score</b>: ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. A larger score means there is a larger difference between the means.</p>

<p><b>P-value</b>:  P-value tells how statistically significant our calculated score value is.</p>

<p>If our price variable is strongly correlated with the variable we are analyzing, we expect ANOVA to return a sizeable F-test score and a small p-value.</p>


<h3>Drive Wheels</h3>


<p>Since ANOVA analyzes the difference between different groups of the same variable, the groupby function will come in handy. Because the ANOVA algorithm averages the data automatically, we do not need to take the average before hand.</p>

<p>To see if different types of 'drive-wheels' impact  'price', we group the data.</p>


In [None]:
df_gptest=df[['drive-wheels', 'price']].groupby(['drive-wheels'])
df_gptest.head(2)

In [None]:
df_gptest

We can obtain the values of the method group using the method "get_group".


In [None]:
df_gptest.get_group('4wd')['price']

We can use the function 'f_oneway' in the module 'stats' to obtain the <b>F-test score</b> and <b>P-value</b>.


In [None]:
# ANOVA
f_val, p_val = stats.f_oneway(df_gptest.get_group('fwd')['price'], df_gptest.get_group('rwd')['price'], df_gptest.get_group('4wd')['price'])

print( "ANOVA results: F=", f_val, ", P =", p_val)

This is a great result with a large F-test score showing a strong correlation and a P-value of almost 0 implying almost certain statistical significance. But does this mean all three tested groups are all this highly correlated?

Let's examine them separately.


**fwd and rwd:**


In [None]:
f_val, p_val = stats.f_oneway(df_gptest.get_group('fwd')['price'], df_gptest.get_group('rwd')['price'])

print( "ANOVA results: F=", f_val, ", P =", p_val )

Let's examine the other groups.


**4wd and rwd:**


In [None]:
f_val, p_val = stats.f_oneway(df_gptest.get_group('4wd')['price'], df_gptest.get_group('rwd')['price'])

print( "ANOVA results: F=", f_val, ", P =", p_val)

**4wd and fwd:**


In [None]:
f_val, p_val = stats.f_oneway(df_gptest.get_group('4wd')['price'], df_gptest.get_group('fwd')['price'])

print("ANOVA results: F=", f_val, ", P =", p_val)

####Conclusion: Important Variables


<p>We now have a better idea of what our data looks like and which variables are important to take into account when predicting the car price. We have narrowed it down to the following variables:</p>

Continuous numerical variables:

<ul>
    <li>Length</li>
    <li>Width</li>
    <li>Curb-weight</li>
    <li>Engine-size</li>
    <li>Horsepower</li>
    <li>City-mpg</li>
    <li>Highway-mpg</li>
    <li>Wheel-base</li>
    <li>Bore</li>
</ul>

Categorical variables:

<ul>
    <li>Drive-wheels</li>
</ul>

<p>As we now move into building machine learning models to automate our analysis, feeding the model with variables that meaningfully affect our target variable will improve our model's prediction performance.</p>
