# Module3: Exploratory Data Analysis
In this module, we're going to cover the basics of exploratory data analysis using Python. 

Exploratory data analysis, or EDA, is an approach to analyze data in order to:

- Summarize main characteristics of the data
- Gain a better understanding of the dataset
- Uncover relationships between different variables
- Extract important variables for the problem we're trying to solve

The main question we are trying to answer in this module is: what are the characteristics that have the most impact on the car price? 

We will be going through several useful exploratory data analysis techniques to answer this question. In this module, you will learn about:

- Descriptive statistics, which describe basic features of a dataset and provide a short summary about the sample and measures of the data.
- Basics of grouping data using GroupBy and how this can help to transform our dataset.
- ANOVA (analysis of variance), a statistical method in which the variation in a set of observations is divided into distinct components.
- The correlation between different variables.
- Lastly, advanced correlation, where we'll introduce you to various correlation statistical methods, namely Pearson correlation and correlation heatmaps.


In this video, we'll be talking about descriptive statistics. When you begin to analyze data, it's important to first explore your data before you spend time building complicated models. 

One easy way to do so is to calculate some descriptive statistics for your data. Descriptive statistical analysis helps to describe basic features of a dataset and obtain a short summary about the sample and measures of the data. Let's show you a couple of different useful methods.

### Descriptive Statistics Methods

1. **Using the `describe` Function in Pandas**
   - The `describe` function automatically computes basic statistics for all numerical variables in your data frame. It shows:
     - Mean
     - Total number of data points
     - Standard deviation
     - Quartiles
     - Extreme values
   - Any NaN values are automatically skipped in these statistics. This function will give you a clearer idea of the distribution of your different variables.

2. **Summarizing Categorical Data with `value_counts`**
   - Categorical variables are variables that can be divided up into different categories or groups and have discrete values.
   - The `value_counts` function summarizes the categorical data, providing the count of each category.
   - You can change the name of the column to make it easier to read.

### Box Plots
- Box plots are a great way to visualize numeric data, showing various distributions of the data. 
- Key features of a box plot:
  - Median: Represents where the middle data point is.
  - Upper Quartile: Shows where the 75th percentile is.
  - Lower Quartile: Shows where the 25th percentile is.
  - Inter-Quartile Range (IQR): Data between the upper and lower quartile.
  - Lower and Upper Extremes: Calculated as 1.5 times the IQR above the 75th percentile and below the 25th percentile.
  - Outliers: Displayed as individual dots outside the upper and lower extremes.
- Box plots make it easy to compare between groups. They can help easily spot outliers and see the distribution and skewness of the data.

### Scatter Plots
- Scatter plots visualize the relationship between two variables.
- The predictor variable (x-axis) is the variable used to predict an outcome, and the target variable (y-axis) is the variable being predicted.
- In our example, engine size is the predictor variable, and price is the target variable.
- A scatter plot shows how the target variable changes as the predictor variable changes.
- It's crucial to label your axes and write a general plot title for clarity.

From the scatter plot, we see that as the engine size goes up, the price of the car also goes up. This indicates a positive linear relationship between these two variables.



In this video, we'll cover the basics of grouping and how this can help to transform our dataset. 

Assume you want to know: is there any relationship between the different types of drive systems (forward, rear, and four-wheel drive) and the price of the vehicles? If so, which type of drive system adds the most value to a vehicle? 

It would be nice if we could group all the data by the different types of drive wheels and compare the results of these different drive wheels against each other. In Pandas, this can be done using the `groupby` method. 

The `groupby` method is used on categorical variables, grouping the data into subsets according to the different categories of that variable. You can group by a single variable or you can group by multiple variables by passing in multiple variable names. 

### Grouping Data with `groupby`

As an example, let's say we are interested in finding the average price of vehicles and observe how they differ between different types of body styles and drive wheels variables. To do this:
```python
# Example code
data = df[['body-style', 'drive-wheels', 'price']]
grouped_data = data.groupby(['drive-wheels', 'body-style']).mean()
```

The data is now grouped into subcategories, and only the average price of each subcategory is shown. We can see that according to our data, rear-wheel drive convertibles and rear-wheel drive hardtops have the highest value, while four-wheel drive hatchbacks have the lowest value.

### Pivot Table

A table of this form isn't the easiest to read and also not very easy to visualize. To make it easier to understand, we can transform this table to a pivot table by using the `pivot` method. 
```python
# Example code
pivot_table = grouped_data.pivot_table(index='drive-wheels', columns='body-style')
```
In the previous table, both drive wheels and body style were listed in columns. A pivot table has one variable displayed along the columns and the other variable displayed along the rows. 

The price data now becomes a rectangular grid, which is easier to visualize. This is similar to what is usually done in Excel spreadsheets.

### Heat Map

Another way to represent the pivot table is using a heat map plot. A heat map takes a rectangular grid of data and assigns a color intensity based on the data value at the grid points. It is a great way to plot the target variable over multiple variables and get visual clues of the relationship between these variables and the target. 
```python
# Example code
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 6))
sns.heatmap(pivot_table, annot=True, cmap='RdBu')
plt.title('Heat Map of Price by Body Style and Drive Wheels')
plt.xlabel('Body Style')
plt.ylabel('Drive Wheels')
plt.show()
```
In the output plot, each type of body style is numbered along the x-axis, and each type of drive wheels is numbered along the y-axis. The average prices are plotted with varying colors based on their values according to the color bar. We see that the top section of the heat map seems to have higher prices in the bottom section.
``` 

In this video, we'll talk about the correlation between different variables. Correlation is a statistical metric for measuring the extent to which different variables are interdependent. In other words, when we look at two variables over time, if one variable changes, how does this affect change in the other variable?

### Examples of Correlation
- **Smoking and Lung Cancer**: Smoking is known to be correlated with lung cancer; you have a higher chance of getting lung cancer if you smoke.
- **Umbrellas and Rain**: There is a correlation between umbrellas and rain; more precipitation means more people use umbrellas. If it doesn't rain, people would not carry umbrellas. Therefore, umbrellas and rain are interdependent and, by definition, correlated.

### Correlation vs. Causation
It is important to know that correlation doesn't imply causation. For instance, while we can say that umbrellas and rain are correlated, we cannot say whether the umbrella caused the rain or the rain caused the umbrella.

### Visualizing Correlation with Scatter Plots
In data science, we often use correlation. Let's look at the correlation between engine size and price.

#### Positive Correlation: Engine Size and Price
We can visualize these two variables using a scatter plot with a regression line, indicating the relationship between them.
```python
import seaborn as sns
import matplotlib.pyplot as plt

sns.regplot(x="engine-size", y="price", data=df)
plt.title('Scatter Plot with Regression Line: Engine Size vs. Price')
plt.show()
```
In this example, the regression line is steep, showing a positive linear relationship. As the engine size increases, the price goes up as well, indicating a positive correlation between engine size and price.

#### Negative Correlation: Highway Miles per Gallon and Price
Now, let's look at the relationship between highway miles per gallon (mpg) and car price.
```python
sns.regplot(x="highway-mpg", y="price", data=df)
plt.title('Scatter Plot with Regression Line: Highway MPG vs. Price')
plt.show()
```
In this plot, when highway mpg values go up, the price goes down, indicating a negative linear relationship. The negative slope of the line shows that highway mpg is a good predictor of price, despite the negative correlation.

#### Weak Correlation: Peak RPM and Price
Lastly, we examine a weak correlation example. Low and high values of peak RPM have both low and high prices, indicating a weak relationship. Therefore, we cannot use RPM to predict the prices effectively.
```python
sns.regplot(x="peak-rpm", y="price", data=df)
plt.title('Scatter Plot with Regression Line: Peak RPM vs. Price')
plt.show()
```
In this case, the scattered data points and the flat regression line suggest a weak correlation.

### Conclusion
Correlation helps us understand how variables are related but remember, it does not imply causation. Visualizing these relationships using scatter plots and regression lines can provide valuable insights into the data.
```

# Introduction to Correlation Statistical Methods

In this video, we'll introduce you to various correlation statistical methods. One way to measure the strength of the correlation between continuous numerical variables is by using a method called **Pearson Correlation**.

## Pearson Correlation

The Pearson Correlation method will give you two values:
1. **Correlation Coefficient**: Measures the strength and direction of the linear relationship between two variables.
2. **P-Value**: Indicates the probability that the observed correlation occurred by chance.

### Interpreting the Correlation Coefficient
- **Close to 1**: Large positive correlation
- **Close to -1**: Large negative correlation
- **Close to 0**: No correlation

### Interpreting the P-Value
- **< 0.001**: Strong certainty about the correlation
- **0.001 - 0.05**: Moderate certainty
- **0.05 - 0.1**: Weak certainty
- **> 0.1**: No certainty of correlation

A strong correlation exists when the correlation coefficient is close to 1 or -1 and the p-value is less than 0.001.

### Example: Horsepower and Car Price
In this example, we examine the correlation between horsepower and car price. Using the `Scipy` stats package, we can calculate the Pearson Correlation.

```python
from scipy import stats

correlation_coefficient, p_value = stats.pearsonr(df['horsepower'], df['price'])
print(f"Correlation Coefficient: {correlation_coefficient}")
print(f"P-Value: {p_value}")
```

- **Correlation Coefficient**: Approximately 0.8, indicating a strong positive correlation.
- **P-Value**: Much smaller than 0.001, confirming strong certainty about the correlation.

### Heat Map of Correlations
Taking all variables into account, we can create a heat map that indicates the correlation between each pair of variables.

```python
import seaborn as sns
import matplotlib.pyplot as plt

correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heat Map')
plt.show()
```

The color scheme indicates the Pearson correlation coefficient, showing the strength of the correlation between variables. The diagonal line with a dark red color represents the correlation of variables with themselves, which is always 1.

This correlation heat map provides an overview of how different variables are related to one another, and most importantly, how these variables are related to price.
```

# Lesson Summary

Congratulations! You have completed this lesson. At this point in the course, you know: 

- **Tools like the `describe` function in pandas** can quickly calculate key statistical measures like mean, standard deviation, and quartiles for all numerical variables in your data frame. 
- Use the **`value_counts` function** to summarize data into different categories for categorical data. 
- **Box plots** offer a more visual representation of the data's distribution for numerical data, indicating features like the median, quartiles, and outliers.
- **Scatter plots** are excellent for exploring relationships between continuous variables, like engine size and price, in a car data set.
- Use **Pandas' `groupby` method** to explore relationships between categorical variables.
- Use **pivot tables and heat maps** for better data visualizations.
- **Correlation between variables** is a statistical measure that indicates how changes in one variable might be associated with changes in another variable.
- When exploring correlation, use **scatter plots combined with a regression line** to visualize relationships between variables.
- **Visualization functions like `regplot` from the seaborn library** are especially useful for exploring correlation.
- The **Pearson correlation**, a key method for assessing the correlation between continuous numerical variables, provides two critical values: 
  - **The coefficient**, which indicates the strength and direction of the correlation.
  - **The P-value**, which assesses the certainty of the correlation.
- A **correlation coefficient** close to 1 or -1 indicates a strong positive or negative correlation, respectively, while one close to zero suggests no correlation.
- For **P-values**, values less than 0.001 indicate strong certainty in the correlation, while larger values indicate less certainty. Both the coefficient and P-value are important for confirming a strong correlation.
- **Heatmaps** provide a comprehensive visual summary of the strength and direction of correlations among multiple variables.
```

ขอโทษที่สับสน ค่าสัมประสิทธิ์การถดถอย (correlation coefficient) ไม่ได้ตรงไปตรงมากับค่า slope หรือค่าความชันของเส้นตรงใน scatter plot ถ้าไม่ใช่ในกรณีที่มีความสัมพันธ์เชิงเส้นเท่านั้น โดยทั่วไปแล้ว:

- ถ้าค่าสัมประสิทธิ์การถดถอยเป็นบวก แสดงว่ามีความสัมพันธ์เชิงบวกระหว่างตัวแปร
- ถ้าค่าสัมประสิทธิ์การถดถอยเป็นลบ แสดงว่ามีความสัมพันธ์เชิงลบระหว่างตัวแปร
- ถ้าค่าสัมประสิทธิ์การถดถอยเข้าใกล้ 0 แสดงว่าไม่มีความสัมพันธ์เชิงเส้นระหว่างตัวแปร

แต่สำหรับความชันของเส้นตรง (slope) จะบ่งบอกถึงอัตราการเปลี่ยนแปลงของตัวแปรต่อหน่วยของตัวแปรอีกตัว เช่น ในกรณีของ scatter plot ที่แสดงความสัมพันธ์เชิงเส้น เส้นตรงจะมีความชันที่แสดงถึงความเปลี่ยนแปลงของตัวแปรตามกันเมื่อตัวแปรอีกตัวเปลี่ยนแปลง แต่ค่าสัมประสิทธิ์การถดถอยจะบ่งบอกถึงความสัมพันธ์ระหว่างตัวแปรโดยรวม ไม่จำเป็นต้องเป็นเส้นตรง

ดังนั้น เมื่อพิจารณาความสัมพันธ์ระหว่างตัวแปร ค่าสัมประสิทธิ์การถดถอยจึงเป็นตัวบ่งชี้ที่ดีกว่า slope หรือความชันของเส้นตรงในการบ่งบอกความสัมพันธ์ระหว่างตัวแปรในกรณีที่ไม่ใช่เส้นตรงเท่านั้น และมีความหมายที่แน่นอนกว่าในการวัดความสัมพันธ์ระหว่างตัวแปรได้ทั่วไป

ขอโทษในความสับสนและขอขอบคุณสำหรับการชี้แจงและช่วยแก้ไขความคิดเห็นให้ทันท่วงที หวังว่าคำอธิบายนี้จะช่วยให้เข้าใจได้ดียิ่งขึ้น

ลองมาดูตัวอย่างที่ชัดเจนของสัมประสิทธิ์การถดถอย (correlation coefficient) และความชัน (slope) ในการวิเคราะห์ความสัมพันธ์ระหว่างตัวแปรกันต่อไปนี้:

พิจารณา scatter plot ที่แสดงความสัมพันธ์ระหว่างอายุของบุคคลกับรายได้ (Age vs. Income) ดังนี้:

![Scatter Plot - Age vs. Income](https://miro.medium.com/max/732/1*yJ9VFX6Td9fgo8JUBux2xQ.png)

ในกรณีนี้ ความชันของเส้นตรงที่แสดงใน scatter plot นี้จะบ่งบอกถึงอัตราการเปลี่ยนแปลงของรายได้ต่อหน่วยของอายุ ซึ่งหากมีความชันลบ แสดงว่าเมื่ออายุเพิ่มขึ้น รายได้จะลดลง และหากมีความชันบวก แสดงว่าเมื่ออายุเพิ่มขึ้น รายได้จะเพิ่มขึ้น

อย่างไรก็ตาม ความชันนี้ไม่ใช่ตัวบ่งชี้ที่แท้จริงของความสัมพันธ์ระหว่างอายุกับรายได้ เนื่องจากความชันไม่สามารถบ่งบอกถึงความสัมพันธ์ทั้งหมดได้ เช่น การบอกว่าความชันเป็นลบหรือบวกไม่สามารถบอกได้ว่าความสัมพันธ์ระหว่างอายุกับรายได้เป็นเชิงบวกหรือเชิงลบแท้จริง

และในกรณีของค่าสัมประสิทธิ์การถดถอย (correlation coefficient) หากมีค่าสัมประสิทธิ์การถดถอยเป็นบวกใกล้ 1 หมายถึงมีความสัมพันธ์เชิงบวก และหากเป็นลบใกล้ -1 หมายถึงมีความสัมพันธ์เชิงลบ ในขณะที่ค่าสัมประสิทธิ์การถดถอยเข้าใกล้ 0 หมายถึงไม่มีความสัมพันธ์ ดังนั้น ค่าสัมประสิทธิ์การถดถอยเป็นตัวบ่งชี้ที่ดีกว่าในการบ่งบอกถึงความสัมพันธ์ระหว่างตัวแปรทั้งสอง และสามารถนำมาใช้วิเคราะห์ความสัมพันธ์ระหว่างตัวแปรได้อย่างมีนัยสำคัญและเป็นรูปธรรมในทางสถิติ

หวังว่าคำอธิบายนี้จะช่วยในการเข้าใจการวัดความสัมพันธ์ระหว่างตัวแปรได้อย่างชัดเจนและแม่นยำยิ่งขึ้น ขอบคุณมากครับ!

1. What method provides summary statistics of a data frame?

   1 point
   
   - [x] describe()
   
   - [ ] tail()
   
   - [ ] head()
   
   - [ ] summary()


2. As the Pearson Correlation value nears zero, then ...

   1 point
   
   - [ ] It indicates uncertainty about the correlation between two variables
   
   - [ ] It indicates the mean of the data is near zero
   
   - [x] It indicates that two variables are not correlated
   
   - [ ] It indicates minimal deviation in a variable's values from the mean


3. What range of Pearson Coefficient ‘p’ is considered too high to support any certainty about the correlation of variables?

   1 point
   
   - [x] p > 0.1
   
   - [ ] 0.001 < p < 0.05
   
   - [ ] p < 0.001
   
   - [ ] 0.05 < p < 0.1


4. Consider the following data frame:

   `df_test = df[['body-style,' 'price']]`

   The following operation is applied:

   `df_grp = df_test.groupby(['body-style'], as_index=False).mean()`

   What are the resulting values of: `df_grp[‘price’]`?

   1 point
   
   - [ ] It averages the body-style variable data values.
   
   - [ ] The average price
   
   - [ ] It writes the mean value of each body style price to the data frame.
   
   - [x] It averages the price for each body style


5. What is the Pearson Correlation between two variables if the input variable is equal to the output variable?

   1 point
   
   - [x] 1
   
   - [ ] Between -1 and 0
   
   - [ ] -1
   
   - [ ] Between 0 and 1
