# **Week 3: Exploratory Data Analysis (EDA) in Applied Economics**

## Definition and Importance of EDA

Exploratory Data Analysis (EDA) is an essential initial step in the data analysis process. It involves visualizing, summarizing, and interpreting the information that is hidden in rows and columns of data. The main aim of EDA is to understand the data, identify anomalies or outliers, uncover patterns, and extract valuable insights. This, in turn, provides a solid foundation for the subsequent analytical or modeling activities.

## Types of Data in Applied Economics

There are various types of data that researchers and professionals in applied economics frequently encounter:

- **Categorical Data**: This data type represents categories or labels. It can be further divided into nominal (e.g., crop types like wheat, rice, maize) and ordinal data (e.g., low, medium, high).

- **Numerical Data**: Represents numbers, and can be either discrete (e.g., number of farms) or continuous (e.g., rainfall in mm).

- **Time-series Data**: Observations recorded at regular time intervals. For instance, the monthly prices of a commodity over several years.

- **Cross-sectional Data**: Observations recorded at the same point in time. An example could be the agricultural yield of different farms in a particular year.

- **Panel Data**: This is a combination of time-series and cross-sectional data. For example, observing the yields of multiple farms over multiple years.

To help differentiate these data types further, here's a simple table:

| Data Type        | Description                                           | Example                                       |
|------------------|-------------------------------------------------------|-----------------------------------------------|
| Categorical      | Represents categories or labels                        | Crop types: Wheat, Rice                       |
| Numerical        | Represents numbers                                     | Rainfall in mm, Number of farms               |
| Time-series      | Observations over regular intervals                   | Monthly commodity prices over several years   |
| Cross-sectional  | Observations at a specific point in time               | Agricultural yield of farms in a given year   |
| Panel            | Combination of time-series and cross-sectional data    | Yields of multiple farms over several years   |

## Statistical Summaries in EDA

### Descriptive Statistics

Before proceeding with any analytical approach, understanding the basic statistical properties of the dataset is essential. This process provides a snapshot of central tendencies, spread, and shape of the dataset's distribution.

```python
# Assuming data is loaded in a dataframe named 'df'
stats_summary = df.describe()
print(stats_summary)
```
This provides count, mean, standard deviation, min, 25th percentile, median (50th percentile), 75th percentile, and max for all numerical columns.

### Correlation and Association

Correlation measures the strength and direction of a linear relationship between two variables. The Pearson correlation coefficient, which ranges from -1 to 1, is the most widely used method to measure it. A value close to 1 implies a strong positive correlation, while a value close to -1 indicates a strong negative correlation. A value close to 0 implies a weak or no linear correlation.

```python
correlation_matrix = df.corr()
print(correlation_matrix)
```
This returns a matrix of Pearson correlation coefficients between every pair of numerical columns in the dataframe. A value close to 1 implies a strong positive correlation: as one variable increases, the other also tends to increase. A value close to -1 implies a strong negative correlation: as one variable increases, the other tends to decrease.



### Detecting and Handling Outliers

Outliers are data points that significantly deviate from other observations in the dataset. They might be a result of variability or errors. The Interquartile Range (IQR) method is commonly used to detect them. Here's the logic:

- **IQR**: It's the range between the first quartile (25%) and the third quartile (75%) in the dataset.
- Any data point that falls below Q1 - 1.5I*QR or above Q3 + 1.5*IQR is considered an outlier.

#### **Using IQR to flag outliers**:

Instead of filtering out the outliers, you can create a new column in your dataframe that flags them.

```python
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1

# Create a new column 'is_outlier' that flags outliers with a '1' and '0' otherwise
df['is_outlier'] = ((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1).astype(int)
```

Now, the dataframe `df` has a new column named `is_outlier` which flags the outliers as '1' and the other data points as '0'.

**Lets break down the logic:**
1. **Outlier Boundaries**: 
    - Lower Boundary: \(Q1 - 1.5 \times IQR\)
    - Upper Boundary: \(Q3 + 1.5 \times IQR\)
   
    These boundaries define what we consider as outliers based on the Interquartile Range (IQR) method. Anything below the lower boundary or above the upper boundary is treated as an outlier.

2. **Condition Checks**:
    - \(df < (Q1 - 1.5 \times IQR)\): This checks if each data point in the dataframe `df` is below the lower boundary.
    - \(df > (Q3 + 1.5 \times IQR)\): This checks if each data point in `df` is above the upper boundary.
<br><br>

3. **Combining Conditions with OR (|)**:
    - \((df < (Q1 - 1.5 \times IQR)) | (df > (Q3 + 1.5 \times IQR))\): 
      This combines the two conditions using the "OR" operator `|`. If a data point satisfies either of the conditions, it's considered an outlier.
<br><br>

4. **Checking across rows with `.any(axis=1)`**:
    - The `.any(axis=1)` method checks row-wise. If any column in a specific row is `True` (i.e., it's an outlier), the method returns `True` for that row. Otherwise, it returns `False`.
<br><br>

5. **Converting Boolean to Integer with `.astype(int)`**:
    - The `True` and `False` values are then converted to integers (1 for `True` and 0 for `False`) using `.astype(int)`.
<br><br>

6. **Storing the Results**:
    - Finally, the result, which is a series of 1s and 0s indicating outliers and non-outliers respectively, is stored in a new column `is_outlier` in the dataframe `df`.

In essence, this line of code creates a new column in the dataframe that flags outliers as '1' and non-outliers as '0' based on the IQR method.


#### **Z-Score and Modified Z-Score**

Note that there are also other methods to detect outliers, such as Z-Score and Modified Z-Score. Here's a quick summary of the two:

Z-Score
- Indicates how many standard deviations a data point is from the mean.
- Outliers typically have Z-scores > 3 or < -3.

Modified Z-Score
- Indicates how many standard deviations a data point is from the median.
- Hence this version uses median and Median Absolute Deviation (MAD).
- Outliers often have scores > 3.5 or < -3.5.

## Tutorial 3, Part 1: Plotting data with Matplotlib and Seaborn

Before you get started
- Go to your Setup.ipynb and pull the latest version of the course repository
- Then create a notebook for this tutorial and rename it to  to \<your_name>\<Lecture_3_Tutorial>
- Share with me: jan5020@gmail.com

Then you need to do the following:
- import the necessary libraries
- load the data
    - the possum data, calling it df_pos
    - the South African maize data, df_maize

Now
- describe the possum data
- describe the maize data
- check the correlation between the variables in the possum data
- check the correlation between the variables in the maize data
- identify the outliers in the `totalL`, and `weight` variables in the possum data and give them the appropriate variable name

## **Data Visualization in EDA**
### Python Libraries for Visualization

Data visualization is an integral part of Exploratory Data Analysis (EDA) as it allows for a graphical representation of information and data. Python offers a range of libraries tailored for various visualization needs. Here, we'll introduce three of the most popular ones:

- **Matplotlib**: One of the most widely used visualization libraries in Python, Matplotlib provides a flexible platform to create a vast array of static, animated, and interactive visualizations. It's highly customizable and serves as the foundation for many other plotting libraries.

- **Seaborn**: Built on top of Matplotlib, Seaborn provides a higher-level interface for creating beautiful, statistically-themed visualizations. It comes with several built-in themes and color palettes to make aesthetically pleasing charts with ease.

- **Plotly**: Unlike the other two, Plotly is mainly known for enabling interactive visualizations. It supports a multitude of chart types and is particularly useful when you want to create visualizations that users can interact with.

To start working with these libraries, you'll first need to import them. Here's how you can do it:

```python
import matplotlib.pyplot as plt
import seaborn as sns
```

#### Matplotlib (`plt`)

**Matplotlib** is the foundational library for many other Python plotting libraries. Here's a quick primer:

- **Basic Plotting**:
    ```python
    x = [1, 2, 3, 4, 5]
    y = [1, 4, 9, 16, 25]
    plt.plot(x, y)
    plt.show()
    ```
    This plots `y` vs `x` as lines and markers.

- **Titles & Labels**:
    ```python
    plt.plot(x, y)
    plt.title("Square Numbers")
    plt.xlabel("Value")
    plt.ylabel("Square of Value")
    plt.show()
    ```

- **Multiple Plots**:
    ```python
    y2 = [1, 8, 27, 64, 125]
    plt.plot(x, y, label="Squares")
    plt.plot(x, y2, label="Cubes")
    plt.legend()  # To show the legend
    plt.show()
    ```

#### Seaborn (`sns`)

**Seaborn** is built on top of Matplotlib and offers a higher-level, more aesthetically pleasing interface:

- **Histogram**:
    ```python
    data = [1, 1, 2, 3, 3, 3, 4, 4, 5]
    sns.histplot(data)
    plt.show()
    ```

- **Box Plot**:
    ```python
    sns.boxplot(x="day", y="total_bill", data=tips_dataset)
    plt.show()
    ```
    Here, `tips_dataset` would be a sample dataset you've loaded, and you're visualizing the total bill amounts by day.

- **Scatter Plot**:
    ```python
    sns.scatterplot(x="total_bill", y="tip", data=tips_dataset)
    plt.show()
    ```

- **Scatter Plot with Regression Line**:
    ```python
    sns.regplot(x="total_bill", y="tip", data=tips_dataset)
    plt.show()
    ```
    
- **Line plot**:
    ```python
    sns.lineplot(data=data, x="X", y="Y", label="Line Plot", color='green',marker='o')
    plt.title("Line Plot with Seaborn")
    plt.show()
    ```

Both libraries work well with Pandas dataframes, which means you can directly pass columns of a dataframe to these functions. Remember to always use `plt.show()` with Matplotlib to ensure the plot is rendered correctly.

Explore the Seaborn plotting library here: https://seaborn.pydata.org/examples/index.html 

## Tutorial 3, Part 2: Plotting data with Seaborn

Using the possum data and **sns**, create the following plots:
- a scatter plot with regression line of skull length vs skull width
- a histogram of skull length
- a boxplot of weight, categorised by sex

Using the maize data and **sns**, create the following plots:
- but first, calculate the average yield for each year
- a line plot of yield vs year
- a line plot of the white and yellow maize price
    - add R/ton to the y-axis
    - add a legend
    - make the white maize blue red and the yellow maize line orange

Stretch goal:
- calculate the average yield per decade
- create a bar plot of yield vs decade
- a boxplot of yield, categorised by decade