# Chapter 1 - Getting to Know Your Data: The Power of Exploratory Data Analysis

In data science, it's easy to get caught up in the excitement of building complex models and making predictions. However, the most crucial step in any data project is often the most overlooked: understanding your data. This is where Exploratory Data Analysis (EDA) comes in. Pioneered by the statistician John Tukey, EDA is all about getting to know your data intimately before diving into modeling or analysis.

## Why is EDA So Important?

EDA is the initial investigation that can make or break a data project. It helps you:

- **Uncover patterns and insights**: By summarizing and visualizing your data, you can start to see trends, relationships, and anomalies that you might not have expected.
- **Understand data distributions**: EDA helps you to understand the shape of your data, where most of the data is located (central tendency), and how it is dispersed.
- **Identify errors and inconsistencies**: Spotting outliers, missing values, or data entry errors early on can save you from building models on flawed data.
- **Inform your modeling choices**: The insights gained from EDA can guide the selection of appropriate statistical models and machine learning algorithms.

## Key Techniques in Exploratory Data Analysis

### 1. Understanding Data Types and Structures

Before any analysis can occur, the data needs to be in a usable format. This usually means structuring it into a table with rows and columns. The type of data is important in helping determine the type of visual display, data analysis, or statistical model.

- **Numeric Data**:
  - **Continuous**: Data that can take any value within a range (e.g., temperature, height).
  - **Discrete**: Data that can only take specific, often integer, values (e.g., number of clicks, number of items).

- **Categorical Data**:
  - **Binary**: A special case with two values (e.g., yes/no, true/false).
  - **Ordinal**: Categories with a meaningful order (e.g., low, medium, high).

Rectangular data is a common format and can be represented as a data frame in R and Python. Each row represents a record, and each column represents a feature. The terms used to describe these can differ depending on the background of the data scientist.

### 2. Estimating Location and Variability

One of the first steps in EDA is to get a sense of the "typical" values in your dataset and how much the data varies.

- **Estimates of Location**:
  - **Mean**: The average value (note: although **Mean** is easy to compute and expedient to use, it may not always be the best measure for a central value)
  - **Median**: The middle value when data is sorted, robust to outliers.
  - **Weighted Mean and Median**: Averages that take into account assigned weights.

- **Estimates of Variability**:
  - **Mean Absolute Deviation**: The average of the absolute differences between each value and the mean.
  - **Variance and Standard Deviation**: Measures of how spread out the data is from the mean. (Note: **Standard Deviation** is more common to used than **Mean Absolute Deviation** and **Variance**)
  - **Median absolute deviation**: The median of the absolute differences between each value and the median, robust to outliers.

Note: you should refer to the [Practical Statistics for Data Scientists](https://www.oreilly.com/library/view/practical-statistics-for/9781491952955/) book for a more in-depth discussion of these concepts.

### 3. Exploring Data Distribution

Understanding how your data is distributed is crucial to choosing the right analysis methods.

- **Percentiles and Boxplots**:
  - Percentiles divide data into equal parts, helpful for summarizing the spread of data.
  - Boxplots visualize data distribution using percentiles, showing the median, quartiles, and outliers.
  - An example of a boxplot is shown below:
  
  ![boxplot](figure/c1/box_plot_ex.png)

- **Frequency Tables and Histograms**:
  - Frequency tables summarize how many data values fall into a set of intervals or bins.
  - Histograms plot frequency tables with bins on the x-axis and the count (or proportion) on the y-axis.
  - An example of a histogram is shown below:

  ![boxplot](figure/c1/histogram_ex.png)

- **Density Plots**: Smoothed versions of histograms that show the data distribution as a continuous line, often based on kernel density estimates. An examaple of a density plot is shown below:

  ![boxplot](figure/c1/density_ex.png)

### 4. Exploring Categorical Data

Categorical data requires different summarization and visualization techniques.

- **Mode**: The most frequent category.
- **Expected Value**: A weighted average that takes into account the probability of occurrence of a category.
- **Bar Charts**: Display the frequency or proportion of each category.
- **Pie Charts**: An alternative to bar charts, but often considered less visually informative.

### 5. Exploring Relationships Between Variables

Understanding how variables relate to each other is an important aspect of EDA.

- **Scatterplots**: Visualize the relationship between two numeric variables.
- **Contingency Tables**: Summarize the counts between two categorical variables.
- **Boxplots and Violin Plots**: Compare the distributions of a numeric variable across different categories.
- **Hexagonal Binning**: Useful for examining the relationship of two numeric variables without being overwhelmed by large amounts of data.
- **Contour Plots**: Used to visualize the density of two numeric variables like a topographical map.

### 6. Visualizing Multiple Variables

Conditioning can be used to extend the use of scatterplots, hexagonal binning, and boxplots to multiple variables.

## The Importance of EDA in Practice

With the increased availability of computing power and software, EDA has become even more powerful. Today, data scientists have a wide variety of tools in R and Python to explore their data.

EDA is not just a step; it's a mindset. It's about being curious, asking questions, and letting the data speak to you. By investing the time in this crucial first step, you'll be well on your way to creating more accurate and effective data projects.

---

### Key Ideas:
- EDA is the first and most important step in any data project.
- EDA involves summarizing data with metrics and visualizing it with charts.
- The goal of EDA is to gain intuition and understanding of the data.
- EDA is a cornerstone of any data science project.
