# Introduction to Data Visualization

**Data visualization** is the process of presenting data in a graphical format. It is a key aspect of data science. It is especially **crucial** in organizations because often decisions are made based on a final visualization or interpretation of the data.

---

## Why Do We Use Data Visualization?
We can think of data visualization as the **interface between those who work directly with data sources and those who make higher level strategic decisions in a business**. It’s important to be able to convey information to others, especially people who may not have the full technical knowledge or understanding to view the raw data or statistical analysis.

Remember that **the purpose of data science at an enterprise level** is to use it to make _key decisions_ and improve products or services. Data visualization is nothing more than a tool for this purpose, _it's not the end goal_ itself.

---

## Which Chart to Use?
When you think about using data visualization, you should always ask these 2 questions:

1. _“What is the information I want to share or story I am trying to tell?"_
2. _"How does this visualization help in conveying that information to others?”_

Because often information doesn’t need a pretty visual to be useful, a more basic approach, like simple metrics (e.g. mean), might do the job.

If you _do need_ data visualization to tell your story, it's time to decide what type of plot, chart, or visualization to use.

In this section, you'll learn about the key characteristics of different plots so you can decide if it is the one you need. Then you can check that plot's section for more detailed information.

### 1. Scatter Plots
Scatter plots allow us to visualize the relationship between *two* (usually continuous) numerical variables. Each point represents a pair of values (x, y).

We use scatter plots to:

* Identify correlations (_Do the variables tend to increase or decrease together?_).
* Detect clusters  (_Are there groups of data points that are similar to each other?_).
* Find outliers (_Are there any data points that are far away from the general trend?_).

### 2. Line Plots
Line plots allow us to visualize the relationship between two variables where there is a known, *continuous* relationship along the x-axis. They are most commonly used for time series data (showing how something changes over time).

We use line plots in order to:

* Track trends over time (_Stock prices, temperature changes, population growth, etc._).
* Show continuous processes (_Anything where the x-axis represents a continuous quantity, e.g., distance, speed, concentration_).

We should _not_ use line plots for:

* Unrelated data points where there's no inherent continuous relationship between them. (_The "tips" dataset example – connecting total bills of different parties is meaningless_). A trend line is different.
* Categorical x-axis (_usually_). (_If the x-axis represents distinct categories, e.g., countries, product types, a bar plot is usually more appropriate_).

### 3. Distribution Plots (Histograms)
Histograms allow us to visualize the *distribution* of a *single, continuous* numerical variable. They show how frequently different values occur.

We use histograms in order to:

* Understand data spread (_Is the data tightly clustered or widely spread?_).
* Identify skewness (_Is the distribution symmetrical or skewed to one side?_).
* Detect multimodality (_Are there multiple peaks in the distribution, suggesting different subgroups?_).

In a histogram:

* **X-axis** represents the continuous numerical variable being analyzed.
* **Y-axis:** represents the frequency (count) or density of data points within each bin.

_To explore the distribution for a categorical variable, use box plots_.

### 4. Categorical Plots (Bar Plots)
Bar plots allow us to visualize the relationship between a *categorical* variable (x-axis) and a *numerical* variable (y-axis). They are often used to compare a statistic (e.g., mean, count, sum) across different categories.

We use bar plots in order to:

* Compare group means (_e.g., average income by education level_).
* Show counts (_Number of observations in each category_).
* Display proportions (_Percentage of respondents choosing each option in a survey_).

**Bar plots** have categorical x-axis, numerical y-axis, while **histogram** have continuous numerical x-axis, frequency/count y-axis.

### 5. Box Plots
Box plots (also known as box-and-whisker plots) are a standardized way of displaying the distribution of data based on a five-number summary: minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum. 

Box plots are particularly useful for:
* Comparing distributions (_Quickly comparing the central tendency, spread, and skewness of multiple datasets_).
* Identifying outliers (_Points that fall outside the "whiskers" are often considered outliers_).
* Visualizing data spread (_Showing the Interquartile Range and the overall range of the data_).