# Basic and Specialized Visualization Tools

## Basic Visualization Tools

### Area Plots with Matplotlib

This section explains what area plots are, when to use them, and how to create them using Matplotlib and Pandas.

#### What is an Area Plot?
* An area plot (also called an area chart or area graph) is a visualization that displays the magnitude and proportion of multiple variables over a continuous axis (usually time). It's like a line plot, but the area blow is filled with color.
* **Key Features:**
    * Shows the cumulative magnitude of variables.
    * Emphasizes the overall trend and the contribution of each variable to the total.
    * Good for comparing multiple quantities over time.

#### When to Use Area Plots
Area plots are particularly useful when:
* You want to show the cumulative total of a quantity over time.
* You want to compare the contributions of different categories to a total over time.
* You want to emphasize the overall trend rather than individual data points.
* Data has cumulative nature.
* Visualizing population demographics.
* Displaying the contribution of resources across various sectors.

#### Creating the Area Plot

```python
# 'kind=area' creates the area plot. figsize makes it bigger
df.plot(kind='area', figsize=(10, 6)) 
# show the plot
plt.show()
```

### Histograms with Matplotlib

This section defines histograms, explains their purpose, and demonstrates how to create them using Matplotlib and NumPy.

#### What is a Histogram?
* A histogram is a graphical representation of the _frequency distribution_ of a _numerical_ dataset. It shows how many data points fall within specific ranges (called "bins").
* **How it works?**
    * **Partitioning:** The range of the data (the difference between the minimum and maximum values) is divided into series of intervals called _bins_. These bins are usually of equal width.
    * **Counting:** For each bin, the number of data points that fall within that bin's range is counted. This count is the _frequency_ for that bin.
    * **Plotting:** A bar is drawn for each bin. The width represents bin size.
* **Key Features:**
    * **X-axis:** Represents the range of the numerical variable, divided into bins.
    * **Y-axis:** Represents the frequency (count) or relative frequency (proportion) of data points within each bin.
    * **Bars:** Adjacent bars (usually touching) represent the frequency of each bin.
* **Purpose:** 
    * To visualize the distribution of a numerical variable.
    * To show the _shape_ of the distribution (symmetrical, skewed, bimodal, etc.).
    * To identify the _central tendency_ (where most data points are clustered).
    * To identify _outliers_ (unusually high or low values).

#### Creating Histograms with Matplotlib and NumPy

```python
# Use NumPy to get bin edges
count, bin_edges = np.histogram(data_2013)

# Create the histogram, specifying bin edges
data_2013.plot(kind='hist', figsize=(8, 5), xticks=bin_edges)

plt.title('Histogram of Immigration from Various Countries in 2013')
plt.xlabel('Number of Immigrants')
plt.ylabel('Number of Countries')
plt.show()
```

### Bar Charts with Matplotlib

This section explains what bar charts are, when to use them, and how to create them using Matplotlib and Pandas.

#### What is a Bar Chart?
* A bar chart (also called a bar graph) is a type of plot that uses rectangular bars to represent the value of a categorical variable. The length (or height) of each bar is proportional to the value it represents.
* **Key Features:**
    * **X-axis (usually):** Represents the categories being compared.
    * **Y-axis (usually):** Represents the numerical value associated with each category.
    * **Bars:** Rectangular bars, either vertical (more common) or horizontal, representing each category. The length/height of the bar corresponds to the value.
    * **Discrete Data:** Bar charts are best suited for _discrete_ categorical data, where the categories are distinct and separate. (Histograms, in contrast, are used for _continuous_ numerical data).

#### When to Use Bar Charts?
Bar charts are excellent for:
* **Comparing values across categories:** Showing the differences in magnitude between different groups.
* **Displaying frequencies or counts:** Showing how many items fall into each category.
* **Tracking changes over time (for a limited number of time points):** If you have data for a few distinct time periods, a bar chart can be used to show changes. For many time points, a line graph is usually better.
* **Ranking:** Displaying ranked values.

#### Creating Bar Charts with Matplotlib and Pandas

```python
df.plot(kind='bar', figsize=(10, 6))  # 'kind=bar' creates a vertical bar chart

plt.title('Immigration from Iceland to Canada (1980-2013)')
plt.xlabel('Year')
plt.ylabel('Number of Immigrants')
plt.show()

```

## Specialized Visualization Tools

### Pie Charts


This section explains what pie charts are, how to create them in Matplotlib, and discusses their limitations and alternatives.

#### What is a Pie Chart?
* **Definition:** A pie chart is a circular statistical graphic that represents proportions of a whole. The circle is divided into slices (like pieces of a pie), and the size of each slice is proportional to the quantity it represents.
* **Key Features:**
    * **Circle:** Represents the whole (100%).
    * **Slices:** Represent parts of the whole.  The area (and central angle) of each slice is proportional to the value it represents.
    * **Proportions:** Pie charts are best for showing *proportions* or *percentages*, not absolute values.

#### When to Use (and When *Not* to Use) Pie Charts
* **Appropriate Use Cases:**
    * Showing the composition of a *single* categorical variable with a *small* number of categories (ideally, fewer than 5-7).
    * When you want to emphasize the relative proportions of each category to the *whole*.
* **When to *Avoid* Pie Charts (and Why):**
    * **Many Categories:** If you have too many categories, the slices become thin and hard to compare.  The chart becomes cluttered and difficult to read.
    * **Comparing Values Across Categories:**  Humans are not good at judging relative areas accurately, especially in circular shapes.  It's much harder to compare the sizes of pie slices than the lengths of bars in a bar chart.
    * **Showing Changes Over Time:** Pie charts are not suitable for showing trends over time.
    * **Comparing Different Datasets:** Comparing multiple pie charts side-by-side is extremely difficult.
* **Alternatives to Pie Charts:**
    * **Bar Charts:**  Almost always a better choice for comparing categories.  They are much easier to read and interpret accurately.
    * **Stacked Bar Charts:**  Can show proportions within categories, similar to a pie chart, but in a more readable format.
    * **Treemaps:** Another alternative for visualizing hierarchical data.

#### Creating Pie Charts with Matplotlib and Pandas

```python
colors_list = ['gold', 'yellowgreen', 'lightcoral', 'lightskyblue', 'lightgreen', 'pink']
explode_list = [0.1, 0, 0, 0, 0.1, 0.1] # Explode out slices.

df_continents['Total'].plot(kind='pie',
                            figsize=(15, 6),
                            autopct='%1.1f%%',  # Add percentages
                            startangle=90,     # Start angle for the first slice
                            shadow=True,       # Add a shadow
                            labels=None,         # Remove labels from slices, put in legend.
                            pctdistance=1.12,    # the ratio between the center of each pie slice and the start of the text generated by autopct
                            colors=colors_list,  # Set custom colors
                            explode=explode_list # Explode slices
                            )

plt.title('Immigration to Canada by Continent (1980-2013)', y = 1.1) # y increases the distance of title from the plot
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.legend(labels=df_continents.index, loc='upper left') # Add legend
plt.show()

```

### Box Plots with Matplotlib

This section explains what box plots (also known as box-and-whisker plots) are, how to interpret them, and how to create them using Matplotlib and Pandas.

#### What is a Box Plot?
* A box plot is a standardized way of displaying the distribution of numerical data based on five key summary statistics:
    * **Minimum:**  The smallest value in the dataset, *excluding outliers*. (More on outliers below.)
    * **First Quartile (Q1):**  The 25th percentile.  25% of the data falls below this value.
    * **Median (Q2):** The 50th percentile (the middle value). 50% of the data falls below this value.
    * **Third Quartile (Q3):** The 75th percentile.  75% of the data falls below this value.
    * **Maximum:** The largest value in the dataset, *excluding outliers*.
* **Outliers:** Data points that fall significantly outside the main pattern of the data.  Box plots typically identify outliers as points that are:
    *   Below Q1 - 1.5 * IQR
    *   Above Q3 + 1.5 * IQR
    *   Where IQR (Interquartile Range) = Q3 - Q1.  The IQR represents the range of the middle 50% of the data.
* **Visual Representation:**
    * **Box:**  The box itself represents the interquartile range (IQR), from Q1 to Q3.
    * **Line inside the box:**  Represents the median (Q2).
    * **Whiskers:** Lines extending from the box.  They typically extend to the minimum and maximum values *within* 1.5 * IQR of the quartiles.
    * **Individual Points:**  Outliers are plotted as individual points beyond the whiskers.

#### Interpreting a Box Plot
A box plot provides a quick visual summary of:
* **Central Tendency:**  The median line shows the center of the data.
* **Spread (Variability):**  The IQR (the size of the box) and the length of the whiskers show how spread out the data is.  A larger box and longer whiskers indicate greater variability.
* **Skewness:**
    * **Symmetrical Distribution:**  The median is in the center of the box, and the whiskers are roughly equal in length.
    * **Right-Skewed Distribution:**  The median is closer to the bottom of the box, and the right whisker is longer than the left.
    * **Left-Skewed Distribution:** The median is closer to the top of the box, and the left whisker is longer than the right.
* **Outliers:**  The presence and location of outliers are clearly visible.

#### Creating Box Plots with Matplotlib and Pandas

```python
df_japan.plot(kind='box', figsize=(8, 6))

plt.title('Box Plot of Japanese Immigrants to Canada (1980-2013)')
plt.ylabel('Number of Immigrants')
plt.show()
```

### Scatter Plots with Matplotlib

This section explains what scatter plots are, when to use them, and how to create them using Matplotlib and Pandas.

#### What is a Scatter Plot?
* A scatter plot is a type of graph that displays the relationship between *two numerical variables*.  Each data point is represented as a point on a two-dimensional plane.
* **Key Features:**
    * **X-axis:** Represents one numerical variable (often the *independent* variable).
    * **Y-axis:** Represents the other numerical variable (often the *dependent* variable).
    * **Points:** Each point represents a single observation (e.g., a person, a country, a year). The position of the point is determined by its values on the x and y variables.

#### When to Use Scatter Plots
Scatter plots are primarily used to:
* **Identify Relationships:** Determine if there's a relationship (correlation) between two variables.  This relationship can be:
    * **Positive Correlation:** As one variable increases, the other tends to increase (upward trend).
    * **Negative Correlation:** As one variable increases, the other tends to decrease (downward trend).
    * **No Correlation:** No clear relationship between the variables (points are scattered randomly).
* **Identify Outliers:**  Find data points that deviate significantly from the general pattern.
* **Identify Clusters:** See if the data points group together in distinct clusters.
* **Visualize large datasets**: Scatter plots are very useful when you're dealing with large datasets.

#### Creating Scatter Plots with Matplotlib and Pandas

```python
df_total.plot(kind='scatter', x='Year', y='Total', figsize=(10, 6))

plt.title('Total Annual Immigration to Canada (1980-2013)')
plt.xlabel('Year')
plt.ylabel('Number of Immigrants')
plt.show()
```