##### Data Visualization with Python
---

# 2. Types of Plots

## 2.1. Line Plot (Line Chart)
A line plot, also known as a line chart, displays data as a series of points connected by straight line segments. It's a fundamental chart type used to visualize trends and changes in data over a continuous interval, most commonly *time*. The independent variable (often time) is plotted on the x-axis, and the dependent variable is plotted on the y-axis.

### Suitable Variable Types
* **X-axis (Independent Variable):** Usually a continuous variable, most often representing time (e.g., years, months, days) or another ordered quantity. Can be ordinal or interval/ratio level data. While less common, the x-axis *could* represent categories if those categories have a clear, inherent order (e.g., stages of a process), but a bar chart is often better in those cases.
* **Y-axis (Dependent Variable):** Typically a numerical variable (interval or ratio level data). The y-axis shows the value of the variable that is changing in response to the independent variable.

### Use Cases
1. **Showing Trends Over Time (Time Series Data):** This is the most common use case. Examples include:
    * Stock prices over days, months, or years.
    * Temperature fluctuations over a period.
    * Population growth over time.
    * Company revenue or profit over quarters or years.
    * Website traffic over days or weeks.
2. **Visualizing Continuous Changes:** Line plots are effective for showing how a variable changes continuously in response to another, even if the independent variable isn't strictly time. Examples include:
    * The relationship between speed and fuel efficiency of a car.
    * The change in a chemical reaction rate with increasing temperature.
    * The growth of a plant in relation to the amount of sunlight it receives.
3. **Comparing Multiple Series:** Line plots can display multiple lines on the same graph, making it easy to compare trends across different groups or categories.  For example:
    * Comparing the stock prices of several different companies.
    * Tracking the sales of different product lines over time.
    * Comparing the growth rates of different populations.
4. **Highlighting Patterns:** Can show patterns, fluctuations, increases, decreases, and rates of change.

### Potential Pitfalls
1. **Misleading Scales:**  The choice of scale on the y-axis can dramatically alter the perception of the trend.  A truncated y-axis (not starting at zero) can exaggerate changes, while an overly wide y-axis range can minimize them.  It's crucial to choose scales thoughtfully and ethically.  Always consider starting at zero unless there's a very strong and justifiable reason not to.
2. **Overplotting (Too Many Lines):** If you plot too many lines on the same graph, it can become cluttered and difficult to interpret.  Consider using separate plots, small multiples, or interactive features (like tooltips or toggles) to handle many series.
3. **Interpolation Issues:** The straight lines connecting data points *imply* a continuous trend, even if the underlying data is only collected at discrete intervals.  Be cautious about interpreting values *between* the plotted points, especially if the data is sparse or the underlying phenomenon isn't truly continuous.
4. **Ignoring Irregular Intervals:** If the data points are *not* evenly spaced along the x-axis (e.g., unevenly spaced time intervals), a standard line plot can be misleading.  It will visually distort the rate of change.  In such cases, consider a scatter plot with connected points, or explicitly indicate the uneven intervals.
5. **Extrapolation:** Extending a trend line beyond the observed time frame.
6. **Causation vs Correlation:** Easy to assume causation.

### Example (Conceptual)
Imagine plotting the monthly average temperature in a city over several years. The x-axis would represent time (months and years), and the y-axis would represent temperature. The line plot would clearly show the seasonal temperature cycle, any long-term warming or cooling trends, and potentially any unusual temperature spikes or dips.

## 2.2. Bar Chart (Bar Graph)

A bar chart, also known as a bar graph, uses rectangular bars to represent the value of a categorical variable. The length (or height) of each bar is proportional to the value it represents. Bar charts are excellent for comparing values across different categories or groups.

### Suitable Variable Types
* **X-axis (typically):** Represents the *categories* being compared. This is usually a categorical (nominal or ordinal) variable.
* **Y-axis (typically):** Represents the *numerical value* associated with each category. This is usually a numerical variable (interval or ratio).
* **Note:** It's also possible to have the axes swapped (horizontal bar chart), in which case the variable types would also be swapped.

### Use Cases
1. **Comparing Values Across Categories:** This is the primary use case. Examples include:
    * Comparing sales figures for different product lines.
    * Showing the population of different countries.
    * Displaying the number of students enrolled in different courses.
    * Comparing average income across different professions.
2. **Displaying Frequencies or Counts:** Showing how many items fall into each category.  For example:
    * The number of respondents who selected each answer choice in a survey.
    * The frequency of different types of errors in a system.
3. **Tracking Changes Over Time (Limited Time Points):** While line graphs are generally preferred for time series data, bar charts can be used effectively if you have only a *few* distinct time points. For example, comparing sales figures for Q1, Q2, Q3, and Q4 of a single year. *Avoid* using bar charts for many time points; use a line graph instead.
4. **Ranking:** Displaying ranked values (e.g., top 10 products by sales).

### Potential Pitfalls
1. **Misleading Scales:** Similar to line plots, the y-axis scale can significantly impact the visual impression. A truncated y-axis (not starting at zero) can exaggerate differences between bars. Always consider starting the y-axis at zero, especially when representing counts or frequencies.  If you *must* use a truncated axis, clearly indicate this to the viewer.
2. **Too Many Categories:** If you have a very large number of categories, a bar chart can become cluttered and difficult to read. Consider grouping categories, using a horizontal bar chart (which often handles many categories better), or using a different visualization type.
3. **Ordering of Categories:** For nominal categories (no inherent order), consider ordering the bars by value (e.g., descending order of frequency) to make comparisons easier. For ordinal categories, maintain the logical order.
4. **3D Bar Charts:** Avoid 3D bar charts. They add no extra information and often distort the data, making it harder to accurately compare bar lengths.
5. **Overlapping Bars:** Avoid overlapping bars, which can make it difficult to read the values.  Use grouped or stacked bar charts (discussed separately) instead.
6. **Comparing Groups of Unequal Size**: Be careful when making direct bar-to-bar comparisons when groups have different sample sizes.

### Example (Conceptual)
Imagine you want to compare the number of students enrolled in different academic departments (e.g., Biology, Chemistry, Physics, Mathematics).  A bar chart would be ideal. The x-axis would list the departments (categories), and the y-axis would represent the number of students (numerical value). Each department would have a bar, and the height of the bar would directly correspond to the enrollment number, making comparisons easy.

## 2.3. Pie Chart
A pie chart is a circular statistical graphic that represents proportions of a whole. The circle is divided into slices (like pieces of a pie), where the area (and central angle) of each slice is proportional to the quantity it represents.

### Suitable Variable Types
* **Categorical Variable:** Pie charts are designed to show the parts of a *single* categorical variable. The categories should be mutually exclusive and exhaustive (covering all possibilities).
* **Proportions/Percentages:** The numerical values associated with each category represent proportions or percentages of the total.

### Use Cases
1. **Showing Composition of a Whole:** The primary use of a pie chart is to display how a whole is divided into its constituent parts.  Examples include:
    * Market share of different companies.
    * Budget allocation across different departments.
    * Demographic breakdown of a population (e.g., by ethnicity or age group).
    * Survey responses where respondents choose one option from a list.
2. **Emphasizing a Dominant Category:** If one category is significantly larger than the others, a pie chart can effectively highlight this dominance.
3. **Few Categories:** It works best with few categories (ideally 5 or fewer).

### Potential Pitfalls
1. **Difficulty Comparing Slices:** Humans are much better at judging lengths (as in bar charts) than areas or angles.  It can be difficult to accurately compare the sizes of slices in a pie chart, especially if the proportions are close.
2. **Too Many Categories:** With more than a few slices, the pie chart becomes cluttered and difficult to read. Small slices become almost invisible.
3. **3D Pie Charts:**  *Never* use 3D pie charts. They distort the proportions and make the chart misleading. The added perspective makes slices closer to the viewer appear larger than they should be.
4. **Comparing Across Multiple Pie Charts:**  It's extremely difficult to compare proportions across different pie charts.  If you need to make comparisons across groups, use a different chart type (e.g., stacked bar chart).
5. **Representing Absolute Values:** Pie charts show *proportions*, not absolute values.  If the absolute values are important, include them in labels or use a different chart type.
6. **Exploding Slices:** While "exploding" a slice (pulling it out from the center) can highlight it, overuse of this technique can make the chart look messy and distract from the overall proportions.
7. **Donut Charts:** Donut charts are just pie charts with a hole in the center.

### Example (Conceptual)
Imagine you're surveying people about their favorite type of fruit. A pie chart could show the percentage of respondents who prefer apples, bananas, oranges, and other fruits.  If a large majority prefers apples, that slice would be visually dominant.  If the preferences are evenly split, the slices would be roughly equal in size.

## 2.4. Scatter Plot
A scatter plot displays the relationship between *two* numerical variables. Each data point is represented as a point on a two-dimensional plane, with one variable determining the position on the x-axis and the other variable determining the position on the y-axis.

### Suitable Variable Types
*  **X-axis (Independent Variable):** Numerical (interval or ratio).
*  **Y-axis (Dependent Variable):** Numerical (interval or ratio).

### Use Cases
1. **Identifying Relationships (Correlation):** Scatter plots are primarily used to see if there's a relationship, or correlation, between two variables. The relationship can be:
    * **Positive Correlation:** As one variable increases, the other tends to increase (an upward trend).
    * **Negative Correlation:** As one variable increases, the other tends to decrease (a downward trend).
    * **No Correlation:** No clear relationship (points are scattered randomly).
2. **Identifying Outliers:**  Data points that fall far away from the general pattern can be easily identified as outliers.
3. **Identifying Clusters:** Scatter plots can reveal whether the data points group together in distinct clusters, suggesting subgroups within the data.
4. **Visualizing Large Datasets:** Scatter plots can effectively handle large datasets, revealing overall patterns and trends even with many data points.
5. **Non-Linear Relationships:** They can identify non-linear relationships.

### Potential Pitfalls
1. **Overplotting (Too Many Points):** With very large datasets, points can overlap excessively, making it difficult to see the underlying pattern.  Solutions include:
    * **Transparency:**  Making the points partially transparent (using the `alpha` parameter in Matplotlib).
    * **Smaller Markers:** Using smaller point markers.
    * **Sampling:**  Plotting a random sample of the data.
    * **Density Plots:**  Using a 2D histogram or density plot (e.g., a hexbin plot) to show the density of points.
2. **Correlation vs. Causation:**  A scatter plot can show a *correlation* between two variables, but it *does not* prove causation.  Just because two variables are related doesn't mean that one causes the other. There might be a third, unobserved variable influencing both.
3. **Misinterpreting Clusters:**  Clusters can be meaningful, but they can also be artifacts of the data collection process or simply random chance.  Careful interpretation is needed.
4. **Ignoring a Third Variable:**  A simple scatter plot only shows the relationship between two variables.  A third variable might be influencing the relationship.  Consider using color, size, or shape to represent a third variable (see example below).
5. **Extrapolation:**  Avoid making predictions outside the range of the observed data. The relationship between the variables might not hold true beyond the plotted points.
6. **Unequal variances:**

### Example (Conceptual)
Imagine plotting the relationship between hours studied and exam scores for a group of students. The x-axis would represent hours studied, and the y-axis would represent exam scores.  A positive correlation would be expected (more study time generally leads to higher scores).  Outliers might represent students who studied a lot but still did poorly, or students who studied very little but did well.

## 2.5. Box Plot (Box-and-Whisker Plot)
A box plot, also known as a box-and-whisker plot, is a standardized way of displaying the distribution of numerical data based on five key summary statistics: minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum. It provides a concise visual summary of the central tendency, spread, and skewness of a dataset, and also highlights potential outliers.

### Suitable Variable Types
* **Numerical (Interval or Ratio):** Box plots are designed for visualizing the distribution of a single numerical variable.
* **Categorical (Optional, for Comparisons):** Box plots are *very* useful for comparing the distributions of a numerical variable *across different categories*. You can create multiple box plots side-by-side, one for each category.

### Use Cases
1. **Summarizing a Distribution:** Quickly see the central tendency (median), spread (IQR and range), and skewness of a dataset.
2. **Comparing Distributions:**  The primary use case is comparing the distributions of a numerical variable across different groups or categories.  For example:
    * Comparing test scores of students in different classes.
    * Comparing the salaries of employees in different departments.
    * Comparing the response times of different servers.
3. **Identifying Outliers:** Box plots clearly identify potential outliers, which are data points that fall far outside the main pattern of the data.
4. **Assessing Symmetry and Skewness:** The position of the median within the box and the lengths of the whiskers provide information about the symmetry or skewness of the distribution.
5. **Checking Normality:** While not a definitive test, box plots can give you a visual to check if your data is normally distributed.

### Anatomy of a Box Plot
A box plot consists of the following components:

* **Box:** The box itself represents the interquartile range (IQR), which contains the middle 50% of the data (from Q1 to Q3).
* **Line inside the box:** Represents the median (Q2).
* **Whiskers:** Lines extending from the box. They typically extend to the minimum and maximum values *within* 1.5 * IQR of the quartiles.
    * **Lower Whisker:** Extends from Q1 to the smallest data point that is greater than or equal to (Q1 - 1.5 * IQR).
    * **Upper Whisker:** Extends from Q3 to the largest data point that is less than or equal to (Q3 + 1.5 * IQR).
* **Individual Points (Outliers):** Data points that fall outside the whiskers (i.e., below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR) are plotted as individual points. These are considered potential outliers.
* **IQR (Interquartile Range):**  IQR = Q3 - Q1.

### Potential Pitfalls
1. **Hiding the Underlying Distribution:** A box plot summarizes the data, but it *doesn't* show the full distribution.  Two datasets with very different shapes (e.g., bimodal vs. uniform) could have similar box plots.  Consider using a histogram or violin plot *in addition to* a box plot to reveal the full distribution.
2. **Misinterpreting Outliers:**  Outliers identified by a box plot are *potential* outliers based on a specific rule (1.5 * IQR). They are not necessarily errors or invalid data points.  Always investigate outliers to understand *why* they are different.
3. **Small Sample Sizes:** Box plots can be misleading with very small sample sizes. The quartiles and median may not be reliable estimates of the population parameters.
4. **Assuming Normality:**  The 1.5 * IQR rule for outlier detection is based on the assumption of a normal distribution.  If the data is highly non-normal, this rule might flag too many or too few points as outliers.

### Example (Conceptual)
Imagine comparing the heights of students in three different schools.  You could create three box plots, one for each school, side-by-side. This would allow you to quickly compare:

* **Median height:** Which school has the tallest students on average?
* **Spread of heights:** Which school has the greatest variation in student heights?
* **Skewness:** Are the heights in any school skewed towards taller or shorter students?
* **Outliers:** Are there any unusually tall or short students in any of the schools?