# Scatter Plots

Scatter plots (also called scatter graphs or scatter charts) are a type of data visualization that displays the relationship between *two* numerical variables. Each point on the plot represents a single observation, with one variable determining the position on the horizontal axis (x-axis) and the other variable determining the position on the vertical axis (y-axis).

We use scatter plots to:

* **Identify Correlations:** Determine if there's a relationship between the two variables.  Do they tend to increase or decrease together (positive correlation), or does one increase as the other decreases (negative correlation)?  Is there no apparent relationship (no correlation)?
* **Detect Clusters:** Identify groups of data points that are similar to each other.  This can suggest subgroups within the data.
* **Find Outliers:** Spot data points that are far away from the general trend, which might indicate errors or unusual cases.
* **Visualize the Shape of the Relationship:**  Is the relationship linear, curved, or something else?
* **Communicate Findings:** Present results in a visually comprehensive way.

## Key Components and Interpretation

* **X-axis (Horizontal):** Typically represents the *independent* variable (the predictor or explanatory variable).  This is the variable that you *might* think of as causing or influencing the other variable (though correlation does *not* prove causation!).
* **Y-axis (Vertical):** Typically represents the *dependent* variable (the response or outcome variable). This is the variable that you *might* think of as being affected by the other variable.
* **Points:** Each point represents a single observation, with its position determined by the values of the two variables for that observation.
* **Trend Line (Regression Line, Line of Best Fit, Least Squares Line):**  A line drawn through the scatter plot that best represents the overall trend in the data.  This line *minimizes* the overall distance between itself and the data points (often using the "least squares" method).  The trend line is *not* guaranteed to pass through any of the actual data points.
* **Slope (m)**. Tells the rate at which the line is increasing or decreasing.
* **Y Intercept (b)**. Tells the value where the line intersects the vertical axis.

## Equation of the Trend Line
Often expressed in the form  `ŷ = mx + b`, where:

*   `ŷ` (y-hat) represents the *predicted* value of the dependent variable (y) for a given value of the independent variable (x).  The "hat" distinguishes the predicted value from the actual observed y values.
*   `m` is the *slope* of the line (the change in y for a one-unit change in x). A positive slope indicates a positive correlation; a negative slope indicates a negative correlation.
*   `b` is the *y-intercept* (the value of y when x = 0).


## Example: Shipping Data
Let's use an example of a small business tracking shipments. Our data includes:

*   **State:** The state to which the order was shipped.
*   **Region:**  The region of the country (West, Midwest, etc.).
*   **Item Count:** The number of items in the order.
*   **Order Total:** The total dollar amount of the order.

A scatter plot could be used to examine the relationship between "Item Count" (x-axis) and "Order Total" (y-axis).  Intuitively, we might expect a positive correlation: more items generally lead to a higher total cost.

| State    | Region     | Item Count | Order Total |
| :------- | :--------- | ---------: | ----------: |
| CA       | West       |          3 |       $2.08 |
| MI       | Midwest    |          6 |       $5.00 |
| NY       | Northeast  |          1 |       $9.65 |
| TX       | South      |          0 |       $0.00 |
| ...      | ...        |        ... |         ... |
| GA      | Southeast |       9      |     $45.24     |

**Plotting the Data:**

Each order would be represented by a single point.  For example, the first order (3 items, $2.08) would be a point near the bottom-left of the plot.

**Interpreting the Trend Line:**

The trend line would likely slope upwards from left to right, indicating a positive correlation.  However, it's *crucial* to understand that the trend line represents an *average* or *expected* relationship.  Individual data points will likely *deviate* from the line.
