### Step 1: Import Necessary Packages

Before we begin, we need to import the necessary Python libraries for plotting and performing the regression. We'll use:

- `matplotlib.pyplot` for creating the graph
- `numpy` for numerical operations


In [None]:
import matplotlib.pyplot as plt
import numpy as np

### Step 2: Anscombe's Quartet Dataset

This dataset is known as **Anscombe's Quartet**, created by statistician Francis Anscombe to illustrate the importance of visualizing data. Despite having nearly identical statistical properties (e.g., mean, variance, correlation, and linear regression), each dataset tells a very different story when graphed.

- **x**: The independent variable, common across three datasets.
- **y1, y2, y3**: Three different dependent variables associated with the same `x` values.
- **x4, y4**: A special case where most of the `x` values are identical, with one outlier.

#### Anscombe's Quartet:

`

In [None]:
# Anscombe's Quartet:
x = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y1 = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]
y2 = [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74]
y3 = [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]
x4 = [8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8]
y4 = [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]


In [None]:
plt.scatter(x, y1)


In [None]:
# Calculate the slope (m) and intercept (b) of the line using np.polyfit
m, b = np.polyfit(x, y1, 1)

# Create the regression line
regression_line = m * np.array(x) + b

# Plot the data points and regression line
plt.scatter(x, y1)
plt.plot(x, regression_line,color='red')
plt.xlabel('x')
plt.ylabel('y1')


### Your Task:

Perform the same linear regression process for the following datasets: y2, y3, and y4.
Modify the code to calculate and plot the regression lines for each of these datasets.
Use distinct colors for each plot and appropriately label the axes (y2, y3, etc.).
Discuss any differences you observe when comparing the results across all datasets.


In [None]:
# Code for x and y2

m2, b2 = np.polyfit(x, y2, 1)

regression_line_2 = m2 * np.array(x) + b2

plt.scatter(x, y2)
plt.plot(x, regression_line_2, color = 'red')
plt.xlabel('x')
plt.ylabel('y2')

In [None]:
# Code for x and y3

m3, b3 = np.polyfit(x, y3, 1)

regression_line_3 = m3 * np.array(x) + b3

plt.scatter(x, y3)
plt.plot(x, regression_line_3, color = 'red')
plt.xlabel('x')
plt.ylabel('y3')

In [None]:
# Code for x and y4

m4, b4 = np.polyfit(x, y4, 1)

regression_line_4 = m4 * np.array(x) + b4

plt.scatter(x, y4)
plt.plot(x, regression_line_4, color = 'red')
plt.xlabel('x')
plt.ylabel('y4')

In [None]:
# Code for x4 and y4

m5, b5 = np.polyfit(x4, y4, 1)

regression_line_5 = m5 * np.array(x4) + b5

plt.scatter(x4, y4)
plt.plot(x4, regression_line_5, color = 'red')
plt.xlabel('x4')
plt.ylabel('y4')

### Reflection Question:

After visualizing the linear regression for all four datasets in Anscombe's Quartet, reflect on the following:

#### What is your reflection?
- How do the datasets visually differ despite having similar summary statistics?
- How did the outlier in the `x4, y4` dataset affect the regression line compared to the other datasets?
- Why is it important to visualize data in addition to calculating summary statistics?

Please provide your insights and discuss the importance of data visualization in understanding relationships between variables.


## ANSWER
---
- The graph with x,y2 had a scatter plot that looked like a projectile motion graph and had a regression line that is positive. The graph x,y3 had a scatter plot with all data numbers being similar except for one, the outlier. The graph also had a postive regression line but was less steep than the x,y2 graph. The x,y4 graph also had an outlier but the scatter plot was trending downward which also reflected in the regression line becoming negative. Lastly the x4,y4 graph was unique in that all the data points were lined up vertically at x = 8 except for the outlier. But because of the outlier having a greater y value, the regression line became positive.

- The regression line for x4, y4 dataset became positive and passes through the outlier point.

- Having a visual helps the statistician and people visually see how the data looks. It is also more pleasing to the eye and can help find trends.