## From Simple to Multiple Regression

While simple linear regression models the relationship between a response and a single explanatory variable, **multiple regression** incorporates two or more explanatory variables. This often leads to more insightful and powerful predictive models.

A common and useful case of multiple regression involves one numeric and one categorical explanatory variable. This allows us to ask a more nuanced question: "How does the numeric variable `x` relate to the response `y`, and does this relationship differ between the categories in `c`?" The parallel slopes model provides a first answer to this by assuming the core relationship (the slope) is the same across all categories, but that each category has a different starting point (the intercept).

### The Parallel Slopes Model

#### The Core Assumption

The fundamental assumption of a parallel slopes model is that the **slope** of the relationship between the numeric explanatory variable and the response variable is **the same for all categories**. The model allows the **intercept** to be different for each category.

#### The Model Equations

If we have a numeric feature `x` and a categorical feature `c` with three levels (A, B, C), the parallel slopes model is not a single equation, but a set of equations:

  * For an observation in category A: $$\text{response} = \beta_A + \beta_{\text{slope}} \times x$$\* For an observation in category B:$$\text{response} = \beta_B + \beta_{\text{slope}} \times x$$\* For an observation in category C:$$\text{response} = \beta_C + \beta_{\text{slope}} \times x$$

Notice that the slope term, $\\beta\_{\\text{slope}}$, is common to all equations, while each category gets its own unique intercept ($\\beta\_A, \\beta\_B, \\beta\_C$).

#### The `statsmodels` Formula

This model is specified in `statsmodels` with the formula: `response ~ numeric_feature + categorical_feature + 0`.

  * `numeric_feature`: This term directs the model to estimate the common slope, $\\beta\_{\\text{slope}}$.
  * `categorical_feature + 0`: As seen previously, including a categorical variable and explicitly removing the global intercept (`+ 0`) directs the model to estimate a separate intercept for each category.


### Implementation and Coefficient Interpretation

Let's fit a parallel slopes model and interpret its output.

```python
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import seaborn as sns
import matplotlib.pyplot as plt

# Create a generic, reproducible dataset with a parallel slopes structure
np.random.seed(42)
category = np.random.choice(['Group A', 'Group B', 'Group C'], 150)
x_numeric = np.random.uniform(10, 50, 150)
# Define different intercepts for each group but a common slope of 3.5
intercepts = {'Group A': 50, 'Group B': 80, 'Group C': 20}
y_response = np.array([intercepts[cat] for cat in category]) + 3.5 * x_numeric + np.random.normal(0, 15, 150)
df = pd.DataFrame({'x_numeric': x_numeric, 'category': category, 'y_response': y_response})

# Fit the parallel slopes model 
mdl_parallel_slopes = smf.ols("y_response ~ x_numeric + category + 0", data=df).fit()

# Interpret the coefficients 
print(mdl_parallel_slopes.params)
```

The output will look similar to this:

```markdown
category[Group A]    52.53
category[Group B]    81.33
category[Group C]    19.71
x_numeric             3.47
dtype: float64
```

**Interpretation**:

  * **`x_numeric` (the slope)**: The coefficient is **3.47**. This means that for *any* category, a one-unit increase in `x_numeric` is associated with an average increase of **3.47** units in `y_response`.
  * **`category[Group A]` (intercept)**: The coefficient is **52.53**. This is the predicted `y_response` for an observation in `Group A` when `x_numeric` is zero.
  * **`category[Group B]` (intercept)**: The coefficient is **81.33**. This is the predicted `y_response` for an observation in `Group B` when `x_numeric` is zero.
  * **`category[Group C]` (intercept)**: The coefficient is **19.71**. This is the predicted `y_response` for an observation in `Group C` when `x_numeric` is zero.


### Visualizing the Parallel Slopes Model

The best way to understand the model is to visualize its results. We plot the raw data, colored by category, and then overlay the fitted regression line for each category. Because each line shares the same slope, they will be perfectly parallel.

```python
# 1. Plot the raw data, colored by category
# Create figure and axis
fig, ax = plt.subplots(figsize=(10, 6))

# 1. Plot the raw data, colored by category
sns.scatterplot(x="x_numeric", y="y_response", hue="category", data=df, ax=ax)

# 2. Extract the coefficients
coeffs = mdl_parallel_slopes.params
slope = coeffs['x_numeric']
intercept_A = coeffs['category[Group A]']
intercept_B = coeffs['category[Group B]']
intercept_C = coeffs['category[Group C]']

# 3. Add the parallel regression lines for each category
ax.axline(xy1=(0, intercept_A), slope=slope, color=sns.color_palette()[0])
ax.axline(xy1=(0, intercept_B), slope=slope, color=sns.color_palette()[1])
ax.axline(xy1=(0, intercept_C), slope=slope, color=sns.color_palette()[2])

ax.set_title("Parallel Slopes Regression Model")
ax.set_xlabel("Numeric Explanatory Variable")
ax.set_ylabel("Response Variable")
ax.grid(True, linestyle='--')
plt.show()
```

The resulting plot provides a clear and intuitive visualization of the model's structure: three distinct starting points (intercepts) for the three groups, but a single, common trend (slope) that describes the relationship between the numeric feature and the response for all of them.

In [None]:
exam_points = []
exercise_points = []

# Input loop
while True:
    # Get input from user
    user_input = input("Exam points and exercises completed: ")

    # Check if input is empty (just Enter key)
    if user_input == "":
        break  # Exit the loop if empty input

    # Split the input into two numbers and convert to integers
    # The split() method breaks a string into a list where each word is a list item
    points, exercises = map(int, user_input.split())

    # Store the data
    exam_points.append(points)
    exercise_points.append(exercises)


def calculate_exercise_points(exercises):
    # Convert exercises (0-100) to points (0-10)
    return exercises // 10


# Calculate statistics
total_students = len(exam_points)
passing_students = 0
grades = [0] * 6  # List to store count of each grade (0-5)

total_points = []  # Store total points for average calculation

# Process each student's data
for i in range(total_students):
    # Calculate exercise points
    exercise_pts = calculate_exercise_points(exercise_points[i])

    # Calculate total points
    total = exam_points[i] + exercise_pts
    total_points.append(total)

    # Determine grade
    if exam_points[i] < 10:  # Exam cutoff rule
        grade = 0
    elif total < 15:
        grade = 0
    elif total < 18:
        grade = 1
    elif total < 21:
        grade = 2
    elif total < 24:
        grade = 3
    elif total < 28:
        grade = 4
    else:
        grade = 5

    grades[grade] += 1

    if grade > 0:
        passing_students += 1

print("Statistics:")
points_average = sum(total_points) / total_students
pass_percentage = (passing_students / total_students) * 100

print(f"Points average: {points_average:.1f}")
print(f"Pass percentage: {pass_percentage:.1f}")

# Print grade distribution
print("Grade distribution:")
for grade in range(5, -1, -1):  # Loop from 5 to 0
    stars = "*" * grades[grade]  # Create string of stars based on grade count
    print(f"  {grade}: {stars}")