# üöÄ Introduction to Multivariate Linear Regression in Python

Welcome to your first step into the world of AI and Machine Learning! In this 2-hour session, we'll explore one of the most fundamental and powerful tools in data science: **Multivariate Linear Regression**.

Don't worry if that sounds complicated. We'll break it down into simple, easy-to-understand pieces. Let's get started!

## üìò Learning Objectives

By the end of this session, you will be able to:

1.  **Understand** what Multivariate Linear Regression is and why it's so useful.
2.  **Identify** the key components of the regression equation.
3.  **Create** your own dataset for practice using Python.
4.  **Build, Train, and Test** a simple regression model with the `scikit-learn` library.
5.  **Interpret** the results to see how well your model performs.

## Topic 1: What is Multivariate Linear Regression?

Imagine you want to predict a student's final exam score. You could try to predict it based on just one thing, like *how many hours they studied*. That's called **Simple Linear Regression**.

But what if other factors matter too? Like their *class attendance* or their *score on a previous test*? 

When you use **multiple input factors** (like `hours_studied`, `attendance`, `previous_score`) to predict a **single output** (like `final_score`), you're doing **Multivariate Linear Regression**! 

It's a cornerstone of machine learning because it helps us understand complex relationships in data and make predictions. The goal is to find the "best fit" that describes how all the inputs work together to influence the output.

## Topic 2: The Core Equation üìú

At its heart, linear regression uses a simple mathematical equation. It might look intimidating, but it's just like a recipe!

**Y = Œ≤‚ÇÄ + Œ≤‚ÇÅX‚ÇÅ + Œ≤‚ÇÇX‚ÇÇ + ... + Œ≤‚ÇöX‚Çö + Œµ**

Let's break it down:

-   **Y**: The value we want to predict (e.g., the final exam score).
-   **X‚ÇÅ, X‚ÇÇ, ...**: Our input features or predictors (e.g., hours studied, attendance).
-   **Œ≤‚ÇÅ, Œ≤‚ÇÇ, ...**: The **coefficients**. These are the most important numbers! They tell us how much each input `X` affects the output `Y`. A positive coefficient means the score goes up, and a negative one means it goes down.
-   **Œ≤‚ÇÄ**: The **intercept** or "baseline". It's the predicted value of `Y` if all our inputs were zero.
-   **Œµ**: The **error term**. No model is perfect! This represents the part of `Y` that our model can't explain.

## Topic 3: Setting Up Our Python Environment

To build our model, we need some powerful tools (libraries). We'll import them first.

-   `numpy`: For efficient numerical operations.
-   `pandas`: For creating and managing our data in a table-like structure called a DataFrame.
-   `scikit-learn`: The master library for machine learning in Python! We'll use specific tools from it.

In [1]:
# This cell imports all the libraries we need for this session
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

print("‚úÖ Libraries imported successfully!")

‚úÖ Libraries imported successfully!


### üéØ Practice Task

Run the code cell above by clicking on it and pressing **Shift + Enter**. If you see the "Libraries imported successfully!" message, you're ready to go! 

**Quick Question:** What does `as np` do? It creates a shorter nickname or 'alias' for the `numpy` library, so we can type `np` instead of `numpy` every time. It's a common convention among Python programmers!

## Topic 4: Creating Our Data

In a real project, you would load data from a file (like a CSV). For this lesson, we'll generate our own *synthetic* (fake) data. This gives us full control and helps us understand the process better.

We'll create data for 100 students, with three features: `Hours_Studied`, `Attendance`, and `Previous_Score`.

In [2]:
# We use np.random.seed(42) to make sure we all get the same 'random' numbers.
# This makes our results reproducible!
np.random.seed(42)

# Number of students (samples)
n_samples = 100

# Generate our features (the 'X' variables)
hours_studied = np.random.rand(n_samples) * 10  # Random hours between 0 and 10
attendance = np.random.rand(n_samples) * 5     # Random attendance score between 0 and 5
previous_score = np.random.rand(n_samples) * 100 # Random previous score between 0 and 100

print("Feature data generated!")

Feature data generated!


Now, let's create the final scores (the `Y` variable). We'll define a "true" relationship using our own coefficients, and then add some random noise to make it more realistic.

In [None]:
# These are the 'true' coefficients we are setting for our data generation
# Our model will later try to 'learn' these numbers from the data!
beta_0 = 30           # Intercept (baseline score)
beta_hours = 5        # For every hour studied, score increases by 5
beta_attendance = -3  # For every point in attendance, score DECREASES by 3 (maybe a mistake in data?)
beta_prev = 0.5       # 50% of the previous score is carried over

# Add some random noise because no real-world data is perfect
errors = np.random.randn(n_samples) * 5

# Calculate the final scores using our linear equation
scores = (beta_0 +
          beta_hours * hours_studied +
          beta_attendance * attendance +
          beta_prev * previous_score +
          errors)

print("Target variable 'scores' created!")

Finally, let's put it all together in a `pandas` DataFrame, which is like a spreadsheet for our data.

In [None]:
# Create a dictionary to hold our data
data_dict = {
    'Hours_Studied': hours_studied,
    'Attendance': attendance,
    'Previous_Score': previous_score,
    'Score': scores
}

# Create the DataFrame
data = pd.DataFrame(data_dict)

# Display the first 5 rows to see what it looks like
data.head()

### üéØ Practice Task

In the code cell below, use the `data.describe()` command to see some summary statistics (like mean, min, max) of our dataset. 

Can you find the average `Previous_Score` from the output?

In [None]:
# Your code here! Type data.describe()


## Topic 5: Preparing Data for the Model

Before we can train our model, we need to do two things:

1.  **Separate Features (X) and Target (y):** We need to tell the model which columns are the inputs (`X`) and which column is the output we want to predict (`y`).
2.  **Split Data into Training and Testing Sets:** This is a crucial step in machine learning. 
    -   **Training Set:** The model `learns` from this data (like studying a textbook).
    -   **Testing Set:** We use this data to `evaluate` how well the model learned (like giving it a final exam). The model never sees this data during training!

In [None]:
# X contains our three input features
X = data[['Hours_Studied', 'Attendance', 'Previous_Score']]

# y contains the single output we want to predict
y = data['Score']

print("X (features) shape:", X.shape)
print("y (target) shape:", y.shape)

In [None]:
# Split the data: 80% for training, 20% for testing
# random_state=42 ensures we get the same split every time we run the code
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Data split into training and testing sets!")
print("Training samples:", X_train.shape[0])
print("Testing samples:", X_test.shape[0])

### üéØ Practice Task

We used `X_train.shape[0]` to find out how many students are in the training set. The `.shape` attribute of a DataFrame gives its dimensions (rows, columns). 

Can you print the shape of `y_train` in the cell below? What do you notice?

In [None]:
# Your code here! Print the shape of y_train

## Topic 6: Building and Training the Model ü§ñ

Now for the exciting part! With `scikit-learn`, building and training a model takes just two lines of code.

1.  **Create a model object:** We'll create an instance of the `LinearRegression` class.
2.  **Fit the model:** We'll use the `.fit()` method on our training data (`X_train`, `y_train`). This is where the model learns the best coefficients (the `Œ≤` values) from the data.

In [None]:
# 1. Create the model
model = LinearRegression()

# 2. Train (fit) the model using the training data
model.fit(X_train, y_train)

print("üéâ Model trained successfully!")

‚úÖ **Well done!** You've just trained your first multivariate linear regression model. That one line, `model.fit()`, did all the complex Ordinary Least Squares (OLS) math for us behind the scenes!

## Topic 7: Evaluating the Model

Training the model is easy, but how do we know if it's any good? We need to evaluate its performance.

1.  **Check the Coefficients:** We'll look at the coefficients (`model.coef_`) and the intercept (`model.intercept_`) that the model learned. We can compare them to the "true" coefficients we set when we created the data.
2.  **Make Predictions:** We'll use the trained model to make predictions on the `X_test` data (the data it has never seen before).
3.  **Check the R-squared (R¬≤) Score:** This is a very common metric. It tells us the *proportion of variance* in the target that is predictable from the features. In simple terms, it's a score from 0 to 1 that tells you how well your model explains the data. **Closer to 1 is better!**

In [None]:
# Print the learned coefficients
print("Model Coefficients:")
for feature, coef in zip(X.columns, model.coef_):
    print(f"{feature}: {coef:.2f}")

# Print the learned intercept
print(f"\nIntercept: {model.intercept_:.2f}")

üí° **Fun Fact:** Look at the coefficients the model learned! How close are they to the `beta` values we set? 
- True `beta_hours`: 5.0  -> Model learned: ~4.88
- True `beta_attendance`: -3.0 -> Model learned: ~-3.07
- True `beta_prev`: 0.5    -> Model learned: ~0.50
- True `beta_0` (Intercept): 30.0 -> Model learned: ~30.60

They are very close! This tells us our model did a great job learning the underlying patterns from the data.

In [None]:
# Make predictions on the test data
y_pred = model.predict(X_test)

# Calculate the R-squared score
r2 = r2_score(y_test, y_pred)

print(f"R-squared Score: {r2:.2f}")

### üéØ Practice Task

Our R-squared score is about **0.88**. In plain English, what does this mean? 

*(Hint: It means our model can explain about 88% of the variation in student scores based on our three features! That's pretty good!)*

üß™ **Experiment:** Go back to the cell where we defined the `beta` coefficients. Change `beta_attendance` from `-3` to `3`. Then, re-run all the cells from that point downwards by clicking `Cell -> Run All Below`. What happens to the learned coefficient for Attendance and the final R-squared score?

---

## üéì Final Revision Assignment (For Home Practice)

Congratulations on completing the session! You've learned the entire workflow of building a basic machine learning model. Now, let's put your new skills to the test with a new problem.

**Scenario:** You are a real estate agent trying to predict house prices. You have data on `Square_Footage`, `Num_Bedrooms`, and `Age_of_House`.

Follow the tasks below to build a house price predictor!

**Task 1: Setup**

The necessary libraries (`numpy`, `pandas`, `scikit-learn`) should already be imported from the beginning of the notebook. If not, import them in the cell below.

**Task 2: Create Data**

Run the cell below to generate a new synthetic dataset for house prices.

In [None]:
np.random.seed(123) # New seed for a new problem!
num_houses = 200

square_footage = np.random.randint(800, 3500, num_houses)
num_bedrooms = np.random.randint(1, 6, num_houses)
age_of_house = np.random.randint(0, 50, num_houses)

true_price = 50000 + (square_footage * 150) + (num_bedrooms * 25000) - (age_of_house * 1000) + np.random.randn(num_houses) * 20000

print("üè° House data generated!")

**Task 3: Create and Explore DataFrame**

Put the new housing data into a pandas DataFrame called `houses_df`. Then, display the first 5 rows using the `.head()` method.

In [None]:
# Your code here
houses_df = pd.DataFrame({
    'Square_Footage': square_footage,
    'Num_Bedrooms': num_bedrooms,
    'Age_of_House': age_of_house,
    'Price': true_price
})

# Display the first 5 rows


**Task 4: Define Features (X) and Target (y)**

Create your feature matrix `X_houses` and your target vector `y_houses` from the `houses_df` DataFrame.

In [None]:
# Your code here
# X_houses = ...
# y_houses = ...


**Task 5: Split the Data**

Split the data into training and testing sets. This time, use a `test_size` of `0.3` (30% for testing).

In [None]:
# Your code here
# X_train_h, X_test_h, y_train_h, y_test_h = ...


**Task 6: Build and Train a New Model**

Create a new `LinearRegression` model and `fit` it to your new housing training data.

In [None]:
# Your code here
# house_model = ...
# house_model.fit(...)


**Task 7: Check the Coefficients**

Print the intercept and the coefficients for the house price model. 

-   Which feature has the biggest positive impact on price?
-   Which feature has a negative impact on price?

In [None]:
# Your code here


**Task 8: Make Predictions and Evaluate**

Finally, use your trained `house_model` to make predictions on the test set and print the R-squared score. How well did your new model perform?

In [None]:
# Your code here
