<a href="https://colab.research.google.com/github/palekars/Data-Scientist-Course-2025/blob/main/Multiple_Linear_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import seaborn as sns
sns.set()

In [3]:
df = pd.read_csv("/content/1.02.+Multiple+linear+regression.csv")
display(df.head())

Unnamed: 0,SAT,GPA,"Rand 1,2,3"
0,1714,2.4,1
1,1664,2.52,3
2,1760,2.54,3
3,1685,2.74,3
4,1693,2.83,2


In [4]:
df.describe()

Unnamed: 0,SAT,GPA,"Rand 1,2,3"
count,84.0,84.0,84.0
mean,1845.27381,3.330238,2.059524
std,104.530661,0.271617,0.855192
min,1634.0,2.4,1.0
25%,1772.0,3.19,1.0
50%,1846.0,3.38,2.0
75%,1934.0,3.5025,3.0
max,2050.0,3.81,3.0


In [5]:
# Define the dependent and independent variables
y = df['GPA']
x = df[['SAT', 'Rand 1,2,3']]

# Add a constant to the independent variables
x = sm.add_constant(x)

# Create and fit the model
model = sm.OLS(y, x).fit()

# Print the summary of the model
display(model.summary())

0,1,2,3
Dep. Variable:,GPA,R-squared:,0.407
Model:,OLS,Adj. R-squared:,0.392
Method:,Least Squares,F-statistic:,27.76
Date:,"Sat, 30 Aug 2025",Prob (F-statistic):,6.58e-10
Time:,07:41:12,Log-Likelihood:,12.72
No. Observations:,84,AIC:,-19.44
Df Residuals:,81,BIC:,-12.15
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.2960,0.417,0.710,0.480,-0.533,1.125
SAT,0.0017,0.000,7.432,0.000,0.001,0.002
"Rand 1,2,3",-0.0083,0.027,-0.304,0.762,-0.062,0.046

0,1,2,3
Omnibus:,12.992,Durbin-Watson:,0.948
Prob(Omnibus):,0.002,Jarque-Bera (JB):,16.364
Skew:,-0.731,Prob(JB):,0.00028
Kurtosis:,4.594,Cond. No.,33300.0


### Explanation of R-squared and Adjusted R-squared

In the context of a regression model, R-squared and Adjusted R-squared are metrics used to evaluate how well the independent variables explain the variation in the dependent variable.

**R-squared (R²)**

*   **Definition**: R-squared is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by the independent variables in a regression model.
*   **Interpretation**: It ranges from 0 to 1 (or 0% to 100%). A higher R-squared value indicates that more of the variation in the dependent variable can be explained by the independent variables. For example, an R-squared of 0.407 means that approximately 40.7% of the variation in 'GPA' can be explained by 'SAT' and 'Rand 1,2,3' in this model.
*   **Limitation**: R-squared can be misleading when comparing models with different numbers of independent variables. Adding more independent variables, even if they are not significant, will generally increase the R-squared value.

**Adjusted R-squared**

*   **Definition**: Adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. It penalizes the addition of unnecessary independent variables.
*   **Interpretation**: It is generally lower than R-squared. Adjusted R-squared increases only if the new term improves the model more than would be expected by chance. It is a better measure for comparing models with different numbers of independent variables. In this case, the Adjusted R-squared of 0.392 is slightly lower than the R-squared, which is expected when adding independent variables.
*   **Usefulness**: Adjusted R-squared is particularly useful when building multiple regression models where you are considering several potential independent variables. It helps in selecting a model that provides the best fit without being overly complex.

In summary, while R-squared tells you how much of the variation in the dependent variable is explained by your model, Adjusted R-squared provides a more accurate picture, especially when comparing models with different numbers of predictors, by accounting for the number of independent variables used.

### Assumptions of Linear Regression

Linear regression models make several assumptions about the data to ensure the validity and reliability of the results. It's important to check these assumptions before interpreting the model.

Here are the key assumptions:

1.  **Linearity**: The relationship between the independent variables and the dependent variable is linear. This means the change in the dependent variable is proportional to the change in the independent variables. You can check this assumption by plotting the dependent variable against each independent variable.
2.  **Independence of Errors (No Autocorrelation)**: The errors (residuals) of the model are independent of each other. This means that the error for one observation does not influence the error for another observation. This assumption is particularly important in time series data. The Durbin-Watson statistic in the model summary can help detect autocorrelation.
3.  **Homoscedasticity (Constant Variance of Errors)**: The variance of the errors is constant across all levels of the independent variables. In other words, the spread of the residuals is roughly the same throughout the range of the predicted values. You can check this assumption by plotting the residuals against the predicted values. Heteroscedasticity (non-constant variance) can affect the standard errors of the coefficients.
4.  **Normality of Errors**: The errors (residuals) of the model are normally distributed. This assumption is important for hypothesis testing and confidence intervals. You can check this assumption by looking at a histogram or a Q-Q plot of the residuals.
5.  **No Multicollinearity**: The independent variables are not highly correlated with each other. High multicollinearity can make it difficult to determine the individual effect of each independent variable on the dependent variable and can lead to unstable coefficient estimates. You can check for multicollinearity by examining the correlation matrix of your independent variables or calculating Variance Inflation Factors (VIF).

Checking these assumptions helps ensure that your linear regression model is appropriate for your data and that the results you obtain are valid.

### Linearity Assumption Explained in Detail

The linearity assumption is one of the fundamental assumptions of linear regression. It states that there must be a **linear relationship** between the independent variables and the dependent variable.

**What does a Linear Relationship Mean?**

A linear relationship means that the change in the dependent variable is directly proportional to the change in the independent variable(s). If you were to plot the dependent variable against an independent variable, the points should roughly form a straight line (or a hyperplane in the case of multiple independent variables).

Mathematically, in a simple linear regression with one independent variable ($X$) and a dependent variable ($Y$), the model is represented as:

$Y = \beta_0 + \beta_1X + \epsilon$

Where:
*   $Y$ is the dependent variable.
*   $X$ is the independent variable.
*   $\beta_0$ is the y-intercept (the value of Y when X is 0).
*   $\beta_1$ is the slope of the line (the change in Y for a one-unit change in X).
*   $\epsilon$ is the error term (the part of Y that the model cannot explain).

In a multiple linear regression with multiple independent variables ($X_1, X_2, ..., X_k$), the model is:

$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_kX_k + \epsilon$

The linearity assumption implies that the $\beta$ coefficients are constant and do not depend on the values of the independent variables.

**Why is Linearity Important?**

If the relationship between the variables is not linear, fitting a linear model will not accurately capture the underlying pattern in the data. This can lead to:

*   **Biased Coefficient Estimates**: The estimated $\beta$ coefficients may not accurately reflect the true relationship between the variables.
*   **Inaccurate Predictions**: The model's predictions for the dependent variable will be unreliable, especially for values of the independent variables outside the range of the observed data.
*   **Misleading Interpretations**: The interpretations of the model's results (e.g., the effect of an independent variable on the dependent variable) can be incorrect.

**How to Check the Linearity Assumption?**

You can check the linearity assumption using several methods:

1.  **Scatter Plots**: Plot the dependent variable against each independent variable individually. Look for a roughly linear pattern. If the relationship appears curved or non-linear, the assumption is violated.
2.  **Residual Plots**: Plot the residuals (the differences between the observed and predicted values of the dependent variable) against the predicted values or the independent variables. If the linearity assumption holds, the residuals should be randomly scattered around zero with no discernible pattern. A curved pattern in the residual plot suggests non-linearity.
3.  **Component-Plus-Residual Plots (Partial Regression Plots)**: These plots show the relationship between the dependent variable and an independent variable after accounting for the effects of other independent variables in the model. They can help identify non-linearity or influential points.

**What to Do if the Linearity Assumption is Violated?**

If you find that the linearity assumption is violated, you have a few options:

1.  **Transform the Variables**: You can try transforming one or both variables to create a more linear relationship. Common transformations include taking the logarithm, square root, or reciprocal of the variables.
2.  **Add Polynomial Terms**: If the relationship is curved, you can add polynomial terms (e.g., $X^2$, $X^3$) to the model to capture the non-linear pattern.
3.  **Use Non-Linear Regression**: If the relationship is inherently non-linear and transformations or polynomial terms don't work, you might need to use a non-linear regression model that is more appropriate for the data.
4.  **Consider Other Models**: Linear regression might not be the right model for your data. You might need to consider other types of models that can handle non-linear relationships, such as generalized additive models (GAMs) or tree-based models.

In summary, the linearity assumption is crucial for the validity of linear regression results. It's essential to check this assumption and address any violations to ensure that your model accurately represents the relationship between the variables and provides reliable insights.

### Normality of Errors Assumption Explained in Detail

The normality assumption in linear regression states that the **errors (residuals)** of the model are **normally distributed**.

**What does Normally Distributed Errors Mean?**

Normally distributed errors mean that if you were to plot the frequency distribution of the residuals, it would approximate a bell-shaped curve (a normal distribution). In a normal distribution, the errors are symmetrically distributed around a mean of zero, with most errors close to zero and fewer errors further away.

Mathematically, this assumption can be expressed as $\epsilon \sim N(0, \sigma^2)$, where $\epsilon$ represents the error term, $N$ denotes a normal distribution, 0 is the mean of the errors, and $\sigma^2$ is the variance of the errors.

**Why is Normality of Errors Important?**

The normality of errors assumption is important for **hypothesis testing** and **confidence intervals** in linear regression.

*   **Hypothesis Testing**: The p-values associated with the coefficients in the regression summary (which help determine the statistical significance of the independent variables) are calculated based on the assumption that the errors are normally distributed. If this assumption is violated, the p-values may not be accurate, leading to incorrect conclusions about the significance of the predictors.
*   **Confidence Intervals**: The confidence intervals for the coefficients are also derived under the assumption of normally distributed errors. If the assumption is not met, the confidence intervals may be wider or narrower than they should be, affecting the precision of the coefficient estimates.

It's worth noting that the normality assumption is **less critical** for estimating the regression coefficients themselves (the $\beta$ values) in large sample sizes due to the Central Limit Theorem. However, it remains important for the validity of statistical inferences.

**How to Check the Normality of Errors Assumption?**

You can check the normality of errors assumption using several methods:

1.  **Histogram of Residuals**: Plot a histogram of the residuals. Visually inspect if the distribution is roughly symmetric and bell-shaped.
2.  **Q-Q Plot (Quantile-Quantile Plot)**: A Q-Q plot compares the quantiles of the residuals to the quantiles of a theoretical normal distribution. If the residuals are normally distributed, the points on the Q-Q plot should fall approximately along a straight line.
3.  **Statistical Tests**: Several statistical tests can be used to formally test for normality, such as the Shapiro-Wilk test, Anderson-Darling test, or Kolmogorov-Smirnov test. However, these tests can be sensitive to sample size, and visual inspection of plots is often sufficient, especially for larger datasets.

**What to Do if the Normality of Errors Assumption is Violated?**

If you find that the normality of errors assumption is violated, you have a few options:

1.  **Transform the Dependent Variable**: Transforming the dependent variable (e.g., using a logarithmic or square root transformation) can sometimes help normalize the distribution of the errors.
2.  **Identify and Address Outliers**: Outliers in the data can significantly affect the normality of residuals. Identifying and appropriately handling outliers (e.g., removing them if they are data errors, or using robust regression methods) can improve normality.
3.  **Use Non-Parametric Methods**: If the errors are severely non-normal and transformations don't work, you might consider using non-parametric regression methods that do not rely on the assumption of normality.
4.  **Consider Other Models**: If the underlying data generating process is not linear, a different type of model might be more appropriate, which could also address the non-normality of errors.

In summary, while the normality of errors is primarily important for valid statistical inference (p-values and confidence intervals), checking and addressing violations of this assumption is good practice to ensure the reliability of your linear regression results.

### Homoscedasticity (Constant Variance of Errors) Assumption Explained in Detail

The homoscedasticity assumption in linear regression states that the **variance of the errors (residuals)** is **constant** across all levels of the independent variables. The opposite of homoscedasticity is **heteroscedasticity**, where the variance of the errors is not constant.

**What does Constant Variance of Errors Mean?**

Constant variance of errors means that the spread or dispersion of the residuals is roughly the same for all values of the independent variables and for all predicted values of the dependent variable. If you were to plot the residuals against the predicted values or an independent variable, the points should be scattered randomly around zero with a consistent width or band.

Mathematically, this assumption can be expressed as $Var(\epsilon_i | X_i) = \sigma^2$ for all observations $i$, where $Var(\epsilon_i | X_i)$ is the variance of the error term for observation $i$ given the independent variables $X_i$, and $\sigma^2$ is a constant variance.

**Why is Homoscedasticity Important?**

Homoscedasticity is important because it affects the **efficiency and validity of the coefficient estimates and standard errors**.

*   **Efficiency of Estimates**: When homoscedasticity holds, the Ordinary Least Squares (OLS) method provides the most efficient estimates of the regression coefficients (they have the lowest variance among all linear unbiased estimators).
*   **Validity of Standard Errors and Statistical Tests**: The standard errors of the coefficients, which are used to calculate t-statistics and p-values, are based on the assumption of constant error variance. If heteroscedasticity is present, the standard errors will be biased (either too large or too small), leading to incorrect t-statistics and p-values. This can result in incorrect conclusions about the statistical significance of the independent variables.
*   **Confidence Intervals**: Similar to hypothesis testing, the confidence intervals for the coefficients will also be incorrect if homoscedasticity is violated.

In simple terms, heteroscedasticity means that the model's predictions are less precise for some ranges of the independent variables than for others.

**How to Check the Homoscedasticity Assumption?**

You can check the homoscedasticity assumption using several methods:

1.  **Residual Plots**: The most common way to check for homoscedasticity is to plot the residuals against the predicted values of the dependent variable (or against each independent variable).
    *   **Homoscedasticity**: The residuals should be randomly scattered around zero with no discernible pattern and a roughly constant width.
    *   **Heteroscedasticity**: Look for patterns in the residual plot, such as a fanning-out shape (where the spread of residuals increases with the predicted values) or a fanning-in shape (where the spread decreases).
2.  **Statistical Tests**: Several statistical tests can formally test for heteroscedasticity, such as the Breusch-Pagan test, the White test, or the Goldfeld-Quandt test. These tests provide a p-value to help determine if there is significant evidence of heteroscedasticity.

**What to Do if the Homoscedasticity Assumption is Violated (Heteroscedasticity)?**

If you find that the homoscedasticity assumption is violated, you have a few options:

1.  **Transform the Dependent Variable**: Transforming the dependent variable (e.g., using a logarithmic or square root transformation) can sometimes stabilize the variance of the errors.
2.  **Use Weighted Least Squares (WLS)**: Weighted Least Squares is a regression method that can be used when heteroscedasticity is present. It assigns different weights to observations based on their variance, giving less weight to observations with higher variance.
3.  **Use Robust Standard Errors**: Some statistical software packages can calculate robust standard errors (also known as heteroscedasticity-consistent standard errors). These standard errors are valid even in the presence of heteroscedasticity and allow for correct inference about the coefficients. This is often a simpler approach than WLS.
4.  **Consider Other Models**: If the heteroscedasticity is severe and cannot be addressed by transformations or robust methods, it might indicate that a different type of model is more appropriate for the data.

In summary, homoscedasticity is important for obtaining efficient coefficient estimates and valid statistical inferences. Checking residual plots and using statistical tests can help identify heteroscedasticity, and various methods are available to address this issue if it is present.

### Independence of Errors (No Autocorrelation) Assumption Explained in Detail

The independence of errors assumption in linear regression states that the **errors (residuals)** of the model are **independent of each other**. This means that the error for one observation does not influence the error for another observation. **Autocorrelation** (also known as serial correlation) occurs when the errors are not independent, and there is a pattern in the residuals over time or space.

**What does Independence of Errors Mean?**

Independence of errors means that knowing the value of the error for one data point does not give you any information about the value of the error for another data point. The errors are random and not systematically related to each other.

Mathematically, this assumption means that the covariance between any two error terms ($\epsilon_i$ and $\epsilon_j$ for $i \neq j$) is zero: $Cov(\epsilon_i, \epsilon_j) = 0$.

**Why is Independence of Errors Important?**

The independence of errors assumption is particularly important in **time series data** or data where the order of observations matters. Violations of this assumption (autocorrelation) can lead to several problems:

*   **Biased Standard Errors**: Autocorrelation causes the standard errors of the regression coefficients to be biased. Positive autocorrelation (where consecutive errors are positively correlated) leads to underestimated standard errors, making the coefficients appear more statistically significant than they actually are (inflated t-statistics and smaller p-values). Negative autocorrelation (where consecutive errors are negatively correlated) leads to overestimated standard errors, making the coefficients appear less significant.
*   **Inefficient Estimates**: While the coefficient estimates themselves remain unbiased in the presence of autocorrelation, they are no longer the most efficient estimates (they do not have the minimum variance).
*   **Invalid Statistical Inference**: Due to the biased standard errors, hypothesis tests and confidence intervals for the coefficients become unreliable.

In essence, if autocorrelation is present, the model is not fully capturing the systematic pattern in the data, and some of that pattern is left in the residuals.

**How to Check the Independence of Errors Assumption?**

You can check the independence of errors assumption using several methods:

1.  **Residual Plots (against time or order)**: If your data has a time component or a natural order, plot the residuals against time or the order of observations. Look for patterns in the residuals, such as a clear trend or a cyclical pattern. A random scatter of residuals around zero suggests independence.
2.  **Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) Plots**: These plots are commonly used in time series analysis. The ACF plot shows the correlation between a time series and lagged versions of itself. Significant spikes at different lags indicate autocorrelation.
3.  **Durbin-Watson Statistic**: This is a common statistic reported in regression summaries (like the one you generated). It tests for the presence of autocorrelation in the residuals, usually first-order autocorrelation (correlation between consecutive errors).
    *   The Durbin-Watson statistic ranges from 0 to 4.
    *   A value around 2 suggests no autocorrelation.
    *   Values significantly below 2 suggest positive autocorrelation.
    *   Values significantly above 2 suggest negative autocorrelation.
    *   The interpretation of the Durbin-Watson statistic depends on the sample size and the number of independent variables, and critical values can be found in statistical tables. In your model summary, the Durbin-Watson statistic is 0.948, which is below 2, suggesting positive autocorrelation in the residuals.
4.  **Breusch-Godfrey Test**: This is a more general test for autocorrelation that can detect higher-order autocorrelation as well.

**What to Do if the Independence of Errors Assumption is Violated (Autocorrelation)?**

If you find that the independence of errors assumption is violated, you have a few options:

1.  **Include Lagged Variables**: In time series data, you can include lagged values of the dependent variable or the independent variables as predictors in the model to capture the autocorrelation.
2.  **Use Time Series Models**: If the autocorrelation is significant and the data is a time series, more specialized time series models (like ARIMA models) might be more appropriate.
3.  **Transform the Variables**: Sometimes, transforming the dependent variable or using differencing (calculating the difference between consecutive observations) can help remove autocorrelation.
4.  **Use Feasible Generalized Least Squares (FGLS)**: This is a regression method that can be used when autocorrelation is present. It estimates the correlation structure of the errors and uses that information to transform the data and obtain more efficient estimates.
5.  **Use Robust Standard Errors (Clustered Standard Errors)**: In some cases, especially when the autocorrelation is clustered within groups of observations, using robust standard errors (specifically clustered standard errors) can provide valid inference even in the presence of autocorrelation.

In summary, the independence of errors assumption is vital, particularly for time series data, as its violation can lead to biased standard errors and incorrect statistical inferences. Checking for autocorrelation using residual plots, the Durbin-Watson statistic, or other tests is important, and various methods are available to address this issue if it is present. Based on your model's Durbin-Watson statistic of 0.948, there appears to be some positive autocorrelation in the residuals, which you might want to investigate further.

### No Multicollinearity Assumption Explained in Detail

The no multicollinearity assumption in linear regression states that the **independent variables should not be highly correlated with each other**. **Multicollinearity** occurs when two or more independent variables in a regression model are highly linearly related.

**What does No Multicollinearity Mean?**

No multicollinearity means that each independent variable provides unique information to the model that is not already explained by the other independent variables. While some degree of correlation between independent variables is expected, high correlation can cause problems.

Mathematically, multicollinearity exists when one independent variable can be expressed as a linear combination of other independent variables.

**Why is No Multicollinearity Important?**

Multicollinearity does **not** affect the overall predictive power of the model (e.g., the R-squared value). However, it does significantly impact the **interpretation and stability of the individual regression coefficients**:

*   **Unstable Coefficient Estimates**: In the presence of high multicollinearity, the estimated coefficients for the correlated independent variables can be unstable and highly sensitive to small changes in the data. This means that the sign or magnitude of a coefficient could change dramatically if you add or remove a few data points.
*   **Difficulty in Interpreting Individual Effects**: It becomes difficult to determine the individual effect of each correlated independent variable on the dependent variable because their effects are intertwined. The model can tell you that the group of correlated variables is important, but not which specific variable within that group is driving the effect.
*   **Inflated Standard Errors**: Multicollinearity inflates the standard errors of the coefficients for the correlated variables. This leads to smaller t-statistics and larger p-values, making it difficult to determine if the individual variables are statistically significant, even if the overall model is significant.
*   **Wide Confidence Intervals**: Due to the inflated standard errors, the confidence intervals for the coefficients of the correlated variables become very wide, reflecting the uncertainty in their estimated values.

In extreme cases of perfect multicollinearity (where one independent variable is a perfect linear combination of others), the regression model cannot be estimated at all.

**How to Check the No Multicollinearity Assumption?**

You can check for multicollinearity using several methods:

1.  **Correlation Matrix**: Calculate the correlation matrix of your independent variables. Look for high correlation coefficients (typically above 0.7 or 0.8 in absolute value) between pairs of independent variables. This is a good initial check, but it only identifies pairwise correlations, not relationships among three or more variables.
2.  **Variance Inflation Factor (VIF)**: The VIF measures how much the variance of the estimated regression coefficient is increased due to multicollinearity. A VIF of 1 means there is no multicollinearity for that variable. A VIF greater than 1 indicates multicollinearity.
    *   Commonly used thresholds for VIF are 5 or 10. A VIF above 5 or 10 is often considered an indication of significant multicollinearity.
    *   You can calculate the VIF for each independent variable.
3.  **Eigenvalues and Condition Index**: Some statistical software provides eigenvalues of the correlation matrix and a condition index. A high condition index (e.g., above 15 or 30) can indicate multicollinearity.

**What to Do if the No Multicollinearity Assumption is Violated?**

If you find that multicollinearity is present, you have a few options:

1.  **Remove One of the Correlated Variables**: If two or more independent variables are highly correlated, you can consider removing one of them from the model. Choose the variable that is less theoretically important or has a weaker relationship with the dependent variable.
2.  **Combine Correlated Variables**: You can create a new variable that is a composite of the correlated variables (e.g., by averaging them or using principal component analysis).
3.  **Collect More Data**: Multicollinearity can sometimes be reduced by collecting more data, especially if the current data set has limited variability in the independent variables.
4.  **Use Ridge Regression or Lasso Regression**: These are penalized regression methods that can handle multicollinearity by shrinking the coefficient estimates.
5.  **Standardize Variables**: While standardizing variables (subtracting the mean and dividing by the standard deviation) does not eliminate multicollinearity, it can sometimes help with the interpretation of coefficients when multicollinearity is present.

In summary, multicollinearity can make it difficult to interpret the individual effects of independent variables and can lead to unstable and unreliable coefficient estimates and standard errors. Checking for multicollinearity using correlation matrices and VIFs is important, and various methods are available to address this issue if it is present.

### Dummy Variables in Regression Explained in Detail

Dummy variables are a way to include **categorical independent variables** (variables that represent groups or categories, like gender, region, or experimental condition) in a linear regression model. Linear regression models require all independent variables to be numerical, so dummy variables are used to convert categorical data into a numerical format that the model can understand.

**What is a Dummy Variable?**

A dummy variable is a binary variable, meaning it can only take on two values, typically **0** or **1**. These values are used to represent the presence or absence of a specific category.

For a categorical variable with *k* categories, you typically create *k-1* dummy variables. The category that does not have a dummy variable assigned to it is called the **reference category** (or base category). The coefficients of the dummy variables are then interpreted in comparison to this reference category.

**How are Dummy Variables Created?**

Let's say you have a categorical variable called "City" with three categories: "New York", "London", and "Paris". To include this in a regression model, you would create *k-1 = 3-1 = 2* dummy variables.

You could create two dummy variables:

1.  **`City_London`**: This variable would be 1 if the observation is from London, and 0 otherwise (New York or Paris).
2.  **`City_Paris`**: This variable would be 1 if the observation is from Paris, and 0 otherwise (New York or London).

In this case, "New York" would be the reference category.

**How are Dummy Variables Used in Regression?**

When you include these dummy variables in your regression model, the model equation might look something like this (assuming 'Income' is the dependent variable and 'Years of Experience' is another independent variable):

`Income = β₀ + β₁ * Years of Experience + β₂ * City_London + β₃ * City_Paris + ε`

**Interpretation of Coefficients with Dummy Variables:**

*   **β₀**: This is the intercept. It represents the expected value of the dependent variable (Income) when all independent variables (including the dummy variables) are zero. In this example, it would represent the expected income for someone with zero years of experience who is in the reference category ("New York").
*   **β₁**: This is the coefficient for 'Years of Experience'. It represents the expected change in Income for a one-unit increase in Years of Experience, holding City constant.
*   **β₂**: This is the coefficient for `City_London`. It represents the **difference** in expected Income between someone in London and someone in the reference category ("New York"), holding Years of Experience constant. If β₂ is positive, it means people in London are expected to earn β₂ more than people in New York, on average, with the same years of experience.
*   **β₃**: This is the coefficient for `City_Paris`. It represents the **difference** in expected Income between someone in Paris and someone in the reference category ("New York"), holding Years of Experience constant.

**Why Use k-1 Dummy Variables? (The Dummy Variable Trap)**

If you were to create a dummy variable for all *k* categories (e.g., `City_New York`, `City_London`, `City_Paris`), you would fall into the **dummy variable trap**. This leads to perfect multicollinearity because the sum of the dummy variables for all categories would always equal 1 (e.g., `City_New York + City_London + City_Paris = 1`). This perfect linear relationship between the independent variables makes it impossible for the regression model to estimate the coefficients uniquely. By excluding one category and using it as the reference, you avoid this issue.

**Choosing the Reference Category:**

The choice of the reference category is arbitrary and does not affect the overall fit of the model or the predictions. However, it does affect the interpretation of the dummy variable coefficients. It's often helpful to choose a reference category that is meaningful for comparison (e.g., a baseline group or the most common category).

**In Summary:**

Dummy variables are a technique to incorporate categorical independent variables into linear regression models. By converting categories into binary (0/1) variables, you can estimate the effect of each category relative to a chosen reference category. Understanding how to create and interpret dummy variables is essential when working with datasets that contain both numerical and categorical predictors.

In [6]:
# Make predictions using the trained model
predictions = model.predict(x)

# Display the first few predictions
print("Predicted GPA values:")
display(predictions.head())

Predicted GPA values:


Unnamed: 0,0
0,3.121933
1,3.022717
2,3.181457
3,3.057441
4,3.078939
