In [None]:
1. What does R-squared represent in a regression model
R-squared measures the proportion of the variance in the dependent variable explained by the model. It ranges from 0 to 1.

2. What are the assumptions of linear regression

Linearity

Independence of errors

Homoscedasticity

Normality of residuals

No multicollinearity

3. What is the difference between R-squared and Adjusted R-squared

R-squared: Increases with more features, even if not useful.

Adjusted R-squared: Penalizes unnecessary features; adjusts R-squared based on model complexity.

4. Why do we use Mean Squared Error (MSE)
MSE measures the average squared difference between predicted and actual values. It‚Äôs a common loss function for regression models.

5. What does an Adjusted R-squared value of 0.85 indicate
It means 85% of the variance in the target is explained by the model, adjusted for the number of predictors.

6. How do we check for normality of residuals in linear regression

Histogram or Q-Q plot

Shapiro-Wilk test or Kolmogorov-Smirnov test

7. What is multicollinearity, and how does it impact regression
Multicollinearity occurs when predictors are highly correlated. It makes coefficient estimates unstable and inflates standard errors.

8. What is Mean Absolute Error (MAE)
MAE is the average of absolute differences between predictions and actual values. It‚Äôs less sensitive to outliers than MSE.

9. What are the benefits of using an ML pipeline

Organizes steps in a workflow

Reduces errors

Ensures reproducibility

Simplifies hyperparameter tuning and deployment

10. Why is RMSE considered more interpretable than MSE
RMSE is in the same units as the target variable, while MSE is in squared units.

11. What is pickling in Python, and how is it useful in ML
Pickling is a way to serialize (save) Python objects, such as trained models, for reuse.

12. What does a high R-squared value mean
It indicates that the model explains a large proportion of the variance in the target variable.

13. What happens if linear regression assumptions are violated
It can lead to biased or inefficient estimates, invalid hypothesis tests, and unreliable predictions.

14. How can we address multicollinearity in regression

Remove correlated features

Use regularization (Ridge, Lasso)

Apply dimensionality reduction (PCA)

15. How can feature selection improve model performance in regression analysis
By removing irrelevant or redundant features, we reduce overfitting and improve interpretability.

16. How is Adjusted R-squared calculated

Adjusted¬†R
2
=
1
‚àí
(
(
1
‚àí
ùëÖ
2
)
√ó
(
ùëõ
‚àí
1
)
ùëõ
‚àí
ùëù
‚àí
1
)
Adjusted¬†R 
2
 =1‚àí( 
n‚àíp‚àí1
(1‚àíR 
2
 )√ó(n‚àí1)
‚Äã
 )
where 
ùëõ
n = number of observations, 
ùëù
p = number of predictors.

17. Why is MSE sensitive to outliers
MSE squares the errors, so large errors (from outliers) have a disproportionate effect.

18. What is the role of homoscedasticity in linear regression
Homoscedasticity means constant variance of residuals. Its absence (heteroscedasticity) can lead to biased standard errors.

19. What is Root Mean Squared Error (RMSE)
RMSE is the square root of MSE. It represents the typical error in the units of the target variable.

20. Why is pickling considered risky
Pickled files can execute arbitrary code during loading, posing security risks if the source is untrusted.

21. What alternatives exist to pickling for saving ML models

Joblib (for large NumPy arrays)

ONNX (for model interoperability)

HDF5 (Keras models)

TorchScript for PyTorch models

SavedModel for TensorFlow

22. What is heteroscedasticity, and why is it a problem
Heteroscedasticity is when residuals have non-constant variance, leading to inefficient estimates and unreliable hypothesis tests.

23. How can interaction terms enhance a regression model's predictive power
Interaction terms capture combined effects of variables that aren‚Äôt explained by their individual contributions. This can improve model accuracy.



In [None]:
1. Visualize Residuals (Diamonds Dataset)

import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm

# Load diamonds dataset
diamonds = sns.load_dataset('diamonds')
X = diamonds[['carat', 'depth', 'table']]
y = diamonds['price']

# Fit model
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
residuals = model.resid

# Plot residuals
sns.histplot(residuals, kde=True)
plt.title("Residuals Distribution")
plt.show()
2. Calculate MSE, MAE, RMSE for Linear Regression

from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

y_true = [3, 5, 7, 9]
y_pred = [2.5, 5.2, 6.8, 9.1]

mse = mean_squared_error(y_true, y_pred)
mae = mean_absolute_error(y_true, y_pred)
rmse = np.sqrt(mse)

print(f"MSE: {mse}, MAE: {mae}, RMSE: {rmse}")
3. Check Linear Regression Assumptions

import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

# Load dataset
tips = sns.load_dataset('tips')
X = tips[['total_bill', 'size']]
y = tips['tip']

# Linear model
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
residuals = model.resid
fitted = model.fittedvalues

# Linearity
sns.scatterplot(x=fitted, y=y)
plt.title("Linearity Check")
plt.show()

# Homoscedasticity
sns.scatterplot(x=fitted, y=residuals)
plt.title("Residuals vs Fitted")
plt.show()

# Multicollinearity
corr = tips[['total_bill', 'size', 'tip']].corr()
print(corr)
4. ML Pipeline with Feature Scaling and Regression
sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score

X, y = load_boston(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

pipelines = {
    'linear': Pipeline([('scaler', StandardScaler()), ('lr', LinearRegression())]),
    'rf': Pipeline([('scaler', StandardScaler()), ('rf', RandomForestRegressor())])
}

for name, pipe in pipelines.items():
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)
    print(f"{name} R2: {r2_score(y_test, y_pred):.2f}")
5. Simple Linear Regression

from sklearn.linear_model import LinearRegression
import numpy as np

X = np.array([[1], [2], [3], [4]])
y = np.array([3, 5, 7, 9])

model = LinearRegression().fit(X, y)
print("Coefficient:", model.coef_)
print("Intercept:", model.intercept_)
print("R-squared:", model.score(X, y))
6. Linear Regression (Tips Dataset)

import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

tips = sns.load_dataset('tips')
X = tips[['total_bill']]
y = tips['tip']

model = LinearRegression().fit(X, y)

# Plot
sns.scatterplot(x='total_bill', y='tip', data=tips)
plt.plot(tips['total_bill'], model.predict(X), color='red')
plt.show()
7. Simple Linear Regression (Synthetic Data)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

X = np.random.rand(50, 1) * 10
y = 2 * X.flatten() + 5 + np.random.randn(50)

model = LinearRegression().fit(X, y)
plt.scatter(X, y)
plt.plot(X, model.predict(X), color='red')
plt.show()
8. Pickle a Linear Regression Modelpython

import pickle
from sklearn.linear_model import LinearRegression
import numpy as np

X = np.array([[1], [2], [3]])
y = np.array([2, 4, 6])
model = LinearRegression().fit(X, y)

with open('linear_model.pkl', 'wb') as f:
    pickle.dump(model, f)
9. Polynomial Regression (Degree 2)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

X = np.linspace(0, 10, 50).reshape(-1, 1)
y = 3 * X.flatten()**2 + 2*X.flatten() + 5 + np.random.randn(50)*10

poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

model = LinearRegression().fit(X_poly, y)

plt.scatter(X, y)
plt.plot(X, model.predict(X_poly), color='red')
plt.show()
10. Generate Synthetic Data & Linear Regression

import numpy as np
from sklearn.linear_model import LinearRegression

X = np.random.rand(100, 1) * 10
y = 3 * X.flatten() + 7 + np.random.randn(100)

model = LinearRegression().fit(X, y)
print("Coefficient:", model.coef_[0])
print("Intercept:", model.intercept_)
11. Compare Polynomial Regression Models

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score

X = np.linspace(0, 10, 50).reshape(-1, 1)
y = X.flatten()**3 - 5*X.flatten()**2 + X.flatten() + np.random.randn(50)*10

for degree in [1, 2, 3, 4]:
    poly = PolynomialFeatures(degree=degree)
    X_poly = poly.fit_transform(X)
    model = LinearRegression().fit(X_poly, y)
    y_pred = model.predict(X_poly)
    print(f"Degree {degree} R2: {r2_score(y, y_pred):.2f}")
12. Simple Linear Regression with Two Features

import numpy as np
from sklearn.linear_model import LinearRegression

X = np.random.rand(100, 2) * 10
y = 3 * X[:, 0] + 5 * X[:, 1] + 7 + np.random.randn(100)

model = LinearRegression().fit(X, y)
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
print("R-squared:", model.score(X, y))
13. Visualize Linear Regression Line (Synthetic Data)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

X = np.linspace(0, 10, 50).reshape(-1, 1)
y = 2 * X.flatten() + 3 + np.random.randn(50)

model = LinearRegression().fit(X, y)
plt.scatter(X, y)
plt.plot(X, model.predict(X), color='red')
plt.title("Linear Regression Fit")
plt.show()

In [None]:
14. Write a Python script that uses the Variance Inflation Factor (VIF) to check for multicollinearity in a dataset with multiple features.


import pandas as pd
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.datasets import make_regression

# Generate synthetic data
X, y = make_regression(n_samples=100, n_features=5, noise=0.1, random_state=42)
df = pd.DataFrame(X, columns=[f'Feature_{i}' for i in range(5)])

# Calculate VIF
vif = pd.DataFrame()
vif['Feature'] = df.columns
vif['VIF'] = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]
print(vif)
15. Write a Python script that generates synthetic data for a polynomial relationship (degree 4), fits a polynomial regression model, and plots the regression curve.


import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

# Generate data
np.random.seed(42)
X = np.sort(np.random.rand(100, 1) * 10, axis=0)
y = 5 + 2*X + 0.5*X**2 - 0.3*X**3 + 0.1*X**4 + np.random.randn(100, 1) * 10

# Polynomial Regression
poly = PolynomialFeatures(degree=4)
X_poly = poly.fit_transform(X)
model = LinearRegression().fit(X_poly, y)

# Plot
plt.scatter(X, y, color='blue')
plt.plot(X, model.predict(X_poly), color='red')
plt.title('Polynomial Regression (Degree 4)')
plt.show()
16. Write a Python script that creates a machine learning pipeline with data standardization and a multiple linear regression model, and prints the R-squared score.


from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

X, y = make_regression(n_samples=100, n_features=3, noise=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', LinearRegression())
])

pipeline.fit(X_train, y_train)
r2_score = pipeline.score(X_test, y_test)
print("R-squared Score:", r2_score)
17. Write a Python script that performs polynomial regression (degree 3) on a synthetic dataset and plots the regression curve.


import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

X = np.sort(5 * np.random.rand(80, 1), axis=0)
y = np.sin(X).ravel() + np.random.normal(0, 0.1, size=X.shape[0])

poly = PolynomialFeatures(degree=3)
X_poly = poly.fit_transform(X)
model = LinearRegression().fit(X_poly, y)

plt.scatter(X, y, color='blue')
plt.plot(X, model.predict(X_poly), color='red')
plt.title('Polynomial Regression (Degree 3)')
plt.show()
18. Write a Python script that performs multiple linear regression on a synthetic dataset with 5 features. Print the R-squared score and model coefficients.


from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression

X, y = make_regression(n_samples=100, n_features=5, noise=0.1, random_state=42)
model = LinearRegression().fit(X, y)
print("R-squared:", model.score(X, y))
print("Coefficients:", model.coef_)
19. Write a Python script that generates synthetic data, fits a linear regression model, and visualizes the data points along with the regression line.


import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

X = np.random.rand(50, 1) * 10
y = 3 * X.squeeze() + 7 + np.random.randn(50) * 2
model = LinearRegression().fit(X, y)

plt.scatter(X, y)
plt.plot(X, model.predict(X), color='red')
plt.title('Linear Regression Line')
plt.show()
20. Create a synthetic dataset with 3 features and perform multiple linear regression. Print the model's R-squared score and coefficients.


from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression

X, y = make_regression(n_samples=100, n_features=3, noise=5, random_state=42)
model = LinearRegression().fit(X, y)
print("R-squared:", model.score(X, y))
print("Coefficients:", model.coef_)
21. Write a Python script that demonstrates how to serialize and deserialize machine learning models using joblib instead of pickling.


from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
import joblib

X, y = make_regression(n_samples=100, n_features=1, noise=0.1)
model = LinearRegression().fit(X, y)

# Serialize
joblib.dump(model, 'linear_model.joblib')

# Deserialize
loaded_model = joblib.load('linear_model.joblib')
print("R-squared:", loaded_model.score(X, y))
22. Write a Python script to perform linear regression with categorical features using one-hot encoding. Use the Seaborn 'tips' dataset.


import seaborn as sns
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

tips = sns.load_dataset('tips')
X = tips[['total_bill', 'sex', 'smoker', 'day']]
y = tips['tip']

preprocessor = ColumnTransformer([
    ('onehot', OneHotEncoder(drop='first'), ['sex', 'smoker', 'day'])
], remainder='passthrough')

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])

pipeline.fit(X, y)
print("R-squared:", pipeline.score(X, y))
23. Compare Ridge Regression with Linear Regression on a synthetic dataset and print the coefficients and R-squared score.


from sklearn.linear_model import LinearRegression, Ridge
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=100, n_features=3, noise=5, random_state=42)

lr = LinearRegression().fit(X, y)
ridge = Ridge(alpha=1.0).fit(X, y)

print("Linear Regression R-squared:", lr.score(X, y))
print("Ridge Regression R-squared:", ridge.score(X, y))
print("Linear Coefficients:", lr.coef_)
print("Ridge Coefficients:", ridge.coef_)
24. Write a Python script that uses cross-validation to evaluate a Linear Regression model on a synthetic dataset.


from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=100, n_features=3, noise=5, random_state=42)
model = LinearRegression()
scores = cross_val_score(model, X, y, cv=5, scoring='r2')

print("Cross-validated R-squared scores:", scores)
print("Mean R-squared:", scores.mean())
25. Write a Python script that compares polynomial regression models of different degrees and prints the R-squared score for each.


from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

X = np.random.rand(100, 1) * 10
y = 2 * X.squeeze()**2 + 3 * X.squeeze() + 5 + np.random.randn(100) * 10

for degree in range(1, 5):
    poly = PolynomialFeatures(degree=degree)
    X_poly = poly.fit_transform(X)
    model = LinearRegression().fit(X_poly, y)
    y_pred = model.predict(X_poly)
    print(f'Degree {degree} R-squared:', r2_score(y, y_pred))
