
Load the `student_scores_v1.csv` dataset into a pandas DataFrame.


In [4]:
import pandas as pd

df = pd.read_csv('student_scores_v1.csv')
df.head()

Unnamed: 0,Hours_Studied,Score
0,1.0,10
1,2.0,25
2,3.0,35
3,4.0,50
4,5.0,65


## Data preprocessing

Handle any missing values, split the data into features (X) and target (y), and split the data into training and testing sets.


In [5]:

print(df.isnull().sum())

X = df[['Hours_Studied']]
y = df['Score']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("\nTraining data shape:", X_train.shape, y_train.shape)
print("Testing data shape:", X_test.shape, y_test.shape)

Hours_Studied    0
Score            0
dtype: int64

Training data shape: (8, 1) (8,)
Testing data shape: (2, 1) (2,)


## Polynomial regression implementation

Apply Polynomial Features to the training and testing data to transform the features into polynomial terms, train a Linear Regression model on the polynomial features of the training data, and evaluate the model performance on the polynomial features of the testing data using appropriate metrics (e.g., R-squared, Mean Squared Error).


In [6]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

poly_features = PolynomialFeatures(degree=2)

X_train_poly = poly_features.fit_transform(X_train)

X_test_poly = poly_features.transform(X_test)

model = LinearRegression()

model.fit(X_train_poly, y_train)

y_pred = model.predict(X_test_poly)

r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

print(f"R-squared: {r2}")
print(f"Mean Squared Error: {mse}")

R-squared: 0.9994730767518012
Mean Squared Error: 0.6271703961686215


## Mlops integration

Use a library like MLflow or similar to track experiments, log parameters, metrics, and the trained model and implement code for model versioning and reproducibility.


In [8]:
%pip install mlflow

Collecting mlflow
  Downloading mlflow-3.2.0-py3-none-any.whl.metadata (29 kB)
Collecting mlflow-skinny==3.2.0 (from mlflow)
  Downloading mlflow_skinny-3.2.0-py3-none-any.whl.metadata (30 kB)
Collecting mlflow-tracing==3.2.0 (from mlflow)
  Downloading mlflow_tracing-3.2.0-py3-none-any.whl.metadata (19 kB)
Collecting alembic!=1.10.0,<2 (from mlflow)
  Downloading alembic-1.16.4-py3-none-any.whl.metadata (7.3 kB)
Collecting docker<8,>=4.0.0 (from mlflow)
  Downloading docker-7.1.0-py3-none-any.whl.metadata (3.8 kB)
Collecting graphene<4 (from mlflow)
  Downloading graphene-3.4.3-py2.py3-none-any.whl.metadata (6.9 kB)
Collecting gunicorn<24 (from mlflow)
  Downloading gunicorn-23.0.0-py3-none-any.whl.metadata (4.4 kB)
Collecting databricks-sdk<1,>=0.20.0 (from mlflow-skinny==3.2.0->mlflow)
  Downloading databricks_sdk-0.62.0-py3-none-any.whl.metadata (39 kB)
Collecting opentelemetry-api<3,>=1.9.0 (from mlflow-skinny==3.2.0->mlflow)
  Downloading opentelemetry_api-1.36.0-py3-none-any.whl

In [9]:
import mlflow
import mlflow.sklearn

with mlflow.start_run():

    mlflow.log_param("polynomial_degree", 2)

    mlflow.log_metric("r2_score", r2)
    mlflow.log_metric("mean_squared_error", mse)
    mlflow.sklearn.log_model(model, "polynomial_regression_model")

mlflow.end_run()



## Prediction

Use the trained model to make predictions on new data.


In [10]:
import numpy as np

new_hours = np.array([[11.0], [12.0], [13.0]])
new_hours_df = pd.DataFrame(new_hours, columns=['Hours_Studied'])

new_hours_poly = poly_features.transform(new_hours_df)

predictions = model.predict(new_hours_poly)

print("Predictions for new hours studied:")
print(predictions)

Predictions for new hours studied:
[101.97943754 103.70371679 104.03865178]
