Tutorial 5

Multiple Linear Regression Model

std_marks_data.csv

1. Observe data and do some preprocessing.

In [9]:
import pandas as pd

In [10]:
std_marks = pd.read_csv("std_marks_data.csv")
print("First five rows of the dataset:\n", std_marks.head())

First five rows of the dataset:
    hours  age  internet  marks
0   6.84   15         0  78.64
1   6.56   20         1  88.80
2    NaN   21         1  88.90
3   8.67   22         1  98.99
4   7.55   17         1  92.34


In [11]:
std_marks

Unnamed: 0,hours,age,internet,marks
0,6.84,15,0,78.64
1,6.56,20,1,88.80
2,,21,1,88.90
3,8.67,22,1,98.99
4,7.55,17,1,92.34
...,...,...,...,...
295,2.99,25,0,43.45
296,6.55,15,1,77.74
297,0.00,20,1,75.76
298,9.90,22,0,99.99


2. Find no. of missing values.

In [12]:
missing_values = std_marks.isnull().sum()
print("\nMissing values in each column:\n", missing_values)


Missing values in each column:
 hours       12
age          0
internet     0
marks        0
dtype: int64


3. Fill mean value in place of NaN value so NaN value does not affect accuracy.

In [13]:
std_marks['hours'].fillna(std_marks['hours'].mean(), inplace=True)

print("\nMissing values after filling:\n", std_marks.isnull().sum())


Missing values after filling:
 hours       0
age         0
internet    0
marks       0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  std_marks['hours'].fillna(std_marks['hours'].mean(), inplace=True)


4. Segregate input and output.

In [14]:
X = std_marks[['hours', 'age', 'internet']]
y = std_marks['marks']
print("\nFeatures (X):\n", X.head())
print("\nTarget Variable (y):\n", y.head())


Features (X):
       hours  age  internet
0  6.840000   15         0
1  6.560000   20         1
2  5.494514   21         1
3  8.670000   22         1
4  7.550000   17         1

Target Variable (y):
 0    78.64
1    88.80
2    88.90
3    98.99
4    92.34
Name: marks, dtype: float64


5. Prepare model.

In [15]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)
print("\nModel training completed.")


Model training completed.


6. Test model with new input data.

In [16]:
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error

y_pred = model.predict(X_test)

print("\nModel Evaluation:")
print("Mean Absolute Error:", mean_absolute_error(y_test, y_pred))
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error:", np.sqrt(mean_squared_error(y_test, y_pred)))

# Predict on new data (Correct format)
new_data = [[8.0, 18, 1]]
predicted_marks = model.predict(new_data)
print("\nPredicted marks for input", new_data, ":", predicted_marks[0])


Model Evaluation:
Mean Absolute Error: 15.093878830955454
Mean Squared Error: 296.9278739472545
Root Mean Squared Error: 17.2315952235205

Predicted marks for input [[8.0, 18, 1]] : 77.49013903730197


