## Hamoye Week 2, Project code

In [29]:
# importing required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge


Checking the data for null values, this is to confirm if I have cleaning tasks to do before working with the data

In [19]:
data = pd.read_csv("energydata_complete.csv")
data.isnull().sum()

date           0
Appliances     0
lights         0
T1             0
RH_1           0
T2             0
RH_2           0
T3             0
RH_3           0
T4             0
RH_4           0
T5             0
RH_5           0
T6             0
RH_6           0
T7             0
RH_7           0
T8             0
RH_8           0
T9             0
RH_9           0
T_out          0
Press_mm_hg    0
RH_out         0
Windspeed      0
Visibility     0
Tdewpoint      0
rv1            0
rv2            0
dtype: int64

We can see from the results of the prebvous code that there is no null value for all the attributes in the dataset

Let's have a look at the various questions and attempt them

### Question 1

From the dataset, fit a linear model on the relationship between the temperature in the living room in Celsius (x = T2) and the temperature outside the building (y = T6). What is the R^2 value in two d.p.?

In [23]:
# Select the independent variable (T2) and the dependent variable (T6)
X = data['T2']
y = data['T6']

# Add a constant term to the independent variable
X = sm.add_constant(X)

# Fit the linear model(ordinary least squares (OLS) regression)
model = sm.OLS(y, X).fit()

# Get the R-squared value
r_squared = model.rsquared

# Print the R-squared value rounded to 2 decimal places
print(f"R-squared: {r_squared:.2f}")

R-squared: 0.64


### Questions 2,3,4,5
Normalize the dataset using the MinMaxScaler after removing the following columns: [“date”, “lights”]. The target variable is “Appliances”. Use a 70-30 train-test set split with a random state of 42 (for reproducibility). Run a multiple linear regression using the training set and evaluate your model on the test set. Answer the following questions:

What is the 
1. Mean Absolute Error (in two decimal places)?
2. Residual Sum of Squares (in two decimal places)?
3. Root Mean Squared Error (in three decimal places)?
4. Coefficient of Determination (in two decimal places)?

In [25]:
# Remove columns and select target variable
columns_to_remove = ["date", "lights"]
target_variable = "Appliances"
data_filtered = data.drop(columns=columns_to_remove)
X = data_filtered.drop(columns=target_variable)
y = data_filtered[target_variable]

# Normalize the dataset using MinMaxScaler
scaler = MinMaxScaler()
X_normalized = scaler.fit_transform(X)

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X_normalized, y, test_size=0.3, random_state=42
)

# Train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred)

# Calculate Residual Sum of Squares (RSS)
rss = mean_squared_error(y_test, y_pred) * len(y_test)

# Calculate Root Mean Squared Error (RMSE)
rmse = mean_squared_error(y_test, y_pred, squared=False)

# Calculate Coefficient of Determination (R-squared)
r_squared = r2_score(y_test, y_pred)

# Print the results rounded to 2 decimal places
print(f"Mean Absolute Error: {mae:.2f}")
print(f"Residual Sum of Squares: {rss:.2f}")
print(f"Root Mean Squared Error: {rmse:.3f}")
print(f"Coefficient of Determination (R-squared): {r_squared:.2f}")

Mean Absolute Error: 53.64
Residual Sum of Squares: 51918501.21
Root Mean Squared Error: 93.640
Coefficient of Determination (R-squared): 0.15


### Question 6
Obtain the feature weights from your linear model above. Which features have the lowest and highest weights respectively?

In [27]:
# Get the feature weights
feature_weights = pd.DataFrame({"Feature": X.columns, "Weight": model.coef_})

# Sort the feature weights
feature_weights = feature_weights.sort_values("Weight")

# Extract the feature with the lowest and highest weights
lowest_weight_feature = feature_weights.iloc[0]["Feature"]
highest_weight_feature = feature_weights.iloc[-1]["Feature"]

# Print the features with the lowest and highest weights
print("Feature with the lowest weight:", lowest_weight_feature)
print("Feature with the highest weight:", highest_weight_feature)


Feature with the lowest weight: RH_2
Feature with the highest weight: RH_1


### Question 7
Train a ridge regression model with an alpha value of 0.4. Is there any change to the root mean squared error (RMSE) when evaluated on the test set?

In [30]:
# Train the Ridge regression model with alpha=0.4
model = Ridge(alpha=0.4)
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Calculate the RMSE on the test set
rmse = mean_squared_error(y_test, y_pred, squared=False)

# Print the RMSE
print("Root Mean Squared Error (RMSE):", rmse)


Root Mean Squared Error (RMSE): 93.66122703951946


### Question 8 
Train a lasso regression model with an alpha value of 0.001 and obtain the new feature weights with it. How many of the features have non-zero feature weights?

In [31]:
# Train the Lasso regression model with alpha=0.001
model = Lasso(alpha=0.001)
model.fit(X_train, y_train)

# Get the feature weights
feature_weights = model.coef_

# Count the number of features with non-zero weights
non_zero_features = sum(feature_weights != 0)

# Print the number of features with non-zero weights
print("Number of features with non-zero weights:", non_zero_features)


Number of features with non-zero weights: 25


### Question 9
What is the new RMSE with the lasso regression? (Answer should be in three (3) decimal places)

In [32]:
# Train the Lasso regression model with alpha=0.001
model = Lasso(alpha=0.001)
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Calculate the RMSE on the test set
rmse = mean_squared_error(y_test, y_pred, squared=False)

# Print the RMSE
print("Root Mean Squared Error (RMSE): {:.3f}".format(rmse))


Root Mean Squared Error (RMSE): 93.641
