 1) Download and read the "data.csv" dataset. Split the dataset in train and test set (use your choice of splitting). Train a linear regression model for predicting "Apparent Temperature (C)" and report the performance (use your choice of at least four performance metrics)

In [2]:
import pandas as pd

dataframe = pd.read_csv("data_6.csv")
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96453 entries, 0 to 96452
Data columns (total 8 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Precip Type               95936 non-null  object 
 1   Temperature (C)           96453 non-null  float64
 2   Apparent Temperature (C)  96453 non-null  float64
 3   Humidity                  96453 non-null  float64
 4   Wind Speed (km/h)         96453 non-null  float64
 5   Wind Bearing (degrees)    96453 non-null  int64  
 6   Visibility (km)           96453 non-null  float64
 7   Pressure (millibars)      96453 non-null  float64
dtypes: float64(6), int64(1), object(1)
memory usage: 5.9+ MB


In [3]:
dataframe["Precip Type"].unique()

array(['rain', 'snow', nan], dtype=object)

In [4]:
dataframe.dropna(subset=["Precip Type"], inplace=True)

In [5]:
dataframe.isnull().values.sum()

0

In [6]:
## now that we have no more nulls to worry about, normalize our class.
#prepare class:
dataframe["Precip Type"] = dataframe["Precip Type"].map({'rain':0, 'snow':1})
dataframe.head()

Unnamed: 0,Precip Type,Temperature (C),Apparent Temperature (C),Humidity,Wind Speed (km/h),Wind Bearing (degrees),Visibility (km),Pressure (millibars)
0,0,9.472222,7.388889,0.89,14.1197,251,15.8263,1015.13
1,0,9.355556,7.227778,0.86,14.2646,259,15.8263,1015.63
2,0,9.377778,9.377778,0.89,3.9284,204,14.9569,1015.94
3,0,8.288889,5.944444,0.83,14.1036,269,15.8263,1016.41
4,0,8.755556,6.977778,0.83,11.0446,259,15.8263,1016.51


In [7]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
dataframe[dataframe.columns.difference(['Precip Type'])] = StandardScaler().fit_transform(dataframe[dataframe.columns.difference(['Precip Type'])])
dataframe.head()



Unnamed: 0,Precip Type,Temperature (C),Apparent Temperature (C),Humidity,Wind Speed (km/h),Wind Bearing (degrees),Visibility (km),Pressure (millibars)
0,0,-0.257951,-0.324102,0.792748,0.478964,0.591157,1.309107,0.102152
1,0,-0.270141,-0.339134,0.63947,0.499902,0.665655,1.309107,0.106415
2,0,-0.267819,-0.138532,0.792748,-0.99362,0.153478,1.100806,0.109058
3,0,-0.381594,-0.458873,0.486192,0.476638,0.758778,1.309107,0.113066
4,0,-0.332833,-0.36246,0.486192,0.03463,0.665655,1.309107,0.113919


In [8]:
# Split the data into training and test sets, keeping the proportion of label on a 80-20 prop.
y = dataframe["Apparent Temperature (C)"]
X = dataframe[dataframe.columns.difference(['Precip Type', 'Apparent Temperature (C)'])]

X_train, X_test, y_train, y_test = train_test_split ( X, y, test_size=0.2, stratify=dataframe['Precip Type'], random_state = 13)
# Show the proportion of each value in the 'Precip Type' column for the original dataframe
# Check the proportion of 'Precip Type' in the original, training, and test data
print("Original distribution:")
print(dataframe['Precip Type'].value_counts(normalize=True))

print("\nTraining distribution:")
print(dataframe.loc[X_train.index, 'Precip Type'].value_counts(normalize=True))

print("\nTest distribution:")
print(dataframe.loc[X_test.index, 'Precip Type'].value_counts(normalize=True))


Original distribution:
0    0.888342
1    0.111658
Name: Precip Type, dtype: float64

Training distribution:
0    0.888336
1    0.111664
Name: Precip Type, dtype: float64

Test distribution:
0    0.888368
1    0.111632
Name: Precip Type, dtype: float64


In [9]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, median_absolute_error

lin = LinearRegression()
lin.fit(X_train, y_train)
y_pred = lin.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
median = median_absolute_error(y_test, y_pred)

# Print metrics to the screen

print(f"Median Absolute Error: {median:.5f}")
print(f"Mean Absolute Error (MAE): {mae:.5f}")
print(f"Mean Squared Error (MSE): {mse:.5f}")
print(f"R-squared (R2): {r2:.5f}")


Median Absolute Error: 0.06560
Mean Absolute Error (MAE): 0.07911
Mean Squared Error (MSE): 0.01012
R-squared (R2): 0.98988


 2) Apply PCA on the dataset and select the first three principal components. Split the dataset into train and test using the same method used in Q1. Compare the performance of this model with the performance obtained in Q1.  Explain the outcome.

In [10]:
from sklearn.decomposition import PCA

pca = PCA(n_components=3)
pca.fit(X)
X_PCA = pca.transform(X)
X_train_pca, X_test_pca, y_train_pca, y_test_pca = train_test_split ( X_PCA, y, test_size=0.2, stratify=dataframe['Precip Type'], random_state = 13)


lin_PCA = LinearRegression()
lin_PCA.fit(X_train_pca, y_train_pca)
y_pred_PCA = lin_PCA.predict(X_test_pca)

mae_pca = mean_absolute_error(y_test_pca, y_pred_PCA)
mse_pca = mean_squared_error(y_test_pca, y_pred_PCA)
r2_pca = r2_score(y_test_pca, y_pred_PCA)
median_pca = median_absolute_error(y_test_pca, y_pred_PCA)

print(f"Median Absolute Error: {median_pca:.5f}")
print(f"Mean Absolute Error (MAE): {mae_pca:.5f}")
print(f"Mean Squared Error (MSE): {mse_pca:.5f}")
print(f"R-squared (R2): {r2_pca:.5f}")

Median Absolute Error: 0.35722
Mean Absolute Error (MAE): 0.42736
Mean Squared Error (MSE): 0.29454
R-squared (R2): 0.70542


We can see quite a decrease on the model quality when we perform a PCA to only 3 features. The reason for that is the decrease of variance caused by the feature decomposition.

 3) Load "data.csv " datasets  and follow this link for the data description (features ['Temperature (C)','Humidity','Wind Speed (km/h)','Wind Bearing (degrees)','Visibility (km)','Pressure (millibars)'] and Precip Type as target). Apply PCA on the dataset and select the first three principal components. Split the dataset in train and test set (use your choice of splitting). Train a logistic regression model and report the performance (use your choice of at least 4 performance metric).

In [11]:
dataframe_2 = pd.read_csv("data_6.csv")
dataframe_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96453 entries, 0 to 96452
Data columns (total 8 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Precip Type               95936 non-null  object 
 1   Temperature (C)           96453 non-null  float64
 2   Apparent Temperature (C)  96453 non-null  float64
 3   Humidity                  96453 non-null  float64
 4   Wind Speed (km/h)         96453 non-null  float64
 5   Wind Bearing (degrees)    96453 non-null  int64  
 6   Visibility (km)           96453 non-null  float64
 7   Pressure (millibars)      96453 non-null  float64
dtypes: float64(6), int64(1), object(1)
memory usage: 5.9+ MB


In [12]:
dataframe_2.dropna(subset=["Precip Type"], inplace=True)
dataframe_2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 95936 entries, 0 to 96452
Data columns (total 8 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Precip Type               95936 non-null  object 
 1   Temperature (C)           95936 non-null  float64
 2   Apparent Temperature (C)  95936 non-null  float64
 3   Humidity                  95936 non-null  float64
 4   Wind Speed (km/h)         95936 non-null  float64
 5   Wind Bearing (degrees)    95936 non-null  int64  
 6   Visibility (km)           95936 non-null  float64
 7   Pressure (millibars)      95936 non-null  float64
dtypes: float64(6), int64(1), object(1)
memory usage: 6.6+ MB


In [13]:
from sklearn.linear_model import LogisticRegression

filtered_dataframe = dataframe_2.drop(columns={'Apparent Temperature (C)'}, axis=1)
filtered_dataframe.info()

filtered_dataframe[filtered_dataframe.columns.difference(['Precip Type'])] = StandardScaler().fit_transform(filtered_dataframe[filtered_dataframe.columns.difference(['Precip Type'])])
y2 = dataframe['Precip Type']
X2 = dataframe[filtered_dataframe.columns.difference(['Precip Type'])]
X2 =scaler.fit_transform(X2)
# Apply PCA
pca = PCA(n_components=3)
X2 = pca.fit_transform(X2)

X2_train, X2_test, y2_train, y2_test = train_test_split ( X2, y2, test_size=0.2, stratify=filtered_dataframe['Precip Type'], random_state = 13)
# Train a logistic regression model
log_reg = LogisticRegression()
log_reg.fit(X2_train, y2_train)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 95936 entries, 0 to 96452
Data columns (total 7 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Precip Type             95936 non-null  object 
 1   Temperature (C)         95936 non-null  float64
 2   Humidity                95936 non-null  float64
 3   Wind Speed (km/h)       95936 non-null  float64
 4   Wind Bearing (degrees)  95936 non-null  int64  
 5   Visibility (km)         95936 non-null  float64
 6   Pressure (millibars)    95936 non-null  float64
dtypes: float64(5), int64(1), object(1)
memory usage: 5.9+ MB


In [14]:
from sklearn.metrics import f1_score, recall_score, precision_score, accuracy_score
y_pred_log= log_reg.predict(X2_test)

acc = accuracy_score(y2_test, y_pred_log)
pres = precision_score(y2_test, y_pred_log)
reacll = recall_score(y2_test, y_pred_log)
f1 = f1_score(y2_test, y_pred_log)

print(f"Accuracy: {acc:.5f}")
print(f"Precision: {pres:.5f}")
print(f"Recall: {reacll:.5f}")
print(f"F1 Score: {f1:.5f}")

Accuracy: 0.90911
Precision: 0.65284
Recall: 0.39683
F1 Score: 0.49361


 4) Apply L1 regulariser on the logistic regression model developed using the same train and test data used in Q3 and calculate performance of the new model. Compare performance of this model with the performance reported in Q3. Explain the outcome.

In [16]:
log_reg_f1 = LogisticRegression(penalty='l1', solver='saga')
log_reg_f1.fit(X2_train, y2_train)
y_pred_log_f1= log_reg_f1.predict(X2_test)

acc_l1 = accuracy_score(y2_test, y_pred_log_f1)
pres_l1 = precision_score(y2_test, y_pred_log_f1)
reacll_l1 = recall_score(y2_test, y_pred_log_f1)
f1_l1 = f1_score(y2_test, y_pred_log_f1)

print(f"Accuracy: {acc_l1:.5f}")
print(f"Precision: {pres_l1:.5f}")
print(f"Recall: {reacll_l1:.5f}")
print(f"F1 Score: {f1_l1:.5f}")

Accuracy: 0.90911
Precision: 0.65284
Recall: 0.39683
F1 Score: 0.49361


In our case there was no changes in the result. That means that the model was not overfitted, what resulted in a unchanged result when we applied the L1 regulariser. 