## Machine Learning (ML) and Deep Learning (DL) Assignments

### Assignment 1: Predictive Analysis with the Titanic Dataset

**Instructions:**

- **Objective:** Predict whether a passenger survived the Titanic disaster using logistic regression.
- **Dataset:** The Titanic dataset is a classic dataset available on Kaggle. It includes passenger information from the Titanic disaster and can be used to predict survival outcomes. The Titanic dataset fields:
  - **PassengerId:** A unique number for each passenger.
  - **Pclass:** The ticket class (1st, 2nd, or 3rd class).
  - **Name:** The passenger's name.
  - **Sex:** The passenger's gender.
  - **Age:** How old the passenger is.
  - **SibSp:** Number of siblings or spouses on board.
  - **Parch:** Number of parents or children on board.
  - **Ticket:** The passenger's ticket number.
  - **Fare:** How much the ticket cost.
  - **Cabin:** The cabin number where the passenger stayed.
  - **Embarked:** Where the passenger got on the ship (C = Cherbourg, Q = Queenstown, S = Southampton).
- **Tasks:**
  - Load the dataset and perform exploratory data analysis.
  - Split the data into training and testing sets.
  - **Preprocess the data: handle missing values, convert categorical data to numerical, etc.**
  - Build a logistic regression model to predict survival.
  - Create a submission file containing the **'PassengerId'** and **'Survived'** columns with the test predictions and save it to a CSV file.
  - Evaluate the model's performance using accuracy, precision, and recall metrics.
- **Hints:**
  - Pay attention to columns, such as **'Age'**, **'Sex'**, **'Pclass'**, and **'Fare'**.
  - Use libraries, such as Pandas for data manipulation, Scikit-learn for logistic regression.

**Python code:**

In [None]:
# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install pip install pandas numpy matplotlib seaborn scikit-learn scipy

In [None]:
# Importing libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, precision_recall_curve, auc

In [None]:
# Load the datasets
train_data = pd.read_csv('TitanicData/titanic_train.csv')
test_data = pd.read_csv('TitanicData/titanic_test.csv')

In [None]:
# Define numeric and categorical features
numeric_features = ['Age', 'Fare']
categorical_features = ['Sex', 'Embarked', 'Pclass']

**Explain the Titanic dataset preprocessing steps:**

- **Define Feature Sets:**
  - **numeric_transformer:** Lists number-based data, e.g., **'Age'**, **'Fare'**, etc.
  - **category_transformer:** Lists category-based data, e.g., **'Pclass'**, **'Sex'**, etc.
- **Numeric Data Setup:** It creates a pipeline for transforming numeric features with two steps.
  - **Imputer:** Fixes missing number data by using the median.
  - **Scaler:** Makes sure all number data is on the same scale.
- **Category Data Setup:** It creates a pipeline for transforming categorical features with two steps.
  - **Imputer:** Fixes missing category data by labeling it **'missing'**.
  - **OneHot:** Changes category data to a numeric format the computer can use.
- **Combine Steps:**
  - Uses a **'ColumnTransformer'** to apply the correct fixes to each type of data.
  - Gets everything ready for machine learning.

In [None]:
# Preprocessing
numeric_transformer = Pipeline(steps=[
  ('imputer', SimpleImputer(strategy='median')),
  ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
  ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
  ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
  transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)])

X = train_data.drop('Survived', axis=1)
y = train_data['Survived']

In [None]:
# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Fit the preprocessor to X_train and transform both X_train and X_val
X_train_transformed = preprocessor.fit_transform(X_train)
X_val_transformed = preprocessor.transform(X_val)

In [None]:
# Build the logistic regression model
model = LogisticRegression()
model.fit(X_train_transformed, y_train)

In [None]:
# Make predictions on the validation set
val_predictions = model.predict(X_val_transformed)

In [None]:
# Evaluate the model's performance on the validation set
accuracy = accuracy_score(y_val, val_predictions)
precision = precision_score(y_val, val_predictions)
recall = recall_score(y_val, val_predictions)

print(f'Validation Accuracy: {accuracy}')
print(f'Validation Precision: {precision}')
print(f'Validation Recall: {recall}')

In [None]:
# Preprocess the test data
X_test = test_data.copy()
X_test_transformed = preprocessor.transform(X_test)

In [None]:
# Make predictions on the test data
test_predictions = model.predict(X_test_transformed)

In [None]:
# Save the predictions to a CSV file
submission = pd.DataFrame({'PassengerId': test_data['PassengerId'], 'Survived': test_predictions})
submission.to_csv('TitanicData/titanic_predictions.csv', index=False)

In [None]:
# Calculate and plot the precision-recall curve
probs = model.predict_proba(X_val_transformed)[:, 1]
precision, recall, _ = precision_recall_curve(y_val, probs)
auc_score = auc(recall, precision)

plt.figure(figsize=(8, 6))
plt.plot(recall, precision, marker='.')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title(f'Precision-Recall Curve (AUC={auc_score:.2f})')
plt.grid(True)
plt.show()

- **An AUC of 0.87 indicates the model is effective at identifying the correct class, with values closer to 1 indicating better performance.**

### Assignment 2: Time Series Forecasting with the Air Quality Dataset using LSTM

**Instructions:**

- **Objective:** Forecast future air pollution levels (e.g., NO2 concentration) using time series analysis.
- **Dataset:** The Air Quality Time Series dataset from the UCI Machine Learning Repository provides multi-year air quality data for time series analysis and forecasting. The Air Quality dataset fields:
  - **Date:** When the data was recorded.
  - **Time:** Time of day for the data.
  - **CO(GT):** Carbon Monoxide level.
  - **PT08.S1(CO):** Sensor response for CO.
  - **NMHC(GT):** Non-Methane Hydrocarbons level.
  - **C6H6(GT):** Benzene level.
  - **PT08.S2(NMHC):** Sensor response for NMHC.
  - **NOx(GT):** Nitrogen Oxides level.
  - **PT08.S3(NOx):** Sensor response for NOx.
  - **NO2(GT):** Nitrogen Dioxide level.
  - **PT08.S4(NO2):** Sensor response for NO2.
  - **PT08.S5(O3):** Sensor response for Ozone.
  - **T:** Temperature.
  - **RH:** Relative Humidity.
  - **AH:** Absolute Humidity.
- **Tasks:**
  - Load the dataset and perform initial exploratory data analysis focused on time series aspects.
  - **Handle missing values and preprocess the data for time series analysis.**
  - Visualize the time series data to understand trends, seasonality, and noise.
  - Use a time series forecasting method, such as LSTM to predict future pollution levels.
  - Evaluate the model's forecasting accuracy.
- **Hints:**
  - Investigate how NO2 levels change over time.
  - Consider resampling the data (e.g., daily averages) if working with high granularity data.
  - Utilize libraries such as Pandas for data manipulation, TensorFlow and Keras for building LSTM models, and statsmodels for statistical analysis.

In [None]:
# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install pip install tensorflow

In [None]:
# Importing libraries
import pandas as pd
import numpy as np
from keras.callbacks import EarlyStopping
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.regularizers import L1L2
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt

In [None]:
# Load the dataset
data = pd.read_csv('AirQualityUCI.csv', delimiter=';', parse_dates=['Date'], dayfirst=True)

In [None]:
# Remove "Unnamed" columns
data = data.loc[:, ~data.columns.str.contains('^Unnamed')]

# Remove rows where "NO2(GT)" column is less than 0
data = data[data['NO2(GT)'] >= 0]

# Handle missing values and set the datetime index
data['Date'] = pd.to_datetime(data['Date'])
data.set_index('Date', inplace=True)
data.dropna(inplace=True)

In [None]:
# Select the NO2 column
no2_values = data['NO2(GT)'].values.reshape(-1, 1)

In [None]:
# Scale the data
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_no2 = scaler.fit_transform(no2_values)

In [None]:
# Split the data into train and test sets
train_size = int(len(scaled_no2) * 0.6)
test_size = len(scaled_no2) - train_size
train, test = scaled_no2[0:train_size,:], scaled_no2[train_size:len(scaled_no2),:]

In [None]:
# Convert an array of values into a dataset matrix
def create_dataset(dataset, look_back=1):
  X, Y = [], []
  for i in range(len(dataset)-look_back-1):
    a = dataset[i:(i+look_back), 0]
    X.append(a)
    Y.append(dataset[i + look_back, 0])
  return np.array(X), np.array(Y)

In [None]:
# Reshape into X=t and Y=t+7
look_back = 7
X_train, Y_train = create_dataset(train, look_back)
X_test, Y_test = create_dataset(test, look_back)

In [None]:
# Reshape input to be [samples, time steps, features]
X_train = X_train.reshape(X_train.shape[0], look_back, -1)
X_test = X_test.reshape(X_test.shape[0], look_back, -1)

In [None]:
# Create and fit the LSTM network with Dropout
model = Sequential()
model.add(LSTM(50, input_shape=(look_back, 1), return_sequences=True, 
               kernel_regularizer=L1L2(l1=1e-5, l2=1e-4)))
model.add(Dropout(0.3))
model.add(LSTM(25, return_sequences=False, 
               kernel_regularizer=L1L2(l1=1e-5, l2=1e-4)))
model.add(Dropout(0.3))
model.add(Dense(1))

# Compile the model
model.compile(loss='mean_squared_error', optimizer='adam')

# Early stopping
early_stopping = EarlyStopping(monitor='val_loss', patience=7)

# Train the model
model.fit(X_train, Y_train, epochs=100, batch_size=32, verbose=2, callbacks=[early_stopping], validation_split=0.4)

In [None]:
# Make predictions
train_predict = model.predict(X_train)
test_predict = model.predict(X_test)

- **scaler.inverse_transform:** It is used to convert the model's predictions and target values back to their original units after they were scaled during preprocessing. 

In [None]:
# Invert predictions
train_predict = scaler.inverse_transform(train_predict)
Y_train = scaler.inverse_transform([Y_train])
test_predict = scaler.inverse_transform(test_predict)
Y_test = scaler.inverse_transform([Y_test])

**The meaning of the following evaluation metrics:**
- **RMSE (Root Mean Squared Error):** It is a standard way to measure the error of a model in predicting quantitative data.
- **MAE (Mean Absolute Error):** It measures the average magnitude of the errors in a set of predictions, without considering their direction.

**Comparing training and test metrics for model evaluation:**
- **Training RMSE and MAE:** Indicate how well the model fits the training data.
- **Test RMSE and MAE:** Show how well the model is expected to perform on unseen data.

In [None]:
# Calculate evaluation metrics
train_rmse = np.sqrt(mean_squared_error(Y_train[0], train_predict[:,0]))
train_mae = mean_absolute_error(Y_train[0], train_predict[:,0])
test_rmse = np.sqrt(mean_squared_error(Y_test[0], test_predict[:,0]))
test_mae = mean_absolute_error(Y_test[0], test_predict[:,0])

# Print evaluation metrics
print(f'Training RMSE: {train_rmse:.2f}')
print(f'Training MAE: {train_mae:.2f}')
print(f'Test RMSE: {test_rmse:.2f}')
print(f'Test MAE: {test_mae:.2f}')

In [None]:
# Plot baseline and predictions
plt.figure(figsize=(12,6))
plt.plot(scaler.inverse_transform(scaled_no2), label='Actual')
plt.plot(np.concatenate((train_predict, test_predict)), label='Predicted')
plt.legend()
plt.xlabel('Time Step')
plt.ylabel('NO2 Levels')
plt.title('Comparison of Actual and Predicted NO2 Levels')
plt.show()