## Machine Learning (ML) and Deep Learning (DL) Assignments

### Assignment 1: Predictive Analysis with the Titanic Dataset

**Instructions:**

- **Objective:** Predict whether a passenger survived the Titanic disaster using logistic regression.
- **Dataset:** The Titanic dataset is a classic dataset available on Kaggle. It includes passenger information from the Titanic disaster and can be used to predict survival outcomes. The Titanic dataset fields:
  - **PassengerId:** A unique number for each passenger.
  - **Pclass:** The ticket class (1st, 2nd, or 3rd class).
  - **Name:** The passenger's name.
  - **Sex:** The passenger's gender.
  - **Age:** How old the passenger is.
  - **SibSp:** Number of siblings or spouses on board.
  - **Parch:** Number of parents or children on board.
  - **Ticket:** The passenger's ticket number.
  - **Fare:** How much the ticket cost.
  - **Cabin:** The cabin number where the passenger stayed.
  - **Embarked:** Where the passenger got on the ship (C = Cherbourg, Q = Queenstown, S = Southampton).
- **Tasks:**
  - Load the dataset and perform exploratory data analysis.
  - **Preprocess the data: handle missing values, convert categorical data to numerical, etc.**
  - Split the data into training and testing sets.
  - Build a logistic regression model to predict survival.
  - Evaluate the model's performance using accuracy, precision, and recall metrics.
- **Hints:**
  - Pay attention to columns, such as **'Age'**, **'Sex'**, **'Pclass'**, and **'Fare'**.
  - Use libraries, such as Pandas for data manipulation, Scikit-learn for logistic regression.

**Python code:**

In [None]:
# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install pip install pandas numpy matplotlib seaborn scikit-learn scipy

In [None]:
# Importing libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [None]:
# Load the datasets
train_data = pd.read_csv('Titanic/titanic_train.csv')
test_data = pd.read_csv('Titanic/titanic_test.csv')

In [None]:
# Preprocess the data
# TODO: Handle missing values, convert categorical data to numerical
# Example: data['Age'].fillna(data['Age'].mean(), inplace=True)

In [None]:
# TODO: Select features and target variable
# Example: X = data[['Pclass', 'Age']]
# Example: y = data['Survived']

**Explain the Titanic dataset preprocessing steps:**

- **Define Feature Sets:**
  - **numeric_transformer:** Lists number-based data, e.g., **'Age'**, **'Fare'**, etc.
  - **category_transformer:** Lists category-based data, e.g., **'Pclass'**, **'Sex'**, etc.
- **Numeric Data Setup:** It creates a pipeline for transforming numeric features with two steps.
  - **Imputer:** Fixes missing number data by using the median.
  - **Scaler:** Makes sure all number data is on the same scale.
- **Category Data Setup:** It creates a pipeline for transforming categorical features with two steps.
  - **Imputer:** Fixes missing category data by labeling it **'missing'**.
  - **OneHot:** Changes category data to a numeric format the computer can use.
- **Combine Steps:**
  - Uses a **'ColumnTransformer'** to apply the correct fixes to each type of data.
  - Gets everything ready for machine learning.

In [None]:
# Preprocessing
numeric_transformer = Pipeline(steps=[
  ('imputer', SimpleImputer(strategy='median')),
  ('scaler', StandardScaler())])

category_transformer = Pipeline(steps=[
  ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
  ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
  transformers=[
    ('num', numeric_transformer, numeric_transformer),
    ('cat', category_transformer, category_transformer)])

X = train_data.drop('Survived', axis=1)
y = train_data['Survived']

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Build the logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

In [None]:
# Preprocess the test data (use 'transform' not 'fit_transform')
X_test = test_data.drop('PassengerId', axis=1)  # Assuming 'PassengerId' is the only non-feature column
X_test_preprocessed = preprocessor.transform(X_test)

# Predict on the test data
test_predictions = model.predict(X_test_preprocessed)

In [None]:
# Predict and evaluate the model
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
precision = precision_score(y_test, predictions)
recall = recall_score(y_test, predictions)

print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")

### Assignment 2: Time Series Forecasting with the Air Quality Dataset using LSTM

**Instructions:**

- **Objective:** Forecast future air pollution levels (e.g., NO2 concentration) using time series analysis.
- **Dataset:** The Air Quality Time Series dataset from the UCI Machine Learning Repository provides multi-year air quality data for time series analysis and forecasting. The Air Quality dataset fields:
  - **Date:** When the data was recorded.
  - **Time:** Time of day for the data.
  - **CO(GT):** Carbon Monoxide level.
  - **PT08.S1(CO):** Sensor response for CO.
  - **NMHC(GT):** Non-Methane Hydrocarbons level.
  - **C6H6(GT):** Benzene level.
  - **PT08.S2(NMHC):** Sensor response for NMHC.
  - **NOx(GT):** Nitrogen Oxides level.
  - **PT08.S3(NOx):** Sensor response for NOx.
  - **NO2(GT):** Nitrogen Dioxide level.
  - **PT08.S4(NO2):** Sensor response for NO2.
  - **PT08.S5(O3):** Sensor response for Ozone.
  - **T:** Temperature.
  - **RH:** Relative Humidity.
  - **AH:** Absolute Humidity.
- **Tasks:**
  - Load the dataset and perform initial exploratory data analysis focused on time series aspects.
  - **Handle missing values and preprocess the data for time series analysis.**
  - Visualize the time series data to understand trends, seasonality, and noise.
  - Use a time series forecasting method, such as LSTM to predict future pollution levels.
  - Evaluate the model's forecasting accuracy.
- **Hints:**
  - Investigate how NO2 levels change over time.
  - Consider resampling the data (e.g., daily averages) if working with high granularity data.
  - Utilize libraries like Pandas for data manipulation, statsmodels for TensorFlow/Keras for LSTM.


In [None]:
# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install pip install tensorflow

In [None]:
# Importing libraries
import pandas as pd
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt

In [None]:
# Load the dataset
data = pd.read_csv('AirQualityUCI.csv', delimiter=';')

In [None]:
# Preprocess the data
# TODO: Handle missing values and set the datetime index
# Example: data['Date'] = pd.to_datetime(data['Date'])
# data.set_index('Date', inplace=True)

In [None]:
# Select the NO2 column
no2_values = data['NO2(GT)'].values.reshape(-1, 1)

In [None]:
# Scale the data
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_no2 = scaler.fit_transform(no2_values)

In [None]:
# Split the data into train and test sets
train_size = int(len(scaled_no2) * 0.67)
test_size = len(scaled_no2) - train_size
train, test = scaled_no2[0:train_size,:], scaled_no2[train_size:len(scaled_no2),:]

In [None]:
# Convert an array of values into a dataset matrix
def create_dataset(dataset, look_back=1):
  X, Y = [], []
  for i in range(len(dataset)-look_back-1):
    a = dataset[i:(i+look_back), 0]
    X.append(a)
    Y.append(dataset[i + look_back, 0])
  return np.array(X), np.array(Y)

In [None]:
# Reshape into X=t and Y=t+1
look_back = 1
X_train, Y_train = create_dataset(train, look_back)
X_test, Y_test = create_dataset(test, look_back)

In [None]:
# Reshape input to be [samples, time steps, features]
X_train = X_train.reshape(X_train.shape[0], 1, X_train.shape[1])
X_test = X_test.reshape(X_test.shape[0], 1, X_test.shape[1])

In [None]:
# Create and fit the LSTM network
model = Sequential()
model.add(LSTM(50, input_shape=(1, look_back)))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(X_train, Y_train, epochs=100, batch_size=1, verbose=2)

In [None]:
# Make predictions
train_predict = model.predict(X_train)
test_predict = model.predict(X_test)

In [None]:
# Invert predictions
train_predict = scaler.inverse_transform(train_predict)
Y_train = scaler.inverse_transform([Y_train])
test_predict = scaler.inverse_transform(test_predict)
Y_test = scaler.inverse_transform([Y_test])

In [None]:
# Plot baseline and predictions
plt.figure(figsize=(12,6))
plt.plot(scaler.inverse_transform(scaled_no2))
plt.plot(np.concatenate((train_predict, test_predict)))
plt.show()