---
<center>

# **I - DATA COLLECTION / ELT** 
By: Jay Menarco.

</center>

**PRELIMINARY STEP: GITHUB REPOSITORY SETUP:**

- Github repository: jays-codes/team24
- Description: Main branch, developent branch, and release branch. 
- Each team member forks the repository (all branches), and worked (push/pull changes) on the development branch.

**DATA COLLECTION / ETL**

- Perform ETL (Extract, Load, Transform): 
    - Extract the following dataset, saved to SQLite database and Github repo: 
        - TTC Streetcar Delay, FY2023 and YTD-09-2024 (https://open.toronto.ca/dataset/ttc-streetcar-delay-data/): directly extracted to Github. 
    
    - Create the following datasets, save to SQLite database and/or Python dataframes and Github repository:  
        - Ontario Public Holiday, 2023 and 2024 (https://excelnotes.com/holidays-ontario-2023/ and https://excelnotes.com/holidays-ontario-2024): no file available, only information online. We manually created the datasets in .csv and saved to Github. 
        - Line route (https://www.ttc.ca/routes-and-schedules/listroutes/streetcar): no file available, only information online. We manually created the datasets in .csv and saved to Github.  
    
    - Load: 
        - Load the data to SQLite database 
        - From SQLite database, load to Python Panda dataframe. 
   
    - Transform: 
        - Join datasets to prepare for analysis: 
        - Perform some feature-engineering to prepare for analysis 
   


In [1]:
# Import necessary libraries for this notebook: 

# Read from SQLite database and load to a pandas dataframe
import os
import sqlite3
import pandas as pd

# For using arrays 
import numpy as np

# For ML work (data preprocessing, hyperparameter tuning, Random Forest Classifier, training & testing sets, and stratified sampling)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder 
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

# For model evaluation, including explainability:  
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report
from sklearn.metrics import balanced_accuracy_score
import statsmodels.api as sm
import shap

# For data visualization 
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.graphics.mosaicplot import mosaic

# For saving the model into a pkl file
import joblib



IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html


# *DATA COLLECTION & ELT*

**LOADING**

In [2]:
# Function to load data from SQLite database
def load_from_db(db_name, table_name):
    conn = sqlite3.connect(db_name)
    query = f'SELECT * FROM {table_name}'
    df = pd.read_sql(query, conn)
    conn.close()

    return df

In [3]:
# Set working directory to the notebook's directory
os.chdir(r"C:\Users\DELL\OneDrive\Desktop\SCHOOL\team24_ly\team24_ly")

# Now define your base directory relative to this location
base_dir = os.path.abspath(os.path.join(os.getcwd(), 'data'))
db_name = os.path.join(base_dir, 'streetcardelaydb2.db')

print(f"Database path: {db_name}")

# Check if the database file exists
if not os.path.exists(db_name):
    raise FileNotFoundError(f"Database file not found: {db_name}")

# Load data
table_name = 'Streetcar_Delay_Data'
df = load_from_db(db_name, table_name)


Database path: C:\Users\DELL\OneDrive\Desktop\SCHOOL\team24_ly\team24_ly\data\streetcardelaydb2.db


**TRANSFORMING**

In [4]:
# Convert incident_date to datetime format
df['incident_date'] = pd.to_datetime(df['incident_date'])

# Load Date table to get holidayType columns
date_table_name = 'Date'
conn = sqlite3.connect(db_name)
date_df = pd.read_sql_query(f'SELECT * FROM {date_table_name}', conn)
date_df['date'] = pd.to_datetime(date_df['date'])

# Merge Date table with Streetcar_Delay_Data table on incident_date
df = df.merge(date_df[['date', 'holidayType']], left_on='incident_date', right_on='date', how='left')
df.drop(columns=['date'], inplace=True)

# Load Season table to get seasonType column
season_table_name = 'Season'
season_df = pd.read_sql_query(f'SELECT * FROM {season_table_name}', conn)
season_df['date'] = pd.to_datetime(season_df['date'])

# Merge Season table with Streetcar_Delay_Data table on incident_date
df = df.merge(season_df[['date', 'season']], left_on='incident_date', right_on='date', how='left')
df.rename(columns={'season': 'seasonType'}, inplace=True)
df.drop(columns=['date'], inplace=True)

# Load Line table to get lineId and lineName (no lineType)
line_table_name = 'Line'
line_df = pd.read_sql_query(f'SELECT * FROM {line_table_name}', conn)

# Merge the dataframes on lineId
df = df.merge(line_df[['lineId', 'lineName']], left_on='line', right_on='lineId', how='left')

# Load Delay table to get delayType
delay_table_name = 'Delay'
delay_df = pd.read_sql_query(f'SELECT * FROM {delay_table_name}', conn)

# Function to determine delayType
def get_delay_type(min_delay):
    for _, row in delay_df.iterrows():
        if row['delayFrom'] <= min_delay <= row['delayTo']:
            return row['delayId']
    return None

# Apply the function to determine delayType
df['delayType'] = df['min_delay'].apply(get_delay_type)

# Close the database connection
conn.close()

# Display the DataFrame
df.head()

Unnamed: 0,incident_date,line,incident_time,day_of_week,location,incident,min_delay,min_gap,bound,vehicle,holidayType,seasonType,lineId,lineName,delayType
0,2023-01-01,505,02:40,Sunday,BROADVIEW AND GERRARD,Held By,15,25,W,4460,New Year's Day,Winter 2023,505,Dundas,2
1,2023-01-01,504,02:52,Sunday,KING AND BATHURST,Cleaning - Unsanitary,10,20,W,4427,New Year's Day,Winter 2023,504,King,2
2,2023-01-01,504,02:59,Sunday,KING AND BATHURST,Held By,25,35,E,4560,New Year's Day,Winter 2023,504,King,3
3,2023-01-01,510,05:38,Sunday,SPADINA AND DUNDAS,Security,15,30,S,4449,New Year's Day,Winter 2023,510,Spadina,2
4,2023-01-01,506,06:35,Sunday,OSSINGTON STATION,Security,10,20,,8706,New Year's Day,Winter 2023,506,Carlton,2


In [7]:
# Save df as "df_prelim" parquet file in 'data' folder
relative_path = os.path.join("data", "df_prelim.parquet")
df.to_parquet(relative_path, index=False)


---
---