# Random Forest Training on NYC Turnstile Data using PySpark & MongoDB

This script loads raw entry/exit foot traffic data from a MongoDB collection (`mta_db.raw_turnstile`), performs feature engineering (timestamp parsing, station indexing), and trains two separate Random Forest regression models to predict:

- `ENTRIES` per station-hour-day combination
- `EXITS` per station-hour-day combination

The trained models are saved locally and later used in streaming inference to predict foot traffic in real time.


In [2]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml import Pipeline

## Initialize Spark Session &  Load Raw Turnstile Data from MongoDB

In [4]:
# Step 1: Create Spark Session
spark = SparkSession.builder \
    .appName("Train_MTA_RF_Model") \
    .config("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.12:3.0.1") \
    .getOrCreate()

# Step 2: Load raw (non-aggregated) turnstile data from MongoDB
raw_df = spark.read \
    .format("mongo") \
    .option("uri", "mongodb://localhost:27017/mta_db.raw_turnstile") \
    .load()

# Rename correct fields from ENTRY_COUNT and EXIT_COUNT
if "ENTRY_COUNT" in raw_df.columns and "EXIT_COUNT" in raw_df.columns:
    raw_df = raw_df.withColumnRenamed("ENTRY_COUNT", "ENTRIES").withColumnRenamed("EXIT_COUNT", "EXITS")

:: loading settings :: url = jar:file:/opt/anaconda3/lib/python3.11/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /Users/gopalakrishnaabba/.ivy2/cache
The jars for the packages stored in: /Users/gopalakrishnaabba/.ivy2/jars
org.mongodb.spark#mongo-spark-connector_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-067b9d67-0a2a-4f78-b6f3-d83582474a01;1.0
	confs: [default]
	found org.mongodb.spark#mongo-spark-connector_2.12;3.0.1 in central
	found org.mongodb#mongodb-driver-sync;4.0.5 in central
	found org.mongodb#bson;4.0.5 in central
	found org.mongodb#mongodb-driver-core;4.0.5 in central
:: resolution report :: resolve 160ms :: artifacts dl 7ms
	:: modules in use:
	org.mongodb#bson;4.0.5 from central in [default]
	org.mongodb#mongodb-driver-core;4.0.5 from central in [default]
	org.mongodb#mongodb-driver-sync;4.0.5 from central in [default]
	org.mongodb.spark#mongo-spark-connector_2.12;3.0.1 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules   

- Creates a Spark session with necessary connectors for MongoDB so we can load training data directly from the `raw_turnstile collection`.

- Reads real-time ingested subway turnstile data from MongoDB (`mta_db.raw_turnstile`). This contains raw entry/exit counts per station.


## Feature Engineering

In [7]:
from pyspark.sql.functions import hour, dayofweek, to_timestamp, concat_ws

# Combine DATE and TIME into a single timestamp column
raw_df = raw_df.withColumn(
    "datetime", 
    to_timestamp(concat_ws(' ', raw_df["DATE"], raw_df["TIME"]), "MM/dd/yyyy HH:mm:ss")
)

# Extract hour of day and day of week from timestamp
raw_df = raw_df.withColumn("hour", hour("datetime"))
raw_df = raw_df.withColumn("day_of_week", dayofweek("datetime"))


In [8]:
# Step 3: Feature Engineering
indexer = StringIndexer(inputCol="STATION", outputCol="station_index")
assembler = VectorAssembler(inputCols=["station_index", "hour", "day_of_week"], outputCol="features")

Converts station names to numeric indices using `StringIndexer`, then creates feature vectors with `VectorAssembler`.

In [10]:

# Step 5: Train model for ENTRIES
if "ENTRIES" in raw_df.columns:
    rf_entries = RandomForestRegressor(featuresCol="features", labelCol="ENTRIES")
    pipeline_entries = Pipeline(stages=[indexer, assembler, rf_entries])
    model_entries = pipeline_entries.fit(raw_df)
    rf_model_entries = model_entries.stages[-1]
    print("ENTRIES Model:")
    print(f" - Number of Trees: {rf_model_entries.getNumTrees}")
    print(f" - Feature Importances: {rf_model_entries.featureImportances}")
    print(f" - Tree Weights: {rf_model_entries.treeWeights}")
    model_entries.write().overwrite().save("/Users/gopalakrishnaabba/mta_rf_entries_model")
else:
    print("Column 'ENTRIES' not found in dataset.")

# Step 6: Train model for EXITS
if "EXITS" in raw_df.columns:
    rf_exits = RandomForestRegressor(featuresCol="features", labelCol="EXITS")
    pipeline_exits = Pipeline(stages=[indexer, assembler, rf_exits])
    model_exits = pipeline_exits.fit(raw_df)
    rf_model_exits = model_exits.stages[-1]
    print("\nEXITS Model:")
    print(f" - Number of Trees: {rf_model_exits.getNumTrees}")
    print(f" - Feature Importances: {rf_model_exits.featureImportances}")
    print(f" - Tree Weights: {rf_model_exits.treeWeights}")
    model_exits.write().overwrite().save("/Users/gopalakrishnaabba/mta_rf_exits_model")
else:
    print("Column 'EXITS' not found in dataset.")

                                                                                

ENTRIES Model:
 - Number of Trees: 20
 - Feature Importances: (3,[0,1],[0.6889107997005658,0.3110892002994342])
 - Tree Weights: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]


                                                                                


EXITS Model:
 - Number of Trees: 20
 - Feature Importances: (3,[0,1],[0.7179336153302223,0.2820663846697776])
 - Tree Weights: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
