# Unsupervised ML

This notebook will be loading in the data, and then running time-series k-means clustering by count on the following:

1. Pickups in Chicago
2. Pickups in Hyde Park (pre-program)
3. Pickups in Hyde Park (program)

Here's the Apache documentation I'll be drawing inspiration from:

https://spark.apache.org/docs/latest/ml-clustering.html
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.clustering.KMeans.html

And here's the article that helped me out:

https://www.influxdata.com/blog/why-use-k-means-for-time-series-data-part-one/
https://www.influxdata.com/blog/why-use-k-means-for-time-series-data-part-two/
https://www.influxdata.com/blog/why-use-k-means-for-time-series-data-part-three/

In [1]:
# read in packages create spark environment
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
import matplotlib.pyplot as plt
%matplotlib inline
import geopandas as gpd
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.clustering import KMeans
from pyspark.ml import Pipeline
from pyspark.ml.feature import StandardScaler
from pyspark.ml.evaluation import ClusteringEvaluator

spark = SparkSession.builder.appName('unsupervised').getOrCreate()

#change configuration settings on Spark 
conf = spark.sparkContext._conf.setAll([('spark.executor.memory', '4g'), ('spark.app.name', 'Spark Updated Conf'), ('spark.executor.cores', '4'), ('spark.cores.max', '4'), ('spark.driver.memory','4g')])

#print spark configuration settings
spark.sparkContext.getConf().getAll()

:: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
com.johnsnowlabs.nlp#spark-nlp_2.12 added as a dependency
graphframes#graphframes added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-224c96f9-880e-4e05-9ecd-ae5053fbd4b8;1.0
	confs: [default]
	found com.johnsnowlabs.nlp#spark-nlp_2.12;4.4.0 in central
	found com.typesafe#config;1.4.2 in central
	found org.rocksdb#rocksdbjni;6.29.5 in central
	found com.amazonaws#aws-java-sdk-bundle;1.11.828 in central
	found com.github.universal-automata#liblevenshtein;3.0.0 in central
	found com.google.protobuf#protobuf-java-util;3.0.0-beta-3 in central
	found com.google.protobuf#protobuf-java;3.0.0-beta-3 in central
	found com.google.code.gson#gson;2.3 in central
	found it.unimi.dsi#fastutil;7.0.12 in central
	found org.projectlombok#lombok;1.16.8 in central
	found com.google.cloud#google-cloud-storage;2.16.0 in central
	found com.google.guava#guava;31.1-jre in centra

[('spark.stage.maxConsecutiveAttempts', '10'),
 ('spark.dynamicAllocation.minExecutors', '1'),
 ('spark.eventLog.enabled', 'true'),
 ('spark.submit.pyFiles',
  '/root/.ivy2/jars/com.johnsnowlabs.nlp_spark-nlp_2.12-4.4.0.jar,/root/.ivy2/jars/graphframes_graphframes-0.8.2-spark3.1-s_2.12.jar,/root/.ivy2/jars/com.typesafe_config-1.4.2.jar,/root/.ivy2/jars/org.rocksdb_rocksdbjni-6.29.5.jar,/root/.ivy2/jars/com.amazonaws_aws-java-sdk-bundle-1.11.828.jar,/root/.ivy2/jars/com.github.universal-automata_liblevenshtein-3.0.0.jar,/root/.ivy2/jars/com.google.cloud_google-cloud-storage-2.16.0.jar,/root/.ivy2/jars/com.navigamez_greex-1.0.jar,/root/.ivy2/jars/com.johnsnowlabs.nlp_tensorflow-cpu_2.12-0.4.4.jar,/root/.ivy2/jars/it.unimi.dsi_fastutil-7.0.12.jar,/root/.ivy2/jars/org.projectlombok_lombok-1.16.8.jar,/root/.ivy2/jars/com.google.guava_guava-31.1-jre.jar,/root/.ivy2/jars/com.google.guava_failureaccess-1.0.1.jar,/root/.ivy2/jars/com.google.guava_listenablefuture-9999.0-empty-to-avoid-conflict-

### Reading in cleaned data, partitioning

In [2]:
# read in rideshare data for all years, concatenate, create appropriate partitioning
# we are dropping 2020 because covid will affect the performance of our model

df_2018 = spark.read.csv("gs://msca-bdp-student-gcs/bdp-rideshare-project/rideshare/processed_data/rides_2018.csv", inferSchema=True, header=True)
df_2019 = spark.read.csv("gs://msca-bdp-student-gcs/bdp-rideshare-project/rideshare/processed_data/rides_2019.csv", inferSchema=True, header=True)
df_2021 = spark.read.csv("gs://msca-bdp-student-gcs/bdp-rideshare-project/rideshare/processed_data/rides_2021.csv", inferSchema=True, header=True)
df_2022 = spark.read.csv("gs://msca-bdp-student-gcs/bdp-rideshare-project/rideshare/processed_data/rides_2022.csv", inferSchema=True, header=True)
df_2023 = spark.read.csv("gs://msca-bdp-student-gcs/bdp-rideshare-project/rideshare/processed_data/rides_2023.csv", inferSchema=True, header=True)

# dropping new columns in 2023
df_2023 = df_2023.drop('Shared Trip Match','Percent Time Chicago','Percent Distance Chicago')

df_all = df_2018.union(df_2019).union(df_2021).union(df_2022).union(df_2023)
df_all.show(5)

                                                                                

+--------------------+-------------------+-------------------+-------+-----+------------+-------------+-----------+------------+----+---+-----+-------------+--------------+-------------+--------------+-----+------------+----+---+
|                  ID|    start_timestamp|      end_timestamp|seconds|miles|pickup_tract|dropoff_tract|pickup_area|dropoff_area|Fare|Tip|total|   pickup_lat|    pickup_lon|  dropoff_lat|   dropoff_lon|month|day_of_month|hour|day|
+--------------------+-------------------+-------------------+-------+-----+------------+-------------+-----------+------------+----+---+-----+-------------+--------------+-------------+--------------+-----+------------+----+---+
|625e77ae6e0ff7191...|2018-11-06 19:00:00|2018-11-06 19:15:00|   1142|  5.8| 17031063400|  17031010400|          6|           1|12.5|  0| 15.0|41.9346591566|-87.6467297286| 42.004764559| -87.659122427|   11|           6|  19|  3|
|62945fdb2e70957f0...|2018-11-06 19:00:00|2018-11-06 19:00:00|    341|  1.2| 170

In [3]:
#display number of records by partition
def displaypartitions(df):
    #number of records by partition
    num = df.rdd.getNumPartitions()
    print("Partitions:", num)
    df.withColumn("partitionId", F.spark_partition_id())\
        .groupBy("partitionId")\
        .count()\
        .orderBy(F.asc("count"))\
        .show(num)

df_all.rdd.getNumPartitions()
displaypartitions(df_all)

Partitions: 534




+-----------+------+
|partitionId| count|
+-----------+------+
|         33|152646|
|        233|328837|
|        232|328975|
|        231|329131|
|        230|329163|
|        229|329209|
|        227|329245|
|        228|329263|
|        225|329263|
|        224|329311|
|        226|329315|
|        222|329332|
|        223|329344|
|        221|329373|
|        218|329389|
|        219|329390|
|        217|329399|
|        215|329410|
|        216|329410|
|        214|329418|
|        220|329427|
|        213|329428|
|        210|329461|
|        212|329481|
|        211|329505|
|        207|329507|
|        208|329513|
|        209|329519|
|        206|329523|
|        204|329533|
|        203|329555|
|        205|329574|
|        201|329587|
|        202|329591|
|        198|329607|
|        200|329623|
|        196|329624|
|        199|329630|
|        197|329633|
|        195|329646|
|        192|329654|
|        194|329673|
|        193|329678|
|        184|329704|
|        191|

                                                                                

In [4]:
# repartitioning to 600 partitions, seems to be balanced now. 
df_all = df_all.repartition(600)
displaypartitions(df_all)



Partitions: 600




+-----------+------+
|partitionId| count|
+-----------+------+
|         26|362149|
|         27|362149|
|         24|362149|
|         25|362149|
|         28|362150|
|         33|362150|
|         30|362150|
|         35|362150|
|         31|362151|
|         29|362151|
|         32|362151|
|         64|362151|
|         34|362151|
|         63|362152|
|         57|362152|
|         62|362152|
|         56|362152|
|         37|362152|
|         65|362152|
|         74|362152|
|         55|362152|
|         36|362152|
|         49|362153|
|         20|362153|
|         54|362153|
|         73|362153|
|         19|362153|
|         75|362153|
|         44|362153|
|         58|362153|
|         71|362153|
|         61|362153|
|         66|362153|
|         72|362153|
|         67|362153|
|         59|362153|
|         39|362153|
|         40|362154|
|         69|362154|
|         45|362154|
|         50|362154|
|         47|362154|
|        160|362154|
|         70|362154|
|         48|

                                                                                

In [5]:
# we will need a year column in this model:
df_all = df_all.withColumn('year', F.year(df_all.start_timestamp))

## Next steps

In [6]:
# Check packages:
%pip freeze

access @ file:///home/conda/feedstock_root/build_artifacts/access_1696558639912/work
affine @ file:///home/conda/feedstock_root/build_artifacts/affine_1674245120525/work
aiohttp @ file:///home/conda/feedstock_root/build_artifacts/aiohttp_1696765416168/work
aiosignal @ file:///home/conda/feedstock_root/build_artifacts/aiosignal_1667935791922/work
alabaster @ file:///home/conda/feedstock_root/build_artifacts/alabaster_1673645646525/work
alembic @ file:///home/conda/feedstock_root/build_artifacts/alembic_1698347477885/work
amply @ file:///home/conda/feedstock_root/build_artifacts/amply_1687675480808/work
ansiwrap==0.8.4
anyio @ file:///home/conda/feedstock_root/build_artifacts/anyio_1688651106312/work/dist
appdirs @ file:///home/conda/feedstock_root/build_artifacts/appdirs_1603108395799/work
argon2-cffi @ file:///home/conda/feedstock_root/build_artifacts/argon2-cffi_1692818318753/work
argon2-cffi-bindings @ file:///home/conda/feedstock_root/build_artifacts/argon2-cffi-bindings_16953865480

In [7]:
df_all.printSchema()

root
 |-- ID: string (nullable = true)
 |-- start_timestamp: timestamp (nullable = true)
 |-- end_timestamp: timestamp (nullable = true)
 |-- seconds: integer (nullable = true)
 |-- miles: double (nullable = true)
 |-- pickup_tract: long (nullable = true)
 |-- dropoff_tract: long (nullable = true)
 |-- pickup_area: integer (nullable = true)
 |-- dropoff_area: integer (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Tip: integer (nullable = true)
 |-- total: double (nullable = true)
 |-- pickup_lat: double (nullable = true)
 |-- pickup_lon: double (nullable = true)
 |-- dropoff_lat: double (nullable = true)
 |-- dropoff_lon: string (nullable = true)
 |-- month: integer (nullable = true)
 |-- day_of_month: integer (nullable = true)
 |-- hour: integer (nullable = true)
 |-- day: integer (nullable = true)
 |-- year: integer (nullable = true)



## Clustering Analysis:
### First we're going to run a time series k-means clustering on the entire City of Chicago. 

In [8]:
df_all = df_all.na.drop()

In [9]:
# Clustering by pick_up area. Understanding the most-popular spots in the city to call a rideshare and their locations:
feature_cols = ["pickup_area", "pickup_lat", "pickup_lon"]

# Step 1: Vector Assembly:
feature_assembler = VectorAssembler(inputCols=feature_cols, outputCol="feature_vector")
vector_assembler = VectorAssembler(inputCols=["feature_vector"], outputCol="features")

# Step 2: Normalization:
scaler = StandardScaler(inputCol="features", outputCol="scaled_features", withStd=True, withMean=False)

# Step 3: K-Means Clustering:
kmeans = KMeans(k=3, seed=1, featuresCol="scaled_features", predictionCol="prediction")

# Step 4: Model Training:
pipeline = Pipeline(stages=[feature_assembler, vector_assembler, scaler, kmeans])
model = pipeline.fit(df_all)

# Step 5: Prediction:
predictions = model.transform(df_all)

# Evaluate clustering by computing Silhouette score:
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))

23/11/24 23:44:31 WARN com.github.fommil.netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
23/11/24 23:44:32 WARN com.github.fommil.netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS

Silhouette with squared euclidean distance = 0.786760389836245


                                                                                

In [10]:
# Show the resulting clusters
centers = model.stages[-1].clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print(center)

Cluster Centers: 
[ 1.13637426e+00  8.69572453e+02 -1.58903761e+03]
[    4.19114642   871.29986067 -1593.60698384]
[    4.19114642   871.2953187  -1593.62644601]


In [11]:
# Display cluster assignments
chicago_clustering = predictions.select("pickup_area", "pickup_lat", "pickup_lon", "features", "scaled_features", "prediction")

In [12]:
# Seeing 20 of the results:
chicago_clustering.show()



+-----------+-------------+--------------+--------------------+--------------------+----------+
|pickup_area|   pickup_lat|    pickup_lon|            features|     scaled_features|prediction|
+-----------+-------------+--------------+--------------------+--------------------+----------+
|          1|42.0016981937|-87.6735740325|[1.0,42.001698193...|[0.05514666339728...|         0|
|          6| 41.936159071|-87.6612652184|[6.0,41.936159071...|[0.33087998038371...|         0|
|         23|41.9066839592|-87.7103539349|[23.0,41.90668395...|[1.26837325813755...|         0|
|          6| 41.942577185|-87.6470785093|[6.0,41.942577185...|[0.33087998038371...|         0|
|          7|41.9217781876|-87.6510618838|[7.0,41.921778187...|[0.38602664378099...|         0|
|          3|41.9724370811|-87.6711095263|[3.0,41.972437081...|[0.16543999019185...|         0|
|         31|41.8561441046|-87.6489783241|[31.0,41.85614410...|[1.70954656531584...|         0|
|         32|41.8809944707|-87.632746488

                                                                                

In [13]:
# We have 147 million rows after dropping the NAs:
print((chicago_clustering.count(), len(chicago_clustering.columns)))

                                                                                

(147900469, 6)


In [15]:
# TODO:

# Define program area (from EDA)
# Run clustering on hyde park pre & post program
# grid search to find optimal ks (and features?)--Ridhi
# Run plots for all three