# 🎵 Spotify + 🚙 Uber Data Connection

I have been a full-time Uber driver in Chicago for the last year, and during that time the music I play with Spotify has kept me going and entertained my passengers. When I found out that I could get [my listening history from Spotify](https://www.spotify.com/us/account/privacy/) for the past year, I wondered if I could discover what music I played for particular pickup or dropoff areas of the city, at particular times of day and days of the week.

First step was to actually request the past year's listening history from Spotify, which I did at the link above. I had to wait 4 days to get mine. The streaming history items include four data fields. Here's an example:

```
  {
    "endTime" : "2023-09-12 23:59",
    "artistName" : "Aphex Twin",
    "trackName" : "Ageispolis",
    "msPlayed" : 96352
  },
```

Second step was to get my Uber trip data. I had previously solved this by reverse engineering how the Uber Drivers page is built, simulating the browser data request in Postman, and then building a scraper around the sample Python code provided by Postman for the request. Here's an example of a trip:

```
{
    "uuid": "UUID_REDACTED",
    "recognizedAt": 1673809557,
    "activityTitle": "Comfort",
    "formattedTotal": "$10.72",
    "routing": {
        "webviewUrl": "https://drivers.uber.com/earnings/trips/UUID_REDACTED",
        "deeplinkUrl": null
    },
    "breakdownDetails": {
        "formattedTip": "$1.00",
        "formattedSurge": null
    },
    "tripMetaData": {
        "formattedDuration": "15 min 56 sec",
        "formattedDistance": "3.9 mi",
        "pickupAddress": "N REDACTED, Chicago, IL 60614-1101, US",
        "dropOffAddress": "W REDACTED, Chicago, IL 60612, US",
        "mapUrl": "https://static-maps.uber.com/map?width=360&height=100&marker=REDACTED"
    },
    "type": "TRIP",
    "status": "COMPLETED"
}
```

Doing some cleanup on this data, processing the `recognizedAt` data as a UNIX timestamp, removing unnecessary fields, I came up with records looking like:

```
{
    "uuid": "UUID_REDACTED",
    "date": "2023-01-15",
    "time": "13:05:57",
    "timestamp": 1673809557,
    "day": "Sunday",
    "day_of_week": 6,
    "sortable_day_of_week": "6 - Sunday",
    "season": "Winter",
    "type": "Comfort",
    "earnings": 10.72,
    "tip": 1.0,
    "surge": 0.0,
    "duration": 956,
    "distance": 3.9,
    "pickup_address": "N REDACTED, Chicago, IL 60614-1101, US",
    "dropoff_address": "W REDACTED, Chicago, IL 60612, US",
    "earnings-surge": 10.72,
    "earnings/second": 0.011213389121338914,
    "earnings/mile": 2.7487179487179487,
    "pickup_zipcode": "60614",
    "dropoff_zipcode": "60612"
}
```

In [65]:
import glob
from pprint import pprint
import uuid
import buckaroo
import pandas as pd

def get_spotify_history():
    history_files = sorted(glob.glob('spotify/StreamingHistory_music_*.json'))
    history_dfs = [pd.read_json(file) for file in history_files]
    spotify_df = pd.concat(history_dfs, ignore_index=True)
    spotify_df['endTime'] = pd.to_datetime(spotify_df['endTime'])
    spotify_df['startTime'] = spotify_df['endTime'] - pd.to_timedelta(spotify_df['msPlayed'] / 1000, unit='s')
    spotify_df['uuid'] = spotify_df.apply(lambda _: uuid.uuid4(), axis=1)
    return spotify_df

def get_uber_history():
    trips_df = pd.read_json('uber/trips.json')
    # Change to using the same terminology as in the Spotify data, for readability
    trips_df.rename(columns={'timestamp': 'startTime'}, inplace=True)
    trips_df['endTime'] = trips_df['startTime'] + pd.to_timedelta(trips_df['duration'], unit='s')
    return trips_df

spotify = get_spotify_history()
uber = get_uber_history()

# Filter the uber trip data to be only the trips that I have Spotify history for
uber = uber[
    (uber['startTime'] > spotify['startTime'].min()) & 
    (uber['endTime'] < spotify['endTime'].max())
]

print(f"Retrieved {len(spotify)} songs from Spotify.")
print(f"Retrieved {len(uber)} trips from Uber.")

Retrieved 61710 songs from Spotify.
Retrieved 3326 trips from Uber.


In [66]:
spotify

BuckarooWidget(buckaroo_options={'sampled': ['random'], 'auto_clean': ['aggressive', 'conservative'], 'post_pr…

In [67]:
uber

BuckarooWidget(buckaroo_options={'sampled': ['random'], 'auto_clean': ['aggressive', 'conservative'], 'post_pr…

## Finding Songs per Trip


Let's  see what this would look like. Taking the first Uber trip as a sample, I'll just brute force search through all songs and find all songs that started during this trip.

In [68]:
sample_trip = uber.head(1)
print(f"Trip picked up at {sample_trip.iloc[0]['startTime']}, dropped off at {sample_trip.iloc[0]['endTime']}")
print(f"Trip time: {sample_trip.iloc[0]['duration'] / 3600} hours")
print(f"Earned ${sample_trip.iloc[0]['earnings']}")

filtered_songs = spotify[
    (spotify['startTime'] >= sample_trip.iloc[0]['startTime']) & 
    (spotify['startTime'] < sample_trip.iloc[0]['endTime'])
]

filtered_songs

Trip picked up at 2023-09-17 02:56:09, dropped off at 2023-09-17 04:11:09
Trip time: 1.25 hours
Earned $71.52


BuckarooWidget(buckaroo_options={'sampled': ['random'], 'auto_clean': ['aggressive', 'conservative'], 'post_pr…

## Songs for Every Trip

Even though I'm only doing this calculation once, I still like to do things efficiently. I don't want to just brute force the song search for every trip.

For simplicity's sake, I'm going to assume that each spotify history item, a play of a particular song at a particular time, can correspond to at most one Uber trip. There are edge cases where a song started in one Uber ride and continued to be playing in another Uber ride, but we don't have to be exact here. I'll take the Spotify history item startTime and say if it occurs after an Uber trip startTime and before an Uber trip endTime, then it corresponds to that trip. 

There will certainly be Spotify songs that don't correspond to an Uber trip. And there will certainly be some Uber rides that have no Spotify history, because I can recall passengers who requested silence. There will be no overlapping Uber trips, I could only do one at a time.

Given these preconditions, we can calculate a single `trip_uuid` that a song may have, or it will be None.

Finding where the Spotify song play intervals overlap with the Uber trip intervals is an Interval Overlap problem and can be solved efficiently using a line sweep algorithm. 

I'll construct a single array of all the interval start and end times, along with what type of interval it represents ('spotify' or 'uber'), and the UUID of the record it represents. I'll sort this array by the start times. Then I'll step through this array, keeping track of what Uber interval is open, and annotate the corresponding song with that Uber trip UUID.  

In [69]:
events = []
for song in spotify.to_dict(orient='records'):
    events.append({
        'event_type': 'start',
        'uuid': song['uuid'],
        'type': 'spotify',
        'time': song['startTime'].to_pydatetime()
    })
    events.append({
        'event_type': 'end',
        'uuid': song['uuid'],
        'type': 'spotify',
        'time': song['endTime'].to_pydatetime()
    })

for trip in uber.to_dict(orient='records'):
    events.append({
        'event_type': 'start',
        'uuid': trip['uuid'],
        'type': 'uber',
        'time': trip['startTime'].to_pydatetime()
    })
    events.append({
        'event_type': 'end',
        'uuid': trip['uuid'],
        'type': 'uber',
        'time': trip['endTime'].to_pydatetime()
    })

events.sort(key=lambda v: (v['time'], v['event_type'] == 'end'))
# What's this look like?
pprint(events[:4])

[{'event_type': 'start',
  'time': datetime.datetime(2023, 9, 12, 23, 57, 23, 648000),
  'type': 'spotify',
  'uuid': UUID('a62a9e19-5f90-477c-8463-03aa6ac72328')},
 {'event_type': 'end',
  'time': datetime.datetime(2023, 9, 12, 23, 59),
  'type': 'spotify',
  'uuid': UUID('a62a9e19-5f90-477c-8463-03aa6ac72328')},
 {'event_type': 'start',
  'time': datetime.datetime(2023, 9, 13, 23, 0, 12, 534000),
  'type': 'spotify',
  'uuid': UUID('bec69c6d-156c-4228-a465-099eea2c1eae')},
 {'event_type': 'end',
  'time': datetime.datetime(2023, 9, 13, 23, 1),
  'type': 'spotify',
  'uuid': UUID('bec69c6d-156c-4228-a465-099eea2c1eae')}]


## 🪢 Uber Data Internal Overlap

Looking quickly at this Uber data, I see that there is some overlap that makes no sense. I shouldn't be able to start a trip before ending another one, but the data suggests that's what's happening. Let's see the full scope of the interval overlap in the Uber data.

In [70]:
trip_intervals = []
for trip in uber.to_dict(orient='records'):
    trip_intervals.append(('start', trip['startTime'], trip))
    trip_intervals.append(('end', trip['endTime'], trip))
trip_intervals = sorted(trip_intervals, key=lambda e: e[1])

# Step 3: Traverse events and find overlaps
active_intervals = []
overlapping_intervals = []

for event_type, time, interval in trip_intervals:
    if event_type == 'start':
        # Check for overlaps with currently active intervals
        for active_interval in active_intervals:
            if active_interval['endTime'] > interval['startTime']:
                overlapping_intervals.append((active_interval, interval))
        # Add the current interval to active intervals
        active_intervals.append(interval)
    elif event_type == 'end':
        # Remove the interval from active intervals
        active_intervals.remove(interval)

print(f"There are {len(overlapping_intervals)} overlapping Uber trips.")    

There are 459 overlapping Uber trips.


## RUHROH 🫗⛓️‍💥🚨

I don't know how Uber trips could overlap, but that's what the data is showing. Given the picture of the data, 
here are updated requirements:

* **Uber trips can overlap**; there may be multiple trips active at the same time.
* **Songs can be associated with multiple Uber trips**; each song may overlap with zero or more trips.
* **Songs and trips have start and end times**, and we need to find all Uber trips that overlap with each song.

I need to implement the sweep line algorithm to account for overlapping Uber trips and determine for each song all the trips it overlaps with.

In [71]:
active_songs = set()
active_trips = set()
song_overlaps = {song['uuid']: set() for song in spotify.to_dict(orient='records')}

for event in events:
    if event['event_type'] == 'start':
        if event['type'] == 'spotify':
            active_songs.add(event['uuid'])
            for trip_uuid in active_trips:
                song_overlaps[event['uuid']].add(trip_uuid)
        elif event['type'] == 'uber':
            # Add trip to active trips
            active_trips.add(event['uuid'])
            # Record overlaps with active songs
            for song_uuid in active_songs:
                song_overlaps[song_uuid].add(event['uuid'])
    elif event['event_type'] == 'end':
        if event['type'] == 'spotify':
            # Remove song from active songs
            active_songs.remove(event['uuid'])
        elif event['type'] == 'uber':
            # Remove trip from active trips
            active_trips.remove(event['uuid'])

spotify['trip_uuids'] = spotify['uuid'].map(song_overlaps)
spotify['trip_count'] = spotify['trip_uuids'].map(lambda s: len(s))
# Then let's throw out all songs that are not associated with a trip
spotify = spotify[spotify['trip_uuids'] != set()]

spotify

BuckarooWidget(buckaroo_options={'sampled': ['random'], 'auto_clean': ['aggressive', 'conservative'], 'post_pr…

In [72]:
trips_with_songs = set.union(*(v['trip_uuids'] for v in spotify.to_dict(orient='records')))
print(f"There are {len(trips_with_songs)} trips with songs associated.")

There are 3266 trips with songs associated.
