# Get Uber Ride Data

As a full-time Uber driver, I'm interested in performing analysis on my past rides so I can make informed choices about driving in the future and how to optimize for higher earnings. However, when I log in to the [Uber Drivers](https://drivers.uber.com/earnings/activities) page, I can only view my past rides a week at a time, and that weekly view is frustratingly paginated and lacks relevant details. Being a programmer, I naturally thought of downloading the data through an API. As of August 24, 2024 access to the [Uber Drivers API](https://developer.uber.com/docs/drivers/introduction) is "limited" and there's a vague message on their info page about applying for access. So that's not a solution for my personal data needs. 

**Caveat**: This is likely against the Uber Drivers TOS and I'm engaging with this at my own risk of potentially haveing my account limited/banned, but I really want to get this data and I'm requesting in a reasonable manner. **Proceed at your own risk.** 

## Reverse Engineering the Uber Drivers page

By going into the Google Chrome Developer Console Network tab when loading the Uber Drivers page, I can see that the weekly paginated rides data is initially retrieved through an HTTP POST request to `https://drivers.uber.com/earnings/api/getWebActivityFeed?localeCode=en` with the request payload:

```
{"startDateIso":"2024-05-13","endDateIso":"2024-05-20","paginationOption":{}}
```

This retrieves a JSON object including my rides data that's far richer than what's actually displayed on the page. Jackpot! 

If there are more pages of data, then the JSON object (I'll refer to as `data`) has `data['data']['pagination']['hasMoreData']` set to `True`. There is then a pagination cursor value available in `data['data']['pagination']['nextCursor']`, and the next page of data can be requested with a similar HTTP POST request to `https://drivers.uber.com/earnings/api/getWebActivityFeed?localeCode=en` with the request payload:

```
{"startDateIso":"2024-05-13","endDateIso":"2024-05-20","paginationOption":{"cursor": data['data']['pagination']['nextCursor']}}
```

Repeating this request until `data['data']['pagination']['hasMoreData']` is `False` will retrieve all data for that time period. 

Performing this entire series of requests for all weekly date ranges from the date I started driving Uber (2023-01-09) to now will get me all of my ride data. 

Importantly, this is all occurring inside an authenticated HTTPS session. In my initial discovery and testing, I used Postman to perform the requests. Through the Network tab in the Google Chrome Developer Console, I right-clicked the POST request, chose `Copy > Copy as cURL`, and then imported that into Postman using `File > Import...`. 

After confirming it worked, I migrated to Python to request everything programmatically. To generate starter code for Python requests, in Postman on this particular request I chose the `Code` option in the pane on the right side of the window, then chose "Python - Requests" from the drop-down menu. This includes the full authentication tokens in plaintext on the `cookies` key of the headers dictionary. 

With all of the data downloaded, I extract the rides data specifically and save this to a local JSON file so I don't have to re-request the data. I then perform some data cleaning, parsing, and enrichment to ultimately produce CSV files that I can easily analyze with other programs and import into a spreadsheet. 

In [2]:
import csv
from datetime import datetime, timedelta
import json
import re
import browsercookie
import pandas as pd
import requests

In [5]:
class UberDriver:
    def __init__(self):
        self.headers = {
          'accept': '*/*',
          'accept-language': 'en-US,en;q=0.9',
          'content-type': 'application/json',
            # Take the 'cookie' value out of the generated code from Postman
            # 'cookie': '',
          'origin': 'https://drivers.uber.com',
          'priority': 'u=1, i',
          'referer': 'https://drivers.uber.com/earnings/activities',
          'sec-ch-ua': '"Not)A;Brand";v="99", "Google Chrome";v="127", "Chromium";v="127"',
          'sec-ch-ua-mobile': '?0',
          'sec-ch-ua-platform': '"macOS"',
          'sec-fetch-dest': 'empty',
          'sec-fetch-mode': 'cors',
          'sec-fetch-site': 'same-origin',
          'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36',
          'x-csrf-token': 'x',
          'x-uber-earnings-seed': '84fe34091ffc6c333a1a8a4f93ac6d21'
        }
    
    def get_rides(self, start_date_iso, end_date_iso):        
        url = "https://drivers.uber.com/earnings/api/getWebActivityFeed?localeCode=en"
    
        payload = json.dumps({
          "startDateIso": start_date_iso,
          "endDateIso": end_date_iso,
          "paginationOption": {}
        })
        
        response = requests.request("POST", url, headers=self.headers, data=payload)
        data = response.json()

        rides = data['data']['activities']
        while data['data']['pagination']['hasMoreData']:
            payload = json.dumps({
              "startDateIso": start_date_iso,
              "endDateIso": end_date_iso,
              "paginationOption": {"cursor": data['data']['pagination']['nextCursor']}
            })
            response = requests.request("POST", url, headers=self.headers, data=payload)
            data = response.json()
            if data['data']['activities']:
                rides = rides + data['data']['activities']
        return rides or []

    def get_ride_detail(self, ride_uuid):
        # This is only helpful to get additional fare breakdown from Uber, if we wanted to analyze how much
        # Uber is taking from each fare.
        url = f"https://drivers.uber.com/earnings/trips/{ride_uuid}"
        response = requests.request("GET", url, headers=self.headers)
        return response.text

In [6]:
uber = UberDriver()

# This is the date I started working as an Uber driver; modify for your start date
start_date = datetime.strptime("2023-01-09", "%Y-%m-%d")
end_date = datetime.today()
current_date = start_date

rides = []
while current_date <= end_date:
    next_date = current_date + timedelta(days = 7)
    start_date_iso = current_date.strftime('%Y-%m-%d')
    end_date_iso = next_date.strftime('%Y-%m-%d')
    print(f"Getting rides for {start_date_iso} - {end_date_iso}...")
    new_rides = uber.get_rides(start_date_iso, end_date_iso)
    print(f"Retrieved {len(new_rides)} rides.")
    rides += new_rides
    current_date = next_date

# Let's dump all of these rides to a JSON file so we can reference this data outside of the script if need be, 
# or simply not have to retrieve from Uber again.
with open(f"rides.json", "w") as file:
    json.dump(rides, file)

print(f"Retrieved {len(rides)} rides total.")

Getting rides for 2023-01-09 - 2023-01-16...
Retrieved 42 rides.
Getting rides for 2023-01-16 - 2023-01-23...
Retrieved 9 rides.
Getting rides for 2023-01-23 - 2023-01-30...
Retrieved 13 rides.
Getting rides for 2023-01-30 - 2023-02-06...
Retrieved 0 rides.
Getting rides for 2023-02-06 - 2023-02-13...
Retrieved 26 rides.
Getting rides for 2023-02-13 - 2023-02-20...
Retrieved 26 rides.
Getting rides for 2023-02-20 - 2023-02-27...
Retrieved 7 rides.
Getting rides for 2023-02-27 - 2023-03-06...
Retrieved 29 rides.
Getting rides for 2023-03-06 - 2023-03-13...
Retrieved 9 rides.
Getting rides for 2023-03-13 - 2023-03-20...
Retrieved 51 rides.
Getting rides for 2023-03-20 - 2023-03-27...
Retrieved 59 rides.
Getting rides for 2023-03-27 - 2023-04-03...
Retrieved 26 rides.
Getting rides for 2023-04-03 - 2023-04-10...
Retrieved 0 rides.
Getting rides for 2023-04-10 - 2023-04-17...
Retrieved 0 rides.
Getting rides for 2023-04-17 - 2023-04-24...
Retrieved 0 rides.
Getting rides for 2023-04-24 - 2

In [6]:
# Let's re-load rides using what's in the JSON file to confirm that it's what we need.
# We can also re-run from this point onward without having to hit the Uber site again.
with open(f"rides.json", "r") as file:
    rides = json.load(file)

rides_df = pd.read_json('rides.json')

# Let's see what the ride data looks like
print(rides_df)
print(rides_df.head(1))
print(rides_df['type'].unique())
print(rides_df['activityTitle'].unique())

# The rest of the code was already written to process a list of dicts, not a dataframe
rides = rides_df.to_dict(orient='records')

                                      uuid  recognizedAt   activityTitle  \
0     d3096d6c-02bd-4f8e-855b-117588b27910    1673809557         Comfort   
1     89a3e777-2000-44af-8509-765b157dfe9e    1673808331           UberX   
2     78616fc5-6214-4f31-89a0-4229e35f0c5c    1673807621           UberX   
3     b6457299-8e46-44ec-b151-398649205336    1673806661           UberX   
4     ae817bfc-76a2-4a47-8622-d85cc9b5dd8e    1673802749           UberX   
...                                    ...           ...             ...   
4244  eeb105cb-7abd-43d3-8987-7d5bf5bceadc    1724101706           UberX   
4245  e56b225b-9b01-4fc1-b2dc-a8b5de54519a    1724098587           UberX   
4246  b9d51f86-f612-4853-84ba-0f98885103dd    1724096867           UberX   
4247  56012ec9-9d96-42ad-a0e7-1260af304ada    1724095014           UberX   
4248  91178c7b-91c5-5c46-914c-7618c524f208    1724058000  {0} Trip Quest   

     formattedTotal                                            routing  \
0            

In [5]:
def parse_time_to_seconds(time_str):
    matches = re.findall(r'(\d+)\s*(hr|min|sec)', time_str)
    unit_to_seconds = {'hr': 3600, 'min': 60, 'sec': 1}
    return sum(int(value) * unit_to_seconds[unit] for value, unit in matches)

def parse_miles(miles_str):
    match = re.search(r'(\d+\.?\d*)\s*mi', miles_str)
    return float(match.group(1))

def parse_currency_to_float(currency_str):
    clean_str = currency_str.replace('$', '').strip()
    return float(clean_str)

def parse_season(date):
    """Return the season for a given datetime object."""
    seasons = {
        'Spring': (3, 21, 6, 20),
        'Summer': (6, 21, 9, 20),
        'Fall': (9, 21, 12, 20),
        'Winter': (12, 21, 3, 20)
    }
    month = date.month
    day = date.day
    for season, (start_month, start_day, end_month, end_day) in seasons.items():
        if start_month <= end_month:
            if start_month <= month <= end_month:
                if (month == start_month and day >= start_day) or (month == end_month and day <= end_day) or (start_month < month < end_month):
                    return season
        else:
            if month > start_month or month < end_month or (month == start_month and day >= start_day) or (month == end_month and day <= end_day):
                return season

def extract_zipcode(address):
    zip_code_pattern = re.compile(r'\b\d{5}\b')
    match = zip_code_pattern.search(address)
    if match:
        return match.group()

cleaned_rides = []
for ride in rides:
    if ride.get('breakdownDetails'):
        tip = ride['breakdownDetails']['formattedTip'] or '$0.00'
        surge = ride['breakdownDetails']['formattedSurge'] or '$0.00'
    else:
        tip = '$0.00'
        surge = '$0.00'
    if ride.get('tripMetaData'):
        duration = parse_time_to_seconds(ride['tripMetaData']['formattedDuration'])
        distance = parse_miles(ride['tripMetaData']['formattedDistance'])
        pickup_address = ride['tripMetaData']['pickupAddress']
        dropoff_address = ride['tripMetaData']['dropOffAddress']
    else:
        duration = None
        distance = None
        pickup_address = None
        dropoff_address = None
    when = datetime.fromtimestamp(ride['recognizedAt'])
    ride_clean = {
        'uuid': ride['uuid'],
        'date': when.strftime('%Y-%m-%d'),
        'time': when.strftime('%H:%M:%S'),
        'timestamp': ride['recognizedAt'],
        'day': when.strftime('%A'),
        'day_of_week': when.weekday(),
        'sortable_day_of_week': f"{when.weekday()} - {when.strftime('%A')}",
        'season': parse_season(when),
        'type': ride['activityTitle'],
        'earnings': parse_currency_to_float(ride['formattedTotal']),
        'tip': parse_currency_to_float(tip),
        'surge': parse_currency_to_float(surge),
        'duration': duration,
        'distance': distance,
        'pickup_address': pickup_address,
        'dropoff_address': dropoff_address,
        'status': ride['status'],
        'note': ride['type']
    }
    cleaned_rides.append(ride_clean)

# Let's filter the data to only include completed rides of humans
filtered_rides = [ride for ride in cleaned_rides if ride['status'] == 'COMPLETED' 
                                                 and ride['note'] == 'TRIP' 
                                                 and ride['type'] in ['Comfort', 'UberX', 'UberXL', 'UberX Share', 
                                                                      'UberX Priority', 'Uber Pet', 'Business Comfort']]

# Let's add some calculated columns now to skip the manual processing in a spreadsheet
enriched_rides = []
for ride in filtered_rides:
    ride = ride.copy()
    ride['earnings-surge'] = ride['earnings'] - ride['surge']
    ride['earnings/second'] = ride['earnings'] / ride['duration']
    ride['earnings/mile'] = ride['earnings'] / ride['distance']
    ride['pickup_zipcode'] = extract_zipcode(ride['pickup_address'])
    ride['dropoff_zipcode'] = extract_zipcode(ride['dropoff_address'])
    del ride['status']
    del ride['note']
    enriched_rides.append(ride)

with open('enriched_rides.json', 'w') as f:
    f.write(json.dumps(enriched_rides))

In [6]:
rides_df = pd.DataFrame(enriched_rides)
print(rides_df.describe())

# Compute summary statistics for each column, including handling None values
stats = rides_df.describe(include='all')

# Count None values per column
null_count = rides_df.isnull().sum()

# Display the statistics and None count
print(stats)
print("\nCount of None values per column:\n", null_count)

# Additional information on string handling
print("\nAdditional Info:")
for column in rides_df.columns:
    if rides_df[column].dtype == 'object':  # Handling for strings and mixed types
        unique_strings = rides_df[column].dropna().unique()
        print(f"Unique values in column '{column}' (#{len(unique_strings)}): {unique_strings}")

          timestamp  day_of_week     earnings          tip        surge  \
count  3.657000e+03  3657.000000  3657.000000  3657.000000  3657.000000   
mean   1.706210e+09     3.135630    11.447104     1.509284     0.714315   
std    1.197818e+07     1.739357     6.950630     2.568006     1.557651   
min    1.673391e+09     0.000000     2.860000     0.000000     0.000000   
25%    1.699843e+09     2.000000     6.760000     0.000000     0.000000   
50%    1.707350e+09     3.000000     9.890000     0.000000     0.000000   
75%    1.715110e+09     5.000000    14.040000     3.000000     1.000000   
max    1.724459e+09     6.000000   105.860000    28.780000    14.000000   

          duration     distance  earnings-surge  earnings/second  \
count  3657.000000  3657.000000     3657.000000      3657.000000   
mean   1033.619087     4.547990       10.732789         0.012302   
std     664.837903     4.038747        6.430917         0.005502   
min      87.000000     0.200000        2.800000     

In [7]:
with open(f"Uber Rides.csv", 'w') as file:
    dw = csv.DictWriter(file, fieldnames=enriched_rides[0].keys())
    dw.writeheader()
    dw.writerows(enriched_rides)

with open(f"Uber All Ride Data.csv", 'w') as file:
    dw = csv.DictWriter(file, fieldnames=cleaned_rides[0].keys())
    dw.writeheader()
    dw.writerows(cleaned_rides)