# Data Collection

## Racing Profile

Thinking about all the features that would go into a single observation. 
Observation unit: Driver
Dependent variable: Grand Prix finishing position or Podium

Independent variables:

    * Features from Practice 1-3 and Qualifying Sessions (for each session (x4))

        * Min/Max/Avg lap times
        * Number of stints (stint_number max)
        * Sum of # of practice laps
        * Summary stats for minimum lap
            * max brake, min/max/avg rpm, max/avg throttle, min/max/avg speed
        * Avg Pit duration
        * Number of Pits
        * Weather
            * rain, avg temps, avg wind speed


## Import libraries

In [2]:
#| label: import
from urllib.request import urlopen
from urllib.error import URLError, HTTPError
import pandas as pd
import json
from datetime import datetime
import time
import signal

query_base = "https://api.openf1.org/v1/"

## Meeting Query 

Obtain the list of race weekends, or meetings. 

In [5]:
query_meetings = query_base+"meetings?year>2022"

response = urlopen(query_meetings)
data = json.loads(response.read().decode('utf-8'))
meetings_df = pd.json_normalize(data)

print(meetings_df)

                 meeting_name  \
0          Pre-Season Testing   
1          Bahrain Grand Prix   
2    Saudi Arabian Grand Prix   
3       Australian Grand Prix   
4       Azerbaijan Grand Prix   
5            Miami Grand Prix   
6           Monaco Grand Prix   
7          Spanish Grand Prix   
8         Canadian Grand Prix   
9         Austrian Grand Prix   
10         British Grand Prix   
11       Hungarian Grand Prix   
12         Belgian Grand Prix   
13           Dutch Grand Prix   
14         Italian Grand Prix   
15       Singapore Grand Prix   
16        Japanese Grand Prix   
17           Qatar Grand Prix   
18   United States Grand Prix   
19     Mexico City Grand Prix   
20       São Paulo Grand Prix   
21       Las Vegas Grand Prix   
22       Abu Dhabi Grand Prix   
23         Bahrain Grand Prix   
24   Saudi Arabian Grand Prix   
25      Australian Grand Prix   
26        Japanese Grand Prix   
27         Chinese Grand Prix   
28           Miami Grand Prix   
29  Emilia

Some of the race weekends have a different number of practice and qualifiying rounds. Some also have Stints in addition to the Grand prix. We are interested in just the race weekends that have 3 practice rounds and one qualifying round.

From this, we got a dataframe of all the sessions with their correspnding meeting_key and the session type (practice, qualifying, or race).

In [6]:
# convert meetings to a list
meeting_list = meetings_df['meeting_key'].to_list()

# create meeting session list
valid_meeting_sessions = []

# loop through each meeting
for meeting in meeting_list:
    query_sessions = query_base + "sessions?meeting_key=" + str(meeting)

    response = urlopen(query_sessions)
    data = json.loads(response.read().decode('utf-8'))
    sessions_df = pd.json_normalize(data)
    
    # check for the 3 practice rounds, qualifier, and race
    session_types = set(sessions_df['session_type'].unique())  # Get unique session types for the meeting

    required_session_types = {'Practice', 'Qualifying', 'Race'}

    # we need 3 Practice sessions, 1 Qualifying, and 1 Race
    practice_sessions = [session for session in sessions_df['session_type'] if session == 'Practice']
    qualifying_sessions = [session for session in sessions_df['session_type'] if session == 'Qualifying']
    race_sessions = [session for session in sessions_df['session_type'] if session == 'Race']

    # check if the meeting has exactly 3 practice sessions, 1 qualifying, and 1 race
    if len(practice_sessions) == 3 and len(qualifying_sessions) == 1 and len(race_sessions) == 1:

        # loop through the valid sessions and add to a list
        for session in sessions_df.itertuples():
            valid_meeting_sessions.append({
                'meeting_key': meeting,
                'session_key': session.session_key,
                'session_type': session.session_type })
            
    # add sleep time to not overload requests
    time.sleep(0.5)

# convner to a DF 
valid_sessions_df = pd.DataFrame(valid_meeting_sessions)

print(valid_sessions_df)

     meeting_key  session_key session_type
0           1141         7765     Practice
1           1141         7766     Practice
2           1141         7767     Practice
3           1141         7768   Qualifying
4           1141         7953         Race
..           ...          ...          ...
175         1256         9999     Practice
176         1256        10000     Practice
177         1256        10001     Practice
178         1256        10002   Qualifying
179         1256        10006         Race

[180 rows x 3 columns]


Then we sectioned off the Race sessions into their own data frame.

We then turned the meeting_keys into a list to cycle through later.

In [7]:
## create a data frame of all the session_keys that are the actual races
# these will be used to get the label (final race position) later
race_session_df = valid_sessions_df[valid_sessions_df['session_type']=="Race"]

print(race_session_df.head())
print(len(race_session_df))

# now get the list of viable meetings to loop through
valid_meeting_list = race_session_df['meeting_key'].to_list()

print(valid_meeting_list)
print(len(valid_meeting_list))

    meeting_key  session_key session_type
4          1141         7953         Race
9          1142         7779         Race
14         1143         7787         Race
19         1208         9078         Race
24         1210         9094         Race
36
[1141, 1142, 1143, 1208, 1210, 1211, 1212, 1214, 1215, 1217, 1218, 1219, 1220, 1223, 1225, 1226, 1229, 1230, 1231, 1232, 1235, 1236, 1237, 1238, 1240, 1241, 1242, 1243, 1244, 1245, 1246, 1248, 1250, 1252, 1254, 1256]
36


## Racing Profile

This block creates a function that when given a meeting_key will generate all the features for each driver in the Race. 

There are 82 features collected for each driver, 20 for each session, and then the driver's number and the meeting_key (to be used for merging later).

The function returns a list of 20 lists, one for each driver.

In [8]:

## loop through all the valid meeting keys
def get_data(meeting):
    # initalize list to appene data to
    data_list = []  

    # get session numbers for practice and qualifiying
    query_sessions = query_base+"sessions?meeting_key="+str(meeting)

    response = urlopen(query_sessions)
    data = json.loads(response.read().decode('utf-8'))
    sessions_df = pd.json_normalize(data)

    # get just the session keys 
    sessions = list(sessions_df['session_key'])
    # get race session num
    race_session_num = sessions[4]

    # remove the real race
    del sessions[-1]
    # print(sessions)
    # we will loop through the sessions later

    ## query the drivers for each race session so that we know they raced in the grand prix
    query_drivers = query_base+"drivers?session_key="+str(race_session_num)

    response = urlopen(query_drivers)
    data = json.loads(response.read().decode('utf-8'))
    drivers_df = pd.json_normalize(data)

    # get all the driver numbers to loop through later
    drivers = list(drivers_df['driver_number'])

    # print(drivers)

    # add sleep time to not overload requests
    time.sleep(0.5)

    ## Loop through the drivers
    for driver in drivers:
        # create list to hold the observation
        driver_feats = [meeting, driver]

        driver_number = str(driver)

        ## Loop through all the sessions
        for session in sessions:
            # creat list to store all data for this session
            session_feats = []

            # LAPS QUERY
            query_laps = query_base+"laps?driver_number="+driver_number+"&session_key="+str(session)

            try:
                # Call API and convert to DataFrame
                response = urlopen(query_laps)
                data = json.loads(response.read().decode('utf-8'))
                laps_df = pd.json_normalize(data)
                
                # Check if the DataFrame is empty (no laps data returned)
                if laps_df.empty:
                    print(f"No lap data returned for driver {driver} and session {session}. Skipping.")
                    continue  # Skip to the next session


                # extract lap infor for current session
                min_lap = laps_df['lap_duration'].min()
                max_lap = laps_df['lap_duration'].max()
                avg_lap = float(round(laps_df['lap_duration'].mean(),3))
                num_laps = laps_df['lap_number'].max()

                # PARSE THE LAPS DATA BY TIME
                # laps_times_df will be used for car_data queries
                lap_times = laps_df[['lap_number','date_start','lap_duration']].copy()

                # strip the time zone since its the same for all sessions
                lap_times['date_start'] = lap_times['date_start'].str.replace(r':\+.*$', '', regex=True)

                # Convert date_start to datetime if it's not already in datetime format
                lap_times['date_start'] = pd.to_datetime(lap_times['date_start'], errors='coerce')

                # use the next lap start as the end time exept for the last lap, which will be calculated with lap duration
                lap_times['date_end'] = lap_times['date_start'].shift(-1).fillna(lap_times['date_start'] + pd.to_timedelta(lap_times['lap_duration'], unit='s'))

                # convert back to string
                lap_times['date_start'] = lap_times['date_start'].dt.strftime('%Y-%m-%dT%H:%M:%S.%f') + lap_times['date_start'].dt.strftime('%z').str[:3] + ':' + lap_times['date_start'].dt.strftime('%z').str[3:]
                lap_times['date_end'] = lap_times['date_end'].dt.strftime('%Y-%m-%dT%H:%M:%S.%f') + lap_times['date_end'].dt.strftime('%z').str[:3] + ':' + lap_times['date_end'].dt.strftime('%z').str[3:]


                # find the lap number for their best lap
                min_lap_num = laps_df[laps_df['lap_duration']==min_lap]['lap_number'].to_list()[0]

            except (HTTPError, URLError) as e:
                # Handle the error if the API call fails
                print(f"Error occurred in lap data for driver {driver} and session {session}: {e}. Skipping this session.")

            # CONDUCT A CAR DATA QUERY ON THE MINIMUM LAP
            # base car data query for time filter to be added to
            query_car_base = query_base+"car_data?driver_number="+driver_number+"&session_key="+str(session)

            # create staret and end time for car data query
            start_time = lap_times[lap_times['lap_number']==min_lap_num]['date_start'].to_list()[0]
            end_time = lap_times[lap_times['lap_number']==min_lap_num]['date_end'].to_list()[0]

            # query for lap specific times
            query_car = query_car_base + "&date>="+str(start_time)+"&date<="+str(end_time)

            try:
                # call api for car data with lap time query
                response = urlopen(query_car)
                data = json.loads(response.read().decode('utf-8'))
                car_df = pd.json_normalize(data)
                
                # Check if the DataFrame is empty (no laps data returned)
                if car_df.empty:
                    print(f"No car data returned for driver {driver} and session {session}. Skipping.")
                    continue  # Skip to the next session

                # get summary stats for the lap
                max_brake = car_df['brake'].max()
                max_rpm = car_df['rpm'].max()
                min_rpm = car_df['rpm'].min()
                avg_rpm = round(car_df['rpm'].mean())
                max_throttle = car_df['throttle'].max()
                avg_throttle = float(round(car_df['throttle'].mean()))
                min_speed = car_df['speed'].min()
                max_speed = car_df['speed'].max()
                avg_speed = round(car_df['speed'].mean())

                # create list of car_data stats per lap
                min_lap_stats = [max_brake, min_rpm, max_rpm, avg_rpm, max_throttle, avg_throttle, min_speed, max_speed, avg_speed]

            except (HTTPError, URLError) as e:
                # Handle the error if the API call fails
                print(f"Error occurred in car data for driver {driver} and session {session}: {e}. Skipping this session.")

            # STINTS QUERY FOR NUMBER OF STINTS
            query_stints = query_base + "stints?driver_number="+driver_number+"&session_key="+str(session)

            try:
                # call api and convert to df
                response = urlopen(query_stints)
                data = json.loads(response.read().decode('utf-8'))
                stints_df = pd.json_normalize(data)

                if stints_df.empty:
                    print(f"No data stints data returned for driver {driver} and session {session}. Skipping.")
                    continue  # Skip to the next session

                # extract max stint number
                num_stints = stints_df['stint_number'].max()

            except (HTTPError, URLError) as e:
                # Handle the error if the API call fails
                print(f"Error occurred for stints data for driver {driver} and session {session}: {e}. Skipping this session.")


            # PTIS QUERY
            query_pits = query_base + "pit?driver_number="+driver_number+"&session_key="+str(session)

            try:
                # call api and convert to df
                response = urlopen(query_pits)
                data = json.loads(response.read().decode('utf-8'))
                pits_df = pd.json_normalize(data)

                if pits_df.empty:
                    print(f"No pit data returned for driver {driver} and session {session}. Skipping.")
                    continue  # Skip to the next session

                # extract num of pits and avg pit duration
                num_pits = len(pits_df)
                avg_pit_time = float(round(pits_df['pit_duration'].mean(),1))

            except (HTTPError, URLError) as e:
                # Handle the error if the API call fails
                print(f"Error occurred for pit data for driver {driver} and session {session}: {e}. Skipping this session.")

            # WEATHER QUERY
            query_wx = query_base + "weather?&session_key="+str(session)

            try:
                # call api and convert to df
                response = urlopen(query_wx)
                data = json.loads(response.read().decode('utf-8'))
                weather_df = pd.json_normalize(data)

                if weather_df.empty:
                    print(f"No weather data returned for driver {driver} and session {session}. Skipping.")
                    continue  # Skip to the next session

                ### parse weather data
                did_rain = weather_df['rainfall'].max()
                max_wind = weather_df['wind_speed'].max()
                avg_air_temp = float(round(weather_df['air_temperature'].mean(),3))
                avg_track_temp = float(round(weather_df['track_temperature'].mean(),3))

                wx_stats = [did_rain, max_wind, avg_air_temp, avg_track_temp]

            except (HTTPError, URLError) as e:
                # Handle the error if the API call fails
                print(f"Error occurred for weather data for driver {driver} and session {session}: {e}. Skipping this session.")

            # AFTER ALL QUERIES PER SESSION
            # append to driver features list
            session_feats = [min_lap, max_lap, avg_lap, num_laps, num_stints] + [num_pits, avg_pit_time] + wx_stats + min_lap_stats 

            driver_feats.extend(session_feats)
        
        # print(driver_feats)
        data_list.append(driver_feats)
        # add sleep time to not overload requests
        time.sleep(2)
    return data_list


Defining the column names for the dataframe and a function to enforce the time it takes per meeting_key.

Note: I used ChatGPT for the timeout function and for help on the try/except loops for error handling. 

In [9]:
colnames = ['meeting_key', 'driver_num',
            'min_lap_p1', 'max_lap_p1', 'avg_lap_p1', 'num_laps_p1', 'num_stints_p1', 'num_pits_p1', 'avg_pit_time_p1',
           'max_brake_p1', 'min_rpm_p1', 'max_rpm_p1', 'avg_rpm_p1', 'max_throttle_p1', 'avg_throttle_p1', 'min_speed_p1',
           'max_speed_p1', 'avg_speed_p1', 'did_rain_p1', 'max_wind_p1', 'avg_air_temp_p1', 'avg_track_temp_p1',
           'min_lap_p2', 'max_lap_p2', 'avg_lap_p2', 'num_laps_p2', 'num_stints_p2', 'num_pits_p2', 'avg_pit_time_p2',
           'max_brake_p2', 'min_rpm_p2', 'max_rpm_p2', 'avg_rpm_p2', 'max_throttle_p2', 'avg_throttle_p2', 'min_speed_p2',
           'max_speed_p2', 'avg_speed_p2', 'did_rain_p2', 'max_wind_p2', 'avg_air_temp_p2', 'avg_track_temp_p2',
           'min_lap_p3', 'max_lap_p3', 'avg_lap_p3', 'num_laps_p3', 'num_stints_p3', 'num_pits_p3', 'avg_pit_time_p3',
           'max_brake_p3', 'min_rpm_p3', 'max_rpm_p3', 'avg_rpm_p3', 'max_throttle_p3', 'avg_throttle_p3', 'min_speed_p3',
           'max_speed_p3', 'avg_speed_p3','did_rain_p3', 'max_wind_p3', 'avg_air_temp_p3', 'avg_track_temp_p3', 
           'min_lap_q', 'max_lap_q', 'avg_lap_q', 'num_laps_q', 'num_stints_q','num_pits_q', 'avg_pit_time_q',
           'max_brake_q', 'min_rpm_q', 'max_rpm_q', 'avg_rpm_q', 'max_throttle_q', 'avg_throttle_q', 'min_speed_q', 
           'max_speed_q', 'avg_speed_q','did_rain_q', 'max_wind_q', 'avg_air_temp_q', 'avg_track_temp_q']

# Function to handle the timeout
def handler(signum, frame):
    raise TimeoutError("Timeout exceeded")

## Collecting Data

This block loops through the valid meeting_keys, calls the function, converts the returned list to a dataframe, and then saves the dataframe as a csv.

In [163]:
sub_valid_meeting_list = valid_meeting_list[0:16]

# Setting up the signal handler
signal.signal(signal.SIGALRM, handler)

# Define timeout (in seconds)
timeout = 300

for meeting_key in sub_valid_meeting_list:
    try:
        # Start the timer
        signal.alarm(timeout)
        
        # Attempt to get the data
        d_list = get_data(meeting_key)
        df = pd.DataFrame(d_list, columns=colnames)

        # Reset the timer (if successful)
        signal.alarm(0)

        # Save the data to CSV
        file_name = f'data/racing_profiles_{meeting_key}.csv'
        df.to_csv(file_name, index=False)

        print(f"Saved file for meeting_key {meeting_key}")

    except TimeoutError:
        print(f"Timeout occurred for meeting_key {meeting_key}, skipping.")
    except Exception as e:
        print(f"An error occurred for meeting_key {meeting_key}: {e}")

No data returned for driver 1 and session 7765. Skipping.
No data returned for driver 1 and session 7766. Skipping.
No data returned for driver 1 and session 7767. Skipping.
No data returned for driver 1 and session 7768. Skipping.
No data returned for driver 2 and session 7765. Skipping.
No data returned for driver 2 and session 7766. Skipping.
No data returned for driver 2 and session 7767. Skipping.
No data returned for driver 2 and session 7768. Skipping.
No data returned for driver 4 and session 7765. Skipping.
No data returned for driver 4 and session 7766. Skipping.
No data returned for driver 4 and session 7767. Skipping.
No data returned for driver 4 and session 7768. Skipping.


KeyboardInterrupt: 

After the initial run, some of the meetings did not get saved:
1232, 1240, 1242, 1243, 1246, 1250, 1254

I will try and go back and do those individually.

In [None]:
## For Meeting 1232

meeting_key = 1232

# run the function
d_list = get_data(meeting_key)
df = pd.DataFrame(d_list, columns=colnames)

# Save the data to CSV
file_name = f'data/racing_profiles_{meeting_key}.csv'
df.to_csv(file_name, index=False)

No data returned for driver 1 and session 9489. Skipping.
No data returned for driver 1 and session 9490. Skipping.


KeyboardInterrupt: 

: 

## Indiviudal Driver Profile

Loop to create an individual driver observation

In [164]:
# set the driver number
driver_number = str(1)

# create list of features for a individual driver
driver_feats = [driver_number]

# list of sessions
session_sub = [7765]


for session in session_sub:
    # creat list to store all data for this session
    session_feats = []

    # LAPS QUERY
    query_laps = query_base+"laps?driver_number="+driver_number+"&session_key="+str(session)

    # call api and convert to df
    response = urlopen(query_laps)
    data = json.loads(response.read().decode('utf-8'))
    laps_df = pd.json_normalize(data)

    # extract lap infor for current session
    min_lap = laps_df['lap_duration'].min()
    max_lap = laps_df['lap_duration'].max()
    avg_lap = float(round(laps_df['lap_duration'].mean(),3))
    num_laps = laps_df['lap_number'].max()

    # PARSE THE LAPS DATA BY TIME
    # laps_times_df will be used for car_data queries
    lap_times = laps_df[['lap_number','date_start','lap_duration']].copy()

    # strip the time zone since its the same for all sessions
    lap_times['date_start'] = lap_times['date_start'].str.replace(r':\+.*$', '', regex=True)

    # Convert date_start to datetime if it's not already in datetime format
    lap_times['date_start'] = pd.to_datetime(lap_times['date_start'], errors='coerce')

    # use the next lap start as the end time exept for the last lap, which will be calculated with lap duration
    lap_times['date_end'] = lap_times['date_start'].shift(-1).fillna(lap_times['date_start'] + pd.to_timedelta(lap_times['lap_duration'], unit='s'))

    # convert back to string
    lap_times['date_start'] = lap_times['date_start'].dt.strftime('%Y-%m-%dT%H:%M:%S.%f') + lap_times['date_start'].dt.strftime('%z').str[:3] + ':' + lap_times['date_start'].dt.strftime('%z').str[3:]
    lap_times['date_end'] = lap_times['date_end'].dt.strftime('%Y-%m-%dT%H:%M:%S.%f') + lap_times['date_end'].dt.strftime('%z').str[:3] + ':' + lap_times['date_end'].dt.strftime('%z').str[3:]


    # find the lap number for their best lap
    min_lap_num = laps_df[laps_df['lap_duration']==min_lap]['lap_number'].to_list()[0]

    # CONDUCT A CAR DATA QUERY ON THE MINIMUM LAP
    # base car data query for time filter to be added to
    query_car_base = query_base+"car_data?driver_number="+driver_number+"&session_key="+str(session)

    # create staret and end time for car data query
    start_time = lap_times[lap_times['lap_number']==min_lap_num]['date_start'].to_list()[0]
    end_time = lap_times[lap_times['lap_number']==min_lap_num]['date_end'].to_list()[0]

    # query for lap specific times
    query_car = query_car_base + "&date>="+str(start_time)+"&date<="+str(end_time)

    # call api for car data with lap time query
    response = urlopen(query_car)
    data = json.loads(response.read().decode('utf-8'))
    car_df = pd.json_normalize(data)


    # get summary stats for the lap
    max_brake = car_df['brake'].max()
    max_rpm = car_df['rpm'].max()
    min_rpm = car_df['rpm'].min()
    avg_rpm = round(car_df['rpm'].mean())
    max_throttle = car_df['throttle'].max()
    avg_throttle = float(round(car_df['throttle'].mean()))
    min_speed = car_df['speed'].min()
    max_speed = car_df['speed'].max()
    avg_speed = round(car_df['speed'].mean())

    # create list of car_data stats per lap
    min_lap_stats = [max_brake, min_rpm, max_rpm, avg_rpm, max_throttle, avg_throttle, min_speed, max_speed, avg_speed]


    # STINTS QUERY FOR NUMBER OF STINTS
    query_stints = query_base + "stints?driver_number="+driver_number+"&session_key="+str(session)

    # call api and convert to df
    response = urlopen(query_stints)
    data = json.loads(response.read().decode('utf-8'))
    stints_df = pd.json_normalize(data)

    # extract max stint number
    num_stints = stints_df['stint_number'].max()


    # PTIS QUERY
    query_pits = query_base + "pit?driver_number="+driver_number+"&session_key="+str(session)

    # call api and convert to df
    response = urlopen(query_pits)
    data = json.loads(response.read().decode('utf-8'))
    pits_df = pd.json_normalize(data)

    # extract num of pits and avg pit duration
    num_pits = len(pits_df)
    avg_pit_time = float(round(pits_df['pit_duration'].mean(),1))

    # WEATHER QUERY
    query_wx = query_base + "weather?&session_key="+str(session)

    # call api and convert to df
    response = urlopen(query_wx)
    data = json.loads(response.read().decode('utf-8'))
    weather_df = pd.json_normalize(data)

    ### parse weather data

    did_rain = weather_df['rainfall'].max()
    max_wind = weather_df['wind_speed'].max()
    avg_air_temp = float(round(weather_df['air_temperature'].mean(),3))
    avg_track_temp = float(round(weather_df['track_temperature'].mean(),3))

    wx_stats = [did_rain, max_wind, avg_air_temp, avg_track_temp]

    # AFTER ALL QUERIES PER SESSION
    # append to driver features list
    session_feats = [min_lap, max_lap, avg_lap, num_laps, num_stints] + [num_pits, avg_pit_time] + wx_stats + min_lap_stats 

    driver_feats.extend(session_feats)



print(driver_feats)


KeyError: 'pit_duration'

In [None]:
# set the driver number
driver_number = str(16)

# create list of features for a individual driver
driver_feats = [driver_number]

# list of sessions
session_sub = [9473, 9474, 9475, 9476]


for session in session_sub:
    # creat list to store all data for this session
    session_feats = []

    # LAPS QUERY
    query_laps = query_base+"laps?driver_number="+driver_number+"&session_key="+str(session)

    # call api and convert to df
    response = urlopen(query_laps)
    data = json.loads(response.read().decode('utf-8'))
    laps_df = pd.json_normalize(data)

    # extract lap infor for current session
    min_lap = laps_df['lap_duration'].min()
    max_lap = laps_df['lap_duration'].max()
    avg_lap = float(round(laps_df['lap_duration'].mean(),3))
    num_laps = laps_df['lap_number'].max()

    # PARSE THE LAPS DATA BY TIME
    # laps_times_df will be used for car_data queries
    lap_times = laps_df[['lap_number','date_start','lap_duration']].copy()

    # strip the time zone since its the same for all sessions
    lap_times['date_start'] = lap_times['date_start'].str.replace(r':\+.*$', '', regex=True)

    # Convert date_start to datetime if it's not already in datetime format
    lap_times['date_start'] = pd.to_datetime(lap_times['date_start'], errors='coerce')

    # use the next lap start as the end time exept for the last lap, which will be calculated with lap duration
    lap_times['date_end'] = lap_times['date_start'].shift(-1).fillna(lap_times['date_start'] + pd.to_timedelta(lap_times['lap_duration'], unit='s'))

    # convert back to string
    lap_times['date_start'] = lap_times['date_start'].dt.strftime('%Y-%m-%dT%H:%M:%S.%f') + lap_times['date_start'].dt.strftime('%z').str[:3] + ':' + lap_times['date_start'].dt.strftime('%z').str[3:]
    lap_times['date_end'] = lap_times['date_end'].dt.strftime('%Y-%m-%dT%H:%M:%S.%f') + lap_times['date_end'].dt.strftime('%z').str[:3] + ':' + lap_times['date_end'].dt.strftime('%z').str[3:]


    # find the lap number for their best lap
    min_lap_num = laps_df[laps_df['lap_duration']==min_lap]['lap_number'].to_list()[0]

    # CONDUCT A CAR DATA QUERY ON THE MINIMUM LAP
    # base car data query for time filter to be added to
    query_car_base = query_base+"car_data?driver_number="+driver_number+"&session_key="+str(session)

    # create staret and end time for car data query
    start_time = lap_times[lap_times['lap_number']==min_lap_num]['date_start'].to_list()[0]
    end_time = lap_times[lap_times['lap_number']==min_lap_num]['date_end'].to_list()[0]

    # query for lap specific times
    query_car = query_car_base + "&date>="+str(start_time)+"&date<="+str(end_time)

    # call api for car data with lap time query
    response = urlopen(query_car)
    data = json.loads(response.read().decode('utf-8'))
    car_df = pd.json_normalize(data)


    # get summary stats for the lap
    max_brake = car_df['brake'].max()
    max_rpm = car_df['rpm'].max()
    min_rpm = car_df['rpm'].min()
    avg_rpm = round(car_df['rpm'].mean())
    max_throttle = car_df['throttle'].max()
    avg_throttle = float(round(car_df['throttle'].mean()))
    min_speed = car_df['speed'].min()
    max_speed = car_df['speed'].max()
    avg_speed = round(car_df['speed'].mean())

    # create list of car_data stats per lap
    min_lap_stats = [max_brake, min_rpm, max_rpm, avg_rpm, max_throttle, avg_throttle, min_speed, max_speed, avg_speed]


    # STINTS QUERY FOR NUMBER OF STINTS
    query_stints = query_base + "stints?driver_number="+driver_number+"&session_key="+str(session)

    # call api and convert to df
    response = urlopen(query_stints)
    data = json.loads(response.read().decode('utf-8'))
    stints_df = pd.json_normalize(data)

    # extract max stint number
    num_stints = stints_df['stint_number'].max()


    # PTIS QUERY
    query_pits = query_base + "pit?driver_number="+driver_number+"&session_key="+str(session)

    # call api and convert to df
    response = urlopen(query_pits)
    data = json.loads(response.read().decode('utf-8'))
    pits_df = pd.json_normalize(data)

    # extract num of pits and avg pit duration
    num_pits = len(pits_df)
    avg_pit_time = float(round(pits_df['pit_duration'].mean(),1))

    # WEATHER QUERY
    query_wx = query_base + "weather?&session_key="+str(session)

    # call api and convert to df
    response = urlopen(query_wx)
    data = json.loads(response.read().decode('utf-8'))
    weather_df = pd.json_normalize(data)

    ### parse weather data

    did_rain = weather_df['rainfall'].max()
    max_wind = weather_df['wind_speed'].max()
    avg_air_temp = float(round(weather_df['air_temperature'].mean(),3))
    avg_track_temp = float(round(weather_df['track_temperature'].mean(),3))

    wx_stats = [did_rain, max_wind, avg_air_temp, avg_track_temp]

    # AFTER ALL QUERIES PER SESSION
    # append to driver features list
    session_feats = [min_lap, max_lap, avg_lap, num_laps, num_stints] + [num_pits, avg_pit_time] + wx_stats + min_lap_stats 

    driver_feats.extend(session_feats)



print(driver_feats)


['16', 90.03, 832.062, 162.989, 24, 4, 4, 370.5, 0, 6.8, 26.058, 35.126, 100, 7509, 11950, 10832, 100, 81.0, 84, 329, 241, 89.18, 712.401, 155.104, 25, 5, 5, 271.7, 0, 4.1, 25.371, 30.597, 100, 6701, 12149, 10848, 100, 81.0, 87, 331, 247, 88.608, 982.957, 207.677, 16, 4, 4, 719.1, 0, 5.8, 25.984, 39.559, 100, 6715, 12238, 10921, 100, 82.0, 92, 335, 250, 87.791, 636.95, 179.428, 23, 6, 6, 422.9, 0, 2.9, 25.106, 30.786, 100, 6750, 12228, 10902, 100, 83.0, 89, 330, 250]
