<a href="https://colab.research.google.com/github/hinafarooq21/F1-Lap-Predictor--Capstone/blob/main/Collecting_the_Capstone_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting Lap Times

This is 1/x notebook of the Predicting Lap Times Project.

This notebook will run through all the steps taken to gather the data.

In [1]:
# Importing all the libraries
# Establish website connection using the requests library
import requests
import pandas as pd
import numpy as np
# the BeautifulSoup library for scraping from the bs4 package
from bs4 import BeautifulSoup
# RegEx for pattern matching
import re

# Web scraping
> Requesting Acess to this page "https://en.wikipedia.org/wiki/List_of_Formula_One_circuits" to access the following information:

1. List of all the circuits
2. Length of the circuits
3. Number of corners
4. Type of circuit
5. Direction of circuit


#### Issues faced

**Issue Encountered While Scraping Data:**

While attempting to scrape data from the webpage, I faced the following challenge:

1. **Python not recognising the table class:**
   The initial code used to locate the table was:
   ```python
   # Finding the correct table
   table = soup.find('table', class_="wikitable sortable jquery-tablesorter").text
   table
   ```
   This code was intended to identify the table and verify its existence. Although it worked for other tables on the webpage, it failed to recognise the required table.

  **Alternative approach:**
   I used the following code to once again attempt extracting the table’s data and handling cases where the table might not be found:
   ```python
   # Extract product information
   # Check if the table is found before proceeding

   table = soup.find('table', class_="wikitable sortable jquery-tablesorter")

   if table:
       # Extract column names (key)
       key = [x.text.strip() for x in table.find_all('th')]
       # Skip the header row and get all data rows
       rows = table.find_all('tr')[1:]  
       # Extract data for each row
       value = [[td.text.strip() for td in row.find_all('td')] for row in rows]  
   else:
       print("Table not found on the page.")
   ```
   This code worked successfully for other tables on the webpage but still failed to recognise the specific table I needed, even though the table exists on the webpage and the class name `wikitable sortable jquery-tablesorter` was verified as correct.

  **Conclusion:**
   Since the same code recognised other tables, I concluded that an alternative method would be necessary to extract data from the required table as beautifulsoup method was not working.

#### Extracting The Data

#####**Extracting links and circuit names**

In [2]:
# Requesting access to website
url = "https://en.wikipedia.org/wiki/List_of_Formula_One_circuits"
response = requests.get(url)

# Checking response status - 200 means OK
if response.status_code == 200:
  print("Success")
else:
  print("Error")

Success


In [3]:
# Initialising BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of Formula One circuits - Wikipedia
  </title>
  <script>
   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-

In [4]:
# Checking if the soupfind method can find the table or not
target_table = soup.find('table', {'class': 'wikitable sortable jquery-tablesorter'})
if target_table:
    print("Table found!")
else:
    print("Table not found!")

Table not found!


In [5]:
# Alternative Approach

# Get all tables from the page
tables = pd.read_html(url)

# Check the number of tables found and print the third one as that is table we need
# This does, however mean that the code is not iterable in the case that the index of the table changes
print(f"Number of tables found: {len(tables)}")
if len(tables) >= 3:
    print(tables[2])  # Print the third table
else:
    print("Third table not found!")

Number of tables found: 4
                          Circuit  Map            Type       Direction  \
0         Adelaide Street Circuit  NaN  Street circuit       Clockwise   
1                Ain-Diab Circuit  NaN    Road circuit       Clockwise   
2    Aintree Motor Racing Circuit  NaN    Road circuit       Clockwise   
3           Albert Park Circuit *  NaN  Street circuit       Clockwise   
4   Algarve International Circuit  NaN    Race circuit       Clockwise   
..                            ...  ...             ...             ...   
72                TI Circuit Aida  NaN    Race circuit       Clockwise   
73        Valencia Street Circuit  NaN  Street circuit       Clockwise   
74     Watkins Glen International  NaN    Race circuit       Clockwise   
75           Yas Marina Circuit *  NaN    Race circuit  Anti-clockwise   
76               Zeltweg Airfield  NaN    Road circuit       Clockwise   

        Location               Country     Last length used Turns  \
0       Adelaide

The code above gives us the table as a database but in order to be able to gather more data throguh it, we need to access the html for table 3.

In [6]:
# Assigning the table html to a variable in order to gather more data

# Find all the tables
circuit_tables = soup.find_all('table')

# Assiggn table 3 to a variable
target_table = circuit_tables[2]  # Index starts at 0
print(target_table)

<table class="wikitable sortable">
<caption>Formula One circuits
</caption>
<tbody><tr>
<th scope="col">Circuit
</th>
<th class="unsortable" scope="col">Map
</th>
<th scope="col">Type
</th>
<th scope="col">Direction
</th>
<th scope="col">Location
</th>
<th scope="col">Country
</th>
<th scope="col">Last length used
</th>
<th>Turns
</th>
<th scope="col">Grands Prix
</th>
<th scope="col">Season(s)
</th>
<th scope="col">Grands Prix held
</th></tr>
<tr>
<td><a href="/wiki/Adelaide_Street_Circuit" title="Adelaide Street Circuit">Adelaide Street Circuit</a>
</td>
<td><span typeof="mw:File"><a class="mw-file-description" href="/wiki/File:Adelaide_(long_route).svg"><img class="mw-file-element" data-file-height="983" data-file-width="1424" decoding="async" height="104" src="//upload.wikimedia.org/wikipedia/commons/thumb/4/49/Adelaide_%28long_route%29.svg/150px-Adelaide_%28long_route%29.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/4/49/Adelaide_%28long_route%29.svg/225px-Adelai

Now that we have successfully located the table, we will begin to gather more information through links provided in the table

In [7]:
# Gather all the "a" tags in the table and assign them to a variable
# This can be used to get the links
circuit_links = target_table.find_all('a')
circuit_links

[<a href="/wiki/Adelaide_Street_Circuit" title="Adelaide Street Circuit">Adelaide Street Circuit</a>,
 <a class="mw-file-description" href="/wiki/File:Adelaide_(long_route).svg"><img class="mw-file-element" data-file-height="983" data-file-width="1424" decoding="async" height="104" src="//upload.wikimedia.org/wikipedia/commons/thumb/4/49/Adelaide_%28long_route%29.svg/150px-Adelaide_%28long_route%29.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/4/49/Adelaide_%28long_route%29.svg/225px-Adelaide_%28long_route%29.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/4/49/Adelaide_%28long_route%29.svg/300px-Adelaide_%28long_route%29.svg.png 2x" width="150"/></a>,
 <a href="/wiki/Adelaide" title="Adelaide">Adelaide</a>,
 <a href="/wiki/Australia" title="Australia"><img alt="Australia" class="mw-file-element" data-file-height="256" data-file-width="512" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/commons/thumb/8/88/Flag_of_Australia_%28converted

In [8]:
# Getting the circuit names and links as lists

# Create empty lists to store the
circuit_names_data = []
circuit_links_data = []

# Find all rows in the target table (excluding the header row)
rows = target_table.find_all('tr')[1:]  # Skip the first row (header)
for row in rows:
    first_column = row.find_all('td')[0]  # Get the first column
    link_tag = first_column.find('a')  # Find the <a> tag in the first column
    if link_tag:
        # Append the full Wikipedia link to the list
        circuit_links_data.append(f"https://en.wikipedia.org{link_tag['href']}")
        circuit_names_data.append(link_tag.text.strip())

# Output the list of links
print(circuit_links_data)
print(circuit_names_data)

['https://en.wikipedia.org/wiki/Adelaide_Street_Circuit', 'https://en.wikipedia.org/wiki/Ain-Diab_Circuit', 'https://en.wikipedia.org/wiki/Aintree_Motor_Racing_Circuit', 'https://en.wikipedia.org/wiki/Albert_Park_Circuit', 'https://en.wikipedia.org/wiki/Algarve_International_Circuit', 'https://en.wikipedia.org/wiki/Aut%C3%B3dromo_do_Estoril', 'https://en.wikipedia.org/wiki/Aut%C3%B3dromo_Hermanos_Rodr%C3%ADguez', 'https://en.wikipedia.org/wiki/Aut%C3%B3dromo_Internacional_Nelson_Piquet', 'https://en.wikipedia.org/wiki/Autodromo_Internazionale_del_Mugello', 'https://en.wikipedia.org/wiki/Imola_Circuit', 'https://en.wikipedia.org/wiki/Interlagos_circuit', 'https://en.wikipedia.org/wiki/Autodromo_Nazionale_di_Monza', 'https://en.wikipedia.org/wiki/Aut%C3%B3dromo_Oscar_y_Juan_G%C3%A1lvez', 'https://en.wikipedia.org/wiki/AVUS', 'https://en.wikipedia.org/wiki/Bahrain_International_Circuit', 'https://en.wikipedia.org/wiki/Baku_City_Circuit', 'https://en.wikipedia.org/wiki/Brands_Hatch', 'http

In [None]:
# Extracting the fastest lap times from the circuit links

# Create a list that holds the fastest lap time and all the lap time data
fastest_lap_time = []
lap_time_data = []

# For loop goes through all the links that we extracted earlier
for link in circuit_links_data:
    response = requests.get(link)
    soup = BeautifulSoup(response.content, 'html.parser')
    table = soup.find('table', {'class': 'infobox vcard'})
    # Get circuit name
    # We don't need this but it helps verify that the information is correct
    circuit_name = soup.find('h1', {'id': 'firstHeading'}).text.strip()
    # Get fastest lap time
    records = [x.td.text.strip() for x in soup.find('tbody').find_all('tr') if x.text.startswith('Race lap')]
    # Handle missing fastest lap time
    fastest_lap_time.append(records)
    #records[0] if records else 'N/A'
    fastest_lap = next((record for record in records if "F1)" in record), 'N/A')
    # Append to the list
    lap_time_data.append({'Circuit Name': circuit_name, 'Fastest Lap Time': fastest_lap})

df = pd.DataFrame(lap_time_data)
print(df)


                     Circuit Name  \
0         Adelaide Street Circuit   
1                Ain-Diab Circuit   
2    Aintree Motor Racing Circuit   
3             Albert Park Circuit   
4   Algarve International Circuit   
..                            ...   
72  Okayama International Circuit   
73        Valencia Street Circuit   
74     Watkins Glen International   
75             Yas Marina Circuit   
76               Zeltweg Air Base   

                                     Fastest Lap Time  
0    1:15.381 ( Damon Hill, Williams FW15C, 1993, F1)  
1   2:22.5 ( Stirling Moss, Vanwall VW 5, 1958[1],...  
2                                                 N/A  
3   1:19.813 ( Charles Leclerc, Ferrari SF-24, 202...  
4   1:18.750 (Lewis Hamilton, Mercedes W11, 2020, F1)  
..                                                ...  
72  1:14.023 ( Michael Schumacher, Benetton B194, ...  
73     1:38.683 ( Timo Glock, Toyota TF109, 2009, F1)  
74   1:34.068 ( Alan Jones, Williams FW07B, 1980, F

In [None]:
df_sorted = df.sort_values(by='Circuit Name', ascending=True)

In [None]:
df.head(50)

Unnamed: 0,Circuit Name,Fastest Lap Time
0,Adelaide Street Circuit,"1:15.381 ( Damon Hill, Williams FW15C, 1993, F1)"
1,Ain-Diab Circuit,"2:22.5 ( Stirling Moss, Vanwall VW 5, 1958[1],..."
2,Aintree Motor Racing Circuit,
3,Albert Park Circuit,"1:19.813 ( Charles Leclerc, Ferrari SF-24, 202..."
4,Algarve International Circuit,"1:18.750 (Lewis Hamilton, Mercedes W11, 2020, F1)"
5,Circuito do Estoril,"1:22.446 ( David Coulthard, Williams FW16B, 19..."
6,Autódromo Hermanos Rodríguez,"1:17.774 ( Valtteri Bottas, Mercedes W12, 2021..."
7,Autódromo Internacional Nelson Piquet,"1:32.507 ( Riccardo Patrese, Williams FW12C, 1..."
8,Mugello Circuit,"1:18.833 ( Lewis Hamilton, Mercedes W11, 2020,..."
9,Imola Circuit,"1:15.484 ( Lewis Hamilton, Mercedes W11, 2020,..."


In [None]:
df.sum.isnull()

In [None]:
# There are no null values in the dataframe.
# However, we know that is not true
# This could be because N/A is stored as a string so identify the number of missing values
# Check Value counts to see how many circuits have a N/A in their fastest lap time column
df['Fastest Lap Time'].value_counts()

Unnamed: 0_level_0,count
Fastest Lap Time,Unnamed: 1_level_1
,23
"1:15.381 ( Damon Hill, Williams FW15C, 1993, F1)",1
"1:34.876 ( Lando Norris, McLaren MCL38, 2024, F1)",1
"1:23.135 (Heinz-Harald Frentzen, Williams FW19, 1997, F1)",1
"1:15.467 ( Alan Jones, Williams FW07B, 1980, F1)",1
"1:40.464 ( Ayrton Senna, Lotus 99T, 1987, F1)",1
"0:57.221 (Marijn van Kalmthout, Benetton B197, 2011, F1)",1
"1:18.426 ( Felipe Massa, Ferrari F2008, 2008, F1)",1
"1:13.780 ( Kimi Räikkönen, McLaren MP4-19B, 2004, F1)",1
"1:16.627 ( Lewis Hamilton, Mercedes W11, 2020, F1)",1


##### **Adding missing values manually**
For reasons unknown to me, for certain circuits the lap record data was not extracted during web scraping. As this is a small dataset with only a small number of values missing, I decided to manually input these values.

In [None]:
# Aintree
df.loc[2, ['Fastest Lap Time']] = ['1:51.8 (Jim Clark, Lotus 25, 1963, F1)']
# Brands Hatch
df.loc[16, ['Fastest Lap Time']] = ['1:09.593 (Nigel Mansell, Williams FW11, 1986, F1)']
# Ceasars palace
df.loc[19, ['Fastest Lap Time']] = [' 1:19.639 (Michele Alboreto	Tyrrell 011, 1982)']
# Bremgarten
df.loc[21, ['Fastest Lap Time']] = ['2:39.7	(Juan Manuel Fangio,	Mercedes-Benz W196, 1954)']
# Nevers Magny-cours
df.loc[24, ['Fastest Lap Time']] = ['1:15.377 (Michael Schumacher, Ferrari F2004, 2004)']
# Gilles Villenue
df.loc[29, ['Fastest Lap Time']] = ['1:13.078 (Finland Valtteri Bottas, Mercedes W10, 2019)']
# Circuit Mont-Tremblant
df.loc[30, ['Fastest Lap Time']] = ['1:32.200	(Clay Regazzoni,	Ferrari 312B, 1970)']
# Fair park
df.loc[39, ['Fastest Lap Time']] = ['1:45.353 (Niki Lauda,	McLaren-TAG, 1984)']
# Istanbul park
df.loc[46, ['Fastest Lap Time']] = ['1:24.770 (Juan Pablo Montoya, McLaren MP4-20, 2005)']
# Long beach
df.loc[51, ['Fastest Lap Time']] = ['1:28.330	(Niki Lauda	McLaren MP4/1C, 1983)']
# Montjuic circuit
df.loc[55, ['Fastest Lap Time']] =['1:23.800 (Ronnie Peterson, Lotus 72E, 1973']
# Canadian tire
df.loc[56, ['Fastest Lap Time']] = ['1:13.299	(Mario Andretti, Lotus 78, 1977']
#Nivelles Baulers
df.loc[57, ['Fastest Lap Time']] = ['1:11.310 (Denny Hulme, McLaren M23, 1974']
#Nurburing
df.loc[58, ['Fastest Lap Time']] = ['1:28.139 (Max Verstappen, Red Bull Racing RB16, 2020)']
# Pescara Circuit
df.loc[59, ['Fastest Lap Time']] = ['9:44.600 (Stirling Moss, Vanwall VW 5, 1957)']
# Pheonix
df.loc[60, ['Fastest Lap Time']] = ['1:21.434 (Ayrton Senna, McLaren MP4/6, 1991)']
# Prince george
df.loc[61, ['Fastest Lap Time']] = ['1:27.600	(Jim Clark, Lotus Climax, 1965)']
# Red Bull Ring
df.loc[62, ['Fastest Lap Time']] = ['1:05.619 (Carlos Sainz Jr., McLaren MCL35, 2020)']
# Riverside
df.loc[63, ['Fastest Lap Time']] = ['1:56.3 (Jack Brabham, Cooper Climax, 1960)']
# Sebring
df.loc[66, ['Fastest Lap Time']] = ['3:05.0 (Maurice Trintignant, Cooper T51, 1959)']
# Silverstone
df.loc[69, ['Fastest Lap Time']] = ['1:27.097 (Max Verstappen, Red Bull RB16, 2020)']
# Zeltweg
df.loc[76, ['Fastest Lap Time']] = ['1:10.560	(Dan Gurney, Brabham BT7,	1964)']
# Sarthe: Buggatti
df.loc[18, ['Fastest Lap Time']] = ['1:36.700	(Graham Hill,	Lotus 49, 1967']
# Donington Park
df.loc[41] =['Donington Park', '1:18.029 (Ayrton Senna,	McLaren MP4/8, 1993)']

In [None]:
df.head(50)

Unnamed: 0,Circuit Name,Fastest Lap Time
0,Adelaide Street Circuit,"1:15.381 ( Damon Hill, Williams FW15C, 1993, F1)"
1,Ain-Diab Circuit,"2:22.5 ( Stirling Moss, Vanwall VW 5, 1958[1],..."
2,Aintree Motor Racing Circuit,"1:51.8 (Jim Clark, Lotus 25, 1963, F1)"
3,Albert Park Circuit,"1:19.813 ( Charles Leclerc, Ferrari SF-24, 202..."
4,Algarve International Circuit,"1:18.750 (Lewis Hamilton, Mercedes W11, 2020, F1)"
5,Circuito do Estoril,"1:22.446 ( David Coulthard, Williams FW16B, 19..."
6,Autódromo Hermanos Rodríguez,"1:17.774 ( Valtteri Bottas, Mercedes W12, 2021..."
7,Autódromo Internacional Nelson Piquet,"1:32.507 ( Riccardo Patrese, Williams FW12C, 1..."
8,Mugello Circuit,"1:18.833 ( Lewis Hamilton, Mercedes W11, 2020,..."
9,Imola Circuit,"1:15.484 ( Lewis Hamilton, Mercedes W11, 2020,..."


##### **Extacrting information into seperate columns**

> The Fastest Lap Time column consists of 4 different values: lap times, driver name, car, year.
> In order to be able to use this information we must seperate them into their own designated columns



In [None]:
# Always copy the dataframe.
# This ensure that no changes are made to the original dataframe.
extracted_data = df.copy()

In [None]:
# During Webscraping some of the data came back with a "/t" attached to the any one of the key values.
# In order to be able to extract the information we must remove all the discrepcencies
extracted_data['Fastest Lap Time'] = extracted_data['Fastest Lap Time'].str.replace('\t', ' ', regex=False)

# There are discrepencies in the manually inputted data, with some of the values not having a closing bracket at the end
extracted_data['Fastest Lap Time'] = extracted_data['Fastest Lap Time'].apply(lambda x: x + ')' if not x.endswith(')') else x)

# Use str.extract with a regular expression to split the columns
extracted_data[['lap_time', 'driver', 'car', 'year', 'category']] = extracted_data['Fastest Lap Time'].str.extract(r'(?P<lap_time>\d{1,2}:\d{2,3}\.\d{3}) \(\s*(?P<driver>.*?)\s*,\s*(?P<car>.*?)\s*,\s*(?P<year>\d{4})(?:\[\d+\])?\s*(?:,\s*(?P<category>\w+))?\)')


# Check if it worked
print(extracted_data)

                     Circuit Name  \
0         Adelaide Street Circuit   
1                Ain-Diab Circuit   
2    Aintree Motor Racing Circuit   
3             Albert Park Circuit   
4   Algarve International Circuit   
..                            ...   
72  Okayama International Circuit   
73        Valencia Street Circuit   
74     Watkins Glen International   
75             Yas Marina Circuit   
76               Zeltweg Air Base   

                                     Fastest Lap Time  lap_time  \
0    1:15.381 ( Damon Hill, Williams FW15C, 1993, F1)  1:15.381   
1   2:22.5 ( Stirling Moss, Vanwall VW 5, 1958[1],...       NaN   
2              1:51.8 (Jim Clark, Lotus 25, 1963, F1)       NaN   
3   1:19.813 ( Charles Leclerc, Ferrari SF-24, 202...  1:19.813   
4   1:18.750 (Lewis Hamilton, Mercedes W11, 2020, F1)  1:18.750   
..                                                ...       ...   
72  1:14.023 ( Michael Schumacher, Benetton B194, ...  1:14.023   
73     1:38.683 ( T

In [None]:
extracted_data.head(60)

Unnamed: 0,Circuit Name,Fastest Lap Time,lap_time,driver,car,year,category
0,Adelaide Street Circuit,"1:15.381 ( Damon Hill, Williams FW15C, 1993, F1)",1:15.381,Damon Hill,Williams FW15C,1993.0,F1
1,Ain-Diab Circuit,"2:22.5 ( Stirling Moss, Vanwall VW 5, 1958[1],...",,,,,
2,Aintree Motor Racing Circuit,"1:51.8 (Jim Clark, Lotus 25, 1963, F1)",,,,,
3,Albert Park Circuit,"1:19.813 ( Charles Leclerc, Ferrari SF-24, 202...",1:19.813,Charles Leclerc,Ferrari SF-24,2024.0,F1
4,Algarve International Circuit,"1:18.750 (Lewis Hamilton, Mercedes W11, 2020, F1)",1:18.750,Lewis Hamilton,Mercedes W11,2020.0,F1
5,Circuito do Estoril,"1:22.446 ( David Coulthard, Williams FW16B, 19...",1:22.446,David Coulthard,Williams FW16B,1994.0,F1
6,Autódromo Hermanos Rodríguez,"1:17.774 ( Valtteri Bottas, Mercedes W12, 2021...",1:17.774,Valtteri Bottas,Mercedes W12,2021.0,F1
7,Autódromo Internacional Nelson Piquet,"1:32.507 ( Riccardo Patrese, Williams FW12C, 1...",1:32.507,Riccardo Patrese,Williams FW12C,1989.0,F1
8,Mugello Circuit,"1:18.833 ( Lewis Hamilton, Mercedes W11, 2020,...",1:18.833,Lewis Hamilton,Mercedes W11,2020.0,F1
9,Imola Circuit,"1:15.484 ( Lewis Hamilton, Mercedes W11, 2020,...",1:15.484,Lewis Hamilton,Mercedes W11,2020.0,F1


In [None]:
# The dataframe still has some null values, this could be due to the regex transformation.
# Upon closer inspection I have concluded that this is most likely due to the quirks in some of the rows
extracted_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Circuit Name      77 non-null     object
 1   Fastest Lap Time  77 non-null     object
 2   lap_time          68 non-null     object
 3   driver            68 non-null     object
 4   car               68 non-null     object
 5   year              68 non-null     object
 6   category          51 non-null     object
dtypes: object(7)
memory usage: 4.3+ KB


In [None]:
# Manually updating the rows with appropriate values as it is the most time effective way
extracted_data.loc[1, ['lap_time', 'driver', 'car', 'year', 'category']] = [
    '2:22.5', 'Stirling Moss', 'Vanwall VW 5', 1958, 'F1']

extracted_data.loc[2, ['lap_time', 'driver', 'car', 'year', 'category']] = [
    '1:51.8', 'Jim Clark', 'Lotus 25', 1963, 'F1']

extracted_data.loc[11, ['lap_time', 'driver', 'car', 'year', 'category']] = [
    '1:21.046', 'Rubens Barrichello', 'Ferrari F2004', 2004, 'F1']

extracted_data.loc[19, ['lap_time', 'driver', 'car', 'year', 'category']] = [
    '1:19.639', 'Michele Alboreto', 'Tyrrell 011', 1982, 'F1']

extracted_data.loc[21, ['lap_time', 'driver', 'car', 'year', 'category']] = [
    '2:39.7', 'Juan Manuel Fangio', 'Mercedes-Benz W196', 1954, 'F1']

extracted_data.loc[36, ['lap_time', 'driver', 'car', 'year', 'category']] = [
    '2:05.07', 'Stirling Moss', 'Cooper T51', 1959, 'F1']

extracted_data.loc[51, ['lap_time', 'driver', 'car', 'year', 'category']] = [
    '1:28.330', 'Niki Lauda', 'McLaren MP4/1C', 1983, 'F1']

extracted_data.loc[63, ['lap_time', 'driver', 'car', 'year', 'category']] = [
    '1:56.3', 'Jack Brabham', 'Cooper Climax', 1960, 'F1']

extracted_data.loc[66, ['lap_time', 'driver', 'car', 'year', 'category']] = [
    '3:05.0', 'Maurice Trintignant', 'Cooper T51', 1959, 'F1']


In [None]:
extracted_data.tail(60)

Unnamed: 0,Circuit Name,Fastest Lap Time,lap_time,driver,car,year,category
17,Buddh International Circuit,"1:27.249 ( Sebastian Vettel, Red Bull RB7, 201...",1:27.249,Sebastian Vettel,Red Bull RB7,2011,F1
18,Circuit de la Sarthe,"1:36.700 (Graham Hill, Lotus 49, 1967)",1:36.700,Graham Hill,Lotus 49,1967,
19,Caesars Palace Grand Prix,"1:19.639 (Michele Alboreto Tyrrell 011, 1982)",1:19.639,Michele Alboreto,Tyrrell 011,1982,F1
20,Circuit de Charade,"2:53.900 ( Chris Amon, Matra MS120D, 1972, F1)",2:53.900,Chris Amon,Matra MS120D,1972,F1
21,Circuit Bremgarten,"2:39.7 (Juan Manuel Fangio, Mercedes-Benz W196...",2:39.7,Juan Manuel Fangio,Mercedes-Benz W196,1954,F1
22,Circuit de Barcelona-Catalunya,"1:16.330 (Max Verstappen, Red Bull Racing RB19...",1:16.330,Max Verstappen,Red Bull Racing RB19,2023,F1
23,Circuit de Monaco,"1:12.909 ( Lewis Hamilton, Mercedes W12, 2021,...",1:12.909,Lewis Hamilton,Mercedes W12,2021,F1
24,Circuit de Nevers Magny-Cours,"1:15.377 (Michael Schumacher, Ferrari F2004, 2...",1:15.377,Michael Schumacher,Ferrari F2004,2004,
25,Pedralbes Circuit,"2:20.400 ( Alberto Ascari, Lancia D50, 1954, F1)",2:20.400,Alberto Ascari,Lancia D50,1954,F1
26,Reims-Gueux,"2:41.000 ( Juan Manuel Fangio, Maserati A6GCM,...",2:41.000,Juan Manuel Fangio,Maserati A6GCM,1953,F1


In [None]:
# Null values in category column are due to missing data and not formatting errors
# This is not an issue as the column is redundant and can be removed in the futre
extracted_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Circuit Name      77 non-null     object
 1   Fastest Lap Time  77 non-null     object
 2   lap_time          77 non-null     object
 3   driver            77 non-null     object
 4   car               77 non-null     object
 5   year              77 non-null     object
 6   category          60 non-null     object
dtypes: object(7)
memory usage: 4.3+ KB


# Combining the Datasets

In [None]:
# Due to the fact that the table wasn't saved to a dataframe during webscraping, we need to do it now inorder to combine the 2 dataframes.

# Print the number of tables found
print(f"Number of tables found: {len(tables)}")

circuits_table = tables[2]

# Display the DataFrame
circuits_table

Number of tables found: 4


Unnamed: 0,Circuit,Map,Type,Direction,Location,Country,Last length used,Turns,Grands Prix,Season(s),Grands Prix held
0,Adelaide Street Circuit,,Street circuit,Clockwise,Adelaide,Australia,3.780 km (2.349 mi),16,Australian Grand Prix,1985–1995,11
1,Ain-Diab Circuit,,Road circuit,Clockwise,Casablanca,Morocco,7.618 km (4.734 mi),18,Moroccan Grand Prix,1958,1
2,Aintree Motor Racing Circuit,,Road circuit,Clockwise,Aintree,United Kingdom,4.828 km (3.000 mi),12,British Grand Prix,"1955, 1957, 1959, 1961–1962",5
3,Albert Park Circuit *,,Street circuit,Clockwise,Melbourne,Australia,5.278 km (3.280 mi),16,Australian Grand Prix,"1996–2019, 2022–2024",27
4,Algarve International Circuit,,Race circuit,Clockwise,Portimão,Portugal,4.653 km (2.891 mi),15,Portuguese Grand Prix,2020–2021,2
...,...,...,...,...,...,...,...,...,...,...,...
72,TI Circuit Aida,,Race circuit,Clockwise,Mimasaka,Japan,3.703 km (2.301 mi),11,Pacific Grand Prix,1994–1995,2
73,Valencia Street Circuit,,Street circuit,Clockwise,Valencia,Spain,5.419 km (3.367 mi),25,European Grand Prix,2008–2012,5
74,Watkins Glen International,,Race circuit,Clockwise,Watkins Glen,United States,5.430 km (3.374 mi),10,United States Grand Prix,1961–1980,20
75,Yas Marina Circuit *,,Race circuit,Anti-clockwise,Abu Dhabi,United Arab Emirates,5.281 km (3.281 mi),15,Abu Dhabi Grand Prix,2009–2024,16


In [None]:
combined_df = pd.concat([circuits_table, extracted_data], axis=1)
combined_df

Unnamed: 0,Circuit,Map,Type,Direction,Location,Country,Last length used,Turns,Grands Prix,Season(s),Grands Prix held,Circuit Name,Fastest Lap Time,lap_time,driver,car,year,category
0,Adelaide Street Circuit,,Street circuit,Clockwise,Adelaide,Australia,3.780 km (2.349 mi),16,Australian Grand Prix,1985–1995,11,Adelaide Street Circuit,"1:15.381 ( Damon Hill, Williams FW15C, 1993, F1)",1:15.381,Damon Hill,Williams FW15C,1993,F1
1,Ain-Diab Circuit,,Road circuit,Clockwise,Casablanca,Morocco,7.618 km (4.734 mi),18,Moroccan Grand Prix,1958,1,Ain-Diab Circuit,"2:22.5 ( Stirling Moss, Vanwall VW 5, 1958[1],...",2:22.5,Stirling Moss,Vanwall VW 5,1958,F1
2,Aintree Motor Racing Circuit,,Road circuit,Clockwise,Aintree,United Kingdom,4.828 km (3.000 mi),12,British Grand Prix,"1955, 1957, 1959, 1961–1962",5,Aintree Motor Racing Circuit,"1:51.8 (Jim Clark, Lotus 25, 1963, F1)",1:51.8,Jim Clark,Lotus 25,1963,F1
3,Albert Park Circuit *,,Street circuit,Clockwise,Melbourne,Australia,5.278 km (3.280 mi),16,Australian Grand Prix,"1996–2019, 2022–2024",27,Albert Park Circuit,"1:19.813 ( Charles Leclerc, Ferrari SF-24, 202...",1:19.813,Charles Leclerc,Ferrari SF-24,2024,F1
4,Algarve International Circuit,,Race circuit,Clockwise,Portimão,Portugal,4.653 km (2.891 mi),15,Portuguese Grand Prix,2020–2021,2,Algarve International Circuit,"1:18.750 (Lewis Hamilton, Mercedes W11, 2020, F1)",1:18.750,Lewis Hamilton,Mercedes W11,2020,F1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
72,TI Circuit Aida,,Race circuit,Clockwise,Mimasaka,Japan,3.703 km (2.301 mi),11,Pacific Grand Prix,1994–1995,2,Okayama International Circuit,"1:14.023 ( Michael Schumacher, Benetton B194, ...",1:14.023,Michael Schumacher,Benetton B194,1994,F1
73,Valencia Street Circuit,,Street circuit,Clockwise,Valencia,Spain,5.419 km (3.367 mi),25,European Grand Prix,2008–2012,5,Valencia Street Circuit,"1:38.683 ( Timo Glock, Toyota TF109, 2009, F1)",1:38.683,Timo Glock,Toyota TF109,2009,F1
74,Watkins Glen International,,Race circuit,Clockwise,Watkins Glen,United States,5.430 km (3.374 mi),10,United States Grand Prix,1961–1980,20,Watkins Glen International,"1:34.068 ( Alan Jones, Williams FW07B, 1980, F1)",1:34.068,Alan Jones,Williams FW07B,1980,F1
75,Yas Marina Circuit *,,Race circuit,Anti-clockwise,Abu Dhabi,United Arab Emirates,5.281 km (3.281 mi),15,Abu Dhabi Grand Prix,2009–2024,16,Yas Marina Circuit,"1:25.637 ( Kevin Magnussen, Haas VF-24, 2024, F1)",1:25.637,Kevin Magnussen,Haas VF-24,2024,F1


In [None]:
# Dropping the column that I deem unnecessary
# There are 2 columns with circuit names, although these aren't necessary, I have decided to leave them in as they both have slight differences.
combined_df.drop(columns = ['Fastest Lap Time'], inplace=True)
combined_df.drop(columns = ['Map'], inplace=True)
combined_df.drop(columns = ['category'], inplace=True)

In [None]:
combined_df.head(50)

Unnamed: 0,Circuit,Type,Direction,Location,Country,Last length used,Turns,Grands Prix,Season(s),Grands Prix held,Circuit Name,lap_time,driver,car,year
0,Adelaide Street Circuit,Street circuit,Clockwise,Adelaide,Australia,3.780 km (2.349 mi),16,Australian Grand Prix,1985–1995,11,Adelaide Street Circuit,1:15.381,Damon Hill,Williams FW15C,1993
1,Ain-Diab Circuit,Road circuit,Clockwise,Casablanca,Morocco,7.618 km (4.734 mi),18,Moroccan Grand Prix,1958,1,Ain-Diab Circuit,2:22.5,Stirling Moss,Vanwall VW 5,1958
2,Aintree Motor Racing Circuit,Road circuit,Clockwise,Aintree,United Kingdom,4.828 km (3.000 mi),12,British Grand Prix,"1955, 1957, 1959, 1961–1962",5,Aintree Motor Racing Circuit,1:51.8,Jim Clark,Lotus 25,1963
3,Albert Park Circuit *,Street circuit,Clockwise,Melbourne,Australia,5.278 km (3.280 mi),16,Australian Grand Prix,"1996–2019, 2022–2024",27,Albert Park Circuit,1:19.813,Charles Leclerc,Ferrari SF-24,2024
4,Algarve International Circuit,Race circuit,Clockwise,Portimão,Portugal,4.653 km (2.891 mi),15,Portuguese Grand Prix,2020–2021,2,Algarve International Circuit,1:18.750,Lewis Hamilton,Mercedes W11,2020
5,Autódromo do Estoril,Race circuit,Clockwise,Estoril,Portugal,4.360 km (2.709 mi),13,Portuguese Grand Prix,1984–1996,13,Circuito do Estoril,1:22.446,David Coulthard,Williams FW16B,1994
6,Autódromo Hermanos Rodríguez *,Race circuit,Clockwise,Mexico City,Mexico,4.304 km (2.674 mi),17,"Mexican Grand Prix, Mexico City Grand Prix","1963–1970, 1986–1992, 2015–2019, 2021–2024",24,Autódromo Hermanos Rodríguez,1:17.774,Valtteri Bottas,Mercedes W12,2021
7,Autódromo Internacional do Rio de Janeiro,Race circuit,Anti-clockwise,Rio de Janeiro,Brazil,5.031 km (3.126 mi),11,Brazilian Grand Prix,"1978, 1981–1989",10,Autódromo Internacional Nelson Piquet,1:32.507,Riccardo Patrese,Williams FW12C,1989
8,Autodromo Internazionale del Mugello,Race circuit,Clockwise,Scarperia e San Piero,Italy,5.245 km (3.259 mi),14,Tuscan Grand Prix,2020,1,Mugello Circuit,1:18.833,Lewis Hamilton,Mercedes W11,2020
9,Autodromo Internazionale Enzo e Dino Ferrari *,Race circuit,Anti-clockwise,Imola,Italy,4.909 km (3.050 mi),17,"Italian Grand Prix, San Marino Grand Prix, Emi...","1980–2006, 2020–2022, 2024",31,Imola Circuit,1:15.484,Lewis Hamilton,Mercedes W11,2020


In [None]:
combined_df.to_csv('formula_one_circuits.csv', index=False)