In [None]:
# SOURCE
SOURCE = "https://www.tesladeaths.com"


In [None]:
from requests_html import HTMLSession, Element

session = HTMLSession()
r: session = session.get(SOURCE)


Getting the `table` element that houses the data.



In [None]:
table: Element = r.html.find("#dttable", first=True)
table


Finding all row headers using `th` tag.



In [None]:
table_headers = table.find("th")[:12]
table_headers = [row.text for row in table_headers]
table_headers


Collecting all data rows and excluding the last 350 since there is no usable data in there.



In [None]:
table_data = table.find("tr")[1:]
# The first tr has all the headings


In [None]:
rows = [row.text.split("\n")[:12] for row in table_data]
for row in rows:
    print(row)


There were some challenges arranging the data in the right way.

-  ~~The urls in the table are truncated, which means simply finding the text would not suffice.~~ Changed approach to `tr` instead of `td`. Extracted the first 12 columns which is what we really need.

~~-   There are more than **one** URL per row. There are numbers in the table are _hyperlinked_ which means there are more URLs than are rows, which makes it difficult to simply find all the `a` tags and plug them into the table at the right index while looping through all the elements in `table_data`. Speaking of which...~~

~~-   The collection of elements in `table_data` is simply a dump of the table; not by row. There are 23 columns in the table therefore, for each 23 elements found from the beginning is one row.~~



Mandatory conversion to DataFrame 😅



In [None]:
import pandas as pd

df = pd.DataFrame(rows, columns=table_headers)
# Converting necessary columns from str to int values
int_value_columns = df.columns[6:12]
df[int_value_columns] = (
    df[int_value_columns].apply(pd.to_numeric, errors="coerce").fillna(0).astype("int")
)

df.dtypes


In [None]:
df.head()


In [None]:
df.tail(15)


In [None]:
cutoff_point = df[df["Case #"] == "1"].index.values[0] + 1


In [None]:
df = df.iloc[:cutoff_point]
df


Played around with the new library `Polars` 🐻‍❄️ which is supposed to be [`faaaast`](https://www.youtube.com/shorts/6E7ZGCfruaw). It is indeed, or should be in theory. `Polars` store data in DataFrames in _columnar_ format as opposed to the classical row format _Pandas_ 🐼 uses.



In [None]:
# import polars as pl

# pl_rows = pl.DataFrame(rows)
# pl_rows


In [None]:
df.dtypes


Converting from `mm/dd/yyyy` to `dd/mm/yyyy`. It is possible to convert them into `datetime` objects.



In [None]:
df["Date"]


In [None]:
from datetime import datetime
import logging

logging.basicConfig(format="%(asctime)s %(levelname)s: %(message)s", level=logging.INFO)


def convert_datestring(date_string):
    try:
        date_object = datetime.strptime(date_string, "%m/%d/%Y")
        return date_object.strftime("%Y-%m-%d")
    except ValueError as e:
        logging.warning(f"Could not convert {date_string} into datetime object: {e}")
        logging.info("Assigning a random date.")
        date_parts = date_string.split("/")
        for index, date_part in enumerate(date_parts):
            try:
                int(date_part)
            except ValueError:
                date_parts[index] = "12"
                return convert_datestring("/".join(date_parts))


df["Date"] = df["Date"].apply(convert_datestring)


In [None]:
df["Date"]


Now let's convert the dates into a datetime object



In [None]:
df["Date"] = pd.to_datetime(df["Date"])
df["Date"]


`Holland` and `Netherlands` appear in the list of countries, although they are the same! Replacing the former with the latter.



In [None]:
df["Country"].replace({"Holland": "Netherlands"}, inplace=True)


Finally! Write the data out to a `.csv` file.



In [None]:
df.to_csv("./data.csv", index=None)


`dtypes` change when reading from `csv`, as it tries to infer the data type for each column. For example, the year converts to `int64`. It is possible to change the `dtype` into something else by passing the `dtype` argument in `read_csv` using a `key-value` of a column name and desired `dtype`.



In [None]:
df = pd.read_csv("./data.csv")
df.dtypes


In [None]:
df = pd.read_csv("./data.csv", dtype={"Case #": str, "Year": str, "Date": str})
df.dtypes


In [None]:
# Last updated on
import re

pattern = re.compile("\d{4}-\d{2}-\d{2}")
html = r.html.find("em")

last_updated_on = []
for em in html:
    match = pattern.search(em.text)
    if match:
        last_updated_on.append(match.group())

last_updated_on
