#  **SM64 Speedruns - A Python Test**

\** **Ignore any redundant / unnecessary comments, they're mainly just notes for me because I am a noob :]**

Learning Python for data analysis / science and wanted to test what I've learned so far on a personal project using a dataset from Kaggle. (https://www.kaggle.com/code/mcpenguin/super-mario-64-speedruns-data-collection)
\
\
I also haven't used git in a while so this is also sort of a guinea pig for re-learning that.

### **SQLite Connection**

Everything from lines 22 - 48 is to ensure the data doesn't append to the data already present in the table.

In [24]:
import pandas as pd
import seaborn as sns
import numpy as np
# Learned the purpose of importing an aliased [module].[interface] and then just the [module], but with a diff alias.
# -- Mainly to access separate operations of the module. i.e. changing the graph's style with mpl, and access the plotting operations with plt.
import matplotlib as mpl
import matplotlib.pyplot as plt

import csv, openpyxl
import sqlite3
import os, glob
import dateutil, datetime

# Connection Object to establish connection to a sqlite3 database.
connObj = sqlite3.connect('SPEEDRUNS.db')
cursorObj = connObj.cursor()

%reload_ext sql
%sql sqlite:///SPEEDRUNS.db

In [25]:
# Assigning the datasets location for SM64 Speedruns to a variable called "repo"
repo = r'./Data/'
full_path_to_dataset = os.path.join(repo, 'ALL_CATEGORIES.csv')

# Checking if there is data already in the table (mainly for testing + it's just a sqlite table)
result = %sql SELECT COUNT(*) FROM ALL_CAT_SPEEDRUNS

# Extract the count from the result set
count_result = result[0][0] if result is not None and len(result) > 0 else 0

print(count_result)
print(full_path_to_dataset)

cursorObj.execute('''CREATE TABLE IF NOT EXISTS ALL_CAT_SPEEDRUNS (
    'RUN_ID' INTEGER PRIMARY KEY,
    'Unnamed: 1' INTEGER,
    'id' VARCHAR(50),
    'place' INTEGER,
    'speedrun_link' VARCHAR(200),
    'submitted_date' DATETIME,
    'primary_time_seconds' FLOAT,
    'real_time_seconds' FLOAT,
    'player_id' VARCHAR(50),
    'player_name' VARCHAR(50),
    'player_country' VARCHAR(50),
    'platform' CHAR(6),
    'verified' BOOL
    )''')
connObj.commit()

# If there is data, delete it
if count_result > 0:
    if os.path.isfile(full_path_to_dataset):
       os.remove(full_path_to_dataset)
       %sql DELETE FROM ALL_CAT_SPEEDRUNS

 * sqlite:///SPEEDRUNS.db
Done.
0
./Data/ALL_CATEGORIES.csv


###**Merging Separate .CSV Files Into A Single Pandas DataFrame.**

**Gathering Datasets (.CSV)**

In [27]:
# Lists files within the specified directory, in this case "repo".
files_in_repo = os.listdir(repo)

# Looping through files_in_repo and assignging it to the csv_files List only if the file ends with .csv.
csv_files = [f for f in files_in_repo if f.endswith('.csv')]

# List to hold the list of dataframes / csv files.
df_list = []



---



**Appending Separate .CSV DataFrames to the "df_list" List via For Loop**

Found this online because I wasn't sure of the syntax on doing the loop, and the error handling is good, too.

It all makes sense though - here's my walkthrough:

1.   Looping through the list of `csv_files` within `repo`, assigning each to "`csv`".

2.   The path to the file is created by joining the `repo` path and the .csv filename.

1.   Creating a DataFrame (for each iteration of `csv_files`) using the `read_csv()` function and the `file_path` variable.
2.   The DataFrame is appended to the DataFrame List `df_list`.

1.   `try` / `except` = error handling on the encoding types for the files.

In [28]:
for csv in csv_files:
    file_path = os.path.join(repo, csv)
    try:
        # Try reading the file using default UTF-8 encoding
        df = pd.read_csv(file_path)
        df_list.append(df)
    except UnicodeDecodeError:
        try:
            # If UTF-8 fails, try reading the file using UTF-16 encoding with tab separator
            df = pd.read_csv(file_path, sep='\t', encoding='utf-16')
            df_list.append(df)
        except Exception as e:
            # Learned that "f" before a string allows the use of variables (wrapped in curly braces)
            print(f"Could not read file {csv} because of error: {e}")
    except Exception as e:
        print(f"Could not read file {csv} because of error: {e}")



---



**Concatenating the DataFrames and Saving to a Single .CSV File**

In [29]:
# Concat all data into a single DataFrame
complete_df = pd.concat(df_list, ignore_index=True)

---

### **Cleansing / Restructuring Data With *Pandas***

In [30]:
# Save the final result to a new .csv file (appears in the G Drive folder after 15-30 sec).
complete_df.to_csv(full_path_to_dataset, index=False)

# Reading in the '/content/drive/MyDrive/Kaggle/Datasets/SM64 Speedruns/ALL_CATEGORIES.csv' file and storing into a dataframe.
df = pd.read_csv(full_path_to_dataset)

In [31]:
# Dropping columns I do not need. The if statement could check if each col is in the df but i didn't want to list them all. This is just so it doesn't error when testing anyway.
if 'speedrun_link' in df:
    cols_to_drop = ['Unnamed: 1',
                    'id',
                    'player_id',
                    'speedrun_link',
                    'primary_time_seconds']
else:
    cols_to_drop = []

df.drop(cols_to_drop, inplace=True, axis=1)

# Renaming the columns a bit.
df = df.rename(columns={'RUN_ID': 'ID', 'player_name': 'PLAYER_NAME', 'player_country': 'COUNTRY', 'real_time_seconds': 'RUN_TIME', 'submitted_date': 'SUBMISSION_DATE', 'place': 'PLACE', 'platform': 'PLATFORM', 'verified': 'VERIFIED'})

# Rearranging columns to an order I like.
df.iloc[:,[0,4,5,3,2,1,6,7]]

Unnamed: 0,ID,PLAYER_NAME,COUNTRY,RUN_TIME,SUBMISSION_DATE,PLACE,PLATFORM,VERIFIED
0,1,Karin,Japan,5808.0,2023-10-21T08:50:34Z,1,N64,Yes
1,2,marlene,,5823.0,2023-10-22T09:45:36Z,2,N64,Yes
2,3,Liam,United States,5837.0,2023-11-23T15:29:50Z,3,N64,Yes
3,4,puncayshun,United States,5854.0,2023-10-14T00:11:51Z,4,N64,Yes
4,5,Weegee,United States,5855.0,2022-11-18T22:01:40Z,5,N64,Yes
...,...,...,...,...,...,...,...,...
2511,2008,disrespectless,Germany,3186.0,2018-09-07T13:43:48Z,497,N64,Yes
2512,2009,YUKING,Japan,3187.0,2019-09-29T05:59:03Z,498,N64,Yes
2513,2010,asandal,United States,3187.0,2022-04-16T10:09:01Z,498,N64,Yes
2514,2011,Oxelput,Germany,3188.0,2023-06-10T21:33:10Z,500,N64,Yes


In [32]:
# Loading the dataframe into the SPEEDRUNS database using the connObj. Specified a table name of "ALL_CAT_SPEEDRUNS".
df.to_sql('ALL_CAT_SPEEDRUNS', connObj, if_exists='replace', index=False)

2516

In [33]:
# Updating the RUN_TIME column with the "real" time in %H:MM:SS format
cursorObj.execute('''SELECT COUNT(*) FROM ALL_CAT_SPEEDRUNS''')
row_count = cursorObj.fetchone()[0]

cursorObj.execute('''SELECT ID, RUN_TIME FROM ALL_CAT_SPEEDRUNS''')
run_times = cursorObj.fetchall()

In [34]:
for rt in run_times:
    real_time = str(datetime.timedelta(seconds = rt[1]))
    run_id = rt[0]
    cursorObj.execute(f'''UPDATE ALL_CAT_SPEEDRUNS SET RUN_TIME = "{real_time}" WHERE ID = {run_id};''')
connObj.commit()

df = pd.read_sql('SELECT * FROM ALL_CAT_SPEEDRUNS ORDER BY ID ASC', connObj)

In [35]:
df

Unnamed: 0,ID,PLACE,SUBMISSION_DATE,RUN_TIME,PLAYER_NAME,COUNTRY,PLATFORM,VERIFIED
0,1,1,2023-10-21T08:50:34Z,1:36:48,Karin,Japan,N64,Yes
1,2,2,2023-10-22T09:45:36Z,1:37:03,marlene,,N64,Yes
2,3,3,2023-11-23T15:29:50Z,1:37:17,Liam,United States,N64,Yes
3,4,4,2023-10-14T00:11:51Z,1:37:34,puncayshun,United States,N64,Yes
4,5,5,2022-11-18T22:01:40Z,1:37:35,Weegee,United States,N64,Yes
...,...,...,...,...,...,...,...,...
2511,2508,497,2020-12-05T15:12:03Z,2:03:49,Linkx2,Germany,N64,Yes
2512,2509,498,2014-12-17T03:22:18Z,2:03:51,TPositive,United States,VC,Yes
2513,2510,499,2022-06-16T09:32:31Z,2:04:04,NaturallyAllen,United States,EMU,Yes
2514,2511,500,2019-02-11T01:26:52Z,2:04:05,meowmix_fan,United States,VC,Yes
