#  **SM64 Speedruns - A Python Test**

\** **Ignore any redundant / unnecessary comments, they're mainly just notes for me because I am a noob :]**

Learning Python for data analysis / science and wanted to test what I've learned so far on a personal project using a dataset from Kaggle. (https://www.kaggle.com/code/mcpenguin/super-mario-64-speedruns-data-collection)
\
\
I also haven't used git in a while so this is also sort of a guinea pig for re-learning that.

## **Gather and Cleanse Data**

### **SQLite Connection**

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
# Learned the purpose of importing an aliased [module].[interface] and then just the [module], but with a diff alias.
# -- Mainly to access separate operations of the module. i.e. changing the graph's style with mpl, and access the plotting operations with plt.
import matplotlib as mpl
import matplotlib.pyplot as plt

import csv, openpyxl
import sqlite3
import os, glob
import dateutil, datetime

# Connection Object to establish connection to a sqlite3 database.
connObj = sqlite3.connect('SPEEDRUNS.db')
cursorObj = connObj.cursor()

%load_ext sql
%sql sqlite:///SPEEDRUNS.db

In [2]:
# Assigning the datasets location for SM64 Speedruns to a variable called "repo"
repo = r'./Data/'
full_path_to_dataset = os.path.join(repo, 'ALL_CATEGORIES.csv')

# Checking if there is data already in the table (mainly for testing + it's just a sqlite table)
result = %sql SELECT COUNT(*) FROM ALL_CAT_SPEEDRUNS

# Extract the count from the result set
count_result = result[0][0] if result is not None and len(result) > 0 else 0

cursorObj.execute('''CREATE TABLE IF NOT EXISTS ALL_CAT_SPEEDRUNS (
    'run_id' INTEGER PRIMARY KEY,
    'Category' VARCHAR(20),
    'id' VARCHAR(50),
    'place' INTEGER,
    'speedrun_link' VARCHAR(200),
    'submitted_date' DATETIME,
    'primary_time_seconds' FLOAT,
    'real_time_seconds' FLOAT,
    'player_id' VARCHAR(50),
    'player_name' VARCHAR(50),
    'player_country' VARCHAR(50),
    'platform' CHAR(6),
    'verified' BOOL
    )''')
connObj.commit()

# If there is data, delete it
if count_result > 0:
    if os.path.isfile(full_path_to_dataset):
       os.remove(full_path_to_dataset)
       %sql DELETE FROM ALL_CAT_SPEEDRUNS

 * sqlite:///SPEEDRUNS.db
Done.
 * sqlite:///SPEEDRUNS.db
2274 rows affected.


### **Merging Separate .CSV Files Into A Single Pandas DataFrame.**

**Gathering Datasets (.CSV)**

In [3]:
# Lists files within the specified directory, in this case "repo".
files_in_repo = os.listdir(repo)

# Looping through files_in_repo and assignging it to the csv_files List only if the file ends with .csv.
csv_files = [f for f in files_in_repo if f.endswith('.csv')]

# List to hold the list of dataframes / csv files.
df_list = []



---



**Appending Separate .CSV DataFrames to the "df_list" List via For Loop**

Found this online because I wasn't sure of the syntax on doing the loop, and the error handling is good, too.

It all makes sense though - here's my walkthrough:

1.   Looping through the list of `csv_files` within `repo`, assigning each to "`csv`".

2.   The path to the file is created by joining the `repo` path and the .csv filename.

1.   Creating a DataFrame (for each iteration of `csv_files`) using the `read_csv()` function and the `file_path` variable.
2.   The DataFrame is appended to the DataFrame List `df_list`.

1.   `try` / `except` = error handling on the encoding types for the files.

In [4]:
for csv in csv_files:
    file_path = os.path.join(repo, csv)
    try:
        # Try reading the file using default UTF-8 encoding
        df = pd.read_csv(file_path)
        df_list.append(df)
    except UnicodeDecodeError:
        try:
            # If UTF-8 fails, try reading the file using UTF-16 encoding with tab separator
            df = pd.read_csv(file_path, sep='\t', encoding='utf-16')
            df_list.append(df)
        except Exception as e:
            # Learned that "f" before a string allows the use of variables (wrapped in curly braces)
            print(f"Could not read file {csv} because of error: {e}")
    except Exception as e:
        print(f"Could not read file {csv} because of error: {e}")



---



**Concatenating the DataFrames and Saving to a Single .CSV File**

In [5]:
# Concat all data into a single DataFrame
complete_df = pd.concat(df_list, ignore_index=True)

---

### **Cleansing / Restructuring Data**

In [6]:
# Save the final result to a new .csv file (appears in the G Drive folder after 15-30 sec).
complete_df.to_csv(full_path_to_dataset, index=False)

# Reading in the '/content/drive/MyDrive/Kaggle/Datasets/SM64 Speedruns/ALL_CATEGORIES.csv' file and storing into a dataframe.
df = pd.read_csv(full_path_to_dataset)

In [7]:
# Dropping columns I do not need. The if statement could check if each col is in the df but i didn't want to list them all. This is just so it doesn't error when testing anyway.
if 'speedrun_link' in df:
    cols_to_drop = ['id',
                    'player_id',
                    'speedrun_link',
                    'primary_time_seconds']
else:
    cols_to_drop = []

df.drop(cols_to_drop, inplace=True, axis=1)

# Renaming the columns a bit.
df = df.rename(columns={'run_id': 'ID', 'Category': 'CATEGORY', 'player_name': 'PLAYER_NAME', 'player_country': 'COUNTRY', 'real_time_seconds': 'RUN_TIME', 'submitted_date': 'SUBMISSION_DATE', 'place': 'PLACE', 'platform': 'PLATFORM', 'verified': 'VERIFIED'})

# Some players don't have their country set up on speedrun.com so sqlite sets these to NaN. I'd rather it be null / none.
cursorObj.execute('''UPDATE ALL_CAT_SPEEDRUNS SET COUNTRY = "" WHERE COUNTRY = "NaN"''')
connObj.commit()

df.to_sql('ALL_CAT_SPEEDRUNS', connObj, if_exists='replace', index=False)

2274

In [8]:
#---------------------------[0]----[1]
cursorObj.execute('''SELECT ID, RUN_TIME FROM ALL_CAT_SPEEDRUNS''')

# Stored as a 2X2 list.
run_times = cursorObj.fetchall()

for rt in run_times:
    # Set the "real_time" to a formatted time (as string). rt[1] is the RUN_TIME from the table.
    real_time = str(datetime.timedelta(seconds = rt[1]))
    # Set the "run_id" to the rt[0] value in the run_times list. Same as the above, just a different index.
    run_id = rt[0]
    # Run an UPDATE statement for each run time and update the RUN_TIME using the variable set previously.
    cursorObj.execute(f'''UPDATE ALL_CAT_SPEEDRUNS SET RUN_TIME = "{real_time}" WHERE ID = {run_id};''')
connObj.commit()

# Reordering columns in the dataframe
df = pd.read_sql('SELECT ID, CATEGORY, PLACE, PLAYER_NAME, RUN_TIME, PLATFORM, COUNTRY, SUBMISSION_DATE, VERIFIED FROM ALL_CAT_SPEEDRUNS ORDER BY ID ASC', connObj)

**Date / Time Cleanup**

I was originally going to remove the first two chars as well (the 0:) hours for 0/1 star, but then realized that there are times over an hour.

In [9]:
# I only want to do this to times that require this amount of precision (only 1 and 0 star).
def TrimLastThree(value):
    if '.' in value:
        return value[:-3]
    else:
        return value

# Converting 'SUBMISSION_DATE' to datetime.
df['SUBMISSION_DATE'] = pd.to_datetime(df['SUBMISSION_DATE']).dt.strftime("%Y-%m-%d")

# Applying the TrimLastThree function to the 'RUN_TIME' column. Learned how to do this and it's cool, I also know somewhat of lambda functions.
df_runtime = df['RUN_TIME'].apply(TrimLastThree)
# Creating a dataframe out of the above result. 
df_runtime = pd.DataFrame(df_runtime)
# Updating the df dataframe with the updated df_runtime values for the RUN_TIME column.
df.update(df_runtime)

In [10]:
# Updating the table with our latest updates to the dataframe.
df.to_sql('ALL_CAT_SPEEDRUNS', connObj, if_exists='replace', index=False)

2274

## **Analyze Data (mainly a fun SQL test I made)**

*Need to figure out what meaning can be obtained from it so we can interpret it later using Matplotlib & Seaborn.*

In [11]:
# Baseline table post-cleanse. This is what is currently in ALL_CAT_SPEEDRUNS.
df

Unnamed: 0,ID,CATEGORY,PLACE,PLAYER_NAME,RUN_TIME,PLATFORM,COUNTRY,SUBMISSION_DATE,VERIFIED
0,1,0 Star,1,Suigi,0:06:16.600,N64,Canada,2023-10-27,Yes
1,2,0 Star,2,KANNO,0:06:27.380,N64,,2022-02-12,Yes
2,3,0 Star,3,cjrokokomero,0:06:28.130,N64,Italy,2023-06-19,Yes
3,4,0 Star,4,Parsee02,0:06:30.650,N64,Japan,2023-07-11,Yes
4,5,0 Star,5,Dowsky,0:06:32.150,N64,United States,2020-09-19,Yes
...,...,...,...,...,...,...,...,...,...
2269,2266,120 Star,497,Linkx2,2:03:49,N64,Germany,2020-12-05,Yes
2270,2267,120 Star,498,TPositive,2:03:51,VC,United States,2014-12-17,Yes
2271,2268,120 Star,499,NaturallyAllen,2:04:04,EMU,United States,2022-06-16,Yes
2272,2269,120 Star,500,meowmix_fan,2:04:05,VC,United States,2019-02-11,Yes


#### **Q1: Filter for only Verified runs, how many non-verified runs are there?**

In [12]:
%%sql  
/*Only verified runs + the total count of all verified and non-verified runs.*/

/*
    VERIFIED: 2097
    NON-VERIFIED: 177
*/
    
SELECT *,
    (SELECT COUNT(*) FROM ALL_CAT_SPEEDRUNS WHERE VERIFIED = 'Yes') AS TOTAL_VERIFIED_COUNT,
    (SELECT COUNT(*) FROM ALL_CAT_SPEEDRUNS WHERE VERIFIED = 'No') AS TOTAL_NON_VERIFIED_COUNT
FROM ALL_CAT_SPEEDRUNS
WHERE VERIFIED = 'Yes'
ORDER BY ID DESC
LIMIT 5;

 * sqlite:///SPEEDRUNS.db
Done.


ID,CATEGORY,PLACE,PLAYER_NAME,RUN_TIME,PLATFORM,COUNTRY,SUBMISSION_DATE,VERIFIED,TOTAL_VERIFIED_COUNT,TOTAL_NON_VERIFIED_COUNT
2270,120 Star,500,Kosmic,2:04:05,N64,United States,2023-03-17,Yes,2097,177
2269,120 Star,500,meowmix_fan,2:04:05,VC,United States,2019-02-11,Yes,2097,177
2268,120 Star,499,NaturallyAllen,2:04:04,EMU,United States,2022-06-16,Yes,2097,177
2267,120 Star,498,TPositive,2:03:51,VC,United States,2014-12-17,Yes,2097,177
2266,120 Star,497,Linkx2,2:03:49,N64,Germany,2020-12-05,Yes,2097,177


#### **Q2: Find players who appear in more than one category.**

In [77]:
%%sql

/*
    Mainly using MAX(RUN_TIME) to be able to use the GROUP BY since it requires an aggregate function.
    I originally though it was just any function (I was just using scalar before).

    (going the extra mile here for more fun - will do later)
    Num of players with a RUN_COUNT of:
        5: 
        4: 
        3: 
        2: 
        1: 
*/

SELECT PLAYER_NAME, COUNT(PLAYER_NAME) AS CAT_COUNT
FROM ALL_CAT_SPEEDRUNS
GROUP BY PLAYER_NAME
HAVING COUNT(PLAYER_NAME) > 1
ORDER BY CAT_COUNT DESC;

 * sqlite:///SPEEDRUNS.db
Done.


PLAYER_NAME,CAT_COUNT
zach,5
turara32767,5
thags15,5
tanepota,5
taciturn,5
spener1122,5
smc_,5
sevenyoshi,5
scoagogo,5
sanj,5


#### **Q3: Find the top 3 players in each category from each country.**

#### **Q4: How do times compare across different countries?**

#### **Q5: Which countries have the best leaderboard rankings on average?**

#### **Q6: Find the difference in times for each platform (N64, EMU, VC)**

#### **Q7: What is the greatest time gap between first and second place of each category? A.K.A., Who held the record the longest?**