# Exploration of Top 8
In this notebook we want to:
- Filter out tournaments that do not have the canonical sets_df['location_names']
- Label the top 8 sets of a tournament.
- Determine the bracket, i.e. which of the losers of the winners set plays which of the winners of the losers sets.

We are also interested in:
- How often does a grand finals reset occur?
- How often does the winner of the loser's finals win the tournament?
- How often does a player coming into the top 8 from losers win the tournament.

In [68]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
from collections import defaultdict
import matplotlib.pyplot as plt
import datetime 

import sqlite3
import sys
import time
import tqdm
from tqdm.auto import tqdm
import pickle
import joblib
import os

if os.path.exists('/workspace/data'):
    # Load the dictionary of DataFrames from the pickle
    data_path = '/workspace/data/'
else:
    data_path = '../data/'


## Loading SQLite Database into Pandas DataFrames

The following code connects to an SQLite database (`melee_player_database.db`) and converts each table within the database into a pandas DataFrame. The DataFrames will be stored in a dictionary, where each key corresponds to the table name with `_df` appended, and the values are the respective DataFrames.

### Steps:

1. **Database Connection**: We use the `sqlite3` library to connect to the SQLite database file.
2. **Retrieve Table Names**: A query retrieves all the table names in the database.
3. **Convert Tables to DataFrames**: For each table:
   - The table is loaded into a pandas DataFrame using `pd.read_sql()`.
   - We check each column to see if any data is JSON-formatted (lists or dictionaries). If so, we convert these columns from strings into their corresponding Python objects using `json.loads()`.
4. **Store DataFrames**: The DataFrames are stored in a dictionary, where the key is the table name with a `_df` suffix, and the value is the DataFrame.
5. **Database Connection Closed**: Once all tables are loaded into DataFrames, the database connection is closed.

### Example:
If the database contains a table named `players`, the corresponding DataFrame will be stored in the dictionary with the key `players_df`, and can be accessed as:

```python
players_df = dfs['players_df']


In [69]:
# Function to get the table names
def get_table_names(conn):
    query = "SELECT name FROM sqlite_master WHERE type='table';"
    return pd.read_sql(query, conn)['name'].tolist()

# Function to load tables into DataFrames
def load_tables_to_dfs(conn):
    table_names = get_table_names(conn)
    dataframes = {}
    
    for table in table_names:
        # Load table into a DataFrame
        df = pd.read_sql(f"SELECT * FROM {table}", conn)
        
        # Detect and convert JSON formatted columns (if any)
        for col in df.columns:
            # Check if any entry in the column is a valid JSON (list or dictionary)
            if df[col].apply(lambda x: isinstance(x, str)).all():
                try:
                    # Try parsing the column as JSON
                    df[col] = df[col].apply(lambda x: json.loads(x) if pd.notnull(x) else x)
                except (json.JSONDecodeError, TypeError):
                    # If it fails, skip the column
                    pass
        
        # Store the DataFrame with table name + '_df'
        dataframes[f"{table}_df"] = df
        
    return dataframes

if os.path.exists(data_path + 'dfs_dict.pkl'):
    cell_has_run = True
    # Load the dictionary of DataFrames from the pickle
    with open(data_path + 'dfs_dict.pkl', 'rb') as f:
        dfs = pickle.load(f)
# Check if the flag variable exists in the global scope so that this code does not run twice
if 'cell_has_run' not in globals():
    path = + data_path + "melee_player_database.db"
    
    # Connect to the database
    conn = sqlite3.connect(path)

    # Convert each table into a DataFrame
    dfs = load_tables_to_dfs(conn)

    # Close the connection
    conn.close()

    # Now, you have a dictionary 'dfs' where each key is the table name with '_df' suffix and value is the corresponding DataFrame.
    # For example, to access the DataFrame for a table called 'players':
    # players_df = dfs['players_df']

    dfs['tournament_info_df']['start'] = pd.to_datetime(dfs['tournament_info_df']['start'], unit='s')
    dfs['tournament_info_df']['end'] = pd.to_datetime(dfs['tournament_info_df']['end'], unit='s')

    
    # Set the flag to indicate that the cell has been run
    cell_has_run = True

### Here we adjust the data types of the dataframes so that they are the correct type. (This will be updated as needed.)

In [70]:
dfs['sets_df']['best_of'] = dfs['sets_df']['best_of'].fillna(0).astype(int) 

In [71]:
# # Save the dictionary of DataFrames as a pickle
# with open(data_path + 'dfs_dict.pkl', 'wb') as f:
#     pickle.dump(dfs, f)

### Here we make dataframes that we will use and print the head.

The integers in 'characters' count the number of games the player has played that character. (We verify this for Zain below.)

In [72]:
players_df = dfs['players_df']
players_df.head()


Unnamed: 0,game,player_id,tag,all_tags,prefixes,social,country,state,region,c_country,c_state,c_region,placings,characters,alias
0,melee,Rishi,Rishi,[Rishi],[],{'twitter': []},,,,,,,[{'key': 'mdva-invitational-2017-(challonge-mi...,,
1,melee,15634,lloD,"[lloD, VGz | lloD, Llod]",[],{'twitter': ['lloD74']},United States,VA,,US,CA,Laurel,[{'key': 'mdva-invitational-2017-(challonge-mi...,"{'melee/peach': 1089, 'melee/falco': 1, 'melee...",
2,melee,6126,Zain,"[Zain, DontTestMe]",[PG],{'twitter': ['PG_Zain']},United States,VA,,US,CA,Los Angeles,[{'key': 'mdva-invitational-2017-(challonge-mi...,"{'melee/marth': 1065, 'melee/pichu': 1, 'melee...",DontTestMe
3,melee,Chu,Chu,[Chu],[],{'twitter': []},,,,,,,[{'key': 'mdva-invitational-2017-(challonge-mi...,,
4,melee,5620,Junebug,"[Junebug, LS | VGz Junebug]",[],{'twitter': ['arJunebug']},United States,VA,,US,VA,Richmond,[{'key': 'mdva-invitational-2017-(challonge-mi...,"{'melee/sheik': 46, 'melee/falco': 4, 'melee/g...",


In [73]:
ranking_df = dfs['ranking_df']
ranking_df.head()

Unnamed: 0,game,ranking_name,priority,region,seasons,tournaments,icon
0,melee,SSBMRank,0,world,"[2015, 2016, 2017, 2018, 2019]",[],miom


In [74]:
ranking_seasons_df = dfs['ranking_seasons_df']
ranking_seasons_df.head()

Unnamed: 0,game,ranking_name,season,start,end,total,by_id,by_placing,final,name
0,melee,SSBMRank,2015,1420070400,1451606399,100,"{'6189': 1, '1004': 2, '4465': 3, '1000': 4, '...","{'1': '6189', '2': '1004', '3': '4465', '4': '...",0,
1,melee,SSBMRank,2016,1451606400,1483228799,100,"{'6189': 1, '1004': 2, '1000': 3, '1003': 4, '...","{'1': '6189', '2': '1004', '3': '1000', '4': '...",0,
2,melee,SSBMRank,2017,1483228800,1514764799,100,"{'1004': 1, '6189': 2, '1000': 3, '1003': 4, '...","{'1': '1004', '2': '6189', '3': '1000', '4': '...",0,
3,melee,SSBMRank,2018,1514793600,1546329600,100,"{'1004': 1, '6189': 2, '4465': 3, '15990': 4, ...","{'1': '1004', '2': '6189', '3': '4465', '4': '...",0,
4,melee,SSBMRank,2019,1546329600,1577836800,100,"{'1004': 1, '4465': 2, '1000': 3, '16342': 4, ...","{'1': '1004', '2': '4465', '3': '1000', '4': '...",0,


In [75]:
sets_df = dfs['sets_df']
print(f"{sets_df[sets_df['game_data'].apply(lambda x: len(x) > 0)].shape[0] / sets_df.shape[0]:0.01%} percent of sets have some game data")
sets_df.shape



32.9% percent of sets have some game data


(1795681, 14)

In [76]:
tournament_info_df = dfs['tournament_info_df']
print(tournament_info_df.shape)
tournament_info_df.head()


(39675, 20)


Unnamed: 0,game,key,cleaned_name,source,tournament_name,tournament_event,season,rank,start,end,country,state,city,entrants,placings,losses,bracket_types,online,lat,lng
0,melee,mdva-invitational-2017-(challonge-mirror),MDVA Invitational 2017 (Challonge Mirror),challonge,https://challonge.com/mdva_invitational_2017,,17,,2017-11-26 08:05:11,2017-11-26 08:48:09,US,VA,Fall's Church,10,"[[Rishi, 1], [15634, 3], [6126, 4], [Chu, 8], ...",{},b'{}',0,,
1,melee,s@sh7,S@SH7,challonge,https://challonge.com/sash7,,17,,2017-06-13 10:27:01,2017-06-13 10:27:01,US,MI,Ann Arbor,92,[],{},b'{}',0,,
2,melee,slippi-champions-league-week-1__melee-singles,Slippi Champions League Week 1,pgstats,slippi-champions-league-week-1,melee-singles,20,,2020-10-11 14:00:00,2020-10-11 14:00:00,,,,20,"[[1000, 1], [6126, 2], [4107, 3], [19554, 3], ...",{},b'{}',1,0.0,0.0
3,melee,slippi-champions-league-week-2__melee-singles,Slippi Champions League Week 2,pgstats,slippi-champions-league-week-2,melee-singles,20,,2020-10-18 14:00:00,2020-10-18 14:00:00,,,,20,"[[6126, 1], [4107, 2], [1000, 3], [19554, 3], ...",{},b'{}',1,0.0,0.0
4,melee,slippi-champions-league-week-3__melee-singles,Slippi Champions League Week 3,pgstats,slippi-champions-league-week-3,melee-singles,20,,2020-10-25 14:00:00,2020-10-25 14:00:00,,,,20,"[[6126, 1], [3359, 2], [19554, 3], [4107, 3], ...",{},b'{}',1,0.0,0.0


## Filter out some touraments
We start by looking for sets_df['location_names'] are the most common.

In [77]:
# We use .to_string() so that we print out all the values.
print(sets_df['location_names'].value_counts().to_string())

location_names
[W1, Winners 1, Winners Round 1]                                                              218928
[L2, Losers 2, Losers Round 2]                                                                178053
[W2, Winners 2, Winners Round 2]                                                              176575
[WQF, Winners Quarters, Winners Quarter-Final]                                                171715
[L1, Losers 1, Losers Round 1]                                                                163507
[L3, Losers 3, Losers Round 3]                                                                111521
[WSF, Winners Semis, Winners Semi-Final]                                                       89587
[LQF, Losers Quarters, Losers Quarter-Final]                                                   83806
[R1, Round 1, Round 1]                                                                         60476
[R2, Round 2, Round 2]                                                      

From the value counts we see that there are several sets_df['location_names'] that correspond to the finals of the tournament:
- ['GF', 'Grand Final',' Grand Final']              35523
- ['F', 'Final', 'Final']                           615
- ['Grand Finals', 'Grand Finals', 'Grand Finals']  7
- [Grand Final, Grand Final, Grand Final]           1

We will filter out the tournaments that do not have a set with ['GF', 'Grand Final',' Grand Final'] in their location names. That way the location names of all the sets in the tournament should be consistent.

In [None]:
# Filter the rows where 'location_names' exactly matches ['GF', 'Grand Final', 'Grand Final']
gf_sets_df = sets_df[sets_df['location_names'].apply(lambda x: x == ['GF', 'Grand Final', 'Grand Final'])]

# Extract the tournament keys for the Grand Finals
gf_tournament_keys = list(gf_sets_df['tournament_key'])

# Filter the sets_df to include only the sets from tournaments that had Grand Finals
valid_tournament_sets_df = sets_df[sets_df['tournament_key'].isin(gf_tournament_keys)]


location_names
[W1, Winners 1, Winners Round 1]                       213439
[L2, Losers 2, Losers Round 2]                         173671
[W2, Winners 2, Winners Round 2]                       172018
[WQF, Winners Quarters, Winners Quarter-Final]         167356
[L1, Losers 1, Losers Round 1]                         159329
[L3, Losers 3, Losers Round 3]                         108786
[WSF, Winners Semis, Winners Semi-Final]                87795
[LQF, Losers Quarters, Losers Quarter-Final]            82289
[L4, Losers 4, Losers Round 4]                          55326
[R1, Round 1, Round 1]                                  47676
[R2, Round 2, Round 2]                                  47401
[R3, Round 3, Round 3]                                  47179
[R4, Round 4, Round 4]                                  38438
[R5, Round 5, Round 5]                                  37896
[WF, Winners Final, Winners Final]                      37839
[LSF, Losers Semis, Losers Semi-Final]                 

Here is the structure of a typical top 8 bracket.
![alt text](top_8.png "Top 8 Bracket")
We need to figure out what location names correspond to which positions.



I suspect that the location names of the top 8 games are the following:
- [f"L{n}", f"Losers {n}", f"Losers Round {n}], # Where n is the maximum n in all such location of the  tournament.  
- ['WSF', 'Winners Semis', 'Winners Semi-Final'],
- ['LQF', 'Losers Quarters', 'Losers Quarter-Final'],
- ['WF', 'Winners Final', 'Winners Final'],
- ['LSF', 'Losers Semis', 'Losers Semi-Final'],
- ['LF', 'Losers Final', 'Losers Final'],
- ['GF', 'Grand Final', 'Grand Final'],'
- ['GFR', 'GF Reset', 'Grand Final Reset']

We will test that hypothesis.

In [79]:
# For now we ignore the L{n} location name.
top_8_locations = [                                   
        ['WSF', 'Winners Semis', 'Winners Semi-Final'],
        ['LQF', 'Losers Quarters', 'Losers Quarter-Final'],
        ['WF', 'Winners Final', 'Winners Final'],
        ['LSF', 'Losers Semis', 'Losers Semi-Final'],
        ['LF', 'Losers Final', 'Losers Final'],
        ['GF', 'Grand Final', 'Grand Final'],
        ['GFR', 'GF Reset', 'Grand Final Reset']
    ] 

valid_tournament_sets_df[valid_tournament_sets_df['location_names'].isin(top_8_locations)]['location_names'].value_counts()

location_names
[WSF, Winners Semis, Winners Semi-Final]        87795
[LQF, Losers Quarters, Losers Quarter-Final]    82289
[WF, Winners Final, Winners Final]              37839
[LSF, Losers Semis, Losers Semi-Final]          37771
[LF, Losers Final, Losers Final]                35660
[GF, Grand Final, Grand Final]                  35523
[GFR, GF Reset, Grand Final Reset]              10817
Name: count, dtype: int64

If our hypothesis was correct, the there should be the same number of sets with location_names WF, LF, and GF, because the grand finals consisit of the winner from the losers final and the winners of the winners final. But the counts of those in our filtered data set do not match.

In [80]:
print('The number of tourmanets in our filtered dataset is', len(gf_tournament_keys))
print()
# Display the value counts of the remaining location names.
print(valid_tournament_sets_df['location_names'].value_counts().to_string())

The number of tourmanets in our filtered dataset is 35523

location_names
[W1, Winners 1, Winners Round 1]                       213439
[L2, Losers 2, Losers Round 2]                         173671
[W2, Winners 2, Winners Round 2]                       172018
[WQF, Winners Quarters, Winners Quarter-Final]         167356
[L1, Losers 1, Losers Round 1]                         159329
[L3, Losers 3, Losers Round 3]                         108786
[WSF, Winners Semis, Winners Semi-Final]                87795
[LQF, Losers Quarters, Losers Quarter-Final]            82289
[L4, Losers 4, Losers Round 4]                          55326
[R1, Round 1, Round 1]                                  47676
[R2, Round 2, Round 2]                                  47401
[R3, Round 3, Round 3]                                  47179
[R4, Round 4, Round 4]                                  38438
[R5, Round 5, Round 5]                                  37896
[WF, Winners Final, Winners Final]                      37

It seems like the number of tournaments with more than one LF set is not very high. Lets take a look at one of the tournaments with more than one Losers Finals

In [81]:
lf_df = valid_tournament_sets_df[valid_tournament_sets_df['location_names'].apply(lambda x: x == ['LF', 'Losers Final', 'Losers Final'])]
print(lf_df['tournament_key'].value_counts().to_string())

tournament_key
ceo-2016__melee-singles                                                                                                               33
full-bloom-3__melee-singles                                                                                                           17
smash-conference-lxix-3__melee-singles                                                                                                13
the-quarantine-series__pound-online-melee-singles                                                                                      9
wisdom-melee-2__melee-singles                                                                                                          9
pax-prime-2015__melee-singles                                                                                                          8
scrub-summit-2__melee-singles-east                                                                                                     7
scrub-summit__east-coast-m

In [90]:
# This tournament has 1 LF game

temp_df = valid_tournament_sets_df[valid_tournament_sets_df['tournament_key']=='ggs-too-sleepy-25__sleepy-singles']
temp_df[['bracket_name']].value_counts()

bracket_name
Bracket         48
Top 8           10
Name: count, dtype: int64

In [91]:
# This tournament has 2 LF games
temp_df = valid_tournament_sets_df[valid_tournament_sets_df['tournament_key']=='b-town-beatdown-90__melee-singles']
temp_df[['bracket_name']].value_counts()

bracket_name 
Final Bracket    26
Amatuers         16
Name: count, dtype: int64

Use top 64 in this tournament.

In [92]:
# This tournament has 33 LF games
temp_df = valid_tournament_sets_df[valid_tournament_sets_df['tournament_key']=='ceo-2016__melee-singles']
temp_df[['bracket_name']].value_counts()


bracket_name
Pools           1260
Top 64            94
Name: count, dtype: int64

Lets simply filter out the tournaments with more than two winners semi-final sets played.

In [105]:
tqdm.pandas()

# Define the condition for the 'WSF' sequence in 'location_names'
wsf_condition = valid_tournament_sets_df['location_names'].progress_apply(
    lambda x: x == ['WSF', 'Winners Semis', 'Winners Semi-Final']
)

# Count occurrences of the 'WSF' sequence per tournament
wsf_counts = (
    valid_tournament_sets_df[wsf_condition]
    .groupby('tournament_key')
    .size()
    .reindex(valid_tournament_sets_df['tournament_key'].unique(), fill_value=0)
)

# Filter for tournaments with exactly 2 occurrences and log those that don’t match
tournaments_2_WSF = wsf_counts[wsf_counts == 2].index.tolist()

# Filter the sets_df to include only the sets from tournaments that had Grand Finals
tournament_sets_2_WSF_df = valid_tournament_sets_df[valid_tournament_sets_df['tournament_key'].isin(tournaments_2_WSF)]
print('We have', tournament_sets_2_WSF_df.shape[0], 'sets remaning after filtering.')
print()
print(tournament_sets_2_WSF_df['location_names'].value_counts().to_string())

  0%|          | 0/1689680 [00:00<?, ?it/s]

We have 1363019 sets remaning after filtering.

location_names
[W1, Winners 1, Winners Round 1]                       163568
[W2, Winners 2, Winners Round 2]                       135146
[L2, Losers 2, Losers Round 2]                         131681
[WQF, Winners Quarters, Winners Quarter-Final]         129377
[L1, Losers 1, Losers Round 1]                         119095
[L3, Losers 3, Losers Round 3]                          81588
[WSF, Winners Semis, Winners Semi-Final]                67696
[LQF, Losers Quarters, Losers Quarter-Final]            65524
[R1, Round 1, Round 1]                                  44349
[R2, Round 2, Round 2]                                  44105
[R3, Round 3, Round 3]                                  43933
[L4, Losers 4, Losers Round 4]                          41620
[R4, Round 4, Round 4]                                  35463
[R5, Round 5, Round 5]                                  35013
[GF, Grand Final, Grand Final]                          33853
[WF, Wi

In [None]:
# There should be 2 times as many winners semifinals games as winners finals games
print('2*|WF| - |WSF| =', 33850 * 2 - 67696)
print('|GF| - |WF| =', 33853 - 33850)


|WF|*2 - |WSF| = 4
|GF| - |WF| = 3


In [112]:
len(tournaments_2_WSF)


33848

In [115]:
print(tournament_sets_2_WSF_df['bracket_name'].value_counts().to_string())

bracket_name
Bracket                                                 888738
Pools                                                   111682
Round Robin Pools                                        38166
Round Robin                                              25874
Main Bracket                                             22761
Top 8                                                    22534
Elimination Bracket                                      20680
Final Bracket                                            17179
Double Elimination Bracket                               13035
bracket                                                   9284
Melee Singles                                             8350
round robin pools                                         8121
RR Pools                                                  6884
Top 16                                                    6714
Singles Bracket                                           6712
Round 1 Pools                             