# Label Majors
In this notebook we will label the major tournaments based on ``https://liquipedia.net/smash/Major_Tournaments/Melee``

In [None]:
import pandas as pd
import datetime 
import re
import os
import re
from datetime import datetime

if os.path.exists('/workspace/data_2'):
    # Load the dictionary of DataFrames from the pickle
    data_path = '/workspace/data_2/'
else:
    data_path = '../data/'


### Load data
Load some data we extracted in ``jaspar_label_0_extract_data.ipynb``


In [None]:
tournament_info_df = pd.read_pickle(data_path + 'tournament_info_df.pkl')
print(tournament_info_df.shape)
tournament_info_df.head()


We copied the information from Liquipedia into a speadsheet and saved it as a CSV which we load as a dataframe.

In [None]:
majors_df = pd.read_csv('melee_majors.csv')

# Ignore the most recent tournaments that are missing from our dataset.
majors_df = majors_df.iloc[6:]

print(f"There are {majors_df.shape[0]} major tournaments to label.")
majors_df.head()

### Clean Up the Tournament Names
We can see from the head of ``majors_df`` that names of the tournaments in ``majors_df['Tournaments']`` have duplicate phrases. We need to clean the tournament names remove duplicates phrases.

In [None]:
tournament_list = list(majors_df['Tournament'])

# Function to remove duplicate phrases
def remove_duplicate_phrases(name):
    # Split the name into words
    words = name.split()
    # Use a sliding window to find duplicates
    for i in range(1, len(words)):
        if words[:i] == words[i:2*i]:
            return ' '.join(words[i:])
    return name

# Clean the tournament names
cleaned_tournament_list = [remove_duplicate_phrases(name) for name in tournament_list]

print("Cleaned Tournament Names:")
for original, cleaned in zip(tournament_list, cleaned_tournament_list):
    print(f"Original: {original}")
    print(f"Cleaned: {cleaned}")
    print()

In [None]:
# Function to clean tournament names
def clean_tournament_name(name):
    # Remove special characters, convert to lowercase, remove extra spaces
    # if ':' in name:
    #     name = name.split(":")[0]
    # if '-' in name:
    #     name = name.split("-")[0]
    name = re.sub(r'[^a-zA-Z0-9\s]', '', name)
    name = name.lower()
    name = re.sub(r'\s+', ' ', name).strip()
    return name

# Clean the major tournament names
major_tournaments_cleaned = [clean_tournament_name(t) for t in cleaned_tournament_list]

# Clean the 'cleaned_name' column in your DataFrame
tournament_info_df['cleaned_name_cleaned'] = tournament_info_df['cleaned_name'].apply(clean_tournament_name)

# Create the 'major' column
tournament_info_df['major'] = tournament_info_df['cleaned_name_cleaned'].isin(major_tournaments_cleaned)

# Verify the results
majors_in_df = tournament_info_df[tournament_info_df['major']]
print(f"We found of {majors_in_df.shape[0]} majors and are missing (at least) {majors_df.shape[0]-majors_in_df.shape[0]}.")


## Remove not majors
Going through the list on the website and comparing to the majors we found, remove the ones that were miss labelled.

In [None]:
not_actually_majors = [
36389,  #battle-of-bc-6-7__lowtier-bracket-melee
16526, #ludwig-smash-invitational__melee-singles-lcq
]

tournament_info_df.loc[not_actually_majors, 'major'] = False

print(f"We managed to find {tournament_info_df[tournament_info_df['major']==True].shape[0]} out of {majors_df.shape[0]}.")

In [None]:
missing_majors = [major for major in major_tournaments_cleaned if not tournament_info_df['cleaned_name_cleaned'].isin([major]).any()]

print(f"We are missing {len(missing_majors)} majors.\n")
for major in missing_majors:
    print(major)

### Search for missing majors
We missed some tournaments because the tournament names found on Liquipeadia do not match the ones in ``tournament_info_df``. We now go through the list of misisng majors one by one and search for them in ``tournament_info_df``. Each tournament in ``majors_df`` has the date, the city, and the number of entrants. Some of those values match what is in ``tournament_info_df`` and some do not. Our strategy is to filter ``tournament_info_df`` down to as small as possible for each missing major and find the major by inspection. We demonstrate for ``tipped off 15``. We collect the missing majors index values as we find them in ``tournament_info_df`` in ``missing_majors``.

In [None]:
temp_df = tournament_info_df.copy()

# look up the informaion about the missing tournament (I used the website rather than the majors_df)
print(majors_df.loc[6])

# Filter tournaments based on date.
year = 2024
temp_df = temp_df[temp_df['start']>=datetime(year,6,15)]
temp_df = temp_df[temp_df['start']<datetime(year,6,16)]

# Filter the tournament based on location.
temp_df = tournament_info_df[tournament_info_df['city']=='Marietta']

# Filter tournaments based on entrants.
temp_df = temp_df[temp_df['entrants']==513]   

print(f"We have filtered down to {temp_df.shape[0]} tournament(s).")
temp_df

## Add missing majors

In [None]:
missing_majors=[
39443, #Tipped off 15
38456, #Get on my level X
28389, #riptide 2023
26646, #Get on my level 2023
26137, # ludwig 2023 main event
24918, #Tipped off 14,
22595, # back in blood major upset
17129,# smash summit 14
15764, # lost tech city 2022
12948, # double down 2022
12779, # get on my level 2022
11293, # smash summit 13
7532, #smash_world tour
6377, #SWT 2021 NA east regional finals
5168, #riptide 2021
1233, #Galint Melee Open: Spring Edition
2, #Slippi Champions League Week 1
3,#Slippi Champions League Week 2
4,#Slippi Champions League Week 3
5,#Slippi Champions League Week 4
667, #Get on my line 2020
167, #GameTyrant Expo 2018
30, #EVO 2018	
41, #Enthusiast Gaming Live Expo 2018
51, #GameTyrant Expo 2017
# genesis fuse doubles circuit finals
58, #EVO 2017
#Shine 2016
26, #EVO 2016
141, # Supe Smash con
25, #EVO 2015
# WTFox
165, #FC Smash 15XR: Return
14 #paragon 2015
]

tournament_info_df.loc[missing_majors, 'major'] = True

print(f"We managed now have {tournament_info_df[tournament_info_df['major']==True].shape[0]} out of {majors_df.shape[0]}.")

In [None]:
major_tournament_info_df = tournament_info_df[tournament_info_df['major']==True]
major_tournament_info_df

In [None]:
major_tournament_info_df.to_pickle(data_path + 'major_tournament_info_df.pkl')