# Mini Project 
## Web Scraping the Super Smash Brothers Melee Top 100 Players of All Time Wikipage

### The goal of this project is to web scrape the SSBM Top 100 PAT and perform some Exploratory Data Analysis on the Data

**First we must import the necesary packages**

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

**Now we obtain the url of the SSBM Top 100 PAT page and make a soup object for web scraping**

In [2]:
url = 'https://www.ssbwiki.com/Top_100_Melee_Players_of_All_Time'

data = requests.get(url)

soup = BeautifulSoup(data.text, 'html.parser')

**The table of information that we want is contained within a tbody tag in the HTML**

In [3]:
tbody = soup.select('table')[1].find('tbody')

**We are extracting the player rank and player name from the soup object**

In [4]:
rank_player_df = pd.DataFrame(columns = ['rank','player'])

for row in tbody.find_all('tr'):   #multiple tr tags that needs to be parsed through
    columns = row.find_all('td')   #each tr tag has multiple td tags that contain text information on player rank and player name
    
    if (columns != []):            #the first tr tag has no td tags, so there is an empty column that we must pass over
        rank = columns[0].text.strip()   # the first td tag has text information on the player rank
        player = columns[1].text.strip()  #the second td tag has text information on the player name
        
        rank_player_df = rank_player_df.append({'rank': rank, 'player': player}, ignore_index = True)

**Taking a brief look at our output**

In [5]:
rank_player_df.head()

Unnamed: 0,rank,player
0,1,Mango
1,2,Armada
2,3,Hungrybox
3,4,Ken
4,5,Mew2King


**Now we need the characters the players use. The characters are inside a tags (hyperlink tags) so we must alter the code to accomdate this change. We will make a dataframe for the main, secondary, third, and fourth characters that each player uses, then combine the dataframes. 
First we will focus on getting the main characters each player uses**

In [6]:
main_df = pd.DataFrame(columns=['main'])

for tr in tbody.find_all('tr'):
    for td in tbody.find_all('td'):
        character = td.find_all('a')    #characters inside a tags, so need to find all the a tags in each td
        
        if (character != []): 
            try:                                          
                main = character[0]['title']  # the main character is in the title of the a tag
            except:
                main = None
                
            main_df = main_df.append({'main':main}, ignore_index = True)

**Now we need the secondary characters**

In [7]:
secondary_df = pd.DataFrame(columns=['secondary'])

for tr in tbody.find_all('tr'):
    for td in tbody.find_all('td'):
        character = td.find_all('a')
        
        if (character != []):
            try:
                secondary = character[1]['title'] # secondary characer is in the second a tag in the title (if exists)
            except:
                secondary = None        # some players don't have secondaries

            secondary_df = secondary_df.append({'secondary' : secondary}, ignore_index = True)

**Now we need the Third Characters**

In [8]:
third_df = pd.DataFrame(columns=['third'])

for tr in tbody.find_all('tr'):
    for td in tbody.find_all('td'):
        character = td.find_all('a')
        
        if (character != []):
            try:
                third = character[2]['title'] #third characer is in the third a tag in the title (if exists)
            except:
                third = None   #some players don't have third characters

            third_df = third_df.append({'third' : third}, ignore_index = True)

**Getting the Fourth Characters**

In [9]:
fourth_df = pd.DataFrame(columns=['fourth'])

for tr in tbody.find_all('tr'):
    for td in tbody.find_all('td'):
        character = td.find_all('a')
        
        if (character != []):
            try:
                fourth = character[3]['title'] #fourth character is in the fourth a tag in the title (if exists)
            except:
                fourth = None  #some players don't have fourth characters

            fourth_df = fourth_df.append({'fourth' : fourth}, ignore_index = True)

**Have all the DataFrames containing the player and their rank, as well as individual dataframes containing their mains, secondaries, third, and fourth characters. We will make a new dataframe, all_characters and combine everything together**

In [10]:
all_characters = pd.DataFrame()

In [11]:
all_characters['main'] = main_df['main']
all_characters['secondary'] = secondary_df['secondary']
all_characters['third'] = third_df['third']
all_characters['fourth'] = fourth_df['fourth']

**Looking at current results**

In [12]:
print(all_characters.head())
print(all_characters.info())

           main          secondary              third                 fourth
0           USA      Smasher:Mango               None                   None
1    Fox (SSBM)       Falco (SSBM)  Jigglypuff (SSBM)  Captain Falcon (SSBM)
2        Sweden     Smasher:Armada               None                   None
3  Peach (SSBM)         Fox (SSBM)  Young Link (SSBM)                   None
4           USA  Smasher:Hungrybox               None                   None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20200 entries, 0 to 20199
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   main       20200 non-null  object
 1   secondary  15554 non-null  object
 2   third      3232 non-null   object
 3   fourth     1212 non-null   object
dtypes: object(4)
memory usage: 631.4+ KB
None


**From the table above, we can see that all the odd rows contain the characters for the player in the row directly above, which is in the secondary column. We can also see that we have quite a bit of entries. This was due to the the variable tbody containing all the tables in the wikipage, since each table was under a tbody tag. We will remove the unnecessary entries**

In [13]:
all_characters = all_characters.loc[0:199]  #

**Looking at the output, we now have the top 100 players, where we have the player name and the characters directly under, having 200 entries**

In [14]:
all_characters.head()

Unnamed: 0,main,secondary,third,fourth
0,USA,Smasher:Mango,,
1,Fox (SSBM),Falco (SSBM),Jigglypuff (SSBM),Captain Falcon (SSBM)
2,Sweden,Smasher:Armada,,
3,Peach (SSBM),Fox (SSBM),Young Link (SSBM),
4,USA,Smasher:Hungrybox,,


In [15]:
all_characters.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   main       200 non-null    object
 1   secondary  154 non-null    object
 2   third      32 non-null     object
 3   fourth     12 non-null     object
dtypes: object(4)
memory usage: 6.4+ KB


**We want the characters, the characters are in the rows with an odd index. We are filtering all_characters dataframe for the odd only indexes**

In [16]:
number_list = list(np.arange(1,200,2))

In [17]:
all_characters = all_characters.loc[number_list]

In [18]:
all_characters.reset_index(drop = True, inplace = True)

**We now have a dataframe of just characters, matching the rank of the players**

In [19]:
all_characters.head()

Unnamed: 0,main,secondary,third,fourth
0,Fox (SSBM),Falco (SSBM),Jigglypuff (SSBM),Captain Falcon (SSBM)
1,Peach (SSBM),Fox (SSBM),Young Link (SSBM),
2,Jigglypuff (SSBM),,,
3,Marth (SSBM),Fox (SSBM),,
4,Marth (SSBM),Sheik (SSBM),Fox (SSBM),Peach (SSBM)


In [20]:
all_characters.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   main       100 non-null    object
 1   secondary  54 non-null     object
 2   third      32 non-null     object
 3   fourth     12 non-null     object
dtypes: object(4)
memory usage: 3.2+ KB


**Adding rank and player to the all_characters dataframe, since the all_characters dataframe is already ranked properly, we can just add the columns of the rank_player_df**

In [21]:
all_characters['rank'] = rank_player_df['rank']

In [22]:
all_characters['player'] = rank_player_df['player']

In [23]:
melee_top_100 = all_characters[['rank','player','main','secondary','third','fourth']]  #reordering the columns

**Looking at the output, want to remove the '(SSBM)' from each character. Going to change the datatype of the elements in the character columns to str, then apply a lambda function element-wise over the character columns to remove this**

In [24]:
melee_top_100.head()

Unnamed: 0,rank,player,main,secondary,third,fourth
0,1,Mango,Fox (SSBM),Falco (SSBM),Jigglypuff (SSBM),Captain Falcon (SSBM)
1,2,Armada,Peach (SSBM),Fox (SSBM),Young Link (SSBM),
2,3,Hungrybox,Jigglypuff (SSBM),,,
3,4,Ken,Marth (SSBM),Fox (SSBM),,
4,5,Mew2King,Marth (SSBM),Sheik (SSBM),Fox (SSBM),Peach (SSBM)


In [25]:
melee_top_100 = melee_top_100.astype({'main' : str, 'secondary' : str, 'third': str, 'fourth' : str})  #changing datatype to string

In [26]:
#Applying lambda function to the character columns to remove '(SSBM)'
melee_top_100[['main','secondary','third','fourth']] = melee_top_100[['main','secondary','third','fourth']].applymap(lambda x: x.replace('(SSBM)', ' ').strip())

**This is the data that we want, formatted how we want it**

In [27]:
melee_top_100

Unnamed: 0,rank,player,main,secondary,third,fourth
0,1,Mango,Fox,Falco,Jigglypuff,Captain Falcon
1,2,Armada,Peach,Fox,Young Link,
2,3,Hungrybox,Jigglypuff,,,
3,4,Ken,Marth,Fox,,
4,5,Mew2King,Marth,Sheik,Fox,Peach
...,...,...,...,...,...,...
95,96,ARMY,Ice Climbers,,,
96,97,Rob$,Falco,Sheik,,
97,98,Tai,Marth,,,
98,99,Rishi,Marth,Fox,Donkey Kong,


**What character is the most popular to main in the top 100 melee players of all time?**

In [28]:
melee_top_100['main'].value_counts()

Fox               29
Sheik             13
Peach             10
Marth             10
Falco              9
Captain Falcon     8
Ice Climbers       6
Jigglypuff         4
Samus              3
Luigi              3
Ganondorf          2
Pikachu            1
Yoshi              1
Roy                1
Name: main, dtype: int64

**What character is the most popular secondary in the top 100 melee players of all time?**

In [29]:
melee_top_100['secondary'].value_counts()

None              46
Fox               14
Sheik             11
Falco              8
Marth              5
Captain Falcon     4
Peach              4
Dr. Mario          2
Samus              1
Young Link         1
Ganondorf          1
Zelda              1
Jigglypuff         1
Mewtwo             1
Name: secondary, dtype: int64

**What character is the most popular third in the top 100 melee players of all time?**

In [30]:
melee_top_100['third'].value_counts()

None              68
Marth              9
Fox                5
Sheik              5
Falco              3
Captain Falcon     2
Jigglypuff         1
Young Link         1
Pikachu            1
Ganondorf          1
Peach              1
Dr. Mario          1
Ice Climbers       1
Donkey Kong        1
Name: third, dtype: int64

**What character is the most popular fourth in the top 100 melee players of all time?**

In [31]:
melee_top_100['fourth'].value_counts()

None              88
Marth              3
Peach              2
Young Link         2
Captain Falcon     1
Falco              1
Luigi              1
Fox                1
Dr. Mario          1
Name: fourth, dtype: int64

**Writing the data to a csv file**

In [32]:
melee_top_100.to_csv('SSBM Top 100 Players of All Time.csv', index = False)