# Exploring Player Data on Stratus: A Web Scraping and EDA Project
---
- **Made by:** *Philip Sinnott*
- **GitHub:** [@philipsinnott](https://github.com/philipsinnott)
- **Blog:** [sinnott.netlify.app](https://sinnott.netlify.app)


# 1. Background
---
In this project, I aim to explore player data on the Minecraft PvP server [Stratus](https://stratus.network) (stratus.network) through web scraping and exploratory data analysis (EDA). By gathering data from the leaderboards and individual player pages on the website, I aim to provide insights into the performance and statistics of the top players on the server.

# 2. Scope and Delimitations
---
One limitation of the data scraped from the leaderboards page is that it's missing important variables such as games played and total number of losses, which are crucial for calculating win-loss ratios and other important metrics for the eventual EDA. In order to gather the missing data, I created a list of the top 100 players using the leaderboards dataset, and then scraped each of these players' individual player pages on the website. By doing this, I was able to complement the leaderboards dataset with not only the missing metrics, but also with other interesting metrics such as best winstreak, amount of double losses, and much more.
<br>
<br>
I will primarily focus on statistics from season 14 and 15.

# 3. Web scraping
---
I used [requests](https://pypi.org/project/requests) to retrieve the data from the website, [BeautifulSoup4](https://pypi.org/project/beautifulsoup4) to easily navigate through the HTML structure of the content and manipulate the data as needed, and [pandas](https://pandas.pydata.org) to store the data in a dataframe for further data manipulation.
<br>
<br>
It's also worth mentioning that transforming the data from the leaderboards dataset into plottable data required significantly more effort in comparison to the player profile dataset.

## 3.1 Acquire data
---

In [1]:
# Imports
from bs4 import BeautifulSoup
import requests
import pprint
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats
import json
import re
from itertools import chain
from concurrent.futures import ThreadPoolExecutor
import threading
import itertools

In [2]:
def get_data_as_soup(url):
    data = requests.get(url).text
    soup = BeautifulSoup(data, "html.parser")
    return soup

url = "https://stratus.network/leaderboard/stratus15,*,*,stratus-ranked/wins/1"
soup = get_data_as_soup(url)

In [3]:
def scrape_lb_data(soup):
    # Get names of each column header
    t_head = soup.find_all("tr", "MuiTableRow-root MuiTableRow-head")

    # Append each header to a list
    column_headers = []
    for i in t_head:
        for j in i.find_all("th"):
            column_headers.append(j.text)
    
    # Get data from top 100 players
    t_body = soup.find_all("tr", class_="MuiTableRow-root")
    rank_rows = soup.find_all("tr", class_="MuiTableRow-root")
    players_data = []
    current_player = []
    for i, rank_row in zip(t_body, rank_rows):
        rank = rank_row.find("td", class_="MuiTableCell-root MuiTableCell-body jss41 MuiTableCell-alignRight")
        if rank:
            current_player.append(rank.text)
        for j in i.find_all("span"):
            data = j.text.strip()
            #print(data)
            current_player.append(data)
        if current_player:
            players_data.append(current_player)
        current_player = []
    if current_player:
        players_data.append(current_player)
    return [players_data, column_headers]

players_data = scrape_lb_data(soup)[0] # body data
column_headers = scrape_lb_data(soup)[1] # header

In [4]:
players_data

[['#1',
  'scolt',
  '89',
  '2,065',
  '2,140',
  '0.96',
  '36.8k',
  '10.81/game',
  '21.5\u2006/min',
  '2.05',
  '43%',
  '19',
  '3,045',
  '52.1'],
 ['#2',
  'legrandmystic',
  '87',
  '2,101',
  '2,476',
  '0.85',
  '36.3k',
  '9.91/game',
  '19.5\u2006/min',
  '3.77',
  '36%',
  '9',
  '3,387',
  '-776.6'],
 ['#3',
  'iWeeiZzz',
  '85',
  '1,917',
  '1,873',
  '1.02',
  '32.5k',
  '12.37/game',
  '23\u2006/min',
  '2.77',
  '41%',
  '22',
  '3,014',
  '225.6'],
 ['#4',
  'Skyxzzz',
  '84',
  '1,487',
  '1,650',
  '0.90',
  '26.7k',
  '10.85/game',
  '22\u2006/min',
  '3.15',
  '42%',
  '16',
  '2,608',
  '-146.2'],
 ['#5',
  'coz',
  '83',
  '2,005',
  '2,107',
  '0.95',
  '35.7k',
  '11.59/game',
  '23\u2006/min',
  '2.77',
  '42%',
  '6',
  '2,846',
  '167.8'],
 ['#6',
  'BallondOr',
  '83',
  '1,602',
  '1,742',
  '0.92',
  '25.7k',
  '10.82/game',
  '19.5\u2006/min',
  '4.54',
  '41%',
  '14',
  '2,788',
  '-421.2'],
 ['#7',
  'ezrs',
  '82',
  '1,557',
  '1,972',
  '0.79'

## 3.2 Data Conversion & Preprocessing
---

In [5]:
def data_conversion_and_preprocessing(data, cols):   
    # Creating the df
    df = pd.DataFrame(data, columns=cols)
    # Rename ranking col
    df.rename(columns={"#":"Ranking"}, inplace=True)
    # Remove unwanted chars so we eventually can store all vars in numeric form
    df["Kills per Game"] = df["Kills per Game"].str.removesuffix("/game")
    df["Damage Dealt per Minute"] = df["Damage Dealt per Minute"].str.removesuffix("\u2006/min")
    df["Ranking"] = df["Ranking"].str.replace("#", "")
    df["Bow Accuracy"] = df["Bow Accuracy"].str.replace("%", "")

    # Remove commas
    for col in df.columns:
        df[col] = df[col].str.replace(",", "")

    # Transform the specified cols into numerical
    for col in df.columns[[0, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13]]:
        #print(col)
        df[col] = pd.to_numeric(df[col], errors='coerce')

    # Store bow accuracy metric in float instead of percentage
    df["Bow Accuracy"] = df["Bow Accuracy"].apply(lambda x: x / 100)

    # Show values in full length, e.g. "1.5M" --> "1500000" and "500k" --> "500000" for visualization purposes
    # also converts values to ints
    df["Damage Dealt"] = df["Damage Dealt"].apply(lambda x: pd.to_numeric(x.replace('M',''))*1000000 if 'M' in x
                                                  else pd.to_numeric(x.replace('k',''))*1000 if 'k' in x else pd.to_numeric(x))
    df["Damage Dealt"] = df["Damage Dealt"].astype(int)
    return df

df_lb = data_conversion_and_preprocessing(players_data, column_headers)

# Print info and make sure all columns except for "Player" are numeric dt
print(df_lb.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Ranking                  100 non-null    int64  
 1   Player                   100 non-null    object 
 2   Wins                     100 non-null    int64  
 3   Kills                    100 non-null    int64  
 4   Deaths                   100 non-null    int64  
 5   KDR                      100 non-null    float64
 6   Damage Dealt             100 non-null    int32  
 7   Kills per Game           100 non-null    float64
 8   Damage Dealt per Minute  100 non-null    float64
 9   Melee/Bow Damage Ratio   100 non-null    float64
 10  Bow Accuracy             100 non-null    float64
 11  Flags                    100 non-null    int64  
 12  Golden Apples Eaten      100 non-null    int64  
 13  Rating                   100 non-null    float64
dtypes: float64(6), int32(1), in

## 3.3 Complementing the Leaderboards Dataset with Player Profile Data
---

In [6]:
#players_data = []
base_url = "https://stratus.network/" # base url

def extract_player_data(season, df):
    players = df["Player"].to_list() # get all player names from lb list
    threads = []
    players_data = []
    for player in players:
        thread = threading.Thread(target=process_player, args=(player, season, players_data))
        thread.start()
        threads.append(thread)
    for thread in threads:
        thread.join()

    #players_data = list(itertools.chain.from_iterable(players_data))
    return players_data

def process_player(player, season, players_data):
    url = f"{base_url}{player}" # craft url for player
    data = requests.get(url).text # get html in raw text format
    data = BeautifulSoup(data, "html.parser") # create soup object
    data = data.find('script', text=lambda t: t.startswith('window.initialReactProps=')) # find where stats start and start there
    data = json.loads(data.text.split("=")[1][:-1]) # remove script tags
    data = json.loads(data) # convert string to dictionary
    data = data["props"]["ranked"] # specify what json object we're looking for
    data = [{'Name': player, **d} for d in data] # use dict comprehension to add name key-value pair
    player_data = list(filter(lambda x: x["season"] == (f"{season}"), data)) # search for specific season and show stats from that
    #print(player_data)
    players_data.append(player_data)
    
df_profile = extract_player_data(15, df_lb)

In [7]:
df_profile

[[{'Name': 'scolt',
   'best_elo': 495,
   'games': 189,
   'wins': 86,
   'winstreak': 0,
   'best_winstreak': 5,
   'losses': 89,
   'kills': 0,
   'deaths': 0,
   'premium_games': 0,
   'premium_wins': 0,
   'premium_losses': 0,
   'premium_kills': 0,
   'premium_deaths': 0,
   'double_losses': 1,
   'elo': 315,
   'discord': '555189998735982613',
   'season': '15',
   'baseSeason': '15',
   'discordName': 'Stratus',
   'name': 'Stratus Season 15',
   'wellFormed': True,
   'rank': {'min': 300,
    'name': 'Silver+',
    'image': '/static/img/icons/ranked/silver.svg',
    'color': '#B2B2B2'}}],
 [{'Name': 'TheTroller1369',
   'best_elo': 500,
   'games': 150,
   'wins': 70,
   'winstreak': 0,
   'best_winstreak': 6,
   'losses': 75,
   'kills': 0,
   'deaths': 0,
   'premium_games': 0,
   'premium_wins': 0,
   'premium_losses': 0,
   'premium_kills': 0,
   'premium_deaths': 0,
   'double_losses': 1,
   'elo': 240,
   'discord': '555189998735982613',
   'season': '15',
   'baseSeason

### 3.3.1 Data Conversion & Preprocessing

In [8]:
# Flatten list of lists and convert to df
def process_player_data(data):
    flat_data = list(chain.from_iterable(data)) # flatten list of lists
    df_player_profile = pd.DataFrame.from_records(flat_data) # convert to df
    df_player_profile.rename(columns={"Name":"Player"}, inplace=True) # rename col to suit lb dataset
    return df_player_profile

df_profile = process_player_data(df_profile)
df_profile.describe() # show metrics

Unnamed: 0,best_elo,games,wins,winstreak,best_winstreak,losses,kills,deaths,premium_games,premium_wins,premium_losses,premium_kills,premium_deaths,double_losses,elo
count,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0
mean,435.6,78.22,40.3,1.01,5.84,34.13,0.0,0.0,9.34,4.7,4.12,111.94,111.25,0.7,380.1
std,205.749774,43.763963,23.191996,1.514209,2.28619,20.290643,0.0,0.0,23.151333,11.473729,10.581363,278.014672,278.845469,1.0,206.593315
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,320.0,48.75,24.0,0.0,5.0,21.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,248.75
50%,410.0,67.0,32.5,0.0,6.0,29.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,357.5
75%,516.25,108.75,53.25,2.0,7.0,44.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,477.5
max,1085.0,213.0,93.0,8.0,12.0,113.0,0.0,0.0,106.0,45.0,57.0,1321.0,1281.0,4.0,1085.0


### 3.3.2 Merge Datasets

In [9]:
# Merge extended df with already existing df
def merge_lb_player_datasets(df1, df2):
    df_merged = pd.merge(df1, df2, on='Player')
    return df_merged

df_merged = merge_lb_player_datasets(df_lb, df_profile)
# drop irrelevant cols
def drop_cols(df, cols, axis):
    df = df.drop(columns=cols, axis=axis)
    return df

df = drop_cols(df_merged, cols=["discord", "season", "baseSeason", "discordName", "name", "wellFormed", "rank", "kills", "deaths"], axis=1)

In [10]:
# Create non_prem df for all columns except for those related to premium q
def create_non_prem_df(df, cols, axis):
    df_non_prem = df.drop(columns=cols, axis=axis)
    return df_non_prem
df_non_prem = df.drop(columns=["premium_games", "premium_wins", "premium_losses", "premium_kills", "premium_deaths"], axis=1)

**Observe** that we now have two main datasets: *df_s15* and *df_s15_non_prem*. *df_s15* contains all metrics, including metrics related to premium queue, whilst *df_s15_non_prem* has excluded metrics related to premium queue.

In [26]:
def create_df_per_season(season):
    base_url_lb = "https://stratus.network/leaderboard/"
    soup = get_data_as_soup(f"{base_url_lb}stratus{season},*,*,stratus-ranked/wins/1")
    players_data = scrape_lb_data(soup)[0]
    column_headers = scrape_lb_data(soup)[1]
    df = data_conversion_and_preprocessing(players_data, column_headers)
    
    season_data = extract_player_data(season, df)
    
    df_player_profile = process_player_data(season_data)
    df_merged = merge_lb_player_datasets(df, df_player_profile)
    df = drop_cols(df_merged, cols=["discord", "baseSeason", "discordName", "name", "wellFormed", "rank", "kills", "deaths"], axis=1)
    df_non_prem = create_non_prem_df(df, cols=["premium_games", "premium_wins", "premium_losses", "premium_kills", "premium_deaths"], axis=1)
    return [df, df_non_prem]

In [29]:
dfs = []
dfs_non_prem = []
for season in range(13, 16):
    df_season, df_non_prem_season = create_df_per_season(season)
    dfs.append(df_season)
    dfs.append(df_non_prem_season)
    dfs_non_prem.append(df_non_prem_season)

In [25]:
dfs[4]

Unnamed: 0,Ranking,Player,Wins,Kills,Deaths,KDR,Damage Dealt,Kills per Game,Damage Dealt per Minute,Melee/Bow Damage Ratio,...,winstreak,best_winstreak,losses,premium_games,premium_wins,premium_losses,premium_kills,premium_deaths,double_losses,elo
0,1,scolt,89,2065,2140,0.96,36800,10.81,21.5,2.05,...,0,5,89,0,0,0,0,0,1,315
1,2,legrandmystic,87,2101,2476,0.85,36300,9.91,19.5,3.77,...,0,6,113,0,0,0,0,0,4,175
2,3,iWeeiZzz,85,1917,1873,1.02,32500,12.37,23.0,2.77,...,2,8,65,0,0,0,0,0,0,735
3,4,Skyxzzz,84,1487,1650,0.90,26700,10.85,22.0,3.15,...,0,7,51,0,0,0,0,0,0,775
4,5,coz,83,2005,2107,0.95,35700,11.59,23.0,2.77,...,1,7,85,0,0,0,0,0,0,605
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,96,ontas,19,519,570,0.91,9300,11.80,24.5,4.11,...,0,4,24,0,0,0,0,0,2,135
96,97,Sofonyas,19,253,220,1.15,4800,11.00,23.5,1.77,...,2,7,4,0,0,0,0,0,0,435
97,98,amberdeilyes,19,515,496,1.04,8200,11.70,22.5,3.56,...,2,4,21,0,0,0,0,0,1,140
98,99,SH0CI,19,481,420,1.15,7900,12.33,22.5,3.14,...,0,4,17,0,0,0,0,0,0,260


# 4. Exploratory Data Analysis (EDA)
---

In [None]:
df.describe().plot()
df_non_prem.plot()

## 4.1 Overview
---

In [None]:
# Use the pairplot function from seaborn to create a scatterplot matrix
sns.pairplot(df_s15_non_prem)

# Show the plot
plt.show()

In [None]:
# Plot the correlation matrix using seaborn's heatmap function
sns.heatmap(df_s15_non_prem.corr())
plt.show()

# 2 EDA
---

## 2.1 Damage Dealt per Minute

In [None]:
# create scatter plot
plt.scatter(df['Kills per Game'], df['Damage Dealt per Minute'], marker='o', s=20, c='red')

# add x and y labels
plt.xlabel('Rating')
plt.ylabel('Kills')

# add a linear regression line
sns.regplot(x='Kills per Game', y='Damage Dealt per Minute', data=df)

# add a title
plt.title('Correlation between Kills per Game and Damage Dealt per Minute')

# add grid
plt.grid()

# show plot
plt.show()

In [None]:
# Locate players with most damage dealt per minute
top_n_players = df.nlargest(20, 'Damage Dealt per Minute')
print(top_n_players[['Player', 'Damage Dealt per Minute']])

In [None]:
# Plot the data
sns.barplot(x="Player", y="Damage Dealt per Minute", data=top_n_players)

# add x and y labels
plt.xlabel('Player')
plt.ylabel('Damage Dealt per Minute')

# add a title
plt.title('Damage Dealt per Minute')

# rotate x-axis labels by 45 degrees
plt.xticks(rotation=80)

# show plot
plt.show()

## 2.2 Rating

In [None]:
# Sort the dataframe in descending order of rating
df = df.sort_values(by='Rating',ascending=False)

# Get the top 10 players with the worst rating
worst_players = df.nsmallest(10, 'Rating')

# Get the top 10 players with the best rating
best_players = df.nlargest(10, 'Rating')

# Create a figure with 2 subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15,5))

# Create a bar chart for the top 10 players with the worst rating
sns.barplot(x="Player", y="Rating", data=worst_players,ax=ax1)
ax1.set_title("Worst Players")
ax1.set_xlabel("Player")
ax1.set_ylabel("Rating")
ax1.set_xticklabels(ax1.get_xticklabels(), rotation=45)

# Create a bar chart for the top 10 players with the best rating
sns.barplot(x="Player", y="Rating", data=best_players,ax=ax2)
ax2.set_title("Best Players")
ax2.set_xlabel("Player")
ax2.set_ylabel("Rating")
ax2.set_xticklabels(ax2.get_xticklabels(), rotation=30)

# Show plot
plt.show()

## 2.3 Bow accuracy / Flags

In [None]:
top_players = df.sort_values(by='Bow Accuracy', ascending=False).head(30)
top_players

In [None]:
non_eu_players = ['Caqs','sillyguy47','serti','Crine','MiniAnht','SoulSand','Levier','Kirikoupen','Algerie','AdamChen','thxchase'
                 ,'ReflexeZ', 'Blaszczak', 'Ceive', 'ssharpy', 'Sktchi', 'Gental', 'baob', 'PlayHigh', 'silent']
#print(len(non_eu_players))
non_eu_df = df[df['Player'].isin(non_eu_players)]
eu_df = df[~df['Player'].isin(non_eu_players)]
top_30_non_eu_players_bow_acc = non_eu_df.nlargest(20,'Bow Accuracy')
top_30_eu_players_bow_acc = eu_df.nlargest(20,'Bow Accuracy')

top_30_non_eu_players_flags = non_eu_df.nlargest(20,'Flags')
top_30_eu_players_flags = eu_df.nlargest(20,'Flags')

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15,10))
sns.barplot(x="Player", y="Flags", data=top_30_eu_players_flags, ax=ax[0])
ax[0].set_title('Top EU Players by Flags')
ax[0].set_xlabel('Player')
ax[0].set_ylabel('Flags')
ax[0].set_xticklabels(ax[0].get_xticklabels(), rotation=80)

sns.barplot(x="Player", y="Flags", data=top_30_non_eu_players_flags, ax=ax[1])
ax[1].set_title('Top Non-EU Players by Flags')
ax[1].set_xlabel('Player')
ax[1].set_ylabel('Flags')
ax[1].set_xticklabels(ax[1].get_xticklabels(), rotation=80)

plt.show()

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15,10))
sns.barplot(x="Player", y="Bow Accuracy", data=top_30_eu_players_bow_acc, ax=ax[0])
ax[0].set_title('Top EU Players by Bow Accuracy')
ax[0].set_xlabel('Player')
ax[0].set_ylabel('Bow Accuracy')
ax[0].set_xticklabels(ax[0].get_xticklabels(), rotation=80)

sns.barplot(x="Player", y="Bow Accuracy", data=top_30_non_eu_players_bow_acc, ax=ax[1])
ax[1].set_title('Top Non-EU Players by Bow Accuracy')
ax[1].set_xlabel('Player')
ax[1].set_ylabel('Bow Accuracy')
ax[1].set_xticklabels(ax[1].get_xticklabels(), rotation=80)

plt.show()

## 2.4 Misc

In [None]:
# Create correlation matrix to determine strongest correlation 
df_num = df.loc[:, df.columns != "Player"]
plt.figure(figsize=(15,8))
plt.title("Correlation matrix")
sns.heatmap(df_num.corr(),annot=True, fmt='0.2f', cmap='YlGnBu')

In [None]:
# Create the scatter plot
plt.scatter(df['Rating'], df['Golden Apples Eaten'])

# Add x and y labels
plt.xlabel('Rating')
plt.ylabel('Golden Apples Eaten')

# Add a title
plt.title('Rating vs Golden Apples Eaten')

# Show plot
plt.show()


In [None]:
sns.lmplot(x='Golden Apples Eaten', y='Rating', data=df)
plt.title('Rating vs Golden Apples Eaten')
plt.xlabel('Golden Apples Eaten')
plt.ylabel('Rating')
plt.show()

## Distributions

In [None]:
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(15,10))

sns.boxplot(x=df['Rating'], ax=ax[0,0], showfliers=True)
ax[0,0].set_title('Distribution of Rating')
ax[0,0].set_xlabel('Rating')

sns.boxplot(x=df['KDR'], ax=ax[0,1], showfliers=True)
ax[0,1].set_title('Distribution of KDR')
ax[0,1].set_xlabel('KDR')

sns.boxplot(x=df['Melee/Bow Damage Ratio'], ax=ax[1,0], showfliers=True)
ax[1,0].set_title('Distribution of Melee/Bow Damage Ratio')
ax[1,0].set_xlabel('Melee/Bow Damage Ratio')

sns.boxplot(x=df['Kills per Game'], ax=ax[1,1], showfliers=True)
ax[1,1].set_title('Distribution of Kills per Game')
ax[1,1].set_xlabel('Kills per Game')

plt.show()

In [None]:
# calculate the z-scores
z = np.abs(stats.zscore(df['Rating']))

# set a threshold for the z-scores
threshold = 3

# detect and store the indices of the outliers
outlier_indices = np.where(z > threshold)

# print the player names of the outliers
print(df.loc[outlier_indices, 'Player'])

In [None]:
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(15,10))

sns.boxplot(x=df['Rating'], ax=ax[0,0])
ax[0,0].set_title('Rating')

sns.boxplot(x=df['KDR'], ax=ax[0,1])
ax[0,1].set_title('KD')

sns.boxplot(x=df['Melee/Bow Damage Ratio'], ax=ax[1,0])
ax[1,0].set_title('Melee/Bow Damage Ratio')

sns.boxplot(x=df['Kills per Game'], ax=ax[1,1])
ax[1,1].set_title('Kills per Game')

plt.show()

In [None]:
# calculate the z-scores
z = np.abs(stats.zscore(df['Rating']))

# set a threshold for the z-scores
threshold = 3

# detect and store the indices of the outliers
outlier_indices = np.where(z > threshold)

# print the player names of the outliers
print(df.loc[outlier_indices, 'Player'])

# create a box plot of the distribution of ratings
plt.boxplot(df['Rating'], showfliers=False) # set showfliers=True to show the outliers
plt.title("Distribution of Ratings")
plt.xlabel("Rating")
plt.show()


In [None]:
# increase the size of the plot
plt.figure(figsize=(10, 6))

# detect and store the indices of the outliers
z = np.abs(stats.zscore(df['Rating']))
threshold = 3
outlier_indices = np.where(z > threshold)

# create a box plot with the player names of the outliers displayed
sns.boxplot(x=df['Rating'], color='skyblue', showfliers=False)
sns.stripplot(x=df['Rating'], y=df['Player'], jitter=True, linewidth=1, 
              size=8, color='black', data=df.loc[outlier_indices], 
              order=None)

# add a title and labels to the plot
plt.title("Distribution of Rating with Outlier Player Names", fontsize=18)
plt.xlabel("Rating", fontsize=14)
plt.ylabel("Player", fontsize=14)

# show the plot
plt.show()


In [None]:
# Create the box plot
fig, ax = plt.subplots(figsize=(10,5))
ax.boxplot(df['Rating'], patch_artist=True, notch=True, vert=0)
ax.set_title('Distribution of Rating')
ax.set_xlabel('Rating')
ax.set_yticklabels('')
# increase the size of the plot
plt.figure(figsize=(10, 6))

# Change the color of the boxes
colors = ['#0000FF', '#0000FF', '#0000FF', '#0000FF', '#0000FF']
for patch, color in zip(ax.artists, colors):
    r, g, b, a = patch.get_facecolor()
    patch.set_facecolor((r, g, b, .7))

# Add gridlines
ax.grid(visible=True, axis='x', linestyle='--', color='gray')

# detect and store the indices of the outliers
z = np.abs(stats.zscore(df['Rating']))
threshold = 3
outlier_indices = np.where(z > threshold)

# Annotate the player names of the outliers
for i in outlier_indices[0]:
    ax.annotate(df.loc[i, 'Player'], (df.loc[i, 'Rating'], 0.5), 
                xytext=(10, 0), textcoords='offset points',
                arrowprops=dict(arrowstyle='->', color='red'),
                bbox=dict(fc='white', ec='black', lw=2))

plt.show()

In [None]:
# Create the box plot
fig, ax = plt.subplots(figsize=(10, 6))
ax.boxplot(df['Rating'])
ax.set_title('Distribution of Rating')

# detect and store the indices of the outliers
z = np.abs(stats.zscore(df['Rating']))
threshold = 3
outlier_indices = np.where(z > threshold)

# Annotate the player names of the outliers
for i in outlier_indices[0]:
    ax.annotate(df.loc[i, 'Player'], (1, df.loc[i, 'Rating']), 
                xytext=(10, -5), textcoords='offset points',
                arrowprops=dict(arrowstyle='->', color='red'))
    
# Annotate the player names of the top 5 highest ratings
top5_highest_indices = df['Rating'].nlargest(5).index
for i in top5_highest_indices:
    ax.annotate(df.loc[i, 'Player'], (1, df.loc[i, 'Rating']), 
                xytext=(-100, 10), textcoords='offset points',
                arrowprops=dict(arrowstyle='->', color='green'))
    
# Annotate the player names of the top 5 lowest ratings
top5_lowest_indices = df['Rating'].nsmallest(5).index
for i in top5_lowest_indices:
    ax.annotate(df.loc[i, 'Player'], (1, df.loc[i, 'Rating']), 
                xytext=(50, -10), textcoords='offset points',
                arrowprops=dict(arrowstyle='->', color='blue'))
plt.show()


In [None]:
# Create the box plot

fig, ax = plt.subplots(figsize=(30, 23))
ax.boxplot(df['Rating'])
ax.set_title('Distribution of Rating', fontsize=20)
ax.set_ylabel('Rating', fontsize=15)
ax.set_xlabel('Players', fontsize=15)
# detect and store the indices of the outliers

z = np.abs(stats.zscore(df['Rating']))
threshold = 3
outlier_indices = np.where(z > threshold)
# Annotate the player names of the outliers

#for i in outlier_indices[0]:
#    ax.annotate(df.loc[i, 'Player'], (1, df.loc[i, 'Rating']),
#        xytext=(10, -5), textcoords='offset points', fontsize=20,
#        arrowprops=dict(arrowstyle='->', color='red'))
#        # Annotate the player names of the top 5 highest ratings

top5_highest_indices = df['Rating'].nlargest(6).index
for i in top5_highest_indices:
    ax.annotate(df.loc[i, 'Player'], (1, df.loc[i, 'Rating']),
        xytext=(-100, 10), textcoords='offset points', fontsize=14,
        arrowprops=dict(arrowstyle='->', color='green'))
        # Annotate the player names of the top 5 lowest ratings

top5_lowest_indices = df['Rating'].nsmallest(6).index
for i in top5_lowest_indices:
    ax.annotate(df.loc[i, 'Player'], (1, df.loc[i, 'Rating']),
        xytext=(50, -10), textcoords='offset points', fontsize=14,
        arrowprops=dict(arrowstyle='->', color='blue'))
plt.show()

In [None]:
from adjustText import adjust_text

# Create the box plot
fig, ax = plt.subplots(figsize=(10, 6))
ax.boxplot(df['Rating'])
ax.set_title('Distribution of Rating')

# detect and store the indices of the outliers
z = np.abs(stats.zscore(df['Rating']))
threshold = 3
outlier_indices = np.where(z > threshold)

# Annotate the player names of the outliers
outliers = [ax.annotate(df.loc[i, 'Player'], (1, df.loc[i, 'Rating']), 
                xytext=(10, -5), textcoords='offset points',
                arrowprops=dict(arrowstyle='->', color='red')) for i in outlier_indices[0]]

# Annotate the player names of the top 5 highest ratings
top5_highest_indices = df['Rating'].nlargest(5).index
highest_annotations = [ax.annotate(df.loc[i, 'Player'], (1, df.loc[i, 'Rating']), 
                xytext=(-100, 10), textcoords='offset points',
                arrowprops=dict(arrowstyle='->', color='green')) for i in top5_highest_indices]

# Annotate the player names of the top 5 lowest ratings
top5_lowest_indices = df['Rating'].nsmallest(5).index
lowest_annotations = [ax.annotate(df.loc[i, 'Player'], (1, df.loc[i, 'Rating']), 
                xytext=(50, -10), textcoords='offset points',
                arrowprops=dict(arrowstyle='->', color='blue')) for i in top5_lowest_indices]

# adjust the positions of the text so that they don't overlap
texts = outliers + highest_annotations + lowest_annotations
adjust_text(texts, arrowprops=dict(arrowstyle="-", color='black', lw=0.5))

plt.show()

# 3 To be added
- Scrape data from "Stats" tab on player profiles, to get all time data that isn't included in the scraping done for the leaderboard page (such as games and losses).