# Data Mining Project - WNBA Playoffs Prediction - G24

## Business Understanding

#### Our data
Basketball tournaments are usually split in two parts. First, all teams play each other aiming to achieve the greatest number of wins possible. Then, at the end of the first part of the season, a pre determined number of teams which were able to win the most games are qualified to the playoff season, where they play series of knock-out matches for the trophy.

For the 10 years, data from players, teams, coaches, games and several other metrics were gathered and arranged on this dataset. The goal is to use this data to predict which teams will qualify for the playoffs in the next season.



#### Competition Format
The 12 teams in the WNBA are split into an Eastern Conference and a Western Conference. WNBA fixtures begin with preseason games in May before each team plays 20 home games and 20 road games during the regular season.

The aim for every team is to qualify for the Playoffs, which begin in September each year.

The WNBA teams with the eight best regular season records regardless of standing qualify for the Playoffs. Higher seeds matchup with lower seeds, so the top seed faces the eight seed, the second seed faces the seven seed and so on.

When it comes to betting on the Playoffs, the first round are best-of-three series. The semifinals and final are both best-of-five, meaning WNBA teams need to record three wins to claim victory in the series.

## Database Connection

We used a free service to host our database. The Database is in PostgreSQL.

In [30]:
import json
import psycopg2
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [29]:
# DB Credentials

with open("config.json") as config_file:
    config = json.load(config_file)

host = config["db_host"]
user = config["db_user"]
password = config["db_password"]
database = config["db_database"]
schema = config["db_schema"]

In [50]:
connection = psycopg2.connect(
    host=host,
    user=user,
    password=password,
    database=database
)

cursor = connection.cursor()

def execute(query):
    cursor.execute(query)
    connection.commit()
    return cursor.fetchall()

def fetch(query):
    cursor.execute(query)
    return cursor.fetchall()

SELECT = "SELECT * FROM " + schema + "." # + table_name 
INSERT = "INSERT INTO " + schema + "." # + table_name + " VALUES " + values
UPDATE = "UPDATE " + schema + "." # + table_name + " SET " + column_name + " = " + value
DELETE = "DELETE FROM " + schema + "."  # + table_name + " WHERE " + column_name + " = " + value

# Test data, year 10
TEST_DATA = "year = 10"
TRAIN_DATA = "year < 10"

 The data about the players, teams and coaches consist of following relations:

    awards_players (96 objects) - each record describes awards and prizes received by players across 10 seasons,
    coaches (163 objects) - each record describes all coaches who've managed the teams during the time period,
    players (894 objects) - each record contains details of all players,
    players_teams (1877 objects) - each record describes the performance of each player for each team they played,
    series_post (71 objects) - each record describes the series' results,
    teams (143 objects) - each record describes the performance of the teams for each season,
    teams_post (81 objects) - each record describes the results of each team at the post-season.


In [32]:
awards_players = fetch(SELECT + "awards_players") # awards and prizes received by players across 10 seasons,
coaches = fetch(SELECT + "coaches") # all coaches who've managed the teams during the time period,
players = fetch(SELECT + "players") # details of all players,
players_teams = fetch(SELECT + "players_teams") # performance of each player for each team they played,
series_post = fetch(SELECT + "series_post") # series' results,
teams = fetch(SELECT + "teams") # performance of the teams for each season,
teams_post = fetch(SELECT + "teams_post") # results of each team at the post-season.

In [33]:
#save the data in a dataframe
awards_players_df = pd.DataFrame(awards_players, columns=['playerID', 'award', 'year', 'lgID'])
coaches_df = pd.DataFrame(coaches, columns=['coachID', 'year', 'tmID', 'lgID', 'stint', 'won', 'lost', 'post_wins', 'post_losses'])
players_df = pd.DataFrame(players, columns=['bioID', 'pos', 'firstseason', 'lastseason', 'height', 'weight', 'college', 'collegeOther', 'birthDate', 'deathDate'])
players_teams_df = pd.DataFrame(players_teams, columns=['playerID', 'year', 'stint', 'tmID', 'lgID', 'GP', 'GS', 'minutes', 'points', 'oRebounds', 'dRebounds', 'rebounds', 'assists', 'steals', 'blocks', 'turnovers', 'PF', 'fgAttempted', 'fgMade', 'ftAttempted', 'ftMade', 'threeAttempted', 'threeMade', 'dq', 'PostGP', 'PostGS', 'PostMinutes', 'PostPoints', 'PostoRebounds', 'PostdRebounds', 'PostRebounds', 'PostAssists', 'PostSteals', 'PostBlocks', 'PostTurnovers', 'PostPF', 'PostfgAttempted', 'PostfgMade', 'PostftAttempted', 'PostftMade', 'PostthreeAttempted', 'PostthreeMade', 'PostDQ'])
series_post_df = pd.DataFrame(series_post, columns=['year', 'round', 'series', 'tmIDWinner', 'lgIDWinner', 'tmIDLoser', 'lgIDLoser', 'W', 'L'])
teams_df = pd.DataFrame(teams, columns=['year', 'lgID', 'tmID', 'franchID', 'confID', 'divID', 'rank', 'playoff', 'seeded', 'firstRound', 'semis', 'finals', 'name', 'o_fgm', 'o_fga', 'o_ftm', 'o_fta', 'o_3pm', 'o_3pa', 'o_oreb', 'o_dreb', 'o_reb', 'o_asts', 'o_pf', 'o_stl', 'o_to', 'o_blk', 'o_pts', 'd_fgm', 'd_fga', 'd_ftm', 'd_fta', 'd_3pm', 'd_3pa', 'd_oreb', 'd_dreb', 'd_reb', 'd_asts', 'd_pf', 'd_stl', 'd_to', 'd_blk', 'd_pts', 'tmORB', 'tmDRB', 'tmTRB', 'opptmORB', 'opptmDRB', 'opptmTRB', 'won', 'lost', 'GP', 'homeW', 'homeL', 'awayW', 'awayL', 'confW', 'confL', 'min', 'attend', 'arena'])
teams_post_df = pd.DataFrame(teams_post, columns=['year', 'tmID', 'lgID', 'W', 'L'])

#make a dictionary with all the dataframes
dfs = {'awards_players_df': awards_players_df, 'coaches_df': coaches_df, 'players_df': players_df, 'players_teams_df': players_teams_df, 'series_post_df': series_post_df, 'teams_df': teams_df, 'teams_post_df': teams_post_df}

## Data Understanding

### First approach

We started with a Exploratory Data Analysis. There are 7 different tables, with different sizes both in lines and columns

In [34]:
#for each table, print the table name, number of rows and columns
for df in dfs:
    print(df)
    print(dfs[df].shape,'\n')

awards_players_df
(95, 4) 

coaches_df
(162, 9) 

players_df
(893, 10) 

players_teams_df
(1876, 43) 

series_post_df
(70, 9) 

teams_df
(142, 61) 

teams_post_df
(80, 5) 



Before jumping into the data analysis, one of the first things we noticed is that all the tables have an "League ID" (lgID) attribute. As the WNBA is the only league we are covering, and there is no variability in this column, we can drop it, as well as other columns that may have always the same content

In [35]:
#Drop columns whose values are always the same
for df in dfs:
    for col in dfs[df].columns:
        if len(dfs[df][col].unique()) == 1:
            print(df, col)
            dfs[df].drop(col, inplace=True, axis=1)

awards_players_df lgID
coaches_df lgID
players_df firstseason
players_df lastseason
players_teams_df lgID
series_post_df lgIDWinner
series_post_df lgIDLoser
teams_df lgID
teams_df divID
teams_df seeded
teams_df tmORB
teams_df tmDRB
teams_df tmTRB
teams_df opptmORB
teams_df opptmDRB
teams_df opptmTRB
teams_post_df lgID


Before continuing, we also noted there are three dead players in the players table. We should take that into account when doing the analysis.

In [36]:
#print dead players
print(players_df[players_df['deathDate'] != '0000-00-00'])

          bioID pos  height  weight            college collegeOther  \
225  dydekma01w   C     9.0     223                                   
605  perroki01w   G    65.0     130       SW Louisiana                
625  priceka01w   G    70.0     148  Stephen F. Austin                
881  yasenco01w   G    72.0     160             Purdue                

      birthDate   deathDate  
225  1974-04-28  2011-05-27  
605  1967-01-18  1999-08-19  
625  1975-12-03  1999-01-18  
881  1973-12-05  2001-05-12  


Also, there are players that have not played any season of the seasons given. We should take that into account when doing the analysis. There are 338 players that have not played any season.

In [51]:
#players that have not played in the last 10 years
fetch("SELECT p.bioid FROM wnba.players p WHERE p.bioid not in (select pt.playerid  from wnba.players_teams pt)")

[('abrahta01w',),
 ('adairje01w',),
 ('adamsda01w',),
 ('adamsmi01w',),
 ('adubari99w',),
 ('aglerbr99w',),
 ('alberma01w',),
 ('alexaer01w',),
 ('allenso99w',),
 ('angelyv01w',),
 ('appelja01w',),
 ('artiska01w',),
 ('ayimmi01w',),
 ('bakerla01w',),
 ('bassmi01w',),
 ('becenry01w',),
 ('beckan99wc',),
 ('bellje01w',),
 ('berggas01w',),
 ('berrysh01w',),
 ('berubca01w',),
 ('bibbyhe01w',),
 ('bishoab01w',),
 ('bjedoni01w',),
 ('bjorkan01w',),
 ('bladerh01w',),
 ('boguemu01w',),
 ('boldeba01w',),
 ('bookeka01w',),
 ('bosweca01w',),
 ('bouceje01w',),
 ('bouchke01w',),
 ('boyerli99w',),
 ('bradlki01w',),
 ('brancli01w',),
 ('branzal01w',),
 ('branzge01w',),
 ('braxtja01w',),
 ('brcanra01w',),
 ('brelaje01w',),
 ('brownci01w',),
 ('brownde01w',),
 ('brownla01w',),
 ('brownle01w',),
 ('brucegr99w',),
 ('bryanjo01w',),
 ('burgehe01w',),
 ('burgehe02w',),
 ('byrdla01w',),
 ('cabezli01w',),
 ('cainke01w',),
 ('camba01w',),
 ('cartede01w',),
 ('cartesy01w',),
 ('cebriel01w',),
 ('chacoke01w',),

#### Describing the data

For starters, we printed the head of each table, to get a sense of the data

In [37]:
# do a head of each table
for df in dfs:
    print(df)
    display(dfs[df].head())
    

awards_players_df


Unnamed: 0,playerID,award,year
0,thompti01w,All-Star Game Most Valuable Player,1
1,leslili01w,All-Star Game Most Valuable Player,2
2,leslili01w,All-Star Game Most Valuable Player,3
3,teaslni01w,All-Star Game Most Valuable Player,4
4,swoopsh01w,All-Star Game Most Valuable Player,6


coaches_df


Unnamed: 0,coachID,year,tmID,stint,won,lost,post_wins,post_losses
0,adamsmi01w,5,WAS,0,17,17,1,2
1,adubari99w,1,NYL,0,20,12,4,3
2,adubari99w,2,NYL,0,21,11,3,3
3,adubari99w,3,NYL,0,18,14,4,4
4,adubari99w,4,NYL,0,16,18,0,0


players_df


Unnamed: 0,bioID,pos,height,weight,college,collegeOther,birthDate,deathDate
0,abrahta01w,C,74.0,190,George Washington,,1975-09-27,0000-00-00
1,abrossv01w,F,74.0,169,Connecticut,,1980-07-09,0000-00-00
2,adairje01w,C,76.0,197,George Washington,,1986-12-19,0000-00-00
3,adamsda01w,F-C,73.0,239,Texas A&M,Jefferson College (JC),1989-02-19,0000-00-00
4,adamsjo01w,C,75.0,180,New Mexico,,1981-05-24,0000-00-00


players_teams_df


Unnamed: 0,playerID,year,stint,tmID,GP,GS,minutes,points,oRebounds,dRebounds,...,PostBlocks,PostTurnovers,PostPF,PostfgAttempted,PostfgMade,PostftAttempted,PostftMade,PostthreeAttempted,PostthreeMade,PostDQ
0,abrossv01w,2,0,MIN,26,23,846,343,43,131,...,0,0,0,0,0,0,0,0,0,0
1,abrossv01w,3,0,MIN,27,27,805,314,45,101,...,0,0,0,0,0,0,0,0,0,0
2,abrossv01w,4,0,MIN,30,25,792,318,44,97,...,1,8,8,22,6,8,8,7,3,0
3,abrossv01w,5,0,MIN,22,11,462,146,17,57,...,2,3,7,23,8,4,2,8,2,0
4,abrossv01w,6,0,MIN,31,31,777,304,29,78,...,0,0,0,0,0,0,0,0,0,0


series_post_df


Unnamed: 0,year,round,series,tmIDWinner,tmIDLoser,W,L
0,1,FR,A,CLE,ORL,2,1
1,1,FR,B,NYL,WAS,2,0
2,1,FR,C,LAS,PHO,2,0
3,1,FR,D,HOU,SAC,2,0
4,1,CF,E,HOU,LAS,2,0


teams_df


Unnamed: 0,year,tmID,franchID,confID,rank,playoff,firstRound,semis,finals,name,...,GP,homeW,homeL,awayW,awayL,confW,confL,min,attend,arena
0,9,ATL,ATL,EA,7,N,,,,Atlanta Dream,...,34,1,16,3,14,2,18,6825,141379,Philips Arena
1,10,ATL,ATL,EA,2,Y,L,,,Atlanta Dream,...,34,12,5,6,11,10,12,6950,120737,Philips Arena
2,1,CHA,CHA,EA,8,N,,,,Charlotte Sting,...,32,5,11,3,13,5,16,6475,90963,Charlotte Coliseum
3,2,CHA,CHA,EA,4,Y,W,W,L,Charlotte Sting,...,32,11,5,7,9,15,6,6500,105525,Charlotte Coliseum
4,3,CHA,CHA,EA,2,Y,L,,,Charlotte Sting,...,32,11,5,7,9,12,9,6450,106670,Charlotte Coliseum


teams_post_df


Unnamed: 0,year,tmID,W,L
0,1,HOU,6,0
1,1,ORL,1,2
2,1,CLE,3,3
3,1,WAS,0,2
4,1,NYL,4,3


Having an idea of what the data looks like, we now want to see a more detailed description to get more information

In [None]:
# do a describe of each table
for df in dfs:
    print(df)
    display(dfs[df].describe(include='all'))
    

In [None]:
  # do a info of each table
for df in dfs:
    print(df)
    print(dfs[df].info(), '\n')

With the info and describes we understand that:
- There are no Null entries (although there values that are simply an empty string), as we will also prove next.
- There are some columns with the DataType "object", most of them being strings.
- There are binary objects (like confID and playoff, in the 'teams' table, with the values "Y" or "N") that could be substituted by a binary, as well as ternary objects (like the firstRound, semis and finals in the 'teams' table, with the values "W", "L" or "") that could also be transformed.

In [None]:
# do a isnull of each table to see if there are null values
for df in dfs:
    print(df)
    print(dfs[df].isnull().sum(), '\n')

We wanted to see what were the different values for the objects and check if there were more objects that could be transformed into binary and ternary variables, so we printed the objects unique values and their frequencies.

In [None]:
# check the value counts for each column of type object
for df in dfs:
    print(df)
    for col in dfs[df].columns:
        if dfs[df][col].dtype == 'object':
            print(dfs[df][col].value_counts(), '\n')

- We conclude the previous identified variables are the only objects with 2 or 3 unique values (confID and playoff, in the 'teams' table, with the values "Y" or "N" and the firstRound, semis and finals in the 'teams' table, with the values "W", "L" or "").
- There are players with no position and no college assigned ("").
- There are players with no date of birth in the record (0000-00-00).

We moved on to the numerical variables, trying to find outliers using boxplots (as we have a lot of variables, we only run for the ones we thought had outliers)

In [None]:
for df in dfs:
    #print(df)
    
    # do box plots of the numerical columns
    for col in dfs[df].columns:
        if dfs[df][col].dtype != 'object':
            if col == 'height' or col == 'weight':
                plt.figure(figsize=(10, 10))
                sns.boxplot(x=dfs[df][col])
                plt.show()
    

- Looking at the boxplots, we can understand that the height and weight variables have default 0 values and should be treated as null values.

In [None]:
#wins and losses by team, sorted by number of wins: stacked bar chart
wins_by_team = teams_df.groupby('name')['won'].sum().sort_values(ascending=False)
losses_by_team = teams_df.groupby('name')['lost'].sum().sort_values(ascending=False)
wins_by_team = wins_by_team.reset_index()
losses_by_team = losses_by_team.reset_index()
wins_by_team['lost'] = losses_by_team['lost']
wins_by_team = wins_by_team.sort_values(by='won', ascending=False)
wins_by_team = wins_by_team.set_index('name')
wins_by_team.plot(kind='bar', stacked=True, figsize=(20, 10))
plt.show()

- The number of games played by each team differs (there may be teams that are no longer playing), so we can't compare the number of wins and losses directly. We need to calculate the percentage of wins and losses for each team.

In [None]:
#win percentage by team, sorted by win percentage: bar chart with color gradient, with a horizontal line at the league average
wins_by_team['win_percentage'] = wins_by_team['won'] / (wins_by_team['won'] + wins_by_team['lost'])
wins_by_team = wins_by_team.sort_values(by='win_percentage', ascending=False)
plt.figure(figsize=(20, 10))
sns.barplot(x=wins_by_team.index, y=wins_by_team['win_percentage'], palette='rocket')
plt.axhline(wins_by_team['win_percentage'].mean(), color='black')
plt.xticks(rotation=90)
plt.show()

- In terms of win percentage, it seems like a competitive league, with more than half of the teams having a win percentage of 50% or more, taking advantage of the worst teams. There is also just one team below 40% of wins.

As we saw that there were teams it more games than the others, we want to see which teams played in which seasons, to see if there are teams that are no longer playing

In [None]:
#Enumerate the seasons a team played in the league
seasons_by_team = teams_df.groupby('name')['year']
print(seasons_by_team.unique())
print("-----------------------")
print(seasons_by_team.nunique())
#print number of seasons played
print("-----------------------")
#filter only teams that have played in the last season
teams_last_season = teams_df[teams_df['year'] == 10]
print(teams_last_season['name'].unique())


### Going deeper

There are tables with a lot of attributes (player_teams 43 and teams_df 61, for example). We will evaluate the correlations and delete the most correlated pairs of attributes.

In [None]:
MAX_CORRELATION = 0.8

In [None]:
def delete_most_correlated(df):

    df_copy = df.copy()


    correlation_matrix = df_copy.corr()

    sorted_correlations = correlation_matrix.unstack().sort_values(ascending=False)

    plt.figure(figsize=(20, 20))
    sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
    plt.show()

    # Get the pairs of attributes with the highest correlation values
    most_correlated_pairs = sorted_correlations[sorted_correlations > MAX_CORRELATION]
    most_correlated_pairs = most_correlated_pairs[most_correlated_pairs < 1.0]


    #delete repeated pairs (e.g. (a,b) and (b,a))
    most_correlated_pairs = most_correlated_pairs[::2]
    print(most_correlated_pairs)

    #drop the attributes with the highest correlation values
    for pair in most_correlated_pairs.index:
        if pair[0] in df_copy.columns:
            df_copy.drop(pair[0], inplace=True, axis=1)
        if pair[1] in df_copy.columns:
            df_copy.drop(pair[1], inplace=True, axis=1)

    #print the attributes that were dropped
    print(df_copy.columns)
    return df_copy

Lets run the function for all the tables

In [None]:
for df in dfs:
    print(df)
    #select only the numerical columns
    #to make the df actually change, we have to assign it to the df
    #df = delete_most_correlated(dfs[df])
    delete_most_correlated(dfs[df].select_dtypes(include=np.number))

So with that we end our understanding phase.
Our main takeaways are:
- There are dead players that should be removed
- There are no Null entries (although there values that are simply an empty string)
- There are some columns with the DataType "object", most of them being strings.
- There are binary objects (like confID and playoff, in the 'teams' table, with the values "Y" or "N") that could be substituted by a binary, as well as ternary objects (like the firstRound, semis and finals in the 'teams' table, with the values "W", "L" or "") that could also be transformed.
- There are players with no position and no college assigned ("").
- There are players with no date of birth in the record (0000-00-00).
- The height and weight variables have default 0 values and should be treated as null values.
- The number of games played by each team differs (there may be teams that are no longer playing), so we can't compare the number of wins and losses directly. Win percentage should be used.
- In terms of win percentage, it seems like a competitive league, with more than half of the teams having a win percentage of 50% or more, taking advantage of the worst teams. There is also just one team below 40% of wins.
- There are teams that are no longer playing.
- There are a lot of highly correlated variables.

## Merging Tables (FALTAM COISAS ANTES MAS VOU VER SE DÀ PARA IR PENSANDO NISTO)

Merged tables:
- teams with teams_post, some values are NaN when the team didn't participate in the playoffs
- players with players_teams, some values are NaN when the player didn't participate in any team, remove such cases?
- awards with players

- próximas ideia... listas de valores agregados ou colunas a 0 ou 1 se forem poucos valores diferentes
- Is series_post even relevant??

In [None]:


teams_post_df = teams_post_df.rename(columns={'W': 'W_post', 'L': 'L_post'})
teams_post_df.head()
teams_df = teams_df.merge(teams_post_df, on=['tmID', 'year'], how='left')

teams_df.head()

In [None]:
# join players (bioID) with players_teams (playerID)

players_teams_df = players_teams_df.rename(columns={'playerID': 'bioID'})
players_teams_df.head()
players_df = players_df.merge(players_teams_df, on=['bioID'], how='left')
players_df = players_df.rename(columns={'bioID': 'playerID'})
players_df.head()

In [None]:
# merge players with awards on playerID and year
awards_players_df.head()
players_df = players_df.merge(awards_players_df, on=['playerID', 'year'], how='left')
#players_df.boxplot(column='points', by='award')
players_df['award'].value_counts()





In [47]:
connection.close()