# Dataset exploration

In this file, we aim to produce a dataset comprising relevant information obtained from all the provided files.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

import pandas as pd
import math
import matplotlib.pyplot as plt

Mounted at /content/drive


Let's load all files into pandas dataframes for manipulation.

In [2]:
awards_players = pd.read_csv("/content/drive/Shareddrives/ML 2024/dataset/awards_players.csv")
awards_players.designation = 'awards_players'

coaches = pd.read_csv("/content/drive/Shareddrives/ML 2024/dataset/coaches.csv")
coaches.designation = 'coaches'

players = pd.read_csv("/content/drive/Shareddrives/ML 2024/dataset/players.csv")
players.designation = 'players'

players_teams = pd.read_csv("/content/drive/Shareddrives/ML 2024/dataset/players_teams.csv")
players_teams.designation = 'players_teams'

series_post = pd.read_csv("/content/drive/Shareddrives/ML 2024/dataset/series_post.csv")
series_post.designation = 'series_post'

teams = pd.read_csv("/content/drive/Shareddrives/ML 2024/dataset/teams.csv")
teams.designation = 'teams'

teams_post = pd.read_csv("/content/drive/Shareddrives/ML 2024/dataset/teams_post.csv")
teams_post.designation = 'teams_post'

teams_11 = pd.read_csv("/content/drive/Shareddrives/ML 2024/dataset/Season_11/teams.csv")
teams_11.designation = 'teams_11'

coaches_11 = pd.read_csv("/content/drive/Shareddrives/ML 2024/dataset/Season_11/coaches.csv")
coaches_11.designation = 'coaches_11'

players_teams_11 = pd.read_csv("/content/drive/Shareddrives/ML 2024/dataset/Season_11/players_teams.csv")
players_teams_11.designation = 'players_teams_11'

FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/Shareddrives/ML 2024/dataset/awards_players.csv'

# Cleaning the dataset

First off, let's check for duplicates, basic dataset manipulation.

In [None]:
def executer(df, fun):
  print(df.designation)
  print(f'Rows before = {df.shape[0]}')
  fun(df)
  print(f'Rows after = {df.shape[0]}\n')


executer(awards_players, lambda x : x.drop_duplicates())
executer(coaches, lambda x : x.drop_duplicates())
executer(players, lambda x : x.drop_duplicates())
executer(players_teams, lambda x : x.drop_duplicates())
executer(series_post, lambda x : x.drop_duplicates())
executer(teams, lambda x : x.drop_duplicates())
executer(teams_post, lambda x : x.drop_duplicates())

The year 11 data does not contain duplicates, so we didn't bother to check, but for the following changes we'll want to have them in the main datasets, so let's concatenate.

In [None]:
teams = pd.concat([teams,teams_11], ignore_index=True)
coaches = pd.concat([coaches,coaches_11], ignore_index=True)
players_teams = pd.concat([players_teams,players_teams_11], ignore_index=True)

Now, to train the dataset, we are going to place a variable for each year which states if the team went to the playoffs in the following year.

For the final two years, of course, this variable "label" will be not defined. This is fine for year 10 since this is the variable which we seek to discover.

It is also okay for year 11, since this year only gives additional information to guess the "label" of year 10, which would be if the teams went to the playoffs in year 11.

In [None]:
main_df = teams.sort_values(by=['franchID', 'year'])
main_df['label'] = main_df['playoff'].shift(-1)
main_df['prev_team'] = main_df['franchID'].shift(-1)
op = lambda row : row['label'] if row['prev_team'] == row['franchID'] else None
main_df['label'] = main_df.apply(op, axis=1)
main_df = main_df.drop('prev_team', axis=1)
main_df.head()

Let's drop the 'lgID' column (it is the same for every team), 'arena' and 'name'.

Eventually we will drop the 'tmID' column, because there are cases where the same team is denoted by different 'tmID's over time, rather than 'franchID', which is constant. But meanwhile it will be useful to relate different tables.

In [None]:
del main_df['lgID']
del main_df['arena']
del main_df['name']

It seems that 'seeded', 'divID', and maybe 'GP' and 'confID' have the same value for each line. Let's confirm.

In [None]:
print(main_df['seeded'].unique())
print(main_df['divID'].unique())
print(main_df['GP'].unique())
print(main_df['confID'].unique())

'seeded' and 'divID' do. Let's drop them.

'confID' only has two values. We can encode them into 0 and 1 in order to avoid non-numeric values.

Our guess about GP is incorrect, however, so we are going to keep it, however, in later years the games seem to go from 32 to 34, so in cases where GP isn't defined (namely in the 11th year), we'll use the value 34.

In [None]:
del main_df['seeded']
del main_df['divID']

main_df['confID'] = main_df['confID'].apply(lambda x : 0 if x == 'WE' else 1)

Let's also encode 'playoff' into 0 and 1.

In [None]:
main_df['playoff'] = main_df['playoff'].apply(lambda x : 0 if x == 'N' else 1)

Let's also encode 'firstRound', 'semis' and 'finals'. 0 for losses and NaN, and 1 for wins.

There can be NaN values in these columns in the cases where a team doesn't make it to the playoffs at all, where these values don't make sense.

In [None]:
main_df['firstRound'] = main_df['firstRound'].apply(lambda x : 1 if x == 'W' else 0)
main_df['semis'] = main_df['semis'].apply(lambda x : 1 if x == 'W' else 0)
main_df['finals'] = main_df['finals'].apply(lambda x : 1 if x == 'W' else 0)

In the WNBA league there are two different conferences, East and West, 4 of each conference's teams will then qualify for the Playoffs, where they will be pitted against eachother and eventually the opposing conference's winning team. <br>
So the top 4 in ranking for each conference are automatically a part of the Playoffs. We'll look further into this soon.

In [None]:
main_df.sort_values(by=["year","rank"], inplace=True)
main_df.head(24)

### Checking for outliers

When checking for outliers in the dataset, the teams dataframe will be ignored as its outliers are simply underperforming or overperforming teams, and we cannot make an argument for them being better or worse. Their other statistics should also reflect these underperforming aspects.

Let's define a function to check for outliers, since it will be repeated many times.

In [None]:
def check_outliers(df, q1=0.25, q3=0.75):
  numerical_columns = df.select_dtypes(include=['number']).columns
  outlier_summary = {}
  for column in numerical_columns:
    Q1 = df[column].quantile(q1)
    Q3 = df[column].quantile(q3)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]

    outlier_summary[column] = {
        'Outliers_Count': len(outliers),
        'Outliers_Entries': outliers
    }

  outlier_counts = {col: details['Outliers_Count'] for col, details in outlier_summary.items()}
  print(pd.DataFrame(list(outlier_counts.items()), columns=['Column', 'Outliers']))

  for column, details in outlier_summary.items():
    if outlier_summary[column]['Outliers_Count'] > 0:
      print(f"\nOutliers in '{column}':")
      print(details['Outliers_Entries'])
      print(details['Outliers_Entries'][column])


We'll start by checking the players table.

In [None]:
check_outliers(players)

In the players table, there a lot of outliers. But this is due to the players table also allocating coaches whose height, weight, and position, are not registered. We'll need this table with all its information later on, and this information will also not be relevant for training the models or rating the players, thus we keep it.

Next, we'll check for outliers in the players_teams table, this table actually holds relevant performance data of each player.

In [None]:
check_outliers(players_teams)

  In the players_teams table, which aggregates players to their stats, the same issue as the teams table is found, but in this case, there are many cases in which players get flagged as outliers because one of their specific stats is lower than the average. This, however, is normal in a game such as basketball. A Shooting Forward will naturally have way less rebounds than a Point Guard, and the same is true for a SF having more points than a PG.

  So, instead, we'll check for extreme outliers.

In [None]:
check_outliers(players_teams, 0.05, 0.95)

Even then, no results seem to be far from simply a great performance or a bad one.
Finally, let's check for outliers in the coaches table. The other tables won't be checked since they all relate to the teams, and as stated before, those simply possess bad performances or great performances, nothing unusual.

In [None]:
check_outliers(coaches)

Outliers are caused by the coaches of, likely, the same teams who had bad or good perfomances in the teams table. The values add up right, nothing unusual.

### Feature Extraction

In this section, we will check for features which have high correlation, a notorious degrader of model accuracy during training; for this purpose, we'll define a function to check how changes affect the correlations.

In [None]:
def check_correlations(table):
  corr_table = table.copy()

  del corr_table['franchID']
  del corr_table['tmID']

  corr_table = corr_table.drop(corr_table[(corr_table['label'].isna()) | (corr_table['label']=="None")].index)

  cols = list(corr_table.columns)

  plot_cols = []

  for i, col1 in enumerate(cols):
      if col1 == 'label':
          continue
      for col2 in cols[i::]:
          if col1 == col2 or col2 == 'label':
              continue
          if math.fabs(corr_table[col1].corr(corr_table[col2])) > 0.85:
              plot_cols.append([col1,col2])

  # Plot the cols with high correlation

  plt.figure(figsize=(25,25))
  plt.subplots_adjust(left=0.1,bottom=0.1,right=0.9,top=0.9,wspace=0.4,hspace=0.4)

  for i, cols in enumerate(plot_cols):
      plt.subplot(8,8,i+1)
      plt.scatter(main_df[cols[0]],main_df[cols[1]],s=10,c='red',alpha=0.4)
      plt.xlabel(f"{cols[0]}",fontsize=12)
      plt.ylabel(f"{cols[1]}",fontsize=12)

With this information, we'll extract new features, get rid of useless ones, and reduce the size of the dataset.

In [None]:
check_correlations(main_df)

As speculated before, the teams with ranks 1 to 4 are always in the Playoffs, and this causes a high correlation between the two features, thus we can get rid of the Playoff column because the Ranking column gives us the same information, but with the added benefit of knowing which teams did better than others in the playoffs.

In [None]:
del main_df["playoff"]

Per year, usually 32 matches are played, however, sometimes more than 32 are played, so to not lose any relevant information and to reduce the amount of correlated columns, the "won" column will be replaced with a percentage of won games over played games, and then both the "lost" and "GP" columns will be deleted, as they will no longer be needed.

In [None]:
main_df['won'] = main_df['won']/main_df['GP']
del main_df['lost']
del main_df['GP']

Looking towards the "homeW", "homeL", "awayW", "awayL", not only are these related, such that "homeW" could be a proportion once again, but these are pretty useless features as the location in which the players play is pretty irrelevant to their victory.

In [None]:
del main_df['homeW']
del main_df['homeL']
del main_df['awayW']
del main_df['awayL']

Another case similar to the above is "confW" and "confL", which state the wins and losses a team has had in their conference. However, we cannot simply get rid of this feature. In the WNBA, the matches done before entering the playoffs are all conference matches, thus this feature grants us incredible information about how a team tends to do in its conference and, if it does well, it obviously makes it into the playoffs.

The features "confW" and "confL", however, are highly correlated, so we decided to replace them with a win percentage.

In [None]:
main_df['confW'] = main_df['confW'] / (main_df['confW'] + main_df['confL'])
del main_df['confL']

Let's check the correlations again, after the initial phase of feature extraction.

In [None]:
check_correlations(main_df)

Big improvement, but we're just getting started.

Let's look towards the frequent "m" and "a" features. These features are shorthand for "made" and "attempted", so, for example, "3pm" and "3pa" are three pointers made and three pointers attempted.

These two are always highly correlated, making a basket increases the attempts, so we'll follow the same pattern of replacing them with a percentage rate. We can't, however, get rid of the attempts, since the percentage means nothing by itself. Although it will still get rid of a lot of correlated features.

In [None]:
main_df['o_3prate'] = main_df['o_3pm'] / main_df['o_3pa']
main_df['d_3prate'] = main_df['d_3pm'] / main_df['d_3pa']
main_df['o_ftrate'] = main_df['o_ftm'] / main_df['o_fta']
main_df['d_ftrate'] = main_df['d_ftm'] / main_df['d_fta']
main_df['o_fgrate'] = main_df['o_fgm'] / main_df['o_fga']
main_df['d_fgrate'] = main_df['d_fgm'] / main_df['d_fga']

del main_df['o_3pm']
del main_df['d_3pm']
del main_df['o_ftm']
del main_df['d_ftm']
del main_df['o_fgm']
del main_df['d_fgm']

Next are "component features" there are many cases of features which are portions of others. Logically, they'll be highly correlated, following the same logic of "increasing one increases the other".

One such case are the "o_oreb", "o_dreb", and "o_reb", and the equivalent for "d_reb". Logically, we just get rid of the feature which is a sum, we lose no information by getting rid of it since, as stated, the portions already tell us the same information, we are simply decorrelating the dataset and reducing its size.

In [None]:
del main_df['o_reb']
del main_df['d_reb']

Same is true for "d_pts" and "o_pts", they're reflected by all the features of free throws, three pointers, etc.

In [None]:
del main_df['d_pts']
del main_df['o_pts']

In [None]:
check_correlations(main_df)

The remaining highly correlated features don't leave much room for manipulation. Previously it was stated that conference wins and rank is related, but we feel some information is lost by deleting either.

Then we have features which, logically, shouldn't be correlated, such as the "d_fga" and "o_fga". This probably happens because, approximately half of the game the team is defending and the other half the team is on the offensive. It's not something we can remove without losing information.

### Dealing with awards

Let's turn awards_players table cumulative. For this purpose, we will build a new table, called "awards" that has the total number of awards that a player has up to that year.

This table, for some reason, contains both the awards for the coaches and the players, and for this notebook we'll utilize it for the coaches, but it will, in another notebook, be explored for the players.

In [None]:
awards = awards_players.groupby(['playerID', 'year'])['award'].count().reset_index()

years = range(1, 12)

player_list = players['playerID'].unique()

temp = pd.MultiIndex.from_product([player_list, years], names=['playerID', 'year']).to_frame(index=False)

awards = temp.merge(awards, on=['playerID', 'year'], how='left')

awards['award'] = awards['award'].fillna(0)

awards['cumulative_awards'] = awards.groupby('playerID')['award'].cumsum()
del awards['award']

awards['cumulative_awards'] = awards['cumulative_awards'].astype(int)

awards.head(25)


## Expanding the coaches
The coaches are an integral part of the team, and this dataset has much information on them, so we'll prepare them to be added to the main dataframe.

Let's start by adding the awards to the "coaches" table.

In [None]:
awards = awards.rename(columns={'playerID': 'coachID'})
coaches = coaches.merge(awards, on=['coachID', 'year'], how='left')

In [None]:
coaches.sort_values(by=['coachID', 'year'])
coaches.head(200)

Let's determine, for each coachID and year, the total wins and losses up to that year. We'll keep it in a percentage in order to avoid high correlations.

In [None]:
coach_record = coaches.groupby(['coachID', 'year'])[['won', 'lost']].sum().reset_index()

coach_record['cumulative_won'] = coach_record.groupby('coachID')['won'].cumsum()
coach_record['cumulative_lost'] = coach_record.groupby('coachID')['lost'].cumsum()

coach_record['cumulative_won_rate'] = coach_record['cumulative_won'] / (coach_record['cumulative_won'] + coach_record['cumulative_lost'])
coach_record['cumulative_total_games'] = coach_record['cumulative_won'] + coach_record['cumulative_lost']

coach_record['games_played'] = coach_record['won'] + coach_record['lost']

del coach_record['won']
del coach_record['lost']
del coach_record['cumulative_won']
del coach_record['cumulative_lost']

# Add teamId back again

coach_record = coach_record.merge(coaches[['coachID', 'tmID', 'year', 'cumulative_awards']], on=['coachID','year'], how='left')

# Assume the coaches of the last year played in all games
coach_record.loc[(coach_record['year'] == 11) & (coach_record['games_played'] == 0), 'games_played'] = 34
coach_record.loc[(coach_record['year'] == 11) & (coach_record['games_played'] == 34), 'cumulative_total_games'] += 34

coach_record.head(200)

Let's relate coaches to respective team in each year. Keep in mind that, since we're using the data from year X to predict year X+1, it only makes sense that we place the coach of year X+1 in year X with the information of that coach in year X, since the coach of year X+1 could be different.

However, we first need to guarantee that, when we use the information of a coach in the previous year, that coach's information actually exists.

We are going to avoid cases in which the coach played in year 6 and only appears again in year 11, for example. In cases like this, the info of the coach in year 6 should be used. So we are going to "stretch" the last info we have on a coach until he appears once again.

In [None]:
coach_record_copy = coach_record.copy()
coach_record_copy.sort_values(by=['coachID', 'year'], inplace=True)
coach_record_copy.reset_index(drop=True, inplace=True)

In [None]:
for i in coach_record_copy.index:
  if i == len(coach_record_copy) - 1:
    coach = coach_record_copy.loc[i]
    for year in range(coach['year'] + 1, 11):
      coach_copy = coach.copy()
      coach_copy['year'] = year
      coach_copy['tmID'] = None
      coach_record_copy.loc[len(coach_record_copy.index)] = coach_copy
    break
  coach1 = coach_record_copy.loc[i]
  coach2 = coach_record_copy.loc[i+1]
  if coach1['coachID'] == coach2['coachID'] and coach1['year'] != coach2['year'] - 1:
    for year in range(coach1['year'] + 1, coach2['year']):
      coach_copy = coach1.copy()
      coach_copy['year'] = year
      coach_copy['tmID'] = None
      coach_record_copy.loc[len(coach_record_copy.index)] = coach_copy
  if coach1['coachID'] != coach2['coachID'] and coach1['year'] < 10:
    for year in range(coach1['year'] + 1, 11):
      coach_copy = coach1.copy()
      coach_copy['year'] = year
      coach_copy['tmID'] = None
      coach_record_copy.loc[len(coach_record_copy.index)] = coach_copy

In [None]:
coach_record_copy.sort_values(by=['coachID', 'year'], inplace=True)
coach_record_copy.reset_index(drop=True, inplace=True)

In [None]:
coach_record = coach_record_copy

Now, let's manipulate the coaches to act as stated above, with the information of year X for the coach of year X+1.

In [None]:
coach_record = coach_record.sort_values(by=['tmID', 'year'])

# We need to shift values up so that each year corresponds to the following year's coach, but this will mess up in cases with multiple coaches, so we will group multiple entries
# But we will save the original DF so we can then get the stats of a coach without being aggregated to another coach
coach_save = coach_record.copy()
del coach_save['tmID'] # we don't care what team the coach has, only what stats they have in a year
coach_record = coach_record.groupby(['tmID', 'year'], as_index=False).agg({'coachID': ','.join})

# For the games_played we'll want the values of year X+1, not year X, so we'll get rid of those columns on coach_save and save them separately and reduce the year by 1 in every entry
games_played = coach_save[['coachID', 'year', 'games_played']]
games_played['year'] -= 1
del coach_save['games_played']

coach_record['next_team'] = coach_record['tmID'].shift(-1) # sanity check column, won't be kept
coach_record['next_coach'] = coach_record['coachID'].shift(-1)

# Apply conditions to ensure data only applies if the team remains the same
# I.e. a team doesn't get another team's coach info due to the shift
coach_record['next_coach'] = coach_record['next_coach'].where(coach_record['next_team'] == coach_record['tmID'])
coach_record = coach_record.drop('next_team', axis=1)

# This makes sure that a coach in year X has the info of that coach in year X, regardless of their team at the time
# Now let's merge the info back into the dataframe
# We must deaggregate the coach_record next_coach when there are two coaches so we can find their stats
coach_record['next_coach'] = coach_record['next_coach'].str.split(',')
coach_record = coach_record.explode('next_coach').reset_index(drop=True)
coach_save.rename(columns = {'coachID': 'next_coach'}, inplace=True)
coach_record = coach_record.merge(coach_save, on=['next_coach', 'year'], how='left')
del coach_record['coachID']
coach_record.rename(columns={'next_coach':'coachID', 'cumulative_won_rate_next': 'cumulative_won_rate', 'cumulative_total_games_next': 'cumulative_total_games', 'cumulative_awards_next': 'cumulative_awards'}, inplace=True)

# Now let's begin the weighted average
coach_record = coach_record.merge(games_played, on=['coachID', 'year'], how='left')
coach_record['cumulative_won_rate'] = coach_record['cumulative_won_rate'] * coach_record['games_played']
coach_record['cumulative_total_games'] = coach_record['cumulative_total_games'] * coach_record['games_played']
coach_record['cumulative_awards'] = coach_record['cumulative_awards'] * coach_record['games_played']

# Now we can reaggregate knowing that the stats correspond to the coach in that year and the coach corresponds to the coach of the next year
coach_record = coach_record.groupby(['tmID', 'year']).sum().reset_index()

# We can divide the values by games played now and we get the weighted average for each team and we will no longer have repeated entries per coach
coach_record['cumulative_won_rate'] = coach_record['cumulative_won_rate'] / coach_record['games_played']
coach_record['cumulative_total_games'] = coach_record['cumulative_total_games'] / coach_record['games_played']
coach_record['cumulative_awards'] = coach_record['cumulative_awards'] / coach_record['games_played']

coach_record.head(200)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  games_played['year'] -= 1


Unnamed: 0,tmID,year,coachID,cumulative_won_rate,cumulative_total_games,cumulative_awards,games_played
0,ATL,9,meadoma99w,0.117647,34.000000,0.0,34.0
1,ATL,10,meadoma99w,0.323529,68.000000,1.0,34.0
2,ATL,11,0,,,,0.0
3,CHA,1,donovan99w,0.281250,32.000000,0.0,32.0
4,CHA,2,donovan99w,0.421875,64.000000,0.0,32.0
...,...,...,...,...,...,...,...
149,WAS,7,adubari99wrollitr01w,0.063771,25.176471,0.0,34.0
150,WAS,8,kenlaje99wrollitr01w,0.345098,19.411765,0.0,34.0
151,WAS,9,plankju99w,0.000000,0.000000,0.0,34.0
152,WAS,10,laceytr99w,0.402174,92.000000,0.0,34.0


Now we can merge.

In [None]:
main_df = main_df.merge(coach_record, on=['tmID', 'year'], how='left')
main_df.head(200)

Unnamed: 0,year,tmID,franchID,confID,rank,firstRound,semis,finals,o_fga,o_fta,...,d_3prate,o_ftrate,d_ftrate,o_fgrate,d_fgrate,coachID,cumulative_won_rate,cumulative_total_games,cumulative_awards,games_played
0,1,LAS,LAS,0,1.0,1,0,0,1956.0,693.0,...,0.295400,0.786436,0.715318,0.440184,0.395313,coopemi01w,0.87500,32.0,1.0,32.0
1,1,NYL,NYL,1,1.0,1,1,0,1815.0,567.0,...,0.276190,0.756614,0.740678,0.436364,0.406696,adubari99w,0.62500,32.0,0.0,32.0
2,1,CLE,CLE,1,2.0,1,0,0,1828.0,570.0,...,0.323370,0.747368,0.780446,0.442560,0.439523,hugheda99w,0.53125,32.0,0.0,32.0
3,1,HOU,HOU,0,2.0,1,1,1,1894.0,634.0,...,0.296651,0.821767,0.728346,0.470433,0.405573,chancva99w,0.84375,32.0,0.0,32.0
4,1,ORL,CON,1,3.0,0,0,0,1911.0,546.0,...,0.340000,0.727106,0.763636,0.435897,0.433299,peckca99wc,0.50000,32.0,0.0,32.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149,11,NYL,NYL,1,,0,0,0,,,...,,,,,,0,,,,0.0
150,11,PHO,PHO,0,,0,0,0,,,...,,,,,,0,,,,0.0
151,11,SAS,SAS,0,,0,0,0,,,...,,,,,,0,,,,0.0
152,11,SEA,SEA,0,,0,0,0,,,...,,,,,,0,,,,0.0


Let's change the column names to reflect that they belong to the coaches, so that they do not get lost in the main dataframe.

In [None]:
main_df.rename(columns={"cumulative_won_rate": "coach_won_rate", "cumulative_total_games": "coach_total_games", "cumulative_awards": "cumulative_coach_awards", "games_played" : "coach_games_played"}, inplace=True)

main_df.head(200)

Unnamed: 0,year,tmID,franchID,confID,rank,firstRound,semis,finals,o_fga,o_fta,...,d_3prate,o_ftrate,d_ftrate,o_fgrate,d_fgrate,coachID,coach_won_rate,coach_total_games,cumulative_coach_awards,coach_games_played
0,1,LAS,LAS,0,1.0,1,0,0,1956.0,693.0,...,0.295400,0.786436,0.715318,0.440184,0.395313,coopemi01w,0.87500,32.0,1.0,32.0
1,1,NYL,NYL,1,1.0,1,1,0,1815.0,567.0,...,0.276190,0.756614,0.740678,0.436364,0.406696,adubari99w,0.62500,32.0,0.0,32.0
2,1,CLE,CLE,1,2.0,1,0,0,1828.0,570.0,...,0.323370,0.747368,0.780446,0.442560,0.439523,hugheda99w,0.53125,32.0,0.0,32.0
3,1,HOU,HOU,0,2.0,1,1,1,1894.0,634.0,...,0.296651,0.821767,0.728346,0.470433,0.405573,chancva99w,0.84375,32.0,0.0,32.0
4,1,ORL,CON,1,3.0,0,0,0,1911.0,546.0,...,0.340000,0.727106,0.763636,0.435897,0.433299,peckca99wc,0.50000,32.0,0.0,32.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149,11,NYL,NYL,1,,0,0,0,,,...,,,,,,0,,,,0.0
150,11,PHO,PHO,0,,0,0,0,,,...,,,,,,0,,,,0.0
151,11,SAS,SAS,0,,0,0,0,,,...,,,,,,0,,,,0.0
152,11,SEA,SEA,0,,0,0,0,,,...,,,,,,0,,,,0.0


Coach's ID is now useless, it merely served for book keeping.

In [None]:
del main_df['coachID']

## New features based on previous years

Lets add the percentage of wins from the 5 previous years (one feature for each year), if value is NaN, replace by the average of wins.

In [None]:
# Calculate the average win percentage
avg_win_percentage = main_df['won'].mean()

# Loop through the 5 previous years
for i in range(1, 6):
    # Shift the 'won' column by i years
    main_df[f'won_percentage_year_{i}'] = main_df.groupby('franchID')['won'].shift(i)

    # Fill NaN values with the average win percentage
    main_df[f'won_percentage_year_{i}'] = main_df[f'won_percentage_year_{i}'].fillna(avg_win_percentage)

In [None]:
main_df.loc[main_df['franchID'] == 'DET'][['won', 'won_percentage_year_1', 'won_percentage_year_2', 'won_percentage_year_3', 'won_percentage_year_4', 'won_percentage_year_5']]

Unnamed: 0,won,won_percentage_year_1,won_percentage_year_2,won_percentage_year_3,won_percentage_year_4,won_percentage_year_5
8,0.4375,0.5,0.5,0.5,0.5,0.5
28,0.3125,0.4375,0.5,0.5,0.5,0.5
46,0.28125,0.3125,0.4375,0.5,0.5,0.5
48,0.735294,0.28125,0.3125,0.4375,0.5,0.5
66,0.5,0.735294,0.28125,0.3125,0.4375,0.5
81,0.470588,0.5,0.735294,0.28125,0.3125,0.4375
90,0.676471,0.470588,0.5,0.735294,0.28125,0.3125
102,0.705882,0.676471,0.470588,0.5,0.735294,0.28125
115,0.647059,0.705882,0.676471,0.470588,0.5,0.735294
133,0.529412,0.647059,0.705882,0.676471,0.470588,0.5


In [None]:
main_df['avg_win_last_3_years'] = main_df.groupby('franchID')['won'].rolling(window=3, min_periods=1).mean().reset_index(0, drop=True)

avg_win_percentage = main_df['won'].mean()
main_df['avg_win_last_3_years'] = main_df['avg_win_last_3_years'].fillna(avg_win_percentage)

In [None]:
main_df.head()

Unnamed: 0,year,tmID,franchID,confID,rank,firstRound,semis,finals,o_fga,o_fta,...,coach_won_rate,coach_total_games,cumulative_coach_awards,coach_games_played,won_percentage_year_1,won_percentage_year_2,won_percentage_year_3,won_percentage_year_4,won_percentage_year_5,avg_win_last_3_years
0,1,LAS,LAS,0,1.0,1,0,0,1956.0,693.0,...,0.875,32.0,1.0,32.0,0.5,0.5,0.5,0.5,0.5,0.875
1,1,NYL,NYL,1,1.0,1,1,0,1815.0,567.0,...,0.625,32.0,0.0,32.0,0.5,0.5,0.5,0.5,0.5,0.625
2,1,CLE,CLE,1,2.0,1,0,0,1828.0,570.0,...,0.53125,32.0,0.0,32.0,0.5,0.5,0.5,0.5,0.5,0.53125
3,1,HOU,HOU,0,2.0,1,1,1,1894.0,634.0,...,0.84375,32.0,0.0,32.0,0.5,0.5,0.5,0.5,0.5,0.84375
4,1,ORL,CON,1,3.0,0,0,0,1911.0,546.0,...,0.5,32.0,0.0,32.0,0.5,0.5,0.5,0.5,0.5,0.5


Last but not least, taking into account the future model evaluation methods, it's best that the ```label``` is a numerical value (0/1) rather that Y/N. Let's make that change.

In [None]:
main_df['label'] = main_df['label'].apply(lambda x: 1 if x == 'Y' else (0 if x == 'N' else float('nan')))
main_df.head(200)

Unnamed: 0,year,tmID,franchID,confID,rank,firstRound,semis,finals,o_fga,o_fta,...,o_3prate,d_3prate,o_ftrate,d_ftrate,o_fgrate,d_fgrate,coach_won_rate,coach_total_games,cumulative_coach_awards,coach_games_played
0,1,LAS,LAS,0,1.0,1,0,0,1956.0,693.0,...,0.331858,0.295400,0.786436,0.715318,0.440184,0.395313,0.87500,32.0,1.0,32.0
1,1,NYL,NYL,1,1.0,1,1,0,1815.0,567.0,...,0.340909,0.276190,0.756614,0.740678,0.436364,0.406696,0.62500,32.0,0.0,32.0
2,1,CLE,CLE,1,2.0,1,0,0,1828.0,570.0,...,0.346437,0.323370,0.747368,0.780446,0.442560,0.439523,0.53125,32.0,0.0,32.0
3,1,HOU,HOU,0,2.0,1,1,1,1894.0,634.0,...,0.350305,0.296651,0.821767,0.728346,0.470433,0.405573,0.84375,32.0,0.0,32.0
4,1,ORL,CON,1,3.0,0,0,0,1911.0,546.0,...,0.341981,0.340000,0.727106,0.763636,0.435897,0.433299,0.50000,32.0,0.0,32.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149,11,NYL,NYL,1,,0,0,0,,,...,,,,,,,,,,0.0
150,11,PHO,PHO,0,,0,0,0,,,...,,,,,,,,,,0.0
151,11,SAS,SAS,0,,0,0,0,,,...,,,,,,,,,,0.0
152,11,SEA,SEA,0,,0,0,0,,,...,,,,,,,,,,0.0


## Exporting the dataset

Let's now export the processed dataset.

In [None]:
main_df.to_csv("/content/drive/Shareddrives/ML 2024/tables/teams&coaches.csv")