## Extracting player awards - **NEED FIXING**
This notebook extract awards and noteable accomplishments for each college player.  
These are web scraped from the player profile in the 'Awards' section.  
For this to work, you first need to have an updated `cbb_player` table since the code uses the `player_id` to search.  

16-08-22: Added clean up of player_award table, adding new columns

In [2]:
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
from lxml import etree, html
import os
import numpy as np
from datetime import datetime
from sportsipy.ncaab.teams import Teams
pd.set_option("display.max_rows", 400)
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_colwidth", 200)

**Get awards for each player**  
first need to get all the player IDs from the code above

In [27]:
cbb_player = pd.read_csv('02_database/cbb_player.csv') 

In [273]:
url_root = 'https://www.sports-reference.com/cbb/players/'

# list of all player id to loop through
player_id = cbb_player.player_id.tolist()

# empty link to hold results
# player_id::award::award category on website
awards = []
print(datetime.now().strftime("%d/%m/%Y %H:%M:%S"))
for id_ in player_id:
    url = url_root+id_+'.html'
    response = requests.get(url)
    html = response.text
    html = html.replace('<!--', '').replace('-->', '')
    soup = BeautifulSoup(html, 'lxml')
    leaderboard = soup.find_all('div', {'id':re.compile('leaderboard_')})
    
    # check if player has awards
    if len(leaderboard)>0:
        for l in leaderboard[1:]:
            for i in l.find_all('td',{'class':'single'}):
                awards.append(id_ + '::' +i.text +'::' + l.find('caption',{'class':'poptip'}).text)
    else:
        awards.append(id_ + '::' +'no_awards' +'::' +'no_awards')

# save raw file to csv        
pd.Series(awards).to_csv('01_raw_csv_files/cbb_awards/cbb_awards_'+ datetime.now().strftime("%d%m%Y") + '.csv',index=False)

19/08/2022 13:05:30


**UPDATED AWARDS CLEAING: 19-08-22**  
Awards were extracted from individual player's profile.  
The awards section of the profile can be split into two sections; awards and notable statistical rankings.  
The initial script extracted both sections, but I will only be using the Awards component.  
The statistical rankings are difficult to clean up because the way they are stored in the html. It will take a long time.  
I also already have their complete stats, so if needed I can rank the players myself.  

*Awards*  
After using various regex methods to split relevant information on the awards, I concluded that the most straight forward way is to use a simple 'if else' method. If award name contains some specific text, then give it a label.  
* award_name: If it's an award that is national, then this is just the name of the award. If the award is for making a team, then it will have the team number.
* award_type: This is a flag indicating whether the award is Individual, meaning across all conferences or given at each conference. 
* award_conf: Which conference award is given out in
* award_year: The year/season the award was given out

In [68]:
# read csv file if necessary
awards0 = pd.read_csv('01_raw_csv_files/cbb_awards/cbb_awards_14082022.csv')
awards1 = pd.read_csv('01_raw_csv_files/cbb_awards/cbb_awards_19082022.csv')
awards = pd.concat([awards0,awards1]).drop_duplicates()['0'].tolist()

df_all_raw = pd.Series(awards).str.split('::',expand=True)
df_all = df_all_raw.copy()

# stat ranking - you can still extract stat ranking, but I'm not doing this at the moment

In [69]:
df_all = df_all[df_all[2]=='Awards'] # just using Award columns
# adding new columns to be populated
df_all['award_name'] = np.nan
# df_all['award_type'] = np.nan
df_all['award_conf'] = np.nan

# award_name column
df_all['award_sub_name'] = df_all[1].str.split(' ',expand = True, n=1)[1]
df_all['award_year'] = df_all[1].str.replace(u'\xa0',u' ').str.split(' ',expand = True, n=1)[0]

In [70]:
# clean up final DF
df_all.rename(columns = {0:'player_id',1:'award_orig_name'},inplace = True)
cbb_player_awards = df_all.loc[:,['player_id','award_orig_name','award_sub_name','award_conf','award_year']]
cbb_player_awards.head(3)

Unnamed: 0,player_id,award_orig_name,award_sub_name,award_conf,award_year
1146,aaron-johnson-2,2010-11 CUSA Player of the Year,Player of the Year,,2010-11
1308,ovie-soko-1,2013-14 All-A-10 - 3rd Team,3rd Team,,2013-14
1667,jamychal-green-1,2008-09 SEC All-Freshman,All-Freshman,,2008-09


**Old award_type - DONT NEED**

In [295]:
# # National Awards
# df_all.loc[df_all[1].str.contains('ap player of the year', case = False),['award_type']]= 'AP Player of the Year'
# df_all.loc[df_all[1].str.contains('ap pre', case = False),['award_type']]= 'AP Preseason All-American'
# df_all.loc[df_all[1].str.contains('NABC Player of the Year', case = False),['award_type']]= 'NABC Player of the Year'
# df_all.loc[df_all[1].str.contains('NABC Defensive Player of the Year', case = False),['award_type']]= 'NABC Defensive Player of the Year'
# df_all.loc[df_all[1].str.contains('Naismith Award Semifinalists', case = False),['award_type']]= 'Naismith Award Semifinalists'
# df_all.loc[df_all[1].str.contains('Naismith Award Finalists', case = False),['award_type']]= 'Naismith Award Finalists'
# df_all.loc[df_all[1].str.endswith('Naismith Award'),['award_type']]= 'Naismith Award'
# df_all.loc[df_all[1].str.contains('NCAA All-Tournament', case = False),['award_type']]= 'NCAA All-Tournament'
# df_all.loc[df_all[1].str.contains('NCAA Tournament Most Outstanding Player', case = False),['award_type']]= 'NCAA Tournament Most Outstanding Player'
# df_all.loc[df_all[1].str.contains('NIT Most Valuable Player', case = False),['award_type']]= 'NIT Most Valuable Player'
# df_all.loc[df_all[1].str.contains('Rupp Trophy', case = False),['award_type']]= 'Rupp Trophy'
# df_all.loc[df_all[1].str.contains('Sporting News Player of the Year', case = False),['award_type']]= 'Sporting News Player of the Year'
# df_all.loc[df_all[1].str.contains('The Bob Cousy Award', case = False),['award_type']]= 'The Bob Cousy Award'
# df_all.loc[df_all[1].str.contains('The Frances Pomeroy Naismith Award', case = False),['award_type']]= 'The Frances Pomeroy Naismith Award'
# df_all.loc[df_all[1].str.contains('USBWA Player of the Year', case = False),['award_type']]= 'USBWA Player of the Year'
# df_all.loc[df_all[1].str.contains('USBWA Freshman of the Year', case = False),['award_type']]= 'USBWA Freshman of the Year'
# df_all.loc[df_all[1].str.contains('USBWA Player of the Year Finalists', case = False),['award_type']]= 'USBWA Player of the Year Finalists'
# df_all.loc[df_all[1].str.contains('Wooden Award - Finalists', case = False),['award_type']]= 'Wooden Award - Finalists'
# df_all.loc[df_all[1].str.contains('Wooden Award - National Ballot', case = False),['award_type']]= 'Wooden Award - National Ballot'
# df_all.loc[df_all[1].str.contains('Wooden Award - Preseason', case = False),['award_type']]= 'Wooden Award - Preseason'
# df_all.loc[df_all[1].str.endswith('Wooden Award'),['award_type']]= 'Wooden Award'
# df_all.loc[df_all[1].str.contains('Wooden Award - Midseason', case = False),['award_type']]= 'Wooden Award - Midseason'
# df_all.loc[df_all[1].str.contains('Wooden Award - Late Season', case = False),['award_type']]= 'Wooden Award - Late Season'
# df_all.loc[df_all[1].str.contains('The Kareem Abdul-Jabbar Award', case = False),['award_type']]= 'The Kareem Abdul-Jabbar Award'
# df_all.loc[df_all[1].str.contains('Consensus All-America', case = False),['award_type']]= 'Consensus'

# ## Conferences
# df_all.loc[df_all[1].str.contains('A-10', case = False),['award_conf']]= 'Atlantic-10'
# df_all.loc[df_all[1].str.contains('Big Sky', case = False),['award_conf']]= 'Big Sky'
# df_all.loc[df_all[1].str.contains('Pac-12', case = False),['award_conf']]= 'Pac-12'
# df_all.loc[df_all[1].str.contains('Big South', case = False),['award_conf']]= 'Big South'
# df_all.loc[df_all[1].str.contains('Big Ten', case = False),['award_conf']]= 'Big Ten'
# df_all.loc[df_all[1].str.contains('Big 12', case = False),['award_conf']]= 'Big 12'
# df_all.loc[df_all[1].str.contains('AEC', case = False),['award_conf']]= 'AEC'
# df_all.loc[df_all[1].str.contains('SEC', case = False),['award_conf']]= 'SEC'
# df_all.loc[df_all[1].str.contains('ACC', case = False),['award_conf']]= 'ACC'
# df_all.loc[df_all[1].str.contains('Big East', case = False),['award_conf']]= 'Big East'
# df_all.loc[df_all[1].str.contains('WAC', case = False),['award_conf']]= 'WAC'
# df_all.loc[df_all[1].str.contains('A-Sun', case = False),['award_conf']]= 'Atlantic Sun'
# df_all.loc[df_all[1].str.contains('MAC', case = False),['award_conf']]= 'Mid-American Conf'
# df_all.loc[df_all[1].str.contains('CAA', case = False),['award_conf']]= 'Colonial Athletic Assoc'
# df_all.loc[df_all[1].str.contains('WCC', case = False),['award_conf']]= 'West Coast Conf'
# df_all.loc[df_all[1].str.contains('Sun Belt', case = False),['award_conf']]= 'Sun Belt'
# df_all.loc[df_all[1].str.contains('MWC', case = False),['award_conf']]= 'Mountain West Conf'
# df_all.loc[df_all[1].str.contains('Ivy', case = False),['award_conf']]= 'Ivy'
# df_all.loc[df_all[1].str.contains('Patriot', case = False),['award_conf']]= 'Patriot'
# df_all.loc[df_all[1].str.contains('Pac-10', case = False),['award_conf']]= 'Pac-10'
# df_all.loc[df_all[1].str.contains('MAAC', case = False),['award_conf']]= 'Metro Atlantic Athletic Conf'
# df_all.loc[df_all[1].str.contains('OVC', case = False),['award_conf']]= 'Ohio Valley Conf'
# df_all.loc[df_all[1].str.contains('Summit', case = False),['award_conf']]= 'Summit League'
# df_all.loc[df_all[1].str.contains('Southern', case = False),['award_conf']]= 'Southern Conf'
# df_all.loc[df_all[1].str.contains('AAC', case = False),['award_conf']]= 'AAC'
# df_all.loc[df_all[1].str.contains('MEAC', case = False),['award_conf']]= 'MEAC'
# df_all.loc[df_all[1].str.contains('Horizon', case = False),['award_conf']]= 'Horizon'
# df_all.loc[df_all[1].str.contains('Big West', case = False),['award_conf']]= 'Big West'
# df_all.loc[df_all[1].str.contains('MVC', case = False),['award_conf']]= 'MVC'
# df_all.loc[df_all[1].str.contains('CUSA', case = False),['award_conf']]= 'CUSA'
# df_all.loc[df_all[1].str.contains('Southland', case = False),['award_conf']]= 'Southland'
# df_all.loc[df_all[1].str.contains('GWC', case = False),['award_conf']]= 'GWC'
# df_all.loc[df_all[1].str.contains('NEC', case = False),['award_conf']]= 'NEC'

# ## remaining award go towars all-conferences
# df_all.award_conf.fillna('National',inplace = True)

# # award_type: conference name awards

# df_all.loc[(df_all[1].str.contains('All', case = False))&(df_all['award_type'].isna()),['award_type']]= 'All-Conf'
# df_all.loc[(df_all[1].str.contains('Player of the Year', case = False))&(df_all['award_type'].isna()),['award_type']]= 'Conf-POY'
# df_all.loc[(df_all[1].str.contains('Tournament MVP', case = False))&(df_all['award_type'].isna()),['award_type']]= 'Conf-Tourney-MVP'
# df_all.loc[(df_all[1].str.contains('Rookie of the Year', case = False))&(df_all['award_type'].isna()),['award_type']]= 'Conf-ROY'
# df_all.loc[(df_all[1].str.contains('Sixth Man of the Year', case = False))&(df_all['award_type'].isna()),['award_type']]= 'Conf-6MOY'
# df_all.loc[(df_all[1].str.contains('Most Improved Player', case = False))&(df_all['award_type'].isna()),['award_type']]= 'Conf-MIP'

### Creating `award_name` & `award_type` columns - DONT NEED award_type  
* award_name: actual name of the award without year
* award_type: national, conference, Consensus All-America, NCAA Tournament All-Region
* award_conf: the conference that the award belongs to

*** Note: you dont really need award_type, but you use it in places as filter. For now just keep it so that you can get to the analysis part

#### **National Awards** 

In [107]:
national_awards = [
    'AP Player of the Year' 
,'AP Preseason All-American' 
,'Helms Foundation Player of the Year' 
,'NABC Defensive Player of the Year' 
,'NABC Player of the Year' 
,'Naismith Award'
,'Naismith Award Finalists' 
,'Naismith Award Semifinalists' 
,'NCAA All-Tournament' 
,'NCAA Tournament Most Outstanding Player' 
,'NIT Most Valuable Player' 
,'Rupp Trophy' 
,'Sporting News Player of the Year' 
,'The Bob Cousy Award' 
,'The Frances Pomeroy Naismith Award' 
,'The Jerry West Award' 
,'The Julius Erving Award' 
,'The Kareem Abdul-Jabbar Award' 
,'The Karl Malone Award' 
,'UPI Player of the Year' 
,'USBWA Freshman of the Year' 
,'USBWA Player of the Year' 
,'USBWA Player of the Year Finalists' 
,'Wooden Award' 
,'Wooden Award - Finalists'
,'Wooden Award - Late Season'
,'Wooden Award - Midseason'
,'Wooden Award - National Ballot'
,'Wooden Award - Preseason']

for award in national_awards:
    cbb_player_awards.loc[cbb_player_awards.award_orig_name.str.contains(award),['award_name']] = award
    cbb_player_awards.loc[cbb_player_awards.award_orig_name.str.contains(award),['award_type']] = 'national'

#### Creating `award_lookup` table.  
This table is used for conference awards only

In [108]:
# read in the two lookup tables. These will help to create the search string neccessary to id the awards
award_conf = pd.read_csv('award_conf.csv')
award_text = pd.read_csv('award_text.csv')

# convert award_conf table wide-to-long
award_conf_long = pd.melt(award_conf, id_vars = ['conf_full','conf_abbr'], value_vars = ['award1','award2','award3','award4','award5','award6','award7',
                                                             'award8','award9','award10'])
award_conf_long['value'] = award_conf_long.value.str.replace('\xa0','')

# merge the two table
award_lookup = award_conf_long.merge(award_text, how = 'inner', left_on='value', right_on='award_name')

# create award_lookup column that will be used a search string to iddentify awards
award_lookup.loc[award_lookup.award_prefix == 'conf_abbr','award_lookup'] = award_lookup.conf_abbr +' '+award_lookup.award_postfix
award_lookup.loc[award_lookup.award_middle == 'conf_abbr','award_lookup'] = award_lookup.award_prefix + award_lookup.conf_abbr + ' '+award_lookup.award_postfix
award_lookup.loc[award_lookup.award_postfix == 'conf_abbr','award_lookup'] = award_lookup.award_prefix + ''+ award_lookup.conf_abbr

# drop columns
award_lookup.drop(columns = ['variable','value','award_prefix','award_middle','award_postfix'],inplace= True)
# store as CSV
award_lookup.to_csv('02_database/award_lookup.csv',index = False)

award_lookup.sample(2)

Unnamed: 0,conf_full,conf_abbr,award_name,award_lookup
61,Southwest Athletic Conference,SWAC,All-Conf Tourney,All-SWAC Tournament
2,Atlantic 10 Conference,A-10,All-Conf,All-A-10


#### Conference awards  

In [109]:
conference_awards = award_lookup.award_lookup.tolist()
conference_abbr = award_lookup.conf_abbr.tolist()
for award,abbr in zip(conference_awards,conference_abbr):
    cbb_player_awards.loc[cbb_player_awards.award_orig_name.str.contains(award),['award_name']] = award
    cbb_player_awards.loc[cbb_player_awards.award_orig_name.str.contains(award),['award_type']] = 'conference'
    cbb_player_awards.loc[cbb_player_awards.award_orig_name.str.contains(award),['award_conf']] = abbr

# Two awards with very similar names: All-Conf & All-Conf Tournament. The logic above cannot distinguish the two. This is a manual overwrite    
cbb_player_awards.loc[(cbb_player_awards.award_sub_name.str.contains('Tournament')) & (cbb_player_awards.award_type=='conference'),['award_name']] = cbb_player_awards.loc[(cbb_player_awards.award_sub_name.str.contains('Tournament')) & (cbb_player_awards.award_type=='conference'),['award_name']].values#+' Tournament'     

In [110]:
cbb_player_awards.loc[(~cbb_player_awards.award_name.isna())&cbb_player_awards.award_name.str.contains('All-Big East Tournament',regex = False),:].head(4)

Unnamed: 0,player_id,award_orig_name,award_sub_name,award_conf,award_year,award_name,award_type,award_year_main
17544,yancy-gates-1,2012 All-Big East Tournament - 1st Team,East Tournament - 1st Team,Big East,2012,All-Big East Tournament,conference,2012
17605,cashmere-wright-1,2012 All-Big East Tournament - 1st Team,East Tournament - 1st Team,Big East,2012,All-Big East Tournament,conference,2012
21283,kemba-walker-1,2011 All-Big East Tournament - 1st Team,East Tournament - 1st Team,Big East,2011,All-Big East Tournament,conference,2011
32060,greg-monroe-1,2010 All-Big East Tournament - 1st Team,East Tournament - 1st Team,Big East,2010,All-Big East Tournament,conference,2010


#### Consensus All-America award

In [111]:
consensus_awards = ['Consensus All-America']
for award in consensus_awards:
    cbb_player_awards.loc[cbb_player_awards.award_orig_name.str.contains(award),['award_name']] = award
    cbb_player_awards.loc[cbb_player_awards.award_orig_name.str.contains(award),['award_type']] = 'Consensus All-America'
    cbb_player_awards.loc[cbb_player_awards.award_orig_name.str.contains(award),['award_conf']] = 'Consensus'

#### NCAA Tournament All-Region

In [112]:
ncaa_tourney_all_region_awards = ['NCAA Tournament All-Region']
for award in ncaa_tourney_all_region_awards:
    cbb_player_awards.loc[cbb_player_awards.award_orig_name.str.contains(award),['award_name']] = award
    cbb_player_awards.loc[cbb_player_awards.award_orig_name.str.contains(award),['award_type']] = 'NCAA Tournament All-Region'
    cbb_player_awards.loc[cbb_player_awards.award_orig_name.str.contains(award),['award_conf']] = 'NCAA'

#### Adding `award_year_main`
Converting the season of the award to the year of the award  
For example, if season is 2013-14 then the award year is 2014

In [113]:
x = cbb_player_awards.award_year.str.split(('-'),expand = True, regex = False)
x.loc[~x[1].isna(),[1]] = '20'+x.loc[~x[1].isna(),[1]].astype(str)
x.loc[x[1].isna(),[1]] = x.loc[x[1].isna()][0]
cbb_player_awards['award_year_main'] = x[1]

In [114]:
# save to csv
cbb_player_awards.to_csv('02_database/cbb_player_awards.csv',index = False)

In [116]:
cbb_player_awards[cbb_player_awards.award_name.str.contains('Wood')].head(4)

Unnamed: 0,player_id,award_orig_name,award_sub_name,award_conf,award_year,award_name,award_type,award_year_main
3409,derrick-williams-2,2010-11 Wooden Award - Finalists,Award - Finalists,,2010-11,Wooden Award - Finalists,national,2011
3410,derrick-williams-2,2010-11 Wooden Award - National Ballot,Award - National Ballot,,2010-11,Wooden Award - National Ballot,national,2011
6562,cory-jefferson-1,2013-14 Wooden Award - Preseason,Award - Preseason,,2013-14,Wooden Award - Preseason,national,2014
9502,jimmer-fredette-1,2009-10 Wooden Award - National Ballot,Award - National Ballot,,2009-10,Wooden Award - National Ballot,national,2010


In [106]:
cbb_player_awards[cbb_player_awards.award_orig_name.str.contains('Wood')].head(2)

Unnamed: 0,player_id,award_orig_name,award_sub_name,award_conf,award_year,award_name,award_type,award_year_main
3409,derrick-williams-2,2010-11 Wooden Award - Finalists,Award - Finalists,,2010-11,,,2011
3410,derrick-williams-2,2010-11 Wooden Award - National Ballot,Award - National Ballot,,2010-11,,,2011
