### My Project - Moneyball for College Tennis:
- I initiallly intended on producing a system to identify promising junior tennis players whose rankings do not reflect their upside potential... with the thought that this could be a powerful recruiting tool for college coaches.
- Currently, the junior rankings are based on the best six singles results plus 25% of the best six doubles results over the trailing 52-weeks (rankings are updated weekly).  Players are assigned points based on the tournament level (GA, G1, G2, G3, G4, G5) and how deep they go in a tournament (quarter finals, semi finals, finals, winner).
- No consideration is given for quality wins, bad losses or close losses against players whose rankings are significantly higher.  Additionally, adding a weight for doubles results is non-sensicle for a singles ranking (it was done to encourage players to enter both singles and doubles events).  Furthermore, it's expensive for players to travel and play the tournaments that offer the most ranking points (ie: highest levels).
- This creates a disadvantage for players who don't have the (financial) means to travel all over the world.
- My plan was to scrape all the junior and college results, add ranking features and build a predictive model to identify attributes of players with more upside than other players with similar rankings.  A weighted logistic regression model was one approach to identify/classify players who are "undervalued".  I care more about the accuracy of predicting players with upside than the opportunity costs of missing players with upside.

- I scraped all results for every single division I college tennis player since 2012 (prior to 2012, the website with the data was inconsistent).
- I then scraped all results for the top 500 ranked junior tennis players in the world.  This turned out to be quite clallenging as the ITF website has blocks on requests coming from non-java clients (ie: scripts).  Additionally, I needed to interact with the website (ie: click buttons, etc).
- I used Selenium, an open-source web-based automation tool that allows you to makes calls to (and run operations on) a browser.  This solution addressed both my problems.
- I was ultimately able to scrape all historical results for all current players ranked in the top 500.  However, the website does not provide point-in-time rankings.  In other words, I had no way to get the rankings of players at the time they played their matches.  I would have to calculate those rankings myself.  This entails scraping players who are no longer ranked (because they aged out of juniors)... and that data is in another format.  I made the strategic decision at this point, to put my project on hold and perform data analysis on what I have scraped... with the full intention of finishing my ultimate project at a later date.
- the data analysis I perform includes: adding features mentioned above, assigning a value to those features and then identifying current players whose rankings appear to be understated.

bonus points features:
- number of sets (3 means close)
- games won
- games lost
- Outcome (if score has letter in it, then no contest for bonus points)
- qualityWin (ranking difference for wins over higher ranked players)
- qualityLoss (ranking difference for three-set losses over higher ranked players)
- badLoss (ranking difference for losses over lower ranked players)
- number of matches played (excludes withdraws and defaults)
- win pct
- alpha:  (sum of qualityWin + 50% x qualityLoss) / ranking

findings:
- of the top 500 ranked junior tennis players, I've identified 26 junior players whose rankings meaningfully understate the quality of their wins and losses (alpha of at least 2.0).
- there are an additional 31 players whose rankings also understate the quality of their wins and losses, though not as extreme.

Next Steps:
- scrape year-end junior rankings for 2013-2017.
- from the year-end junior rankings, identify players whose results have not been scraped... and scrape them.
- now we have all players results historically... and we can recreate point-in-time rankings every week.
- we can then assign opponent point-in-time rankings at the time matches were played.
- now we can backtest and see if players with excess alpha actually perform better in college than comparitvely ranked players who had lower alpha.
- scrape professional results (most juniors in the top 500 play some professional tournaments and some have unusually good success).

#### Below, I'll take you through my steps and my code and findings.

In [1]:
# request gets the data.  soup structures the data.
import requests
from bs4 import BeautifulSoup
requests.urllib3.disable_warnings()
import pandas as pd
import numpy as np
from collections import Counter
import re

import os
import sys
sys.path.append('P:/python/packages')
sys.path.append('P:/python/packages/geckodriver')
from selenium import webdriver

In [2]:
driver = webdriver.Firefox()
player_rankings_url='https://www.itftennis.com/juniors/rankings/player-rankings.aspx'

# using selenium, open a browser window and load the ITF website, and select all current players with rankings
def get_rankings_page(driver, url):
    driver.implicitly_wait(30)
    driver.get(url)
    
    #driver.find_element_by_xpath("//select[@id='ddlRange']/option[text()='Top 500']").click()
    driver.find_element_by_xpath("//select[@id='ddlRange']/option[text()='All']").click()
    driver.find_element_by_xpath('//*[@id="btnSearch"]').click()
    return driver
    
driver = get_rankings_page(driver, player_rankings_url)

In [3]:
# load all ranked players and their info into a dataframe

# function that returns general stats about a single player
def get_players_data(players_data, col_names):
    players_stats = pd.DataFrame(columns = col_names)

    for player in players_data:
        player_dict = {}
        for idx, d in enumerate(player.find_all('td')[1:]):
            col_name = col_names[idx]
            player_dict[col_name] = d.text.lstrip()
        player_dict["Player ID"] = str(player).split('a href="/juniors/players/player/profile.aspx?playerid=')[-2].split('"')[0]
        players_stats = players_stats.append(player_dict, ignore_index=True)
    return players_stats
    #print(players_stats)
        
def get_columns(column_text):
    return column_text.split()
    
# function that uses beautifulsoup to parse and load each player's info into a dataframe
def extract_rankings(driver):
    html_source = driver.page_source
    soup=BeautifulSoup(driver.page_source, 'lxml')
    
    ranking_table=soup.find('div', {'id': 'divRankingResults'}).find('table').find('tbody').find_all('tr')
    col_names = get_columns(ranking_table[0].text)
    col_names.append("Player ID")
    players=ranking_table[1:]
    
    return get_players_data(players[0:], col_names)
    
df=extract_rankings(driver)

In [4]:
# add first name and last name columns
addNames=pd.DataFrame(df.Player.str.split(',',1).tolist(),columns = ['Last Name','First Name'])
df=df.join(addNames)

In [5]:
# save to csv
#df.to_csv('./output/playerRankings.csv', encoding='utf-8', index=False)

In [6]:
# functions to simplify the process

def get_player_page(driver, url):
    driver.implicitly_wait(30)
    driver.get(url)
    
    driver.find_element_by_xpath("/html/body/form/div[3]/div/div[2]/div[2]/div[1]/div[2]/div[1]/span[2]/span/span/a/span").click()
    driver.find_element_by_xpath('//*[@id="btnViewAll"]').click()
    return driver

def extract_player_results(driver):
    html_source = driver.page_source
    soup=BeautifulSoup(driver.page_source, 'lxml')
    ranking_table=soup.find('div', {'id': 'ActivityDiv'}).text[:10]
    #ranking_table=soup.find('div', {'id': 'ActivityDiv'}).find('table').find('tbody').find_all('tr')
    #col_names = get_columns(ranking_table[0].text)
    #players=ranking_table[1:]
    
    #return get_players_data(players[0:], col_names)

In [7]:
# player_rankings_url='https://www.itftennis.com/juniors/players/player/profile.aspx?playerid=100286164'
# player_rankings_url='https://www.itftennis.com/juniors/players/player/profile.aspx?playerid=100282321'
# player_rankings_url='https://www.itftennis.com/juniors/players/player/profile.aspx?playerid=100274778'
# player_rankings_url='https://www.itftennis.com/juniors/players/player/profile.aspx?playerid=100204396'
# player_rankings_url='https://www.itftennis.com/juniors/players/player/profile.aspx?playerid=100282506'
# player_rankings_url='https://www.itftennis.com/juniors/players/player/profile.aspx?playerid=100197802'

In [8]:
# function getEventResultsTable
# function getCurrentPlayerInfo
# function getPlayerRecord


def getEventResultsTable(currEvent, eventNumber, a_match, a_match2):
    eventNumber2_column=[]
    eventRound_column=[]
    eventResults_column=[]
    eventOpponent_column=[]
    eventOpponentID_column=[]
    eventCountry_column=[]
    eventScore_column=[]
    
    stop_flag=False;
    for a in currEvent.find_all('tr'):
        tourney_row=a.find_all('td')
        num_lines=str(tourney_row).count("\n")+1    
        if str(tourney_row).find('Partnering')!=-1:
            stop_flag=True;
        if (num_lines>15) and (stop_flag==False):
            # look for "Partnering" in text.  if exists, then done
            A=str(tourney_row[0])
            A=re.findall(a_match,A)
            if len(A)==1:
                A=A[0].strip()
            else:
                A='';
            B=str(tourney_row[1])
            B=B=re.findall(a_match,B)
            #print("B = ",B)
            if len(B)==1:
                B=B[0].strip()
            else:
                B='';
            C=str(tourney_row[2])
            C=''
            D=str(tourney_row[3])
            D=re.findall(a_match,D)
            if len(D)>=6:
                D1=D[5].strip()
                D3=D1.split("playerid=")[1]
                D3=D3.split('"')[0]
                D1=re.findall(a_match2,D1)[0]
                D2=D[6].strip()
                D2=D2.replace('(','')
                D2=D2.replace(')','')
            else:
                D1='';
                D2='';
                D3='';
            E=str(tourney_row[4])
            E=re.findall(a_match,E)
            if len(E)==1:
                E=E[0].strip()
            else:
                E='';
            #print(A,"; ",B,"; ",D1,"; ",D2,"; ",E)

            eventNumber2_column.append(eventNumber)
            eventRound_column.append(A)
            eventResults_column.append(B)
            eventOpponent_column.append(D1)
            eventOpponentID_column.append(D3)
            eventCountry_column.append(D2)
            eventScore_column.append(E)

    return pd.DataFrame(
        {'eventNumber': eventNumber2_column,
         'eventRound': eventRound_column,
         'eventResults': eventResults_column,
         'eventOpponent': eventOpponent_column,
         'eventOpponentID': eventOpponentID_column,
         'eventOpponentCountry': eventCountry_column,
         'eventScore': eventScore_column,
        })

def getCurrentPlayerInfo(player_ID):
    # get player name & country
    playername=soup.text[0:300].split('Player Profile - ')[1].split('\n')[0]
    playername=playername.replace(')','')
    playername=playername.split('(')
    fullname_list=playername[0].split(',')
    fullname_column=playername[0].strip()
    player_country_column=playername[1].strip()
    lastname_column=fullname_list[0].strip()
    firstname_column=fullname_list[1].strip()

    # get general event info for this player
    eventNumber_column=[]
    eventName_column=[]

    tourneyName_table=soup.find('div', {'id': 'ActivityDiv'}).find_all('h3')
    eventNumber=0
    for j in tourneyName_table[:]:
        eventNumber=eventNumber+1;
        tourney=j.find('a')
        tourney_name=str(tourney).split('</a>')[-2].split('>')[-1]
        eventNumber_column.append(eventNumber)
        eventName_column.append(tourney_name)

    return pd.DataFrame(
        {'eventNumber': eventNumber_column,
         'eventName': eventName_column,
         'playerName': fullname_column,
         'player_ID': player_ID,
         'playerCountry': player_country_column,
         'playerLast': lastname_column,
         'playerFirst': firstname_column,
        })


def getPlayerRecord():
    # get specific event info for this player
    metaData_table=soup.find('div', {'id': 'ActivityDiv'}).find_all('li',{'class':'pn-Gradient'})

    a_match = re.compile(r'(?<=\n)(.*)(?=\n)')
    a_match2 = re.compile(r'(?<=>)(.*)(?=</a>)')
    
    all_events = pd.DataFrame(
        {'eventNumber': [],
        'eventVenue': [],
        'eventDates': [],
        'eventCategory': [],
        'eventGrade': [],
        'eventEntry': [],
        'eventSurface': [],
        'eventIndoorOutdoor': [],
        'eventRound': [],
        'eventResults': [],
        'eventOpponent': [],
        'eventOpponentID': [],
        'eventOpponentCountry': [],
        'eventScore': [],
        })
    
    eventNumber=0
    for currEvent in metaData_table[:]:
        eventNumber=eventNumber+1;
        venue=str(currEvent.find('li',{'class':'liVenue'})).split('</li>')[-2].split('\n')[-1].strip()
        dates=str(currEvent.find('li',{'class':'liDates'})).split('</li>')[-2].split('\n')[-1].strip()
        event_info=currEvent.find('ul',{'class':'ulMatch'})
        event_category=str(event_info.find_all('li')[0]).split('</li>')[-2].split('\n')[-1].strip()
        event_grade=str(event_info.find_all('li')[1]).split('</li>')[-2].split('\n')[-1].strip()
        event_entry=str(event_info.find_all('li')[2]).split('</li>')[-2].split('\n')[-1].strip()
        event_surface=str(event_info.find_all('li')[3]).split('</li>')[-2].split('\n')[-1].strip()
        event_surface=event_surface.replace('Surface: ','')
        event_indoor_outdoor=str(event_info.find_all('li')[4]).split('</li>')[-2].split('\n')[-1].strip()
        event_indoor_outdoor=event_indoor_outdoor.replace('(','')
        event_indoor_outdoor=event_indoor_outdoor.replace(')','')

        
        temp_df = pd.DataFrame(
        {'eventNumber': [eventNumber],
         'eventVenue': [venue],
         'eventDates': [dates],
         'eventCategory': [event_category],
         'eventGrade': [event_grade],
         'eventEntry': [event_entry],
         'eventSurface': [event_surface],
         'eventIndoorOutdoor': [event_indoor_outdoor],
        })
        
        event_results = getEventResultsTable(currEvent,eventNumber, a_match, a_match2)
        #print(event_results.head())
        merged_df = pd.merge(temp_df, event_results, how='left', left_on=['eventNumber'], right_on=['eventNumber'])
        all_events = pd.concat([all_events, merged_df])
    return all_events

In [9]:
# my loop to scrape all the data and write to file

start=0
last=3
inc=3
while start < last:

    end = start + inc

    playerResults = pd.DataFrame({'eventNumber': [],'eventRound': [],'eventResults': [],'eventOpponent': [],'eventOpponentCountry': [],
    'eventScore': [],'eventName': [],'playerName': [],'playerCountry': [],'playerLast': [],'playerFirst': [],
    'eventVenue': [],'eventDates': [],'eventCategory': [],'eventGrade': [],'eventEntry': [],'eventSurface': [],'eventIndoorOutdoor': []})

    playerCnt=start
    for i in df["Player ID"][start:end]:
        player_rankings_url='https://www.itftennis.com/juniors/players/player/profile.aspx?playerid='+i
        print("player count = ",playerCnt,"; ",player_rankings_url)

        # load this player's data into firefox browser
        driver = get_player_page(driver, player_rankings_url)
        extract_player_results(driver)

        # load this player's data into a soup object
        html_source = driver.page_source
        soup=BeautifulSoup(driver.page_source, 'lxml')

        CurrentPlayerInfo=getCurrentPlayerInfo(i)
        PlayerRecord=getPlayerRecord()
        PlayerInfo = pd.merge(PlayerRecord, CurrentPlayerInfo, how='left', left_on=['eventNumber'], right_on=['eventNumber'])
    #     output='./output/player_info_test' + str(start+1) + '.csv'
    #     PlayerInfo.to_csv(output, encoding='utf-8', index=False)
        playerResults=pd.concat([playerResults,PlayerInfo],sort=False)
        playerCnt=playerCnt+1

    playerResults=playerResults[playerResults.eventCategory!='Doubles']

    output='./output/player_stats' + str(start+1) + '-' + str(end) + '.csv'
    #output='./output/player_stats_test.csv'
    playerResults.to_csv(output, encoding='utf-8', index=False)

    start = start + inc

print('done!')

player count =  0 ;  https://www.itftennis.com/juniors/players/player/profile.aspx?playerid=100274778
player count =  1 ;  https://www.itftennis.com/juniors/players/player/profile.aspx?playerid=100282506
player count =  2 ;  https://www.itftennis.com/juniors/players/player/profile.aspx?playerid=100291609
done!


In [10]:
# for debugging a single player
i='100226751'

playerResults = pd.DataFrame({'eventNumber': [],'eventRound': [],'eventResults': [],'eventOpponent': [],'eventOpponentCountry': [],
'eventScore': [],'eventName': [],'playerName': [],'playerCountry': [],'playerLast': [],'playerFirst': [],
'eventVenue': [],'eventDates': [],'eventCategory': [],'eventGrade': [],'eventEntry': [],'eventSurface': [],'eventIndoorOutdoor': []})

player_rankings_url='https://www.itftennis.com/juniors/players/player/profile.aspx?playerid='+i
print(player_rankings_url)

# load this player's data into firefox browser
driver = get_player_page(driver, player_rankings_url)
extract_player_results(driver)

# load this player's data into a soup object
html_source = driver.page_source
soup=BeautifulSoup(driver.page_source, 'lxml')

CurrentPlayerInfo=getCurrentPlayerInfo(i)
PlayerRecord=getPlayerRecord()
PlayerInfo = pd.merge(PlayerRecord, CurrentPlayerInfo, how='left', left_on=['eventNumber'], right_on=['eventNumber'])
playerResults=pd.concat([playerResults,PlayerInfo],sort=False)
playerResults=playerResults[playerResults.eventCategory!='Doubles']

#playerResults.to_csv('./output/player_stats_A', encoding='utf-8', index=False)

https://www.itftennis.com/juniors/players/player/profile.aspx?playerid=100226751


In [2]:
# read in all player results and current player rankings

output_path='X:/Risk/python/general_assembly_course/FinalProject/output/'
output_path='./output/'

# find files in given directory, NOT subdirectories
def find_files(given_path):
    files = [f for f in os.listdir(given_path)
             if os.path.isfile(os.path.join(given_path, f))]
    return files
output_files = find_files(output_path)

output_frames = []
for f in output_files: 
    if 'player_stats' in f:
        output_frames.append(pd.read_csv(output_path + f))

results = pd.concat(output_frames, sort=False)

playerRankings=pd.read_csv(output_path + 'playerRankings.csv')

# join the data, add features and write to disk

activePlayerJoin=playerRankings[['Player','Rank','DOB','Event','Points','Player ID','Last Name','First Name']]
activePlayerJoin.rename(columns={'Event':'Events'},inplace=True)
opponentPlayerJoin=activePlayerJoin.copy()
opponentPlayerJoin.rename(columns={'Rank':'RankOpponent','Last Name':'Last Name Opponent','First Name':'First Name Opponent','Player ID':'Player ID Opponent'},inplace=True)
opponentPlayerJoin.drop(columns=['DOB','Events','Points'],inplace=True)

results=pd.merge(results, activePlayerJoin, how='left', left_on=['player_ID'], right_on=['Player ID'])
results=pd.merge(results, opponentPlayerJoin, how='left', left_on=['eventOpponentID'], right_on=['Player ID Opponent'])
results.drop(columns=['Player_x','Player_y','player_ID','Last Name','First Name'],inplace=True)

# start: convert eventDates to eventStartDate and eventEndDate
test=results.eventDates

a=pd.DataFrame(test.str.split('-').tolist(),columns = ['eventStartDate','eventEndDate'])
a['year'] = a['eventEndDate'].str[-4:]
a['eventStartDate'] = a['eventStartDate'] + a['year']
a['eventStartDate'] = pd.to_datetime(a['eventStartDate'])
a['eventEndDate'] = pd.to_datetime(a['eventEndDate'])
#a.drop(columns=['year'],inplace=True)
results=results.join(a)
results.drop(columns=['eventDates','year'],inplace=True)
# end: convert eventDates to eventStartDate and eventEndDate

results.fillna({'eventScore':''}, inplace=True)

results['eventResults'][results['eventScore'].str.contains('RET')] = 'U'
results['eventResults'][results['eventScore'].str.contains('W/O')] = 'U'
results['eventResults'][results['eventScore'].str.contains('DEF')] = 'U'
results['eventResults'][results['eventResults']=='A'] = 'U'
results['eventResults'][results['eventResults']=='N'] = 'U'
results['eventResults'][results['eventResults']=='O'] = 'U'

results['eventScore']=results['eventScore'].str.replace(' RET','')
results['eventScore']=results['eventScore'].str.replace(' W/O','')
results['eventScore']=results['eventScore'].str.replace(' DEF','')

results.to_csv('./output/all player stats.csv', encoding='utf-8', index=False)
#results[results.Rank==1]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A valu

In [97]:
# add some more features and save the trailing three months of data
results=pd.read_csv('./output/all player stats.csv')
playerRankings=pd.read_csv(output_path + 'playerRankings.csv')

results["sets"]=results['eventScore'].str.count('-')
results_qtr=results[results["eventEndDate"]>='2018-06-01']
results_qtr=results_qtr[['Rank','RankOpponent','eventResults','eventScore','sets']]

# get rid of parantheses
results_qtr['eventScore'] = results_qtr['eventScore'].str.replace(r"\(.*\)","")
# make new column that doesn't have the RET
results_qtr['eventScores'] = results_qtr['eventScore'].str.replace('RET', '')

# the formulas below work fine except for when there's one match, so we flip it temporarily
results_qtr['eventScores'] = np.where(len(results_qtr['eventScores']) == 3, results_qtr['eventScores'][::-1], results_qtr['eventScores'])

# add everything to left/right of - respectively
results_qtr['playerGames'] = results_qtr.eventScores.str.findall('(\d+)-').apply(pd.Series, dtype=float).sum(1)
results_qtr['opponentGames'] = results_qtr.eventScores.str.findall('-(\d+)').apply(pd.Series, dtype=float).sum(1)

# drop the new column since we don't need it
results_qtr.drop(columns=['eventScores'], inplace=True)
# uj with the forfeited games
# results_qtr = pd.concat([results_qtr, forfeit], sort=False)

results_qtr.to_csv('./output/all player stats trailing qtr.csv', encoding='utf-8', index=False)

In [95]:
#results_qtr.info()
#results[results.eventResults.isnull()].groupby('Rank').count()
#results[results.Rank==353]
#75, 207, 353, 355

In [104]:
#summary stats:
#number of matches played (excludes withdraws and defaults)
#win pct
#alpha: (sum of qualityWin + 50% x qualityLoss - 25% x badLoss) / ranking

results_qtr=pd.read_csv('./output/all player stats trailing qtr.csv')
results_qtr['goodWin']=0
results_qtr['goodWin'][(results_qtr.RankOpponent.isnull()==False) & (results_qtr.eventResults=='W') & (results_qtr.Rank>results_qtr.RankOpponent)]=results_qtr.Rank-results_qtr.RankOpponent

results_qtr['goodLoss']=0
results_qtr['goodLoss'][(results_qtr.RankOpponent.isnull()==False) & (results_qtr.eventResults=='L') & (results_qtr.Rank>results_qtr.RankOpponent) & (results_qtr.sets==3)]=results_qtr.Rank-results_qtr.RankOpponent
results_qtr['goodLoss'][(results_qtr.RankOpponent.isnull()==False) & (results_qtr.eventResults=='L') & (results_qtr.Rank>results_qtr.RankOpponent) & (results_qtr.playerGames+3>=results_qtr.opponentGames)]=results_qtr.Rank-results_qtr.RankOpponent

results_qtr['badLoss']=0
results_qtr['badLoss'][(results_qtr.RankOpponent.isnull()==False) & (results_qtr.eventResults=='L') & (results_qtr.Rank<results_qtr.RankOpponent)]=results_qtr.RankOpponent-results_qtr.Rank

# need to do groupby each of the statements below along with a count() of total wins & losses; win pct
# then append these to each player's value
#results_qtr[results_qtr['goodWin']>0]
#results_qtr[(results_qtr['goodLoss']>0) & (results_qtr.sets<3)]
#results_qtr[results_qtr['badLoss']>0]

winLossQuality=results_qtr.groupby('Rank',as_index=False)['goodWin','goodLoss','badLoss'].sum()
completed_matches=results_qtr[results_qtr.eventResults!='U'].groupby(['Rank'],as_index=False)['eventResults'].count()
wins=results_qtr[results_qtr.eventResults=='W'].groupby(['Rank'],as_index=False)['eventResults'].count()

winLossQuality.rename(columns={'goodWin':'goodWinSum','goodLoss':'goodLossSum','badLoss':'badLossSum'},inplace=True)
completed_matches.rename(columns={'eventResults':'totalMatches'},inplace=True)
wins.rename(columns={'eventResults':'totalWins'},inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [105]:
tmp=playerRankings[['Rank','Player','DOB','Player ID']][playerRankings.Rank<=500]
summary_qtr = pd.merge(tmp, winLossQuality, how='left', left_on=['Rank'], right_on=['Rank'])
summary_qtr = pd.merge(summary_qtr, completed_matches, how='left', left_on=['Rank'], right_on=['Rank'])
summary_qtr = pd.merge(summary_qtr, wins, how='left', left_on=['Rank'], right_on=['Rank'])

summary_qtr['winPct']=summary_qtr.totalWins/summary_qtr.totalMatches
summary_qtr['winScalar']=(summary_qtr.goodWinSum+summary_qtr.goodLossSum*0.5)/summary_qtr.Rank

summary_qtr.to_csv('./output/summary_qtr.csv', encoding='utf-8', index=False)

In [109]:
results_qtr[results_qtr.Rank==256]
results_qtr[results_qtr.Rank==97]
results_qtr[results_qtr.Rank==490]
results_qtr[results_qtr.Rank==341]

Unnamed: 0,Rank,RankOpponent,eventResults,eventScore,sets,playerGames,opponentGames,goodWin,goodLoss,badLoss
3000,341,672.0,W,6-3 7-6,2.0,13.0,9.0,0,0,0
3001,341,168.0,W,6-1 6-3,2.0,12.0,4.0,173,0,0
3002,341,94.0,L,4-6 1-6,2.0,5.0,12.0,0,0,0
3003,341,2412.0,W,6-1 6-2,2.0,12.0,3.0,0,0,0
3004,341,171.0,L,4-6 4-6,2.0,8.0,12.0,0,0,0
3005,341,739.0,L,4-6 4-6,2.0,8.0,12.0,0,0,398
3006,341,1519.0,W,6-2 6-1,2.0,12.0,3.0,0,0,0
3007,341,139.0,W,1-6 6-4 7-6,3.0,14.0,16.0,202,0,0
3008,341,286.0,U,2-6,1.0,2.0,6.0,0,0,0
3009,341,111.0,W,6-1 6-4,2.0,12.0,5.0,230,0,0


Unnamed: 0,Rank,Player,DOB,Player ID
0,1,"TSENG, Chun Hsin",08 Aug 2001,100274778
1,2,"BAEZ, Sebastian",28 Dec 2000,100282506
2,3,"KORDA, Sebastian",05 Jul 2000,100291609
3,4,"GASTON, Hugo",26 Sep 2000,100197802
4,5,"MUSETTI, Lorenzo",03 Mar 2002,100255913
5,6,"MEJIA, Nicolas",11 Feb 2000,100226751
6,7,"ANDREEV, Adrian",12 May 2001,100204396
7,8,"SEYBOTH WILD, Thiago",10 Mar 2000,100286164
8,9,"DRAPER, Jack",22 Dec 2001,100227974
9,10,"NAKASHIMA, Brandon",03 Aug 2001,100281928


In [None]:
html_source = driver.page_source
soup=BeautifulSoup(driver.page_source, 'lxml')

ranking_table=soup.find('div', {'id': 'divRankingResults'}).find('table').find('tbody').find_all('tr')
ranking_table[1].text.split()


In [26]:

tournaments_url_base = 'https://www.itftennis.com/juniors/tournaments/tournament/info.aspx?tournamentid'
tournament_ids=['1100042183', '1100041293']
tournament_ids=['1100042183']

In [32]:
def get_tournament(driver, tournament_id, base_url='https://www.itftennis.com/juniors/tournaments/tournament/info.aspx?tournamentid'):
    url = '{}={}'.format(base_url, tournament_id)
    print(url)
    driver.implicitly_wait(30)
    driver.get(url)
    
    
    driver.find_element_by_xpath('//*[@id="__tab_tabResults"]/span').click()
    soup_level1=BeautifulSoup(driver.page_source, 'lxml')
    //*[@id="lnkLinkToResults1100336899"]
    
    
driver = webdriver.Firefox()
for tournament_id in tournament_ids:
    get_tournament(driver, tournament_id)



https://www.itftennis.com/juniors/tournaments/tournament/info.aspx?tournamentid=1100042183


In [31]:
tags=soup.find_all('a')
for tag in tags:
    print(tag['href'])

In [None]:
url = 'https://www.itftennis.com/juniors/tournaments/tournament/info.aspx?tournamentid=1100042329'
driver.implicitly_wait(30)
driver.get(url)
driver.find_element_by_xpath('//*[@id="__tab_tabResults"]/span').click()
driver.find_element_by_xpath('//*[@id="lnkLinkToResults1100336896"]').click()
#print(driver.page_source)
print(driver.find_element_by_xpath('//*[@id="lnkPlayerW1"]').text)

In [11]:
soup = BeautifulSoup(data, 'html.parser')

In [4]:
response

<Response [200]>

In [5]:
soup = BeautifulSoup(response.text, 'html.parser')

In [12]:
soup.text

'\n\n\n\n\n\n'

In [57]:
soup.find('div', {'id': 'divTourResults'})
#soup.find_all('td')
#soup.find_all('table')[3]

In [58]:
soup.find('div', {'id': 'divTourResults'})

In [7]:
soup.find_all('table')[3].find_all('td')[0]

<td style=" text-align:left; padding-left:3px; width:10%;">
                                        Date
                                    </td>

In [8]:
soup.find_all('table')[3].find_all('td')[7]

<td style=" text-align:left; padding-left:3px; width:10%;">
                                        5/24/2018
                                    </td>

In [36]:
soup.find_all('table')[3].find_all('td')[14]

<td style=" text-align:left; padding-left:3px; width:10%;">
                                        5/23/2018
                                    </td>

In [43]:
# for loop. load dates (increment by 7 until)

In [65]:
len(soup.find_all('table')[3].find_all('td'))

308

In [154]:
date_column=[]
idx=0
tag_cnt=len(soup.find_all('table')[3].find_all('td'))
while idx<tag_cnt:
    my_output=soup.find_all('table')[3].find_all('td')[idx]
    my_output=str(my_output).split('\r\n')[1].strip()
    idx=idx+7
    date_column.append(my_output)
    #print(my_output)

In [155]:
date_column

['Date',
 '5/24/2018',
 '5/23/2018',
 '5/18/2018',
 '5/12/2018',
 '5/11/2018',
 '4/28/2018',
 '4/22/2018',
 '4/21/2018',
 '4/15/2018',
 '4/14/2018',
 '3/16/2018',
 '3/14/2018',
 '3/12/2018',
 '3/3/2018',
 '2/18/2018',
 '2/17/2018',
 '2/16/2018',
 '2/11/2018',
 '2/3/2018',
 '1/28/2018',
 '1/27/2018',
 '1/20/2018',
 '1/15/2018',
 '1/14/2018',
 '1/13/2018',
 '11/2/2017',
 '11/1/2017',
 '10/23/2017',
 '10/22/2017',
 '10/21/2017',
 '10/21/2017',
 '10/20/2017',
 '10/20/2017',
 '10/6/2017',
 '10/5/2017',
 '9/24/2017',
 '9/23/2017',
 '9/22/2017',
 '9/21/2017',
 '9/21/2017',
 '9/17/2017',
 '9/16/2017',
 '9/15/2017']

In [157]:
event_column=[]
idx=1
tag_cnt=len(soup.find_all('table')[3].find_all('td'))
while idx<tag_cnt:
    my_output=soup.find_all('table')[3].find_all('td')[idx]
    my_output=str(my_output).split('</font></a>')[0].strip()
    my_output=my_output.split('>')[-1].strip()
    idx=idx+7
    event_column.append(my_output)
    #print(my_output)

In [162]:
event_column[26]

'2017 Oracle ITA National Fall Championships'

In [142]:
soup.find_all('table')[3].find_all('td')[8+7*0]

<td style=" text-align:left; padding-left:3px; width:20%;">
<a id="ContentPlaceHolder1_rptDisplayPlayerSingleResult_lnkEvent_0"><font color="#10425D">NCAA Division I Individual Championships</font></a>
</td>

In [123]:
# code all columns
# load into dataframe
# add meta-data
# loop thru every player

In [124]:
import re
contactInfo = 'Doe, John: 555-1212'
re.search(r'\w+, \w+: \S+', contactInfo)

<_sre.SRE_Match object; span=(0, 19), match='Doe, John: 555-1212'>

In [125]:
match = re.search(r'(\w+), (\w+): (\S+)', contactInfo)

In [127]:
match.group(1)

'Doe'

In [128]:
match.group(2)

'John'

In [129]:
match.group(3)

'555-1212'

In [132]:
match.group(0)

'Doe, John: 555-1212'

In [136]:
nick=re.match(r'dog', 'dog cat dog')
nick.group(0)

'dog'

IndexError: no such group