# Clustering Week to Week Fantasy Football Projections

In this notebook we will explore how we can categorize performances of players based on how many weeks in we are into the fantasy football season. This notebook will also serve as a tool in which users can look at the entireity of the data science pipeline. I will document my thoughts, and ideas regarding this project in this notebook.

First let us consider what questions does this analysis and data have to answer. When conducting data science in industry this is akin to understanding the business requirements and what do the stakeholders want out of your work. The scenario here is slightly different cause this is a personal project, but imagine that this work is being done for somebody else and that someone else is not technical. How can we take what they've told you and put into a list of requirements?

The outcome of this analysis should have the following:

* A data set that has results of players in the current year.
* Dataset that has week to week fantasy results for players in completed seasons
* Analysis of how week to week trends look overall and week to week.
* A paradigm which allows me to compare players up until the current week.

In [256]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import time
from sklearn.cluster import KMeans

## Data Collection

The first part of the data science pipeline is the data collection aspect of it. During this step, we need to understand what type of data we want and where to get it. We will develop scripts to access that data. Since we are looking for data is not a database I can access, we will need to look on the internet to find the results that match what we need. Luckily, I found a website that has all the data that I needed, now its just a matter of whether they will let me access it.

The library that I am going to use is called BeautifulSoup. It allows you to web scrape data (or get the html of a website), which you can then put into a dataframe format.

In [2]:
LINK_PREF = 'http://fftoday.com/stats/playerstats.php?'

In [195]:
def getData(link,year,week,pos,scoring='PPR'):
    '''
    Used to extract data from the website regarding fantasy stats per week for the top players in the scoring format.
    '''
    time.sleep(2) # to make sure we don't time out
    posId ={
        'QB':'10', #ids correspond to ids in link
        'RB':'20',
        'WR':'30',
        'TE':'40',
        'K':'80',
        'DST':'99'
    }
    scoringID = {
        'PPR':'107644'
    }
    full_link = link + 'Season=' + str(year) + '&GameWeek=' + str(week) + '&PosID=' + posId[pos] + '&LeagueID=' + scoringID[scoring]
    response = requests.get(full_link)
    html = response.text.encode()
    soup = BeautifulSoup(html,'html.parser') # use beautifulsoup to parse through html and find specific html tags that correspond to the table
    body = soup.find('td',{'class':'bodycontent'})
    table = body.find_all('table')[5].find_all('tr')[2:] # subset arrays to find actual table (hard-coded)
    table = [[y for y in x.text.split('\n') if y!=''] for x in table] # to later turn into dataframe
    if pos != 'DST':
        table[0] = [table[0][0]] + table[0][4:]
    return table
def parseFantasyData(link):
    '''
    Wrapper function that iterates through all positions, all years and all weeks.
    Returns a dictionary of positions with a list of dataframes.
    '''
    fantasyData = {}
    for pos in ['QB','RB','WR','TE','K','DST']:
        fantasyData[pos]=[]
        for year in [2010,2011,2012,2013,2014,2015,2016,2017]:
            for week in [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17]:
                #print(pos,year,week)
                data=getData(link,year,week,pos)
                data = pd.DataFrame(columns=data[0],data=data[1:])
                data['POS'] = pos
                data['year']=year
                data['week']=week
                if pos == 'DST':
                    data=data.rename(columns={'Team':'Player'})
                fantasyData[pos].append(data[['Player','POS','year','week','FPts']])
    return fantasyData

def fixNames(dataframe):
    '''
    Removes weird formatting and weekly rank so that dataframe's Player column only has the first and last name of a player.
    Also includes Jr. if applicable.
    '''
    player=[x[1]+ ' ' + ' '.join(x[2:]) for x in dataframe['Player'].str.split(' ').values] # removes weird utf-8 encoding shit
    dataframeCopy = dataframe.copy(deep=True).drop('Player',axis=1)
    dataframeCopy['Player']=player
    return dataframeCopy

def pivotData(dataframe):
    '''
    pivots the data and does some data cleaning (fixes names and transforms FPts to a float.)
    '''
    dataframeNamesFixed = fixNames(dataframe)
    dataframeNamesFixed['FPts'] = dataframeNamesFixed['FPts'].astype(float)
    dataframeNamesFixed = pd.pivot_table(dataframeNamesFixed,values='FPts',index=['Player','POS','year'],columns=['week'],fill_value=0).reset_index()
    return dataframeNamesFixed

In [102]:
data=parseFantasyData(LINK_PREF)

In [197]:
dataMerged = {pos:pivotData(pd.concat(data[pos])) for pos in ['QB','RB','WR','TE','K','DST']}

## Data Analysis

In order to get the data in the format that we needed, we first needed to gather the data from the 

In [302]:

def getComparison(playerStats,dataDictionary,player,pos,year,week):
    '''
    Takes the current trend of a player up till the current of the season and then compares to the top 10 closest players
    by position and euclidean distance.
    '''
    
    
    
    data = dataDictionary[pos][['Player','POS','year']+list(range(1,week+1))].loc[dataDictionary[pos]['year']<year].copy(deep=True).T
    
    def L2Norm(row):
        return np.sqrt(np.sum(np.square(playerStats - row)))
    
    if pos!='DST':
        subset = data.iloc[3:]
    else:
        subset = data.iloc[2:]
    
    data = data.T
    data['distance'] = subset.apply(L2Norm)
    
    return data.sort_values('distance')[['Player','year']]

def getForecastedAverage(current_data,dataDictionary,player,pos,year,week,n=5):
    
    if pos!='DST':
        playerStatsPast = current_data.loc[current_data['Player']==player].values[0][3:week+3]
    else:
        playerStatsPast = current_data.loc[current_data['Player']==player].values[0][2:week+2]
    similarPlayers = getComparison(playerStatsPast,dataDictionary,player,pos,year,week).iloc[:n]
    similarPlayers = pd.merge(similarPlayers,dataDictionary[pos][['Player','POS','year']+list(range(week+1,week+4))])[[]+list(range(week+1,week+4))]
    return similarPlayers.mean()

def getPlayerFuture(current_data,player,week):
    if pos!='DST':
        playerStatsFuture = current_data.loc[current_data['Player']==player].values[0][week+3:week+6]
    else:
        playerStatsFuture = current_data.loc[current_data['Player']==player].values[0][week+2:week+5]
        
    return playerStatsFuture

def testSim(dataDictionary,n=5):
    for pos in ['QB','RB','WR','TE','K']:
        for year in [2015,2016,2017]:
            testList = dataDictionary[pos].loc[(dataDictionary[pos]['year']==year)]
            print(testList)

In [291]:
getForecastedAverageError(dataMerged['QB'].loc[dataMerged['QB']['year']==2017],dataMerged,'Alex Smith','QB',2017,3)

week
4     9.86
5    16.58
6     1.52
dtype: object

In [303]:
testSim(dataMerged)

week               Player POS  year     1     2     3     4     5     6     7  \
index                                                                           
1             AJ McCarron  QB  2015   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
8           Aaron Rodgers  QB  2015  25.0  22.8  38.3  18.5  24.0  22.2   0.0   
16             Alex Smith  QB  2015  25.7  11.1  21.8  21.8  15.2  19.1  17.1   
19            Alex Tanney  QB  2015   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
23            Andrew Luck  QB  2015  22.2  18.9  23.1   0.0   0.0  31.1  31.4   
29            Andy Dalton  QB  2015  21.6  23.7  38.2  21.7  32.4  24.0   0.0   
33           Austin Davis  QB  2015   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
35           B.J. Daniels  QB  2015   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
41     Ben Roethlisberger  QB  2015  21.6  30.4   9.6   0.0   0.0   0.0   0.0   
50         Blaine Gabbert  QB  2015   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
54          Blake Bortles  Q

[107 rows x 20 columns]
week                    Player POS  year     1     2     3     4     5     6  \
index                                                                          
0                   A.J. Derby  TE  2016   0.0   0.0   0.0   0.0   0.0   0.0   
7                   Alan Cross  TE  2016   0.0   0.0   0.0   0.0   1.5   0.0   
9                   Alex Ellis  TE  2016   0.0   0.0   0.0   0.0   0.0   0.0   
29              Anthony Fasano  TE  2016   0.0   0.0   0.0   0.0   2.0   8.5   
40               Antonio Gates  TE  2016   5.0  10.5   0.0   0.0  13.0   3.6   
43               Austin Hooper  TE  2016   2.4  11.4   0.0  11.2   2.4   0.0   
47     Austin Seferian-Jenkins  TE  2016  10.0   3.4   0.0   0.0   3.7   0.0   
55              Ben Braunecker  TE  2016   0.0   0.0   0.0   0.0   0.0   0.0   
58                  Ben Koyack  TE  2016   0.0   0.0   0.0   0.0   0.0   0.0   
73                  Blake Bell  TE  2016   0.0   1.6   0.0   0.0   1.4   0.0   
77              

In [282]:
dataMerged['QB'].loc[(dataMerged['QB']['year']==2017) & (dataMerged['QB']['Player']=='Alex Smith')]

week,Player,POS,year,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
18,Alex Smith,QB,2017,34.7,18.7,16.7,30.3,30.1,17.6,29.1,17.4,23.1,0.0,14.2,17.5,41.3,15.8,20.9,20.5,0.0
