# Project 2

## Predicting NFL All-Pros

In [1]:
import pandas as pd

Here we are collecting the information that www.pro-football-reference.com has on All-Pros from 1970 (year of the merger)
until 2021 (last completed season).

In [2]:
df_ap = pd.DataFrame()

for year in range(1970, 2022):
    url = 'https://www.pro-football-reference.com/years/' + str(year) + '/allpro.htm'

    df_url = pd.read_html(url)[0]

    df_url.drop(columns = ['Yrs', 'Cmp', 'Att', 'Yds', 'TD', 'Int', 'Att.1', 'Yds.1', 'TD.1', 
                           'Rec', 'Yds.2', 'TD.2', 'Solo', 'Sk', 'Int.1'], inplace = True)
    
    df_url['Year'] = year

    df_temp = [df_ap, df_url]

    df_ap = pd.concat(df_temp)
    


In [3]:
df_ap.rename(columns = {'All-pro teams': 'all_pro'}, inplace = True)

Dropping everything but QB since we are focusing on quaterbacks.

In [4]:
df_ap = df_ap[df_ap['Pos'] == 'QB']

Dropping all of the rows where a player did not make a AP (Associated Press) team.

In [5]:
df_ap_trimmed = df_ap.loc[df_ap['all_pro'].str.contains('AP')]

Creating one hot columns for AP 1st and 2nd Teams.

In [6]:
df_ap_trimmed['ap_1st'] = df_ap_trimmed.all_pro.apply(lambda teams : 1 if ' AP: 1st Tm' in teams.split(',') else 0)
df_ap_trimmed['ap_2nd'] = df_ap_trimmed.all_pro.apply(lambda teams : 1 if ' AP: 2nd Tm' in teams.split(',') else 0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_ap_trimmed['ap_1st'] = df_ap_trimmed.all_pro.apply(lambda teams : 1 if ' AP: 1st Tm' in teams.split(',') else 0)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_ap_trimmed['ap_2nd'] = df_ap_trimmed.all_pro.apply(lambda teams : 1 if ' AP: 2nd Tm' in teams.split(',') else 0)


Dropping everyone who wasn't on AP 1st or 2nd team.

In [7]:
df_ap = df_ap_trimmed[df_ap_trimmed['ap_1st'] + df_ap_trimmed['ap_2nd'] > 0]

Dropping the 'all_pro' column because we don't need it anymore.

In [8]:
df_ap.drop(columns = ['all_pro'], inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_ap.drop(columns = ['all_pro'], inplace = True)


In [9]:
df_ap.columns = map(str.lower, df_ap.columns)

Pulling the passing data from www.pro-football-reference.com

In [10]:
df_pass = pd.DataFrame()

for year in range(1970, 2022):
    url = 'https://www.pro-football-reference.com/years/' + str(year) + '/passing.htm'

    df_url = pd.read_html(url)[0]
    
    df_url = df_url[df_url['Pos'].str.upper() == 'QB']

    df_url['year'] = year

    df_temp = [df_pass, df_url]

    df_pass = pd.concat(df_temp)

Removing the QBR and 1D columns, because they don't exist for every year.

In [11]:
df_pass.drop(columns = ['QBR', '1D'], inplace = True)

In [12]:
df_pass.columns = map(str.lower, df_pass.columns)

Splitting 'qbrec' into wins, loses, and ties. And then dropping 'qbrec'.

In [13]:
df_pass[['win', 'loses', 'ties']] = df_pass['qbrec'].str.split('-', expand = True)

In [14]:
df_pass.drop(columns = 'qbrec', inplace = True)

Capitalizing all of 'pos' so it is consistent.

In [15]:
df_pass['pos'] = df_pass['pos'].apply(str.upper)

Replacing NaN's with 0, because if the value is missing, it means they did not have any of that category.

In [16]:
df_pass.fillna(0, inplace = True)

Renaming columns, mostly to differentiate them from the rushing stats later.

In [17]:
df_pass.rename(columns = {'att': 'pass_att', 'yds': 'pass_yds', 'td': 'pass_td', 'lng': 'pass_lng', 'y/a': 'pass_y/a',
                           'y/g': 'pass_y/g', 'yds.1': 'yards_lost_sack', 'rk': 'pass_rk'}, inplace = True)

In [18]:
df_rush = pd.DataFrame()

for year in range(1970, 2022):
    url = 'https://www.pro-football-reference.com/years/' + str(year) + '/rushing.htm'

    df_url = pd.read_html(url)[0]

    df_url = df_url[df_url[( 'Unnamed: 4_level_0',    'Pos')].str.upper() == 'QB']

    df_url['Year'] = year

    df_temp = [df_rush, df_url]

    df_rush = pd.concat(df_temp)

The table was formatted weird, so I'm just manually renaming all the columns.

In [19]:
df_rush.columns = ['rush_rk', 'player', 'tm', 'age', 'pos', 'g', 'gs', 'rush_att', 'rush_yds', 
                   'rush_td', 'rush_lng', 'rush_y/a', 'rush_y/g', 'fmb.1', 'year', 'drop', 'fmb.2']

Drop 1D (first downs), because it was not always tracked.

In [20]:
df_rush.drop(columns = ['drop'], inplace = True)

All other stats should be 0 if they don't exist.

In [21]:
df_rush.fillna(0, inplace = True)

Because the table was formatted weirdly, it store fumbles into two different columns. It seems to have only
stored it in one column each year, so I was able to just add them together to fix it.

In [22]:
df_rush['fmb'] = df_rush['fmb.1'].astype('int') + df_rush['fmb.2'].astype('int')

In [23]:
df_rush.drop(columns = ['fmb.1', 'fmb.2'], inplace = True)

In [24]:
df_rush['pos'] = df_rush['pos'].apply(str.upper)

Using outer joins to combine the tables.

In [25]:
df_p_r = pd.merge(df_pass, df_rush, how = 'outer', on = ['player', 'year', 'tm', 'age', 'pos', 'g', 'gs'])

In [27]:
df_p_r[['age', 'g', 'gs']] = df_p_r[['age', 'g', 'gs']].astype(int)

In [28]:
df = pd.merge(df_p_r, df_ap, how = "left", on = ['player', 'year', 'tm', 'age', 'pos', 'g', 'gs'] )

In [37]:
df[['ap_1st', 'ap_2nd', 'fmb']] = df[['ap_1st', 'ap_2nd', 'fmb']].astype(int)

In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3818 entries, 0 to 3817
Data columns (total 42 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   pass_rk          3818 non-null   object
 1   player           3818 non-null   object
 2   tm               3818 non-null   object
 3   age              3818 non-null   int64 
 4   pos              3818 non-null   object
 5   g                3818 non-null   int64 
 6   gs               3818 non-null   int64 
 7   cmp              3818 non-null   object
 8   pass_att         3818 non-null   object
 9   cmp%             3818 non-null   object
 10  yds              3818 non-null   object
 11  pass_td          3818 non-null   object
 12  td%              3818 non-null   object
 13  int              3818 non-null   object
 14  int%             3818 non-null   object
 15  pass_lng         3818 non-null   object
 16  pass_y/a         3818 non-null   object
 17  ay/a             3818 non-null   

In [34]:
df.fillna(0, inplace = True)

In [None]:
df = df[['year', 'player', 'tm', 'age', 'pos', 'g', 'gs', 'pass_rk', 'cmp', 'pass_att', 'cmp%, ']]

In [38]:
df.to_csv('nfl_qb.csv', index = False)