# Data Collection

In this notebook, we will collect and organize the primary/source data for the project. The data will be collected from the following sources: Kaggle
We will take a look at the data and perform some basic exploratory data analysis to get a better understanding of the data.
We will also perform some data cleaning and data wrangling to prepare the data for the next steps in the project.

# Imports and loads

In [168]:
### Imports

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns



In [196]:
#Load Data
injury_data = pd.read_csv('./src/injuries.csv')
team_data = pd.read_csv('./src/team_rosters.csv')
#Added this to data so that we can have the column names to measure the stats of the players (we don't have to use the combine data if we don't want to)
player_data = pd.read_csv('./src/combine_data.csv')

print(f'Injury Shape: {injury_data.shape}, Team Shape: {team_data.shape}, Player Shape: {player_data.shape}')

Injury Shape: (5682, 16), Team Shape: (46163, 37), Player Shape: (1797, 18)


# Data Discovery

## Injury Data

In [170]:
injury_data.head()

Unnamed: 0,season,game_type,team,week,gsis_id,position,full_name,first_name,last_name,report_primary_injury,report_secondary_injury,report_status,practice_primary_injury,practice_secondary_injury,practice_status,date_modified
0,2022,REG,ARI,1,00-0027993,C,Rodney Hudson,Rodney,Hudson,,,,Not injury related - resting player,,Did Not Participate In Practice,2022-09-07 21:10:03+00:00
1,2022,REG,ARI,1,00-0028946,LS,Aaron Brewer,Aaron,Brewer,,,,Ankle,,Full Participation in Practice,2022-09-09 19:55:06+00:00
2,2022,REG,ARI,1,00-0032127,LB,Markus Golden,Markus,Golden,,,,Toe,,Full Participation in Practice,2022-09-09 19:55:29+00:00
3,2022,REG,ARI,1,00-0034490,LB,Ezekiel Turner,Ezekiel,Turner,,,,Shoulder,,Full Participation in Practice,2022-09-09 19:55:44+00:00
4,2022,REG,ARI,1,00-0035924,RB,Jonathan Ward,Jonathan,Ward,,,,Shoulder,,Full Participation in Practice,2022-09-09 19:55:49+00:00


In [171]:
print(injury_data.dtypes)

season                        int64
game_type                    object
team                         object
week                          int64
gsis_id                      object
position                     object
full_name                    object
first_name                   object
last_name                    object
report_primary_injury        object
report_secondary_injury      object
report_status                object
practice_primary_injury      object
practice_secondary_injury    object
practice_status              object
date_modified                object
dtype: object


In [172]:
injury_data.describe()

Unnamed: 0,season,week
count,5682.0,5682.0
mean,2022.0,10.288631
std,0.0,5.344045
min,2022.0,1.0
25%,2022.0,6.0
50%,2022.0,10.0
75%,2022.0,15.0
max,2022.0,22.0


In [173]:
injury_data.isnull().sum()

season                          0
game_type                       0
team                            0
week                            0
gsis_id                         0
position                        0
full_name                       0
first_name                      0
last_name                       0
report_primary_injury        2937
report_secondary_injury      5483
report_status                2937
practice_primary_injury        27
practice_secondary_injury    5277
practice_status                 0
date_modified                   0
dtype: int64

## Team Data

In [174]:
team_data.head()

Unnamed: 0,season,team,position,depth_chart_position,jersey_number,status,player_name,first_name,last_name,birth_date,...,status_description_abbr,football_name,esb_id,gsis_it_id,smart_id,entry_year,rookie_year,draft_club,draft_number,age
0,2022,TB,QB,QB,12.0,ACT,Tom Brady,Tom,Brady,1977-08-03,...,A01,Tom,BRA371156,25511,32004252-4137-1156-7ed0-8b9e44948f13,2000,2000.0,NE,199.0,45.279
1,2022,TB,QB,QB,12.0,ACT,Tom Brady,Tom,Brady,1977-08-03,...,A01,Tom,BRA371156,25511,32004252-4137-1156-7ed0-8b9e44948f13,2000,2000.0,NE,199.0,45.432
2,2022,TB,QB,QB,12.0,ACT,Tom Brady,Tom,Brady,1977-08-03,...,A01,Tom,BRA371156,25511,32004252-4137-1156-7ed0-8b9e44948f13,2000,2000.0,NE,199.0,45.355
3,2022,TB,QB,QB,12.0,ACT,Tom Brady,Tom,Brady,1977-08-03,...,A01,Tom,BRA371156,25511,32004252-4137-1156-7ed0-8b9e44948f13,2000,2000.0,NE,199.0,45.413
4,2022,TB,QB,QB,12.0,ACT,Tom Brady,Tom,Brady,1977-08-03,...,A01,Tom,BRA371156,25511,32004252-4137-1156-7ed0-8b9e44948f13,2000,2000.0,NE,199.0,45.202


In [175]:
team_data.sample(10)

Unnamed: 0,season,team,position,depth_chart_position,jersey_number,status,player_name,first_name,last_name,birth_date,...,status_description_abbr,football_name,esb_id,gsis_it_id,smart_id,entry_year,rookie_year,draft_club,draft_number,age
37808,2022,CHI,DB,CB,39.0,ACT,Josh Blackwell,Josh,Blackwell,1999-04-05,...,A01,Josh,BLA350385,54807,3200424c-4135-0385-1bfd-b61dfb5f22f3,2022,2022.0,,,23.524
15960,2022,CIN,DB,CB,35.0,ACT,Jalen Davis,Jalen,Davis,1996-02-02,...,A01,Jalen,DAV420337,46630,32004441-5642-0337-d407-c1457eea63dc,2018,2018.0,,,26.971
42638,2022,NO,LB,LB,45.0,DEV,Nephi Sewell,Nephi,Sewell,1998-12-19,...,P01,Nephi,SEW440460,55132,32005345-5744-0460-78e0-ad2f28ed6cef,2022,2022.0,,,23.767
1069,2022,NO,RB,RB,22.0,ACT,Mark Ingram,Mark,Ingram,1989-12-21,...,A01,Mark,ING656964,37101,3200494e-4765-6964-19af-4fbe814368d2,2011,2011.0,NO,28.0,32.799
40810,2022,CAR,QB,QB,,CUT,Davis Cheek,Davis,Cheek,1999-02-26,...,A01,Davis,CHE108754,54945,32004348-4510-8754-c5f5-5f95e72cc1db,2022,2022.0,,,23.54
12604,2022,CLE,OL,C,55.0,ACT,Ethan Pocic,Ethan,Pocic,1995-08-05,...,A01,Ethan,POC303385,44870,3200504f-4330-3385-f54c-3031b4454d04,2017,2017.0,SEA,58.0,27.217
42720,2022,LA,RB,RB,30.0,DEV,Ronnie Rivers,Ronnie,Rivers,1999-01-31,...,P01,Ronnie,RIV673384,55145,32005249-5667-3384-6c78-b7eef7619467,2022,2022.0,,,23.671
24699,2022,DET,DL,DT,94.0,ACT,Benito Jones,Benito,Jones,1997-11-27,...,A01,Benito,JON078068,52743,32004a4f-4e07-8068-1dcd-55aeb628d977,2020,2020.0,,,24.961
37311,2022,SEA,LB,LB,92.0,RES,Tyreke Smith,Tyreke,Smith,2000-02-14,...,R01,Tyreke,SMI766470,54623,3200534d-4976-6470-1b88-87c8583e4086,2022,2022.0,SEA,158.0,22.727
39265,2022,HOU,RB,RB,31.0,RES,Dameon Pierce,Dameon,Pierce,2000-02-19,...,R01,Dameon,PIE245478,54572,32005049-4524-5478-115c-8ae11dbeefa6,2022,2022.0,HOU,107.0,22.828


In [176]:
print(team_data.dtypes)

season                       int64
team                        object
position                    object
depth_chart_position        object
jersey_number              float64
status                      object
player_name                 object
first_name                  object
last_name                   object
birth_date                  object
height                     float64
weight                     float64
college                     object
player_id                   object
espn_id                    float64
sportradar_id               object
yahoo_id                   float64
rotowire_id                float64
pff_id                     float64
pfr_id                      object
fantasy_data_id            float64
sleeper_id                 float64
years_exp                    int64
headshot_url                object
ngs_position                object
week                         int64
game_type                   object
status_description_abbr     object
football_name       

In [177]:
team_data.describe()

Unnamed: 0,season,jersey_number,height,weight,espn_id,yahoo_id,rotowire_id,pff_id,fantasy_data_id,sleeper_id,years_exp,week,gsis_it_id,entry_year,rookie_year,draft_number,age
count,46163.0,46018.0,46117.0,46161.0,29914.0,26254.0,32527.0,28350.0,24679.0,32527.0,46163.0,46163.0,46163.0,46163.0,46162.0,29563.0,45638.0
mean,2022.0,49.418793,74.136089,243.614935,3185857.0,30410.416927,13083.695637,34922.637354,19178.279185,5503.382052,3.360743,9.929359,48864.554383,2018.639257,2018.647892,112.274465,26.555763
std,0.0,29.247307,2.659819,48.06996,1298371.0,3744.692389,2418.622945,23191.863649,3062.447674,2212.790241,3.098262,5.657399,5398.989262,3.098262,3.102625,73.357222,3.094677
min,2022.0,1.0,66.0,0.0,2330.0,5228.0,1350.0,698.0,430.0,13.0,0.0,1.0,25511.0,2000.0,2000.0,1.0,21.002
25%,2022.0,24.0,72.0,204.0,3040031.0,29399.0,11833.0,11101.0,18024.5,4080.0,1.0,5.0,44949.0,2017.0,2017.0,48.0,24.323
50%,2022.0,49.0,74.0,233.0,3144988.0,31166.0,13572.0,39517.0,19931.0,5970.0,3.0,10.0,48456.0,2019.0,2019.0,104.0,25.889
75%,2022.0,75.0,76.0,290.0,4039505.0,32673.0,15053.0,49699.0,21187.0,7412.0,5.0,15.0,53637.0,2021.0,2021.0,172.0,28.145
max,2022.0,99.0,81.0,1794.0,4820589.0,33891.0,16648.0,143793.0,22477.0,8928.0,22.0,22.0,55611.0,2022.0,2022.0,262.0,45.454


In [178]:
team_data.isnull().sum()

season                         0
team                           0
position                       0
depth_chart_position           0
jersey_number                145
status                         0
player_name                    0
first_name                     0
last_name                      0
birth_date                   373
height                        46
weight                         2
college                       27
player_id                     27
espn_id                    16249
sportradar_id              13636
yahoo_id                   19909
rotowire_id                13636
pff_id                     17813
pfr_id                     24618
fantasy_data_id            21484
sleeper_id                 13636
years_exp                      0
headshot_url                1277
ngs_position               24768
week                           0
game_type                      0
status_description_abbr        0
football_name                  0
esb_id                        91
gsis_it_id

## Player Data

In [179]:
player_data.head()

Unnamed: 0,season,draft_year,draft_team,draft_round,draft_ovr,pfr_id,cfb_id,player_name,pos,school,ht,wt,forty,bench,vertical,broad_jump,cone,shuttle
0,2018,2018.0,San Francisco 49ers,2.0,44.0,PettDa00,dante-pettis-1,Dante Pettis,WR,Washington,6-0,186.0,,,,,,
1,2018,2018.0,Indianapolis Colts,2.0,52.0,TuraKe00,kemoko-turay-1,Kemoko Turay,EDGE,Rutgers,6-5,253.0,4.65,,,,,
2,2018,,,,,AdamJo03,josh-adams-2,Josh Adams,RB,Notre Dame,6-2,213.0,,18.0,,,,
3,2018,,,,,,,Ola Adeniyi,EDGE,Toledo,6-2,248.0,4.83,26.0,31.5,,7.21,4.28
4,2018,2018.0,Houston Texans,3.0,98.0,AkinJo00,jordan-akins-1,Jordan Akins,TE,Central Florida,6-3,249.0,,,,,,


In [180]:
player_data.dtypes

season           int64
draft_year     float64
draft_team      object
draft_round    float64
draft_ovr      float64
pfr_id          object
cfb_id          object
player_name     object
pos             object
school          object
ht              object
wt             float64
forty          float64
bench          float64
vertical       float64
broad_jump     float64
cone           float64
shuttle        float64
dtype: object

In [181]:
player_data.describe()

Unnamed: 0,season,draft_year,draft_round,draft_ovr,wt,forty,bench,vertical,broad_jump,cone,shuttle
count,1797.0,1075.0,1075.0,1075.0,1773.0,1429.0,1126.0,1392.0,1357.0,982.0,1027.0
mean,2020.057874,2019.99814,3.822326,117.219535,239.948675,4.740364,19.787744,33.109052,117.100221,7.279511,4.436407
std,1.382757,1.414871,1.897039,71.239506,45.413863,0.297032,6.292731,4.213948,8.974989,0.3959,0.254253
min,2018.0,2018.0,1.0,1.0,144.0,4.23,4.0,19.5,82.0,6.28,3.94
25%,2019.0,2019.0,2.0,56.0,203.0,4.51,15.0,30.5,112.0,7.0,4.26
50%,2020.0,2020.0,4.0,113.0,228.0,4.65,19.5,33.5,118.0,7.19,4.39
75%,2021.0,2021.0,5.0,174.0,271.0,4.92,24.0,36.0,123.0,7.5,4.58
max,2022.0,2022.0,7.0,262.0,384.0,5.85,44.0,46.5,141.0,8.82,5.38


In [182]:
player_data.isnull().sum()

season           0
draft_year     722
draft_team     722
draft_round    722
draft_ovr      722
pfr_id         229
cfb_id         116
player_name      0
pos              0
school           0
ht              29
wt              24
forty          368
bench          671
vertical       405
broad_jump     440
cone           815
shuttle        770
dtype: int64

# Preprocessing Data

## Injury Data

A lot of the key metrics needed for our modeling is present in the injury data. We will need to do some data wrangling to get the data into a format that is usable for our modeling.

Some recommendations for columns:
- 'season'
- 'team'
- 'week'
- 'gsis_id'
- 'full_name'
- 'position'
- 'report_status'
- 'report_primary_injury'
- 'report_secondary_injury'
- 'practice_status'
- 'practice_primary_injury'
- 'practice_secondary_injury'
- 'date_modified'

### Missing Values

#### Report Primary Injury

In [183]:
print(injury_data['report_primary_injury'].value_counts(dropna=False))
print('---')
print(injury_data['report_primary_injury'].value_counts(dropna=False, normalize=True))
print('---')
print()
print("NOTE:  I'm thinking we can rename the NAN values to 'No Injury'")

report_primary_injury
NaN                    2937
Knee                    432
Ankle                   415
Hamstring               326
Concussion              170
                       ... 
right Quadricep           1
right Groin               1
Appendix                  1
Hernia                    1
toe, pec, knee, hip       1
Name: count, Length: 66, dtype: int64
---
report_primary_injury
NaN                    0.516895
Knee                   0.076030
Ankle                  0.073038
Hamstring              0.057374
Concussion             0.029919
                         ...   
right Quadricep        0.000176
right Groin            0.000176
Appendix               0.000176
Hernia                 0.000176
toe, pec, knee, hip    0.000176
Name: proportion, Length: 66, dtype: float64
---

NOTE:  I'm thinking we can rename the NAN values to 'No Injury'


#### Report Secondary Injury

In [184]:
# print(injury_data['report_secondary_injury'].value_counts(dropna=False))
print('---')
print(injury_data['report_secondary_injury'].value_counts(dropna=False, normalize=True))
print('---')
print()
print("NOTE:  I'm thinking we can drop the Secondary Injury column since it's mostly NaN values")

---
report_secondary_injury
NaN                                     0.964977
Ankle                                   0.005456
Knee                                    0.004928
Illness                                 0.003872
Hip                                     0.001760
Back                                    0.001760
Shoulder                                0.001408
Not injury related - resting player     0.001408
Foot                                    0.001056
Wrist                                   0.001056
Hamstring                               0.000880
Neck                                    0.000880
Abdomen                                 0.000704
Achilles                                0.000704
Calf                                    0.000704
Quadricep                               0.000704
Hand                                    0.000704
Not injury related - personal matter    0.000704
Heel                                    0.000528
Pectoral                                0

#### Report Status

In [185]:
print(injury_data['report_status'].value_counts(dropna=False))
print('---')
print(injury_data['report_status'].value_counts(dropna=False, normalize=True))
print('---')
print()
print("NOTE:  I'm thinking we can leave the status column as is.\n There is a direct correlation to the NAN and the 'No Injury' values.")

report_status
NaN             2937
Questionable    1511
Out             1078
Doubtful         156
Name: count, dtype: int64
---
report_status
NaN             0.516895
Questionable    0.265927
Out             0.189722
Doubtful        0.027455
Name: proportion, dtype: float64
---

NOTE:  I'm thinking we can leave the status column as is.
 There is a direct correlation to the NAN and the 'No Injury' values.


#### Practice Status

In [186]:
print(injury_data['report_status'].value_counts(dropna=False))

report_status
NaN             2937
Questionable    1511
Out             1078
Doubtful         156
Name: count, dtype: int64


#### Practice Primary Injury

In [187]:
print(injury_data['practice_primary_injury'].value_counts(dropna=False))
print('---')
print(injury_data['practice_primary_injury'].value_counts(dropna=False, normalize=True))
print('---')
print()
# print("NOTE:  I'm thinking we can leave the status column as is.\n There is a direct correlation to the NAN and the 'No Injury' values.")

practice_primary_injury
Knee                                   787
Ankle                                  692
Not injury related - resting player    636
Hamstring                              467
Illness                                362
                                      ... 
lower leg cramps                         1
left Knee                                1
ankle, knee, elbow                       1
shoulder, biceps, hand                   1
Ankles                                   1
Name: count, Length: 81, dtype: int64
---
practice_primary_injury
Knee                                   0.138508
Ankle                                  0.121788
Not injury related - resting player    0.111932
Hamstring                              0.082189
Illness                                0.063710
                                         ...   
lower leg cramps                       0.000176
left Knee                              0.000176
ankle, knee, elbow                     0.000176
shoul

#### Practice Secondary Injury

In [188]:
print(injury_data['practice_secondary_injury'].value_counts(dropna=False))
print('---')
print(injury_data['practice_secondary_injury'].value_counts(dropna=False, normalize=True))
print('---')
print()
# print("NOTE:  I'm thinking we can leave the status column as is.\n There is a direct correlation to the NAN and the 'No Injury' values.")

practice_secondary_injury
NaN                               5277
Knee                                64
Ankle                               62
Not injury related - resting p      44
Illness                             28
Back                                21
Shoulder                            20
Hip                                 15
Hamstring                           10
Rib                                  9
Quadricep                            9
Wrist                                9
Ribs                                 9
Elbow                                9
Foot                                 8
Toe                                  8
Groin                                7
Neck                                 7
Abdomen                              6
Heel                                 6
Not injury related - personal        6
Achilles                             5
Glute                                4
Biceps                               4
Calf                                 4

### Recommended Columns 

In [199]:
injury_cols = ['season', 'team', 'week', 'gsis_id', 'full_name', 'position', 'report_status', 'report_primary_injury', 'report_secondary_injury', 'practice_status', 'practice_primary_injury', 'practice_secondary_injury', 'date_modified']

print('I would like to rename the columns to make them easier to work with.\nI would also like to change the dtypes of the date_modified column to datetime24')


injury_data[injury_cols].head()

I would like to rename the columns to make them easier to work with.
I would also like to change the dtypes of the date_modified column to datetime24


Unnamed: 0,season,team,week,gsis_id,full_name,position,report_status,report_primary_injury,report_secondary_injury,practice_status,practice_primary_injury,practice_secondary_injury,date_modified
0,2022,ARI,1,00-0027993,Rodney Hudson,C,,,,Did Not Participate In Practice,Not injury related - resting player,,2022-09-07 21:10:03+00:00
1,2022,ARI,1,00-0028946,Aaron Brewer,LS,,,,Full Participation in Practice,Ankle,,2022-09-09 19:55:06+00:00
2,2022,ARI,1,00-0032127,Markus Golden,LB,,,,Full Participation in Practice,Toe,,2022-09-09 19:55:29+00:00
3,2022,ARI,1,00-0034490,Ezekiel Turner,LB,,,,Full Participation in Practice,Shoulder,,2022-09-09 19:55:44+00:00
4,2022,ARI,1,00-0035924,Jonathan Ward,RB,,,,Full Participation in Practice,Shoulder,,2022-09-09 19:55:49+00:00


## Team Data
There are a lot of column that contain ID information not needed for this project.I receommend we drop those columns and keep the following columns:

---
Data Modeling Columns:
- season
- team
- position
- **player_id** (if this is the foreign key for gsis_id in the player data)

---
Data Visualization Columns (Streamlit):
- jersey_number
- status
- player_name
- weight
- height
- college
- years_exp
- headshot_url
- age
- Team




### Missing Values

#### Jersey Number

In [190]:
print(team_data['jersey_number'].value_counts(dropna=False))
print(team_data['status'].value_counts(dropna=False, normalize=True))
team_data.sort_values(by= 'jersey_number', na_position='first').sample(25)


jersey_number
91.0    616
17.0    602
31.0    588
26.0    585
23.0    585
       ... 
12.0    344
62.0    341
61.0    314
40.0    313
NaN     145
Name: count, Length: 100, dtype: int64
status
ACT    0.592682
DEV    0.185083
RES    0.113338
INA    0.077790
CUT    0.019280
RET    0.011503
TRC    0.000173
TRD    0.000087
TRT    0.000022
SUS    0.000022
EXE    0.000022
Name: proportion, dtype: float64


Unnamed: 0,season,team,position,depth_chart_position,jersey_number,status,player_name,first_name,last_name,birth_date,...,status_description_abbr,football_name,esb_id,gsis_it_id,smart_id,entry_year,rookie_year,draft_club,draft_number,age
33005,2022,NYJ,DL,DE,51.0,DEV,Marquiss Spencer,Marquiss,Spencer,1997-07-16,...,P01,Marquiss,SPE570569,53682,32005350-4557-0569-e3dd-86dd8400ca87,2021,2021.0,DEN,253.0,25.366
19175,2022,JAX,LB,ILB,54.0,ACT,Ty Summers,Ty,Summers,1995-12-31,...,A01,Ty,SUM606964,48009,32005355-4d60-6964-28e0-f107fd758851,2019,2019.0,GB,226.0,26.754
5938,2022,DAL,LB,OLB,56.0,ACT,Dante Fowler,Dante Fowler Jr.,Fowler,1994-08-03,...,A01,Dante,FOW382785,42346,3200464f-5738-2785-f81a-f6d367ce1686,2015,2015.0,JAX,3.0,28.279
2206,2022,DAL,P,P,5.0,ACT,Bryan Anger,Bryan,Anger,1988-10-06,...,A01,Bryan,ANG280911,38600,3200414e-4728-0911-cad2-e8c7ca434956,2012,2012.0,JAX,70.0,34.215
18889,2022,LV,DL,DE,98.0,ACT,Maxx Crosby,Maxx,Crosby,1997-08-22,...,A01,Maxx,CRO371007,47889,32004352-4f37-1007-6c0e-db2de3f9d742,2019,2019.0,OAK,106.0,25.092
5967,2022,TB,WR,WR,16.0,ACT,Breshad Perriman,Breshad,Perriman,1993-09-10,...,A01,Breshad,PER440170,42369,32005045-5244-0170-6920-f60eb457c1fb,2015,2015.0,BAL,26.0,29.155
20188,2022,CLE,DB,CB,29.0,DEV,Herb Miller,Herb,Miller,1997-11-11,...,P06,Herb,MIL297240,48388,32004d49-4c29-7240-92aa-2e17f1270d18,2019,2019.0,,,25.139
38369,2022,BAL,DB,DB,33.0,CUT,David Vereen,David,Vereen,1997-10-09,...,W03,David,VER279342,54866,32005645-5227-9342-24eb-0e3d8dacbec7,2022,2022.0,,,24.923
37554,2022,GB,DL,DT,99.0,INA,Jonathan Ford,Jonathan,Ford,1998-09-29,...,A01,Jonathan,FOR256596,54699,3200464f-5225-6596-8618-ffea8d51b3cf,2022,2022.0,GB,234.0,24.047
15578,2022,NYJ,QB,QB,5.0,ACT,Mike White,Michael,White,1995-03-25,...,A01,Mike,WHI367864,46240,32005748-4936-7864-3ac7-015862958e61,2018,2018.0,DAL,171.0,27.485


In [191]:
no_jersey = team_data[team_data['jersey_number'].isnull()]
no_jersey['status'].value_counts(dropna=False, normalize=True)

#Figure out what these players are doing on the team if they don't have a jersey number
#Should they be dropped from the data set?

status
RET    0.468966
CUT    0.427586
DEV    0.068966
ACT    0.027586
RES    0.006897
Name: proportion, dtype: float64

#### player_id [a foreign key?]

In [192]:
# print(team_data['player_id'].value_counts(dropna=False, normalize=True))

print('So the lack of player_id is most likely explained by the fact that these players have not played in a game yet.\nI think we can drop these players from the data set.')

team_data.sort_values(by = 'player_id', na_position='first', ascending=True).head(25)

So the lack of player_id is most likely explained by the fact that these players have not played in a game yet.
I think we can drop these players from the data set.


Unnamed: 0,season,team,position,depth_chart_position,jersey_number,status,player_name,first_name,last_name,birth_date,...,status_description_abbr,football_name,esb_id,gsis_it_id,smart_id,entry_year,rookie_year,draft_club,draft_number,age
46136,2022,ATL,WR,WR,80.0,DEV,Josh Ali,Josh,Ali,,...,P01,Josh,ALI103821,55557,,2022,2022.0,,,
46137,2022,ATL,RB,FB,,DEV,Clint Ratkovich,Clint,Ratkovich,,...,P01,Clint,RAT463569,55611,,2022,2022.0,,,
46138,2022,ATL,WR,WR,80.0,DEV,Josh Ali,Josh,Ali,,...,P01,Josh,ALI103821,55557,,2022,2022.0,,,
46139,2022,ATL,WR,WR,,DEV,Emeka Emezie,Chukwuemeka,Emezie,,...,P02,Emeka,EME675714,55353,,2022,2022.0,,,
46140,2022,CAR,WR,WR,,DEV,Emeka Emezie,Chukwuemeka,Emezie,,...,P01,Emeka,EME675714,55353,,2022,2022.0,,,
46141,2022,JAX,QB,QB,,RES,Nathan Rourke,Nathan,Rourke,,...,R23,Nathan,,54152,,2020,,,,
46142,2022,ATL,RB,FB,,DEV,Clint Ratkovich,Clint,Ratkovich,,...,P01,Clint,RAT463569,55611,32005241-5446-3569-9eae-a162edf4a49e,2022,2022.0,,,
46143,2022,ATL,WR,WR,,DEV,Emeka Emezie,Chukwuemeka,Emezie,,...,P01,Emeka,EME675714,55353,,2022,2022.0,,,
46144,2022,ATL,WR,WR,80.0,DEV,Josh Ali,Josh,Ali,,...,P01,Josh,ALI103821,55557,,2022,2022.0,,,
46145,2022,ATL,WR,WR,80.0,DEV,Josh Ali,Josh,Ali,,...,P01,Josh,ALI103821,55557,,2022,2022.0,,,


#### headshot_url

In [193]:
team_data.head()


Unnamed: 0,season,team,position,depth_chart_position,jersey_number,status,player_name,first_name,last_name,birth_date,...,status_description_abbr,football_name,esb_id,gsis_it_id,smart_id,entry_year,rookie_year,draft_club,draft_number,age
0,2022,TB,QB,QB,12.0,ACT,Tom Brady,Tom,Brady,1977-08-03,...,A01,Tom,BRA371156,25511,32004252-4137-1156-7ed0-8b9e44948f13,2000,2000.0,NE,199.0,45.279
1,2022,TB,QB,QB,12.0,ACT,Tom Brady,Tom,Brady,1977-08-03,...,A01,Tom,BRA371156,25511,32004252-4137-1156-7ed0-8b9e44948f13,2000,2000.0,NE,199.0,45.432
2,2022,TB,QB,QB,12.0,ACT,Tom Brady,Tom,Brady,1977-08-03,...,A01,Tom,BRA371156,25511,32004252-4137-1156-7ed0-8b9e44948f13,2000,2000.0,NE,199.0,45.355
3,2022,TB,QB,QB,12.0,ACT,Tom Brady,Tom,Brady,1977-08-03,...,A01,Tom,BRA371156,25511,32004252-4137-1156-7ed0-8b9e44948f13,2000,2000.0,NE,199.0,45.413
4,2022,TB,QB,QB,12.0,ACT,Tom Brady,Tom,Brady,1977-08-03,...,A01,Tom,BRA371156,25511,32004252-4137-1156-7ed0-8b9e44948f13,2000,2000.0,NE,199.0,45.202


In [194]:
#Creating a function that produces the image of the headshot of the player using the headshot_url.
#Ref: https://stackoverflow.com/questions/32370281/how-to-embed-image-or-picture-in-jupyter-notebook-either-from-a-local-machine-o

from IPython.display import Image, display
from IPython.core.display import HTML
#Single image
# url = team_data['headshot_url'].iloc[0]
# image_width = 210
# image_height = 150
# display(Image(url=url, width=image_width, height=image_height))

#Multiple images
for i in range(15):
    headshot_urls = team_data.drop_duplicates(subset=['player_name'])
    url = headshot_urls['headshot_url'].iloc[i]
    image_width = 210
    image_height = 150
    display(Image(url=url, width=image_width, height=image_height))

#### Column contents

In [195]:
print(team_data['status'].value_counts(dropna=False).head(7))
print(team_data['player_name'].value_counts(dropna=False).head(7))
print(team_data['years_exp'].value_counts(dropna=False).head(7))


status
ACT    27360
DEV     8544
RES     5232
INA     3591
CUT      890
RET      531
TRC        8
Name: count, dtype: int64
player_name
Jaylon Moore       39
Josh Allen         38
Jonah Williams     37
Anthony Brown      37
Michael Thomas     37
Kyle Fuller        36
Connor McGovern    36
Name: count, dtype: int64
years_exp
0    9099
2    6601
1    6135
3    5789
4    5034
5    3811
6    2816
Name: count, dtype: int64


### Recommended Columns

In [198]:
team_col = ['season', 'team', 'player_id', 'player_name', 'position', 'status', 'jersey_number', 'height', 'weight', 'years_exp', 'college', 'headshot_url', 'age', ]

print('I would like to make sure the player_id is in line with the gsis_id in the injury_data.\nI would also like to change the dtypes of the height, weight, and age columns to int64.\nMight want to remove cut/ret/developing players from the data set, and only include ACT and .')

team_data[team_col].sample(10)

I would like to make sure the player_id is in line with the gsis_id in the injury_data.
I would also like to change the dtypes of the height, weight, and age columns to int64.


Unnamed: 0,season,team,player_id,player_name,position,status,jersey_number,height,weight,years_exp,college,headshot_url,age
7314,2022,NE,00-0032415,Matt Judon,LB,ACT,9.0,75.0,261.0,6,Grand Valley State,https://static.www.nfl.com/image/private/f_aut...,30.226
15041,2022,TB,00-0034363,Deadrin Senat,DL,ACT,95.0,73.0,305.0,4,South Florida,https://static.www.nfl.com/image/private/f_aut...,28.266
26015,2022,BAL,00-0036081,Broderick Washington,DL,ACT,96.0,74.0,305.0,2,Texas Tech,https://static.www.nfl.com/image/private/f_aut...,26.114
28716,2022,ARI,00-0036318,Lachavious Simmons,OL,DEV,73.0,77.0,315.0,2,Tennessee State,https://static.www.nfl.com/image/private/f_aut...,26.152
28578,2022,CIN,00-0036310,Hakeem Adeniji,OL,ACT,77.0,76.0,300.0,2,Kansas,https://static.www.nfl.com/image/private/f_aut...,25.043
4891,2022,NO,00-0031545,Kevin White,WR,ACT,17.0,75.0,216.0,7,West Virginia,https://static.www.nfl.com/image/private/f_aut...,30.423
31717,2022,SF,00-0036567,Elijah Mitchell,RB,RES,25.0,70.0,221.0,1,Louisiana-Lafayette,https://static.www.nfl.com/image/private/f_aut...,24.422
8653,2022,LA,00-0032943,Riley Dixon,P,ACT,11.0,76.0,221.0,6,Syracuse,https://static.www.nfl.com/image/private/f_aut...,29.24
40575,2022,CLE,00-0037335,Travell Harris,WR,CUT,83.0,69.0,185.0,0,Washington State,,
8422,2022,MIA,00-0032807,Clayton Fejedelem,DB,ACT,42.0,73.0,205.0,6,Illinois,https://static.www.nfl.com/image/private/f_aut...,29.602


## Player Data
This data may not be necessary for our modeling, if we decide not to use this data, we can drop it from the project.
If we keep it I recommend we keep the following columns:

**NOTE**: There are no `player_id/gs_is` columns in this data, so we will need to create a player_id column to use as a foreign key in the team data.

---
- season
- ht
- weight (for comparison with team data, weight)
- position (for comparison with team data, position)

### Missing Values

#### Height

Will need to convert height to inches and then to a float.

In [205]:
player_data['ht'].value_counts(dropna=False).tail(10)

ht
5-9    54
5-8    35
NaN    29
6-7    28
6-8    14
5-7    11
5-6     4
6-9     1
5-5     1
5-4     1
Name: count, dtype: int64

In [203]:
team_data['height'].value_counts(dropna=False).head(10)

height
76.0    6189
75.0    6102
74.0    5812
73.0    5807
77.0    5078
72.0    4669
71.0    3434
78.0    2879
70.0    2515
69.0    1375
Name: count, dtype: int64

In [207]:
print('All missing values are from players who have not played in a game yet.\nI think we can drop these players from the data set.')
player_data.sort_values(by='ht', na_position='first').head(30)


All missing values are from players who have not played in a game yet.
I think we can drop these players from the data set.


Unnamed: 0,season,draft_year,draft_team,draft_round,draft_ovr,pfr_id,cfb_id,player_name,pos,school,ht,wt,forty,bench,vertical,broad_jump,cone,shuttle
1012,2021,,,,,,zayne-anderson-1,Zayne Anderson,LB,BYU,,,4.46,11.0,34.0,120.0,6.78,4.22
1035,2021,2021.0,New England Patriots,6.0,188.0,BledJo00,joshuah-bledsoe-1,Joshuah Bledsoe,S,Missouri,,,,,,,,
1077,2021,2021.0,Carolina Panthers,3.0,70.0,ChriBr00,brady-christensen-1,Brady Christensen,OL,BYU,,,,30.0,34.0,124.0,,
1078,2021,2021.0,Jacksonville Jaguars,3.0,65.0,CiscAn00,andre-cisco-1,Andre Cisco,DB,Syracuse,,,,17.0,,,,
1103,2021,,,,,,brady-davis-2,Brady Davis,QB,Memphis,,,4.65,,31.0,114.0,7.06,4.2
1115,2021,,,,,,keith-duncan-1,Keith Duncan,K,Iowa,,176.0,,,,,,
1139,2021,,,,,,thomas-fletcher-1,Thomas Fletcher,LS,Alabama,,,,,,,,
1142,2021,,,,,ForrMi00,miller-forristall-1,Miller Forristall,TE,Alabama,,,,,,,,
1148,2021,2021.0,Philadelphia Eagles,5.0,150.0,GainKe00,kenny-gainwell-1,Kenneth Gainwell,RB,Memphis,,201.0,4.42,21.0,,,,
1152,2021,2021.0,Las Vegas Raiders,4.0,143.0,GillTy00,tyree-gillespie-1,Tyree Gillespie,S,Missouri,,,4.38,15.0,,,7.06,


#### Weight

In [209]:
player_data['wt'].value_counts(dropna=False).head(10)

wt
205.0    39
195.0    27
197.0    27
201.0    27
208.0    26
198.0    25
215.0    25
NaN      24
209.0    24
202.0    23
Name: count, dtype: int64

In [212]:
print('All missing values are from players who have not played in a game yet.\nI think we can drop these players from the data set.')
player_data.sort_values(by='wt', na_position='first').head(30)

All missing values are from players who have not played in a game yet.
I think we can drop these players from the data set.


Unnamed: 0,season,draft_year,draft_team,draft_round,draft_ovr,pfr_id,cfb_id,player_name,pos,school,ht,wt,forty,bench,vertical,broad_jump,cone,shuttle
1012,2021,,,,,,zayne-anderson-1,Zayne Anderson,LB,BYU,,,4.46,11.0,34.0,120.0,6.78,4.22
1035,2021,2021.0,New England Patriots,6.0,188.0,BledJo00,joshuah-bledsoe-1,Joshuah Bledsoe,S,Missouri,,,,,,,,
1077,2021,2021.0,Carolina Panthers,3.0,70.0,ChriBr00,brady-christensen-1,Brady Christensen,OL,BYU,,,,30.0,34.0,124.0,,
1078,2021,2021.0,Jacksonville Jaguars,3.0,65.0,CiscAn00,andre-cisco-1,Andre Cisco,DB,Syracuse,,,,17.0,,,,
1103,2021,,,,,,brady-davis-2,Brady Davis,QB,Memphis,,,4.65,,31.0,114.0,7.06,4.2
1139,2021,,,,,,thomas-fletcher-1,Thomas Fletcher,LS,Alabama,,,,,,,,
1142,2021,,,,,ForrMi00,miller-forristall-1,Miller Forristall,TE,Alabama,,,,,,,,
1152,2021,2021.0,Las Vegas Raiders,4.0,143.0,GillTy00,tyree-gillespie-1,Tyree Gillespie,S,Missouri,,,4.38,15.0,,,7.06,
1177,2021,,,,,HazeDa00,damon-hazelton-jr-1,Damon Hazelton,WR,Missouri,,,4.6,,37.5,,,
1195,2021,2021.0,San Francisco 49ers,5.0,180.0,HufaTa00,talanoa-hufanga-1,Talanoa Hufanga,S,USC,,,4.65,,,,,


### Recommened Columns

In [213]:
player_col = ['season', 'player_name', 'pos', 'ht', 'wt']

print('I would rename the ht/wt columns with beg_ht and beg_wt.\nI would also like to change the dtypes of the ht/wt columns to int64.')

player_data[player_col].head()

I would rename the ht/wt columns with beg_ht and beg_wt.
I would also like to change the dtypes of the ht/wt columns to int64.


Unnamed: 0,season,player_name,pos,ht,wt
0,2018,Dante Pettis,WR,6-0,186.0
1,2018,Kemoko Turay,EDGE,6-5,253.0
2,2018,Josh Adams,RB,6-2,213.0
3,2018,Ola Adeniyi,EDGE,6-2,248.0
4,2018,Jordan Akins,TE,6-3,249.0
