In [1]:
%run ../initialize.ipynb

## Feature Engineering Strategy
* start simple -- get quick wins
* use model pipeline to get feedback
* run POCs on feature types; don't get stuck in rabbit holes
* size of net to cast depends on data
 * if you have 10 positive examples, and develop 1 million features and perform feature selection, you will inevitably find some features that fit the data.
 * if you have 1M positive examples, you can cast a very wide net.
 * 2007-2017 has 2937 games, roughly 1450 positive labels. care must be taken not to overfit.
 * keeping a holdout set to use in only rare circumstances is one way to do this, at the cost of training on fewer positive labels.
* the same or similar features can work on regression or classification, but predicting games vs. predicting scores will need different features

## Things to predict
* won
* covered spread
* over
* score differential
* team points
* score

## Tables
* Source
 * labels.over_under_labels
 * labels.team_game_line_labels

* Derived

## Features

### Rankings

### General

### Priorities Legend
* &#x1F6A8;	 - high priority
* &#x23F8; - medium priority
* &#x270B; - low priority
* &#x23F3; - waiting on data
* &#x1F6A7;	 - started
* &#x2714; - completed

### Rankings
* Dimensions
 * Overall
 * Offense
   * passing
   * rushing
 * Defense
 * Special Teams
* &#x23F8; DVOA rankings comparisons between 2 teams

### Stats
 * &#x23F8; TDs
 * &#x23F8; points
 * &#x23F8; yards, ypa, ypc, etc.
 * &#x23F8; turnovers
 * &#x23F8; time of possession
 * &#x270B; yards
 * &#x270B; kicker FG%/PAT% recently vs. that season's average
 * &#x23F3; advanced

### Time
* &#x2714; time of day -- raw (tree can split on primetime)
* &#x2714; day of week -- raw
 * very sparse other than Sunday. Sunday/not Sunday?
* &#x2714; is playoffs
* &#x2714; week # -- raw
* &#x23F8; days since last game
* &#x23F8; timezone change for each team
* &#x270B; team's body clock start time (start time relative to last week's timezone)
* &#x270B; patterns, e.g. team 2nd straight week of 10am body time
* &#x1F44E;	should season not be a feature? since it can include games afterwards. if one year has 60% and another year has 50%... given I know it's in the 60% year, I know you'll win 60%... since there don't appear to be a lot of time trends in the initial exploration, I won't use season

### Teams (both teams)
* &#x2714; winrate last 5, YTD, last 16, last 3 years, etc.
* &#x2714; cover rate last 5, YTD, ...
* &#x23F8; winrate as favorite/underdog last 5, YTD, etc.
* &#x23F8; recent scores (lot of shootouts recently?)
* &#x23F8; smoothed winrates/cover rates
 * this will help regress 1-0, 2-0, 0-1, etc. type records to closer to 0.5
* &#x270B; fanbase size, average attendance, etc.
* &#x270B; ages (median, mean)
* &#x270B; how does age interact with week # (do older teams tire?)

### Standings
* &#x270B; playoff situation
* &#x270B; is one team tanking?

### Matchup
* &#x2714; is intra-division
* &#x2714; is intra-conference
* &#x2714; head-to-head record recent history
* &#x270B; recent record vs. similar team
* &#x270B; TE vs. TE coverage, net rating
* &#x23F3; *eventually get scheme data*

### Coach
* &#x270B; coach head-to-head
* &#x270B; coach record
* &#x270B; team vs. coach

### Travel
* &#x2714; travel from last game
* &#x2714; distance from home stadium
* &#x2714; travel distance from last game, decayed by time (cross-country in 10 days is not as bad as 5)
* &#x23F8; Sum of travel last 3, 5 games; sum of distance from home; sum of timezone changes
* &#x23F8; Sum of consecutive distances (multiplied by repeats?), to raise for multiple games in a row
* Convolution of travel

### Home/Visitor
* &#x1F6A8; binary H/V
* &#x1F6A8; consecutive home/away games
* &#x1F6A8; road games last 4 weeks
* &#x1F6A8; home/away record
* &#x1F6A8; home/away record ATS
* &#x270B; this team's home field. net H/V record recently?

### Injuries
* &#x23F3; key injuries by position/snaps played
* &#x23F3; number of snaps missing?
* &#x23F3; new injuries (serious injuries could counter features indicating recent success)

### Players
* &#x23F3; need additional datasets for most of these
* &#x23F3; *QB's QBR/DVOA last 5, 16, 32 games, etc*
* &#x23F3; position players' rankings

### Spread
* &#x1F6A8; raw spread
* &#x23F8; team's records in games with roughly this spread
* &#x23F8; historic cover rate at this spread

### Weather
* &#x1F6A8; temperature
* &#x1F6A8; is_precipitation
* &#x1F6A8; is_snow
* &#x23F8; others?


### Questions
* how to handle the "this team" vs. "opponent" aspect, e.g. rankings
 * net difference? 
 * keep them separate?
 * proxy for the net difference, e.g. how they fared recently
* how to get differences for things
 * maybe the difference from 1-16 is bigger than 17-32
 * could do some sort of binning of similar matchups
