<a href="https://colab.research.google.com/github/npgeorge/DS-Unit-2-Applied-Modeling/blob/master/Nick_George_Assignment_1_Applied_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency > 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

In [0]:
#use first principles to figure out end data frame
#what are the foundational differences, keep it small, build from there.

#due to the many data frames, my goal is to build one comprehensive data frame
#this one data frame will be used to try to predict who will win each game.

#target
#'outcome' column from the game csv

#regression
#win, loss, or tie will numerically represented

#target distribution
#need to explore

#I will train/test/val on each season separately.
#If this doesn't work, I will try multiple seasons at a time. 
#Perhaps a 'returning_players' parameter giving a % of how many players stay YoY could be useful



In [0]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

#upload local file
from google.colab import files
import io

In [0]:
#choose file button
uploaded = files.upload()

In [0]:
#uploading the game df to start exploration
df = pd.read_csv(io.BytesIO(uploaded['game.csv']))

In [8]:
df.head()

Unnamed: 0,game_id,season,type,date_time,date_time_GMT,away_team_id,home_team_id,away_goals,home_goals,outcome,home_rink_side_start,venue,venue_link,venue_time_zone_id,venue_time_zone_offset,venue_time_zone_tz
0,2011030221,20112012,P,2012-04-29,2012-04-29T19:00:00Z,1,4,3,4,home win OT,right,Wells Fargo Center,/api/v1/venues/null,America/New_York,-4,EDT
1,2011030222,20112012,P,2012-05-01,2012-05-01T23:30:00Z,1,4,4,1,away win REG,right,Wells Fargo Center,/api/v1/venues/null,America/New_York,-4,EDT
2,2011030223,20112012,P,2012-05-03,2012-05-03T23:30:00Z,4,1,3,4,home win OT,left,Prudential Center,/api/v1/venues/null,America/New_York,-4,EDT
3,2011030224,20112012,P,2012-05-06,2012-05-06T23:30:00Z,4,1,2,4,home win REG,left,Prudential Center,/api/v1/venues/null,America/New_York,-4,EDT
4,2011030225,20112012,P,2012-05-08,2012-05-08T23:30:00Z,1,4,3,1,away win REG,right,Wells Fargo Center,/api/v1/venues/null,America/New_York,-4,EDT


In [35]:
#bruins
#if a model works for one team, it *could* work broadly for others
#this may turn out to be a bad assumption
#this filter returns the df for the Bruins
#or statement for home and away games

bruins_id_condition = (df['home_team_id'] == 6) | (df['away_team_id'] == 6)
df_bruins = df[bruins_id_condition]
df_bruins.head()

Unnamed: 0,game_id,season,type,date_time,date_time_GMT,away_team_id,home_team_id,away_goals,home_goals,outcome,home_rink_side_start,venue,venue_link,venue_time_zone_id,venue_time_zone_offset,venue_time_zone_tz
11,2010030311,20102011,P,2011-05-15,2011-05-15T00:00:00Z,14,6,5,2,away win REG,left,TD Garden,/api/v1/venues/null,America/New_York,-4,EDT
12,2010030312,20102011,P,2011-05-18,2011-05-18T00:00:00Z,14,6,5,6,home win REG,left,TD Garden,/api/v1/venues/null,America/New_York,-4,EDT
13,2010030313,20102011,P,2011-05-20,2011-05-20T00:00:00Z,6,14,2,0,away win REG,left,St. Pete Times Forum,/api/v1/venues/null,America/New_York,-4,EDT
14,2010030314,20102011,P,2011-05-21,2011-05-21T17:30:00Z,6,14,3,5,home win REG,left,St. Pete Times Forum,/api/v1/venues/null,America/New_York,-4,EDT
15,2010030315,20102011,P,2011-05-24,2011-05-24T00:00:00Z,14,6,1,3,home win REG,left,TD Garden,/api/v1/venues/null,America/New_York,-4,EDT


In [0]:
#seasons
#the NHL seasons runs from early October to Early April, Playoffs through mid June
#there are 6 seasons in this data set
#i'll split the set by seasons
#as some years are very different than others for certain teams

#in case seasons started early one year
#2010 to 2011 season
season_10to11 = (df['date_time'] > '2010-09-01') & (df['date_time'] < '2011-07-01')

#2011 to 2012
season_11to12 = (df['date_time'] > '2011-09-01') & (df['date_time'] < '2012-07-01')

#2012 to 2013
season_12to13 = (df['date_time'] > '2012-09-01') & (df['date_time'] < '2013-07-01')

#2013 to 2014
season_13to14 = (df['date_time'] > '2013-09-01') & (df['date_time'] < '2014-07-01')

#2014 to 2015
season_14to15 = (df['date_time'] > '2014-09-01') & (df['date_time'] < '2015-07-01')

#2015 to 2016
season_15to16 = (df['date_time'] > '2015-09-01') & (df['date_time'] < '2016-07-01')

#2016 to 2017
season_16to17 = (df['date_time'] > '2016-09-01') & (df['date_time'] < '2017-07-01')

#2017 to 2018
season_17to18 = (df['date_time'] > '2017-09-01') & (df['date_time'] < '2018-07-01')

#2018 to 2019
season_18to19 = (df['date_time'] > '2018-09-01') & (df['date_time'] < '2019-07-01')

#need to try to implement a function here

In [62]:
#pass the season condition, sort season from beginning to end
df_10to11 = df[season_10to11].sort_values(by=['date_time'])
df_10to11

Unnamed: 0,game_id,season,type,date_time,date_time_GMT,away_team_id,home_team_id,away_goals,home_goals,outcome,home_rink_side_start,venue,venue_link,venue_time_zone_id,venue_time_zone_offset,venue_time_zone_tz
10725,2010020002,20102011,R,2010-10-07,2010-10-07T23:00:00Z,4,5,3,2,away win REG,left,CONSOL Energy Center,/api/v1/venues/null,America/New_York,-4,EDT
9625,2010020003,20102011,R,2010-10-07,2010-10-07T16:00:00Z,12,30,4,3,away win REG,right,Hartwall Areena,/api/v1/venues/null,America/Chicago,-5,CDT
11078,2010020001,20102011,R,2010-10-07,2010-10-07T23:00:00Z,8,10,2,3,home win REG,right,Air Canada Centre,/api/v1/venues/null,America/Toronto,-4,EDT
10295,2010020008,20102011,R,2010-10-08,2010-10-08T16:00:00Z,30,12,1,2,home win SO,right,Hartwall Areena,/api/v1/venues/null,America/New_York,-4,EDT
9472,2010020011,20102011,R,2010-10-08,2010-10-08T23:30:00Z,15,11,2,4,home win REG,,Philips Arena,/api/v1/venues/null,America/New_York,-4,EDT
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
148,2010030413,20102011,P,2011-06-07,2011-06-07T00:00:00Z,23,6,1,8,home win REG,left,TD Garden,/api/v1/venues/null,America/New_York,-4,EDT
149,2010030414,20102011,P,2011-06-09,2011-06-09T00:00:00Z,23,6,0,4,home win REG,left,TD Garden,/api/v1/venues/null,America/New_York,-4,EDT
150,2010030415,20102011,P,2011-06-11,2011-06-11T00:00:00Z,6,23,0,1,home win REG,right,Rogers Arena,/api/v1/venues/null,America/Vancouver,-7,PDT
151,2010030416,20102011,P,2011-06-14,2011-06-14T00:00:00Z,23,6,2,5,home win REG,left,TD Garden,/api/v1/venues/null,America/New_York,-4,EDT


In [57]:
#bruins 2010 to 2011, won the Stanley Cup this year
bruins_id_condition = (df_10to11['home_team_id'] == 6) | (df_10to11['away_team_id'] == 6)
df_bruins_10to11 = df_10to11[bruins_id_condition]
df_bruins_10to11.tail() #last game won in April - Stanley Cup Win

Unnamed: 0,game_id,season,type,date_time,date_time_GMT,away_team_id,home_team_id,away_goals,home_goals,outcome,home_rink_side_start,venue,venue_link,venue_time_zone_id,venue_time_zone_offset,venue_time_zone_tz
148,2010030413,20102011,P,2011-06-07,2011-06-07T00:00:00Z,23,6,1,8,home win REG,left,TD Garden,/api/v1/venues/null,America/New_York,-4,EDT
149,2010030414,20102011,P,2011-06-09,2011-06-09T00:00:00Z,23,6,0,4,home win REG,left,TD Garden,/api/v1/venues/null,America/New_York,-4,EDT
150,2010030415,20102011,P,2011-06-11,2011-06-11T00:00:00Z,6,23,0,1,home win REG,right,Rogers Arena,/api/v1/venues/null,America/Vancouver,-7,PDT
151,2010030416,20102011,P,2011-06-14,2011-06-14T00:00:00Z,23,6,2,5,home win REG,left,TD Garden,/api/v1/venues/null,America/New_York,-4,EDT
152,2010030417,20102011,P,2011-06-16,2011-06-16T00:00:00Z,6,23,4,0,away win REG,right,Rogers Arena,/api/v1/venues/null,America/Vancouver,-7,PDT
