# Model Building

Now that we've explored the grouped team data, it's time to start building some logistic regressions using the 'result' column as our binary outcome variable.

First, we'll attempt this regression on all features to get a baseline for how well the data can predict the outcome without any transformations. We'll start by importing the sklearn modules that will be used for performing and evaluating the logistic regression.

In [54]:
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

### Splitting The Data

We'll split the teamDF dataframe into 4 sub-dataframes - first into predictors and the outcome, then into training and testing sets, preserving the original indexing.

In [56]:
reg_teamDF = teamDF.copy()
regressors = list(set(list(reg_teamDF)) - set(['result']))
X = reg_teamDF.loc[:, regressors].values
y = reg_teamDF.loc[:,'result'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .25, random_state=10)

### Fitting The Model

Next, we'll fit the logistic regression model.

In [57]:
try:
    lr = LogisticRegression()
    lr.fit(X_train, y_train)
    pred = lr.predict(X_test)
except Exception as ex:
    print(ex)

could not convert string to float: 'Red'


Because there are NaN's in our inputs, LogisticRegression is unable to perform a fit on our training data. We'll need a method for handling this.

### Handling Missing Data

First, we know from earlier calls to .info() that there's a few columns in teamDF that are all NaN (these are all columns that have values in playerDF but not in teamDF). We won't be needing any of these columns. All of this and the following NaN transformations will be done on a duplicate dataframe.

In [58]:
print('Columns with all NaN values --', teamDF.columns[teamDF.isnull().all()].tolist())

reg_teamDF = teamDF.copy()
reg_teamDF.dropna(axis = 'columns', how = 'all', inplace = True)
print('Columns with all NaN values --', reg_teamDF.columns[reg_teamDF.isnull().all()].tolist())

Columns with all NaN values -- ['doubles', 'triples', 'quadras', 'pentas', 'dmgshare', 'earnedgoldshare']
Columns with all NaN values -- []


That takes care of entirely NaN columns, but we still have to deal with coolumns containing some NaN values. We can either completely drop columns containing NaN's or impute missing values. Since so many columns have at least a few missing values, there would be very few predictors remaining if we went with the former, so we'll go with imputing.

We'd like to impute either by median values or mean values. Mean is preferrable to median, provided the mean is not being influenced by a large number of very large outliers.

**Assumption - Note:** *very large outliers* in this case will be refer to observations more than three standard deviations from the mean, while *large number* will mean at least 1% of observations.

In [59]:
null_cols = reg_teamDF._get_numeric_data().columns[reg_teamDF._get_numeric_data().isnull().any()].tolist()
mean = reg_teamDF[null_cols].mean()
sd = reg_teamDF[null_cols].std()

outliers = {c:{'mean':mean[c], 'sd':sd[c],
               'values':[v for v in reg_teamDF[c] if abs(v - mean[c]) > 3*sd[c]]} for c in null_cols}

for k in outliers.keys():
    if len(outliers[k]['values']) >= 100:
        print('Feature: ', k)
        print('Mean: {} and St Dev: {}, Outlier Count: {}'.format(outliers[k]['mean'],
                                                                  outliers[k]['sd'], len(outliers[k]['values'])), '\n')

Feature:  date
Mean: 39655.90773846363 and St Dev: 11001.897307010122, Outlier Count: 694 

Feature:  fbtime
Mean: 6.559774005912805 and St Dev: 3.852554686429645, Outlier Count: 128 

Feature:  dmgtochamps
Mean: 72155.69578713969 and St Dev: 29205.11791982077, Outlier Count: 122 

Feature:  wardkills
Mean: 50.67151744056876 and St Dev: 21.10930304233139, Outlier Count: 110 

Feature:  visionwards
Mean: 31.480949041608227 and St Dev: 14.538736402749127, Outlier Count: 146 

Feature:  goldspent
Mean: 59821.31729490022 and St Dev: 15237.27190880845, Outlier Count: 101 



Okay, all of those features look like reasonable choices for median imputation. All the other features containing NaN's will be corrected with mean imputation.

In [60]:
median_cols = [o for o in outliers.keys() if len(outliers[o]['values']) >= 100]
mean_cols = list(set(null_cols) - set(median_cols))

reg_teamDF[mean_cols] = reg_teamDF[mean_cols].fillna(reg_teamDF.mean())
reg_teamDF[median_cols] = reg_teamDF[median_cols].fillna(reg_teamDF.median())

if not reg_teamDF.isnull().values.any():
    print("NaN's successfully imputed")
else:
    print("NaN's remain in: ", reg_teamDF.columns[reg_teamDF.isnull().any()].tolist())

NaN's remain in:  ['url', 'ban1', 'ban2', 'ban3', 'ban4', 'ban5', 'aggression']


Since we still have NaN's in the non-numeric columns, let's see how serious a problem it is.

In [61]:
reg_teamDF.isnull().sum().sort_values(ascending = False)[:7]/len(reg_teamDF)

ban5          0.448062
ban4          0.445465
url           0.163606
ban1          0.002875
ban3          0.002690
ban2          0.001020
aggression    0.000742
dtype: float64

Okay, there are very few observations with missing data in features 'ban1', 'ban2', 'ban3', and 'aggression' - so it shouldn't effect the regression too strongly if we outright drop those observations.

'url' refers to the url where the complete match data for each observation is being stored, so while this feature may be useful for adding more features to the dataset at some point, it has no benefit as a regressor and can be dropped.

In [62]:
reg_teamDF.dropna(axis='rows',subset=['ban1','ban2','ban3','aggression'], how = 'any', inplace=True)
reg_teamDF.drop(['url'], axis='columns', errors = 'ignore', inplace= True)

The same things cannot be said for 'ban4' and 'ban5', which have missing data in nearly half their observations. This is because a 4th and 5th in the champion selection stage of matches was only added to the game in the last year. There are several ways to handle this issue. We could split the dataframe into observations that have values for those features and observations that don't, and then perform seperate regressions on each. We could impute the median entry of each column (since they are both categorical data). Or, we could drop the features.

For this baseline regression, we'll go ahead and drop the features, but we'll revisit this problem as we fine-tune our model later on.

In [63]:
reg_teamDF.drop(['ban4','ban5'], axis='columns', errors = 'ignore', inplace= True)

if not reg_teamDF.isnull().values.any():
    print("NaN's successfully imputed")
else:
    print("NaN's remain in: ", reg_teamDF.columns[reg_teamDF.isnull().any()].tolist())

NaN's successfully imputed


### Fitting The Model (Again!)

Now that all the NaN's have been imputed or dropped, let's re-attempt our logistic regression.

In [64]:
regressors = list(set(list(reg_teamDF)) - set(['result']))
X = reg_teamDF.loc[:, regressors].values
y = reg_teamDF.loc[:,'result'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .25, random_state=10)

try:
    lr = LogisticRegression()
    lr.fit(X_train, y_train)
    pred = lr.predict(X_test)
except Exception as ex:
    print(ex)

could not convert string to float: 'Blue'


In addition to NaN's, a logistic regression is also flustered by non-numeric features. For categorical features, we can handle this by created dummy variables (which we'll get to later). For less meaningful features - like team names, we can drop them from the regression.

In [65]:
non_nums = list(reg_teamDF.select_dtypes(exclude=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']))
print(non_nums)
print(type(non_nums))

['gameid', 'league', 'split', 'week', 'game', 'patchno', 'playerid', 'side', 'position', 'player', 'team', 'champion', 'ban1', 'ban2', 'ban3', 'quality', 'pace', 'aggression', 'opp_quality']
<class 'list'>


First, let's attempt a regression only on the numeric columns, then we'll expand to include the categorical ones.

### Fitting The Model (Again, again)

We'll fit a model using only the numeric features of reg_teamDF

In [66]:
regressors = list(set(list(reg_teamDF)) - set(non_nums) - set(['result']))
X = reg_teamDF.loc[:, regressors].values
y = reg_teamDF.loc[:,'result'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .25, random_state=10)

try:
    lr = LogisticRegression()
    lr.fit(X_train, y_train)
    pred = lr.predict(X_test)
except Exception as ex:
    print(ex)

### Evaluating the regression

Success! Now we'll see how well this basic logistic regression on the numeric columns worked as a classifier.

In [67]:
confusion_matrix = confusion_matrix(y_test, pred)
confusion_matrix

array([[1348,   16],
       [  23, 1294]])

That worked remarkably well. Suspiciously well actually. The confusion matrix says we correctly predicted 1348 losses as losses, and 1294 wins as wins. Let's look at the classification report as well.

In [68]:
print(classification_report(y_test, pred))

             precision    recall  f1-score   support

          0       0.98      0.99      0.99      1364
          1       0.99      0.98      0.99      1317

avg / total       0.99      0.99      0.99      2681



## Tuning the regression
***

The model was extremely successful in predicting wins and losses. However, it was provided with features that contained a mix of in-game and pre-game information. Since the goal of this analysis is to determine pre-game strategies teams can apply to increase their chances of winning, we'd like to make the model as independent of in-game information as possible.

Let's start by looking at the coefficients our model produced, and seeing which features had the strongest effect.

In [103]:
cfs = sorted(list(zip(lr.coef_[0], regressors)))
print("Strongest negative effects: ", *cfs[:10], sep ='\n')
print("\n Strongest positive effects: ", *cfs[-10:][::-1], sep='\n')

Strongest negative effects: 
(-0.66931419759105637, 'opptowerkills')
(-0.16104913974775251, 'gamelength')
(-0.1435244897245736, 'cspm')
(-0.11474658547703745, 'd')
(-0.11474658547703745, 'teamdeaths')
(-0.044145857595901346, 'oppdragkills')
(-0.023262376361339872, 'oppelementals')
(-0.021080678926604583, 'oppbaronkills')
(-0.020506763494470448, 'wpm')
(-0.014549505567118775, 'fdtime')

 Strongest positive effects: 
(0.56865153607190344, 'teamtowerkills')
(0.045068380266909708, 'a')
(0.033136257858742899, 'teamkills')
(0.033136257858742899, 'k')
(0.033110621435705825, 'elementals')
(0.031951266522651706, 'teamdragkills')
(0.023003349751880391, 'visionwards')
(0.019448053021457603, 'oppelders')
(0.016993247633173919, 'monsterkillsownjungle')
(0.016967567848715413, 'fbarontime')


We should examine the more easily interpretable odds ratios of the coefficients. Note that each of these odds ratios represents the change in probability of a win for a 1 unit increase in its respective predictor with all other predictors held constant.

In [130]:
ecfs = sorted(zip(np.exp(lr.coef_[0]) - [1]*len(lr.coef_[0]), regressors))
print("Strongest negative effects: ", *ecfs[:10], sep ='\n')
print("\n Strongest positive effects: ", *ecfs[-10:][::-1], sep='\n')
print("\n Intercept: ", np.exp(lr.intercept_[0]))

Strongest negative effects: 
(-0.48794037087596698, 'opptowerkills')
(-0.14874976014268981, 'gamelength')
(-0.13370041551075995, 'cspm')
(-0.10840794234332196, 'd')
(-0.10840794234332196, 'teamdeaths')
(-0.043185611351744591, 'oppdragkills')
(-0.022993893166194268, 'oppelementals')
(-0.020860034578686681, 'oppbaronkills')
(-0.02029792975742517, 'wpm')
(-0.014444172975482283, 'fdtime')

 Strongest positive effects: 
(0.76588421404583751, 'teamtowerkills')
(0.046099390018568664, 'a')
(0.033691378219675938, 'teamkills')
(0.033691378219675938, 'k')
(0.033664878409895849, 'elementals')
(0.032467188362731658, 'teamdragkills')
(0.023269967241973388, 'visionwards')
(0.019638398350893205, 'oppelders')
(0.017138454210096965, 'monsterkillsownjungle')
(0.017112334649202143, 'fbarontime')

 Intercept:  0.99947259893


**Quick Insights:**
***
- Tower kills are the strongest determinants (and we observed this correlation earlier in the analysis).
- A death is more damaging than a kill is beneficial, so 'trade' kills are not recommended.
- Gamelength should not have any effect, because there should be an equal number of wins and losses with each game length. Something strange is going on, likely with rows that were dropped with missing values.
- The negative effect of 'cspm' is also surprising. Why would teams that are more efficiently farming gold be less likely to win? Something strange is going on here.

Before addressing the problem of feature selection for a pre-game focused regression, let's take a quick detour to see what's going on with 'gamelength' and 'cspm' in the model.

**Correlations of cspm & gamelength to result:**

Let's check if the coefficients for cspm and gamelength in the regression can be validated by their correlations to result. Remembering that we stored the correlation matrix as `corr` earlier on:

In [85]:
corr.result[['cspm', 'gamelength']]

cspm          0.26258
gamelength    0.00000
Name: result, dtype: float64

As should be expected, gamelength has absolutely no correlation with result, while cspm has a weak-to-moderate **positive** correlation.

**Means of cspm & gamelength grouped by result:**

We can also examine the mean values for each of these features when grouped by result.

In [88]:
teamDF[['cspm', 'gamelength']].groupby(teamDF['result']).mean()

Unnamed: 0_level_0,cspm,gamelength
result,Unnamed: 1_level_1,Unnamed: 2_level_1
0,30.029181,36.314276
1,31.331264,36.314276


Again as expected, gamelength has an identical mean for wins and losses, while cspm is **higher** for wins than losses.

**Interpretation**

There are two possibilities to consider:

1. The regression was influenced by the rows containing NaN's that were removed from reg_teamDF. In such a case, if a disproportionate number of losses with short game lengths and/or wins with long game lengths were removed, the model would unfairly believe that longer games meant a greater liklihood of defeat. And the same thinking applies with cspm.

2. The model is using gamelength and cspm to correct against the coefficient of another predictor.

The first possibility is fairly simple to test by applying the same evaluations to reg_teamDF as we did to teamDF.

In [89]:
reg_teamDF[['cspm', 'gamelength']].groupby(reg_teamDF['result']).mean()

Unnamed: 0_level_0,cspm,gamelength
result,Unnamed: 1_level_1,Unnamed: 2_level_1
0,30.139954,36.33699
1,31.224691,36.345179


In [90]:
reg_teamDF.corr().result[['cspm', 'gamelength']]

cspm          0.239161
gamelength    0.000519
Name: result, dtype: float64

Interestingly enough, the modified dataframe the model was built from not only has similar correlation between cspm and result, but the correlation between gamelength and result is also positive (albeit negligably so). This suggests that the dropped rows are not the culprit of these coefficients, but rather a relationship between the predictors.

Something is happening where the the coefficients for gamelength and cspm are being constructed as negative to counteract the effects of the other predictors. With gamelength, this is more easily interpretable because so many of the model's predictors typically have higher observed values in longer games (kills, deaths, gold earned, etc).

It is not as clear why cspm, a rate of time value, is also being used as a balancing mechanism. My theory is that this is an indicator of a lack of independence among the predictors, and the correlation matrix heatmap generated earlier can confirm this. What is not able to be confirmed or rejected from that heatmap though is the presence of multicolinearity among the predictors (instances where the predictor can be interpretted as a dependent variable for a subset of the predictors). This will be investigated later on.

### Logistic Regression w/ Categoricals

Before we get into feature selection to tune the model, we'd first like to examine a model that includes as predictors the categorical features that were dropped earlier on.

In [142]:
print(teamDF.describe(include = ['category']))

       league   split   week   game   patchno  playerid   side position  \
count   10782   10782  10782  10782  10782.00     10782  10782    10782   
unique     21      14     99     31     34.00         2      2        1   
top       LCK  2017-2     SF      2      7.04       200    Red     Team   
freq     1928    2538    448   3738    550.00      5391   5391    10782   

       player           team champion     ban1     ban2   ban3    ban4  \
count   10782          10782    10782    10751    10771  10753    5979   
unique      1            161        1      107      113    116     111   
top      Team  SK Telecom T1           LeBlanc  LeBlanc   Ryze  Syndra   
freq    10782            261    10782      659      474    476     297   

          ban5 quality    pace aggression opp_quality  
count     5951   10782   10782      10774       10782  
unique     115       5       5          5           5  
top     Syndra   Great  Sprint     Frisky       Great  
freq       308    3455    216

There are 21 categorical variables. Some of these won't be needed (like player which is 'Team' for all entries and champion which is empty for all). However, some columns are also going to need to be added. Because the rows taken from the original dataframe to form teamDF don't have any champion specific data, and we're ultimately trying to build a model that relies exclusively on pre-game decisions, we're going to need the five champion selections as additional columns.

**NOTE:** Bans and champion selections have fairly high cardinality (each of the 8-10 features has over 100 values). This means that adding dummy variables to represent all of these possible choices in the regression will add over 1000 columns to the dataframe. This may exceed the memory limits of the machine performing this analysis.

## Work/Struggle In Progress

So, I'm running into something that I can't seem to figure out. Spent a couple hours searching for and attempting solutions, but I can't figure out how to pivot the champion selections out of playerDF and concat to teamDF without creating duplicate indexes to handle columns with missing values. In other words, and as you'll see in the sub-slides, pivotting from playerDF on the position column with champion values gives a dataframe that has only one valid entry per row (the champion corresponding to the position value at that index in the original dataframe). I can't figure out how to collapse those rows. At least not elegantly. I can think of brute force techniques (writing a custom function for fillna that will fill all entries of a gameid/team combination in a column with the only valid entry found and then dropping duplicate rows from final df) but I know there has to be a simple way to do this that I can't seem to suss out.

In [154]:
test = teamDF.copy()
test.set_index(['gameid', 'team'], inplace = True)
test.sort_index(inplace = True)
test.head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,url,league,split,date,week,game,patchno,playerid,side,position,...,gdat15,xpat10,oppxpat10,xpdat10,winRate,totalGames,quality,pace,aggression,opp_quality
gameid,team,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
1000029,Jin Air Green Wings,http://matchhistory.na.leagueoflegends.com/en/...,LCK,2016-1,42431.125278,8.1,1,6.04,200,Red,Team,...,1255.0,17648.0,16363.0,1285.0,0.447368,190,Fair,Amble,Moderate,Bad
1000029,SBENU Sonicboom,http://matchhistory.na.leagueoflegends.com/en/...,LCK,2016-1,42431.125278,8.1,1,6.04,100,Blue,Team,...,-1255.0,16363.0,17648.0,-1285.0,0.204545,44,Bad,Amble,Moderate,Fair
1000116,CJ Entus,http://matchhistory.na.leagueoflegends.com/en/...,LCK,2016-1,42433.124398,8.3,1,6.04,100,Blue,Team,...,2245.0,15909.0,13971.0,1938.0,0.336634,101,Poor,Walk,Moderate,Bad
1000116,SBENU Sonicboom,http://matchhistory.na.leagueoflegends.com/en/...,LCK,2016-1,42433.124398,8.3,1,6.04,200,Red,Team,...,-2245.0,13971.0,15909.0,-1938.0,0.204545,44,Bad,Walk,Moderate,Poor
1000306,Kongdoo Monster,http://matchhistory.na.leagueoflegends.com/en/...,LCK,2016-1,42438.124271,9.1,1,6.04,200,Red,Team,...,-1855.0,18789.0,19384.0,-595.0,0.293578,109,Bad,Amble,Moderate,Good
1000306,Longzhu Gaming,http://matchhistory.na.leagueoflegends.com/en/...,LCK,2016-1,42438.124271,9.1,1,6.04,100,Blue,Team,...,1855.0,19384.0,18789.0,595.0,0.511111,180,Good,Amble,Moderate,Bad
1000316,Kongdoo Monster,http://matchhistory.na.leagueoflegends.com/en/...,LCK,2016-1,42438.167106,9.1,2,6.04,100,Blue,Team,...,-12.0,17982.0,18337.0,-355.0,0.293578,109,Bad,Walk,Moderate,Good
1000316,Longzhu Gaming,http://matchhistory.na.leagueoflegends.com/en/...,LCK,2016-1,42438.167106,9.1,2,6.04,200,Red,Team,...,12.0,18337.0,17982.0,355.0,0.511111,180,Good,Walk,Moderate,Bad
1000320,Jin Air Green Wings,http://matchhistory.na.leagueoflegends.com/en/...,LCK,2016-1,42438.208461,9.1,1,6.04,100,Blue,Team,...,-618.0,19417.0,19337.0,80.0,0.447368,190,Fair,Walk,Alert,Great
1000320,KT Rolster,http://matchhistory.na.leagueoflegends.com/en/...,LCK,2016-1,42438.208461,9.1,1,6.04,200,Red,Team,...,618.0,19337.0,19417.0,-80.0,0.626126,222,Great,Walk,Alert,Fair


In [220]:
temp = playerDF.copy()
temp = temp.set_index(['gameid', 'team'])
temp.sort_index(inplace=True)
test = pd.concat([playerDF[['gameid', 'team']], playerDF.pivot(columns = 'position', values = 'champion')], axis = 'columns')
print(playerDF[['gameid','team','position','champion']].head(20))
test.head(20)

     gameid            team position    champion
0   1160150      KT Rolster      Top        Gnar
1   1160150      KT Rolster   Jungle      Gragas
2   1160150      KT Rolster   Middle       Varus
3   1160150      KT Rolster      ADC     Kog'Maw
4   1160150      KT Rolster  Support     Alistar
5   1160150  Longzhu Gaming      Top       Yasuo
6   1160150  Longzhu Gaming   Jungle     Hecarim
7   1160150  Longzhu Gaming   Middle      Viktor
8   1160150  Longzhu Gaming      ADC       Sivir
9   1160150  Longzhu Gaming  Support  Tahm Kench
12  1160184        ESC Ever      Top   Gangplank
13  1160184        ESC Ever   Jungle       Elise
14  1160184        ESC Ever   Middle       Karma
15  1160184        ESC Ever      ADC        Jhin
16  1160184        ESC Ever  Support       Janna
17  1160184   SK Telecom T1      Top        Gnar
18  1160184   SK Telecom T1   Jungle     Hecarim
19  1160184   SK Telecom T1   Middle     Taliyah
20  1160184   SK Telecom T1      ADC        Ashe
21  1160184   SK Tel

Unnamed: 0,gameid,team,ADC,Jungle,Middle,Support,Top
0,1160150,KT Rolster,,,,,Gnar
1,1160150,KT Rolster,,Gragas,,,
2,1160150,KT Rolster,,,Varus,,
3,1160150,KT Rolster,Kog'Maw,,,,
4,1160150,KT Rolster,,,,Alistar,
5,1160150,Longzhu Gaming,,,,,Yasuo
6,1160150,Longzhu Gaming,,Hecarim,,,
7,1160150,Longzhu Gaming,,,Viktor,,
8,1160150,Longzhu Gaming,Sivir,,,,
9,1160150,Longzhu Gaming,,,,Tahm Kench,


In [225]:
# test.set_index(['gameid', 'team'],inplace=True)
# test.sort_index(inplace=True)

# for c in ['Top', 'ADC', 'Jungle', 'Middle', 'Support']:
#     test[c].cat.add_categories([''], inplace=True)
# test[['Top', 'ADC', 'Jungle', 'Middle', 'Support']] = test[['Top', 'ADC', 'Jungle', 'Middle', 'Support']].fillna('')
temp = teamDF[['gameid','team', 'split', 'cspm', 'result']].copy()
temp = temp.set_index(['gameid','team'])
temp = temp.sort_index()
other = temp.join(test, how = 'inner')
other.head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,split,cspm,result,ADC,Jungle,Middle,Support,Top
gameid,team,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1000029,Jin Air Green Wings,2016-1,28.921042,1,,,,,Trundle
1000029,Jin Air Green Wings,2016-1,28.921042,1,,Elise,,,
1000029,Jin Air Green Wings,2016-1,28.921042,1,,,LeBlanc,,
1000029,Jin Air Green Wings,2016-1,28.921042,1,Ashe,,,,
1000029,Jin Air Green Wings,2016-1,28.921042,1,,,,Alistar,
1000029,SBENU Sonicboom,2016-1,29.516329,0,,,,,Poppy
1000029,SBENU Sonicboom,2016-1,29.516329,0,,Nidalee,,,
1000029,SBENU Sonicboom,2016-1,29.516329,0,,,Jayce,,
1000029,SBENU Sonicboom,2016-1,29.516329,0,Caitlyn,,,,
1000029,SBENU Sonicboom,2016-1,29.516329,0,,,,Braum,


## Questions for Alex:

1. The site I was using for data updated and now has some kind of inherent scraping protection. I'm trying to get around that by changing headers for urllib Request() but am getting a ValueError "Must explicitly set engine if not passing in buffer or path for io."

    I've included the code in a sub-slide.

2. Narrative voice -- I've been going back and forth w/o any rhyme or reason between first person singular, first person plural, and third person. As I continue working and editing, I'll clean that up for consistency. Do you have any best practice recommendations on which style to adopt among those three?

3. Should I be scaling and centering numerical predictors in logistic regression so that I can interpret the odds ratio coefficients more comparably? For instance, I feel like it's not very useful to compare the odds ratios for a 1 unit increase in minions killed vs towers killed, since they are on such largely different scales.

In [112]:
# List of excel file url's to read and join together
root_url = "http://oracleselixir.com/gamedata/"
urls = ["2016-spring/", "2017-spring/", "2017-summer/"]
urls = [root_url + url for url in urls]

from urllib.request import Request, urlopen

reqs = [urlopen(Request(url, headers={'User-Agent': 'Chrome'})).read() for url in urls]
partial_dfs = [pd.read_excel(req, 'Sheet1') for req in reqs]
df = pd.concat(partial_dfs, ignore_index = True)

ValueError: Must explicitly set engine if not passing in buffer or path for io.