Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency >= 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
    - Regression: Will you use mean absolute error, root mean squared error, R^2, or other regression metrics?
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

If you haven't found a dataset yet, do that today. [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2) and choose your dataset.

Some students worry, ***what if my model isn't “good”?*** Then, [produce a detailed tribute to your wrongness. That is science!](https://twitter.com/nathanwpyle/status/1176860147223867393)

In [1]:
import pandas as pd
import numpy as np

In [5]:
df = pd.read_csv("E:/NotesAssignments/Unit-2/DS-Unit-2-Applied-Modeling/data/project-data/LoL-Ranked-Data.csv")

In [6]:
df.head()

Unnamed: 0,gameId,creationTime,gameDuration,seasonId,winner,firstBlood,firstTower,firstInhibitor,firstBaron,firstDragon,...,t2_towerKills,t2_inhibitorKills,t2_baronKills,t2_dragonKills,t2_riftHeraldKills,t2_ban1,t2_ban2,t2_ban3,t2_ban4,t2_ban5
0,3326086514,1504279457970,1949,9,1,2,1,1,1,1,...,5,0,0,1,1,114,67,43,16,51
1,3229566029,1497848803862,1851,9,1,1,1,1,0,1,...,2,0,0,0,0,11,67,238,51,420
2,3327363504,1504360103310,1493,9,1,2,1,1,1,2,...,2,0,0,1,0,157,238,121,57,28
3,3326856598,1504348503996,1758,9,1,1,1,1,1,1,...,0,0,0,0,0,164,18,141,40,51
4,3330080762,1504554410899,2094,9,1,2,1,1,1,1,...,3,0,0,1,0,86,11,201,122,18


In [8]:
df.set_index('gameId',inplace=True)

In [9]:
df.head()

Unnamed: 0_level_0,creationTime,gameDuration,seasonId,winner,firstBlood,firstTower,firstInhibitor,firstBaron,firstDragon,firstRiftHerald,...,t2_towerKills,t2_inhibitorKills,t2_baronKills,t2_dragonKills,t2_riftHeraldKills,t2_ban1,t2_ban2,t2_ban3,t2_ban4,t2_ban5
gameId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3326086514,1504279457970,1949,9,1,2,1,1,1,1,2,...,5,0,0,1,1,114,67,43,16,51
3229566029,1497848803862,1851,9,1,1,1,1,0,1,1,...,2,0,0,0,0,11,67,238,51,420
3327363504,1504360103310,1493,9,1,2,1,1,1,2,0,...,2,0,0,1,0,157,238,121,57,28
3326856598,1504348503996,1758,9,1,1,1,1,1,1,0,...,0,0,0,0,0,164,18,141,40,51
3330080762,1504554410899,2094,9,1,2,1,1,1,1,0,...,3,0,0,1,0,86,11,201,122,18


In [10]:
target = "winner"

My problem is classification. Team 1 wins or Team 2 wins.

In [12]:
df['winner'].value_counts(normalize=True)

1    0.506448
2    0.493552
Name: winner, dtype: float64

My target seems to be evenly distributed. 
Since my target is evenly distributed I can simply use accuracy

In [13]:
df.columns

Index(['creationTime', 'gameDuration', 'seasonId', 'winner', 'firstBlood',
       'firstTower', 'firstInhibitor', 'firstBaron', 'firstDragon',
       'firstRiftHerald', 't1_champ1id', 't1_champ1_sum1', 't1_champ1_sum2',
       't1_champ2id', 't1_champ2_sum1', 't1_champ2_sum2', 't1_champ3id',
       't1_champ3_sum1', 't1_champ3_sum2', 't1_champ4id', 't1_champ4_sum1',
       't1_champ4_sum2', 't1_champ5id', 't1_champ5_sum1', 't1_champ5_sum2',
       't1_towerKills', 't1_inhibitorKills', 't1_baronKills', 't1_dragonKills',
       't1_riftHeraldKills', 't1_ban1', 't1_ban2', 't1_ban3', 't1_ban4',
       't1_ban5', 't2_champ1id', 't2_champ1_sum1', 't2_champ1_sum2',
       't2_champ2id', 't2_champ2_sum1', 't2_champ2_sum2', 't2_champ3id',
       't2_champ3_sum1', 't2_champ3_sum2', 't2_champ4id', 't2_champ4_sum1',
       't2_champ4_sum2', 't2_champ5id', 't2_champ5_sum1', 't2_champ5_sum2',
       't2_towerKills', 't2_inhibitorKills', 't2_baronKills', 't2_dragonKills',
       't2_riftHeraldKills',

In [14]:
features = ['firstBlood',
           'firstTower',
           'firstInhibitor',
           'firstBaron',
           'firstDragon',
           'firstRiftHerald',
           'gameDuration']

I will use the features listed above to train my data. And I will use a random split in order to split my data.

In [15]:
df['firstBlood'].value_counts(normalize=True)

1    0.507147
2    0.482074
0    0.010779
Name: firstBlood, dtype: float64

In [16]:
df['firstTower'].value_counts(normalize=True)

1    0.502253
2    0.474189
0    0.023558
Name: firstTower, dtype: float64

In [17]:
df['firstInhibitor'].value_counts(normalize=True)

1    0.447737
2    0.430375
0    0.121888
Name: firstInhibitor, dtype: float64

While exploring the data I found that there are some games where Towers and Inhibitors were not taken. The destruction of these objectives are a prerequisite of victory in any game. The fact that some games exist that have a winner (when I looked at the target data, there were no stalemates, either team_1 wins, or team_2 wins) but no one capturing any of these objectives makes me believe that these games either 1) The data may be wrong or 2) That the losing team surrendered for any reason, before any of these objectives were captures by the opposing team.

I do not believe that any of these features would leak because none of these are whether or not the nexus is destroyed.