Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency > 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

In [4]:
import pandas as pd

df = pd.read_csv('data/steam.csv')

In [5]:
df.head()

Unnamed: 0,appid,name,release_date,english,developer,publisher,platforms,required_age,categories,genres,steamspy_tags,achievements,positive_ratings,negative_ratings,average_playtime,median_playtime,owners,price
0,10,Counter-Strike,2000-11-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,124534,3339,17612,317,10000000-20000000,7.19
1,20,Team Fortress Classic,1999-04-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,3318,633,277,62,5000000-10000000,3.99
2,30,Day of Defeat,2003-05-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Valve Anti-Cheat enabled,Action,FPS;World War II;Multiplayer,0,3416,398,187,34,5000000-10000000,3.99
3,40,Deathmatch Classic,2001-06-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,1273,267,258,184,5000000-10000000,3.99
4,50,Half-Life: Opposing Force,1999-11-01,1,Gearbox Software,Valve,windows;mac;linux,0,Single-player;Multi-player;Valve Anti-Cheat en...,Action,FPS;Action;Sci-fi,0,5250,288,624,415,5000000-10000000,3.99


In [15]:
#set target to owners 
target = 'owners'

In [16]:
df[target].value_counts()

0-20000                18596
20000-50000             3059
50000-100000            1695
100000-200000           1386
200000-500000           1272
500000-1000000           513
1000000-2000000          288
2000000-5000000          193
5000000-10000000          46
10000000-20000000         21
20000000-50000000          3
50000000-100000000         2
100000000-200000000        1
Name: owners, dtype: int64

In [None]:
# it looks like it will be a classification problem that is strongly distributed to the smaller values (tailing left)

In [None]:
# train, test, and val should be split by time using 'release_date'

In [17]:
df.head()

Unnamed: 0,appid,name,release_date,english,developer,publisher,platforms,required_age,categories,genres,steamspy_tags,achievements,positive_ratings,negative_ratings,average_playtime,median_playtime,owners,price
0,10,Counter-Strike,2000-11-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,124534,3339,17612,317,10000000-20000000,7.19
1,20,Team Fortress Classic,1999-04-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,3318,633,277,62,5000000-10000000,3.99
2,30,Day of Defeat,2003-05-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Valve Anti-Cheat enabled,Action,FPS;World War II;Multiplayer,0,3416,398,187,34,5000000-10000000,3.99
3,40,Deathmatch Classic,2001-06-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,1273,267,258,184,5000000-10000000,3.99
4,50,Half-Life: Opposing Force,1999-11-01,1,Gearbox Software,Valve,windows;mac;linux,0,Single-player;Multi-player;Valve Anti-Cheat en...,Action,FPS;Action;Sci-fi,0,5250,288,624,415,5000000-10000000,3.99


In [47]:
# features that should be dropped include:
# 'appid', 'name', 'categories' and 'steamspy_tags' (as they're nearly the same as 'genre')
# need to encode genre tags so values are separated

# thankfully, it looks like with the target I've selected there will be no time-travelling in my model

In [18]:
df['required_age'].value_counts()

0     26479
18      308
16      192
12       73
7        12
3        11
Name: required_age, dtype: int64

In [21]:
df['steamspy_tags'].value_counts()

Action;Indie;Casual                                 845
Action;Adventure;Indie                              714
Early Access;Action;Indie                           507
Adventure;Indie;Casual                              442
Indie;Casual                                        378
                                                   ... 
Walking Simulator;Adventure;Psychological Horror      1
Indie;RPGMaker;Female Protagonist                     1
Strategy;Turn-Based Strategy;Indie                    1
Indie;Casual;Walking Simulator                        1
Early Access;Sandbox;Simulation                       1
Name: steamspy_tags, Length: 6423, dtype: int64

In [22]:
df['release_date'].value_counts()

2018-07-13    64
2018-11-16    56
2019-01-31    56
2016-04-05    56
2018-05-31    55
              ..
2012-11-06     1
2011-10-27     1
2008-11-07     1
2016-06-04     1
2006-05-10     1
Name: release_date, Length: 2619, dtype: int64

In [29]:
test1 = pd.to_datetime(df['release_date'])
test1.dt.year.value_counts()

2018    8160
2017    6357
2016    4361
2015    2597
2019    2213
2014    1555
2013     418
2012     320
2009     305
2011     239
2010     238
2008     145
2007      93
2006      48
2005       6
2004       6
2001       4
2003       3
1999       2
2000       2
2002       1
1997       1
1998       1
Name: release_date, dtype: int64

In [46]:
((test1.dt.year == 2018) & (test1.dt.month >= 8) |(test1.dt.year == 2019 )).value_counts()

False    21489
True      5586
Name: release_date, dtype: int64

In [43]:
((test1.dt.year == 2018) & (test1.dt.month <= 7)).value_counts()

False    22288
True      4787
Name: release_date, dtype: int64

In [44]:
(test1.dt.year < 2018).value_counts()

True     16702
False    10373
Name: release_date, dtype: int64

In [None]:
#further analysis of 'release_date' indicates that a train/val/test split may be done with 2019
#test data from aug 2018 on (3373 + 2213 = 5586)
#val data from  jan 2018 to jul (4787)
#train on the remainder (16702)

In [None]:
# though this is a classification problem, there is still alot of desire to predict the 
# success of a game in terms of sale