# SHOUT! Global Data Science Competition 2016

Hi! My name is Guy Rapaport, and I'm a data scientist working for **Turi** (formerly known as **Dato**).

In this notebook, I'll use **GraphLab Create**, our scalable machine learning Python library, to create a simple baseline for the competition. 

Turi assists TalkingData in running this competition. For the duration of this competition, Turi will supply Kaggle users with **free GraphLab Create licenses!**

You can get GraphLab Create from Turi's website: http://www.turi.com/download

Register using your email address, and you will get 1 month of free trial usage.

Send the email adress you registered with me at [guy+kaggle@turi.com](mailto:guy+kaggle@turi.com), and I'll extend your license until the end of the competition.

<img src="http://cdn2.hubspot.net/hub/426799/hubfs/dato_to_turi.gif"></img>

## Loading the Data

### Assert all files are available

I am assuming all competition data is located under in the directory **`BASEDIR`** located in the same directory as this notebook.

In [1]:
BASEDIR = "csvs"
import os
print "Expecting data (csv files) in: %s" % (os.path.join(os.getcwd(), BASEDIR))

Expecting data (csv files) in: /Users/dato/Documents/projects/tendcloud_competition/kaggle/csvs


In [2]:
from glob import glob
from os.path import basename
required_csvs = ['app_events.csv',
 'app_labels.csv',
 'events.csv',
 'gender_age_test.csv',
 'gender_age_train.csv',
 'label_categories.csv',
 'phone_brand_device_model.csv',
 'sample_submission.csv']

available_csvs = set(map(basename, glob(os.path.join(BASEDIR, "*.csv"))))

assert all([r in available_csvs for r in required_csvs])

### Import GraphLab and Load the CSVs

For this part, I am assuming you have GraphLab Create installed. Otherwise, [get it from here](https://get.turi.com)!

In the following lines we will simply import GraphLab (`as gl`) and use it to load the competition data into GraphLab's data structure - the `SFrame`. This is a tabular data structure which can be described as *pandas on steroids*: if you normally use pandas, you can [find your way around SFrame here](https://turi.com/learn/translator/).

SFrame really shines when it has to *scale over a single machine*: it is **multicore** (all cores are used in the different SFrame operations) and **out-of-core** (meaning it is limited by the size of your disk and not of your RAM). While the data for this contest is rather small and fits in memory, you SFrames can be fed into the machine learning algorithms of GraphLab Create, as you will soon see.

Let's load each dataset and inspect the first 3 rows.

In [3]:
import graphlab as gl

In [4]:
app_events = gl.SFrame.read_csv('csvs/app_events.csv')
app_events.head(3)

[INFO] graphlab.cython.cy_server: GraphLab Create v2.0.1 started. Logging: /tmp/graphlab_server_1468272895.log


This non-commercial license of GraphLab Create is assigned to guy4261@gmail.com and will expire on October 26, 2016. For commercial licensing options, visit https://turi.com/buy/.


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


event_id,app_id,is_installed,is_active
2,5927333115845830913,1,1
2,-5720078949152207372,1,0
2,-1633887856876571208,1,0


In [5]:
app_labels = gl.SFrame.read_csv('csvs/app_labels.csv')
app_labels.head(3)

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


app_id,label_id
7324884708820027918,251
-4494216993218550286,251
6058196446775239644,406


In [6]:
events = gl.SFrame.read_csv('csvs/events.csv')
events.head(3)

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,str,float,float]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


event_id,device_id,timestamp,longitude,latitude
1,29182687948017175,2016-05-01 00:55:25,121.38,31.24
2,-6401643145415154744,2016-05-01 00:54:12,103.65,30.97
3,-4833982096941402721,2016-05-01 00:08:05,106.6,29.7


The timestamp column was treated as a string column. I'll convert it to a datetime object before we continue.

In [7]:
events["timestamp"] = events["timestamp"].str_to_datetime(str_format="%Y-%m-%dT%H:%M:%S")
events.head(3)

event_id,device_id,timestamp,longitude,latitude
1,29182687948017175,2016-05-01 00:55:25,121.38,31.24
2,-6401643145415154744,2016-05-01 00:54:12,103.65,30.97
3,-4833982096941402721,2016-05-01 00:08:05,106.6,29.7


On this occasion, I'd like to see the duration of the data.

In [8]:
min_ts = events["timestamp"].min()
max_ts = events["timestamp"].max()
print "Data duration is %s (%s->%s)" % (
    str(max_ts - min_ts),
    str(min_ts),
    str(max_ts)
)

Data duration is 7 days, 0:07:44 (2016-04-30 23:52:24->2016-05-08 00:00:08)


In [9]:
gender_age_train = gl.SFrame.read_csv('csvs/gender_age_train.csv')
gender_age_train

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,str,int,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


device_id,gender,age,group
-8076087639492063270,M,35,M32-38
-2897161552818060146,M,35,M32-38
-8260683887967679142,M,35,M32-38
-4938849341048082022,M,30,M29-31
245133531816851882,M,30,M29-31
-1297074871525174196,F,24,F24-26
236877999787307864,M,36,M32-38
-8098239495777311881,M,38,M32-38
176515041953473526,M,33,M32-38
1596610250680140042,F,36,F33-42


In [10]:
gender_age_test = gl.SFrame.read_csv('csvs/gender_age_test.csv')
gender_age_test

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


device_id
1002079943728939269
-1547860181818787117
7374582448058474277
-6220210354783429585
-5893464122623104785
-7560708697029818408
289797889702373958
-402874006399730161
5751283639860028129
-848943298935149395


It's best practice to ensure we don't have any duplicate IDs in the test set.

In [11]:
duplicate_ids = gender_age_test.groupby(
    "device_id", {
        "count": gl.aggregate.COUNT()}).filter_by(
            [1], "count", exclude=True)

duplicate_ids.materialize()
duplicate_ids

device_id,count


In [12]:
label_categories = gl.SFrame.read_csv('csvs/label_categories.csv')
label_categories

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


label_id,category
1,
2,game-game type
3,game-Game themes
4,game-Art Style
5,game-Leisure time
6,game-Cutting things
7,game-Finding fault
8,game-stress reliever
9,game-pet
10,game-Answer


In [13]:
phone_brand_device_model = gl.SFrame.read_csv('csvs/phone_brand_device_model.csv')
phone_brand_device_model

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


device_id,phone_brand,device_model
-8890648629457979026,小米,红米
1277779817574759137,小米,MI 2
5137427614288105724,三星,Galaxy S4
3669464369358936369,SUGAR,时尚手机
-5019277647504317457,三星,Galaxy Note 2
3238009352149731868,华为,Mate
-3883532755183027260,小米,MI 2S
-2972199645857147708,华为,G610S
-5827952925479472594,小米,MI One Plus
-8262508968076336275,vivo,S7I


In [14]:
sample_submission = gl.SFrame.read_csv('csvs/sample_submission.csv')
sample_submission

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,float,float,float,float,float,float,float,float,float,float,float,float]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


device_id,F23-,F24-26,F27-28,F29-32,F33-42,F43+,M22-,M23-26,M27-28,M29-31
1002079943728939269,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833
-1547860181818787117,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833
7374582448058474277,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833
-6220210354783429585,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833
-5893464122623104785,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833
-7560708697029818408,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833
289797889702373958,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833
-402874006399730161,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833
5751283639860028129,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833
-848943298935149395,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833,0.0833

M32-38,M39+
0.0833,0.0833
0.0833,0.0833
0.0833,0.0833
0.0833,0.0833
0.0833,0.0833
0.0833,0.0833
0.0833,0.0833
0.0833,0.0833
0.0833,0.0833
0.0833,0.0833


## Feature Engineering - Preparing the Data for Model Training

For this benchmark, I'd like to classify each event to its owner's gender-age group. However, our data points are not events, but the application statuses (`is installed` and `is active`) per event. Let's look at the data again, as a reminder. I'll use `event_id==2` as the example event for each of the following steps.

In [15]:
app_events.filter_by([2], "event_id").head(3)

event_id,app_id,is_installed,is_active
2,5927333115845830913,1,1
2,-5720078949152207372,1,0
2,-1633887856876571208,1,0


GraphLab Create can handle sparse vectors (Python `dict` objects) and expand them to columns, so I don't have to represent my data like this. To begin with, I'll completely ignore the `is_installed` column, since all the sampled apps are of course installed. You can verify this by checking how many unique values are in the `is_installed` column:

In [16]:
app_events["is_installed"].unique()

dtype: int
Rows: 1
[1]

Now I'll run a SQL-like groupby operation that will create a dictionary stating whether each app in the event is active or not.

Let's run the operation and check the result:

In [17]:
grouped_app_events = app_events.groupby("event_id", {"active": gl.aggregate.CONCAT("app_id", "is_active")})
grouped_app_events.filter_by([2], "event_id")

event_id,active
2,"{5927333115845830913: 1, -1758857579862594461: 0, ..."


We can then split the `active` column into apps that are only installed, and these apps off the `active` column.

In [18]:
grouped_app_events["installed"] = grouped_app_events["active"].apply(lambda d:{k:1 for (k, v) in d.iteritems() if v == 0})

In [19]:
grouped_app_events["active"] = grouped_app_events["active"].dict_trim_by_values(1)

In [20]:
# Shorthand syntax for the filter_by operation
grouped_app_events[grouped_app_events["event_id"] == 2].head(1)

event_id,active,installed
2,"{5927333115845830913: 1, 4775896950989639373: 1, ...","{-1758857579862594461: 1, 3717049149426646122: 1, ..."


Now each event can be classified on its own. Let's join the grouped app events with the events metadata SFrame.

In [21]:
joined = events.join(grouped_app_events, on="event_id")
joined[joined["event_id"] == 2].head(1)

event_id,device_id,timestamp,longitude,latitude,active
2,-6401643145415154744,2016-05-01 00:54:12,103.65,30.97,"{5927333115845830913: 1, 4775896950989639373: 1, ..."

installed
"{-1758857579862594461: 1, 3717049149426646122: 1, ..."


Note that now, each device may still have multiple events associated with it. I'll classify each event separately and then aggregate the results; you can choose a different part, especially if you want to beat the competition `;)`

Here's an example of other events by the device from our favourite `event_id==2`:

In [22]:
joined.filter_by([-6401643145415154744], "device_id").head(3)

event_id,device_id,timestamp,longitude,latitude,active
2,-6401643145415154744,2016-05-01 00:54:12,103.65,30.97,"{5927333115845830913: 1, 4775896950989639373: 1, ..."
65269,-6401643145415154744,2016-05-01 00:51:34,103.65,30.97,"{5927333115845830913: 1, 4775896950989639373: 1, ..."
212044,-6401643145415154744,2016-05-01 22:17:00,103.62,30.99,"{5927333115845830913: 1, 7167114343576723123: 1, ..."

installed
"{-1758857579862594461: 1, 3717049149426646122: 1, ..."
"{-1758857579862594461: 1, 3717049149426646122: 1, ..."
{}


Now let's use our train and test datasets. We'll run a join operation between the joined events data and the train/test datasets. SFrame's default join type is `inner`, which means we'll only take the events that has a corresponding `device_id` in the train/test SFrame. This will effectively split our joined events data between train and test sets, according to the device IDs.

In [23]:
train = joined.join(gender_age_train, on="device_id")
train.head(3)

event_id,device_id,timestamp,longitude,latitude,active
6,1476664663289716375,2016-05-01 00:27:21,0.0,0.0,"{-3467200097934864127: 1, 5927333115845830913: 1, ..."
29,7166563712658305181,2016-05-01 00:31:40,117.96,28.47,"{-9129109839652417461: 1, 5099453940784075687: 1, ..."
35,-3449419341168524142,2016-05-01 00:25:41,0.0,0.0,"{6666573793632706850: 1, -6159006306231678365: 1, ..."

installed,gender,age,group
"{-1633873313139722876: 1, 5729517255058371973: 1, ...",M,19,M22-
"{5927333115845830913: 1, -5791102939414640022: 1, ...",M,60,M39+
"{5927333115845830913: 1, 7499170796297973860: 1, ...",M,28,M27-28


In [24]:
test = joined.join(gender_age_test, on="device_id")
test.head(3)

event_id,device_id,timestamp,longitude,latitude,active
2,-6401643145415154744,2016-05-01 00:54:12,103.65,30.97,"{5927333115845830913: 1, 4775896950989639373: 1, ..."
7,5990807147117726237,2016-05-01 00:15:13,113.73,23.0,"{33792862810792679: 1, 487766649788038994: 1, ..."
9,-2073340001552902943,2016-05-01 00:15:33,0.0,0.0,"{-332792151099088795: 1, 5099453940784075687: 1, ..."

installed
"{-1758857579862594461: 1, 3717049149426646122: 1, ..."
"{-1633933922436094199: 1, -607956450007961078: 1, ..."
"{5927333115845830913: 1, 6279509966423362439: 1, ..."


Again, just to make sure, let's assert there is now interesection between the train devices and the test devices.

In [25]:
intersection = set(train["device_id"].unique()) & set(test["device_id"].unique())
assert len(intersection) == 0
intersection

set()

## Machine Learning Time! Classifiying using GraphLab Create

Time for training some models! But first, let's shuffle our train data and create some folds of it so that we can do k-fold cross validation.

In [26]:
train = gl.cross_validation.shuffle(train, random_seed=0)
folds = gl.cross_validation.KFold(train, num_folds=5)

We are about to train a model, so we need to define our target column and our feature columns. Let's look at the column names again:

In [27]:
train.column_names()

['event_id',
 'device_id',
 'timestamp',
 'longitude',
 'latitude',
 'active',
 'installed',
 'gender',
 'age',
 'group']

* `event_id' - That's obviously useless: it's just an indexing column.
* `device_id' - Also useless index column.
* `timestamp' - Might be useful, but only after we'll break it down to hours of day, and the dates to weekdays/weekends. And even so, this effort would be worthwhile only if we have many events for many devices. We'll keep it out for this notebook.
* `longitude`, `latitude` - Can we derive the gender and age of a person by it's latitude? Perhaps there are locations visited more by men/woman, or by children/adults. Let's keep these two. Note that our longitude/latitude values have at most 2 digits after the decimal point, [indicating an accuracy of ~1.1 kilometers](http://gis.stackexchange.com/a/8674).
* `active`, `installed` - feature columns we worked hard to put our hands on: definitely in!
* `gender`, `age`, `group` - our potential **target** columns.

In this notebook I'll use **`group`** as the target column. A different solution may fuse two different models, one for gender, another for age.

Let's use the first fold to demonstrate how easy it is to train a model using GraphLab Create. We'll use a **boosted decision trees classifier** model, and we'll feed our fold's SFrames to it.
You can read
[more on how to use this model in our API docs](https://turi.com/products/create/docs/generated/graphlab.boosted_trees_classifier.create.html#graphlab.boosted_trees_classifier.create).

Note: I'll use the validation set part of the first fold as the `validation_set` for the model. This means that after each training iteration, GraphLab will print the performance over the validation set. It does not affect the training process at all, and if we didn't supplied it (nor used `None` as the value for it), GraphLab would use a random 5% sample of the training data (`fold_train`) for this purpose.

In [28]:
fold_train, fold_val = folds[0]

target = "group"
features = [
 'longitude',
 'latitude',
 'active',
 'installed'
]

model = gl.classifier.boosted_trees_classifier.create(fold_train,
                                                      target=target, features=features,
                                                      validation_set=fold_val)

What a fancy logloss! I wonder how good would the model perform on the test data :)

Meanwhile, let's see how stable this training method is over the other folds:

In [29]:
log_loss_values = []

for fold_train, fold_val in folds:
    model = gl.classifier.boosted_trees_classifier.create(fold_train,
                                                      target=target, features=features,
                                                      validation_set=None, verbose=False)
    log_loss_values.append(model.evaluate(fold_val, metric="log_loss")["log_loss"])

for i, ll in enumerate(log_loss_values):
    print "log loss for fold %d: %f" %(i+1, ll)

log loss for fold 1: 1.911203
log loss for fold 2: 1.914789
log loss for fold 3: 1.913967
log loss for fold 4: 1.912135
log loss for fold 5: 1.912335


Looks pretty stable to me!

Note that by training method I mean not only our choice of features and feature engineering, but also how we tune our boosted trees. There are many default values being used for different aspects of the training, as you can see from the API docs:

```python
graphlab.boosted_trees_classifier.create(dataset, target, features=None,
    max_iterations=10, validation_set='auto', class_weights=None,
    max_depth=6, step_size=0.3, min_loss_reduction=0.0, min_child_weight=0.1,
    row_subsample=1.0, column_subsample=1.0, verbose=True,
    random_seed=None, metric='auto', **kwargs)
```

Before we'll create our submission, let's see check which features were most useful for this model. In the boosted decision trees context, that means, which features were found the most in the trees' nodes (as split points).

In [30]:
model.get_feature_importance()

name,index,count
longitude,,144
latitude,,97
active,3.4332896017370127e+18,60
active,-2.320783822570583e+18,48
active,6.284164581582112e+18,46
active,5.516228268441718e+18,44
active,-5.368809411346729e+18,43
active,5.927333115845832e+18,43
active,6.95669958242614e+18,40
active,-3.9552127334851e+18,38


This result applies to the last trained `model`, but I assure you `longitude` and `latitude` are always prominent. What can we learn from this? Perhaps that in different areas, people behave differently, and require a different model for predicting their age and gender. Who knows - that would require a more deeper inspection.

Another interesting point is that for some app IDs, an active app would say a lot about your group. Let's write a function for obtaining apps categories given IDs.

In [31]:
def get_categories_for_apps(*app_ids):
    app_ids = map(int, app_ids)
    return app_labels.filter_by(app_ids, "app_id").join(label_categories, on="label_id")["category"].unique()

Note that we convereted tha `app_ids` back to `int` values using `map`.

This is required because even though we packed those `int` app_ids into the dictionaries of `active` and `installed` apps, they are treated as categorical values in the training. Thus, they are stored in the `model.get_feature_importance()` results as strings and need to be converted back to `int`.

In [32]:
for colname, coltype in zip(
    model.get_feature_importance().column_names(),
    model.get_feature_importance().column_types()):
    print colname, coltype

name <type 'str'>
index <type 'str'>
count <type 'int'>


So for the top 5 app IDs, let's see their respective categories:

In [33]:
for app_id in model.get_feature_importance()["index"][2:7]:
    print "app_id=%s, categories=%s" % (app_id, str(get_categories_for_apps(app_id)))

app_id=3433289601737013244, categories=['Industry tag', 'Property Industry 1.0']
app_id=-2320783822570582843, categories=['weibo']
app_id=6284164581582112235, categories=['Industry tag', 'Property Industry 2.0', 'Services 1']
app_id=5516228268441717785, categories=['P2P net loan', 'Debit and credit', 'Pay']
app_id=-5368809411346728624, categories=['Industry tag', 'Property Industry 1.0', 'Taxi']


## Creating a Submission

Let's use our train model to predict the probability for each of our target classes. We'll use the `gender_age_test` joined data we created earlier. Here's a reminder of how it looks like - exactly like the train data, *sans* the target columns.

In [34]:
test.head(3)

event_id,device_id,timestamp,longitude,latitude,active
2,-6401643145415154744,2016-05-01 00:54:12,103.65,30.97,"{5927333115845830913: 1, 4775896950989639373: 1, ..."
7,5990807147117726237,2016-05-01 00:15:13,113.73,23.0,"{33792862810792679: 1, 487766649788038994: 1, ..."
9,-2073340001552902943,2016-05-01 00:15:33,0.0,0.0,"{-332792151099088795: 1, 5099453940784075687: 1, ..."

installed
"{-1758857579862594461: 1, 3717049149426646122: 1, ..."
"{-1633933922436094199: 1, -607956450007961078: 1, ..."
"{5927333115845830913: 1, 6279509966423362439: 1, ..."


Now I'll use the model to predict a probability vector for each of the events in the `test` SFrame. This will pose a problem, as now we have multiple predictions for each device_id (when the device_id has more than one events).

In [35]:
results = gl.SFrame(test[["device_id"]])
results["predict"] = model.predict(test, output_type="probability_vector")
results.sort("device_id")

device_id,predict
-9222661944218806987,"[0.0406686961651, 0.0444855690002, ..."
-9222661944218806987,"[0.0413026362658, 0.0451790057123, ..."
-9222661944218806987,"[0.0397630110383, 0.0474216490984, ..."
-9222661944218806987,"[0.040388636291, 0.0441792234778, ..."
-9222661944218806987,"[0.0413026362658, 0.0451790057123, ..."
-9222661944218806987,"[0.0433587320149, 0.0474280789495, ..."
-9222661944218806987,"[0.044090859592, 0.0482289195061, ..."
-9222661944218806987,"[0.0413026362658, 0.0451790057123, ..."
-9222399302879214035,"[0.0341429449618, 0.0373473539948, ..."
-9222399302879214035,"[0.034290291369, 0.0375085324049, ..."


I'll choose the aggergate the results per `device_id` by averaging the probability vectors.

In [36]:
results_agg = results.groupby("device_id", {
        "predict": gl.aggregate.AVG("predict"),
    })
results_agg.head(3)

device_id,predict
-4455230124644815790,"[0.0341725138715, 0.0373797060456, ..."
6613624870840984795,"[0.0431418456137, 0.0471908301115, ..."
2660355718062911483,"[0.0407662391663, 0.0445922724903, ..."


Now it's time to unpack the probability vectors according to the class labels. I'll pick the class labels from the `sample_submission` SFrame, where all classes are represented. The probability vector is ordered according to the natural ordering (read: Python's `sorted()` result) on the class labels.

In [37]:
classes = sorted(sample_submission.column_names())
classes.remove("device_id")

With this knowledge, it's easy to do the unpacking ourselves:

In [38]:
for i, label in enumerate(classes):
    results_agg[label] = results_agg["predict"].apply(lambda lst:lst[i])

submission = results_agg[["device_id"] + classes]

Let's look at the submission.

In [39]:
submission.head(3)

device_id,F23-,F24-26,F27-28,F29-32,F33-42
-4455230124644815790,0.0341725138715,0.0373797060456,0.0357922750991,0.049911536742,0.0609928842168
6613624870840984795,0.0431418456137,0.0471908301115,0.0451867468655,0.0525502227247,0.075110450387
2660355718062911483,0.0407662391663,0.0445922724903,0.0426985360682,0.0595421716571,0.0769132003188

F43+,M22-,M23-26,M27-28,M29-31,M32-38
0.057933524251,0.0745691508055,0.158295459114,0.0604897357989,0.0836647781543,0.147407517768
0.0253774523735,0.0793024078012,0.157966941595,0.10948343575,0.117965810001,0.148551955819
0.0691120326519,0.0749356374145,0.117746852338,0.0721614807844,0.0998082384467,0.160627260804

M39+
0.199390940368
0.0981718748808
0.141096070409


Looks like the sample submssion! Nice, isn't it?

I'll spoil it in advance: actually, no!

The size of our submission is way smaller than it should be:

In [40]:
print "len() of submission: %d" % len(submission)
print "len() for gender_age_test: %d" % len(gender_age_test)

len() of submission: 35172
len() for gender_age_test: 112071


This happens because we relied heavily on app_events data, which many devices in our dataset simply don't have.

In [41]:
print "Number of devices in gender_age_test that have events: %d" % len(set(gender_age_test["device_id"]) & set(events["device_id"]))

Number of devices in gender_age_test that have events: 35194


Seems that we also lost some data on the way (35172 < 35194). This way or the other, we have to pad our submission with same default values for the missing device IDs. Let's take those rows from the sample submission.

In [42]:
def pad_submission_sf(submission):
    return submission.append(sample_submission.filter_by(submission["device_id"], "device_id", exclude=True))

In [43]:
my_submission_sf = pad_submission_sf(submission)
my_submission_sf.head(3)

device_id,F23-,F24-26,F27-28,F29-32,F33-42
-4455230124644815790,0.0341725138715,0.0373797060456,0.0357922750991,0.049911536742,0.0609928842168
6613624870840984795,0.0431418456137,0.0471908301115,0.0451867468655,0.0525502227247,0.075110450387
2660355718062911483,0.0407662391663,0.0445922724903,0.0426985360682,0.0595421716571,0.0769132003188

F43+,M22-,M23-26,M27-28,M29-31,M32-38
0.057933524251,0.0745691508055,0.158295459114,0.0604897357989,0.0836647781543,0.147407517768
0.0253774523735,0.0793024078012,0.157966941595,0.10948343575,0.117965810001,0.148551955819
0.0691120326519,0.0749356374145,0.117746852338,0.0721614807844,0.0998082384467,0.160627260804

M39+
0.199390940368
0.0981718748808
0.141096070409


Let's also verify that we have the required number of device IDs.

In [44]:
assert set(my_submission_sf["device_id"].unique()) == set(gender_age_test["device_id"].unique())

Now we can save our submission SFrame. Note that we have several options to do it:

* `my_submission_sf.save("benchmark_submission.csv")` - would save the SFrame as a CSV file.

* `my_submission_sf.save("benchmark_submission.csv.gz")` - would save the SFrame as a compressed (gzipped) CSV file. This is the option I'll use as it means a smaller upload to Kaggle.

* `my_submission_sf.save("benchmark_submission")` - would create a directory called `benchmark_submission` where the SFrame data is saved in our binary format. This is not as smaller as a gzipped file, but is definitely smaller than the raw CSV, and loads faster into GraphLab.

Note that the SFrame project has been open sourced, so you can use SFrame for feature engineering for free. Checkout *[sic]* [our SFrame repository on GitHub](https://github.com/turi-code/SFrame) or download it using pip:

    `pip install -U sframe`

In [45]:
my_submission_sf.save("benchmark_submission.csv.gz")

This submission scores ~2.42 on Kaggle. Not very good, considering we padded the results for most of the devices!

To demonstrate the power of SFrame and the convenience of GraphLab Create, I chose to focus on a quite complicated feature engineering (joining and grouping by the active application IDs). I also created just one model - the boosted decision trees. However, there is much more you can do to win this competition - whether you'll choose to use GraphLab Create for this task or not.

## Where do we go from here?

Naturally, to where we came from. Let's recall which data sources are available for us. We stored the list of CSVs in the `required_csvs` list in the beginning of the notebook.

In [46]:
required_csvs

['app_events.csv',
 'app_labels.csv',
 'events.csv',
 'gender_age_test.csv',
 'gender_age_train.csv',
 'label_categories.csv',
 'phone_brand_device_model.csv',
 'sample_submission.csv']

By now we already know that for many devices, we don't have any events. That means no timestamps, no longitude/latitude values, and no lists of active and/or installed apps. That is, nothing like this:

In [47]:
joined.head(1)

event_id,device_id,timestamp,longitude,latitude,active
2,-6401643145415154744,2016-05-01 00:54:12,103.65,30.97,"{5927333115845830913: 1, 4775896950989639373: 1, ..."

installed
"{-1758857579862594461: 1, 3717049149426646122: 1, ..."


We do seem to have the phone brands and device models for all of our devices:

In [48]:
len(set(gender_age_test["device_id"].unique()) & set(phone_brand_device_model["device_id"].unique()))

112071

Perpahs we can make use of this information somehow. Seems unfair? This is real-world data, and these are the hardships the TalkingData data scientists tackle. Sometimes, only some of the data is available to some of the devices. Let's see what we can get out of it.

Another thing we just glimpsed upon is the categories space of the apps. Perhaps converting the list of apps to a list of categories would give us a deeper understanding of our device owners' interests?

There surely are many ways to handle this data - and with the easy-to-learn, easy-to-use API of GraphLab Create's many algorithms, I hope you'll win the competition!

# Good Luck!!!

## Bonus Round: Visualizing the Geolocation Data
Looking at the model's feature importance,, we noticed that the longitude and latitude arguments contribute the most to the model's results. Let's visualize them in the hope that we can get some insights on how to partition the coordinate pairs.

In [49]:
# A reminder
model.get_feature_importance()[:2]

name,index,count
longitude,,144
latitude,,97


The following cell requires the **`folium`** package, and **may take a long time to execute!**

You can get `folium` by using `pip install -U folium` .

In [50]:
try:
    from folium import Map, CircleMarker
    
    def show_coordinates_on_map():
        mean_lon = events["longitude"][events["longitude"] > 1].mean()
        mean_lat = events["latitude"][events["latitude"] > 1].mean()
        cloc = (mean_lat, mean_lon)
        map_1 = Map(location=cloc, zoom_start=3)
        for row in events[["latitude", "longitude"]].unique():
            map_1.add_children(CircleMarker((row["latitude"], row["longitude"])))

        map_1.lat_lng_popover()
        return map_1

except ImportError:
    def show_coordinates_on_map():
        print "Please install the folium Python package to make this cell usable."

In [51]:
# Uncomment the following line and run the cell if you're interested in the visualization.

#show_coordinates_on_map()