[View in Colaboratory](https://colab.research.google.com/github/jagatfx/turicreate-colab/blob/master/turicreate_sframes_intro.ipynb)

# Introduction to Turi Create SFrames

https://github.com/apple/turicreate/blob/master/userguide/sframe/sframe-intro.md

SFrames are the primary data structure for extracting data from other sources for use in Turi Create.

They are similar to Pandas Dataframes but do not need to be loaded as a whole into RAM, so are not constrained by the RAM of the machine running the code. This makes it a scalable data structure. It is column immutable and supports out-of-core processing.

SFrames can extract data from the following static file formats:

*   CSV
*   JSON
*   SQL databases

# Google Drive Access

You will be asked to click a link to generate a secret key to access your Google Drive. 

Copy and paste secret key it into the space provided with the notebook.

In [2]:
# Install a Drive FUSE wrapper.
# https://github.com/astrada/google-drive-ocamlfuse
!apt-get update -qq 2>&1 > /dev/null
!apt-get install -y -qq software-properties-common python-software-properties module-init-tools
!add-apt-repository -y ppa:alessandro-strada/ppa 2>&1 > /dev/null
!apt-get update -qq 2>&1 > /dev/null
!apt-get -y install -qq google-drive-ocamlfuse fuse

Preconfiguring packages ...
Selecting previously unselected package cron.
(Reading database ... 18408 files and directories currently installed.)
Preparing to unpack .../00-cron_3.0pl1-128ubuntu5_amd64.deb ...
Unpacking cron (3.0pl1-128ubuntu5) ...
Selecting previously unselected package libapparmor1:amd64.
Preparing to unpack .../01-libapparmor1_2.11.0-2ubuntu17.1_amd64.deb ...
Unpacking libapparmor1:amd64 (2.11.0-2ubuntu17.1) ...
Selecting previously unselected package libdbus-1-3:amd64.
Preparing to unpack .../02-libdbus-1-3_1.10.22-1ubuntu1_amd64.deb ...
Unpacking libdbus-1-3:amd64 (1.10.22-1ubuntu1) ...
Selecting previously unselected package dbus.
Preparing to unpack .../03-dbus_1.10.22-1ubuntu1_amd64.deb ...
Unpacking dbus (1.10.22-1ubuntu1) ...
Selecting previously unselected package dirmngr.
Preparing to unpack .../04-dirmngr_2.1.15-1ubuntu8.1_amd64.deb ...
Unpacking dirmngr (2.1.15-1ubuntu8.1) ...
Selecting previously unselected package distro-info-data.
Preparing to unpack .

In [0]:
# Generate auth tokens for Colab
from google.colab import auth
auth.authenticate_user()

In [4]:
# Generate creds for the Drive FUSE library.
from google.colab import output
from oauth2client.client import GoogleCredentials
import time
creds = GoogleCredentials.get_application_default()
import getpass
# Determine if Drive Fuse credential setup is already complete.
fuse_credentials_configured = False
with output.temporary():
  !google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret} < /dev/null 2>&1
  # _exit_code is set to the result of the last "!" command.
  fuse_credentials_configured = _exit_code == 0

# Sleep for a short period to ensure that the previous output has been cleared.
time.sleep(1)
  
if fuse_credentials_configured:
  print('Drive FUSE credentials already configured!')
else:
  # Work around misordering of STREAM and STDIN in Jupyter.
  # https://github.com/jupyter/notebook/issues/3159
  prompt = !google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret} < /dev/null 2>&1 | grep URL
  vcode = getpass.getpass(prompt[0] + '\n\nEnter verification code: ')
  !echo {vcode} | google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret}


Please, open the following URL in a web browser: https://accounts.google.com/o/oauth2/auth?client_id=32555940559.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive&response_type=code&access_type=offline&approval_prompt=force

Enter verification code: ··········
Please, open the following URL in a web browser: https://accounts.google.com/o/oauth2/auth?client_id=32555940559.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive&response_type=code&access_type=offline&approval_prompt=force
Please enter the verification code: Access token retrieved correctly.


In [0]:
# Create a directory and mount Google Drive using that directory.
!mkdir -p drive
!google-drive-ocamlfuse drive

In [1]:
!ls


10000.txt      loc-gowalla_totalCheckins.txt  song_data.csv
10000.txt.zip  __MACOSX			      song_data.csv.zip
adc.json       music_usage_data.sframe	      wget-log
drive	       sample_data


# Fetch Data

In [0]:
!if [ -f "/content/drive/Colab Notebooks/data/song_data.csv.zip" ]; then echo "already downloaded song data, copying to workspace" && cp "/content/drive/Colab Notebooks/data/song_data.csv.zip" . && unzip song_data.csv.zip; else echo "downloading song data..." && mkdir -p "/content/drive/Colab Notebooks/data" && wget "https://static.turi.com/datasets/millionsong/song_data.csv"; fi

In [0]:
!if [ -f "/content/drive/Colab Notebooks/data/10000.txt.zip" ]; then echo "already downloaded 10000 data, copying to workspace" && cp "/content/drive/Colab Notebooks/data/10000.txt.zip" . && unzip 10000.txt.zip; else echo "downloading 10000 data..." && mkdir -p "/content/drive/Colab Notebooks/data" && wget "https://static.turi.com/datasets/millionsong/10000.txt"; fi

In [1]:
!if [ -f "/content/drive/Colab Notebooks/data/loc-gowalla_totalCheckins.txt.gz" ]; then echo "already downloaded loc-gowalla data, copying to workspace" && cp "/content/drive/Colab Notebooks/data/loc-gowalla_totalCheckins.txt.gz" . && gunzip loc-gowalla_totalCheckins.txt.gz; else echo "downloading loc-gowalla data..." && mkdir -p "/content/drive/Colab Notebooks/data" && wget "https://snap.stanford.edu/data/loc-gowalla_totalCheckins.txt.gz" && gunzip loc-gowalla_totalCheckins.txt.gz; fi

already downloaded loc-gowalla data, copying to workspace
tar: This does not look like a tar archive
tar: Skipping to next header
tar: Exiting with failure status due to previous errors


# Setup Turi Create

In [16]:
!apt install libnvrtc8.0
!pip uninstall -y mxnet-cu80 && pip install mxnet-cu80==1.1.0
!pip install turicreate

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  libnvrtc8.0
0 upgraded, 1 newly installed, 0 to remove and 0 not upgraded.
Need to get 6,225 kB of archives.
After this operation, 28.3 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu artful/multiverse amd64 libnvrtc8.0 amd64 8.0.61-1 [6,225 kB]
Fetched 6,225 kB in 1s (6,187 kB/s)
Selecting previously unselected package libnvrtc8.0:amd64.
(Reading database ... 19845 files and directories currently installed.)
Preparing to unpack .../libnvrtc8.0_8.0.61-1_amd64.deb ...
Unpacking libnvrtc8.0:amd64 (8.0.61-1) ...
Setting up libnvrtc8.0:amd64 (8.0.61-1) ...
Processing triggers for libc-bin (2.26-0ubuntu2.1) ...
[33mSkipping mxnet-cu80 as it is not installed.[0m
Collecting mxnet-cu80==1.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/9c/55/bcfd26fd408a4bab27bca1ef5dc1df42954509c904699a6c371d5a4c23ab/

In [0]:
import mxnet as mx
import turicreate as tc

In [0]:
# Use all GPUs (default)
tc.config.set_num_gpus(-1)

# Use only 1 GPU
#tc.config.set_num_gpus(1)

# Use CPU
#tc.config.set_num_gpus(0)

# Sample Data

The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.

https://labrosa.ee.columbia.edu/millionsong/

The first table contains metadata about each song in the database. Here's how we load it into an SFrame:

In [20]:
songs = tc.SFrame.read_csv("./song_data.csv")

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [21]:
songs.head()

song_id,title,release,artist_name,year
SOQMMHC12AB0180CB8,Silent Night,Monster Ballads X-Mas,Faster Pussy cat,2003
SOVFVAK12A8C1350D9,Tanssi vaan,Karkuteillä,Karkkiautomaatti,1995
SOGTUKN12AB017F4F1,No One Could Ever,Butter,Hudson Mohawke,2006
SOBNYVR12A8C13558C,Si Vos Querés,De Culo,Yerba Brava,2003
SOHSBXH12A8C13B0DF,Tangle Of Aspens,Rene Ablaze Presents Winter Sessions ...,Der Mystic,0
SOZVAPQ12A8C13B63C,"Symphony No. 1 G minor ""Sinfonie ...",Berwald: Symphonies Nos. 1/2/3/4 ...,David Montgomery,0
SOQVRHI12A6D4FB2D7,We Have Got Love,Strictly The Best Vol. 34,Sasha / Turbulence,0
SOEYRFT12AB018936C,2 Da Beat Ch'yall,Da Bomb,Kris Kross,1993
SOPMIYT12A6D4F851E,Goodbye,Danny Boy,Joseph Locke,0
SOJCFMH12A8C13B0C2,Mama_ mama can't you see ? ...,March to cadence with the US marines ...,The Sun Harbor's Chorus- Documentary Recordings ...,0


No options are needed for the simplest case, as the SFrame parser infers column types. Of course, there are many options you may need to specify when importing a csv file. Some of the more common options come in to play when we load the usage data of users listening to these songs online:

In [22]:
usage_data = tc.SFrame.read_csv("./10000.txt",
                                header=False,
                                delimiter='\t',
                                column_type_hints={'X3':int})

In [23]:
usage_data.head()

X1,X2,X3
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOAKIMP12A8C130995,1
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOBBMDR12A8C13253B,2
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOBXHDL12A81C204C0,1
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOBYHAJ12A6701BF1D,1
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SODACBL12A8C13C273,1
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SODDNQT12A6D4F5F7E,5
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SODXRTY12AB0180F3B,1
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOFGUAY12AB017B0A8,1
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOFRQTD12A81C233C0,1
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOHQWYZ12A6D4FA701,1


The header and delimiter options are needed because this particular csv file does not provide column names in its first line, and the values are separated by tabs, not commas. The column_type_hints keeps the SFrame csv parser from attempting to infer the datatype of each column, which it does by default. For a full list of options when parsing csv files, check our [API Reference](https://apple.github.io/turicreate/docs/api/generated/turicreate.SFrame.read_csv.html#turicreate.SFrame.read_csv).

Here we might want to rename columns from the default names:

In [24]:
usage_data.rename({'X1':'user_id', 'X2':'song_id', 'X3':'listen_count'})

user_id,song_id,listen_count
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOAKIMP12A8C130995,1
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOBBMDR12A8C13253B,2
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOBXHDL12A81C204C0,1
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOBYHAJ12A6701BF1D,1
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SODACBL12A8C13C273,1
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SODDNQT12A6D4F5F7E,5
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SODXRTY12AB0180F3B,1
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOFGUAY12AB017B0A8,1
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOFRQTD12A81C233C0,1
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOHQWYZ12A6D4FA701,1


SFrames can be saved as a csv file or in the SFrame binary format. If your SFrame is saved in binary format loading it is instantaneous, so we won't ever have to parse that file again. Here, the default is to save in binary format, and we supply the name of a directory to be created which will hold the binary files:

In [0]:
usage_data.save('./music_usage_data.sframe')

Loading is then very fast:

In [0]:
same_usage_data = tc.load_sframe('./music_usage_data.sframe')

# Data Types

An SFrame is made up of columns of a contiguous type, a number of datatypes are supported:

*   int (signed 64-bit integer)
*   float (double-precision floating point)
*   str (string)
*   array.array (1-D array of doubles)
*   list (arbitrarily list of elements)
*   dict (arbitrary dictionary of elements)
*   datetime.datetime (datetime with microsecond precision)
*   image (image)

# Memory Intensive Example

https://blog.usejournal.com/python-for-big-data-computation-on-a-single-computer-c232046df3c3

The data we will use for our experiment comes from the (now inexistent) Gowalla social networking site. Two data nice data sets coming from this site are available here. We will be looking at the biggest one, which contains the event-log of “check-ins” of Gowalla’s users to a set of locations. This data set contains 6.44 million records, each containing a single check-in and just a few columns, of which we will pick only 3: user_id, location_id and checkin_ts (the second-resolution timestamp of the check-in event).

https://snap.stanford.edu/data/loc-gowalla.html

The problem and its (theoretical) solution
We will use Turi Create to attack what could be termed the “stalker-stalkee detection problem” on this data set. In this problem, we are asked to identify pairs of users (E, R) that maximize the ‘stalking measure between E and R’. The stalking measure between E and R is defined as the number of distinct locations where there was ever a check-in by user E (the stalkEE) followed by a check-in by user R (the stalkER).

The first thing is to index the check-ins by location_id (remember that in pandas a single value for a key can refer to more than one row). This will make the following computation easier.

Then comes the tricky part, for each location we want to consider all pairs of check-ins where the check-in time stamp of the first user in the pair strictly precedes that of the second user. So generate chin_pairs, a data frame containing all pairs of check-ins for the same location and then filter it to enforce the conditions just described, to generate pairs_filtered.

However, trying to run a naïve Pandas solution on a laptop or PC with the amount of RAM that is usual these days, (say 16GB), will result in a MemoryError exception. With Turi Create and SFrames we do not have such problems.

In [6]:
checkins = ( tc.SFrame.read_csv( 'loc-gowalla_totalCheckins.txt',                  
                                 delimiter='\t', header=False )
                .rename( {'X1': 'user_id', 'X2' : 'checkin_ts',
                          'X3': 'lat', 'X4' : 'lon',
                          'X5': 'location_id'} )
  [["user_id", "location_id", "checkin_ts"]] )

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,str,float,float,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [7]:
checkins.head()

user_id,location_id,checkin_ts
0,22847,2010-10-19T23:55:27Z
0,420315,2010-10-18T22:17:43Z
0,316637,2010-10-17T23:42:03Z
0,16516,2010-10-17T19:26:05Z
0,5535878,2010-10-16T18:50:42Z
0,15372,2010-10-12T23:58:03Z
0,21714,2010-10-12T22:02:11Z
0,420315,2010-10-12T19:44:40Z
0,153505,2010-10-12T15:57:20Z
0,420315,2010-10-12T15:19:03Z


Next, generate the pairs of check-ins that satisfy the conditions of our detection algorithms.

In [0]:
import datetime
import dateutil.parser

In [0]:
chin_ps = ( checkins.join(checkins, on='location_id').rename( {'checkin_ts': 'checkin_ts_ee', 'checkin_ts.1': 'checkin_ts_er', 'user_id': 'stalkee' , 'user_id.1': 'stalker' } ) )

In [0]:
chin_ps['time_diff'] = (chin_ps['checkin_ts_er'].apply(dateutil.parser.parse) - chin_ps['checkin_ts_ee'].apply(dateutil.parser.parse)) / 86400

In [11]:
# pairs_filtered = chin_ps[ (chin_ps['checkin_ts_ee'] < chin_ps['checkin_ts_er']) & (chin_ps['stalkee'] != chin_ps['stalker']) ]
pairs_filtered = chin_ps[ (chin_ps['time_diff'] > 0.0) & (chin_ps['time_diff'] < 1.0) & (chin_ps['stalkee'] != chin_ps['stalker']) ]
pairs_filtered.head()

stalkee,location_id,checkin_ts_ee,stalker,checkin_ts_er,time_diff
7,420315,2010-10-18T20:24:42Z,0,2010-10-18T22:17:43Z,0.0784837962963
7,420315,2010-10-18T15:08:58Z,0,2010-10-18T22:17:43Z,0.297743055556
31,420315,2010-10-18T14:00:53Z,0,2010-10-18T22:17:43Z,0.345023148148
66,420315,2010-10-18T18:59:11Z,0,2010-10-18T22:17:43Z,0.13787037037
327,420315,2010-10-18T21:21:12Z,0,2010-10-18T22:17:43Z,0.0392476851852
327,420315,2010-10-18T14:05:59Z,0,2010-10-18T22:17:43Z,0.341481481481
342,420315,2010-10-18T14:10:40Z,0,2010-10-18T22:17:43Z,0.338229166667
350,420315,2010-10-18T19:28:34Z,0,2010-10-18T22:17:43Z,0.117465277778
456,420315,2010-10-18T16:00:08Z,0,2010-10-18T22:17:43Z,0.262210648148
515,420315,2010-10-18T11:42:06Z,0,2010-10-18T22:17:43Z,0.441400462963


In [0]:
final_result = ( pairs_filtered[['stalkee', 'stalker', 'location_id']]
                    .unique()
                    .groupby( ['stalkee', 'stalker'], {"location_count": agg.COUNT })
                    .topk( 'location_count', k=5 )
                    .materialize() )

In [0]:
print( final_result )

# Articles, Repositories, etc

*   https://medium.com/@nilotic2/a-guide-to-turi-create-a72f53f26721
*   https://blog.usejournal.com/python-for-big-data-computation-on-a-single-computer-c232046df3c3
*   https://github.com/onmyway133/Avengers