<center><h1>  Google's Quickdraw Datset</h1></center>

## The Data
The raw data are [`ndjson`](http://ndjson.org/) files that are seperated by category, in the following format: 

| Key          | Type                   | Description                                  |
| ------------ | -----------------------| -------------------------------------------- |
| key_id       | 64-bit unsigned integer| A unique identifier across all drawings.     |
| word         | string                 | Category the player was prompted to draw.    |
| recognized   | boolean                | Whether the word was recognized by the game. |
| timestamp    | datetime               | When the drawing was created.                |
| countrycode  | string                 | A two letter country code ([ISO 3166-1 alpha-2](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2)) of where the player was located. |
| drawing      | string                 | A JSON array representing the vector drawing |  



The format of the drawing array is as following:
 
```javascript
[ 
  [  // First stroke 
    [x0, x1, x2, x3, ...],
    [y0, y1, y2, y3, ...],
    [t0, t1, t2, t3, ...]
  ],
  [  // Second stroke
    [x0, x1, x2, x3, ...],
    [y0, y1, y2, y3, ...],
    [t0, t1, t2, t3, ...]
  ],
  ... // Additional strokes
]
```
Where `x` and `y` are the pixel coordinates, and `t` is the time in milliseconds since the first point. `x` and `y` are real-valued while `t` is an integer. The raw drawings can have vastly different bounding boxes and number of points due to the different devices used for display and input.

## Data Wrangling

In [1]:
%matplotlib inline
import numpy as np # linear algebra
import pandas as pd # data processing
# Plotting libraries
import matplotlib.pyplot as plt 
import seaborn as sns

In [2]:
from glob import glob
files = glob("data/full/*.ndjson")
files

['data/full\\cannon.ndjson',
 'data/full\\door.ndjson',
 'data/full\\face.ndjson',
 'data/full\\nail.ndjson',
 'data/full\\pear.ndjson',
 'data/full\\piano.ndjson',
 'data/full\\radio.ndjson',
 'data/full\\spider.ndjson',
 'data/full\\star.ndjson',
 'data/full\\sword.ndjson']

In [5]:
df_door = pd.read_json('data/full\\cannon.ndjson', lines=True)
df_door.shape

(141394, 6)

In [6]:
df_door.head()

Unnamed: 0,countrycode,drawing,key_id,recognized,timestamp,word
0,US,"[[[275, 272, 272, 271, 270, 270, 269, 269, 269...",5429533144514560,True,2017-03-07 02:28:18.148170,cannon
1,CZ,"[[[154.33299255371094, 149.375, 141.3139953613...",6738271738527744,True,2017-03-19 12:58:59.089310,cannon
2,DE,"[[[640, 640, 640, 640, 640, 640, 640, 640, 640...",6068795355430912,True,2017-03-26 07:47:54.888710,cannon
3,US,"[[[692.8319702148438, 687.8109741210938, 683.2...",5683221763194880,False,2017-03-12 05:47:58.073700,cannon
4,US,"[[[255, 261, 265, 270, 275, 281, 286, 291, 298...",6136843659640832,False,2017-03-03 13:50:42.610660,cannon


## Feature Engineering

**Normalization:**
1. X = normalized X Ranges between 0 to 1
2. Y = Y values normalized using X. Ranges between 0 and 1.5

**Features Created:**
1. stroke_count: Number of strokes the doodle took to create
2. draw_time: The total time the doodle create to draw
3. total_datapoints: The total number of datapoints that exist in an image
4. total_time_of_stroke = time spent on each stroke
5. datapoints_per_stroke = number of datapoints exist within each stroke

**Filtering applied:**
1. Removing Unrecognized Images:
    a. Having unrecognized images in the dataset may reduce prediction accuracy
2. Removing data where final time is greater than 20000
    a. Quickdraw ask users to draw in 20sec (20000ms) and some users have final times higher than that limit.
3. Removing data where y_max is greater than 1.5.
    a. We need this to maintain 2:3 X:Y ratio of the images




### Normalization

In [7]:
def array_normalizer(df):
    X = {}
    X_strokes = {}
    y = {}
    y_strokes = {}
    y_max = {}
    for index in df.index:
        # store X / Y of the stroke in a list for EACH stroke
        X_stroke_lists = [df.loc[index,'drawing'][stroke][0] for stroke in range(df['stroke_count'][index])]
        y_stroke_lists = [df.loc[index,'drawing'][stroke][1] for stroke in range(df['stroke_count'][index])]
        
        # Make a flattened list of all X / Y info of an image
        X_list = [item for sublist in X_stroke_lists for item in sublist]
        y_list = [item for sublist in y_stroke_lists for item in sublist]

        #normalizing X and Y
        X_min = np.min(X_list)
        X_max = np.max(X_list)
        y_min = np.min(y_list)
    
        #runs user defined function array_normalizer to normalize
        X_norm = (np.array(X_list)-np.array([X_min]*len(X_list)))/float(X_max-X_min)
        y_norm = (np.array(y_list)-np.array([y_min]*len(y_list)))/float(X_max-X_min)
        
        #Filtering out data where Ymax is greater than 1.5. We need this filter to maintain 2:3 X:Y ratio of all images
        y_max[index] = np.max(y_norm)
        
        #store X,Y and time info from each stroke as a list
        X_strokes[index] = [list((np.array(X_stroke_lists[stroke])-np.array([X_min]*len(X_stroke_lists[stroke])))/float(X_max-X_min)) for stroke in range(len(X_stroke_lists))]
        y_strokes[index] = [list((np.array(y_stroke_lists[stroke])-np.array([X_min]*len(y_stroke_lists[stroke])))/float(X_max-X_min)) for stroke in range(len(y_stroke_lists))]

        
        X[index] = X_norm
        y[index] = y_norm
        
    df['X'] = pd.Series(X)
    df['y'] = pd.Series(y)
    df['X_per_stroke'] = pd.Series(X_strokes)
    df['Y_per_stroke'] = pd.Series(y_strokes)
    df['y_max'] = pd.Series(y_max)
    df = df[df['y_max']<=1.5]
    
    return(df)

## Filtering Applied

In [20]:
def filters(df):
    #setting boolean and changing recognized features to 1 and 0.
    recoginzed_boolean = {True: 1, False:0}
    df['recognized'] = df['recognized'].map(recoginzed_boolean)

    # Remove data that was not recognized
    df = df[(df['recognized']=='True')]
    
    # Remove draw times that exceeded the max time
    df = df[(df['draw_time']<=20000)]

## Creating Features

In [9]:
def stroke_count(df):
    '''
    Calculates the number of strokes
    
    New Feature: stroke_count [int]
    '''
    df['stroke_count']=df['drawing'].str.len()

In [10]:
def draw_time(df):
    '''
    Calcualtes total time taken to draw the image
    
    New Feature: draw_time [int]
    '''
    df['draw_time'] = [df.loc[index,'drawing'][df.loc[index,'stroke_count']-1][2][-1] for index in df.index]

In [17]:
def total_datapoints(df):
    '''
    Calculates total number of datapoints and total for each stroke
    
    New Features: total_datapoints [int]; datapoints_per_strokes [List]
    '''
    dpps_dict = {}
    for index in df.index:
        total_points = [len(df.loc[index,'drawing'][stroke][0]) for stroke in range(df['stroke_count'][index])]
        df.loc[index, 'total_datapoints'] = sum(total_points)
        # Points per Stroke
        datapoints_per_strokes = [len(df.loc[index,'drawing'][stroke][0]) for stroke in range(df['stroke_count'][index])]
        dpps_dict[index] = datapoints_per_strokes
    df['datapoints_per_strokes'] = pd.Series(dpps_dict)

In [12]:
def stroke_times(df):
    '''
    Calculates the total time each stroke takes 
    
    New Feature: stroke_times [List]
    '''
    st_dict = {}
    
    for index in df.index:
        stroke_times = [(df.loc[index,'drawing'][stroke][2][-1] - df.loc[index,'drawing'][stroke][2][0]) 
                                       for stroke in range(df['stroke_count'][index])]
        st_dict[index] = stroke_times
    df['stroke_times'] = pd.Series(st_dict)

# Loading all datasets

In [13]:
def feature_engineering(files):
    '''
    Loads a subset of 3000 rows for the entire dataset of doodles for each category
    and then perform engineering with all the user-defined functions
    
    Returns: Dataframe
    '''
    
    # Load up to 3000 rows of data
    rows = ''
    with open(files,'r') as f:
        for line in f.readlines()[0:3000]:
            rows += line
    df = pd.read_json(rows,lines=True)
    
    # Creates new feature "stroke_count" which calcualtes the total number of strokes
    stroke_count(df)
    
    # Creates new feature "draw_time" which calcualtes the total draw time
    draw_time(df)
    
    # Creates 5 new features which are the normalizes X and Y data, then filters out data to keep 2:3 X:Y ratio 
    df = array_normalizer(df)
    
    # Create 2 new features "total_datapoints" & "datapoints_per_stroke" which 
    # calcualtes total datapoints in each doodle and each stroke
    total_datapoints(df)
    
    # Creates new feature "stroke_time" which calcualtes time taken for each stroke
    stroke_times(df)
    
    print(df.shape)
    return df

In [18]:
cannon = feature_engineering('data/full/cannon.ndjson')
door = feature_engineering('data/full/door.ndjson')
face = feature_engineering('data/full/face.ndjson')
nail = feature_engineering('data/full/nail.ndjson')
pear = feature_engineering('data/full/pear.ndjson')
piano = feature_engineering('data/full/piano.ndjson')
radio = feature_engineering('data/full/radio.ndjson')
spider = feature_engineering('data/full/spider.ndjson')
star = feature_engineering('data/full/star.ndjson')
sword = feature_engineering('data/full/sword.ndjson')

(2978, 16)
(1568, 16)
(2879, 16)
(1212, 16)
(1379, 16)
(2988, 16)
(2943, 16)
(2971, 16)
(2908, 16)
(1105, 16)


In [21]:
cannon = feature_engineering('data/full/cannon.ndjson')

(2978, 16)


In [19]:
cannon.head()

Unnamed: 0,countrycode,drawing,key_id,recognized,timestamp,word,stroke_count,draw_time,X,y,X_per_stroke,Y_per_stroke,y_max,total_datapoints,datapoints_per_strokes,stroke_times
0,US,"[[[275, 272, 272, 271, 270, 270, 269, 269, 269...",5429533144514560,True,2017-03-07 02:28:18.148170,cannon,7,6627,"[0.892682926829, 0.878048780488, 0.87804878048...","[0.0, 0.0292682926829, 0.0634146341463, 0.0926...","[[0.892682926829, 0.878048780488, 0.8780487804...","[[0.965853658537, 0.99512195122, 1.02926829268...",0.839024,153.0,"[28, 62, 27, 22, 5, 5, 4]","[862, 1585, 588, 536, 137, 116, 55]"
1,CZ,"[[[154.33299255371094, 149.375, 141.3139953613...",6738271738527744,True,2017-03-19 12:58:59.089310,cannon,5,5820,"[0.373322584494, 0.355768067261, 0.32722686972...","[0.699869671647, 0.694027620459, 0.68520787036...","[[0.373322584494, 0.355768067261, 0.3272268697...","[[0.820627083151, 0.814785031963, 0.8059652818...",0.716621,130.0,"[47, 53, 15, 10, 5]","[1135, 1506, 586, 363, 322]"
2,DE,"[[[640, 640, 640, 640, 640, 640, 640, 640, 640...",6068795355430912,True,2017-03-26 07:47:54.888710,cannon,4,7744,"[0.497363796134, 0.497363796134, 0.49736379613...","[0.235500878735, 0.2460456942, 0.254833040422,...","[[0.497363796134, 0.497363796134, 0.4973637961...","[[-0.181019332162, -0.170474516696, -0.1616871...",0.435852,121.0,"[28, 44, 31, 18]","[1804, 2408, 779, 465]"
3,US,"[[[692.8319702148438, 687.8109741210938, 683.2...",5683221763194880,False,2017-03-12 05:47:58.073700,cannon,13,17961,"[0.417519288799, 0.412606828573, 0.40810431728...","[0.194627886393, 0.1958410725, 0.201505921896,...","[[0.417519288799, 0.412606828573, 0.4081043172...","[[-0.0091283356945, -0.00791514958687, -0.0022...",0.593593,673.0,"[63, 121, 143, 69, 14, 53, 5, 10, 7, 5, 134, 7...","[620, 1092, 2689, 1090, 496, 924, 104, 208, 93..."
4,US,"[[[255, 261, 265, 270, 275, 281, 286, 291, 298...",6136843659640832,False,2017-03-03 13:50:42.610660,cannon,4,13609,"[0.01982160555, 0.0257680872151, 0.02973240832...","[0.291377601586, 0.296333002973, 0.30128840436...","[[0.01982160555, 0.0257680872151, 0.0297324083...","[[0.145688800793, 0.15064420218, 0.15559960356...",0.33003,285.0,"[70, 121, 35, 59]","[1876, 4591, 584, 1172]"
