# Data Engineering & <br> Datetime Feature Engineering <br> of the Google Store Analytics Dataset

## Using <u> Full Kaggle Test Dataset </u> - Data Engineering/Feature Engineering - v1
This dataset is provided by the Kaggle competition.  
https://www.kaggle.com/c/ga-customer-revenue-prediction

A lot of columns have data in json format, which need to be split out into separate columns of data.

Also, want to add some further features based on the date/time of day.

In [1]:
import pandas as pd
import numpy as np

from datetime import datetime, timezone, timedelta
# import holidays
import pytz

import requests
import os
gmaps_api_key = os.environ.get('googlemaps_api_key')
timezonedb_api_key = os.environ.get('timezonedb_api_key')

from time import sleep

# Import the Full Kaggle Test Dataset


In [2]:
df = pd.read_csv('data/test.csv', low_memory=False, dtype={'fullVisitorId':str}) #full Kaggle test dataset

# df = pd.read_pickle('data/train_sample.pkl') #top 10,000 rows of the training dataset
print(df.shape)
print(df.columns)
df.head(3)

(804684, 12)
Index(['channelGrouping', 'date', 'device', 'fullVisitorId', 'geoNetwork',
       'sessionId', 'socialEngagementType', 'totals', 'trafficSource',
       'visitId', 'visitNumber', 'visitStartTime'],
      dtype='object')


Unnamed: 0,channelGrouping,date,device,fullVisitorId,geoNetwork,sessionId,socialEngagementType,totals,trafficSource,visitId,visitNumber,visitStartTime
0,Organic Search,20171016,"{""browser"": ""Chrome"", ""browserVersion"": ""not a...",6167871330617112363,"{""continent"": ""Asia"", ""subContinent"": ""Southea...",6167871330617112363_1508151024,Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""4"", ""pageviews"": ""4""}","{""campaign"": ""(not set)"", ""source"": ""google"", ...",1508151024,2,1508151024
1,Organic Search,20171016,"{""browser"": ""Chrome"", ""browserVersion"": ""not a...",643697640977915618,"{""continent"": ""Europe"", ""subContinent"": ""South...",0643697640977915618_1508175522,Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""5"", ""pageviews"": ""5"",...","{""campaign"": ""(not set)"", ""source"": ""google"", ...",1508175522,1,1508175522
2,Organic Search,20171016,"{""browser"": ""Chrome"", ""browserVersion"": ""not a...",6059383810968229466,"{""continent"": ""Europe"", ""subContinent"": ""Weste...",6059383810968229466_1508143220,Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""7"", ""pageviews"": ""7"",...","{""campaign"": ""(not set)"", ""source"": ""google"", ...",1508143220,1,1508143220


# Data Engineering - Split out All Columns that Have JSON/Dictionary Info into Individual Columns
Some of the colums in the dataset have json/dictionary info - in order to be able to graph data/break out the variables, we want to separate each of this info to separate columns where each key of the dictionary is its own column data.

The columns that need to be split out are:
['device', 'geoNetwork', 'totals', 'trafficSource']

In [3]:
#print out shape of dataframe and names of all columns of the dataframe to see what columns are getting added as iterate through
print('original df shape: ', df.shape)
print('original df columns: ', df.columns)

#iterate through each of the column names that have to be split out into indiviual columns
for column in ['device', 'geoNetwork', 'totals', 'trafficSource']:

    #print out the name of the column you are processing in this for loop
    print('\n----------------------------------------\nsplitting out column: ', column)

    
    #some of the dictionaries have values of false/true instead of 'false'/'true'; python can't read the false/true
    #because is expecting False/True - instead of creating these boolean names, just convert them to text string
    #of 'false'/'true' before evaluating to dictionary
    df[column] = df[column].map(lambda x: x.replace("false", "'false'"))
    df[column] = df[column].map(lambda x: x.replace("true", "'true'"))
  

    #the JSON/dictionary columns are imported as string types - need to convert to python dictionaries using eval
    df[column] = df[column].map(lambda x: eval(x))

    
    #figure out all the possible unique keys for the dictionaries in this column - need this to make sure we
    #iterate through all of the dictionary data and store into different columns
    unique_column_keys = set()
    #iterate through each row of the column dictionaries
    for row_dictionary in list(df[column]):
        #iterate through each key of the row's dictionary and add to the unique column keys set (use add instead of append for sets)
        for row_key in list(row_dictionary.keys()):
            unique_column_keys.add(row_key)
    print('max length of column dictionaries is: ', len(unique_column_keys))
    print('unique column keys are: ', unique_column_keys)

    
    #use the set of unique_column_keys to add a new column for each of these keys
    for column_key in unique_column_keys:
        #create new column names of format:  originalcolumnname_columnkey, eg 'device_isMobile'
        df[f"{column}_{column_key}"] = df[column].map(lambda x: x.get(column_key, np.nan))
    
    print('updated df shape: ', df.shape)
    print('updated df columns: ', df.columns, '\n')

#drop original columns that had dictionary/jsons so don't keep repetitive data (cuts file size down by almost 40%)
print('\n----------------------------------------\n')
df.drop(['device', 'geoNetwork', 'totals', 'trafficSource'], axis=1, inplace=True)
print('final df shape after dropping original columns: ', df.shape)
    
#show the final df head
df.head(3)

original df shape:  (804684, 12)
original df columns:  Index(['channelGrouping', 'date', 'device', 'fullVisitorId', 'geoNetwork',
       'sessionId', 'socialEngagementType', 'totals', 'trafficSource',
       'visitId', 'visitNumber', 'visitStartTime'],
      dtype='object')

----------------------------------------
splitting out column:  device
max length of column dictionaries is:  16
unique column keys are:  {'mobileDeviceBranding', 'browser', 'operatingSystem', 'mobileInputSelector', 'mobileDeviceModel', 'language', 'deviceCategory', 'operatingSystemVersion', 'flashVersion', 'browserVersion', 'screenColors', 'isMobile', 'browserSize', 'mobileDeviceMarketingName', 'mobileDeviceInfo', 'screenResolution'}
updated df shape:  (804684, 28)
updated df columns:  Index(['channelGrouping', 'date', 'device', 'fullVisitorId', 'geoNetwork',
       'sessionId', 'socialEngagementType', 'totals', 'trafficSource',
       'visitId', 'visitNumber', 'visitStartTime',
       'device_mobileDeviceBranding

Unnamed: 0,channelGrouping,date,fullVisitorId,sessionId,socialEngagementType,visitId,visitNumber,visitStartTime,device_mobileDeviceBranding,device_browser,...,totals_visits,totals_bounces,trafficSource_keyword,trafficSource_medium,trafficSource_adContent,trafficSource_adwordsClickInfo,trafficSource_referralPath,trafficSource_isTrueDirect,trafficSource_campaign,trafficSource_source
0,Organic Search,20171016,6167871330617112363,6167871330617112363_1508151024,Not Socially Engaged,1508151024,2,1508151024,not available in demo dataset,Chrome,...,1,,(not provided),organic,,{'criteriaParameters': 'not available in demo ...,,True,(not set),google
1,Organic Search,20171016,643697640977915618,0643697640977915618_1508175522,Not Socially Engaged,1508175522,1,1508175522,not available in demo dataset,Chrome,...,1,,(not provided),organic,,{'criteriaParameters': 'not available in demo ...,,,(not set),google
2,Organic Search,20171016,6059383810968229466,6059383810968229466_1508143220,Not Socially Engaged,1508143220,1,1508143220,not available in demo dataset,Chrome,...,1,,(not provided),organic,,{'criteriaParameters': 'not available in demo ...,,,(not set),google


##### So, obviously the test data doesn't include the transactionRevenue column (its a blind final prediction on the Kaggle competition).  But, there is also one other column not included in test data that was in the train data - 'campaignCode', so make sure to not use 'campaignCode' in any machine learning models.

# Datetime Feature Engineering
### (Add in Additional Date/Time Features into our Data to Help with Analysis - day of week, holiday, local timezone, etc.)
1. Use the visitStartTime, which is a POSIX timestamp to get the UTC date and hour all together into a datetime object (rather than converting just the date column, which doesn't have any hour info).  This POSIX format can be extracted to a python datetime object - for example, a timestamp of 1472830385 becomes datetime.datetime(2016, 09, 02, 9, 33, 5), wich is in format year, month, day, hour, minute, second.
2. Then use the UTC datetime object to extract to a string formatted datetime_iso_utc (that can be used for time series analysis based on purchase time).  (iso format is: 'YYYY-MM-DD HH:MM:SS')
    - datetime_iso_utc
3. Convert to local timezone datetime object (this will require an indepth sub-process to identify timezone/call it)
4. Using the local timezone datetime object, extract multiple new datetime features:
    - datetime_iso_local
    - year_local
    - month_local
    - day_local
    - yearday_local (overall day of the year 1 to 365, which captures the month and day values)
    - weekday_local (values of 1 to 7 - 1 is Monday)
    - hour_local
5. Try to add a column to determine whether the date is a holiday or not.
    - holiday_local


### Figure Out Local Timezones for Each Location 
- Combine City/Country info given to us
- Get a rough lat/lng for that location by using Gmaps Geocoding API
- Use timezonedb API to send each location lat/lng to find out the local timezone

In [5]:
###CREATE NEW CITY_COUNTRY COLUMN THAT COMBINES CITY/COUNTRY TOGETHER###

#the geographic information provided almost always has a country listed (a very small percent country is "(not set)")
#note - there some strange, incorrect combinations - like New York Canada or Mountain View Germany, which will have a small percentage of rough data
#the geographic information sometimes have a city (more common for larger countries, but not consistent)

df['city_country'] = df.apply(lambda row: 
                              row.geoNetwork_city + " " + row.geoNetwork_country 
                              if row.geoNetwork_city not in ["not available in demo dataset", "(not set)"] 
                              else row.geoNetwork_country,
                              axis='columns')

In [10]:
#check length of the city_country_set to know how many unique combinations need to get run through the APIs
print(f"# of unique city_country combinations that has to be run through the APIs is: {len(set(df.city_country))}")

# of unique city_country combinations that has to be run through the APIs is: 1520


In [6]:
###GET LAT, LNG OF ALL CITY_COUNTRY COMBINATIONS (best approximation for location with sample data)###

#create a unique set of all city_country combinations to cycle through
city_country_set = set(df.city_country)
#remove the '(not set)' label from the set for the few cases where '(not set)' was in country
city_country_set.remove('(not set)')

#get Lat/Lng for our locations by using Google Geocoding and store info in a dictionary
city_country_lat_lng = {}

for city_country in city_country_set:
    #Google Geocoding works by inputting an address and get a lat, lng returned
    #https://maps.googleapis.com/maps/api/geocode/json?address=1600+Amphitheatre+Parkway,+Mountain+View,+CA&key=YOUR_API_KEY
    #every data point at least has a country

    base_geocode_url = 'https://maps.googleapis.com/maps/api/geocode/json?'
    geocode_parameters = {'address': city_country,
                          'key': gmaps_api_key}

    json_geocode_response = requests.get(base_geocode_url, geocode_parameters).json()
    if json_geocode_response.get('status') != 'ZERO_RESULTS':
        lat = json_geocode_response.get('results')[0].get('geometry').get('location').get('lat')
        lng = json_geocode_response.get('results')[0].get('geometry').get('location').get('lng')
    else:
        lat = np.nan
        lng = np.nan
    
    city_country_lat_lng[city_country] = (lat, lng)


#add a column to the dataframe that has each city_country approximate lat/lng
df['lat_lng'] = df['city_country'].map(lambda x: city_country_lat_lng.get(x, (np.nan, np.nan)))

In [13]:
#check how many non nan lat_lng we got from the city_country list
print(f"# of non-NaN lat/lngs we successfully got from the Google Geocode API is: {len(city_country_lat_lng)}")
print(f"% of non-Nan lat/lngs we successfully got from the Google Geocode API compared to city_country list is: {len(city_country_lat_lng)/len(set(df.city_country))*100}")

# of non-NaN lat/lngs we successfully got from the Google Geocode API is: 1519
% of non-Nan lat/lngs we successfully got from the Google Geocode API compared to city_country list is: 99.9342105263158


In [35]:
###GET APPROXIMATE TIMEZONE FOR EACH CITY_COUNTRY COMBINATIONS (using timezonedb API)###

#store timezone for all of our city_country locations in a dictionary
city_country_timezone = {}

#city_country_lat_lng is a dictionary with city_country: (lat, lng)
#cycle through each city_country lat,lng and call the timezonedb API
for city_country, lat_lng in city_country_lat_lng.items():
    #get local timezone info for each city/country combination in our dataset by looking up their lat,lng
    #http://api.timezonedb.com/v2.1/get-time-zone (has a 1 second rate limit)
    base_timezonedb_url = 'http://api.timezonedb.com/v2.1/get-time-zone?'
    timezonedb_parameters = {'key': timezonedb_api_key,
                             'format': 'json',
                             'by': 'position',
                             'lat': lat_lng[0],
                             'lng': lat_lng[1]}

    #had to add timeout=None because the response time by the API was poor when running it today (did not have this 
    #problem when ran the API on the training kaggle set, which had similar length of country_city combos)
    json_timezone_response = requests.get(base_timezonedb_url, timezonedb_parameters).json()  #, timeout=None).json()
    
    if json_timezone_response.get('status') != 'FAILED':
        #get the gmt timezone offset and the abbreviation for the timezone - gmtOffset it the time offset in seconds
        #so will need to divide it by 3600 to get the usable offset in hours
        timezone_abbreviation = json_timezone_response.get('abbreviation', np.nan)
        timezone_gmt_offset = json_timezone_response.get('gmtOffset', np.nan)/3600
    else:
        timezone_abbreviation = np.nan
        timezone_gmt_offset = np.nan
    
    city_country_timezone[city_country] = (timezone_abbreviation, timezone_gmt_offset)
    
    #need to sleep for 1 second because api says it has a 1 request/second rate limit
    sleep(2.1)

#add a column to the dataframe that has each city_country approximate timezone
df['timezone'] = df['city_country'].map(lambda x: city_country_timezone.get(x, (np.nan, np.nan)))

In [36]:
#check how many non nan timezones we got from the city_country list
print(f"# of non-NaN timezones we successfully got from the TimezoneDB API is: {len(city_country_timezone)}")
print(f"% of non-Nan timezones we successfully got from the TimezoneDB API compared to city_country list is: {len(city_country_timezone)/len(set(df.city_country))*100}")

# of non-NaN timezones we successfully got from the TimezoneDB API is: 1519
% of non-Nan timezones we successfully got from the TimezoneDB API compared to city_country list is: 99.9342105263158


### Extract the Different Datetime Features

- datetime_iso_utc   ('YYYY-MM-DD HH:MM:SS')
- datetime_iso_local  ('YYYY-MM-DD HH:MM:SS')
- year_local
- month_local
- day_local
- yearday_local (overall day of the year 1 to 365, which captures the month and day values)
- weekday_local  (values of 1 to 7 where 1 is Monday)
- hour_local
- holiday_local

In [37]:
def extract_datetime_info(timestamp, gtm_offset_hours):
    #first convert the timestamp to datetime object with utc setting
    datetime_object_utc = datetime.fromtimestamp(timestamp, tz=pytz.UTC)
    #convert timestamp to isoformat for timeseries analysis in Tableau, etc. (and ease of viewing) - both utc and local
    #initially gives string format of '2016-09-02T09:33:05', strip out the middle T so looks like '2016-09-02 09:33:05'
    datetime_iso_utc = datetime_object_utc.isoformat().replace("T", " ")  #'YYYY-MM-DD HH:MM:SS'
    
    #these timestamps aren't adjusted for timezone, in order to have better analysis
    #it seems like we would want the datetime in the approximate local timezone based on country/city since
    #we would assume consumer behavior is more similar between consumers based on the local time, rather than UTC time
    #some of the data doesn't have a timezone, so only return the datetime_iso_utc if gtm_offset_hours is not a nan
    if ~np.isnan(gtm_offset_hours): 
        #convert datetime by changing the timezone by the gtmOffsets
        datetime_object_local = datetime_object_utc.astimezone(timezone(timedelta(hours=gtm_offset_hours)))

        #convert timestamp to isoformat for timeseries analysis in Tableau, etc. (and ease of viewing) - both utc and local
        #initially gives string format of '2016-09-02T09:33:05', strip out the middle T so looks like '2016-09-02 09:33:05'
        datetime_iso_local = datetime_object_local.isoformat().replace("T", " ")  #'YYYY-MM-DD HH:MM:SS'

        #extract year, month, month-day, day, day of the week (using isoweekday), and hour into different variables
        year_local = datetime_object_local.year #YYYY (as int)
        month_local = datetime_object_local.month #MM (as int)
        day_local = datetime_object_local.day #DD (as int) - this is the day of the month, max 31
        yearday_local = datetime_object_local.timetuple().tm_yday #DDD (as int) - this is the day of the year sequentially 1 to 365 which gives context of the month-day in one value
        weekday_local = datetime_object_local.isoweekday() #integer values of 1 to 7 where 1 is Monday
        hour_local = datetime_object_local.hour #HH (as int)

        return datetime_iso_utc, datetime_iso_local, year_local, month_local, day_local, yearday_local, weekday_local, hour_local
    
    else:
        return datetime_iso_utc, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan

In [38]:
#use the extract_datetime_info function to add columns of all of the additional datetime features to the df
df[['datetime_iso_utc', 'datetime_iso_local',
    'year_local', 'month_local','day_local',
    'yearday_local', 'weekday_local', 'hour_local']] = df[['visitStartTime', 'timezone']].apply(lambda row:
                                                        extract_datetime_info(row.visitStartTime, row.timezone[1]),
                                                        axis='columns',
                                                        result_type='expand')

In [39]:
df.head()

Unnamed: 0,channelGrouping,date,fullVisitorId,sessionId,socialEngagementType,visitId,visitNumber,visitStartTime,device_mobileDeviceBranding,device_browser,...,lat_lng,timezone,datetime_iso_utc,datetime_iso_local,year_local,month_local,day_local,yearday_local,weekday_local,hour_local
0,Organic Search,20171016,6167871330617112363,6167871330617112363_1508151024,Not Socially Engaged,1508151024,2,1508151024,not available in demo dataset,Chrome,...,"(1.352083, 103.819836)","(+08, 8.0)",2017-10-16 10:50:24+00:00,2017-10-16 18:50:24+08:00,2017.0,10.0,16.0,289.0,1.0,18.0
1,Organic Search,20171016,643697640977915618,0643697640977915618_1508175522,Not Socially Engaged,1508175522,1,1508175522,not available in demo dataset,Chrome,...,"(41.6488226, -0.8890853)","(CEST, 2.0)",2017-10-16 17:38:42+00:00,2017-10-16 19:38:42+02:00,2017.0,10.0,16.0,289.0,1.0,19.0
2,Organic Search,20171016,6059383810968229466,6059383810968229466_1508143220,Not Socially Engaged,1508143220,1,1508143220,not available in demo dataset,Chrome,...,"(46.227638, 2.213749)","(CEST, 2.0)",2017-10-16 08:40:20+00:00,2017-10-16 10:40:20+02:00,2017.0,10.0,16.0,289.0,1.0,10.0
3,Organic Search,20171016,2376720078563423631,2376720078563423631_1508193530,Not Socially Engaged,1508193530,1,1508193530,not available in demo dataset,Safari,...,"(37.3860517, -122.0838511)","(PDT, -7.0)",2017-10-16 22:38:50+00:00,2017-10-16 15:38:50-07:00,2017.0,10.0,16.0,289.0,1.0,15.0
4,Organic Search,20171016,2314544520795440038,2314544520795440038_1508217442,Not Socially Engaged,1508217442,1,1508217442,not available in demo dataset,Safari,...,"(37.3382082, -121.8863286)","(PDT, -7.0)",2017-10-17 05:17:22+00:00,2017-10-16 22:17:22-07:00,2017.0,10.0,16.0,289.0,1.0,22.0


# Clean and Save Engineered Dataframe to Pickle for Easier Access Later

In [40]:
#there are many columns in our sample data that only have 'not available in demo dataset' values,
#so want to drop those columns as they are not useful
columns_to_drop = ['device_language', 'device_screenColors', 'device_mobileDeviceMarketingName', 'device_mobileInputSelector', 
'device_mobileDeviceBranding', 'device_mobileDeviceInfo', 'device_browserSize', 'device_browserVersion', 'device_mobileDeviceModel',
'device_flashVersion', 'device_screenResolution', 'device_operatingSystemVersion', 'geoNetwork_cityId', 'geoNetwork_latitude',
'geoNetwork_networkLocation', 'geoNetwork_longitude', 'trafficSource_adwordsClickInfo']

df.drop(columns_to_drop, axis=1, inplace=True)

In [41]:
# #dropping one row in the FULL TRAINING DATASET (v1) that was causing an error when trying to save data to a csv for Tableau processing
# df.drop(564203, inplace=True)

In [42]:
#verify the size of the df we are saving (the full training datset should be about 100k rows on partial
#and 900k rows on fulland 43 good columns)
print(df.shape)

(804684, 42)


In [43]:
# #save the dataframe to pickle and a csv - PARTIAL DATASET
# df.to_pickle('data/train_partial_data_split.pkl')
# df.to_csv('data/train_partial_data_split.csv')

#save the dataframe to pickle and a csv - FULL TRAINING DATASET (v1)
df.to_pickle('data/test_v1_full_data_split.pkl')
df.to_csv('data/test_v1_full_data_split.csv')

# Initial Exploration - Determine the Value Counts of each Column
To get a better grasp of the data that we have, look at value counts of each column.  Some will have lots of different values (like fullVisitorId, which will be unique to every individual).  Others will have binary info.

Cycle through and print the top 10 value_counts for each column.

In [44]:
#iterate through every single column in the dataset (that is not a dictionary, can't do value_counts on a dictionary)
for column in df.columns:

    #print out the name of the column you are processing in this for loop
    print('\n----------------------------------------\nCOLUMN: ', column)
    
    #if column has dictionaries, it can't do value_counts so flag this in output
    if type(df[column][0]) == dict:
        print('column contains dictionaries')
        print('example dictionary from 0th row: \n', df[column][0])
    
    else:
        #print out the length of the value_counts for the column
        print('# of Unique Values in Column: ', len(df[column].value_counts()))

        #print out the top ten value_counts for the column
        #normalize=True provides relative frequency of the value (instead of just count), so 0.473 is 47.3%
        print('Top 10 Value Counts for Column: \n', df[column].value_counts(normalize=True).head(10))



----------------------------------------
COLUMN:  channelGrouping
# of Unique Values in Column:  8
Top 10 Value Counts for Column: 
 Organic Search    0.444152
Direct            0.161688
Social            0.160130
Referral          0.132312
Display           0.055949
Paid Search       0.025229
Affiliates        0.020520
(Other)           0.000021
Name: channelGrouping, dtype: float64

----------------------------------------
COLUMN:  date
# of Unique Values in Column:  272
Top 10 Value Counts for Column: 
 20171212    0.011475
20171213    0.011347
20171004    0.006365
20170920    0.006064
20170921    0.005859
20171005    0.005815
20180321    0.005510
20171017    0.005292
20180327    0.005253
20171127    0.005212
Name: date, dtype: float64

----------------------------------------
COLUMN:  fullVisitorId
# of Unique Values in Column:  617242
Top 10 Value Counts for Column: 
 7282998257608986241    0.000338
3884810646891698298    0.000299
0348420048060754000    0.000220
74776385937944847

# of Unique Values in Column:  229
Top 10 Value Counts for Column: 
 1     0.518601
2     0.123709
3     0.079575
4     0.047759
5     0.040697
6     0.028185
7     0.023138
8     0.017688
9     0.014962
10    0.012012
Name: totals_hits, dtype: float64

----------------------------------------
COLUMN:  totals_visits
# of Unique Values in Column:  1
Top 10 Value Counts for Column: 
 1    1.0
Name: totals_visits, dtype: float64

----------------------------------------
COLUMN:  totals_bounces
# of Unique Values in Column:  1
Top 10 Value Counts for Column: 
 1    1.0
Name: totals_bounces, dtype: float64

----------------------------------------
COLUMN:  trafficSource_keyword
# of Unique Values in Column:  2415
Top 10 Value Counts for Column: 
 (not provided)                     0.837397
(User vertical targeting)          0.062277
(automatic matching)               0.044371
6qEhsCssdK0z36ri                   0.015039
(Remarketing/Content targeting)    0.011942
1hZbAqLCbjwfgOH7            

 347.0    0.011768
346.0    0.010510
278.0    0.006051
264.0    0.005985
277.0    0.005841
263.0    0.005830
80.0     0.005370
312.0    0.005257
86.0     0.005256
290.0    0.005160
Name: yearday_local, dtype: float64

----------------------------------------
COLUMN:  weekday_local
# of Unique Values in Column:  7
Top 10 Value Counts for Column: 
 3.0    0.167805
2.0    0.163201
4.0    0.156990
1.0    0.151792
5.0    0.142406
7.0    0.109985
6.0    0.107820
Name: weekday_local, dtype: float64

----------------------------------------
COLUMN:  hour_local
# of Unique Values in Column:  24
Top 10 Value Counts for Column: 
 15.0    0.068287
14.0    0.066132
16.0    0.064744
11.0    0.063758
13.0    0.061299
12.0    0.060588
17.0    0.058794
10.0    0.057410
20.0    0.055084
19.0    0.054889
Name: hour_local, dtype: float64


#### Look more closely at fullVisitorID
##### (this is the id we need to assign final revenue to)

In [45]:
#for fullVisitorID check out actual value counts (instead of percentage) - this is the id we need to assign final revenue to
print('\n----------------------------------------\nCOLUMN: ', 'fullVisitorId', '\n')

print('# of Unique Visitors: ', len(df['fullVisitorId'].value_counts()), '\n')

print('# of Unique Visitors Tracked More than Once: ', len(df['fullVisitorId'].value_counts()[df['fullVisitorId'].value_counts()>1]), '\n')

print('Top 10 Visitors/Counts: \n', df['fullVisitorId'].value_counts().head(10), '\n')

print('Last 5 Visitors/Counts: \n', df['fullVisitorId'].value_counts().tail(5))


----------------------------------------
COLUMN:  fullVisitorId 

# of Unique Visitors:  617242 

# of Unique Visitors Tracked More than Once:  91418 

Top 10 Visitors/Counts: 
 7282998257608986241    272
3884810646891698298    241
0348420048060754000    177
7477638593794484792    173
460252456180441002     162
7122741899604173060    154
1322101426801959631    153
8839221334461540297    152
0603203541488487946    148
0827807801897731454    127
Name: fullVisitorId, dtype: int64 

Last 5 Visitors/Counts: 
 2699757157163905734    1
0734830396372291137    1
610789661660376250     1
1741042414750140468    1
3843809844673445121    1
Name: fullVisitorId, dtype: int64


In [49]:
#df.totals_transactionRevenue.value_counts().head(10)

In [None]:
16990000