# Data Engineering & <br> Datetime Feature Engineering <br> of the Google Store Analytics Dataset

## Using <u> Full Kaggle Train Dataset </u> - Data Engineering/Feature Engineering - v1
This dataset is provided by the Kaggle competition.  
https://www.kaggle.com/c/ga-customer-revenue-prediction

A lot of columns have data in json format, which need to be split out into separate columns of data.

Also, want to add some further features based on the date/time of day.

In [1]:
import pandas as pd
import numpy as np

from datetime import datetime, timezone, timedelta
# import holidays
import pytz

import requests
import os
gmaps_api_key = os.environ.get('googlemaps_api_key')
timezonedb_api_key = os.environ.get('timezonedb_api_key')

from time import sleep

# Import the Full Kaggle Train Dataset


In [2]:
df = pd.read_csv('data/train.csv', low_memory=False, dtype={'fullVisitorId':str}) #full training dataset

# df = pd.read_pickle('data/train_sample.pkl') #top 10,000 rows of the training dataset
print(df.shape)
print(df.columns)
df.head(3)

(903653, 12)
Index(['channelGrouping', 'date', 'device', 'fullVisitorId', 'geoNetwork',
       'sessionId', 'socialEngagementType', 'totals', 'trafficSource',
       'visitId', 'visitNumber', 'visitStartTime'],
      dtype='object')


Unnamed: 0,channelGrouping,date,device,fullVisitorId,geoNetwork,sessionId,socialEngagementType,totals,trafficSource,visitId,visitNumber,visitStartTime
0,Organic Search,20160902,"{""browser"": ""Chrome"", ""browserVersion"": ""not a...",1131660440785968503,"{""continent"": ""Asia"", ""subContinent"": ""Western...",1131660440785968503_1472830385,Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""1"", ""pageviews"": ""1"",...","{""campaign"": ""(not set)"", ""source"": ""google"", ...",1472830385,1,1472830385
1,Organic Search,20160902,"{""browser"": ""Firefox"", ""browserVersion"": ""not ...",377306020877927890,"{""continent"": ""Oceania"", ""subContinent"": ""Aust...",377306020877927890_1472880147,Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""1"", ""pageviews"": ""1"",...","{""campaign"": ""(not set)"", ""source"": ""google"", ...",1472880147,1,1472880147
2,Organic Search,20160902,"{""browser"": ""Chrome"", ""browserVersion"": ""not a...",3895546263509774583,"{""continent"": ""Europe"", ""subContinent"": ""South...",3895546263509774583_1472865386,Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""1"", ""pageviews"": ""1"",...","{""campaign"": ""(not set)"", ""source"": ""google"", ...",1472865386,1,1472865386


# Data Engineering - Split out All Columns that Have JSON/Dictionary Info into Individual Columns
Some of the colums in the dataset have json/dictionary info - in order to be able to graph data/break out the variables, we want to separate each of this info to separate columns where each key of the dictionary is its own column data.

The columns that need to be split out are:
['device', 'geoNetwork', 'totals', 'trafficSource']

In [4]:
#print out shape of dataframe and names of all columns of the dataframe to see what columns are getting added as iterate through
print('original df shape: ', df.shape)
print('original df columns: ', df.columns)

#iterate through each of the column names that have to be split out into indiviual columns
for column in ['device', 'geoNetwork', 'totals', 'trafficSource']:

    #print out the name of the column you are processing in this for loop
    print('\n----------------------------------------\nsplitting out column: ', column)

    
    #some of the dictionaries have values of false/true instead of 'false'/'true'; python can't read the false/true
    #because is expecting False/True - instead of creating these boolean names, just convert them to text string
    #of 'false'/'true' before evaluating to dictionary
    df[column] = df[column].map(lambda x: x.replace("false", "'false'"))
    df[column] = df[column].map(lambda x: x.replace("true", "'true'"))
  

    #the JSON/dictionary columns are imported as string types - need to convert to python dictionaries using eval
    df[column] = df[column].map(lambda x: eval(x))

    
    #figure out all the possible unique keys for the dictionaries in this column - need this to make sure we
    #iterate through all of the dictionary data and store into different columns
    unique_column_keys = set()
    #iterate through each row of the column dictionaries
    for row_dictionary in list(df[column]):
        #iterate through each key of the row's dictionary and add to the unique column keys set (use add instead of append for sets)
        for row_key in list(row_dictionary.keys()):
            unique_column_keys.add(row_key)
    print('max length of column dictionaries is: ', len(unique_column_keys))
    print('unique column keys are: ', unique_column_keys)

    
    #use the set of unique_column_keys to add a new column for each of these keys
    for column_key in unique_column_keys:
        #create new column names of format:  originalcolumnname_columnkey, eg 'device_isMobile'
        df[f"{column}_{column_key}"] = df[column].map(lambda x: x.get(column_key, np.nan))
    
    print('updated df shape: ', df.shape)
    print('updated df columns: ', df.columns, '\n')

#drop original columns that had dictionary/jsons so don't keep repetitive data (cuts file size down by almost 40%)
print('\n----------------------------------------\n')
df.drop(['device', 'geoNetwork', 'totals', 'trafficSource'], axis=1, inplace=True)
print('final df shape after dropping original columns: ', df.shape)
    
#show the final df head
df.head(3)

original df shape:  (903653, 12)
original df columns:  Index(['channelGrouping', 'date', 'device', 'fullVisitorId', 'geoNetwork',
       'sessionId', 'socialEngagementType', 'totals', 'trafficSource',
       'visitId', 'visitNumber', 'visitStartTime'],
      dtype='object')

----------------------------------------
splitting out column:  device
max length of column dictionaries is:  16
unique column keys are:  {'language', 'deviceCategory', 'mobileDeviceMarketingName', 'browserSize', 'screenResolution', 'browser', 'mobileDeviceModel', 'mobileDeviceBranding', 'screenColors', 'isMobile', 'mobileDeviceInfo', 'flashVersion', 'browserVersion', 'operatingSystem', 'mobileInputSelector', 'operatingSystemVersion'}
updated df shape:  (903653, 28)
updated df columns:  Index(['channelGrouping', 'date', 'device', 'fullVisitorId', 'geoNetwork',
       'sessionId', 'socialEngagementType', 'totals', 'trafficSource',
       'visitId', 'visitNumber', 'visitStartTime', 'device_language',
       'device_d

Unnamed: 0,channelGrouping,date,fullVisitorId,sessionId,socialEngagementType,visitId,visitNumber,visitStartTime,device_language,device_deviceCategory,...,totals_transactionRevenue,trafficSource_adwordsClickInfo,trafficSource_isTrueDirect,trafficSource_keyword,trafficSource_source,trafficSource_adContent,trafficSource_medium,trafficSource_referralPath,trafficSource_campaign,trafficSource_campaignCode
0,Organic Search,20160902,1131660440785968503,1131660440785968503_1472830385,Not Socially Engaged,1472830385,1,1472830385,not available in demo dataset,desktop,...,,{'criteriaParameters': 'not available in demo ...,,(not provided),google,,organic,,(not set),
1,Organic Search,20160902,377306020877927890,377306020877927890_1472880147,Not Socially Engaged,1472880147,1,1472880147,not available in demo dataset,desktop,...,,{'criteriaParameters': 'not available in demo ...,,(not provided),google,,organic,,(not set),
2,Organic Search,20160902,3895546263509774583,3895546263509774583_1472865386,Not Socially Engaged,1472865386,1,1472865386,not available in demo dataset,desktop,...,,{'criteriaParameters': 'not available in demo ...,,(not provided),google,,organic,,(not set),


# Datetime Feature Engineering
### (Add in Additional Date/Time Features into our Data to Help with Analysis - day of week, holiday, local timezone, etc.)
1. Use the visitStartTime, which is a POSIX timestamp to get the UTC date and hour all together into a datetime object (rather than converting just the date column, which doesn't have any hour info).  This POSIX format can be extracted to a python datetime object - for example, a timestamp of 1472830385 becomes datetime.datetime(2016, 09, 02, 9, 33, 5), wich is in format year, month, day, hour, minute, second.
2. Then use the UTC datetime object to extract to a string formatted datetime_iso_utc (that can be used for time series analysis based on purchase time).  (iso format is: 'YYYY-MM-DD HH:MM:SS')
    - datetime_iso_utc
3. Convert to local timezone datetime object (this will require an indepth sub-process to identify timezone/call it)
4. Using the local timezone datetime object, extract multiple new datetime features:
    - datetime_iso_local
    - year_local
    - month_local
    - day_local
    - yearday_local (overall day of the year 1 to 365, which captures the month and day values)
    - weekday_local (values of 1 to 7 - 1 is Monday)
    - hour_local
5. Try to add a column to determine whether the date is a holiday or not.
    - holiday_local


### Figure Out Local Timezones for Each Location 
- Combine City/Country info given to us
- Get a rough lat/lng for that location by using Gmaps Geocoding API
- Use timezonedb API to send each location lat/lng to find out the local timezone

In [5]:
###CREATE NEW CITY_COUNTRY COLUMN THAT COMBINES CITY/COUNTRY TOGETHER###

#the geographic information provided almost always has a country listed (a very small percent country is "(not set)")
#note - there some strange, incorrect combinations - like New York Canada or Mountain View Germany, which will have a small percentage of rough data
#the geographic information sometimes have a city (more common for larger countries, but not consistent)

df['city_country'] = df.apply(lambda row: 
                              row.geoNetwork_city + " " + row.geoNetwork_country 
                              if row.geoNetwork_city not in ["not available in demo dataset", "(not set)"] 
                              else row.geoNetwork_country,
                              axis='columns')

In [24]:
###GET LAT, LNG OF ALL CITY_COUNTRY COMBINATIONS (best approximation for location with sample data)###

#create a unique set of all city_country combinations to cycle through
city_country_set = set(df.city_country)
#remove the '(not set)' label from the set for the few cases where '(not set)' was in country
city_country_set.remove('(not set)')

#get Lat/Lng for our locations by using Google Geocoding and store info in a dictionary
city_country_lat_lng = {}

for city_country in city_country_set:
    #Google Geocoding works by inputting an address and get a lat, lng returned
    #https://maps.googleapis.com/maps/api/geocode/json?address=1600+Amphitheatre+Parkway,+Mountain+View,+CA&key=YOUR_API_KEY
    #every data point at least has a country

    base_geocode_url = 'https://maps.googleapis.com/maps/api/geocode/json?'
    geocode_parameters = {'address': city_country,
                          'key': gmaps_api_key}

    json_geocode_response = requests.get(base_geocode_url, geocode_parameters).json()
    if json_geocode_response.get('status') != 'ZERO_RESULTS':
        lat = json_geocode_response.get('results')[0].get('geometry').get('location').get('lat')
        lng = json_geocode_response.get('results')[0].get('geometry').get('location').get('lng')
    else:
        lat = np.nan
        lng = np.nan
    
    city_country_lat_lng[city_country] = (lat, lng)


#add a column to the dataframe that has each city_country approximate lat/lng
df['lat_lng'] = df['city_country'].map(lambda x: city_country_lat_lng.get(x, (np.nan, np.nan)))

In [25]:
###GET APPROXIMATE TIMEZONE FOR EACH CITY_COUNTRY COMBINATIONS (using timezonedb API)###

#store timezone for all of our city_country locations in a dictionary
city_country_timezone = {}

#city_country_lat_lng is a dictionary with city_country: (lat, lng)
#cycle through each city_country lat,lng and call the timezonedb API
for city_country, lat_lng in city_country_lat_lng.items():
    #get local timezone info for each city/country combination in our dataset by looking up their lat,lng
    #http://api.timezonedb.com/v2.1/get-time-zone (has a 1 second rate limit)
    base_timezonedb_url = 'http://api.timezonedb.com/v2.1/get-time-zone?'
    timezonedb_parameters = {'key': timezonedb_api_key,
                             'format': 'json',
                             'by': 'position',
                             'lat': lat_lng[0],
                             'lng': lat_lng[1]}

    json_timezone_response = requests.get(base_timezonedb_url, timezonedb_parameters).json()
    
    if json_timezone_response.get('status') != 'FAILED':
        #get the gmt timezone offset and the abbreviation for the timezone - gmtOffset it the time offset in seconds
        #so will need to divide it by 3600 to get the usable offset in hours
        timezone_abbreviation = json_timezone_response.get('abbreviation', np.nan)
        timezone_gmt_offset = json_timezone_response.get('gmtOffset', np.nan)/3600
    else:
        timezone_abbreviation = np.nan
        timezone_gmt_offset = np.nan
    
    city_country_timezone[city_country] = (timezone_abbreviation, timezone_gmt_offset)
    
    #need to sleep for 1 second because api says it has a 1 request/second rate limit
    sleep(1.1)

#add a column to the dataframe that has each city_country approximate timezone
df['timezone'] = df['city_country'].map(lambda x: city_country_timezone.get(x, (np.nan, np.nan)))

### Extract the Different Datetime Features

- datetime_iso_utc   ('YYYY-MM-DD HH:MM:SS')
- datetime_iso_local  ('YYYY-MM-DD HH:MM:SS')
- year_local
- month_local
- day_local
- yearday_local (overall day of the year 1 to 365, which captures the month and day values)
- weekday_local  (values of 1 to 7 where 1 is Monday)
- hour_local
- holiday_local

In [26]:
def extract_datetime_info(timestamp, gtm_offset_hours):
    #first convert the timestamp to datetime object with utc setting
    datetime_object_utc = datetime.fromtimestamp(timestamp, tz=pytz.UTC)
    #convert timestamp to isoformat for timeseries analysis in Tableau, etc. (and ease of viewing) - both utc and local
    #initially gives string format of '2016-09-02T09:33:05', strip out the middle T so looks like '2016-09-02 09:33:05'
    datetime_iso_utc = datetime_object_utc.isoformat().replace("T", " ")  #'YYYY-MM-DD HH:MM:SS'
    
    #these timestamps aren't adjusted for timezone, in order to have better analysis
    #it seems like we would want the datetime in the approximate local timezone based on country/city since
    #we would assume consumer behavior is more similar between consumers based on the local time, rather than UTC time
    #some of the data doesn't have a timezone, so only return the datetime_iso_utc if gtm_offset_hours is not a nan
    if ~np.isnan(gtm_offset_hours): 
        #convert datetime by changing the timezone by the gtmOffsets
        datetime_object_local = datetime_object_utc.astimezone(timezone(timedelta(hours=gtm_offset_hours)))

        #convert timestamp to isoformat for timeseries analysis in Tableau, etc. (and ease of viewing) - both utc and local
        #initially gives string format of '2016-09-02T09:33:05', strip out the middle T so looks like '2016-09-02 09:33:05'
        datetime_iso_local = datetime_object_local.isoformat().replace("T", " ")  #'YYYY-MM-DD HH:MM:SS'

        #extract year, month, month-day, day, day of the week (using isoweekday), and hour into different variables
        year_local = datetime_object_local.year #YYYY (as int)
        month_local = datetime_object_local.month #MM (as int)
        day_local = datetime_object_local.day #DD (as int) - this is the day of the month, max 31
        yearday_local = datetime_object_local.timetuple().tm_yday #DDD (as int) - this is the day of the year sequentially 1 to 365 which gives context of the month-day in one value
        weekday_local = datetime_object_local.isoweekday() #integer values of 1 to 7 where 1 is Monday
        hour_local = datetime_object_local.hour #HH (as int)

        return datetime_iso_utc, datetime_iso_local, year_local, month_local, day_local, yearday_local, weekday_local, hour_local
    
    else:
        return datetime_iso_utc, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan

In [27]:
#use the extract_datetime_info function to add columns of all of the additional datetime features to the df
df[['datetime_iso_utc', 'datetime_iso_local',
    'year_local', 'month_local','day_local',
    'yearday_local', 'weekday_local', 'hour_local']] = df[['visitStartTime', 'timezone']].apply(lambda row:
                                                        extract_datetime_info(row.visitStartTime, row.timezone[1]),
                                                        axis='columns',
                                                        result_type='expand')

In [28]:
df.head()

Unnamed: 0,channelGrouping,date,fullVisitorId,sessionId,socialEngagementType,visitId,visitNumber,visitStartTime,device_language,device_deviceCategory,...,lat_lng,timezone,datetime_iso_utc,datetime_iso_local,year_local,month_local,day_local,yearday_local,weekday_local,hour_local
0,Organic Search,20160902,1131660440785968503,1131660440785968503_1472830385,Not Socially Engaged,1472830385,1,1472830385,not available in demo dataset,desktop,...,"(38.423734, 27.142826)","(+03, 3.0)",2016-09-02 15:33:05+00:00,2016-09-02 18:33:05+03:00,2016.0,9.0,2.0,246.0,5.0,18.0
1,Organic Search,20160902,377306020877927890,377306020877927890_1472880147,Not Socially Engaged,1472880147,1,1472880147,not available in demo dataset,desktop,...,"(-25.274398, 133.775136)","(ACST, 9.5)",2016-09-03 05:22:27+00:00,2016-09-03 14:52:27+09:30,2016.0,9.0,3.0,247.0,6.0,14.0
2,Organic Search,20160902,3895546263509774583,3895546263509774583_1472865386,Not Socially Engaged,1472865386,1,1472865386,not available in demo dataset,desktop,...,"(40.4167754, -3.7037902)","(CEST, 2.0)",2016-09-03 01:16:26+00:00,2016-09-03 03:16:26+02:00,2016.0,9.0,3.0,247.0,6.0,3.0
3,Organic Search,20160902,4763447161404445595,4763447161404445595_1472881213,Not Socially Engaged,1472881213,1,1472881213,not available in demo dataset,desktop,...,"(-0.789275, 113.921327)","(WIB, 7.0)",2016-09-03 05:40:13+00:00,2016-09-03 12:40:13+07:00,2016.0,9.0,3.0,247.0,6.0,12.0
4,Organic Search,20160902,27294437909732085,27294437909732085_1472822600,Not Socially Engaged,1472822600,2,1472822600,not available in demo dataset,mobile,...,"(55.378051, -3.435973)","(BST, 1.0)",2016-09-02 13:23:20+00:00,2016-09-02 14:23:20+01:00,2016.0,9.0,2.0,246.0,5.0,14.0


# Clean and Save Engineered Dataframe to Pickle for Easier Access Later

In [29]:
#there are many columns in our sample data that only have 'not available in demo dataset' values,
#so want to drop those columns as they are not useful
columns_to_drop = ['device_language', 'device_screenColors', 'device_mobileDeviceMarketingName', 'device_mobileInputSelector', 
'device_mobileDeviceBranding', 'device_mobileDeviceInfo', 'device_browserSize', 'device_browserVersion', 'device_mobileDeviceModel',
'device_flashVersion', 'device_screenResolution', 'device_operatingSystemVersion', 'geoNetwork_cityId', 'geoNetwork_latitude',
'geoNetwork_networkLocation', 'geoNetwork_longitude', 'trafficSource_adwordsClickInfo']

df.drop(columns_to_drop, axis=1, inplace=True)

In [30]:
#dropping one row in the FULL TRAINING DATASET (v1) that was causing an error when trying to save data to a csv for Tableau processing
df.drop(564203, inplace=True)

In [31]:
#verify the size of the df we are saving (the full training datset should be about 100k rows on partial
#and 900k rows on fulland 43 good columns)
print(df.shape)

(903652, 44)


In [32]:
# #save the dataframe to pickle and a csv - PARTIAL DATASET
# df.to_pickle('data/train_partial_data_split.pkl')
# df.to_csv('data/train_partial_data_split.csv')

#save the dataframe to pickle and a csv - FULL TRAINING DATASET (v1)
df.to_pickle('data/train_v1_full_data_split.pkl')
df.to_csv('data/train_v1_full_data_split.csv')

# Initial Exploration - Determine the Value Counts of each Column
To get a better grasp of the data that we have, look at value counts of each column.  Some will have lots of different values (like fullVisitorId, which will be unique to every individual).  Others will have binary info.

Cycle through and print the top 10 value_counts for each column.

In [33]:
#iterate through every single column in the dataset (that is not a dictionary, can't do value_counts on a dictionary)
for column in df.columns:

    #print out the name of the column you are processing in this for loop
    print('\n----------------------------------------\nCOLUMN: ', column)
    
    #if column has dictionaries, it can't do value_counts so flag this in output
    if type(df[column][0]) == dict:
        print('column contains dictionaries')
        print('example dictionary from 0th row: \n', df[column][0])
    
    else:
        #print out the length of the value_counts for the column
        print('# of Unique Values in Column: ', len(df[column].value_counts()))

        #print out the top ten value_counts for the column
        #normalize=True provides relative frequency of the value (instead of just count), so 0.473 is 47.3%
        print('Top 10 Value Counts for Column: \n', df[column].value_counts(normalize=True).head(10))



----------------------------------------
COLUMN:  channelGrouping
# of Unique Values in Column:  8
Top 10 Value Counts for Column: 
 Organic Search    0.422242
Social            0.250226
Direct            0.158276
Referral          0.116016
Paid Search       0.028026
Affiliates        0.018152
Display           0.006930
(Other)           0.000133
Name: channelGrouping, dtype: float64

----------------------------------------
COLUMN:  date
# of Unique Values in Column:  366
Top 10 Value Counts for Column: 
 20161128    0.005320
20161115    0.005185
20161114    0.004942
20161130    0.004908
20161026    0.004841
20161129    0.004799
20161116    0.004796
20161004    0.004783
20161205    0.004720
20170426    0.004674
Name: date, dtype: float64

----------------------------------------
COLUMN:  fullVisitorId
# of Unique Values in Column:  714166
Top 10 Value Counts for Column: 
 1957458976293878100    0.000308
0824839726118485274    0.000282
3608475193341679870    0.000222
18567491479157725

# of Unique Values in Column:  274
Top 10 Value Counts for Column: 
 1     0.494386
2     0.152661
3     0.077908
4     0.046969
5     0.034238
6     0.026468
7     0.021599
8     0.017135
9     0.014341
10    0.011774
Name: totals_hits, dtype: float64

----------------------------------------
COLUMN:  totals_newVisits
# of Unique Values in Column:  1
Top 10 Value Counts for Column: 
 1    1.0
Name: totals_newVisits, dtype: float64

----------------------------------------
COLUMN:  totals_pageviews
# of Unique Values in Column:  213
Top 10 Value Counts for Column: 
 1     0.500825
2     0.159116
3     0.081716
4     0.050016
5     0.036977
6     0.027323
7     0.021555
8     0.016902
9     0.013928
10    0.011183
Name: totals_pageviews, dtype: float64

----------------------------------------
COLUMN:  totals_visits
# of Unique Values in Column:  1
Top 10 Value Counts for Column: 
 1    1.0
Name: totals_visits, dtype: float64

----------------------------------------
COLUMN:  totals_tra

Name: yearday_local, dtype: float64

----------------------------------------
COLUMN:  weekday_local
# of Unique Values in Column:  7
Top 10 Value Counts for Column: 
 3.0    0.163354
2.0    0.161772
4.0    0.157551
1.0    0.154041
5.0    0.145967
6.0    0.109172
7.0    0.108142
Name: weekday_local, dtype: float64

----------------------------------------
COLUMN:  hour_local
# of Unique Values in Column:  24
Top 10 Value Counts for Column: 
 15.0    0.068968
14.0    0.067092
16.0    0.065810
11.0    0.063796
13.0    0.063113
12.0    0.061855
17.0    0.059081
10.0    0.057753
18.0    0.053047
20.0    0.051983
Name: hour_local, dtype: float64


#### Look more closely at fullVisitorID
##### (this is the id we need to assign final revenue to)

In [34]:
#for fullVisitorID check out actual value counts (instead of percentage) - this is the id we need to assign final revenue to
print('\n----------------------------------------\nCOLUMN: ', 'fullVisitorId', '\n')

print('# of Unique Visitors: ', len(df['fullVisitorId'].value_counts()), '\n')

print('# of Unique Visitors Tracked More than Once: ', len(df['fullVisitorId'].value_counts()[df['fullVisitorId'].value_counts()>1]), '\n')

print('Top 10 Visitors/Counts: \n', df['fullVisitorId'].value_counts().head(10), '\n')

print('Last 5 Visitors/Counts: \n', df['fullVisitorId'].value_counts().tail(5))


----------------------------------------
COLUMN:  fullVisitorId 

# of Unique Visitors:  714166 

# of Unique Visitors Tracked More than Once:  93492 

Top 10 Visitors/Counts: 
 1957458976293878100    278
0824839726118485274    255
3608475193341679870    201
1856749147915772585    199
3269834865385146569    155
0720311197761340948    153
7634897085866546110    148
4038076683036146727    138
0232377434237234751    135
3694234028523165868    129
Name: fullVisitorId, dtype: int64 

Last 5 Visitors/Counts: 
 7799550081967585551    1
3024357404412117556    1
6447802399108460222    1
9049947554290937684    1
3248437177151980488    1
Name: fullVisitorId, dtype: int64


In [36]:
df.totals_transactionRevenue.value_counts().head(10)

16990000    256
18990000    189
33590000    187
44790000    170
13590000    135
55990000    122
19990000    116
15990000     98
15190000     93
19190000     92
Name: totals_transactionRevenue, dtype: int64

In [None]:
16990000