# Using geoNetwork information to infer local times for visits 

![alt text](https://upload.wikimedia.org/wikipedia/commons/thumb/e/e8/Standard_World_Time_Zones.png/1024px-Standard_World_Time_Zones.png)

## Introduction

Perhaps people visiting the GStore at lunchtime from work have a greater propensity to make a purchase than a visitor who is at home in the evening. Or perhaps not. But testing any hypothesis based on a hunch that the time of day might influence propensity to visit or purchase, requires that we know what the time was for the visitor when they made their visit. 

In the competition data the `visitStartTime` parameter records the time at which each visit begins as POSIX, which as I understand it is UTC, for all visits. This won’t necessarily reflect the time on the visitor’s clock. For example, if I am here in Japan and I visit the GStore at 1 PM (JST) this would be recorded as 4 AM (UTC) in the `visitStartTime` parameter.  

Fortunately, the data also includes some information about the visitor’s location. This might allow us to work out what the visitor’s wall clock time was at the time of their visit. I've had issues in the past with the reliability of location information that is derived from IP addresses (as I believe these are). Will the location data available be of sufficient quality to allow us to compliment `visitStartTime` with local time information? 

The objectives for this notebook are:

 1. To make an assessment of the quality of the location data 
 2. Explore how to use the location information to calculate the local time for a visit
 3. Make a visual comparison of the affects of looking at visits by local time, rather than UTC for all visitors.
 
I'll be relying on the `pytz` and `geopy` packages (and using Google's location API.) I'm also using `seaborn` to draw some heatmaps.

(Because of the use of Google’s API, I’ll include the code in comments and supply the data I fetched from the API in an additional data file).

In [None]:
# imports and declarations
import pickle
import json
import pandas as pd
import numpy as np
from IPython.core.display import display, HTML
import geopy
import pytz
from datetime import datetime
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
plt.style.use('ggplot')
g = geopy.GoogleV3("...")   #os.getenv('GOOGLE_API_KEY'))

## Expand the geoNetwork data

First task is to load in the data. The `geoNetwork` field provides information about the location of the visitor for each visit. It is stored as JSON. Let's flatten that out into the columns of a DataFrame.

We'll also drop the columns that contain no useful information (i.e. 'not available in demo dataset') and save the resulting dataFrame back to disk.


In [None]:
# read both train and test files in and concatenate
df1 = pd.read_csv('../input/ga-customer-revenue-prediction/train.csv',usecols=['visitId','fullVisitorId','visitStartTime','geoNetwork'],dtype=str)
df2 = pd.read_csv('../input/ga-customer-revenue-prediction/test.csv',usecols=['visitId','fullVisitorId','visitStartTime','geoNetwork'],dtype=str)
df1['dsrc']='train'; df2['dsrc']='test' # remember which was which!
df=pd.concat([df1,df2]); del df1,df2; df.reset_index(inplace=True,drop=True);
regular_columns=set(df.columns)-set(['geoNetwork'])

# 'expand' the json in geoNetwork into columns
expandedCols=pd.io.json.json_normalize(df['geoNetwork'].apply(json.loads))
expandedCols.columns=['geoNetwork'+'.'+x for x in expandedCols.columns]
        
# There is some special text in the raw data 'not available in demo dataset' 
# Takes up an unneccassary amount of space, remove columns that are full of that
orig_cols = expandedCols.columns
expandedCols=expandedCols.applymap(lambda v: None if v=='not available in demo dataset' else v).dropna(axis=1,how='all')
dropped_cols = set(orig_cols) - set(expandedCols.columns)
print(f'Dropping {dropped_cols} because not available in demo dataset')
          
# save this result to disk

fdf=pd.concat([df[list(regular_columns)],expandedCols],axis=1)

with open('../input/exgeo_time_train_test.pickle', 'wb') as fh:
     pickle.dump(fdf, fh, protocol=pickle.HIGHEST_PROTOCOL)
        
del df,expandedCols,fdf

In [None]:
# load up the 'expanded' geoNetwork and visit data that was created in previous step
with open('../input/exgeo_time_train_test.pickle', 'rb') as fh:
     df=pickle.load(fh)

## Examine the contents of geoNetwork information

Do some counting of missing values and show the result.

In [None]:
summary_df={}
for c in ['geoNetwork.continent','geoNetwork.subContinent','geoNetwork.country','geoNetwork.region','geoNetwork.city','geoNetwork.metro','geoNetwork.networkDomain']:
    n_missing=df[df[c].isnull() | (df[c]=='(not set)')].shape[0]
    summary_df[c]={'n_missing':int(n_missing),'p_missing':n_missing/df[c].shape[0],'n_unique':df[c].nunique()}
    
display(HTML(pd.DataFrame(summary_df).T[['n_missing','p_missing','n_unique']].style.format({'n_missing':"{:,.0f}",'n_unique':"{:,.0f}",'p_missing':'{:.1%}'}).render()))

def no_info(v): return sum([((e=='(not set)') or (e==None) or (e=='unknown.unknown') or (e==np.nan)) for e in v]) == len(v)
df['no_info']=df[['geoNetwork.country','geoNetwork.city','geoNetwork.continent','geoNetwork.region','geoNetwork.subContinent','geoNetwork.metro']].apply(lambda r: no_info(r),axis=1)
df['no_info']=df[['geoNetwork.country','geoNetwork.city','geoNetwork.region']].apply(lambda r: no_info(r),axis=1)

print(f"{df['no_info'].sum():,} visits with no location information")


In summary:

 * There are 2,406 visits (0.07%) without any location information. Would not be able to add any time zone information for these visits and would have to default to UTC or remove these if that bothered us.
 
 * Nearly all visits have `country`, `continent` and `subContinent` information (looks like derived from same source as they all appear together or not at all).
     * The continents are `Asia, Europe, Americas, Africa, Oceania, (not set)`.

* `region`, `city` and `metro` are much less reliably present: only 40% have `city` and/or `region` information and only 20% have `metro`.

* `metro` is [The Designated Market Area (DMA) from where traffic arrived.](https://developers.google.com/analytics/devguides/reporting/core/dimsmets#view=detail&group=geo_network&jump=ga_metro) [Metro areas are the same as DMAs (Designated Market Areas) created by Nielsen Media Research](https://www.simpleviewinc.com/blog/post/2017/31/City-v-Metro-in-Google-Analytics-What-s-the-Difference-/980/")
 
* `region` [In U.S., a region is a state, New York, for example.](https://developers.google.com/analytics/devguides/reporting/core/dimsmets#view=detail&group=geo_network&jump=ga_region) but also contains other regions like `England` or `Kanto_JP`.

* `networkDomain` is [The domain name of users ISP, derived from the domain name registered to the ISPs IP address.](https://developers.google.com/analytics/devguides/reporting/core/dimsmets#view=detail&group=geo_network&jump=ga_networkdomain) [Sometimes it might be a known company offices IP address range and resolve to their company domain](https://webmasters.stackexchange.com/questions/105762/google-analytics-what-is-not-set-under-network-domain). Unlikely to make use of networkDomain in deriving a time zone.




## Converting to local time using `pytz`

`pytz` can be used to localise a UTC time to the wallclock time for another time zone as follows:


In [None]:
udt = pytz.utc.localize(datetime.now()) # create a timezone aware datetime object
ldt = udt.astimezone(pytz.timezone('Asia/Tokyo'))
print(f'E.g. {udt.strftime("%H:%M on %d %b %Y")} in UTC ---> {ldt.strftime("%H:%M on %d %b %Y")} in Tokyo')

The timezone above was specified using a string of the `Area/Location` where, `Area` is the name of a continent, an ocean, or `Etc`. `Location` is the name of a specific location within the area – usually a city or small island. 

Interestingly:
> [Country names are not used in this scheme, primarily because they would not be robust, owing to frequent political and boundary changes. The names of large cities tend to be more permanent.](https://en.wikipedia.org/wiki/Tz_database)

So we can see we have a few problems with constructing time zone names directly from the geographical information in the data:

 * while almost all the visits have a `continent` assigned (only 0.2% do not), we don't have the same set of continent _and oceans_ as in the `pytz` database.

* Only 40% of visits have any `city` information, we cant be sure that the cities we have in the data match the chosen cities for time zone name [ref](https://en.wikipedia.org/wiki/Tz_database).

We're going to have to do something else.


## Using Google's APIs to get time zone

An alternative approach would be to use a geocoding service to transform the location data we do have in the data into the standard time zone names. A combination of Google's Geocoding and Timezone APIs can help here. 

My first approach was to make the most detailed location string from the data we have available for each visit (labelled `geoString`) and then pass this to the geocoding API. 

This gives back (hopefully) a location (lat/lon) that we can send to the Timezone API and get back the timezone information we need.


In [None]:
# try to make up the best string from city, region and country information.
def geo_string(r):
    fr={k: v for k, v in r.items() if v not in ['(not set)',None,'nan', np.nan]}
    # if you have city or region, drop country.
    if fr.get('geoNetwork.city',None) or fr.get('geoNetwork.region',''):
        fr.pop('geoNetwork.country', None)
        
    return(', '.join(list(filter(lambda v: None if v=='' else v,[fr.get('geoNetwork.city',None), fr.get('geoNetwork.region',''), fr.get('geoNetwork.country','')]))))
    
df['geoString']=df[['geoNetwork.city','geoNetwork.country','geoNetwork.region','geoNetwork.metro']].apply(lambda r: geo_string(r), axis=1)


In [None]:
# fetch the timezone imformation based on the geoString we encoded.
### careful - this makes calls to Google APIs --> could mean real $$
## commented out here

# results = []
# errors=[]
                                                                  
# for c,code in df['geoString'].unique():
#     if c> -1:   # was used for manual hackery to start at offset
#         print(f'{c}. {code}') 
#         try:
#             gx = g.geocode(code)
#             if gx:
#                 result = {code:gx.raw}
#                 result[code]['timezone'] = g.timezone(gx.point)
#                 result[code]['src']=code
#                 results.append(result[code])
#             else:
#                 gx = g.geocode(code.split('/')[0])
#                 if gx:
#                     result = {code:gx.raw}
#                     result[code]['timezone'] = g.timezone(gx.point)
#                     result[code]['src']=code
#                     results.append(result[code])

#         except:
#             print(f'ERROR fetching {c}. {code}')
#             errors.append([code])

# print(f'There were {len(errors)} errors.')
# print(errors)

# with open('geocode_tz.pickle', 'wb') as handle:
#      pickle.dump(results, handle, protocol=pickle.HIGHEST_PROTOCOL)



## Checking the results of the geocoding

### Erroneous country information

First passes of the script produced a number of errors where a time zone could not be found for a location string. Looking at the output it seemed that the country information in the data was erroneous and was causing confusion. 
e.g.:

    Mountain View, California, Japan, ERROR
    Mountain View, California, China, ERROR
    Mexico City, Mexico City, Brazil, ERROR
    London, England, United States, ERROR
    ...
 
After trying a few hacks, I modified the creation of the `geoString` to leave out `country` information if we already had more detailed information (`city` or `region`) and re-ran the geocoding script.

There were about the same number of visits that have no location information are there, but also a handful of other cases:

| geoString (input to geocoding API) | missing timezone objects |
|----|----|
| Riyadh, Riyadh Province 	| 1328
| Guatemala City, Guatemala Department  |	54
| Hung Yen Province 	| 25
| Managua, Managua Department 	| 15
| Tay Ninh Province 	| 13
| Kobe, Hyogo Prefecture 	| 6
| Micronesia 	| 1

Some of these look like they should have been picked up. Riyadh for example accounts fora thousand or so visits. Just because it was annoying me I hand-crafted some of the `geoString` values and tried them again. The code below merges the results together. Nothing like a bit of hackery...


In [None]:
# execute this to load pre-fetched results and apply timezone to each visit
with open('../input/ga-support-geocode/geocode_tz.pickle', 'rb') as handle:
     gtzinfo=pickle.load(handle)

mappings = {
    'Riyadh':'Riyadh, Riyadh Province',
    'Guatemala City':'Guatemala City, Guatemala Department',
    'Hung Yen':'Hung Yen Province',
    'Tay Ninh':'Tay Ninh Province',
    'Kobe, Japan':'Kobe, Hyogo Prefecture',
    'Managua':'Managua, Managua Department',
    'Micronesia':'Micronesia'
        }

with open('../input/ga-support-geocode/geocode_tz_handcrafted.pickle', 'rb') as handle:
     gtzinfo_hc=pickle.load(handle)
     for g in gtzinfo_hc:
        g['src']=mappings[g['src']] # restore to what geoString would have been before handcrafting

# gtzinfo is an array of dict objects. 'src' field gives the geoString string that
# we used to obtain location/timezone information and provides key back to original data
# filter all the other information out for now.
filt_gtzinfo={}
for r in gtzinfo+gtzinfo_hc:
    if r['src']=='':
        continue
    filt_gtzinfo[r['src']]=r.get('timezone')
    

## Create timezone-aware datetimes from POSIX visitStartTime

Finally, the meat of the issue - We have time zone information for each visit so we can go onto converting to `visitStartTime` to local times.

In [None]:
# join this timezone onto the visits data
df['tz']=df['geoString'].apply(lambda s: filt_gtzinfo.get(s,pytz.utc))

# convert POSIX to pandas datetime object
df['visitStartTime_dt_utc'] = pd.to_datetime(df['visitStartTime'],unit='s',utc=True)

# convert UTC to local time
df['visitStartTime_dt_local']=df[['visitStartTime_dt_utc','tz']].apply(lambda r: r['visitStartTime_dt_utc'].astimezone(r['tz']),axis=1)

# make a note of the offset from utc in hours (for analysis)
df['utcoffset']=df['visitStartTime_dt_local'].apply(lambda t: int(t.strftime('%z')[:-2]))

# let's have a shufty
display(df[['geoString','visitStartTime','visitStartTime_dt_utc','visitStartTime_dt_local','tz','utcoffset']].sample(10))


## O-oh... Multi-time-zone countries...

If we just have country level information and that country spans multiple time zones then the time zone (like United States) we get back a timezone for a 'generic' location for that country. For example, searching for United States yields a time zone for Chicago.

    `[('United States', 'America/Chicago')]`
    
Which seems reasonable behaviour to me, but if the visitor was in New York, San Francisco, or Alaska then we will get their local time wrong by up to 3 hours.

Maybe this is not such a big deal. But let's understand the scale of the problem.

From [wikipedia](wikipedia) these countries span more than one time zone:

    Russia, USA, Canada, Brazil, Mexico, Indonesia, Kiribati, DRC, Micronesia, Kazahstan, Mongolia, Papua New Guinea, Ukraine
    
For each of these countries, how many times do we only have country-level location information? And as a consequence what will be our assignment error?

In [None]:
mtz_countries={'Russia':[2,12],'United States':[-9,-4],'Brazil':[-5,-2],'Mexico':[-8,-5],'Indonesia':[7,9],'Kiribati':[12,14], 'DRC':[1,2],'Micronesia':[10,11], 'Kazakhstan':[5,6], 'Mongolia':[7,8], 'Papua New Guinea':[10,11], 'Ukraine':[2,3]}

# In the Google data have 'Congo - Kinshasa' for DRC and 'Congo - ??' for Republic of congo
# just straighten that out to match our list of multi zone countries.
def replace_congo(v):
    if 'Congo' in v:
        if 'Kinshasa' in v:
            return 'DRC'
        else:
            return 'Republic of Congo'
    else:
        return v
    
df['geoNetwork.country']=df['geoNetwork.country'].apply(lambda v: replace_congo(v))

mtz_summary=[]

def is_nullv(v):
    return (v in ['(not set)',np.nan])

def test_notnull(r):
    t=[not is_nullv(e) for e in r]
    return pd.Series(t).any()
           
for c in mtz_countries.keys():
    cv=df[(df["geoNetwork.country"]==c)]
    cs={}
    cs['country']=c
    cs['visits']=cv.shape[0]
    cs['better_than_country']=cv[["geoNetwork.city","geoNetwork.metro","geoNetwork.region"]].apply(lambda r: test_notnull(r),axis=1).sum()
    if cv.shape[0]>0:
        cs['with_region_p']=cv["geoNetwork.region"].apply(lambda v:  not is_nullv(v)).sum()/cv.shape[0]
        cs['with_metro_p']=cv["geoNetwork.metro"].apply(lambda v:  not is_nullv(v)).sum()/cv.shape[0]
        cs['with_city_p']=cv["geoNetwork.city"].apply(lambda v:  not is_nullv(v)).sum()/cv.shape[0]
        cs['better_than_country_p']=cv[["geoNetwork.city","geoNetwork.metro","geoNetwork.region"]].apply(lambda r: test_notnull(r),axis=1).sum()/cv.shape[0]
    
        mtz_summary.append(cs)
    
col_order=['country', 'visits', 'better_than_country_p',
       'with_city_p', 'with_metro_p', 'with_region_p']


In [None]:
x=pd.DataFrame(mtz_summary)[col_order].set_index('country')
display(HTML(x.style.format({'visits':'{:,.0f}','better_than_country_p':'{:,.1%}','with_city_p':'{:,.1%}','with_metro_p':'{:,.1%}','with_region_p':'{:,.1%}'}).render()))  

Perhaps what stands out most here is that we only have better than country level data for 50% of visits from the USA, which accounts for some 700k+ visits. So by assigning a Chicago time zone to these visitors, what error might we have introduced?

Worst is that they were all in Alaska (UTC - 8) and we assign Chicago (UTC - 5) --> 3 hours out for half of our US based visits.

Maybe the US visits we do have better-than-country level information for can tell us more about how the non-labelled visits are distributed. This makes a biggish assumption that there is nothing systematic about when visits are not labelled with city or region information.


In [None]:
us_visits=df[(df['geoNetwork.country']=='United States') & (df.tz.apply(lambda tz: tz.zone.split('/')[0])=='America') & (df.tz.apply(lambda tz: tz.zone not in ['America/Sao_Paulo','America/Santiago','America/Buenos_Aires']))].copy().reset_index()
us_visits_with_region_or_city = us_visits[(~us_visits['geoNetwork.region'].isnull()) | (~us_visits['geoNetwork.region'].isnull())].reset_index()
us_visits_without_region_or_city = us_visits[(us_visits['geoNetwork.region'].isnull()) & (us_visits['geoNetwork.region'].isnull())].reset_index()

maxutc=us_visits_with_region_or_city.utcoffset.max();minutc=us_visits_with_region_or_city.utcoffset.min()
prob_each_time_zone = pd.DataFrame(us_visits_with_region_or_city.utcoffset.value_counts()/us_visits_with_region_or_city.shape[0]).reset_index()
prob_each_time_zone.columns=['utc_offset','prob']
prob_each_time_zone.sort_values(by='utc_offset',inplace=True)
prob_each_time_zone['error']=prob_each_time_zone['utc_offset']-(-5)
display(prob_each_time_zone)
weighted_offset = prob_each_time_zone.apply(lambda r: r.prob * r.utc_offset, axis=1).sum()
weighted_error = prob_each_time_zone.apply(lambda r: r.prob * r.error, axis=1).sum()
print(f'The weighted average utc offset for United States is {weighted_offset:.1f} hours.')
print(f'The error by assigning to Chicago for unkown region/city is on average {weighted_error:.1f} hours.') 


Just as another way to prove the same thing to myself:
 * Take the US visits that only have country information.
 * To each visit assign a US timezone with probability according to that which we observe in the better-than-country labelled data (as above cell)
 * Take the difference between this and Chicago (-5) and take mean to get average error through assignment.

Again, this does assume there is nothing systematic about the missing geoNetwork data.

In [None]:
tzs=us_visits_with_region_or_city['tz'].value_counts().index
p=(us_visits_with_region_or_city['tz'].value_counts()/us_visits_with_region_or_city.shape[0]).values
N=us_visits_without_region_or_city.shape[0]

us_visits_without_region_or_city['tz']=pd.Series(np.random.choice(tzs, N, p=p))

# convert UTC to local time
us_visits_without_region_or_city['visitStartTime_dt_local']=us_visits_without_region_or_city[['visitStartTime_dt_utc','tz']].apply(lambda r: r['visitStartTime_dt_utc'].astimezone(r['tz']),axis=1)

# make a note of the offset from utc in hours (for analysis)
us_visits_without_region_or_city['utcoffset']=us_visits_without_region_or_city['visitStartTime_dt_local'].apply(lambda t: int(t.strftime('%z')[:-2]))

us_visits_without_region_or_city['off_error']=us_visits_without_region_or_city['utcoffset']+5

print(f'Mean error by assigning Chicago: {us_visits_without_region_or_city.off_error.mean():.2f} hours.')
print(f'Std Dev of error by assigning Chicago: {us_visits_without_region_or_city.off_error.std():.2f} hours.')
print(f'Median error by assigning Chicago: {us_visits_without_region_or_city.off_error.median():.2f} hours.')


## Visual comparison of the affect of applying local time

I'm going to use a heatmap to plot a 2D histogram of number of visits by hour of the day and day of the week, both for `visitStartTime`s in UTC and the visitors 'local' time.

There will be a heatmap that groups together all visits for each `continent` described in the Google Store data (rather than from timezone description).  We note that this doesn't account for the erroneus labelling of country and continent as noticed in previous analysis.

In [None]:
# extract hour of day and day of week for utc and local time for analysis
df['dow_utc']=df['visitStartTime_dt_utc'].dt.dayofweek
df['hod_utc']=df['visitStartTime_dt_utc'].dt.hour
df['dow_loc']=pd.Series([ts.dayofweek for ts in df['visitStartTime_dt_local']])
df['hod_loc']=pd.Series([ts.hour for ts in df['visitStartTime_dt_local']])

tp_tz=df['tz'].apply(lambda z: z.zone.split('/')[0]).unique()

df['visits']=1

continents=df['geoNetwork.continent'].unique()
fig,ax=plt.subplots(2,len(continents),figsize=(16,6),sharey=True)
for i,c in enumerate(continents):
    x=df[df['geoNetwork.continent']==c][['dow_utc','hod_utc','visits']].groupby(['dow_utc','hod_utc']).sum().reset_index().pivot(index='hod_utc',columns='dow_utc',values='visits')
    sns.heatmap(x,ax=ax[0,i])
    ax[0,i].set_title(c,fontsize=20)
    ax[0,i].set_xlabel('day of week',fontsize=16)
    if i==0:
        ax[0,i].set_ylabel('hour of day (UTC)',fontsize=16)
    else:
        ax[0,i].set_ylabel('')
 
for i,c in enumerate(continents):
    x=df[df['geoNetwork.continent']==c][['dow_loc','hod_loc','visits']].groupby(['dow_loc','hod_loc']).sum().reset_index().pivot(index='hod_loc',columns='dow_loc',values='visits')
    sns.heatmap(x,ax=ax[1,i], cmap="bone")
    #ax[1,i].set_title(c,fontsize=20)
    ax[1,i].set_xlabel('day of week',fontsize=16)
    if i==0:
        ax[1,i].set_ylabel('hour of day (local)',fontsize=16)
    else:
        ax[1,i].set_ylabel('')

#tmp=fig.text(.01, -.2, "These heatmaps compare the number of visits in each hour of the day and day of week across the different continents as labelled in the geoNetwork field.\nThe top row are for visitStartTime in UTC, the bottom row is once the visitStartTime has been converted into the visitor's local time\nWould visitors from the Americas really be more likely to visit in the evenings, while Europeans favour lunchtimes and Asains the morning?", ha='left',fontsize=20,linespacing=1.5)
plt.tight_layout()    

These heatmaps compare the number of visits in each hour of the day and day of week across the different continents as labelled in the geoNetwork field.

The top row are for visitStartTime in UTC, the bottom row is once the visitStartTime has been converted into the visitor's local time. Note each heat map has it's own color scale, scaled to the number of visits in that group (I'm interested in the distribution of visits not the absolute number here).

Looking at the UTC maps, does it seem reasonable that visitors from the Americas really be more likely to visit in the evenings, while Europeans favour lunchtimes and Asains the morning? The bottom row seems more intuitive, with the highest number of visits clustered to daytime hours on week days.

# Conclusion

Applying local time seems to have been reasonably effective in allowing visits to be labelled in local time, at least at the aggregate level. The comparison charts look as one might expect intuitively.

The level of missing and erroneous data will mean that some individual visits will not have been labelled with precision. This will need to be taken into account in any subsequent analysis. It might be best to group up the time of day of the visits into parts of the day such as morning, afternoon and evening and investigate that as a predictor rather than the precise hour.

