We would like to examine the data that various media companies have collected on individuals. 

We will start with Google, and look at one particular aspect of their data collection: it appears that Google scrapes financial transaction data (purchases and reservations) from some emails sent to gmail addresses.

The code below will help you extract and examine this information for your own perusal.

In [1]:
import pandas as pd
import glob
import json
import copy
from collections import defaultdict
from datetime import datetime

First, we import some useful libraries, and add a function to aid in converting timestamps.

In [2]:
"""
Input:  Unix timestamp (as string)
Output: utcfromtimestamp of the first 10 chars of the string.
Motivation: Google's transaction records record in microseconds. 
    We merely remove the microseconds (if they are present).
"""
def timestamp_sec_to_datetime(ts):
    assert int(ts) > 0, "must be nonnegative int"
    sts = str(ts)
    return datetime.utcfromtimestamp(int(sts[:10]))

Download your Google data by following the directions here:
    https://support.google.com/accounts/answer/3024190?hl=en

Then, decompress it (there will likely be a lot) and set purchase_dir below to the correct folder.

In [3]:
# My purchase data was in "Takeout 2". Yours may differ.
purchase_dir = "./Takeout 2/Purchases _ Reservations/"

Each transaction is in its own json file.

There are several kinds of transactions recorded, included but not limited to: 

    * item purchase
    * concert ticket purchase
    * flight reservation
    * hotel reservation
    * item shipped
    
We have not (yet) parsed through all the different types of transactions, but manage to read a fair amount of data from the overall collection.

The first task is to get all transactions into a single dictionary.

In [4]:
# for each file, open the json file, and put it into a dictionary.
json_file_list = glob.glob(purchase_dir + "order_*.json")
kill_front = len(purchase_dir + "order_") # get to number
kill_back  = len(".json") # yes, 5, but let's be consistent

order_dict = {}

for fn in json_file_list:
    with open(fn, "r") as f:
        order = json.load(f)
        order_number = fn[kill_front:-kill_back]
        order_dict[order_number] = copy.deepcopy(order)
#        print(f"Added order_dict[{order_number}].")
        # make a dict specifically to track which keys are in each 

Once this is done, we can examine the dollar amounts that Google has determined you spent on certain transactions.

We'll restrict our initial examination to transactions less than $100,000 in value (I doubt this is a problem for most people) by looking for the "priceline" key in each transaction. 

To further examine these transactions, we may later look into the "lineItem" key.

In [5]:
# Extract the times of these transactions and their total cost.

# $100,000 is way higher than any transaction that would occur here.
# TODO remove this threshold
threshold = 100000  

time_count, total_count = 0, 0
totals = defaultdict(float)
summed = defaultdict(int)

for k, v in order_dict.items():
    
# if "creationTime" exists in the keys, get the time.
    order_ts = v["creationTime"]["usecSinceEpochUtc"] if "creationTime" in v.keys() else 1 # eh, '1970-01-01'
    order_date = timestamp_sec_to_datetime(order_ts).strftime('%Y-%m-%d')
    
# if "priceline" exists in the keys, get the subtotal and tax (if any).
    if "priceline" in v.keys():
        # add each one to its respective total
        for v2 in v["priceline"]:
            # TODO remember that all the values are in microunits. We'll assume USD for now.
            val = round(int(v2['amount']['amountMicros']) / 1000000, 2)
            if abs(val) < threshold:
                totals[ v2['type'] ] += val
                summed[ v2['type'] ] += 1
#                print(f"{order_date}: added ${val} to {v2['type']}.")

Looking at the summed and totals dictionaries compiled with the above code, we can see:

    * summed: total number of transactions logged by type
    * totals: total dollar amount logged by type

We can do a finer-grained analysis; this is merely a first look at what's in Google's hands about us as consumers.

In [6]:
summed

defaultdict(int,
            {'SUBTOTAL': 390, 'TAX': 365, 'DELIVERY': 109, 'OTHER_COSTS': 86})

In [7]:
totals

defaultdict(float,
            {'SUBTOTAL': 11471.009999999975,
             'TAX': 727.8799999999999,
             'DELIVERY': 292.55999999999995,
             'OTHER_COSTS': 169.35999999999996})