# Pandas Dataframe to Google Firestore

On my recent Hacker News project I ended up building a few nested data views in my pandas dataframe. These views are for plotting the user's comment sentiment over time, and for their top 50 saltiest comments.

I needed to find somewhere to host this data for the front end app and after looking over every offering across AWS/Heroku/GoogleCloud I settled on Google Cloud's Cloud Firestore. 

The price and usability was hard to beat, but the system lacks a simple `csv` upload. And since I had nested values in my dataframe as `str(dict)`s I wanted to un-nest them and take advantage of the document based structure of Google Firestore. 

This un-nesting string method would work for MongoDB as well. 

The primary advantage of using Firestore is that I won't have to worry about building a scalable API, and the cost of storage/querying is very low. 

An added advantage is that my front-end dev partners can write all the code they'd like in JS, and I can write and interact with the database using Python. That's a major win.

## Set up the Google Cloud Environment

I recommend using the [Firestore Quickstart Tutorials](https://cloud.google.com/firestore/docs/quickstart-servers) to get started. You will need to get registered on Google Cloud Platform, and create a Google Cloud Platform project.

Follow the instructions to create your database and download the authentication credential (`.json`).

Then you will need to set your environment variable as well. If you're using jupyter like me you can put the key in the same folder as your notebook (**ADD THE FILE NAME TO YOUR GITIGNORE**) and then follow the steps below to add the key to your Anaconda env.

Via Bash:
```
cd $CONDA_PREFIX
ls
mkdir -p ./etc/conda/activate.d
mkdir -p ./etc/conda/deactivate.d
touch ./etc/conda/activate.d/env_vars.sh
touch ./etc/conda/deactivate.d/env_vars.sh
nano ./etc/conda/activate.d/env_vars.sh
```
add these lines: 
```
#!/bin/sh
export GOOGLE_APPLICATION_CREDENTIALS="yourkey.json"
```
then ctrl+x, y.
```
nano ./etc/conda/deactivate.d/env_vars.s
```
add these lines:
```
#!/bin/sh
unset GOOGLE_APPLICATION_CREDENTIALS
```

*Then you're ready to launch your Jupyter notebook.*

## Load dependencies

In [2]:
#!pip install --upgrade google-cloud-firestore
import pandas as pd
import numpy as np
import json
import datetime
from google.cloud import firestore
from tqdm import tqdm_pandas
from tqdm import tqdm_notebook as tqdm
# Load TQDM
tqdm_pandas(tqdm())

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




## Check for Env variable

In [3]:
!if [ -z ${GOOGLE_APPLICATION_CREDENTIALS+x} ]; then echo "GOOGLE_APPLICATION_CREDENTIALS is unset"; else echo "GOOGLE_APPLICATION_CREDENTIALS is set to '$GOOGLE_APPLICATION_CREDENTIALS'";fi

GOOGLE_APPLICATION_CREDENTIALS is set to 'winterrose-nlp-7d9d80973d77.json'


## Initialize Cloud Firestore

In [4]:
db = firestore.Client()

## I created the collection `commentor_stats` manually using the firestore dashboard.

In [5]:
users_ref = db.collection('commentor_stats')

## A few notes about creating new documents on Firestore

 [The Google Cloud Tutorial for Uploading Data](https://cloud.google.com/firestore/docs/manage-data/add-data)

#### Add a document with a specified document id using set  
```db.collection(u'cities').document(u'new-city-id').set(data)```

#### Let firestore create the id using the .add method. 
```db.collection(u'cities').add(city.to_dict())```

#### The data structure for creating proper imports. 
```
data = {
  u'stringExample': u'Hello, World!',
  u'booleanExample': True,
  u'numberExample': 3.14159265,
  u'dateExample': datetime.datetime.now(), #pd.timestamp works too.
  u'arrayExample': [5, True, u'hello'],
  u'nullExample': None,
  u'objectExample': {
    u'a': 5,
    u'b': True
  }
```

#### Timestamps need to conform to RFC 3339, pd.Timestamp works.

## Now it's time to upload all my data.  First I need to import it to the notebook.

In [6]:
df = pd.read_csv("hn_commentor_summary.csv")

In [7]:
### These are my columns. I'm going to drop a few. 
[["commentor", # Name of commentor.
  "time_cmnt_lst", # Most recent comment time in Dataset.
  "time_cmnt_fst", # First HN comment time.
  "cnt_cmnts_oall", # Count of total number of comments. 
  "sum_slt_oall", # Total Salt Score Overall. (All Salty + NonSalty Scores added up.)
  "avg_slt_oall", # Average Comment Salt Score across all comments. 
  "cnt_slt_s", # Count of JUST salty comments. 
  "sum_slt_s", # Total Salt Score of Salty Comments
  "avg_slt_s", # Average Salt Score of Salty Comments
  "rank_lt_amt_slt", # Rank: Lifetime Salt Scores Total of Salty comments only.
  "rank_lt_qty_sc", # Rank: Lifetime quantity of "Salty Comments" contributed.
  "rank_oall_slt", # Rank: Lifetime overall "Salt Score" total of All Salty + NonSalty comments. 
  "rank_slt_trolls", # Rank: *ONLY TROLL ACCOUNTS* Lifetime overall "Salt Score" total. (Troll accounts are accounts that *ONLY* have salty posts.)
  "top_cmnts_s", # List of 50 Top Salty Comments. 
  "monthly_plot"]];# List of every month of activity for plotting.

### I'll make a copy incase something happens. 

In [8]:
test = df.copy()
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 388120 entries, 0 to 388119
Data columns (total 15 columns):
commentor          388120 non-null object
time_cmnt_lst      388120 non-null int64
time_cmnt_fst      388120 non-null int64
cnt_cmnts_oall     388120 non-null int64
sum_slt_oall       388120 non-null float64
avg_slt_oall       388120 non-null float64
cnt_slt_s          171854 non-null float64
sum_slt_s          171854 non-null float64
avg_slt_s          171854 non-null float64
rank_lt_amt_slt    171854 non-null float64
rank_lt_qty_sc     171854 non-null float64
rank_oall_slt      818 non-null float64
rank_slt_trolls    25055 non-null float64
top_cmnts_s        171854 non-null object
monthly_plot       388120 non-null object
dtypes: float64(9), int64(3), object(3)
memory usage: 44.4+ MB


### Convert Unix-Epoch times (currently `int`) to Pandas Timestamps. 

In [9]:
# Convert Unix time to Timestamps
test["time_cmnt_lst"] = test["time_cmnt_lst"].apply(lambda x: pd.Timestamp(x, unit='s').tz_localize('UTC'))
test["time_cmnt_fst"] = test["time_cmnt_fst"].apply(lambda x: pd.Timestamp(x, unit='s').tz_localize('UTC'))
test.head(5)

Unnamed: 0,commentor,time_cmnt_lst,time_cmnt_fst,cnt_cmnts_oall,sum_slt_oall,avg_slt_oall,cnt_slt_s,sum_slt_s,avg_slt_s,rank_lt_amt_slt,rank_lt_qty_sc,rank_oall_slt,rank_slt_trolls,top_cmnts_s,monthly_plot
0,0-,2014-03-14 12:02:05+00:00,2014-03-14 12:02:05+00:00,1,0.12,0.12,,,,,,,,,"[{'y_m': '14_03', 't_s': 0.0, 't_h': 0.12, 'c_..."
1,0--__-_-__--0,2018-11-01 19:54:05+00:00,2018-11-01 19:54:05+00:00,1,-0.006803,-0.006803,1.0,-0.006803,-0.006803,165340.0,171854.0,,23113.0,"[{'commentor': '0--__-_-__--0', 'comment_time'...","[{'y_m': '18_11', 't_s': -0.01, 't_h': 0.0, 'c..."
2,0-0,2009-12-03 19:15:12+00:00,2009-12-03 19:15:12+00:00,1,-0.28,-0.28,1.0,-0.28,-0.28,73460.0,147254.0,,2776.0,"[{'commentor': '0-0', 'comment_time': 12598677...","[{'y_m': '09_12', 't_s': -0.28, 't_h': 0.0, 'c..."
3,0-4,2010-11-01 21:20:10+00:00,2010-10-29 23:19:31+00:00,12,0.753139,0.062762,2.0,-0.009702,-0.004851,162410.0,98565.0,,,"[{'commentor': '0-4', 'comment_time': 12883946...","[{'y_m': '10_10', 't_s': -0.01, 't_h': 0.71, '..."
4,0-9,2018-04-21 15:31:34+00:00,2016-10-28 15:46:02+00:00,2,0.05625,0.028125,,,,,,,,,"[{'y_m': '16_10', 't_s': 0.0, 't_h': 0.0, 'c_s..."


In [10]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 388120 entries, 0 to 388119
Data columns (total 15 columns):
commentor          388120 non-null object
time_cmnt_lst      388120 non-null datetime64[ns, UTC]
time_cmnt_fst      388120 non-null datetime64[ns, UTC]
cnt_cmnts_oall     388120 non-null int64
sum_slt_oall       388120 non-null float64
avg_slt_oall       388120 non-null float64
cnt_slt_s          171854 non-null float64
sum_slt_s          171854 non-null float64
avg_slt_s          171854 non-null float64
rank_lt_amt_slt    171854 non-null float64
rank_lt_qty_sc     171854 non-null float64
rank_oall_slt      818 non-null float64
rank_slt_trolls    25055 non-null float64
top_cmnts_s        171854 non-null object
monthly_plot       388120 non-null object
dtypes: datetime64[ns, UTC](2), float64(9), int64(1), object(3)
memory usage: 44.4+ MB


## I'll use this function to unpack the messy list of comments from `top_cmnts_s' field. 

In [11]:
test_cmnts = test.iloc[1]["top_cmnts_s"]
test_cmnts

'[{\'commentor\': \'0--__-_-__--0\', \'comment_time\': 1541102045, \'comment_saltiness\': -0.0068033854, \'comment_polarity\': -0.01375, \'comment_subjectivity\': 0.4947916667, \'subjectivity_spectrum\': 0.0104166667, \'is_salty\': True, \'is_subjective\': False, \'is_negative\': True, \'parent_type\': \'story\', \'parent_author\': \'macbookaries\', \'parent_title\': \'People who refuse to drink water, no matter what\', \'cleaned_comment\': "This doesn\'t seem that weird. From what I understand it\'s very rare to drink water in Chinese culture because it was historically necessary to boil it. You drink tea, booze and soup, but not water. If someone with first hand experience can chime in on this I\'d appreciate it. Same in historical Europe and America - alcoholic beverages were preferred over water for health reasons.", \'comment_rank\': 0.0, \'comment_id\': 18357789, \'parent_id\': 18356809}]'

In [12]:
def dict_top_comments(top_cmnt_obj):
    """Unpack the top comments and turn it into a good dict.
    
    Args:
        top_cmnt_obj, a str of dicts, from df.top_cmnts_s.
    
    Evaluates the string, json.dumps it, and reads it back in to a dataframe.
    Drops all unecessary columns. 
    Creates a named index for each comment. 
    Turns df into an indexed dict. 
    
    Returns: 
        temp, a properly formed nested dict. 
    """
    try:
        temp = pd.read_json(json.dumps(eval(top_cmnt_obj)))
        temp = temp.drop(columns=["subjectivity_spectrum", "is_negative",
                                  "is_salty", "is_subjective", "comment_polarity",
                                  "comment_subjectivity", "commentor"]).reset_index()
        temp["c_id"] = temp["index"].apply(lambda x: "c_" + str(x))
        temp["comment_time"] = temp["comment_time"].apply(lambda x: pd.Timestamp(x, unit='s').tz_localize('UTC'))
        temp.drop(columns = ["index"], inplace = True)
        temp = temp.set_index("c_id").to_dict("index")
    except:
        temp = np.NaN
    return temp


# Preview the dict after unpacking & cleaning. 
dict_top_comments(test_cmnts)

{'c_0': {'cleaned_comment': "This doesn't seem that weird. From what I understand it's very rare to drink water in Chinese culture because it was historically necessary to boil it. You drink tea, booze and soup, but not water. If someone with first hand experience can chime in on this I'd appreciate it. Same in historical Europe and America - alcoholic beverages were preferred over water for health reasons.",
  'comment_id': 18357789,
  'comment_rank': 0,
  'comment_saltiness': -0.0068033854000000005,
  'comment_time': Timestamp('2018-11-01 19:54:05+0000', tz='UTC'),
  'parent_author': 'macbookaries',
  'parent_id': 18356809,
  'parent_title': 'People who refuse to drink water, no matter what',
  'parent_type': 'story'}}

## I'll use this function to unpack / repack my `monthly_plot`

In [13]:
test_plts = test.iloc[1]["monthly_plot"]
test_plts

"[{'y_m': '18_11', 't_s': -0.01, 't_h': 0.0, 'c_s': 1.0, 'c_h': 0.0}]"

In [14]:
def dict_monthly_plot(monthly_plot_obj):
    """Turns the list of dicts into a nested dict w/ indexes.
    
    Args:
        monthly_plot_obj, an array of dicts.
    
    Returns: 
        temp, a dict of dicts. 
    """
    try:
        temp = pd.DataFrame.from_dict(eval(monthly_plot_obj)).set_index("y_m").to_dict("index")
    except:
        temp = np.NaN
    return temp

dict_monthly_plot(test_plts)

{'18_11': {'c_h': 0.0, 'c_s': 1.0, 't_h': 0.0, 't_s': -0.01}}

## Finally, I'll use this function to process all the data for upload. 
This may take a while. :) 

I'll turn my df into a numpy array of dicts then apply the function. 

In [16]:
upload = test.iloc[0:].to_dict("records")

In [17]:
#!pip install joblib
from joblib import Parallel, delayed
import multiprocessing
num_cores = multiprocessing.cpu_count()
num_cores

8

In [18]:
x_dict = upload
def replace_and_upload2(x):
    """"""
    x["monthly_plot"] = dict_monthly_plot(x["monthly_plot"])
    x["top_cmnts_s"] = dict_top_comments(x["top_cmnts_s"])
    print("uploaded ", x["commentor"])
    return x
results =[]
results.append(Parallel(n_jobs=7)(delayed(replace_and_upload2)(x) for x in tqdm(x_dict)))
print("done")

HBox(children=(IntProgress(value=0, max=379603), HTML(value='')))

done


## I had to restart this a few times due to random errors. Added error handling and fixed it, but this is where my process had to begin again. 

In [95]:
print(results[0][304849:304851])

[{'commentor': 'searchencrypt', 'time_cmnt_lst': Timestamp('2018-10-24 20:31:28+0000', tz='UTC'), 'time_cmnt_fst': Timestamp('2018-01-03 14:49:33+0000', tz='UTC'), 'cnt_cmnts_oall': 19, 'sum_slt_oall': 1.1310191256191158, 'avg_slt_oall': 0.05952732240100609, 'cnt_slt_s': 2.0, 'sum_slt_s': -0.5180357142857143, 'avg_slt_s': -0.25901785714285713, 'rank_lt_amt_slt': 55279.0, 'rank_lt_qty_sc': 79543.0, 'rank_oall_slt': nan, 'rank_slt_trolls': nan, 'top_cmnts_s': {'c_0': {'cleaned_comment': 'Bezos says, “If you make customers unhappy in the physical world, they might each tell six friends. If you make customers unhappy on the Internet, they can each tell 6,000.” Tracking people makes them :(', 'comment_id': 16643222, 'comment_rank': 0, 'comment_saltiness': -0.3586607143, 'comment_time': Timestamp('2018-03-21 22:56:33+0000', tz='UTC'), 'parent_author': 'dwyerm', 'parent_id': 16642584, 'parent_title': 'Another Comment', 'parent_type': 'comment'}, 'c_1': {'cleaned_comment': "It's news because p

In [100]:
final = results[0][-:]
print(len(final))
print(final[0])

3
{'commentor': 'zzzzzzzzzzz', 'time_cmnt_lst': Timestamp('2018-10-13 14:18:32+0000', tz='UTC'), 'time_cmnt_fst': Timestamp('2018-10-13 14:18:32+0000', tz='UTC'), 'cnt_cmnts_oall': 1, 'sum_slt_oall': 0.0, 'avg_slt_oall': 0.0, 'cnt_slt_s': nan, 'sum_slt_s': nan, 'avg_slt_s': nan, 'rank_lt_amt_slt': nan, 'rank_lt_qty_sc': nan, 'rank_oall_slt': nan, 'rank_slt_trolls': nan, 'top_cmnts_s': nan, 'monthly_plot': {'18_10': {'c_h': 1.0, 'c_s': 0.0, 't_h': 0.0, 't_s': 0.0}}}


In [93]:
batch_no

76

## And here's the upload function. 

Notice how it batches the records into groups of 500 then submits them. The submission step was having the occasional timeout but adding the `try` worked great. 

In [97]:
from time import sleep
x = 1
batch_no = 1
for entry in tqdm(final):
    if x == 1:
        # Do this part the first time.
        batch = db.batch()
        print("set batch")
        
    # Do this part for every single one. 
    #print ("added %s to batch" % x)
    batch.set(db.collection(u'commentor_stats').document(), entry)
    
    if x % 500 == 0:
        #Do this part every 500th time.
        #Had to add a try/except for this pesky submission error.
        try:
            batch.commit()
        except:
            print("Commit of batch %s failed... reattempting." % batch_no)
            sleep(5) # Wait 5 seconds, then retry. 
            batch.commit()
        print("sent batch %s" % batch_no)
        batch = db.batch()
        batch_no += 1
    x += 1

# One last batch commit to send the last non-500 docsize batch.
batch.commit()

HBox(children=(IntProgress(value=0, max=74753), HTML(value='')))

set batch
sent batch 1
sent batch 2
sent batch 3
sent batch 4
sent batch 5
sent batch 6
sent batch 7
sent batch 8
sent batch 9
sent batch 10
sent batch 11
sent batch 12
sent batch 13
sent batch 14
sent batch 15
sent batch 16
sent batch 17
sent batch 18
sent batch 19
sent batch 20
sent batch 21
sent batch 22
sent batch 23
sent batch 24
sent batch 25
sent batch 26
sent batch 27
sent batch 28
sent batch 29
sent batch 30
sent batch 31
sent batch 32
sent batch 33
sent batch 34
sent batch 35
sent batch 36
sent batch 37
sent batch 38
sent batch 39
sent batch 40
sent batch 41
sent batch 42
sent batch 43
sent batch 44
sent batch 45
sent batch 46
sent batch 47
sent batch 48
sent batch 49
sent batch 50
sent batch 51
sent batch 52
sent batch 53
sent batch 54
sent batch 55
sent batch 56
sent batch 57
sent batch 58
sent batch 59
sent batch 60
sent batch 61
sent batch 62
sent batch 63
sent batch 64
sent batch 65
sent batch 66
sent batch 67
sent batch 68
sent batch 69
sent batch 70
sent batch 71
sent 

In [None]:
# The basic outline for batch uploading. 
#batch = db.batch()
#batch.set(db.collection(u'commentor_stats').document(),{u'commentor': u'ZTESTZZZZZ'})
#batch.commit()