# BigQuery - Inserting Data in a DataFrame into a Table

This notebook demonstrates how to stream in data from a Pandas dataframe into a BiqQuery table.

### In this notebook you will
* Create a Pandas dataframe containing some stock price data
* Learn how to make some adjustments to a Pandas dataframe to include the index as a column and make sure columns have the right types
* Infer a BigQuery table schema from the adjusted Pandas dataframe
* Create a BigQuery table with the inferred schema
* Upload the records from the Pandas dataframe to the BigQuery table

Related Links:

* [BigQuery](https://cloud.google.com/bigquery/)
* Python [Pandas](http://pandas.pydata.org/) for data analysis

----

NOTE:

* If you're new to notebooks, or want need an introduction to using BigQuery, check out the full [list](..) of notebooks.


In [1]:
import datetime
import gcp.bigquery as bq
import pandas.io.data as web
import time

# Sample Data

First we need some data; we can easily get a pandas dataframe containing Google stock price data (using the Google Finance APIs, via the pandas DataReader class):

In [2]:
start_date = datetime.datetime(2013, 1, 1)
end_date = datetime.datetime(2015, 1, 30)
df = web.DataReader('GOOGL', data_source='google', start=start_date, end=end_date)
df[:5]

Unnamed: 0_level_0,Open,High,Low,Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2013-01-02,360.07,363.86,358.63,361.99,2542268
2013-01-03,362.83,366.33,360.72,362.2,2318140
2013-01-04,365.03,371.11,364.2,369.35,2763552
2013-01-07,368.09,370.06,365.66,367.74,1655967
2013-01-08,368.14,368.52,362.58,367.02,1676740


# Preparing the Pandas DataFrame

We are going to need to create a BigQuery table with an appropriate schema. We can create a schema ourselves, but it is easier to just derive the schema from the dataframe.

## Pandas Schema

In [3]:
df.dtypes

Open      float64
High      float64
Low       float64
Close     float64
Volume      int64
dtype: object

The types look reasonable, but notice that the date column is not included. That is because it is the index for the DataFrame. We want to include the index, which we can do by converting it to a column:

In [4]:
df = df.reset_index(drop=False)
df.dtypes

Date      datetime64[ns]
Open             float64
High             float64
Low              float64
Close            float64
Volume             int64
dtype: object

As a result, you'll notice the DataFrame has a Date column and the index is now simply an auto-numbered sequence.

In [5]:
df[:3]

Unnamed: 0,Date,Open,High,Low,Close,Volume
0,2013-01-02,360.07,363.86,358.63,361.99,2542268
1,2013-01-03,362.83,366.33,360.72,362.2,2318140
2,2013-01-04,365.03,371.11,364.2,369.35,2763552


You may need to rename columns if you want your BigQuery table to have different columns. Or the column in the DataFrame may have characters that are not allowed in BigQuery column names (as is true in the case of the Date column in this sample).

In [6]:
df.columns[0]

'Date'

In [7]:
df.rename(columns={df.columns[0]: 'Date'}, inplace=True)

## Missing Values

Although it is not necessary in this example, missing values can be filled with a default value:

In [8]:
df.fillna(value=0, inplace=True)

# Creating a BigQuery Table Schema

Now we want to create a schema for the table. We can infer one from the dataframe as follows:

In [9]:
schema = bq.Schema.from_data(df)
schema

# Creating the BigQuery Table

Now we can create table with the schema that was just created.

For the purpose of this example, if the table exists we'll recreate it (with the `overwrite=True` parameter). Additionally we'll do the same for creating a DataSet that will contain the table.

In [10]:
bq.DataSet('samples').create()



In [11]:
table = bq.Table('samples.stock').create(schema, overwrite=True)

# Inserting Data into BigQuery

Finally, we can populate the table with data from the dataframe. This uses the BigQuery streaming insert API to stream in rows from the pandas dataframe into BigQuery.

In [12]:
table.insert_data(df)

To confirm the insert, we can sample the newly created and populated table.

Note that it can take some while for BigQuery to process the newly inserted data, and make it available to be queried. You may need to wait a while and refresh this cell a few times before seeing the results.

In [13]:
table.sample()