<h1> Experiment with BigQuery API</h1>

In this notebook, you will be trying out BigQuery and its Application Programming Interface (API) to analyze a dataset of 10 billion Wikipedia articles from 2014.  To be precise, the table with the articles consists of 10,677,046,566 rows or 692.58 GB of data, including the date/timestamp with the last modification to the article, as well as its title, language, and total number of views at that timestamp.

---
Before you start, **make sure that you are logged in with your student account**. Otherwise you may incur Google Cloud charges for using this notebook. 

---


To confirm that you are logged in with your student account, check that you see a letter S in a circle located in the upper right hand corner of this notebook line in the screenshot below. Of course the color around your letter S might not be exactly the same ;)

If you are not sure if you are logged in with your student account, close all your private (i.e. incognito/anonymous) windows, open a new one,  log in [here](https://console.cloud.google.com) using the student credentials you got earlier, and finally return back to this page to continue with the notebook.

![](https://i.imgur.com/PfSHVAb.png)

[](https://i.imgur.com/5D6SJcu.png)

In [0]:
import datetime
import numpy as np
import pandas as pd
import seaborn as sns

import shutil
from google.cloud import bigquery

#@markdown Copy-paste your GCP Project ID in the following field:

PROJECT = "" #@param {type: "string"}

#@markdown Next, use Shift-Enter to run this cell and to complete authentication.

try:  
  from google.colab import auth
  auth.authenticate_user()  
  print("AUTHENTICATED")
except:
  print("FAILED to authenticate")

<h3> Extract sample data from BigQuery </h3>

Here's a SQL query to sample 10 rows of data. The SQL statement uses the LIMIT keyword to restrict the sample size to 10 rows. Since the ORDER BY clause precedes the LIMIT keyword, you are guaranteed to receive the rows in the descending order by the number of article views. Notice that the code in the next cell is using BigQuery APIs to execute the SQL statement and then stores the response in a Pandas dataframe variable named `trips`.

In [0]:
#create a BigQuery API client
bq = bigquery.Client(project=PROJECT)

#configure the client to ignore cached results ;)
job_config = bigquery.QueryJobConfig()
job_config.use_query_cache = False

#start the query timer
start = datetime.datetime.now()

#send the SQL statement to BigQuery,
#get the result back, and prepare it
#as a nice looking Python Pandas data frame
trips = bq.query('''
  SELECT 
      language, title, SUM(views) as views 

  FROM `bigquery-samples.wikipedia_benchmark.Wiki10B` 

  WHERE 
   title LIKE '%Google%'

  GROUP BY language, title

  ORDER BY views DESC 

  LIMIT 10
  ''', job_config=job_config).to_dataframe()

#print how much time passed
print(datetime.datetime.now() - start)

#render the result
trips

<h3> Recap </h3>

In this sample notebook, you ran a benchmark query against the BigQuery serverless data warehouse. After the BigQuery APIs returned the results, you learned about the amount of time it took to execute the query and you observed the results with the most viewed articles about Google. You also learned how to run use BigQuery Python API client and how to enable or disable query result cache. The following screenshot illustrates the result you should have receved in the previous step.

![](https://i.imgur.com/ljEu9yS.png)


Copyright 2019 Counter Factual .AI LLC.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.