<a href="https://colab.research.google.com/github/prof-rossetti/intro-to-python/blob/main/notebooks/applied-ds/Processing_Big_Data_in_Google_BigQuery_with_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Project Setup

Complete these prerequisite setup steps before moving on:

  1. Login to the [Google Cloud console](https://console.cloud.google.com) with your Google account.

  2. [Create a new project](https://console.cloud.google.com/projectcreate), and note it's name / identifier (i.e. the `PROJECT_ID`).

  3. From the "APIs and Services" menu, search for and enable the "BigQuery API".



## Authorization

In [4]:
from google.colab import auth
auth.authenticate_user()

In [1]:
# use your own project_id:
PROJECT_ID = input("Google Cloud Project Name / ID: ") or "intro-to-python-2021"

Google Cloud Project Name / ID: 
GOOGLE CLOUD PROJECT: intro-to-python-2021


## Fetching Data

We can use the pandas `read_gbq` function to execute an SQL query against a specified Google BigQuery database, and return the result as a pandas `DataFrame` object.

Be aware, everytime we execute a query, it might use up some of your credits, based on the amount of data processes and/or returned.

However after we have the data in memory, we can work with the dataframe as much as we want.

### Google Trends

For this example, we will use a public dataset in the BigQuery environment called `google_trends`, which holds information about top weekly search trends across a variety of US media markets (i.e. designated market areas, or DMAs).

Counting number of rows in the top terms table:

In [8]:
sql = f"""
    SELECT count(*) as row_count
    FROM `bigquery-public-data.google_trends.top_terms`
"""
results_df = read_gbq(sql, project_id=PROJECT_ID)
results_df #> 43,623,983

Unnamed: 0,row_count
0,43623983


Reading data from the top terms table:

In [7]:
from pandas import read_gbq

TERMS_LIMIT = 100

sql = f"""
    SELECT *
    FROM `bigquery-public-data.google_trends.top_terms`
    -- ORDER BY week, dma_name, score
    LIMIT {int(TERMS_LIMIT)}
"""

trends_df = read_gbq(sql, project_id=PROJECT_ID)
trends_df.head()

Unnamed: 0,week,score,rank,refresh_date,dma_name,dma_id,term
0,2018-07-01,45,1,2023-05-17,Portland-Auburn ME,500,Lakers
1,2019-06-30,31,1,2023-05-17,Portland-Auburn ME,500,Lakers
2,2020-08-16,30,1,2023-05-17,Portland-Auburn ME,500,Lakers
3,2020-09-13,31,1,2023-05-17,Portland-Auburn ME,500,Lakers
4,2020-09-20,49,1,2023-05-17,Portland-Auburn ME,500,Lakers


#### Markets

Unique markets:

In [17]:
from pandas import read_gbq

sql = f"""
    SELECT DISTINCT dma_name, dma_id
    FROM `bigquery-public-data.google_trends.top_terms`
    ORDER BY dma_id
"""

markets_df = read_gbq(sql, project_id=PROJECT_ID)
print(len(markets_df))
markets_df.head()

210
['Abilene-Sweetwater TX', 'Albany GA', 'Albany-Schenectady-Troy NY', 'Albuquerque-Santa Fe NM', 'Alexandria LA', 'Alpena MI', 'Amarillo TX', 'Anchorage AK', 'Atlanta GA', 'Augusta GA', 'Austin TX', 'Bakersfield CA', 'Baltimore MD', 'Bangor ME', 'Baton Rouge LA', 'Beaumont-Port Arthur TX', 'Bend OR', 'Billings MT', 'Biloxi-Gulfport MS', 'Binghamton NY', 'Birmingham (Ann and Tusc) AL', 'Bluefield-Beckley-Oak Hill WV', 'Boise ID', 'Boston MA-Manchester NH', 'Bowling Green KY', 'Buffalo NY', 'Burlington VT-Plattsburgh NY', 'Butte-Bozeman MT', 'Casper-Riverton WY', 'Cedar Rapids-Waterloo-Iowa City & Dubuque IA', 'Champaign & Springfield-Decatur IL', 'Charleston SC', 'Charleston-Huntington WV', 'Charlotte NC', 'Charlottesville VA', 'Chattanooga TN', 'Cheyenne WY-Scottsbluff NE', 'Chicago IL', 'Chico-Redding CA', 'Cincinnati OH', 'Clarksburg-Weston WV', 'Cleveland-Akron (Canton) OH', 'Colorado Springs-Pueblo CO', 'Columbia SC', 'Columbia-Jefferson City MO', 'Columbus GA', 'Columbus OH', '

Unnamed: 0,dma_name,dma_id
0,Portland-Auburn ME,500
1,New York NY,501
2,Binghamton NY,502
3,Macon GA,503
4,Philadelphia PA,504


In [19]:
markets_df[ markets_df["dma_name"].str.contains("Washington DC") ]

Unnamed: 0,dma_name,dma_id
11,Washington DC (Hagerstown MD),511


#### Weeks

Latest week:

In [33]:
from pandas import read_gbq

sql = f"""
    SELECT DISTINCT week
    FROM `bigquery-public-data.google_trends.top_terms`
    ORDER BY week DESC
"""

weeks_df = read_gbq(sql, project_id=PROJECT_ID)
weeks_df

Unnamed: 0,week
0,2023-06-11
1,2023-06-04
2,2023-05-28
3,2023-05-21
4,2023-05-14
...,...
260,2018-06-17
261,2018-06-10
262,2018-06-03
263,2018-05-27


In [49]:
latest_week = weeks_df["week"].max()
earliest_week = weeks_df["week"].min()

print(earliest_week, "...", latest_week)

2018-05-20 ... 2023-06-11


#### Refresh Dates

In [46]:
from pandas import read_gbq

sql = f"""
    SELECT DISTINCT refresh_date
    FROM `bigquery-public-data.google_trends.top_terms`
    ORDER BY refresh_date DESC
"""

refresh_df = read_gbq(sql, project_id=PROJECT_ID)
refresh_df.head()

Unnamed: 0,refresh_date
0,2023-06-15
1,2023-06-14
2,2023-06-13
3,2023-06-12
4,2023-06-11


In [47]:
latest_refresh_date = refresh_df["refresh_date"].max()
print(latest_refresh_date)

2023-06-15


#### Top Terms

Top terms in a given market during a given week:

In [54]:
from pandas import read_gbq

dma_id = 511 # update as desired

sql = f"""
    SELECT week, refresh_date, rank, score, dma_name, dma_id, term
    FROM `bigquery-public-data.google_trends.top_terms`
    WHERE dma_id = {int(dma_id)}
        AND week = '{latest_week}'
        AND refresh_date = '{latest_refresh_date}'
    ORDER BY week DESC, dma_id, rank
"""

trends_df = read_gbq(sql, project_id=PROJECT_ID)
trends_df.index = trends_df["rank"]
trends_df.head()

Unnamed: 0_level_0,week,refresh_date,rank,score,dma_name,dma_id,term
rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,2023-06-11,2023-06-15,1,,Washington DC (Hagerstown MD),511,Eminem's daughter Alaina
2,2023-06-11,2023-06-15,2,10.0,Washington DC (Hagerstown MD),511,Nations League
3,2023-06-11,2023-06-15,3,,Washington DC (Hagerstown MD),511,White House press secretary
4,2023-06-11,2023-06-15,4,,Washington DC (Hagerstown MD),511,Bradley Beal
5,2023-06-11,2023-06-15,5,,Washington DC (Hagerstown MD),511,Rachel Maddow


In [55]:
len(trends_df)

25

In [57]:
trends_df["term"]

rank
1        Eminem's daughter Alaina
2                  Nations League
3     White House press secretary
4                    Bradley Beal
5                   Rachel Maddow
6                 Secret Invasion
7                       DJ Khaled
8                     Adam Schiff
9                     Hepatitis A
10                   Trevor Bauer
11                Yankees vs Mets
12                 Olivia Rodrigo
13                    John Romita
14                          Trump
15           Vegas Golden Knights
16                Cormac McCarthy
17                       Flag Day
18    Dwyane Wade Gabrielle Union
19                      Ja Morant
20                Denver shooting
21                      Joe Biden
22                    Ezra Miller
23                   Stefon Diggs
24                    Tyler Perry
25                    Phil Kessel
Name: term, dtype: object