# Purpose of script

The purpose of the script is to get to know the Google Trends API

I'll be using a wrapper Python package (`pytrends`) to interface with the API

https://pypi.org/project/pytrends/

https://github.com/GeneralMills/pytrends

Note: This package (`pytrends`) may not be up to date because it's not an official Google Trends API and since Google constantly updates their backend, it's likely that these scripts will have to change. 

Another generally helpful resource: https://www.holisticseo.digital/python-seo/google-trends/


In [47]:
import os
import pandas as pd
import pytrends
from pytrends.request import TrendReq
import datetime
import boto3

### Initialize API

In [2]:
pytrends_API = TrendReq()

### Look for particular keywords

Some terms to look up

1. covid/covid19
2. coronavirus
3. lockdown
4. quarantine
5. shutdown
6. vaccine

Note: The API only lets us have 5 keywords at a time

In [3]:
keywords = ["covid", "coronavirus", "lockdown", "quarantine", "vaccine"]

In [4]:
pytrends_API.build_payload(kw_list = keywords, geo = "US")

#### Look up prevalence of keywords, by region

The values are calculated on a scale from 0 to 100, where 100 is the location with the most popularity as a fraction of total searches in that location, a value of 50 indicates a location which is half as popular. A value of 0 indicates a location where there was not enough data for this term.

In [5]:
interest_by_state = pytrends_API.interest_by_region()

In [6]:
interest_by_state.head(20)

Unnamed: 0_level_0,covid,coronavirus,lockdown,quarantine,vaccine
geoName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alabama,40,50,1,2,7
Alaska,50,40,0,2,8
Arizona,43,46,1,2,8
Arkansas,43,47,1,2,7
California,41,47,2,2,8
Colorado,45,45,1,2,7
Connecticut,43,45,1,3,8
Delaware,41,47,1,2,9
District of Columbia,39,46,1,3,11
Florida,40,48,1,2,9


In [27]:
pytrends_API.interest_over_time()

Unnamed: 0_level_0,covid,coronavirus,lockdown,quarantine,vaccine,isPartial
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2016-02-14,0,0,0,0,0,False
2016-02-21,0,0,0,0,0,False
2016-02-28,0,0,0,0,0,False
2016-03-06,0,0,0,0,0,False
2016-03-13,0,0,0,0,0,False
...,...,...,...,...,...,...
2021-01-10,23,3,0,0,10,False
2021-01-17,22,3,0,0,10,False
2021-01-24,22,3,0,0,10,False
2021-01-31,19,2,0,0,10,False


### Modify timeframe of search

Seems like we can also specify, via the `timeframe` parameter, the time period for our search. 

https://github.com/GeneralMills/pytrends/issues/211

Example trends link: https://trends.google.com/trends/explore?date=2021-01-10%202021-01-11&geo=US&q=%2Fm%2F0dl567,%2Fm%2F0261x8t

Example: Look for searches between 2020-03-20 and 2020-03-22

In [7]:
pytrends_API.build_payload(kw_list = keywords, geo = "US", timeframe = "2020-03-20 2020-03-22")

In [8]:
pytrends_API.interest_over_time()

Unnamed: 0_level_0,covid,coronavirus,lockdown,quarantine,vaccine,isPartial
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-03-20,24,99,10,5,1,False
2020-03-21,23,96,5,4,1,False
2020-03-22,25,100,6,4,1,False


In [9]:
pytrends_API.interest_by_region().head()

Unnamed: 0_level_0,covid,coronavirus,lockdown,quarantine,vaccine
geoName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alabama,19,77,2,2,0
Alaska,26,68,3,3,0
Arizona,18,75,4,3,0
Arkansas,20,73,4,2,1
California,18,70,8,4,0


### Can we get both location and time?

We can get a breakdown of search trends by region/state and by time, but can we do both at the same time? For example, can we see the day-by-day breakdown of searches in, say, California?

In [11]:
pytrends_API.build_payload(kw_list = keywords, geo = "US")

Can we do this approach for one day at a time?

In [21]:
pytrends_API.build_payload(kw_list = keywords, geo = "US", timeframe = "2020-03-20 2020-03-20")

In [23]:
pytrends_API.interest_over_time()

Unnamed: 0_level_0,covid,coronavirus,lockdown,quarantine,vaccine,isPartial
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-03-20,25,100,10,5,1,False


In [22]:
pytrends_API.interest_by_region()

Unnamed: 0_level_0,covid,coronavirus,lockdown,quarantine,vaccine
geoName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alabama,19,76,2,3,0
Alaska,28,66,3,3,0
Arizona,19,72,6,3,0
Arkansas,20,74,2,3,1
California,18,65,13,4,0
Colorado,21,70,5,4,0
Connecticut,16,77,5,2,0
Delaware,16,73,7,4,0
District of Columbia,22,72,3,2,1
Florida,15,77,5,3,0


### General approach 

We can adopt a looping approach, where we can query by day and get each state's value in the `interest_by_region` request call for that day. 

Let's start our search at `2020-03-01` (March 1st, 2020) and go up `2021-02-01` (February 1st, 2021). In the actual script, we'd just change the endpoint. This link explains how to create a range of dates: https://www.kite.com/python/answers/how-to-create-a-range-of-dates-in-python

For each query, we can save it as a df and then upload that df to AWS

#### Initialize API, get keywords

In [38]:
pytrends_API = TrendReq()

In [39]:
keywords = ["covid", "coronavirus", "lockdown", "quarantine", "vaccine"]

#### Get array of dates

In [32]:
start_date = datetime.date(2020, 3, 1)
end_date = datetime.date(2021, 2, 1)
total_num_dates = (end_date - start_date).days

In [35]:
dates_arr = []

In [36]:
for day in range(total_num_dates):
    new_date = (start_date + datetime.timedelta(days = day)).isoformat()
    dates_arr.append(new_date)

#### Get payload

Here's an example script detailing how to get the payload, with a custom set of keywords and a single date

`pytrends_API.build_payload(kw_list = keywords, geo = "US", timeframe = "2020-03-20 2020-03-20")`


In [41]:
pytrends_API.build_payload(kw_list = keywords, geo = "US", timeframe = "2020-03-20 2020-03-20")

In [44]:
pytrends_API.interest_over_time()

Unnamed: 0_level_0,covid,coronavirus,lockdown,quarantine,vaccine,isPartial
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-03-20,25,100,10,5,1,False


In [45]:
pytrends_API.interest_by_region()

Unnamed: 0_level_0,covid,coronavirus,lockdown,quarantine,vaccine
geoName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alabama,19,76,2,3,0
Alaska,28,66,3,3,0
Arizona,19,72,6,3,0
Arkansas,20,74,2,3,1
California,18,65,13,4,0
Colorado,21,70,5,4,0
Connecticut,16,77,5,2,0
Delaware,16,73,7,4,0
District of Columbia,22,72,3,2,1
Florida,15,77,5,3,0


#### Create script

Let's loop through all the dates, query the API by those dates, get the `interest_over_time` and `interest_by_region` values, and export to AWs

Log into AWS

In [48]:
AWS_BUCKET = os.environ["AWS_BUCKET"]
AWS_ACCESS = os.environ["AWS_ACCESS"]
AWS_SECRET = os.environ["AWS_SECRET"]

In [72]:
s3 = boto3.client('s3',
                  aws_access_key_id=AWS_ACCESS,
                  aws_secret_access_key=AWS_SECRET)

Get paths to save file locally and in AWS

In [70]:
FRESH_SCRAPES_FILENAME = ""
INTEREST_OVER_TIME_FILENAME = "interest_over_time_"
INTEREST_BY_REGION_FILENAME = "interest_by_region_"

In [71]:
LOCAL_SCRAPES_DIR = "../../../tweets/google_API_scrapes/"
AWS_SCRAPES_DIR = "google_API_scrapes/"

LOCAL_TIME_FILENAME = LOCAL_SCRAPES_DIR + INTEREST_OVER_TIME_FILENAME
LOCAL_REGION_FILENAME = LOCAL_SCRAPES_DIR + INTEREST_BY_REGION_FILENAME

AWS_TIME_FILENAME = AWS_SCRAPES_DIR + INTEREST_OVER_TIME_FILENAME
AWS_REGION_FILENAME = AWS_SCRAPES_DIR + INTEREST_BY_REGION_FILENAME


#### Create loop

In [67]:
pytrends_API = TrendReq()

In [68]:
keywords = ["covid", "coronavirus", "lockdown", "quarantine", "vaccine"]

In [None]:
for date in dates_arr:
    
    try:
        
        print(f"Getting Google API trends data for {date}")
        
        pytrends_API = TrendReq()

        # build payload
        pytrends_API.build_payload(kw_list = keywords, geo = "US", timeframe = f"{date} {date}")

        # get trends over time and by region, and export locally:
        pytrends_API.interest_over_time().to_csv(LOCAL_TIME_FILENAME + date + ".csv")
        pytrends_API.interest_by_region().to_csv(LOCAL_REGION_FILENAME + date + ".csv")

        # upload to AWS
        s3.upload_file(LOCAL_TIME_FILENAME + date + ".csv", AWS_BUCKET, AWS_TIME_FILENAME + date + ".csv")
        s3.upload_file(LOCAL_REGION_FILENAME + date + ".csv", AWS_BUCKET, AWS_REGION_FILENAME + date + ".csv")

        # delete local versions of the file
        os.remove(LOCAL_TIME_FILENAME + date + ".csv")
        os.remove(LOCAL_REGION_FILENAME + date + ".csv")
        
        print(f"Finished getting Google API trends data for {date}")
        
    except Exception as e:
        print("Error in getting Google API trends data")
        print(e)
        print(f"Error occurred for the following date: {date}")

Getting Google API trends data for 2020-03-01
Finished getting Google API trends data for 2020-03-01
Getting Google API trends data for 2020-03-02
Finished getting Google API trends data for 2020-03-02
Getting Google API trends data for 2020-03-03
Finished getting Google API trends data for 2020-03-03
Getting Google API trends data for 2020-03-04
Finished getting Google API trends data for 2020-03-04
Getting Google API trends data for 2020-03-05
Finished getting Google API trends data for 2020-03-05
Getting Google API trends data for 2020-03-06
Finished getting Google API trends data for 2020-03-06
Getting Google API trends data for 2020-03-07
Finished getting Google API trends data for 2020-03-07
Getting Google API trends data for 2020-03-08
Finished getting Google API trends data for 2020-03-08
Getting Google API trends data for 2020-03-09
Finished getting Google API trends data for 2020-03-09
Getting Google API trends data for 2020-03-10
Finished getting Google API trends data for 2