### Getting Data from Google Trends
This notebook outlines the steps to download the Google Trends data.

Note: `pytrends` Python package was required and used to run this notebook. Unfortunately, it was not possible to install the required package and dependencies here on Jupyter hub, so this notebook will not run as expected here.

In [1]:
from pytrends.request import TrendReq, exceptions
import pandas as pd
import time

First, create a list of the keywords we want to query on Google Trends.

In [2]:
keywords = ['dengue', 
           'dengue fever', 
           'bone pain', 
           'rain', 
           'mosquito bite', 
           'fever', 
           'rashes', 
           'rash', 
           'mosquito']

We also create a list of timeframes that represent the start and end of each epidemiological year from 2012 to 2022. 

Dates were referenced from the Central Massachusetts Mosquito Control Project: https://www.cmmcp.org/mosquito-surveillance-data/pages/epi-week-calendars-2008-2023

In [3]:
timeframes = ['2022-01-02 2022-12-31',
              '2021-01-03 2022-01-01',
              '2019-12-29 2021-01-02',
              '2018-12-30 2019-12-28',
              '2017-12-31 2018-12-29',
              '2017-01-01 2017-12-30',
              '2016-01-03 2016-12-31',
              '2015-01-04 2016-01-02',
              '2013-12-29 2015-01-03',
              '2012-12-30 2013-12-28',
              '2012-01-01 2012-12-29']

Now we are ready to use the `pytrends` package to query Google Trends. Note the several "quirks" that are important to note when using Google Trends (some of which we found out along the way):

1. Querying by specific dates that are about a year apart will result in weekly Trends results
2. Querying for a particular time period gives the relative popularity from 0-100% of the search term, not an absolute number of queries
3. Querying more than one search term for a particular time period gives the relative popularity across *all* search terms throughout the time period
4. However, only up to 5 terms can be searched in one go
5. Google Trends will limit the rate at which API calls/requests can be made

As such, we decided that we had to (1) query each epidemiological year separately to obtain weekly results, (2) query each search term separately to get more accurate trends of each search term, rather than a relative popularity between batches of 5 search terms.

In [4]:
def get_trends_data(timeframe, keyword):
    res = None
    while res is None:
        pytrends = TrendReq(hl='en-SG') # reset request so as not to get rate limited by Trends
        pytrends.build_payload([keyword], cat=0, geo='SG', timeframe=timeframe)
        try:
            res = pytrends.interest_over_time()
        except:
            time.sleep(3)
    return res

In [5]:
yearly = []
for i, timeframe in enumerate(timeframes):
    for j, keyword in enumerate(keywords):
        print(f'Timeframe {i+1}/{len(timeframes)} @ Keyword {j+1}/{len(keywords)}', end='\r', flush=True)
        yearly.append(get_trends_data(timeframe, keyword))

Timeframe 11/11 @ Keyword 9/9

After obtaining the raw data from Google Trends, we can then convert them into dataframes.

In [6]:
dfs = []
for i in range(0, len(yearly), 9):
    dfs.append(pd.concat(yearly[i:i+9], axis=1).drop('isPartial', axis=1))

df = pd.concat(dfs).reset_index()
df

Unnamed: 0,date,dengue,dengue fever,bone pain,rain,mosquito bite,fever,rashes,rash,mosquito
0,2022-01-02,10,0,70,85,81,54,27,66,39
1,2022-01-09,19,0,91,36,31,60,100,63,38
2,2022-01-16,11,18,0,44,23,64,76,73,48
3,2022-01-23,16,0,0,32,70,65,53,69,47
4,2022-01-30,11,0,49,52,69,66,75,75,45
...,...,...,...,...,...,...,...,...,...,...
569,2012-11-25,53,48,21,40,0,92,45,61,57
570,2012-12-02,37,34,0,52,50,90,60,48,70
571,2012-12-09,43,48,77,49,80,94,58,75,46
572,2012-12-16,53,45,0,59,100,86,56,62,59


### Save the downloaded data dataframe as a csv

So that we can easily read it for cleaning subsequently.

In [8]:
df.to_csv('../../data/trends_monthly.csv', index=False)