# Lab 1: Playing with Google Trends


The goal of this lab is collecting Google Trends data using [PyTrends](https://pypi.org/project/pytrends/).

This lab is written by Dr. Jisun AN (jisunan@smu.edu.sg) and Dr. Haewoon KWAK (hkwak@smu.edu.sg).

# Install

In [None]:
!pip install pytrends

In [None]:
!pip install matplotlib

In [None]:
!pip install plotly

# Set logger

This allows us what's happening in the 3rd party library.

In [None]:
import logging
logging.basicConfig(level=logging.DEBUG,
                    format='%(asctime)s %(name)-12s %(levelname)-8s %(message)s',
                    datefmt='%m-%d %H:%M:%S')
logger = logging.getLogger(__name__)

# Connect to Google

Language = en-US, timezone (Singapore) = -480 (according to Google's convention)

In [None]:
from pytrends.request import TrendReq

pytrends = TrendReq(hl='en-US', tz=-480)

# Collect the Google Trends query's response using Pytrends

We collect all the data that is accessible through the web interface, which is:

1. Interest over time
2. Interest by city (region)
3. Related topics
4. Related queries

![page.png](attachment:page.png)

https://trends.google.com/trends/explore?date=2020-12-04%202021-01-03&geo=SG&q=new%20year&hl=en

## Setting common parameters

In [None]:
keywords = ["new year"]
pytrends.build_payload(keywords, geo='SG', timeframe='2020-12-04 2021-01-03', cat=0)

## 1. Interest over time

In [None]:
df = pytrends.interest_over_time()
df.head(n=10)

In [None]:
df.to_csv('1-over-time.csv')

### Exercise 1. Change the country from Singapore to other 3 countries that you are curious

The ISO-2 country code is available via https://en.wikipedia.org/wiki/ISO_3166-2#:~:text=It%20was%20first%20published%20in,form%20than%20their%20full%20names.

Don't forget to change XX, YY, and ZZ in the filename into your country names.

In [None]:
keywords = ["new year"]

In [None]:
pytrends.build_payload(keywords, geo='XX', timeframe='2020-12-04 2021-01-03', cat=0)
df = pytrends.interest_over_time()

df.to_csv('1-1-XX.csv')

In [None]:
pytrends.build_payload(keywords, geo='YY', timeframe='2020-12-04 2021-01-03', cat=0)
df = pytrends.interest_over_time()

df.to_csv('1-2-YY.csv')

In [None]:
pytrends.build_payload(keywords, geo='ZZ', timeframe='2020-12-04 2021-01-03', cat=0)
df = pytrends.interest_over_time()

df.to_csv('1-3-ZZ.csv')

## 2. Interest by city

Unfortunately, Google does not provide a fine-grained subregion view.
Let's try with Australia.

In [None]:
keywords = ["new year"]
pytrends.build_payload(keywords, geo='AU', timeframe='2020-12-04 2021-01-03', cat=0)

In [None]:
df = pytrends.interest_by_region(resolution='REGION', inc_low_vol=True, inc_geo_code=False)
df.head()

In [None]:
df.to_csv('2-by-region.csv')

## Exercise 2. Read the API doc and try different resolution

Check https://github.com/GeneralMills/pytrends#interest-by-region and try other resolution by changing XX to other options.

In [None]:
df = pytrends.interest_by_region(resolution='XX', inc_low_vol=True, inc_geo_code=False)
df.head()

### Bugs
You might realize that PyTrends does not support the city-level data for AU. When the `resolution` parameter is set as `CITY`, you will observe that the data returned is region-specific instead of city-specific which is incorrect.

It is because of the following source code in the Pytrends library. 

    # make the request
    region_payload = dict()
    if self.geo == '':
        self.interest_by_region_widget['request']['resolution'] = resolution
    elif self.geo == 'US' and resolution in ['DMA', 'CITY', 'REGION']:
        self.interest_by_region_widget['request']['resolution'] = resolution

See https://github.com/GeneralMills/pytrends/blob/master/pytrends/request.py#L273

From the above code, it is clear that the passing `resolution` parameter is not assigned to self when geo is neither '' nor 'US'.<br/>
Since geo == 'AU' in our setting, the code does not show the 'CITY' result.<br>
And thus, I fixed the code and uploaded at https://github.com/haewoon/pytrends. <br/>

If you wish to retrieve city-level data for AU, carry out the following steps: 
1. Download the 'pytrends' folder from the above github link <img align="center" src="https://docs.google.com/uc?id=1Vqi2cFkLFLCxWIJMtdD4MHC9ji4BfclG"  style="height: 250px;"/>
2. Place the 'pytrends' folder in the same folder as the current python notebook file. 
3. Go to Jupyter Notebook menu, select Kernel -> select Restart option to restart the notebook kernel (explanation: by restarting the kernel, the current notebook will be able to recognize the presence of the newly added 'pytrends' folder)
4. Rerun the `interest_by_region` API call by setting `resolution` as `CITY`

You should now observe that city-level data for AU is displayed as the `pytrends.interest_by_region` now uses the source code found in the newly added 'pytrends' folder instead of the standard 'pytrends' standard library.


In [None]:
# save the resolution data into a csv file
df.to_csv('2-by-XX.csv')

## 3. Related topics
Okay. Back to Singapore. 
Users searching for a search term (here 'new year') also searched for these topics. 

Google Trends provide two options:
* Top - The most popular topics. Scoring is on a relative scale where a value of 100 is the most commonly searched topic and a value of 50 is a topic searched half as often as the most popular term, and so on.

* Rising - Related topics with the biggest increase in search frequency since the last time period. Results marked "Breakout" had a tremendous increase, probably because these topics are new and had few (if any) prior searches.

In [None]:
keywords = ["new year"]
pytrends.build_payload(keywords, geo='SG', timeframe='2020-12-04 2021-01-03', cat=0)

In [None]:
related_topics = pytrends.related_topics()

### Access to rising related topics

In [None]:
related_topics['new year']['rising']

### Access to top related topics

In [None]:
related_topics['new year']['top']

## Exercise 3. Compare related topics between different periods.

Related topics are also changing over time.
Compare related topics of 'covid-19' in the **United Kingdom** (1) between 2020/11/1 and 2020/12/1 with (2) between 2020/12/1 and 2021/1/1.

Tip: United Kingdom's code is not UK.


## 4. Related queries

Users searching for a term (here 'new year') also searched for these queries. 

Similarly, Google provides two options:
* Top - The most popular search queries. Scoring is on a relative scale where a value of 100 is the most commonly searched query, 50 is a query searched half as often as the most popular query, and so on.

* Rising - Queries with the biggest increase in search frequency since the last time period. Results marked "Breakout" had a tremendous increase, probably because these queries are new and had few (if any) prior searches.

In [None]:
keywords = ["new year"]
pytrends.build_payload(keywords, geo='SG', timeframe='2020-12-04 2021-01-03', cat=0)

In [None]:
related_queries = pytrends.related_queries()

### Access to rising related queries

In [None]:
related_queries['new year']['rising']

### Access to top related queries

In [None]:
related_queries['new year']['top']

## Exercise 4. Compare related queries between different periods.

Related queries are also changing over time.
Compare related queries of 'covid-19' in the **United States** (1) between 2020/11/1 and 2020/12/1 with (2) between 2020/12/1 and 2021/1/1
