# Learning goal
In this notebook you will learn how to retrieve Google Trend's search interest data at scale and in a reliable way. This comes in handy for queries with thousands of keywords to build a dataset.  

# Problem

Using Google Trends to see search interest for a keyword of the last five years works well for a small number queries. With more queries, Google's server will deny their service, returning "too many requests" errors, rate limit exceedance, or blacklist your IP. A suggested workaround is to use proxies. However, this often causes other errors later on when they get out of service or have other issues which makes them inaccessible. You would have to reconfigure them once in a while when revisiting your code. 

To alleviate this problem, I rely on timeouts. Sufficiently long intervals between queries minimze the risk of request errors. In addition to this, I provide a fallback procedure in case of errors. It stores previously collected data and initiates another attempt. The procedure might take longer though especially for thousands of queries. But the computer can work, while you sleep. enjoy life or make big plans for the next project that involves Google Trends. Because you will know how to work it after reading this article.

# Challenges and limits of Google Trends

1. Denial of service, exceeding rate limits, being blacklisted
3. Maximum 5 keywords per query
2. Relative measures and scalability (solved by @Carrie Fowle
in https://towardsdatascience.com/using-google-trends-at-scale-1c8b902b6bfa)


While we focus on the first issue, there will be workarounds included for all three along the way.   




# Implementation plan



1. look through code and judge where to refactor
    1. rely on helper functions
    2. define keyword_constructor()
1. set Gtrends query into a function
    2. save dataset along the way
    2. simulate error and retry at last idx
    3. if retry unsuccessful, increase timeout until no exception 
        1. abort with message if unsuccessful after 10 increases "wait for a bit and define a longer timeout"

    
## Data sources

- search interest: Google Trends
- search interest+: Google autocompletion
- news coverage: Google News
- ESG scores, financial data, sector: Yahoo!finance



### NEXT
* checkout cleanco to strip company names from their type https://pypi.org/project/cleanco/


In [139]:
# data preparation
import pandas as pd
import numpy as np
import re 
from math import ceil
from math import exp
import os

# visualization
import seaborn as sns
import matplotlib.pyplot as plt

# data collection 
from yahooquery import Ticker
from pytrends.request import TrendReq 
from time import sleep 

py_version = !python --version
print(py_version[0])
print("Pandas version",pd.__version__)

Python 3.7.6
Pandas version 1.0.3


# Helper functions

In [138]:
## inlcude into helper functions
# KEYWORD GENERATOR HELPERS 
def regex_strip_legalname(raw_names):
    """Removes legal entity, technical description or firm type from firm name
    
    Input
        raw_names: list of strings with firm names
        
    Return
        list of strings: firm names without legal description 
    
    """
    
    pattern = r"(,\s)?(LLC|Inc|Corp\w*|\(?Class \w+\)?|Group|Company|\WCo(\s|\.)|plc|Ltd|Int'l\.|Holdings)\.?\W?"
    stripped_names = [re.sub(pattern,'', n) for n in raw_names]
    
    return stripped_names

def batch(lst, n=5):
    """Yield successive n-sized chunks from list lst
    
    adapted from https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks
    
    Input
        lst: list 
        n: selected batch size
        
    Return 
        List: lst divided into batches of len(lst)/n lists
    """
    
    for i in range(0, len(lst), n):
        yield lst[i:i + n]
        
def flatten_list(nested_list):
    """Flattens nested list"""
    
    return [element for sublist in nested_list for element in sublist]

def list_remove_duplicates(l):
    """Removes duplicates from list elements whilst preserving element order
    adapted from 
    https://stackoverflow.com/questions/480214/how-do-you-remove-duplicates-from-a-list-whilst-preserving-order
    
    Input
        list with string elements
    
    Return 
        Sorted list without duplicates
    
    """
    seen = set()
    seen_add = seen.add
    return [x for x in l if not (x in seen or seen_add(x))]


# PYTREND HELPERS
def pytrends_sleep_init(seconds):
    """Timeout for certain seconds and re-initialize pytrends
    
    Input
        seconds: int with seconds for timeout
        
    Return
        None
    
    """
    print("TIMEOUT for {} sec.".format(seconds))
    sleep(seconds)
    pt = TrendReq()
    
def make_x_y_csv(x, y, filename, data_dir):
    '''Merges features and labels and converts them into one csv file with labels in the first column.
       :param x: Data features
       :param y: Data labels
       :param file_name: Name of csv file, ex. 'train.csv'
       :param data_dir: The directory where files will be saved
       '''
    
    # create dir if nonexistent
    if not os.path.exists(data_dir):
        os.makedirs(data_dir)
    
    # merge df
    y = pd.DataFrame(y)
    x = pd.DataFrame(x)
    
    # export to csv
    pd.concat([y, x], axis=1).to_csv(os.path.join(data_dir, filename), 
                                     header=False, 
                                     index=False)
    
    # nothing is returned, but a print statement indicates that the function has run
    print('Path created: '+str(data_dir)+'/'+str(filename))
    
def make_csv(x, filename, data_dir, append=False, header=False, index=False):
    '''Merges features and labels and converts them into one csv file with labels in the first column.
       :param x: Data features
       :param file_name: Name of csv file, ex. 'train.csv'
       :param data_dir: The directory where files will be saved
       '''
    
    # create dir if nonexistent
    if not os.path.exists(data_dir):
        os.makedirs(data_dir)
    
    # make sure its a df
    x = pd.DataFrame(x)
    
    # export to csv
    if not append:
        x.to_csv(os.path.join(data_dir, filename), 
                                     header=header, 
                                     index=index)
    else:
        x.to_csv(os.path.join(data_dir, filename),
                                     mode = 'a',
                                     header=header, 
                                     index=index)        
    
    # nothing is returned, but a print statement indicates that the function has run
    print('Path created: '+str(data_dir)+'/'+str(filename))

# Engineer search keywords from firm names and topic

**TODO:** def construct_search_keywords()

## S&P 500 listings

In [147]:
# retrieve S&P 500 listings from Wikipedia
table=pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
df_sp500 = table[0]

## retrieve firm information from table
# ticker
ticker = list(df_sp500.Symbol)
# sector
sector = df_sp500.loc[:,'GICS Sector']
# firm names
firm_names_raw = list(df_sp500.Security)

### Preprocces firm names: Remove legal suffix

TODO
* remove class A/B
* remove .com
* apply list_remove_duplicates()

In [175]:
# removes legal suffix from firm names
from cleanco import prepare_terms, basename

terms = prepare_terms()

firm_names_clean = [basename(i, terms) for i in firm_names_raw]

## previous solution with regex 
# remove legal taxonomy and firm type
# TODO: add list_remove_duplicates()
# firm_names = regex_strip_legalname(firm_names_raw)

In [137]:




# esg keywords (negative exclusion criteria)
topics = ['scandal', 'greenwashing', 'corruption', 'fraud', 'bribe', 'tax', 'forced', 'harassment', 'violation', 
          'human rights', 'conflict', 'weapons', 'arms trade', 'pollution', 'CO2', 'emission', 'fossil fuel',
          'gender inequality', 'discrimination', 'sexism', 'racist', 'intransparent', 'data privacy', 'lawsuit', 
          'unfair', 'bad', 'problem', 'hate', 'issues', 'controversial']

# store lists as csv for retrieval
make_csv(topics, 'topics.csv', 'data', append=False, header=True)
make_csv(firm_names, 'firm_names.csv', 'data', append=False, header=True)

NameError: name 'df_sp500' is not defined

In [60]:
############################
# DEFINE PARAMETERS 
n_firms = 30
batch_size = 5
n_keywords = int(n_firms*len(topics))
n_query = int(n_keywords/batch_size)
n_topics = len(topics)
sec_sleep = 45
############################


# create search keywords as pairwise combintations of firm names + topics
search_keywords = [[j+' '+i for j in topics] for i in firm_names]

# print("{} topic keywords for {} firm each ---> {} pairwise combinations"\
#       .format(n_topics, n_firms, n_keywords))
# print()

# Subset for test purposes
print(">>>>>>> Subset for testing purposes")
keywords_sample = search_keywords[:n_firms]
print("Generated {} keywords for {} firms and {} topics each".format(n_keywords,n_firms,n_topics))
print("Resulting in {} queries with {} keywords each (=batch)".format(n_query, batch_size))

## generate keyword batches (= query)
# flatten list
keyword_batches = flatten_list([list(batch(keywords_sample[i], batch_size)) for i in range(n_firms)])

print("\nExample keyword batch:\n{}".format(keyword_batches[0]))

>>>>>>> Subset for testing purposes
Generated 900 keywords for 30 firms and 30 topics each
Resulting in 180 queries with 5 keywords each (=batch)

Example keyword batch:
['scandal 3M ', 'greenwashing 3M ', 'corruption 3M ', 'fraud 3M ', 'bribe 3M ']


## Query Google

In [None]:
## retrieve Google trends across time

# initialize pytrends
pt = TrendReq()

# store DFs for later concat
df_list = []
index_batch_error = []

# create csv to store intermediate results
make_csv(pd.DataFrame(), filename='googletrends.csv', data_dir='data', append=False)

for i, batch in enumerate(keyword_batches):
    
    # retrieve interest over time
    try:
        # re-init pytrends and wait (sleep/timeout)
        pytrends_sleep_init(sec_sleep)
        
        # pass keywords to pytrends API
        pt.build_payload(kw_list=batch) 
        print("Payload build for {}. batch".format(i))
        df_search_result = pt.interest_over_time()
        
    except Exception as e:
        print(e)
        print("Query {} of {}".format(i, n_query))
        # store index at which error occurred
        index_batch_error.append(i)
        
        # re-init pytrends and wait (sleep/timeout)
        pytrends_sleep_init(sec_sleep)
        
        # retry
        print("RETRY for {}. batch".format(i))
        pt.build_payload(kw_list=batch) 
        df_search_result = pt.interest_over_time()
        
    # check for non-empty df
    if df_search_result.shape[0] != 0:
        
        # reset index for consistency (to call pd.concat later with empty dfs)
        df_search_result.reset_index(inplace=True)
        df_list.append(df_search_result)
        
    # no search result for any keyword
    else:        
        # create df containing 0s
        df_search_result = pd.DataFrame(np.zeros((261,batch_size)), columns=batch)
        df_list.append(df_search_result)
        
    make_csv(df_search_result, filename='googletrends.csv', data_dir='data',
             append=True,
            header=True)

In [None]:
# combine query results to df
drop_cols = ['isPartial', 'date']

# index df
df_clean_list = []
for i,x in enumerate(range(0,len(df_list),6)):

    map_colnames = dict(zip(search_keywords[i+272], list(topics)))
    
    ## create firm-level df
    # df with isPartial and date columns --> drop columns
    try:
        df_firm = pd.concat(df_list[x:x+6], axis=1).drop(columns=drop_cols)

        # rename columns
        df_firm.rename(columns=map_colnames, inplace=True)
        
        # add firm column
        df_firm['firm'] = firm_names[i+272]
        
        df_clean_list.append(df_firm)
        
    except:
        df_firm = pd.concat(df_list[x:x+6], axis=1).rename(columns=map_colnames)
        # rename columns
        df_firm.rename(columns=map_colnames, inplace=True)
        # add firm 
        df_firm['firm'] = firm_names[i]

        df_clean_list.append(df_firm)

# df (long format) with time dimension      
df_time = pd.concat(df_clean_list)

# Store query results so far
print('Index batch error:',index_batch_error)

# get timestamp
import time
timestr = time.strftime("%Y%m%d-%H%M%S")
df_filename = 'df_time_{}_idxbatch_{}.csv'.format(timestr, 1633)
print(df_filename)

# Store df_time
make_csv(df_time, filename=df_filename, data_dir='data', append=False, header=False)

# Google News

In [74]:
import pandas as pd
from GoogleNews import GoogleNews
from datetime import date

In [126]:
googlenews=GoogleNews(start=date_1year_ago,end=date_today)
googlenews.search('Microsoft scandal')

until_page = 5
for p in range(1, until_page+1):
    print("Page:", p)
    googlenews.getpage(p)

result = googlenews.result()
df=pd.DataFrame(result)
print(df.shape)
df.tail()

Page: 1
Page: 2
Page: 3
Page: 4
Page: 5
(60, 6)


Unnamed: 0,title,media,date,desc,link,img
55,Microsoft pulls its smaller investments in fac...,Engadget,28.03.2020,Although Microsoft is less likely to be embroi...,https://www.engadget.com/2020-03-28-microsoft-...,"data:image/gif;base64,R0lGODlhAQABAIAAAP//////..."
56,Bill Gates Gives to the Rich (Including Himself),The Nation,17.03.2020,In speeches delivered at the American Enterpri...,https://www.thenation.com/article/society/bill...,"data:image/gif;base64,R0lGODlhAQABAIAAAP//////..."
57,Coronavirus: Microsoft offers behind-the-scene...,ComputerWeekly.com,17.06.2020,ICO acknowledges GDPR concerns over A-level re...,https://www.computerweekly.com/news/252484794/...,"data:image/gif;base64,R0lGODlhAQABAIAAAP//////..."
58,Microsoft fixes 26 critical vulnerabilities in...,ComputerWeekly.com,11.03.2020,Microsoft has fixed 115 CVE-numbered vulnerabi...,https://www.computerweekly.com/news/252479868/...,"data:image/gif;base64,R0lGODlhAQABAIAAAP//////..."
59,"Microsoft adds more AMD-powered Azure VMs, whi...",TechRepublic,07.11.2019,"Microsoft adds more AMD-powered Azure VMs, whi...",https://www.techrepublic.com/article/microsoft...,"data:image/gif;base64,R0lGODlhAQABAIAAAP//////..."


In [132]:
df.title[0]

'Microsoft Security Shocker As 250 Million Customer Records Exposed Online'

In [133]:
df.drop_duplicates().count()

title    50
media    50
date     50
desc     50
link     50
img      50
dtype: int64

In [134]:
googlenews.clear()

In [None]:
import pandas as pd
from GoogleNews import GoogleNews
from datetime import date

# function to subtract a year form today's date
def add_years(d, years):
    """Return a date that's `years` years after the date (or datetime)
    object `d`. Return the same calendar date (month and day) in the
    destination year, if it exists, otherwise use the following day
    (thus changing February 29 to March 1).
    
    from https://stackoverflow.com/a/15743908

    """
    try:
        return d.replace(year = d.year + years)
    except ValueError:
        return d + (date(d.year + years, 1, 1) - date(d.year, 1, 1))
    


def get_news(keyword, until_page=20):
    """Retrieve news for keyword for the first specified number of result pages
        within the period until 1 year ago
        
    Input
        keyword to look up news for
    
    Return
        dataframe
    """
    
    ## define 1 year timespan with datestrings 
    # today's date
    date_today = date.today().strftime("%m/%d/%Y")
    # date 1 year ago
    date_1year_ago = add_years(date.today(), -1).strftime("%m/%d/%Y")
    
    # init googlenews object
    googlenews = GoogleNews(lang='en', start=,end='02/28/2020')
    # search news for keyword
    googlenews.search(keyword)
    
    # get results for each page 
    for p in range(until_page):
        googlenews=GoogleNews(start=date_1year_ago,end=date_today)
        googlenews.search('Microsoft scandal')
        googlenews.getpage(1)
        result = googlenews.result()
        df=pd.DataFrame(result)
        # print(df.head())
        print(df.shape)
        googlenews.clear()