Pytrends implementation

In [1]:
from pytrends.request import TrendReq
import pandas as pd
import os
from dateutil.relativedelta import *
from dateutil.rrule import *
from datetime import date

Timeframe setup. A timeframe over 9 months will condense the data into weekly markers - not very good for our use. Timeframes are 9 months so they are still daily markers.

In [2]:
use_date = date(2016,1,1)
today = date.today()
#beginning period of 9 months
begin = list(rrule(freq=MONTHLY, interval=8,count=10, dtstart=use_date))
for i in range(len(begin)):
    begin[i] = begin[i].date()

#calculate the last day of last month
use_date = use_date+relativedelta(months=-1)
use_date = use_date+relativedelta(day=31)

#end period of 9 months
end = list(rrule(freq=MONTHLY, interval=8,count=11, dtstart=use_date, bymonthday=(-1,)))
for i in range(len(end)):
    end[i] = end[i].date()

#adds time periods to a list
#if the end date is later than today's date, today's date will be used
dates = []
for i in range(len(begin)):
    if today >= begin[i] and today >= end[i+1]:
        full = (str(begin[i]), str(end[i+1]))
        dates.append(full)
    elif today >= begin[i] and today <= end[i+1]:
        recent = (str(begin[i]), str(today))
        dates.append(recent)
# for i in dates:
#     print(i)

Data processing. Creates a 'data.csv' file to store the data for each 9 month period while mantaining its daily marker. The keywords come from 'keywords.txt'. Pytrends only accepts up to 5 keywords each time a payload is built so a 2D list is constructed. Each element of the list is another list containing 5 keywords.

In [3]:
filename = 'keywords.txt'
with open(filename) as f:
    content = f.readlines()
for i in range(len(content)):
    content[i] = content[i].strip()
fullLists = len(content)//5
remainder = len(content)%5
kw = []
a = 0
b = 5
for i in range(0,fullLists):
    kw.append(content[a:b])
    a += 5
    b += 5
if remainder > 0:
    kw.append(content[b-5:len(content)])

Constructing the payload with the keywords lists. After constructing the payload, it will be written to a csv file with the name of 'data_.csv' where the _ is a number according to which keyword list it is. It will iterate through the keywords, then iterate through each 9 month time period so it maintains its daily markers.

In [4]:
pytrend = TrendReq()
count = 0
csvname = 'data'
allcsv = []
for i in kw: 
    csvfile = csvname + str(count) + '.csv'
    allcsv.append(csvfile)
    for j in dates:
        timef = ''
        timef = j[0] + ' ' +  j[1]   
        pytrend.build_payload(i,timeframe=timef)
        df = pytrend.interest_over_time().drop(['isPartial'],axis=1)
        try:
            if os.stat(csvfile).st_size > 0:
                df.to_csv(csvfile, mode='a', header=False)
            else:
                #file should never be empty
                print("empty file")
        except OSError:
            df.to_csv(csvfile,index=True)
    count += 1


All the separate csvs are then concatenated into one big csv. Extra date columns are removed as they are redundant.

In [6]:
allData = []
count = 0
for file in allcsv:
    df = pd.read_csv(file)
    if count > 0:
        df =df.drop('date',axis=1)
    allData.append(df)
    count +=1
concatDf = pd.concat(allData, axis=1)
concatDf.to_csv('aggregate.csv',index=False)

Some issues: <br/>
- I noticed that the data was rather inconsistent when the timeframe was shifted around, i.e. if there are two timeframes set to 2020-01-01 to 2020-03-01 and 2020-02-01 and 2020-03-01, the data from 02-01 to 03-01 is different between the two datasets.<br/>
https://support.google.com/google-ads/thread/8389370?msgid=26184434<br/>
https://support.google.com/google-ads/thread/8632416?hl=en<br/>
I believe that this won't an issue since the timeframes I chose and the way I implemented it doesn't overlap or contain any subsets of any timeframe. This way, the data can just be interrepted as fluctuation. 
- Pytrends only includes the most recent data up to two days ago as far as I can tell - not too sure if this is a problem.<br/>
