# Google Trend
After the three topics for each article was identied, the goal was to create a metric for how relevant was this topic at the date the article was published? This done under the assumption that people read articles that are relevant for the current time. If the topic is not 'popular' at the time of release, it will not be well resieved by the end-users, hence the user engangement would decrease. <br>
An example of this could be the topic of Covid-19. If you had written and article about Covid-19 back in Febrauary 2020, chanches are that most people would be very interested in that article. Whereas today, nobody wants to read anymore about covid-19.  <br>
This next section will provide each topic with a metric of how relevant this topic is at the time of release. 

In order to determine a topics relevance `pytrend` is used, which is an unofficial API for Google Trends. Google Trend is a great tool for mapping what poeple are seaching for, in real time. 

In [2]:
# Install pytrends
# !pip install pytrends

In [39]:
import pandas as pd                        
import time
import itertools
from pytrends.request import TrendReq
pytrends = TrendReq()

In [40]:
df = pd.read_csv("Data/data_topic.csv")

# Nessesary date preprocessing step
df['date'] = df['published_at'].str.split('T', expand=True)[0]
df['date'] = df.date.str.rsplit('-', 1, expand=True)[0]
# df = df.iloc[[0,-2]]
df = df.head(2).append(df.tail(2))
# df

I order to optimize the computional power the three columns with topics are combined and the `unique()` function is used to find every unique topic. This limits the number of requests made to Google Trends API, seeing that many of the topics are seen in more than one topic column. <br>

`pytrend` is in many ways an easy and great tool, but is comes with many limitations. The timeframe of which a topic is investigates can not surpass more than 10 years from today. Luckily the lastest date in this data set is from September 2019. The date format is also very restrictive, as it only works for each seventh day in the month. A decision was made to shorten the date format, which originaly was in `YYYY-MM-DD` to simply a `YYYY-MM` format. This reduced the number of dates to two dates (2019-09 and 2019-10). That is why the input of the function is only a single date string. <br>

A final remark about the function was that the topic = "date" would not be accepted as input in `pytrends`, which is why it simply was just removed. 

In [9]:
## September 2019
df_09 = df[df['date']=='2019-09']
topic_09_list = []
topic_09_list.append(df_09['Topic1'].unique())
topic_09_list.append(df_09['Topic2'].unique())
topic_09_list.append(df_09['Topic3'].unique())

topic_09_list = list(itertools.chain.from_iterable(topic_09_list))
topic_09_list = list(set(topic_09_list))
# topic_09_list.remove('date')

## October 2019
df_10 = df[df['date']=='2019-10']
topic_10_list = []
topic_10_list.append(df_10['Topic2'].unique())
topic_10_list.append(df_10['Topic3'].unique())
topic_10_list.append(df_10['Topic1'].unique())

topic_10_list = list(itertools.chain.from_iterable(topic_10_list))
topic_10_list = list(set(topic_10_list))
# topic_10_list.remove('date')

When is comes the actual value returned by API, `pytrends`agian comes with its limitations. According to Lazarina Stoy from October 2021 she says the following about the returned value:

> Values are calculated on a scale from 0 to 100, where 100 is the location with the most popularity as a fraction of total searches in that location, a value of 50 indicates a location that is half as popular, and a value of 0 indicates a location where the term was less than 1% as popular as the peak. (Soruce: https://lazarinastoy.com/the-ultimate-guide-to-pytrends-google-trends-api-with-python/)

The following function looks at a time period from five years ago to today. Depending on the month and year selected it returns a mean value for the topic for that specific month. It is averaged because `pytrends` still returns the every seven day of the month. It is assumed in this project that the interest of a topic is constant throughout a month.

Notice that the function does not consider region as an input.

In [10]:
def Topic_Value(date,topic):
    # Initialize pytrends API request
    pytrends.build_payload([topic], cat=0, timeframe='today 5-y') 
    data = pytrends.interest_over_time() 
    data = data.reset_index() 
    
    # Group to only see year and month
    data['YearMonth'] = pd.to_datetime(data['date']).dt.strftime('%Y-%m')
    # Average for the enitre month
    data = data.groupby('YearMonth').mean()
    # Find value for topic 
    value = data[topic].filter(items = [date], axis=0)[0]

    # Should be increased if not runned in Google Colab.
    time.sleep(3)
    return value

The next code chunk then takes the list of topics and use the `Topic_Value` function to pull a value and saves it in a list with the coresponding values for that topic. 

In [11]:
topic_value_oct = []
topic_value_sep = []

# Topics and values from September 2019 
for i in topic_09_list:
  topic_value_sep.append(Topic_Value('2019-09',i))
  # print(i,';', topic_value_sep[-1])

# Topics and values from October 2019 
for i in topic_10_list:
  topic_value_oct.append(Topic_Value('2019-10',i))
  # print(i,';', topic_value_oct[-1])

The next code chunks are made to combine the topic and values for the different timestamps the dataframe.

In [12]:
df_topics_1 = pd.DataFrame({'date': '2019-09','Topic1': topic_09_list,'Topic 1 Score': topic_value_sep})
df_topics_2 = pd.DataFrame({'date': '2019-09','Topic2': topic_09_list,'Topic 2 Score': topic_value_sep})
df_topics_3 = pd.DataFrame({'date': '2019-09','Topic3': topic_09_list,'Topic 3 Score': topic_value_sep})

df_topics_1_10 = pd.DataFrame({'date': '2019-10','Topic1': topic_10_list,'Topic 1 Score': topic_value_oct})
df_topics_2_10 = pd.DataFrame({'date': '2019-10','Topic2': topic_10_list,'Topic 2 Score': topic_value_oct})
df_topics_3_10 = pd.DataFrame({'date': '2019-10','Topic3': topic_10_list,'Topic 3 Score': topic_value_oct})

df_topics_1 = df_topics_1.append(df_topics_1_10)
df_topics_2 = df_topics_2.append(df_topics_2_10)
df_topics_3 = df_topics_3.append(df_topics_3_10)

df = pd.merge(df, df_topics_1,  how='left', left_on=['date','Topic1'], right_on = ['date','Topic1'])
df = pd.merge(df, df_topics_2,  how='left', left_on=['date','Topic2'], right_on = ['date','Topic2'])
df = pd.merge(df, df_topics_3,  how='left', left_on=['date','Topic3'], right_on = ['date','Topic3'])

In [14]:
# df.to_csv(data_path + 'dataset.csv')

A decision has been made to replace all `NaN` values with 0 as the value does not represent the decisired purpose. 

In [15]:
df.loc[df['Topic1'].isnull(), 'Topic 1 Score'] = 0
df.loc[df['Topic2'].isnull(), 'Topic 2 Score'] = 0
df.loc[df['Topic3'].isnull(), 'Topic 3 Score'] = 0

## Reflection
A reflection about the Google Trend API is done by firstly interpreting the values given with the `Topic_Value` function.

In [23]:
print("Google Trend value for the topic 'Brexit' from September 2019:",Topic_Value('2019-09','brexit'))

Google Trend value for the topic 'Brexit' from September 2019: 37.8


As mentioned before Google Trend API works by giving a value between 0-100. A value of 37.8 for the topic Brexit is not particulary high. Looking at the graph below which visualizes the interest of the topic Brexit over the last five years, it seems that eventhough brexit might not be the hottest topic it was still very relevant at the time. 

In [38]:
topic = ['brexit']
pytrends.build_payload(topic, cat=0, timeframe='today 5-y') 
data = pytrends.interest_over_time() 
data = data.reset_index() 
fig = px.line(data, x="date", y=topic[0], title='Brexit Web Search Interest Over Time')
fig.show() 

In [25]:
print("Google Trend value for the topic 'flight' from September 2019:",Topic_Value('2019-09','flight'))

Google Trend value for the topic 'Brexit' from September 2019: 86.2


Looking at the value for the topic 'flight' it is much higher than 'brexit'. Again a visualization of the interest the last five years are displayed below. However, the reason for the high value in the topic is because this topic is always a highly searched topic. The big descrease in searches in march 2010 could to due to Covid-19, that made people unable to travel on vacation or business trips. As Covid-19 is slowly disappear from our lives, the demand the flight increses angain.

In [28]:
topic = ['flight']
pytrends.build_payload(topic, cat=0, timeframe='today 5-y') 
data = pytrends.interest_over_time() 
data = data.reset_index() 
fig = px.line(data, x="date", y=topic[0], title='Flight Web Search Interest Over Time')
fig.show() 

The two topic examples raises a question about the usefullness of `pytrends`, because the metric suggests, according to our asseumption, that if an author writes an article about 'flight' it will be more relevant than if the person wrote about 'brexit'. `pytrends` or Google Trend does not provide the actual number of seaches for a topic, making it difficult to determine if the number is high because it truely is relevant for the time period or high because the it always a seached topic. However, while Google Trend has many limitations and makes it difficult to compare topic values, it is still included in the project as a way to enrich the data set. This is done because the value given by Google Trend still gives some kind of indication whether the topic is relevant or not. 

A final reflection about Google Trend and possible futher work to be done, would be to look into the sources (news agency). Many of the sources are targeting a specific segment. For instance the The Irish Times would not be relevant for the population of Denmark, but highly relevant for the irish people. `pytrends` has a input field for region specification. This step would make the value given by Google Trend more specific for the chosen region. However the downside of this would be that many sources such like reuters, which is a international news agency, would not be able to be reduced to a single area. 