# Final Notebook - Advanced business analytics

In [3]:
!pip install pytrends
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
from bs4 import BeautifulSoup
from tqdm.notebook import tqdm
from matplotlib import cm
from PIL import Image
from IPython.display import Image
import os
import PIL
import glob
import time
import itertools
from pytrends.request import TrendReq
pytrends = TrendReq()

tqdm.pandas()
df = pd.read_csv("Data/articles_data.csv")
df=df.drop("Unnamed: 0",axis=1)
df=df.dropna()

df['date'] = df['published_at'].str.split('T', expand=True)[0]
df['date'] = df.date.str.rsplit('-', 1, expand=True)[0]

1. Intro
2. Data kilder
- Kaggle dataset
- Scrape
- Google trends

4. Exploratory
5. Topic modelling
 - google trends
6. Deeplearning
7. Discussion
8. conclusions

## Introduciton


![alt text](Data/cover_pic.jpg "Title")

<a class="anchor" id="first-bullet"></a>
The impact social media platforms have had on the distribution of articles from news outlets has been significant. About half of U.S. adults (53%) say they get news from social media “often” or “sometimes,” and this use is spread out across a number of different sites, according to a Pew Research Center survey conducted Aug. 31-Sept. 7, 2020. Among 11 social media sites asked about as a regular source of news, Facebook sits at the top, with about a third (36%) of Americans getting news there regularly. But what articles a shown to which users?[1]

The facebook algorithm has long been a hot topic what news feed is shown to you. The specific details of how it works is not known, however there are supposedly three main ranking parameters.
* 1. Who posted it? Content by people or business affilated with you will be priotized.
* 2. The type of content, if you a more prone to click on videos, videos will be shown to you.
* 3. Interaction with the post, feed will prioritize posts with a lot of engagement, especially from people you interact with a lot.[2]

So when an article is shown to the user, a long process has been gone through to pick that exact one. But what determines whether the user will enteract with the article, is it the cover picture, title, news oulet or a combination of all?

This notebook will try to explore what is important for news articles on facebook, in order for people to enteract with them. This can be usefull for editors of newspaper to ensure that the articles they choose to share, are the one who will gain the most traction.

 We will try to built several models to predict the user engangment, with a special focus on article headlines, the cover picture or the news outlet. In order to achieve this several methods has been used: web-scraping, top-modelling, deep learning. The main data file for this project is obtained from https://www.kaggle.com/szymonjanowski/internet-articles-data-with-users-engagement, and webscrapes & API usuage has been carried out to supplement the data set.

[1] - https://www.pewresearch.org/journalism/2021/01/12/news-use-across-social-media-platforms-in-2020/ \
[2] - https://blog.hootsuite.com/facebook-algorithm/

## Data Sources

In this project three data sources are used.

- A data set containing 10437 rows representing unique articles from September 2019 to October 2019. The data set contain 15 columns where 11 of the are information about the article and the 4 of the are metrics on different engangement type with the article.

## Exploratory Analysis

## Topic modeling with BERT

The purpose of topic modeling on our dataset is to find keywords/topics for each article to put into Google Trends API to get an insight how popular certaint events were at given timestamp.

For topic modeling will we use BERTopic made by Maarten Grootendorst, which is a Bidirectional Encoder Representations from Transformers (BERT) based topic modeling technique. Documentation can be found her: https://github.com/MaartenGr/BERTopic

The first step of BERTopic is converting the documents to numerical data. This is where BERT are used to extract different embeddings based on the context of the word using pretrained language model. The second step is to reduce the dimensionality of the resulting embeddings to optimize the clusteringprocess. BERTopic does this with the **UMAP** algorithm. After having reduced the dimensionality of the documents embeddings, BERTopic cluster the documents with **HDBSCAN**. Lastly, from the clusters of documents,topic representations are extracted using a custom class-based variation of TF-IDF (c-TF-IDF).

We use BERTopic because it extract the different embeddings based on the context of the word, instead of LDA's bag-of-words approach, which we believe can be important when making topics for articles.

In [None]:
from bertopic import BERTopic
import pandas as pd
import numpy as np
import re, nltk
from nltk.corpus import stopwords
from wordcloud import WordCloud
from nltk.stem import WordNetLemmatizer
import itertools
from typing import List
import plotly.graph_objects as go
from plotly.subplots import make_subplots

First we want do filter our data from stopwords and then lemmatize it, this will be done with the nltk library.

In [None]:
# Load data
df = pd.read_csv ('articles_data.csv')

# Drop rows with nan-values in specific columns
df = df.dropna(subset = ['Unnamed: 0', 'source_id', 'source_name', 'author', 'title',
       'description', 'url', 'published_at',
       'top_article', 'engagement_reaction_count', 'engagement_comment_count',
       'engagement_share_count', 'engagement_comment_plugin_count', 'content'])
# Reset index
df = df.reset_index(drop=True)


In [None]:
#Function for cleaning text
def cleaned_text(text, source_name):
    try:

        if source_name == 'Reuters':
            clean=re.split('-',text,1)[1]
            clean = re.sub("\n"," ",clean)
            clean=clean.lower()
            clean=re.sub(r"[~.,%/:;?_&+*=!-]"," ",clean)
            clean=re.sub("\[.*?\]", "", clean)
            clean=re.sub("[^a-z]"," ",clean)
            clean=clean.lstrip()
            clean=re.sub("\s{2,}"," ",clean)
            clean=re.sub(r'\b\w\b', '', clean)

        else:
            clean = re.sub("\n"," ",text)
            clean=clean.lower()
            clean=re.sub(r"[~.,%/:;?_&+*=!-]"," ",clean)
            clean=re.sub("\[.*?\]", "", clean)
            clean=re.sub("[^a-z]"," ",clean)
            clean=clean.lstrip()
            clean=re.sub("\s{2,}"," ",clean)
            clean=re.sub(r'\b\w\b', '', clean)
    except:
        clean = np.nan
    return clean

In [None]:
# Cleaning text
df["cleaned_content"] = df.apply(lambda x : cleaned_text(x['content'], x['source_name']), axis=1)

# Drop rows with nan in column cleaned_content
df = df.dropna(subset = ['cleaned_content'])
df = df.reset_index(drop=True)

In [None]:
stop=stopwords.words('english')
stop.append("say")
# Remove stopwords
df["stop_removed_content"]=df["cleaned_content"].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

In [None]:
# Tokenize
df["tokenized"]=df["stop_removed_content"].apply(lambda x: nltk.word_tokenize(x))

In [None]:
# Function for lematize
def word_lemmatizer(text):
    lem_text = [WordNetLemmatizer().lemmatize(i,pos='v') for i in text]
    return lem_text

# Lematize
df["lemmatized"]=df["tokenized"].apply(lambda x: word_lemmatizer(x))
df["lemmatize_joined"]=df["lemmatized"].apply(lambda x: ' '.join(x))

In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,source_id,source_name,author,title,description,url,url_to_image,published_at,content,top_article,engagement_reaction_count,engagement_comment_count,engagement_share_count,engagement_comment_plugin_count,cleaned_content,stop_removed_content,tokenized,lemmatized,lemmatize_joined
0,0,reuters,Reuters,Reuters Editorial,NTSB says Autopilot engaged in 2018 California...,The National Transportation Safety Board said ...,https://www.reuters.com/article/us-tesla-crash...,https://s4.reutersmedia.net/resources/r/?m=02&...,2019-09-03T16:22:20Z,WASHINGTON (Reuters) - The National Transporta...,0.0,0.0,0.0,2528.0,0.0,the national transportation safety board said ...,national transportation safety board said tues...,"[national, transportation, safety, board, said...","[national, transportation, safety, board, say,...",national transportation safety board say tuesd...
1,1,the-irish-times,The Irish Times,Eoin Burke-Kennedy,Unemployment falls to post-crash low of 5.2%,Latest monthly figures reflect continued growt...,https://www.irishtimes.com/business/economy/un...,https://www.irishtimes.com/image-creator/?id=1...,2019-09-03T10:32:28Z,The States jobless rate fell to 5.2 per cent l...,0.0,6.0,10.0,2.0,0.0,the states jobless rate fell to per cent last ...,states jobless rate fell per cent last month a...,"[states, jobless, rate, fell, per, cent, last,...","[state, jobless, rate, fell, per, cent, last, ...",state jobless rate fell per cent last month ac...
2,3,al-jazeera-english,Al Jazeera English,Al Jazeera,North Korean footballer Han joins Italian gian...,Han is the first North Korean player in the Se...,https://www.aljazeera.com/news/2019/09/north-k...,https://www.aljazeera.com/mritems/Images/2019/...,2019-09-03T17:25:39Z,"Han Kwang Song, the first North Korean footbal...",0.0,0.0,0.0,7.0,0.0,han kwang song the first north korean football...,han kwang song first north korean footballer s...,"[han, kwang, song, first, north, korean, footb...","[han, kwang, song, first, north, korean, footb...",han kwang song first north korean footballer s...
3,5,abc-news,ABC News,The Associated Press,'This Tender Land' is an affecting story about...,"""This Tender Land"" by William Kent Krueger is ...",https://abcnews.go.com/Entertainment/wireStory...,,2019-09-03T15:56:49Z,"""This Tender Land: a Novel"" (Atria Books), by ...",0.0,0.0,0.0,0.0,0.0,this tender land novel atria books by william...,tender land novel atria books william kent kru...,"[tender, land, novel, atria, books, william, k...","[tender, land, novel, atria, book, william, ke...",tender land novel atria book william kent krue...
4,6,reuters,Reuters,Reuters Editorial,EU wants to see if lawmakers will block Brexit...,The European Union is waiting to see if Britis...,https://www.reuters.com/article/us-britain-eu-...,https://s2.reutersmedia.net/resources/r/?m=02&...,2019-09-03T16:25:41Z,LONDON (Reuters) - The European Union is waiti...,0.0,0.0,0.0,817.0,0.0,the european union is waiting to see if britis...,european union waiting see british lawmakers b...,"[european, union, waiting, see, british, lawma...","[european, union, wait, see, british, lawmaker...",european union wait see british lawmakers bloc...


We will now create our topic model with the BERTopic library. We will use a pre-trained embedding model named `all-MiniLM-L6-v2`, which is a general purpose model trained on more than 1 billion training pairs. There are many different pre-trained embedding model, which can be found at https://www.sbert.net/docs/pretrained_models.html. We chose `all-MiniLM-L6-v2` because it have a good trade-off between performance and speed.

In [None]:
# create model 

#model = BERTopic(verbose=True, embedding_model = 'all-MiniLM-L6-v2', calculate_probabilities = True, nr_topics="auto")
 
#topics, probabilities = model.fit_transform(df['lemmatize_joined'])

# Save model

#model.save("topics_model")

In [None]:
# Load model

model = BERTopic.load("topics_model")

topics, probabilities = model.transform(df['lemmatize_joined'])

Batches:   0%|          | 0/254 [00:00<?, ?it/s]

2022-04-29 13:06:51,624 - BERTopic - Reduced dimensionality with UMAP
2022-04-29 13:06:52,052 - BERTopic - Predicted clusters with HDBSCAN
2022-04-29 13:06:59,596 - BERTopic - Calculated probabilities with HDBSCAN


We can see that the model finds 114 topics, where -1 refers to all documents that did not have any topics assigned. So, 2483 articles have not been assigned a topic.

BERTopic uses the clustering algorithm HDBSCAN (https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html) and a trait with this algorithm is that it doesn't force all documents/articles towards a certain cluster. If no cluster could be found, then it is simply an outlier.

In [None]:
print(f'Number of topics the model find: {len(model.get_topic_freq())}')
print(f'Number of articles with an assigned topic: {model.get_topic_freq().Count[1:].sum()}')
model.get_topic_freq()

Number of topics the model find: 114
Number of articles with an assigned topic: 5637


Unnamed: 0,Topic,Count
0,-1,2483
1,0,453
2,1,386
3,2,345
4,3,306
...,...,...
109,108,11
110,109,11
111,110,10
112,111,10


We could "force" outlier articles into topics by looking at the probabilities for each article to be in each topic. That way, we can select, for each article, the topic with the the highest probability. Thus, although we do generate an outlier class in our BERTopic model, we can assign article to an actual topic. However after testing can we not argument for the trade-off of only lossing 396 outlier articles with an probability of 10%.

In [None]:
probability_threshold = 0.1
new_topics = [np.argmax(prob) if max(prob) >= probability_threshold else -1 for prob in probabilities]
# Calculate the number of new articles with topics
model.get_topic_freq().Count[0].sum()-new_topics.count(-1)

If we look at the 12 largest topics, we can see some patterns. Topic 0 is about the Donald Trump and topic 7 is about the protest in Hong Kong. However, there are also topics which isn't as clear and well defined as those two. Topic 10 top 3 words are food, li and restaurant which doesn't give a clear picture of an event at a given moment.

In [1]:
# Defining new visualize_barchart function 
# Modification of code from https://github.com/MaartenGr/BERTopic/blob/master/bertopic/plotting/_barchart.py
def visualize_barchart2(topic_model,
                       topics: List[int] = None,
                       top_n_topics: int = 12,
                       n_words: int = 5,
                       width: int = 250,
                       height: int = 250) -> go.Figure:
    colors = itertools.cycle(["#D55E00", "#0072B2", "#CC79A7", "#E69F00", "#56B4E9", "#009E73", "#F0E442"])

    # Select topics based on top_n and topics args
    if topics is not None:
        topics = list(topics)
    elif top_n_topics is not None:
        topics = topic_model.get_topic_freq().Topic.to_list()[1:top_n_topics + 1]
    else:
        topics = topic_model.get_topic_freq().Topic.to_list()[1:7]

    # Initialize figure
    subplot_titles = [f"Topic {topic}" for topic in topics]
    columns = 4
    rows = int(np.ceil(len(topics) / columns))
    fig = make_subplots(rows=rows,
                        cols=columns,
                        shared_xaxes=False,
                        x_title = 'c-TF-IDF score',
                        y_title = 'top 5 words',
                        horizontal_spacing=.1,
                        vertical_spacing=.4 / rows if rows > 1 else 0,
                        subplot_titles=subplot_titles)

    # Add barchart for each topic
    row = 1
    column = 1
    for topic in topics:
        words = [word + "  " for word, _ in topic_model.get_topic(topic)][:n_words][::-1]
        scores = [score for _, score in topic_model.get_topic(topic)][:n_words][::-1]

        fig.add_trace(
            go.Bar(x=scores,
                   y=words,
                   orientation='h',
                   marker_color=next(colors)),
            row=row, col=column)

        if column == columns:
            column = 1
            row += 1
        else:
            column += 1

    # Stylize graph
    fig.update_layout(
        template="plotly_white",
        showlegend=False,
        title={
            'text': "<b>Topic Word Scores",
            'x': .5,
            'xanchor': 'center',
            'yanchor': 'top',
            'font': dict(
                size=22,
                color="Black")
        },
        width=width*4,
        height=height*rows if rows > 1 else height * 1.3,
        hoverlabel=dict(
            bgcolor="white",
            font_size=16,
            font_family="Rockwell"
        ),
    )
    
    
    fig.update_xaxes(showgrid=True)
    fig.update_yaxes(showgrid=True)
    
    return fig

NameError: name 'List' is not defined

In [None]:
visualize_barchart2(model)

Now we will give each article 3 keywords, which can be put into Google Trends API, from the 3 most important words in the topic each article is clustered too.

In [None]:
# Making topic column
df['Topic'] = topics

def get_3_topic(data, model):
    topic1 = []
    topic2 = []
    topic3 = []
    
    for i in data:
        if i < 0:
            topic1.append(np.nan)
            topic2.append(np.nan)
            topic3.append(np.nan)
        else:
            topic1.append(model.get_topic(i)[0][0])
            topic2.append(model.get_topic(i)[1][0])
            topic3.append(model.get_topic(i)[2][0])

    return topic1, topic2, topic3

In [None]:
topic1, topic2, topic3 = get_3_topic(df['Topic'], model)

df['Topic1'] = topic1
df['Topic2'] = topic2
df['Topic3'] = topic3

In [None]:
df[['Topic1','Topic2','Topic3']].head()

Unnamed: 0.1,Unnamed: 0,source_id,source_name,author,title,description,url,url_to_image,published_at,content,...,engagement_comment_plugin_count,cleaned_content,stop_removed_content,tokenized,lemmatized,lemmatize_joined,Topic,Topic1,Topic2,Topic3
0,0,reuters,Reuters,Reuters Editorial,NTSB says Autopilot engaged in 2018 California...,The National Transportation Safety Board said ...,https://www.reuters.com/article/us-tesla-crash...,https://s4.reutersmedia.net/resources/r/?m=02&...,2019-09-03T16:22:20Z,WASHINGTON (Reuters) - The National Transporta...,...,0.0,the national transportation safety board said ...,national transportation safety board said tues...,"[national, transportation, safety, board, said...","[national, transportation, safety, board, say,...",national transportation safety board say tuesd...,5,crash,plane,flight
1,1,the-irish-times,The Irish Times,Eoin Burke-Kennedy,Unemployment falls to post-crash low of 5.2%,Latest monthly figures reflect continued growt...,https://www.irishtimes.com/business/economy/un...,https://www.irishtimes.com/image-creator/?id=1...,2019-09-03T10:32:28Z,The States jobless rate fell to 5.2 per cent l...,...,0.0,the states jobless rate fell to per cent last ...,states jobless rate fell per cent last month a...,"[states, jobless, rate, fell, per, cent, last,...","[state, jobless, rate, fell, per, cent, last, ...",state jobless rate fell per cent last month ac...,-1,,,
2,3,al-jazeera-english,Al Jazeera English,Al Jazeera,North Korean footballer Han joins Italian gian...,Han is the first North Korean player in the Se...,https://www.aljazeera.com/news/2019/09/north-k...,https://www.aljazeera.com/mritems/Images/2019/...,2019-09-03T17:25:39Z,"Han Kwang Song, the first North Korean footbal...",...,0.0,han kwang song the first north korean football...,han kwang song first north korean footballer s...,"[han, kwang, song, first, north, korean, footb...","[han, kwang, song, first, north, korean, footb...",han kwang song first north korean footballer s...,51,club,league,madrid
3,5,abc-news,ABC News,The Associated Press,'This Tender Land' is an affecting story about...,"""This Tender Land"" by William Kent Krueger is ...",https://abcnews.go.com/Entertainment/wireStory...,,2019-09-03T15:56:49Z,"""This Tender Land: a Novel"" (Atria Books), by ...",...,0.0,this tender land novel atria books by william...,tender land novel atria books william kent kru...,"[tender, land, novel, atria, books, william, k...","[tender, land, novel, atria, book, william, ke...",tender land novel atria book william kent krue...,-1,,,
4,6,reuters,Reuters,Reuters Editorial,EU wants to see if lawmakers will block Brexit...,The European Union is waiting to see if Britis...,https://www.reuters.com/article/us-britain-eu-...,https://s2.reutersmedia.net/resources/r/?m=02&...,2019-09-03T16:25:41Z,LONDON (Reuters) - The European Union is waiti...,...,0.0,the european union is waiting to see if britis...,european union waiting see british lawmakers b...,"[european, union, waiting, see, british, lawma...","[european, union, wait, see, british, lawmaker...",european union wait see british lawmakers bloc...,1,brexit,johnson,boris


### Partial conclusion

It could be seen that we have gotten some good topics for 5637 of our articles. However, there are still 2483 articles which wasn't assigned topics, which is not ideal when we need to model and therefore we need to assign a value for the articles without assigned keywords. And we saw that forcing articles into topics didn't give big enough gain.

Another problem with our approach of assigning 3 keywords based on the 3 top words of the assigned article, is that for some topics is the words c-TF-IDF score significant different. For instance, is topic 11 word "school" c-TF-IDF double as large as the other two words (students, parents). Compared to topic 1 words, which have almost the same c-TF-IDF score. This will our model not know and therefore will there be some uncertainty in our data.

A third problem is with the automatisation of assigning the keywords for each article. If we look at the description for article "North Korean footballer Han joins Italian giants Juventus" and its assigned keywords, can we see that it is about football, which the keywords refers to. However, the football player does not join Real Madrid, which the third keyword refers to, but Juventus.

In [None]:
print(df['title'][2])
print(df[['Topic1','Topic2', 'Topic3']])

## Google Trend
After the three topics for each article was identied, the goal was to create a metric for how relevant was this topic at the date the article was published? This done under the assumption that people read articles that are relevant for the current time. If the topic is not 'popular' at the time of release, it will not be well resieved by the end-users, hence the user engangement would decrease. <br>
An example of this could be the topic of Covid-19. If you had written and article about Covid-19 back in Febrauary 2020, chanches are that most people would be very interested in that article. Whereas today, nobody wants to read anymore about covid-19.  <br>
This next section will provide each topic with a metric of how relevant this topic is at the time of release. 

In order to determine a topics relevance `pytrend` is used, which is an unofficial API for Google Trends. Google Trend is a great tool for mapping what poeple are seaching for, in real time. 

I order to optimize the computional power the three columns with topics are combined and the `unique()` function is used to find every unique topic. This limits the number of requests made to Google Trends API, seeing that many of the topics are seen in more than one topic column. <br>

`pytrend` is in many ways an easy and great tool, but is comes with many limitations. The timeframe of which a topic is investigates can not surpass more than 10 years from today. Luckily the lastest date in this data set is from September 2019. The date format is also very restrictive, as it only works for each seventh day in the month. A decision was made to shorten the date format, which originaly was in `YYYY-MM-DD` to simply a `YYYY-MM` format. This reduced the number of dates to two dates (2019-09 and 2019-10). That is why the input of the function is only a single date string. <br>

A final remark about the function was that the topic = "date" would not be accepted as input in `pytrends`, which is why it simply was just removed. 

In [None]:
## September 2019
df_09 = df[df['date']=='2019-09']
topic_09_list = []
topic_09_list.append(df_09['Topic1'].unique())
topic_09_list.append(df_09['Topic2'].unique())
topic_09_list.append(df_09['Topic3'].unique())

topic_09_list = list(itertools.chain.from_iterable(topic_09_list))
topic_09_list = list(set(topic_09_list))
# topic_09_list.remove('date')

## October 2019
df_10 = df[df['date']=='2019-10']
topic_10_list = []
topic_10_list.append(df_10['Topic2'].unique())
topic_10_list.append(df_10['Topic3'].unique())
topic_10_list.append(df_10['Topic1'].unique())

topic_10_list = list(itertools.chain.from_iterable(topic_10_list))
topic_10_list = list(set(topic_10_list))
# topic_10_list.remove('date')

When is comes the actual value returned by API, `pytrends`agian comes with its limitations. According to Lazarina Stoy from October 2021 she says the following about the returned value:

> Values are calculated on a scale from 0 to 100, where 100 is the location with the most popularity as a fraction of total searches in that location, a value of 50 indicates a location that is half as popular, and a value of 0 indicates a location where the term was less than 1% as popular as the peak. (Soruce: https://lazarinastoy.com/the-ultimate-guide-to-pytrends-google-trends-api-with-python/)

The following function looks at a time period from five years ago to today. Depending on the month and year selected it returns a mean value for the topic for that specific month. It is averaged because `pytrends` still returns the every seven day of the month. It is assumed in this project that the interest of a topic is constant throughout a month. 

In [None]:
def Topic_Value(date,topic):
    # Initialize pytrends API request
    pytrends.build_payload([topic], cat=0, timeframe='today 5-y') 
    data = pytrends.interest_over_time() 
    data = data.reset_index() 
    
    # Group to only see year and month
    data['YearMonth'] = pd.to_datetime(data['date']).dt.strftime('%Y-%m')
    # Average for the enitre month
    data = data.groupby('YearMonth').mean()
    # Find value for topic 
    value = data[topic].filter(items = [date], axis=0)[0]

    # Should be increased if not runned in Google Colab.
    time.sleep(3)
    return value

The next code chunk then takes the list of topics and use the `Topic_Value` function to pull a value and saves it in a list with the coresponding values for that topic. 

In [None]:
topic_value_oct = []
topic_value_sep = []

# Topics and values from September 2019 
for i in topic_09_list:
  topic_value_sep.append(Topic_Value('2019-09',i))
  # print(i,';', topic_value_sep[-1])

# Topics and values from October 2019 
for i in topic_10_list:
  topic_value_oct.append(Topic_Value('2019-10',i))
  # print(i,';', topic_value_oct[-1])

The next code chunks are made to combine the topic and values for the different timestamps the dataframe.

In [None]:
df_topics_1 = pd.DataFrame({'date': '2019-09','Topic1': topic_09_list,'Topic 1 Score': topic_value_sep})
df_topics_2 = pd.DataFrame({'date': '2019-09','Topic2': topic_09_list,'Topic 2 Score': topic_value_sep})
df_topics_3 = pd.DataFrame({'date': '2019-09','Topic3': topic_09_list,'Topic 3 Score': topic_value_sep})

df_topics_1_10 = pd.DataFrame({'date': '2019-10','Topic1': topic_10_list,'Topic 1 Score': topic_value_oct})
df_topics_2_10 = pd.DataFrame({'date': '2019-10','Topic2': topic_10_list,'Topic 2 Score': topic_value_oct})
df_topics_3_10 = pd.DataFrame({'date': '2019-10','Topic3': topic_10_list,'Topic 3 Score': topic_value_oct})

df_topics_1 = df_topics_1.append(df_topics_1_10)
df_topics_2 = df_topics_2.append(df_topics_2_10)
df_topics_3 = df_topics_3.append(df_topics_3_10)

df = pd.merge(df, df_topics_1,  how='left', left_on=['date','Topic1'], right_on = ['date','Topic1'])
df = pd.merge(df, df_topics_2,  how='left', left_on=['date','Topic2'], right_on = ['date','Topic2'])
df = pd.merge(df, df_topics_3,  how='left', left_on=['date','Topic3'], right_on = ['date','Topic3'])

### Partial conclusion

## Deep Learning & Predictions

## Discussion

## Conclusion