# Pipeline 1: Google-To-DW pipeline

The aim of this notebook is to be able to answer the questions: Is DW covering what customers want

Approach: Extract trending topics on Google and compare to what DW covers

<img src="../reports/illustrations/pipeline1.png" width=800 />

We tried 2 different approaches:

**Approach 1**: we used pre-trained models such as Chat GPT and zero-shot learning. \
This approach was overal less effective. Our attempts can be found in pipeline2_playground_approach1_*.ipynb

**Approach 2**: we trained our own models \
The most performing models are sumarised here. Our other attempts can be found in pipeline2_playground_approach2.ipynb

<img src="../reports/illustrations/pipeline2_approaches.png" width=800 />

In [1]:
# Import useful libraries
import pandas as pd
import os
import sys

# Import functions from source folder
sys.path.append('../src/') 
from data.preprocess_keywords import make_cleaned_keywords_df
from data.make_datasets import get_data, get_daily_trending_searches

In [2]:
# Specify wanted time range
start_date = '2019-01-01'
end_date = '2019-02-01'

# Where data files will be stored
path_to_data_files = '../data/interim/'

# Extract trending topics from Google

In [9]:
# Extracts trending topic from Google if the file does not exist, else loads it
# If error with the number of requests, change the header in make_datasets.py 
# (https://stackoverflow.com/questions/50571317/pytrends-the-request-failed-google-returned-a-response-with-code-429#:~:text=I%20am%20trustworthy.-,Solution,Visit%20the%20Google%20Trend%20page%20and%20perform%20a%20search%20for,-a%20trend%3B%20it)

google_file = path_to_data_files + start_date + '_' + end_date + '_World_daily_trending_searches.json'

if os.path.isfile(google_file) == False:
    df_google = get_daily_trending_searches(path_to_data_files, start_date, end_date = end_date)
else:
    df_google = pd.read_json(google_file, orient ='split', compression = 'infer') 

# Load DW data

In [15]:
# Clean data file in specific date range
clean_data_file = '../data/interim/clean_keywords_' + start_date + '_' + end_date + '.json'

# Generates the clean data file if it does not exist
if os.path.isfile(clean_data_file) == False:

    # Path to raw data
    data_file = '../data/raw/CMS_2010_to_June_2022_ENGLISH.json'

    # Load and extract data within time range
    df_subset = get_data(data_file, start_date, end_date)

    # Cleans keywords and saves data as a dataframe
    make_cleaned_keywords_df(df_subset, start_date, end_date)


# Loads the clean data file
df_dw = pd.read_json(clean_data_file, orient ='split', compression = 'infer')

# Remove rows witn no category
df_dw.dropna(subset=['cleanFocusCategory'], inplace = True)
df_dw.reset_index(drop = True, inplace = True)

# Models: map google keywords to DW category

In [13]:
# data from DW
df_dw.head()

Unnamed: 0,id,lastModifiedDate,Date,keywordStrings,cleanFocusParentCategory,cleanFocusCategory,teaser,keywordStringsCleanAfterFuzz
0,46912921,2019-01-01T03:57:28.904Z,2019-01-01,"[NASA, OSIRIS-REx, Bennu, asteroid]",Science,Science,The OSIRIS-REx spacecraft had arrived at the l...,"[nasa, osiris-rex, bennu, asteroid]"
1,46911356,2019-01-01T06:11:50.527Z,2019-01-01,"[English Channel, migration, boats, illegal im...",Law and Justice,Law and Justice,The UK is withdrawing patrol ships from overse...,"[english channel, migration, boats, illegal im..."
2,46909694,2019-01-01T06:14:35.563Z,2019-01-01,"[Brazil, Jair Bolsonaro, Chicago economics, Ha...",Politics,Politics,Brazil is inaugurating President Jair Bolsonar...,"[brazil, jair bolsonaro, chicago economics, ha..."
3,46912694,2019-01-01T08:26:11.599Z,2019-01-01,"[Japan, Tokyo, Harajuku, attack]",Law and Justice,Crime,"A man with an ""intent to murder"" has driven a ...","[japan, tokyo, harajuku, attack]"
4,46910092,2019-01-01T09:05:00.736Z,2019-01-01,"[Asia, Bangladesh, elections, Kamal Hossain, S...",Politics,Politics,"In an exclusive interview with DW, Kamal Hossa...","[asia, bangladesh, elections, kamal hossain, s..."


In [16]:
# Data from Google
df_google.head()

Unnamed: 0,value,formattedValue,link,topic_mid,topic_title,topic_type,date,location
0,174300,Breakout,/trends/explore?q=/m/02vxn&date=2019-01-02+201...,/m/02vxn,Film,Topic,2019-01-02,World
24,39500,Breakout,/trends/explore?q=/m/014dgf&date=2019-01-02+20...,/m/014dgf,Sales,Topic,2019-01-02,World
23,39700,Breakout,/trends/explore?q=/m/0jg24&date=2019-01-02+201...,/m/0jg24,Image,Topic,2019-01-02,World
22,39750,Breakout,/trends/explore?q=/m/0mgkg&date=2019-01-02+201...,/m/0mgkg,Amazon.com,E-commerce company,2019-01-02,World
21,39900,Breakout,/trends/explore?q=/m/0glpjll&date=2019-01-02+2...,/m/0glpjll,Instagram,Social networking service,2019-01-02,World


In [None]:
## Model 1

In [None]:
## Model 2

In [None]:
# Model comparison ?

# Compare trending topics and DW covered categories