# Newspaper classification: ETL

This ETL notebook includes code to scrape titles from two major German newspaper websites for later feature engineering and classification. 

## Setup

Relevant libraries are loaded:

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
from datetime import datetime
from project_lib import Project

## Scraping newspapers

Titles from newspapers are scraped per newspaper using *BeautifulSoup* and *requests* libraries. 

### Süddeutsche Zeitung (SZ)

In [2]:
# Scrape website
sz_url = "https://www.sueddeutsche.de/"
sz_html_page = requests.get(sz_url)
sz_soup = BeautifulSoup(sz_html_page.content, "html.parser")

# Extract titles
sz_teasers = sz_soup.find_all(class_="sz-teaser__text-container--s")
sz_titles = []
for teaser in sz_teasers:
    title_element = teaser.find("h3", class_="sz-teaser__title")
    sz_titles.append(title_element.text)
    
# Save as df and initialise newspaper name and date
sz_df = pd.DataFrame(sz_titles, columns =['Title'])
sz_df['Newspaper'] = "SZ"
sz_df['DateTime'] = datetime.now()

sz_df.head(3)

Unnamed: 0,Title,Newspaper,DateTime
0,Wie die Bundeswehr nun helfen soll,SZ,2021-08-16 11:10:29.425187
1,Was die Rückkehr der Taliban bedeutet,SZ,2021-08-16 11:10:29.425187
2,"Söder: ""Man kann nicht sagen, dass alles perfe...",SZ,2021-08-16 11:10:29.425187


### Welt

In [3]:
# Scrape website
welt_url = "https://www.welt.de/"
welt_html_page = requests.get(welt_url)
welt_soup = BeautifulSoup(welt_html_page.content, "html.parser")

# Extract titles
welt_teasers = welt_soup.find_all(class_="c-teaser-default__body o-teaser__body")
welt_titles = []

for teaser in welt_teasers:
    title_element = teaser.find(attrs={'data-qa': 'Teaser.Link'})
    welt_titles.append(title_element.text)

# Save as df and initialise newspaper name and date
welt_df = pd.DataFrame(welt_titles, columns =['Title'])
welt_df['Newspaper'] = "Welt"
welt_df['DateTime'] = datetime.now()

welt_df.head(3)

Unnamed: 0,Title,Newspaper,DateTime
0,"„Biden sagte: ,Sch*** drauf, das ist doch nic...",Welt,2021-08-16 11:10:29.814123
1,Deutsche Botschaft warnte offenbar lange erfo...,Welt,2021-08-16 11:10:29.814123
2,Man steht ohnmächtig vor dieser kolossalen Ni...,Welt,2021-08-16 11:10:29.814123


## Concatenation

Data from *SZ* and *Welt* newspapers is combined, which now looks as follows:

In [4]:
df = pd.concat([sz_df, welt_df])
df.head()

Unnamed: 0,Title,Newspaper,DateTime
0,Wie die Bundeswehr nun helfen soll,SZ,2021-08-16 11:10:29.425187
1,Was die Rückkehr der Taliban bedeutet,SZ,2021-08-16 11:10:29.425187
2,"Söder: ""Man kann nicht sagen, dass alles perfe...",SZ,2021-08-16 11:10:29.425187
3,Rein in etablierte Machtstrukturen,SZ,2021-08-16 11:10:29.425187
4,"Neuer Impfstoff, neue Ängste",SZ,2021-08-16 11:10:29.425187


## Updating Data

Data from current website scraping is combined with older data:

In [5]:
# The code was removed by Watson Studio for sharing.

In [6]:
project = Project(project_id = project_id, project_access_token = project_token)

In [7]:
old_file = project.get_file("Newspaper_Data.csv")
old_file.seek(0)
old_df = pd.read_csv(old_file)

In [8]:
df = pd.concat([df, old_df])

Duplicate titles per newspaper (i.e., that appeared on multiple occasions) are deleted per newspaper with older title kept:

In [9]:
df = df.drop_duplicates(subset=['Title', 'Newspaper'], keep='last')

## Saving

Data is saved for feature engineering and other downstream processing:

In [10]:
project.save_data(file_name = "Newspaper_Data.csv", data = df.to_csv(index = False), set_project_asset=True, overwrite=True)

{'file_name': 'Newspaper_Data.csv',
 'message': 'File saved to project storage.',
 'bucket_name': 'newspaperclassification-donotdelete-pr-htkved6zqhsjcj',
 'asset_id': '87fae567-81e7-4f99-954e-a18d2da8c26e'}