# A tutorial on several methods of saving data while using an automatic scraper with GitHub Actions

We'll start by downloading the homepage of https://www.lrt.lt (but in English because I do not know Lithuanian).

In [1]:
import requests
from bs4 import BeautifulSoup

In [6]:
response = requests.get("https://www.lrt.lt/en/news-in-english")
doc = BeautifulSoup(response.content)

In [7]:
items = doc.select(".news")
len(items)

42

In [9]:
#items[0]

In [22]:
articles = []
for item in items:
    headline = item.select_one('h3').text
    url = item.select_one('a')['href']
    img = item.select_one('img').get('data-src', None)
    article = {
        'url': url,
        'headline': headline,
        'img_path': img
    }
    articles.append(article)
len(articles)

42

In [23]:
import pandas as pd

df = pd.DataFrame(articles)
df.head()

Unnamed: 0,url,headline,img_path
0,/en/news-in-english/19/2050571/lithuanian-pm-v...,Lithuanian PM voices confidence in defence min...,/img/2023/02/18/1451044-637891-150x84.jpg
1,/en/news-in-english/19/2050524/lithuania-deems...,"Lithuania deems 1,164 Belarusian and Russian n...",/img/2022/03/01/1207094-733403-150x84.jpg
2,/en/news-in-english/19/2050519/vilnius-ex-mayo...,Vilnius ex-mayor Šimašius returns to private s...,/img/2023/04/17/1491828-404400-150x84.jpg
3,/en/news-in-english/19/2050432/lithuania-s-sup...,Lithuania's support to Ukraine includes helico...,/img/2023/07/20/1555673-67448-150x84.jpg
4,/en/news-in-english/19/2050380/latvia-to-ask-t...,Latvia to ask thousands of Russian citizens to...,/img/2019/08/05/485338-154587-150x84.jpg


In [19]:
# items[0]

## What to do with the results????

### Approach 1: Save it to a simple single csv file.

Just like https://github.com/jsoma/automatic-scraper-bbc or https://github.com/laurabejder/european_energy_prices

I want a CSV that always has the most current list of headlines in it, and maybe i want to be able to see a diff of the changes every time something is updated (time travel).

In [28]:
df.to_csv("current-headlines.csv", index=False)

### Approach 2: Save a different file every time we run the scraper

This is good for having easily-accessible data that is maybe weekly or monthly or even daily if you don't mind having hundreds of files in a single folder. Great for easily browsing to a current date and time (or just date) in your dataset.

If Laura also wanted to save daily data for https://github.com/laurabejder/european_energy_prices she could be doing this (if you look at her filenames I think she did it a couple times maybe)

> **WARNING:** You should probably only do this one with daily or a-few-times-a-day scripts, not every minute. That would create too many files!

In [51]:
import os

# Try to create a folder called 'data'
# and if it exists DON'T THROW AN ERROR
os.makedirs("data", exist_ok=True)

In [52]:
from datetime import datetime

# This would keep track down to the second
datetime.now().strftime("%Y-%m-%d_%H.%M.%S")

# This only does the day
date_string = datetime.now().strftime("%Y-%m-%d")
filepath = f"data/{date_string}.csv"

df.to_csv(filepath, index=False)

## Approach 3: Appending to an existing CSV file

This is when you want a snapshot of a point in time and you want to keep track of everything over time. Terrible example for headlines, but a great example for Kelly's here: https://github.com/kellywaldro/processing-times/

In [56]:
# Add a new column for today's date

# You could just use datetime.now() and get the entire datetime
# or another strftime but this is fine
df['scrape_date'] = datetime.now().strftime("%Y-%m-%d")
df.head()

Unnamed: 0,url,headline,img_path,scrape_date
0,/en/news-in-english/19/2050571/lithuanian-pm-v...,Lithuanian PM voices confidence in defence min...,/img/2023/02/18/1451044-637891-150x84.jpg,2023-08-04
1,/en/news-in-english/19/2050524/lithuania-deems...,"Lithuania deems 1,164 Belarusian and Russian n...",/img/2022/03/01/1207094-733403-150x84.jpg,2023-08-04
2,/en/news-in-english/19/2050519/vilnius-ex-mayo...,Vilnius ex-mayor Šimašius returns to private s...,/img/2023/04/17/1491828-404400-150x84.jpg,2023-08-04
3,/en/news-in-english/19/2050432/lithuania-s-sup...,Lithuania's support to Ukraine includes helico...,/img/2023/07/20/1555673-67448-150x84.jpg,2023-08-04
4,/en/news-in-english/19/2050380/latvia-to-ask-t...,Latvia to ask thousands of Russian citizens to...,/img/2019/08/05/485338-154587-150x84.jpg,2023-08-04


Check what already exists

In [59]:
# If it exists, open it
# If it doesn't, just make a blank dataframe
# could also use os.path.exists to check if the file exists
# but honestly try/except is the easiest route to go here
try:
    existing_df = pd.read_csv("always-updated.csv")
except:
    existing_df = pd.DataFrame([])
existing_df.head()

In [61]:
# Combine our new dataframe and our old dataframe
# ignore_index=True 
combined = pd.concat([df, existing_df], ignore_index=True)
combined.head()

Unnamed: 0,url,headline,img_path,scrape_date
0,/en/news-in-english/19/2050571/lithuanian-pm-v...,Lithuanian PM voices confidence in defence min...,/img/2023/02/18/1451044-637891-150x84.jpg,2023-08-04
1,/en/news-in-english/19/2050524/lithuania-deems...,"Lithuania deems 1,164 Belarusian and Russian n...",/img/2022/03/01/1207094-733403-150x84.jpg,2023-08-04
2,/en/news-in-english/19/2050519/vilnius-ex-mayo...,Vilnius ex-mayor Šimašius returns to private s...,/img/2023/04/17/1491828-404400-150x84.jpg,2023-08-04
3,/en/news-in-english/19/2050432/lithuania-s-sup...,Lithuania's support to Ukraine includes helico...,/img/2023/07/20/1555673-67448-150x84.jpg,2023-08-04
4,/en/news-in-english/19/2050380/latvia-to-ask-t...,Latvia to ask thousands of Russian citizens to...,/img/2019/08/05/485338-154587-150x84.jpg,2023-08-04


In [63]:
combined.to_csv("always-updated.csv", index=False)