How do the raw HTML files in selkouutiset-scrape get cleaned up in to nice, easy-to-read Markdown files? This repository is how, using a combination of shell scripts and (as the needs of the project grew) Python as well.
Here I'm going to outline how to do all of these steps by hand. If you want to run the whole process, you can just run the shell script run.sh
(which is what I do).
# Safe to run over and over! Try it out!
cd /tmp && rm -rf selkouutiset-scrape-cleaned/
git clone https://github.com/hiAndrewQuinn/selkouutiset-scrape-cleaned.git
cd selkouutiset-scrape-cleaned
fish update.fish
rm .hash
rm -rf 2024/02 # does NOT work with all earlier versions out of the box sadly.
fish create-markdown-versions.fish
cat languages.txt | while read -l lang
fd '.*.fi.md$' | python translation-code/markdown2json.py --target-lang=$lang
end
# ☁️ gcloud + Google Translate API ONLY, `curl` is to test.
set GCP_SELKOUUTISET_ARCHIVE_PROJECT 'andrews-selkouutiset-archive'
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "x-goog-user-project: $GCP_SELKOUUTISET_ARCHIVE_PROJECT" \
-H "Content-Type: application/json; charset=utf-8" \
-d @2023/11/11/_request.fi.en.json \
"https://translation.googleapis.com/language/translate/v2" | jq .
# Now we actually send the requests to the cloud.
fish translation-code/generate-translations.fish $GCP_SELKOUUTISET_ARCHIVE_PROJECT
fd '_response\...\...\.json' | python translation-code/json2markdown.py
The bulk of this Git repo is contained in the YYYY/MM/DD
folders. Each YYYY/MM/DD
folder contains 2*n things, where n is the number of languages we are translating to (curently just Finnish and English).
index.source.md
is the Selkouutiset article, converted to Markdown, and translated into the appropriate language if necessary.source
is the langauge the article is in, as defined by the ISO 639-1 standard.2023/10/25/index.fi.md
is the Finnish-language version of the Selkouutiset articles from 2023.10.25.2023/11/11/index.en.md
is the English-language version of the Selkouutiset articles from 2023.11.11.
_request.source.target.json
is a JSON file, generated fromindex.source.md
by a Python script, which contains the JSON request we send to the Google Translation API.source
should matchjq '.source'
for the JSON file.target
should match'.target'
.- As an example, this file would be called
index.fi.en.md
:{ "source": "fi", "target": "en", "q": [ "Tämä on testi." ], "format": "text" }
_response.source.target.json
is a JSON file, which contains the JSON response we get back from sending_request.source.target.json
to the Google Translation API.source
should matchjq '.source'
for the JSON file.target
should match'.target'
.- As an example, this file would be called
index.fi.en.md
:{ "data": { "translations": [ { "translatedText": "This is a test." } ] } }
index.source.md
doesn't actually clue you in as to which translation was used to generate it. For this simple project that's not a big deal, because I'm not interested in running my JSON requests through fi
then ar
then es
then fr
then en
just to mangle index.en.md
up. But if you're doing something more complicated, you might want to keep track of this.
selkouutiset-scrape-cleaned
uses selkouutiset-scrape
as a Git submodule. So the first thing to do on a fresh clone
is run
fish update.fish
while in the root of the cleaned
repo. This will both initialize and update the submodules for us.
This update.fish
defaults to doing nothing if the Git repo in question has any uncommitted work, so it's pretty safe to run on a loop.
Once we have a fresh set of HTML, we can then run
fish create-markdown-versions.fish
again while in the root of the cleaned
repo, to run all of our HTML files through the pandoc
and sed
filters that eventually produce our nice and clean index.fi.md
files.
In an automated environment, I usually run create-markdown-versions
and then immediately commit the changes:
git add -A
set timestamp (date -u)
git commit -m "Latest data: $timestamp" || exit 0
git push
Experience has taught me not to put this git commit code into create-markdown-versions
itself. ;)
Like update.sh
, this is also safe to run on a loop.
Alright, here's where things get a bit tricky. We have a bunch of Finnish-language Markdown files, but we want to create English-language versions of them (or Spanish-, or Farsi-, or what have you). As a former cloud guy, I like working with any of the Big Three, and in this case I decided to go with Google. So, in order to do the translations yourself, you need to have a Google Cloud account and a Google Cloud project with the Google Translation API enabled. Here are the API docs if that sounds fun to you.
There are two Python files in translation-code/
: markdown2json.py
, and json2markdown.py
. The easiest way to use them is by piping in the names of the files you wish to transform with the fd
command:
cat languages.txt | while read -l lang
fd '.*.fi.md$' | python translation-code/markdown2json.py --target-lang=$lang
end
For the purposes of testing whether your gcloud
CLI is set up properly, you can run the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "x-goog-user-project: andrews-selkouutiset-archive" \
-H "Content-Type: application/json; charset=utf-8" \
-d @2023/11/11/_request.fi.en.json \
"https://translation.googleapis.com/language/translate/v2"
If you get back something that looks like JSON-wrapped translated text, you're in the clear! Run
fish translation-code/generate-translations.fish
to send all of the JSON requests to the cloud and save the responses. (This is also safe to run on a loop - if a e.g. _request.fi.en.json
file already exists in that YYYY/MM/DD
file, it won't be sent to the cloud again, and you won't be charged.)
The grand finale. Take all of those _response
s you just generated and run them through our opposite, json2markdown.py
.
fd '_response\...\...\.json' | python translation-code/json2markdown.py