Skip to content

Commit

Permalink
add aoc-harvester project
Browse files Browse the repository at this point in the history
  • Loading branch information
mgtezak committed May 13, 2024
1 parent 286be6a commit 383f631
Show file tree
Hide file tree
Showing 7 changed files with 311 additions and 9 deletions.
17 changes: 17 additions & 0 deletions de/projekte.html
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,23 @@ <h2><a href="projekte/wettervorhersage.html">Wettervorhersage<br /> in Australie
<!-- Posts -->
<section class="posts">

<article>
<header>
<span class="date">13. Mai 2024</span>
<h2><a href="projekte/aoc-harvester.html">Advent of Code Data Harvester
</a></h2>
<p>
Automatische Erfassung und Speicherung von
<a href="https://adventofcode.com">Advent of Code</a>'s öffentlichen Daten.
Cron Jobs, GitHub Actions, Requests, BeautifulSoup, SQLite
</p>
</header>
<a href="projekte/aoc-harvester.html" class="image fit"><img src="../images/aoc-harvester.jpg" alt="image: aoc-harvester" /></a>
<ul class="actions special">
<li><a href="projekte/aoc-harvester.html" class="button">Mehr</a></li>
</ul>
</article>

<article>
<header>
<span class="date">25. April 2024</span>
Expand Down
144 changes: 144 additions & 0 deletions de/projekte/aoc-harvester.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
<!DOCTYPE HTML>
<!--
Massively by HTML5 UP
html5up.net | @ajlkn
Free for personal and commercial use under the CCA 3.0 license (html5up.net/license)
-->
<html>
<head>
<title>Michael Tezak Portfolio</title>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=no" />
<link rel="stylesheet" href="../../assets/css/main.css" />
<link rel="icon" type="image/svg+xml" href="../../assets/icons/favicon.svg" />
<noscript><link rel="stylesheet" href="../../assets/css/noscript.css" /></noscript>
</head>

<body class="is-preload">

<!-- Wrapper -->
<div id="wrapper">

<!-- Header -->
<header id="header"></header>

<!-- Nav -->
<nav id="nav">
<ul class="links">
<li class="active"><a href="../projekte.html">Projekte</a></li>
<li><a href="../qualifikationen.html">Qualifikationen</a></li>
<li><a href="../weitere-interessen.html">Weitere-Interessen</a></li>
</ul>
<ul class="icons">
<li><a href="https://github.com/mgtezak" target="_blank" class="icon brands fa-github"><span class="label">GitHub</span></a></li>
<li><a href="https://www.linkedin.com/in/mgtezak/" target="_blank" class="icon brands fa-linkedin"><span class="label">LinkedIn</span></a></li>
</ul>
</nav>

<!-- Main -->
<div id="main">

<!-- Post -->
<section class="post">
<header class="major">
<!-- <span class="date">1. November 2023</span> -->
<h1>Advent of Code Data Harvester</h1>
<p>
Automatische Erfassung und Speicherung von
<a href="https://adventofcode.com">Advent of Code</a>'s öffentlichen Daten.
<br/>
Cron Jobs, GitHub Actions, Requests, BeautifulSoup, SQLite
</p>
</header>
<div class="image main"><img src="../../images/aoc-harvester.jpg" alt="image: aoc-harvester" /></div>

<p>
Der AoC Data Harvester besteht aus zwei automatisierten Skripten,
die zu verschiedenen Zeiten laufen und verschiedene Arten von Daten sammeln.
Beide sind so konfiguriert, dass sie den
<a href="https://old.reddit.com/r/adventofcode/wiki/faqs/automation/" target="_blank">Web-Scraping-Richtlinien von AoC</a>
entsprechen.
Konkret bedeutet dies, dass aufeinanderfolgende Requests um 15 Minuten versetzt sind
und jeder Request einen Header enthält, der zu meinem
<a href="https://github.com/mgtezak/AoC_Data_Harvester" target="_blank">öffentlichen Repository</a>
zurückverlinkt, um maximale Transparenz zu gewährleisten.
Die Automatisierung wird durch <i>GitHub Actions</i> ermöglicht, was überraschend einfach einzurichten ist.
</p>
<p>
<h4>1. Titel und Bestenlisten Scraper</h4>
Dieses Skript läuft täglich um 7 Uhr (UTC) vom 1. bis zum 25. Dezember,
also immer 2 Stunden nachdem ein neues Puzzle veröffentlicht wird.
In Cron-Syntax wird dies zu <code>0 7 1-25 12 *</code>.
Zu diesem Zeitpunkt wird die Bestenliste mit sehr guter wahrscheinlichkeit gefüllt sein.
Sobald sie für ein bestimmtes Rätsel gefüllt ist, ändert sie sich danach nicht mehr,
was bedeutet, dass diese Daten zusammen mit dem Rätseltitel nur einmal erfasst werden müssen.
</p>
<p>
<h4>2. Abschlussraten Scraper</h4>
Da die Anzahl der Personen, die ein bestimmtes Puzzle lösen,
mit der Zeit zunimmt, ist das Erfassen dieser Abschlussraten keine einmalige Aufgabe,
sondern eine fortlaufende.
Daten dieser Art eignen sich gut für Zeitreihenanalysen.
Ich habe beschlossen, dass das Skript das ganze Jahr über um 12 Uhr Mittags (UTC) laufen wird.
In Cron-Syntax entspricht dies <code>0 12 * * *</code>.
Glücklicherweise werden die Abschlusszahlen jährlich gruppiert,
sodass ich nur einen Requests pro Jahr benötige, in dem AoC bisher stattfand.
Derzeit sind das (2024 - 2015 = ) 9 Anfragen pro täglichem Durchlauf.
Mit 15 Minuten zwischen den einzelnen Request, ergibt dies 2 Stunden, die das Script braucht um einmal zu laufen.
</p>
<p>
Alle gesammelten Daten werden in einer SQLite-Datenbank gespeichert und in vordefinierte Tabellen eingefügt:
<code>puzzles</code> für die Titel (1 Zeile pro Rätsel),
<code>leaderboard</code> für die Benutzernamen und Abschlusszeiten der 100 schnellsten Teilnehmer
bei jedem Rätselteil (200 Zeilen pro Rätsel) und schließlich <code>stats</code>
für die Abschlussraten (1 Zeile pro Rätsel pro Zeitscheibe).
Falls du dich fragst, was ich mit diesen Daten anfange, schau dir mein
<a href="aoc-analytics.html">AoC Datenanlyse</a> Projekt an.
Da ich jedoch erst kürzlich begonnen habe, die Abschlussraten täglich zu aktualisieren, wird noch einiges folgen.



</p>

<ul class="actions special">
<li><a href="https://github.com/mgtezak/AoC_Data_Harvester" target="_blank" class="button primary">Zum GitHub Repo</a></li>
</ul>
</section>

<footer>
<div class="pagination">
<!-- <a href="pomodino.html" class="previous">Prev</a> -->
<a href="pomodino.html" class="next">Next</a>
</div>
</footer>

</div>

<!-- Footer -->
<footer id="footer">
<section class="split contact">
<section>
<h3>Kontakt</h3>
<p><a>mgtezak@gmail.com</a></p>
<p><span class="language-toggle"><a href="../../en/projects/aoc-harvester.html">English version</a></span></p>
</section>
</section>
</footer>

<!-- Copyright -->
<div id="copyright"></div>

</div>

<!-- Scripts -->
<script src="../../assets/js/jquery.min.js"></script>
<script src="../../assets/js/jquery.scrollex.min.js"></script>
<script src="../../assets/js/jquery.scrolly.min.js"></script>
<script src="../../assets/js/browser.min.js"></script>
<script src="../../assets/js/breakpoints.min.js"></script>
<script src="../../assets/js/util.js"></script>
<script src="../../assets/js/main.js"></script>


</body>
</html>
1 change: 1 addition & 0 deletions de/projekte/pomodino.html
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ <h1>PomoDino Timer</h1>

<footer>
<div class="pagination">
<a href="aoc-harvester.html" class="previous">Prev</a>
<a href="aoc-analytics.html" class="next">Next</a>
</div>
</footer>
Expand Down
19 changes: 10 additions & 9 deletions en/projects.html
Original file line number Diff line number Diff line change
Expand Up @@ -68,20 +68,21 @@ <h2><a href="projekte/wettervorhersage.html">Wettervorhersage<br /> in Australie
<!-- Posts -->
<section class="posts">

<!-- <article>
<article>
<header>
<span class="date">15. January 2024</span>
<h2><a href="projects/aoc-solver.html">Advent of Code Data Analytics</a></h2>
<span class="date">13. May 2024</span>
<h2><a href="projects/aoc-harvester.html">Advent of Code Data Harvester</a></h2>
<p>
Collecting, analyzing and visualizing public statistics from <a href="https://adventofcode.com/", target="_blank">Advent of Code</a>
BeautifulSoup, Streamlit, Pandas, Numpy, Matplotlib, Seaborn
Automatic collection and storage of
<a href="https://adventofcode.com">Advent of Code</a>'s public data.
Cron Jobs, GitHub Actions, Requests, BeautifulSoup, SQLite
</p>
</header>
<a href="projects/aoc-solver.html" class="image fit"><img src="../images/aoc_user_info.png" alt="image: aoc-solver" /></a>
<a href="projects/aoc-harvester.html" class="image main"><img src="../images/aoc-harvester.jpg" alt="image: aoc-harvester" /></a>
<ul class="actions special">
<li><a href="projects/aoc-solver.html" class="button">More</a></li>
</ul>
</article> -->
<li><a href="projects/aoc-harvester.html" class="button">More</a></li>
</ul>
</article>

<article>
<header>
Expand Down
138 changes: 138 additions & 0 deletions en/projects/aoc-harvester.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
<!DOCTYPE HTML>
<!--
Massively by HTML5 UP
html5up.net | @ajlkn
Free for personal and commercial use under the CCA 3.0 license (html5up.net/license)
-->
<html>
<head>
<title>Michael Tezak Portfolio</title>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=no" />
<link rel="stylesheet" href="../../assets/css/main.css" />
<link rel="icon" type="image/svg+xml" href="../../assets/icons/favicon.svg" />
<noscript><link rel="stylesheet" href="../../assets/css/noscript.css" /></noscript>
</head>

<body class="is-preload">

<!-- Wrapper -->
<div id="wrapper">

<!-- Header -->
<header id="header"></header>

<!-- Nav -->
<nav id="nav">
<ul class="links">
<li class="active"><a href="../projects.html">Projects</a></li>
<li><a href="../technical-skills.html">Technical-Skills</a></li>
<li><a href="../other-interests.html">Other-Interests</a></li>
</ul>
<ul class="icons">
<li><a href="https://github.com/mgtezak" target="_blank" class="icon brands fa-github"><span class="label">GitHub</span></a></li>
<li><a href="https://www.linkedin.com/in/mgtezak/" target="_blank" class="icon brands fa-linkedin"><span class="label">LinkedIn</span></a></li>
</ul>
</nav>

<!-- Main -->
<div id="main">

<!-- Post -->
<section class="post">
<header class="major">
<!-- <span class="date">1. November 2023</span> -->
<h1>Advent of Code Data Harvester</h1>
<p>
Automatic collection and storage of
<a href="https://adventofcode.com">Advent of Code</a>'s
public data.
<br/>
Cron Jobs, GitHub Actions, Requests, BeautifulSoup, SQLite
</p>
</header>
<div class="image main"><img src="../../images/aoc-harvester.jpg" alt="image: aoc-harvester" /></div>

<!-- <span class="image right"><img src="../../images/aoc_public_completions2.png" alt="" /></span> -->

<p>
The AoC Data Harvester consists of two automated scripts,
which run at different times and collect different types of data.
Both are configured to comply with
<a href="https://old.reddit.com/r/adventofcode/wiki/faqs/automation/" target="_blank">AoC's web scraping guidelines</a>,
meaning that consecutive requests are separated out by 15 minutes
and each request includes a header linking back to my
<a href="https://github.com/mgtezak/AoC_Data_Harvester" target="_blank">public repo</a> for maximum transparency.
The automation is made possible through <i>GitHub Actions</i> which is surprisingly easy to set up.
</p>
<p>
<h4>1. Title & Leaderboard Scraper</h4>
This script runs daily at 7am (<i>UTC</i>) from the 1st-25th of December,
which is 2 hours after each new puzzle is released.
In Cron syntax, this translates to
<code>0 7 1-25 12 *</code>.
By that point the leaderboard will almost certainly have filled up.
Once it has filled up for a given puzzle, it won't ever change again, meaning that these data,
along with the puzzle title, need to be scraped only once.
</p>
<p>
<h4>2. Completion Stats Scraper</h4>
Since the number of people who complete a given puzzle increases over time,
scraping these completion rates is not a one-time job, but a continuous one.
This type of data lends itself well for some time-series analysis later on.
I decided that the script will run at noon
(<i>UTC</i>) throughout the entire year.
In Cron syntax, this translates to <code>0 12 * * *</code>.
Luckily, the completion numbers are grouped annually,
so I only need 1 request per year since 2015, which is when AoC started to collect the data.
So currently that's 9 requests per daily run with 15 minutes between each new request, adding up to 2 hours in total.
</p>
<p>
All the collected data is stored in an SQLite database file and inserted into predefined tables:
<code>puzzles</code> for the titles (1 row per puzzle),
<code>leaderboard</code> for the user names and completion times of the top 100 fastest competitors on each puzzle part (200 rows per puzzle)
and finally <code>stats</code> for the completion rates (1 row per puzzle per time slice).
If you're wondering what I do with this data, check out my <a href="aoc-analytics.html">AoC Data Analytics</a>
project. However, since I've only just started to update the completion rates daily, there is more to come.
</p>
<ul class="actions special">
<li><a href="https://github.com/mgtezak/AoC_Data_Harvester" target="_blank" class="button primary">See GitHub Repo</a></li>
</ul>
</section>
<footer>
<div class="pagination">
<!-- <a href="pomodino.html" class="previous">Prev</a> -->
<a href="pomodino.html" class="next">Next</a>
</div>
</footer>
</div>

<!-- Footer -->
<footer id="footer">
<section class="split contact">
<section>
<h3>Contact</h3>
<p><a>mgtezak@gmail.com</a></p>
<p><span class="language-toggle"><a href="../../de/projekte/aoc-harvester.html">Deutsche Version</a></span></p>

</section>
</section>
</footer>

<!-- Copyright -->
<div id="copyright"></div>

</div>

<!-- Scripts -->
<script src="../../assets/js/jquery.min.js"></script>
<script src="../../assets/js/jquery.scrollex.min.js"></script>
<script src="../../assets/js/jquery.scrolly.min.js"></script>
<script src="../../assets/js/browser.min.js"></script>
<script src="../../assets/js/breakpoints.min.js"></script>
<script src="../../assets/js/util.js"></script>
<script src="../../assets/js/main.js"></script>


</body>
</html>
1 change: 1 addition & 0 deletions en/projects/pomodino.html
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ <h1>PomoDino Timer</h1>
</section>
<footer>
<div class="pagination">
<a href="aoc-harvester.html" class="previous">Prev</a>
<a href="aoc-analytics.html" class="next">Next</a>
</div>
</footer>
Expand Down
Binary file added images/aoc-harvester.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 383f631

Please sign in to comment.