Terminarz rozliczania się z projektów:
projekt | zadanie | deadline |
---|---|---|
przygotowanie danych | EDA (:earth_africa:, 1, 2) | |
zaliczenie | Aggregation Pipeline (3) | 23.04.2017 |
egzamin | MapReduce (4) | 21.05.2017 |
Link do prywatnego repozytorium z rozwiązaniami zadań należy wpisać odpowiednio w pliku projects.md.
Link do szablonu repozytorium.
- Git Tips – most commonly used git tips and tricks
- Awesome Public Datasets
- Emoji cheat sheet – do wykorzystania w dokumentacji.
Więcej danych:
- Watching Our World Unfold:
- NOAA – National Centers for Environmental Information (US)
Provide public access to scripts, runs, and results:
- Version control all custom scripts:
- avoid writing code
- write thin scripts and use standard tools and use standard UNIX commands to chain things together.
- Avoid manual data manipulation steps:
- use a build system, for example make, and have all results produced automatically by build targets
- if it’s not automated, it’s not part of the project, i.e. have an idea for a graph or an analysis? automate its generation
- Use a markup, for example Markdown, or AsciiDoctor to create reports for analysis and presentation output products.
Plus two more rules:
- Record all intermediate results, when possible in standardized formats.
- Connect textual statements to underlying results.
Three more links:
- The Quartz guide to bad data – an exhaustive reference to problems seen in real-world data along with suggestions on how to resolve them.
- How to share data with a statistician – it is critical that you include the rawest form of the data that you have access to.
- Przykładowy (i interesujący) projekt w SQL: Georgios Gousios. Assessment of the pull based development model, as implemented by Github
Spakowany plik RC_2015-01.bz2 zajmuje na dysku 5_452_413_560 B, czyli ok. 5.5 GB. Każda linijka pliku to jeden obiekt JSON, komentarz z serwisu Reddit, z tekstem komentarza, autorem, itd. Wszystkich komentarzy/JSON-ów powinno być 53_851_542.
bunzip2 --stdout RC_2015-01.bz2 | head -1 | jq .
time bunzip2 --stdout RC_2015-01.bz2 | rl --count 1000 > RC_2015-01_1000.json
# real ∞ s
# user ∞ s
# sys 0m12 s
time bunzip2 -c RC_2015-01.bz2 | mongoimport --drop --host 127.0.0.1 -d test -c reddit
# 2015-10-09T19:49:35.698+0200 test.reddit 29.5 GB
# 2015-10-09T19:49:35.698+0200 imported 53851542 documents
# real 38m40.629s
# user 56m37.200s
# sys 1m17.074s
Plik primer-dataset.json informacje o restauracjach w Nowym Jorku.
wget https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/primer-dataset.json
cat dataset.json | gzip --stdout > primer-dataset.json.gz
rm dataset.json
gunzip -c primer-dataset.json.gz | shuf -n 1 # macOS, brew install coreutils (gshuf)
gunzip -c primer-dataset.json.gz | rl -c 1 # macOS, brew install randomize-lines
Unikamy zapisywania plików na dysku.
# curl -s 'https://inf.ug.edu.pl/plan/?format=json' \
# | mongoimport --drop --jsonArray -c plan
# curl -s 'https://inf.ug.edu.pl/plan/?format=json' \
# | jq -c '.[]' \
# | mongoimport --drop -c plan
curl -s https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/primer-dataset.json \
| mongoimport --drop -c restaurants
# use `shuf -n 100` on Linux
curl -s https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/primer-dataset.json \
| gshuf -n 100 \
| mongoimport --drop -c restaurants