This repo contains scripts for exploring online data from sources such as Twitter, online news outlets or online forums, with the main aim of generating ideas for ASF projects involving opinions/thoughts shared by different types of users online.
Our data collection pipeline contains functions to collect data from Twitter's API v2 recent search endpoint and from The Guardian Open Platform content endpoint.
Both require access to developer credentials:
Coming soon...
Coming soon...
- Meet the data science cookiecutter requirements, in brief:
- Install:
direnv
andconda
- Install:
- Run
make install
to configure the development environment:- Setup the conda environment
- Configure
pre-commit
- Run
direnv allow
; - Activate conda enviroment
conda activate asf_online_data_exploration
- Set your credentials as enviroment variables
export BEARER_TOKEN="ADD_YOUR_BEARER_TOKEN_HERE"
and replaceADD_YOUR_BEARER_TOKEN_HERE
with your bearer token credentials.export GUARDIAN_API_KEY="ADD_YOUR_API_KEY_HERE"
and replaceADD_YOUR_API_KEY_HERE
with your API key credentials. Alternatively, setexport GUARDIAN_API_KEY="test"
- Run
conda install -c conda-forge vega-cli vega-lite-cli
. If that doesn't work, follow the instructions here
asf_online_data_exploration/
├─ analysis/
├─ config/
│ ├─ base.yaml - file paths info
│ ├─ data_collection_parameters.py - parameters for data collection
├─ getters/
├─ notebooks/
├─ pipeline/
│ ├─ data_collection/
│ │ ├─ recent_search_twitter.py - functions to collect data from Twitter's recent search endpoint
│ │ ├─ the_guardian.py - functions to collect data from The Guardian Open Platform content endpoint
│ │ ├─ tests/
│ │ | ├─ testing_recent_search_twitter.py - functions to test data collection pipeline for Twitter's recent search endpoint
│ │ | ├─ testing_the_guardian.py - functions to test data collection pipeline for The Guardian Open Platform content endpoint
├─ utils/
│ ├─ data_collection_utils.py - utility functions for retrieving and uploading data
inputs/
outputs/
Technical and working style guidelines
Project based on Nesta's data science project template (Read the docs here).