Service for creating Twitter datasets for research and archiving.
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
docker
static
templates
tests
.gitignore
Dockerfile-flaskrun
Dockerfile-loader
Dockerfile-server
Dockerfile-spark
Dockerfile-worker
LICENSE
README.md
elasticsearch-hadoop-6.2.2.jar
example.dataset.json
models.py
requirements.txt
setup.py
stats.py
tasks.py
tweetset_cli.py
tweetset_loader.py
tweetset_server.py
utils.py

README.md

TweetSets

DOI

Twitter datasets for research and archiving.

  • Create your own Twitter dataset from existing datasets.
  • Conforms with Twitter policies.

TweetSets allows users to (1) select from existing datasets; (2) limit the dataset by querying on keywords, hashtags, and other parameters; (3) generate and download dataset derivatives such as the list of tweet ids and mention nodes/edges.

Modes

TweetSets can be run in different modes. The modes determine which datasets are available and what type of dataset derivates can be generated.

  • public mode: Source datasets that are marked as local only are excluded. Dataset derivates that include the text of the tweet cannot be generated.
  • local mode: All source datasets are included, including those that are marked as local only. All dataset derivatives can be generated, including those that include the text of the tweet.
  • both mode: For configured network IP ranges, the user is placed in local mode. Otherwise, the user is placed in public mode.

These modes allow conforming with the Twitter policy that prohibits sharing complete tweets with 3rd parties.

Modes are configured in the .env file as described below.

Installing

Prerequisites

Installation

  1. Create data directories on a volume with adequate storage:

     mkdir -p /tweetset_data/redis
     mkdir -p /tweetset_data/datasets
     mkdir -p /tweetset_data/elasticsearch/esdata1
     mkdir -p /tweetset_data/elasticsearch/esdata2
     chown -R 1000:1000 /tweetset_data/elasticsearch
    

Note:

  • Create an esdata<number> directory for each ElasticSearch container.
  • On OS X, the redis and esdata<number> directories must be ugo+rwx.
  1. Clone or download this repository:

     git clone https://github.com/justinlittman/TweetSets.git
    
  2. Change to the docker directory:

     cd docker
    
  3. Copy the example docker files:

     cp example.docker-compose.yml docker-compose.yml
     cp example.env .env
    
  4. Edit .env. This file is annotated to help you select appropriate values.

  5. Create dataset_list_msg.txt. The contents of this file will be displayed on the dataset list page. It can be used to list other datasets that are available, but not yet loaded. If leaving the file empty then:

     touch dataset_list_msg.txt
    
  6. Bring up the containers:

     docker-compose up -d
    

For HTTPS support, uncomment and configure the nginx-proxy container in docker-compose.yml.

Cluster installation

Clusters must have at least a primary node and two additional nodes.

Primary node

  1. Create data directories on a volume with adequate storage:

     mkdir -p /tweetset_data/redis
     mkdir -p /tweetset_data/datasets
     mkdir -p /tweetset_data/elasticsearch
     chown -R 1000:1000 /tweetset_data/elasticsearch
    
  2. Clone or download this repository:

     git clone https://github.com/justinlittman/TweetSets.git
    
  3. Change to the docker directory:

     cd docker
    
  4. Copy the example docker files:

     cp example.cluster-primary.docker-compose.yml docker-compose.yml
     cp example.env .env
    
  5. Edit .env. This file is annotated to help you select appropriate values.

  6. Create dataset_list_msg.txt. The contents of this file will be displayed on the dataset list page. It can be used to list other datasets that are available, but not yet loaded. If leaving the file empty then:

     touch dataset_list_msg.txt
    
  7. Bring up the containers:

     docker-compose up -d
    

For HTTPS support, uncomment and configure the nginx-proxy container in docker-compose.yml.

Cluster node

  1. Create data directories on a volume with adequate storage:

     mkdir -p /tweetset_data/elasticsearch
     chown -R 1000:1000 /tweetset_data/elasticsearch
    
  2. Clone or download this repository:

     git clone https://github.com/justinlittman/TweetSets.git
    
  3. Change to the docker directory:

     cd docker
    
  4. Copy the example docker files:

     cp example.cluster-node.docker-compose.yml docker-compose.yml
     cp example.cluster-node.env .env
    
  5. Edit .env. This file is annotated to help you select appropriate values. Note that 2 cluster nodes must have MASTER set to true.

  6. Bring up the containers:

     docker-compose up -d
    

Loading a source dataset

Prepping the source dataset

  1. Create a dataset directory within the dataset filepath configured in your .env.
  2. Place tweet files in the directory. The tweet files can be line-oriented JSON (.json) or gzip compressed line-oriented JSON (.json.gz).
  3. Create a dataset description file in the directory named dataset.json. See example.dataset.json for the format of the file.

Loading

  1. Start and connect to a loader container:

     docker-compose run --rm loader /bin/bash
    
  2. Invoke the loader:

     python tweetset_loader.py create /dataset/path/to
    

To see other loader commands:

    python tweetset_loader.py

Note that tweets are never added to an existing index. When using the reload command, a new index is created for a dataset that replaces the existing index. The new index replaces the old index only after the new index has been created, so user's are not effected by reloading.

Loading with Apache Spark

When using the Spark loader, the dataset files must be located at the dataset filepath on all nodes (e.g., by having separate copies or using a network share such as NFS).

In general, using Spark withing Docker is tricky because the Spark driver, Spark master, and Spark nodes all need to be able to communicate and the ports are dynamically selected. (Some of the ports can be fixed, but supporting multiple simultaneous loaders requires leaving some dynamic.) This doesn't play well with Docker's port mapping, since the hostnames and ports that Spark advertises internally must match what is available through Docker. Further complicating this is that host networking (which is used to support the dynamic ports) does not work correctly on Mac.

Cluster mode

  1. Start and connect to a loader container:

     docker-compose -f loader.docker-compose.yml run --rm loader /bin/bash
    
  2. Invoke the loader:

     spark-submit \
     --jars elasticsearch-hadoop.jar \
     --master spark://$SPARK_MASTER_HOST:7101 \
     --py-files dist/TweetSets-0.1-py3.6.egg,dependencies.zip \
     --conf spark.driver.bindAddress=0.0.0.0 \
     --conf spark.driver.host=$SPARK_DRIVER_HOST \
     tweetset_loader.py spark-create /dataset/path/to
    

Kibana

Elastic's Kibana is a general-purpose framework for exploring, analyzing, and visualizing data. Since the tweets are already indexed in ElasticSearch, they are ready to be used from Kibana.

To enable Kibana, uncomment the Kibana service in your docker-compose.yml. By default, Kibana will run on port 5601.

A few notes about Kibana:

  • When starting Kibana, the first step you will need to do is select an index pattern. Each index represents a dataset, where the format of the name of the index is tweets-. The dataset id is available under the dataset details when selecting source datasets in TweetSets.
  • The time period of the tweets is controlled by the date picker on the top, right of the Kibana screen. By default the time period is very short; you will probably want to adjust to cover a longer time period.

Citing

Please cite TweetSets as:

    Justin Littman. (2018). TweetSets. Zenodo. https://doi.org/10.5281/zenodo.1289426

Kibana TODO

  • Consider multiple Kibana users.
  • Consider persistence.
  • Provide a default dashboard.
  • Consider approaches to index patterns.

TweetSets TODO

  • Loading:
    • Hydration of tweet ids lists.
  • Limiting:
    • Limit by mention user ids
    • Limit by user ids
    • Limit by verified users
  • Scroll additional sample tweets
  • Dataset derivatives:
    • Additional top derivatives:
      • URL
      • Quotes/retweets
    • Options to limit top derivatives by:
      • Top number (e.g., top 500)
      • Count greater than (e.g., more than 5 mentions)
    • Additional nodes/edges derivatives:
      • Replies
      • Quotes/retweets
    • Provide nodes/edges in additional formats such as Gephi.
  • Separate counts of tweets available for public / local on home page.