Skip to content

Commit

Permalink
refs #919. Clarifies warc iters in releasing datasets docs.
Browse files Browse the repository at this point in the history
  • Loading branch information
Justin Littman committed May 30, 2018
1 parent 3df8932 commit 1658a73
Show file tree
Hide file tree
Showing 2 changed files with 6 additions and 4 deletions.
4 changes: 2 additions & 2 deletions docs/processing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -115,8 +115,8 @@ to supply ``--api-base-url``.

Sync scripts
============
Sync scripts will extract social media data from WARC files for a collection and write it
to line-oriented JSON files. It is called a "sync script" because it will
Sync scripts will extract Twitter data from WARC files for a collection and write tweets to
to line-oriented JSON files and tweet ids to text files. It is called a "sync script" because it will
skip WARCs that have already been processed.

Sync scripts are parallelized, allowing for faster processing.
Expand Down
6 changes: 4 additions & 2 deletions docs/releasing_datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,9 +37,11 @@ Note that the Women's March dataset is a single (SFM) collection. For an example
::
time parallel –j 3 -a source.lst -a dest.lst --xapply "twitter_stream_warc_iter.py {1} | jq –r ‘.id_str’ > {2}"

This command executes a Twitter Stream WARC iterator to extract the tweets from the WARC files and jq to extract the tweet ids. Parallel is used to perform this process in parallel (using multiple processors), using WARC files from `source.lst` and text files from `dest.lst`.
This command executes a Twitter Stream WARC iterator to extract the tweets from the WARC files and jq to extract the tweet ids. This shows using `twitter_stream_warc_iter.py` for a Twitter stream collection. For a Twitter REST collection, use `twitter_rest_warc_iter.py`.

- Note: `-j 3` limits parallel to 3 processors. Make sure to select an appropriate number for your server.
Parallel is used to perform this process in parallel (using multiple processors), using WARC files from `source.lst` and text files from `dest.lst`. `-j 3` limits parallel to 3 processors. Make sure to select an appropriate number for your server.

An alternative to steps 1 and 2 is to use a sync script to write tweet id text files and tweet JSON files in one step. (See :doc:`processing`)

4. Combine multiple files into large files:

Expand Down

0 comments on commit 1658a73

Please sign in to comment.