T140 path to extracts: #140 #153

dolsysmith · 2021-09-03T17:16:38Z

Features

Updates UI to display full extracts created by Spark
Reuses existing full_datasets path (assuming this is or can be configured as an NFS mount)
Adds aggregate users extract type
Copies JSON files from dataset_loading to tweetsets-data/full_datasets (or equivalent paths as defined in .env)
Uses repartitioning to optimize loading of multiple files
Coalesces extracts into a smaller number of files, using the max file size .env variable

Setup

The full_datasets folder must be a shared NFS mount available to all nodes in the Spark cluster. (On my VM, I moved the tweetsets_data folder to /storage on both the primary and secondary nodes, then mapped the full_datasets folder on the primary to the same location on the secondary VM.)
- Note: you don't want to share the entire tweetsets_data folder, as that will likely cause problems for Elasticsearch.
- I initially tried mapping /storage/tweetsets_data/full_datasets (VM 1) to a folder on VM 2 in /home/dsmith, but that did not seem to work.
Update your .env files accordingly with the new paths, if necessary.
On your non-primary nodes, update docker-compose.yml as follows:
- Add the following line to the spark-worker section, under volumes: ${TWEETSETS_DATA_PATH}/full_datasets:/tweetsets_data/full_datasets
On the primary node, update loader.docker-compose.yml as follows:
- Under volumes, add ${TWEETSETS_DATA_PATH}/full_datasets:/tweetsets_data/full_datasets
- Under environment, add
  - SPARK_MAX_FILE_SIZE
  - SPARK_PARTITION_SIZE
- Optional:add the following (in order to expose the Spark jobs UI):
```
ports:
   - 4040:4040
```
To your primary node's .env, add the following:
- SPARK_MAX_FILE_SIZE=2g
- SPARK_PARTITION_SIZE=128m
For testing, the server-flaskrun and loader containers should be built locally. Make sure you rebuild the images before restarting the containers.

Testing

Load a dataset.
Verify that full extracts are created and available in the UI.
Verify that extracts are downloadable and that, for all extracts except the full-tweet JSON, a small number of files are created. (For smaller datasets, each extract should have one file.)
Verify that the number of (non-header) rows in the tweet-ids extract matches the number of tweets in the UI.
Create custom extracts and verify that these are downloadable and created correctly.

Benchmarks

The following metrics were obtained using a subset of the Summer Olympics collection.

Metric	Value
Number of workers	1
Number of cores	2
Number of tweets	1,048,637
Size on disk	980M
Number of gzipped files	32

Operation	Time
RDD -> Elasticsearch	12 min
tweet-ids	50 sec
tweet-csv	3.8 min
tweet-mentions/nodes	1.1 min
tweet-mentions/edges	55 s
tweet-mentions/agg	1 min
tweet-users	1 min

…k extracts; tweetset_loader copies JSON (instead of re-generating)

lwrubel · 2021-09-08T12:22:18Z

Would you add to example.env the new variables and suggested values?
SPARK_MAX_FILE_SIZE=2g
SPARK_PARTITION_SIZE=128m

And then could you also add something to the README about the full_datasets directory needing to be set up as a shared NFS mount, available to all nodes in the Spark cluster? That will be necessary for setting up future dev environments using the Spark loader (and for reconfiguring any current dev instances). It's not something that described well to begin with in our current README.

…ies/TweetSets into t140-path-to-extracts

lwrubel · 2021-09-09T12:30:13Z

Reviewed the updated documentation and looks good!

dolsysmith added 16 commits September 1, 2021 09:32

Adding aggregate users extract; updating tweetset_server to find Spar…

5a0318b

…k extracts; tweetset_loader copies JSON (instead of re-generating)

Typo in writing users

b871549

Fixed typo in os.makedirs

5538e52

Key error resolved in _add_filenames

deddfbe

typo in glob pattern fixed in _add_filenames

776bb95

typo in glob pattern fixed in _add_filenames

e1d0e77

Return just file name from glob pattern

c11501a

Adding relative path to file names

7e07a38

Adding support for extract subpaths

b919a73

Added missing comma in pattern tuple for JSON

91080c8

Trying with coalesced partitions

d25ebdf

Trying with coalesced partitions

caee381

Testing coalesce

fbfc596

Testing coalesce

c3f6f95

Added repartitioning to create fewer files

5d012da

Commenting out dev lines in loader.docker-compose

b3c379d

dolsysmith requested a review from lwrubel September 3, 2021 17:16

dolsysmith linked an issue Sep 7, 2021 that may be closed by this pull request

Update path to full extracts #140

Closed

Set default value for Spark partitions

4fcb65c

dolsysmith and others added 5 commits September 8, 2021 09:34

Fixed error not applying partitions to RDD

e8e3404

Editing headers on file download page.

16501eb

Updating example .yml files and .env

469e85f

Merge branch 't140-path-to-extracts' of https://github.com/gwu-librar…

c1c5f1d

…ies/TweetSets into t140-path-to-extracts

Updating nginx conf for new filepaths.

234d1f2

dolsysmith merged commit 13cd127 into master Sep 9, 2021

dolsysmith deleted the t140-path-to-extracts branch September 9, 2021 12:31

kerchner mentioned this pull request Sep 17, 2021

Update example.docker-compose.yml, example.env, and README #95

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

T140 path to extracts: #140 #153

T140 path to extracts: #140 #153

dolsysmith commented Sep 3, 2021

lwrubel commented Sep 8, 2021

lwrubel commented Sep 9, 2021

T140 path to extracts: #140 #153

T140 path to extracts: #140 #153

Conversation

dolsysmith commented Sep 3, 2021

Features

Setup

Testing

Benchmarks

lwrubel commented Sep 8, 2021

lwrubel commented Sep 9, 2021