Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T140 path to extracts: #140 #153

Merged
merged 22 commits into from
Sep 9, 2021
Merged

T140 path to extracts: #140 #153

merged 22 commits into from
Sep 9, 2021

Conversation

dolsysmith
Copy link
Contributor

Features

  • Updates UI to display full extracts created by Spark
  • Reuses existing full_datasets path (assuming this is or can be configured as an NFS mount)
  • Adds aggregate users extract type
  • Copies JSON files from dataset_loading to tweetsets-data/full_datasets (or equivalent paths as defined in .env)
  • Uses repartitioning to optimize loading of multiple files
  • Coalesces extracts into a smaller number of files, using the max file size .env variable

Setup

  1. The full_datasets folder must be a shared NFS mount available to all nodes in the Spark cluster. (On my VM, I moved the tweetsets_data folder to /storage on both the primary and secondary nodes, then mapped the full_datasets folder on the primary to the same location on the secondary VM.)
    • Note: you don't want to share the entire tweetsets_data folder, as that will likely cause problems for Elasticsearch.
    • I initially tried mapping /storage/tweetsets_data/full_datasets (VM 1) to a folder on VM 2 in /home/dsmith, but that did not seem to work.
  2. Update your .env files accordingly with the new paths, if necessary.
  3. On your non-primary nodes, update docker-compose.yml as follows:
    • Add the following line to the spark-worker section, under volumes: ${TWEETSETS_DATA_PATH}/full_datasets:/tweetsets_data/full_datasets
  4. On the primary node, update loader.docker-compose.yml as follows:
    • Under volumes, add ${TWEETSETS_DATA_PATH}/full_datasets:/tweetsets_data/full_datasets
    • Under environment, add
      • SPARK_MAX_FILE_SIZE
      • SPARK_PARTITION_SIZE
    • Optional:add the following (in order to expose the Spark jobs UI):
    ports:
       - 4040:4040
    
  5. To your primary node's .env, add the following:
    • SPARK_MAX_FILE_SIZE=2g
    • SPARK_PARTITION_SIZE=128m
  6. For testing, the server-flaskrun and loader containers should be built locally. Make sure you rebuild the images before restarting the containers.

Testing

  1. Load a dataset.
  2. Verify that full extracts are created and available in the UI.
  3. Verify that extracts are downloadable and that, for all extracts except the full-tweet JSON, a small number of files are created. (For smaller datasets, each extract should have one file.)
  4. Verify that the number of (non-header) rows in the tweet-ids extract matches the number of tweets in the UI.
  5. Create custom extracts and verify that these are downloadable and created correctly.

Benchmarks

The following metrics were obtained using a subset of the Summer Olympics collection.

Metric Value
Number of workers 1
Number of cores 2
Number of tweets 1,048,637
Size on disk 980M
Number of gzipped files 32
Operation Time
RDD -> Elasticsearch 12 min
tweet-ids 50 sec
tweet-csv 3.8 min
tweet-mentions/nodes 1.1 min
tweet-mentions/edges 55 s
tweet-mentions/agg 1 min
tweet-users 1 min

@dolsysmith dolsysmith linked an issue Sep 7, 2021 that may be closed by this pull request
@lwrubel
Copy link
Collaborator

lwrubel commented Sep 8, 2021

Would you add to example.env the new variables and suggested values?
SPARK_MAX_FILE_SIZE=2g
SPARK_PARTITION_SIZE=128m

And then could you also add something to the README about the full_datasets directory needing to be set up as a shared NFS mount, available to all nodes in the Spark cluster? That will be necessary for setting up future dev environments using the Spark loader (and for reconfiguring any current dev instances). It's not something that described well to begin with in our current README.

@lwrubel
Copy link
Collaborator

lwrubel commented Sep 9, 2021

Reviewed the updated documentation and looks good!

@dolsysmith dolsysmith merged commit 13cd127 into master Sep 9, 2021
@dolsysmith dolsysmith deleted the t140-path-to-extracts branch September 9, 2021 12:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update path to full extracts
2 participants