-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datasets with different date formatter have different contents #4
Comments
@kpgaffney thanks for raising this issue! It seems I made a mistake in the generation: most data sets using The output of 1701 social_network-csv_basic-longdateformatter-sf0.1/dynamic/person_0_0.csv
1529 social_network-csv_composite-longdateformatter-sf0.1/dynamic/person_0_0.csv
1701 social_network-csv_composite_merge_foreign-longdateformatter-sf0.1/dynamic/person_0_0.csv
1701 social_network-csv_merge_foreign-longdateformatter-sf0.1/dynamic/person_0_0.csv
3901 social_network-csv_basic-longdateformatter-sf0.3/dynamic/person_0_0.csv
3515 social_network-csv_composite-longdateformatter-sf0.3/dynamic/person_0_0.csv
3901 social_network-csv_composite_merge_foreign-longdateformatter-sf0.3/dynamic/person_0_0.csv
3901 social_network-csv_merge_foreign-longdateformatter-sf0.3/dynamic/person_0_0.csv
11001 social_network-csv_basic-longdateformatter-sf1/dynamic/person_0_0.csv
9893 social_network-csv_composite-longdateformatter-sf1/dynamic/person_0_0.csv
11001 social_network-csv_composite_merge_foreign-longdateformatter-sf1/dynamic/person_0_0.csv
11001 social_network-csv_merge_foreign-longdateformatter-sf1/dynamic/person_0_0.csv
27001 social_network-csv_basic-longdateformatter-sf3/dynamic/person_0_0.csv
24329 social_network-csv_composite-longdateformatter-sf3/dynamic/person_0_0.csv
27001 social_network-csv_composite_merge_foreign-longdateformatter-sf3/dynamic/person_0_0.csv
27001 social_network-csv_merge_foreign-longdateformatter-sf3/dynamic/person_0_0.csv I'll regenerate the data sets and upload them to a new SURF repository. However, this is a long process (the generation is slow, then I transfer the data sets to SURF where they are copied to tape, and also the summer holidays are now on), so they are only going to be available in the autumn. In the meantime, you may use the Hadoop Datagen to generate the correct data sets. |
No problem, thanks for looking into that! |
I think I reconstructed what happened: I first generated the I'll start generating the correct data sets with the following script. This will take 3-5 days: #!/bin/bash
export HADOOP_CLIENT_OPTS="-Xmx960G"
set -eu
mkdir -p ../datagen-graphs
mkdir -p ../datagen-graphs-compressed
for SF in 0.1 0.3 1 3 10 30 100 300 1000; do
echo "=> SF: $SF"
for SERIALIZER in CsvBasic CsvCompositeMergeForeign CsvMergeForeign; do
case $SERIALIZER in
CsvBasic)
VARIANT=csv-basic-longdateformatter
;;
CsvCompositeMergeForeign)
VARIANT=csv-composite-merge-foreign-longdateformatter
;;
CsvMergeForeign)
VARIANT=csv-merge-foreign-longdateformatter
;;
esac
echo "---> SERIALIZER: ${SERIALIZER} a.k.a. ${VARIANT}"
echo > params.ini
echo "ldbc.snb.datagen.generator.scaleFactor:snb.interactive.${SF}" >> params.ini
echo "ldbc.snb.datagen.serializer.dateFormatter:ldbc.snb.datagen.util.formatter.LongDateFormatter" >> params.ini
echo "ldbc.snb.datagen.serializer.dynamicActivitySerializer:ldbc.snb.datagen.serializer.snb.csv.dynamicserializer.activity.${SERIALIZER}DynamicActivitySerializer" >> params.ini
echo "ldbc.snb.datagen.serializer.dynamicPersonSerializer:ldbc.snb.datagen.serializer.snb.csv.dynamicserializer.person.${SERIALIZER}DynamicPersonSerializer" >> params.ini
echo "ldbc.snb.datagen.serializer.staticSerializer:ldbc.snb.datagen.serializer.snb.csv.staticserializer.${SERIALIZER}StaticSerializer" >> params.ini
rm -rf /tmp/hadoop*
./run.sh
mv social_network ../datagen-graphs/social_network-${VARIANT}-sf${SF}
cd ../datagen-graphs
time tar --use-compress-program=zstdmt -cf ../datagen-graphs-compressed/social_network-${VARIANT}-sf$SF.tar.zst social_network-${VARIANT}-sf${SF}
cd ../ldbc_snb_datagen_hadoop
done
done |
Hello, I'm happy to report that this has been fixed by the release of the updated Interactive data sets: |
Datasets with the same scale factor and serializer but different date formatter have different contents.
For example, the csv_basic dataset with the string date formatter has 14074
knows
edges but the csv_basic dataset with the long date formatter has 18075knows
edges:The text was updated successfully, but these errors were encountered: