Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datasets with different date formatter have different contents #4

Closed
kpgaffney opened this issue Jul 29, 2022 · 4 comments
Closed

Datasets with different date formatter have different contents #4

kpgaffney opened this issue Jul 29, 2022 · 4 comments

Comments

@kpgaffney
Copy link

Datasets with the same scale factor and serializer but different date formatter have different contents.

For example, the csv_basic dataset with the string date formatter has 14074 knows edges but the csv_basic dataset with the long date formatter has 18075 knows edges:

% wc -l social_network-csv_basic-sf0.1/dynamic/person_knows_person_0_0.csv 
   14074 social_network-csv_basic-sf0.1/dynamic/person_knows_person_0_0.csv
% wc -l social_network-csv_basic-longdateformatter-sf0.1/dynamic/person_knows_person_0_0.csv
   18075 social_network-csv_basic-longdateformatter-sf0.1/dynamic/person_knows_person_0_0.csv
@szarnyasg
Copy link
Member

@kpgaffney thanks for raising this issue! It seems I made a mistake in the generation: most data sets using longdateformatter have 100% of the graph serialized in the initial data set (instead of the correct value which is 90%). The practical implication of serializing 100% of the graph is that the updates are not going to work as the nodes/edges to be inserted are already in the graph.

The output of wc -l social_network-csv_*-sf{0.1,0.3,1,3}/dynamic/person_0_0.csv shows this:

1701 social_network-csv_basic-longdateformatter-sf0.1/dynamic/person_0_0.csv
1529 social_network-csv_composite-longdateformatter-sf0.1/dynamic/person_0_0.csv
1701 social_network-csv_composite_merge_foreign-longdateformatter-sf0.1/dynamic/person_0_0.csv
1701 social_network-csv_merge_foreign-longdateformatter-sf0.1/dynamic/person_0_0.csv

3901 social_network-csv_basic-longdateformatter-sf0.3/dynamic/person_0_0.csv
3515 social_network-csv_composite-longdateformatter-sf0.3/dynamic/person_0_0.csv
3901 social_network-csv_composite_merge_foreign-longdateformatter-sf0.3/dynamic/person_0_0.csv
3901 social_network-csv_merge_foreign-longdateformatter-sf0.3/dynamic/person_0_0.csv

11001 social_network-csv_basic-longdateformatter-sf1/dynamic/person_0_0.csv
9893 social_network-csv_composite-longdateformatter-sf1/dynamic/person_0_0.csv
11001 social_network-csv_composite_merge_foreign-longdateformatter-sf1/dynamic/person_0_0.csv
11001 social_network-csv_merge_foreign-longdateformatter-sf1/dynamic/person_0_0.csv

27001 social_network-csv_basic-longdateformatter-sf3/dynamic/person_0_0.csv
24329 social_network-csv_composite-longdateformatter-sf3/dynamic/person_0_0.csv
27001 social_network-csv_composite_merge_foreign-longdateformatter-sf3/dynamic/person_0_0.csv
27001 social_network-csv_merge_foreign-longdateformatter-sf3/dynamic/person_0_0.csv

I'll regenerate the data sets and upload them to a new SURF repository. However, this is a long process (the generation is slow, then I transfer the data sets to SURF where they are copied to tape, and also the summer holidays are now on), so they are only going to be available in the autumn. In the meantime, you may use the Hadoop Datagen to generate the correct data sets.

@kpgaffney
Copy link
Author

No problem, thanks for looking into that!

@szarnyasg
Copy link
Member

szarnyasg commented Jul 30, 2022

I think I reconstructed what happened: I first generated the composite data set. This generation produced the initial data set and the updates. Thinking that I'll save time by not generating the updates for the other 3 serializers (basic, merge-foreign, composite-merge-foreign), I turned off generating the updates. (This is possible because the serialization format of the update streams is always the same). However, turning off the updates also changed the graph -- now the entire graph was serialized in the generation.

I'll start generating the correct data sets with the following script. This will take 3-5 days:

#!/bin/bash

export HADOOP_CLIENT_OPTS="-Xmx960G"

set -eu

mkdir -p ../datagen-graphs
mkdir -p ../datagen-graphs-compressed

for SF in 0.1 0.3 1 3 10 30 100 300 1000; do
    echo "=> SF: $SF"
    
    for SERIALIZER in CsvBasic CsvCompositeMergeForeign CsvMergeForeign; do
        case $SERIALIZER in
            CsvBasic)
                VARIANT=csv-basic-longdateformatter
                ;;
            CsvCompositeMergeForeign)
                VARIANT=csv-composite-merge-foreign-longdateformatter
                ;;
            CsvMergeForeign)
                VARIANT=csv-merge-foreign-longdateformatter
                ;;
        esac
        
        echo "---> SERIALIZER: ${SERIALIZER} a.k.a. ${VARIANT}"

        echo > params.ini
        echo "ldbc.snb.datagen.generator.scaleFactor:snb.interactive.${SF}" >> params.ini
        echo "ldbc.snb.datagen.serializer.dateFormatter:ldbc.snb.datagen.util.formatter.LongDateFormatter" >> params.ini
        echo "ldbc.snb.datagen.serializer.dynamicActivitySerializer:ldbc.snb.datagen.serializer.snb.csv.dynamicserializer.activity.${SERIALIZER}DynamicActivitySerializer" >> params.ini
        echo "ldbc.snb.datagen.serializer.dynamicPersonSerializer:ldbc.snb.datagen.serializer.snb.csv.dynamicserializer.person.${SERIALIZER}DynamicPersonSerializer" >> params.ini
        echo "ldbc.snb.datagen.serializer.staticSerializer:ldbc.snb.datagen.serializer.snb.csv.staticserializer.${SERIALIZER}StaticSerializer" >> params.ini

        rm -rf /tmp/hadoop*
        ./run.sh

        mv social_network ../datagen-graphs/social_network-${VARIANT}-sf${SF}
        cd ../datagen-graphs
        time tar --use-compress-program=zstdmt -cf ../datagen-graphs-compressed/social_network-${VARIANT}-sf$SF.tar.zst social_network-${VARIANT}-sf${SF}
        cd ../ldbc_snb_datagen_hadoop
    done
done

@szarnyasg
Copy link
Member

Hello, I'm happy to report that this has been fixed by the release of the updated Interactive data sets:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants