Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Look for invalid ?o in Freebase data #70

Closed
paulhoule opened this issue Oct 22, 2013 · 11 comments
Closed

Look for invalid ?o in Freebase data #70

paulhoule opened this issue Oct 22, 2013 · 11 comments
Milestone

Comments

@paulhoule
Copy link
Owner

We have some cases such as the notable type where there are ?o mids that are not represented as mids in the ?s field.

A key thing is that we want a list of unique ?p ?o pairs where this is affected because my understanding of the Freebase dump creation process is that often errors are idiosyncratic to particular predicates. A reasonable approach is to

  1. produce a set of unique ?s
  2. produce a set of unique ?o
  3. join the list above to find invalid ?o's (these could be produced with the node as the subject and the key being "this was an s" and "this was an o", highly scalable because we only have at most two nodes in a bucket)
  4. produce a set of unique ?p ?o
  5. join invalid ?o's against ?p - ?o's. If we can get the 'invalid' marker to sort ahead of the ?p - ?o's, then we can do this streaming w/ no memory consumption

We could do it in fewer stages, but I think the memory consumption would be higher and it would be less scalable.

paulhoule added a commit that referenced this issue Nov 12, 2013
… objects from triples so we can extract the set of all distinct objects
paulhoule added a commit that referenced this issue Nov 12, 2013
…cript to extract all unique object links
paulhoule added a commit that referenced this issue Nov 12, 2013
…bject URIs (filter out if not rdf.basekb.com) and refactor UniqTool out to simplify development of various uniq tools
paulhoule added a commit that referenced this issue Nov 12, 2013
…perators for subject, object and predicate ready for integration testing
@paulhoule
Copy link
Owner Author

To get a "definition of done", this will be the construction of a flow that does all of the steps required to perform this analysis (extract unique subjects, objects, predicates, then run the set differences)

@paulhoule
Copy link
Owner Author

Here is the command I am running to do the diff

haruhi run job -clusterId smallAwsCluster
setDifference -r 4
s3n://basekb-sandbox/2013-11-17-00-00/uniqInternalURIObjects
s3n://basekb-sandbox/2013-11-17-00-00/uniqURISubjects2
s3n://basekb-sandbox/2013-11-17-00-00/missingObjects2 \

@paulhoule
Copy link
Owner Author

@paulhoule
Copy link
Owner Author

If we wind up with a bunch of objects hanging in space it will be harder to characterize them because, unlike the predicates, these will be twisty little mids that all look alike.

One way to characterize them would be to do a join where we have a set of contains (S,?o), and then we retrieve all (?s ?p ?o1) where ?o1=?o.

For the join here we are going to have the ?o as the key and the tag will be 1 if it is one of the ?o's and 2 if it is a triple. The "value" of the ?o is not material, but the value of the triple is just a Text representation of a triple. If we see a row with a 1 come up first then we write the 2's to the output, otherwise we write the 1's.

Once we get the actual offending triples it ought to be obvious what we are dealing with.

@paulhoule
Copy link
Owner Author

Ok we get 2,136,121 loose predicates and almost all of them are mid's. Here are the 30 other predicates that show up

$ zcat *.gz | grep -v /m.
<http://rdf.basekb.com/ns/.user.xandr.webscrapper.ads.ads_kind>
<http://rdf.basekb.com/ns/user.xandr.webscrapper.ads.%14%30%42%30:>
<http://rdf.basekb.com/ns/user.xandr.webscrapper.ads.%1A%3E%3C%3D%30%42%4B:>
<http://rdf.basekb.com/ns/user.xandr.webscrapper.ads.ads_kind>
<http://rdf.basekb.com/ns/user.xandr.webscrapper.ads.price>
<http://rdf.basekb.com/ns/user.xandr.webscrapper.ads.street>
<http://rdf.basekb.com/ns/wikipedia.en.Bibliography_of_Jorge_Luis_Borges>
<http://rdf.basekb.com/ns/.user.xandr.webscrapper.ads.ads_topic>
<http://rdf.basekb.com/ns/.user.xandr.webscrapper.ads.email>
<http://rdf.basekb.com/ns/emql.metacritic>
<http://rdf.basekb.com/ns/user.lbwelch>
<http://rdf.basekb.com/ns/user.xandr.webscrapper.ads.%12%30%3D%3D%4B%35+%3A%3E%3C%3D%30%42%4B:>
<http://rdf.basekb.com/ns/user.xandr.webscrapper.ads.ads_topic>
<http://rdf.basekb.com/ns/user.xandr.webscrapper.ads.email>
<http://rdf.basekb.com/ns/wikipedia.en.Brothers_Grimm>
<http://rdf.basekb.com/ns/.user.xandr.webscrapper.ads.geo_city>
<http://rdf.basekb.com/ns/en.death_valley_national_park>
<http://rdf.basekb.com/ns/en>
<http://rdf.basekb.com/ns/user.pinworm27.afdb>
<http://rdf.basekb.com/ns/user.xandr.webscrapper.ads.%14%3E%3C%30%48%3D%38%35+%36%38%32%3E%42%3D%4B%35:>
<http://rdf.basekb.com/ns/user.xandr.webscrapper.ads.%1A%3E%3C%38%41%41%38%4F+%31%40%3E%3A%35%40%30:>
<http://rdf.basekb.com/ns/user.xandr.webscrapper.ads.%1C%35%41%42%3F%3E%3B%3E%36%35%3D%38%35:>
<http://rdf.basekb.com/ns/user.xandr.webscrapper.ads.%22%3E%47%3D%4B%39+%30%34%40%35%41::>
<http://rdf.basekb.com/ns/user.xandr.webscrapper.ads.%42%35%3B%35%44%3E%3D:>
<http://rdf.basekb.com/ns/user.xandr.webscrapper.ads.geo_city>
<http://rdf.basekb.com/ns/wikipedia.en.Liaden_universe>
<http://rdf.basekb.com/ns/user.xandr.webscrapper.ads.%1C%35%31%3B%38%40%3E%32%30%3D%3D%30%4F:>
<http://rdf.basekb.com/ns/user.xandr.webscrapper.ads.%1C%35%42%40%30%36:>
<http://rdf.basekb.com/ns/user.xandr.webscrapper.ads.phone>
<http://rdf.basekb.com/ns/user.xandr.webscrapper.ads.zone>

paulhoule added a commit that referenced this issue Nov 20, 2013
@paulhoule
Copy link
Owner Author

to do tomorrow: write test for MatchingKeyReducer and FetchTriplesWithMatchingObjectsMapper

paulhoule added a commit that referenced this issue Nov 21, 2013
paulhoule added a commit that referenced this issue Nov 21, 2013
paulhoule added a commit that referenced this issue Nov 21, 2013
@paulhoule
Copy link
Owner Author

Here is the command that is supposed to match up the missing objects with the triples responsible for them

haruhi run job -clusterId smallAwsCluster fetchWithMatchingObjects3 -r 4 \
   s3n://basekb-sandbox/2013-11-17-00-00/missingObjects2
   s3n://basekb-now/2013-11-17-00-00/sieved/a \
   s3n://basekb-now/2013-11-17-00-00/sieved/description \
   s3n://basekb-now/2013-11-17-00-00/sieved/key \
   s3n://basekb-now/2013-11-17-00-00/sieved/keyNs \
   s3n://basekb-now/2013-11-17-00-00/sieved/label \
   s3n://basekb-now/2013-11-17-00-00/sieved/links \
   s3n://basekb-now/2013-11-17-00-00/sieved/name \
   s3n://basekb-now/2013-11-17-00-00/sieved/notability \
   s3n://basekb-now/2013-11-17-00-00/sieved/notableForPredicate \
   s3n://basekb-now/2013-11-17-00-00/sieved/other \
   s3n://basekb-now/2013-11-17-00-00/sieved/text \ 
   s3n://basekb-now/2013-11-17-00-00/sieved/webpages \
   s3n://basekb-sandbox/2013-11-17-00-00/triplesWithMissingObjects

@paulhoule
Copy link
Owner Author

Note when I ran the command above I found that none of the arguments after /sieved/text were properly parsed because there was a space after the backslash! Shades of python in a shell script, yikes.

@paulhoule
Copy link
Owner Author

I fixed that problem, then I ran it with a mediumAwsCluster but I forgot to bump -r 4 up and I started running out of heap. Possibly I could have done better with more reducers, but there are many segments up there (description, name) that only contain literals and these could be removed from input.

To speed up the debug cycle I'm going to make a tiny synthetic test case against just one file, labels-m-00001.gz

@paulhoule
Copy link
Owner Author

ok, actually I ran it against a-m-00000.gz because we're really interested in URI objects.

When I did a grep of the output triples and of the input triples, they got exactly the same number for type.property and measurement_unit.dated_percentage when grepped. So it looks like the algorithm is good, but somehow we're creating a GIGO situation.

Here is the new command:

haruhi run job -clusterId mediumAwsCluster fetchWithMatchingObjects3 -r 11 \
   s3n://basekb-sandbox/2013-11-17-00-00/missingObjects2
   s3n://basekb-now/2013-11-17-00-00/sieved/a \
   s3n://basekb-now/2013-11-17-00-00/sieved/links \
   s3n://basekb-now/2013-11-17-00-00/sieved/notability \
   s3n://basekb-now/2013-11-17-00-00/sieved/webpages \
   s3n://basekb-sandbox/2013-11-17-00-00/triplesWithMissingObjects

@paulhoule
Copy link
Owner Author

At this point there is still a problem because the job above doesn't gives empty output. On the other hand, in the missingObjects2 file I see the following node

<http://rdf.basekb.com/ns/m.01000h3>

and in links/links-m-00017.nt.gz I see the triple

<http://rdf.basekb.com/ns/m.01dftfd>  <http://rdf.basekb.com/ns/m.0j2r8t8>    <http://rdf.basekb.com/ns/m.01000h3>    .

so somehow this fact is getting lost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant