Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSTOR dfr and Paper Machines #41

Open
brekhusr opened this issue Jul 15, 2013 · 4 comments
Open

JSTOR dfr and Paper Machines #41

brekhusr opened this issue Jul 15, 2013 · 4 comments

Comments

@brekhusr
Copy link

I see that in your development of Paper Machines you included a data set from JSTOR DFR. I'm wondering how you were able to do that. I requested a couple data sets from JSTOR DFR, csv format (was that the right choice?) but it looks like, instead of full text, they have links to full text in one of the fields.

Two questions:

  1. how do I import the csv file into Zotero, or should I request XML instead? How would I import that set of records into Zotero?
  2. will Zotero pull the full text automatically from JSTOR or are there additional steps needed to give Paper Machines some actual text to analyze?
@erickpeirson
Copy link
Contributor

Hi @brekhusr -- regarding (2): DfR doesn't yield full text documents, only N-grams. Still figuring out what's going on with this feature, but it looks like PM uses unigrams (wordcounts), and you do indeed want CSV format. The "links" that you're seeing are so that you can identify the precise document in the JSTOR database from which the corresponding N-grams were extracted. When you make a dataset request on DfR, be sure to check the box for "wordcounts".

Re (1): you can select the DfR dataset at modeling time. No need to import it into Zotero.

@brekhusr
Copy link
Author

No need to import the dataset into Zotero? But I thought Paper Machines worked by extracting text from PDF's linked to items from Zotero. That's the thrust of the whole readme page for PM (https://github.com/chrisjr/papermachines) as far as I can tell. I see I am even more confused than I thought I was. If you have time to back up a minute...maybe I just need a step-by-step on how to get PM to interact with JSTOR DFR directly, if having the text in Zotero is not necessary.

@corajr
Copy link
Contributor

corajr commented Feb 11, 2014

Sorry for the long silence on this. In general, PM does work by taking full-text documents stored in Zotero (PDFs/web snapshots) and converting them into plain text, upon which various processes can be run. However, DfR data doesn't contain full text, but only the word counts from each document, as Erick noted above.

Using DfR output in Paper Machines is therefore something of a hack -- the DfR word counts are converted back into "full text" by simply repeating each word the requisite number of times. This is good enough for the purposes of topic modeling, which doesn't take word order into account. However, it wouldn't work for several of the other processes, and thus was never fully integrated into the rest of the program.

Another quirk of the implementation is that you have to run the "Topic Modeling -> By Time (With JSTOR DfR)" on a Zotero collection with at least one full text, even if you just want to work with DfR data. The idea was that the program would combine the recreated texts from DfR with a collection from Zotero before training the topic model. Ultimately, this turned out to be far less common a use case than working with DfR data alone, but the interface design hasn't caught up.

In order to get a topic model from DfR that doesn't include extraneous documents from Zotero, you can create a collection with a single note in it (I usually put the web address of the query from JSTOR here), and run "Extract Text" to fool Paper Machines into thinking the collection has content. Then, running a topic model "With JSTOR DfR" and selecting the DfR zip will do the trick.

I've been slowly working on a redesign that loosens PM's ties to Zotero. In the new version, you'd be able to load texts from Zotero collections, JSTOR DfR, or even a folder of documents with a spreadsheet cataloging its contents, and combine these data sources as you see fit. That is still a ways off, sadly, but I hope it's a little clearer what can be done with DfR data at the moment.

@brekhusr
Copy link
Author

Thanks for responding, Chris! Yes, this information is indeed helpful. I could see getting one JSTOR article into Zotero, and then creating a set of documents for DfR, with that one article subtracted. When I get a chance I’ll give that a try, and I’ll look forward to the new version, whenever it comes. Roughly, are you thinking of weeks, months, or years?

Rachel

From: Chris Johnson-Roberson [mailto:notifications@github.com]
Sent: Tuesday, February 11, 2014 12:01 PM
To: chrisjr/papermachines
Cc: Brekhus, Rachel L.
Subject: Re: [papermachines] JSTOR dfr and Paper Machines (#41)

Sorry for the long silence on this. In general, PM does work by taking full-text documents stored in Zotero (PDFs/web snapshots) and converting them into plain text, upon which various processes can be run. However, DfR data doesn't contain full text, but only the word counts from each document, as Erick noted above.

Using DfR output in Paper Machines is therefore something of a hack -- the DfR word counts are converted back into "full text" by simply repeating each word the requisite number of times. This is good enough for the purposes of topic modeling, which doesn't take word order into account. However, it wouldn't work for several of the other processes, and thus was never fully integrated into the rest of the program.

Another quirk of the implementation is that you have to run the "Topic Modeling -> By Time (With JSTOR DfR)" on a Zotero collection with at least one full text, even if you just want to work with DfR data. The idea was that the program would combine the recreated texts from DfR with a collection from Zotero before training the topic model. Ultimately, this turned out to be far less common a use case than working with DfR data alone, but the interface design hasn't caught up.

In order to get a topic model from DfR that doesn't include extraneous documents from Zotero, you can create a collection with a single note in it (I usually put the web address of the query from JSTOR here), and run "Extract Text" to fool Paper Machines into thinking the collection has content. Then, running a topic model "With JSTOR DfR" and selecting the DfR zip will do the trick.

I've been slowly working on a redesign that loosens PM's ties to Zotero. In the new version, you'd be able to load texts from Zotero collections, JSTOR DfR, or even a folder of documents with a spreadsheet cataloging its contents, and combine these data sources as you see fit. That is still a ways off, sadly, but I hope it's a little clearer what can be done with DfR data at the moment.


Reply to this email directly or view it on GitHubhttps://github.com//issues/41#issuecomment-34784814.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants