-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSTOR dfr and Paper Machines #41
Comments
Hi @brekhusr -- regarding (2): DfR doesn't yield full text documents, only N-grams. Still figuring out what's going on with this feature, but it looks like PM uses unigrams (wordcounts), and you do indeed want CSV format. The "links" that you're seeing are so that you can identify the precise document in the JSTOR database from which the corresponding N-grams were extracted. When you make a dataset request on DfR, be sure to check the box for "wordcounts". Re (1): you can select the DfR dataset at modeling time. No need to import it into Zotero. |
No need to import the dataset into Zotero? But I thought Paper Machines worked by extracting text from PDF's linked to items from Zotero. That's the thrust of the whole readme page for PM (https://github.com/chrisjr/papermachines) as far as I can tell. I see I am even more confused than I thought I was. If you have time to back up a minute...maybe I just need a step-by-step on how to get PM to interact with JSTOR DFR directly, if having the text in Zotero is not necessary. |
Sorry for the long silence on this. In general, PM does work by taking full-text documents stored in Zotero (PDFs/web snapshots) and converting them into plain text, upon which various processes can be run. However, DfR data doesn't contain full text, but only the word counts from each document, as Erick noted above. Using DfR output in Paper Machines is therefore something of a hack -- the DfR word counts are converted back into "full text" by simply repeating each word the requisite number of times. This is good enough for the purposes of topic modeling, which doesn't take word order into account. However, it wouldn't work for several of the other processes, and thus was never fully integrated into the rest of the program. Another quirk of the implementation is that you have to run the "Topic Modeling -> By Time (With JSTOR DfR)" on a Zotero collection with at least one full text, even if you just want to work with DfR data. The idea was that the program would combine the recreated texts from DfR with a collection from Zotero before training the topic model. Ultimately, this turned out to be far less common a use case than working with DfR data alone, but the interface design hasn't caught up. In order to get a topic model from DfR that doesn't include extraneous documents from Zotero, you can create a collection with a single note in it (I usually put the web address of the query from JSTOR here), and run "Extract Text" to fool Paper Machines into thinking the collection has content. Then, running a topic model "With JSTOR DfR" and selecting the DfR zip will do the trick. I've been slowly working on a redesign that loosens PM's ties to Zotero. In the new version, you'd be able to load texts from Zotero collections, JSTOR DfR, or even a folder of documents with a spreadsheet cataloging its contents, and combine these data sources as you see fit. That is still a ways off, sadly, but I hope it's a little clearer what can be done with DfR data at the moment. |
Thanks for responding, Chris! Yes, this information is indeed helpful. I could see getting one JSTOR article into Zotero, and then creating a set of documents for DfR, with that one article subtracted. When I get a chance I’ll give that a try, and I’ll look forward to the new version, whenever it comes. Roughly, are you thinking of weeks, months, or years? Rachel From: Chris Johnson-Roberson [mailto:notifications@github.com] Sorry for the long silence on this. In general, PM does work by taking full-text documents stored in Zotero (PDFs/web snapshots) and converting them into plain text, upon which various processes can be run. However, DfR data doesn't contain full text, but only the word counts from each document, as Erick noted above. Using DfR output in Paper Machines is therefore something of a hack -- the DfR word counts are converted back into "full text" by simply repeating each word the requisite number of times. This is good enough for the purposes of topic modeling, which doesn't take word order into account. However, it wouldn't work for several of the other processes, and thus was never fully integrated into the rest of the program. Another quirk of the implementation is that you have to run the "Topic Modeling -> By Time (With JSTOR DfR)" on a Zotero collection with at least one full text, even if you just want to work with DfR data. The idea was that the program would combine the recreated texts from DfR with a collection from Zotero before training the topic model. Ultimately, this turned out to be far less common a use case than working with DfR data alone, but the interface design hasn't caught up. In order to get a topic model from DfR that doesn't include extraneous documents from Zotero, you can create a collection with a single note in it (I usually put the web address of the query from JSTOR here), and run "Extract Text" to fool Paper Machines into thinking the collection has content. Then, running a topic model "With JSTOR DfR" and selecting the DfR zip will do the trick. I've been slowly working on a redesign that loosens PM's ties to Zotero. In the new version, you'd be able to load texts from Zotero collections, JSTOR DfR, or even a folder of documents with a spreadsheet cataloging its contents, and combine these data sources as you see fit. That is still a ways off, sadly, but I hope it's a little clearer what can be done with DfR data at the moment. — |
I see that in your development of Paper Machines you included a data set from JSTOR DFR. I'm wondering how you were able to do that. I requested a couple data sets from JSTOR DFR, csv format (was that the right choice?) but it looks like, instead of full text, they have links to full text in one of the fields.
Two questions:
The text was updated successfully, but these errors were encountered: