Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where can I find topics of reuters dataset #12072

Closed
hadifar opened this issue Jan 18, 2019 · 8 comments
Closed

Where can I find topics of reuters dataset #12072

hadifar opened this issue Jan 18, 2019 · 8 comments
Labels
Good first issue Issues which are good for first-time contributors. Usually easy to fix. stat:contributions welcome A pull request to fix this issue would be welcome. type:docs Need to modify the documentation

Comments

@hadifar
Copy link
Contributor

hadifar commented Jan 18, 2019

In Reuters dataset, there are 11228 instances while in the dataset's webpage there are 21578. Even in the reference paper there are more than 11228 examples after pruning.

Unfortunately, there is no information about the Reuters dataset in Keras documentation. Is it possible to clarify how this dataset gathered and what the topics labels are? It mentioned there are 46 topics but what is the category e.g. for topic number 32?

@gabrieldemarmiesse gabrieldemarmiesse added stat:contributions welcome A pull request to fix this issue would be welcome. type:docs Need to modify the documentation Good first issue Issues which are good for first-time contributors. Usually easy to fix. labels Jan 19, 2019
@SteffenBauer
Copy link

SteffenBauer commented Jan 28, 2019

Update: As this topic gained some traction in internet discussions, and was even referenced from the official Keras documentation ( https://keras.io/api/datasets/reuters/ ), I collected all code and data from this investigation and put it here:

https://github.com/SteffenBauer/KerasTools/tree/master/Reuters_Analysis


In case it might be useful, I wrote a small library with some tools for Keras that I build for my personal deep learning explorations over the last year. I was interested in the exact mappings for all the Keras datasets, you can find the according dataset decoding module from my library here:

https://github.com/SteffenBauer/KerasTools/blob/master/KerasTools/datasets/decode.py

I got the Reuters topic mapping by transcribing the reuters entries back to human-readable form, sorting by topic label frequency, and matching them with the label topics found here:

https://martin-thoma.com/nlp-reuters/

Using this, I got these label mappings for the Keras Reuters dataset:

"reuters":
  ['cocoa','grain','veg-oil','earn','acq','wheat','copper','housing','money-supply',
   'coffee','sugar','trade','reserves','ship','cotton','carcass','crude','nat-gas',
   'cpi','money-fx','interest','gnp','meal-feed','alum','oilseed','gold','tin',
   'strategic-metal','livestock','retail','ipi','iron-steel','rubber','heat','jobs',
   'lei','bop','zinc','orange','pet-chem','dlr','gas','silver','wpi','hog','lead'],

@hadifar
Copy link
Contributor Author

hadifar commented Jan 30, 2019

@SteffenBauer Thanks for the reply.

I already saw the https://maratin-thoma.com blog post but the number of examples in each topic doesn't match with Keras' Reuters dataset. For example, class name earn in this blog post has 2877 instances but in Keras Reuters the most dominant topic has 3159.

@SteffenBauer
Copy link

@hadifar: Yes, I never found any instance of the Reuters dataset with 11228 entries elsewhere than in Keras. When the Keras dataset was produced, there must have been some kind of pre-processing / pruning. As a result, a direct matching between the Keras set and that at martin-thoma.com is not possible.

My list of topic mapping is therefore only kind of a 'reverse engineering' result. It needed a lot of manual matching. I used the number of entries as a hint where to look deeper, and then I directly inspected several re-transcribed entries visually, trying to figure out what category they match best. After some iterations, I ended up at above result, which should match the real categories.

Here is the jupyter notebook that I used in identifying the categories:

https://github.com/SteffenBauer/KerasTools/blob/master/Notebooks/09%201b%20Reuters%20Dataset.ipynb

But yes, I would also be very interested in more detailed information how the Keras reuters set was generated.

@SteffenBauer
Copy link

SteffenBauer commented Jan 30, 2019

I just browsed the commit history of datasets.reuters, and it looks like older versions indeed contained the code which was used to parse the Reuters-21578 dataset into the reuters.pkl file, but it was removed 3 years ago:

71952f2#diff-4e341a06492281a7032f4fe4ecf6a3f7

So it should be possible to investigate further how the Keras reuters dataset was derived from the official data.

@SteffenBauer
Copy link

SteffenBauer commented Jan 30, 2019

Looks like the old make_reuters_dataset function is the key here. I applied it to the Reuters21578 dataset from UCI. It is indeed parsing the dataset into 11228 entries, and when I print the topic mapping dictionary topic_indexes, I get this:

{'copper': 6, 'livestock': 28, 'gold': 25, 'money-fx': 19, 'ipi': 30, 'trade': 11, 'cocoa': 0, 'iron-steel': 31, 'reserves': 12, 'tin': 26, 'zinc': 37, 'jobs': 34, 'ship': 13, 'cotton': 14, 'alum': 23, 'strategic-metal': 27, 'lead': 45, 'housing': 7, 'meal-feed': 22, 'gnp': 21, 'sugar': 10, 'rubber': 32, 'dlr': 40, 'veg-oil': 2, 'interest': 20, 'crude': 16, 'coffee': 9, 'wheat': 5, 'carcass': 15, 'lei': 35, 'gas': 41, 'nat-gas': 17, 'oilseed': 24, 'orange': 38, 'heat': 33, 'wpi': 43, 'silver': 42, 'cpi': 18, 'earn': 3, 'bop': 36, 'money-supply': 8, 'hog': 44, 'acq': 4, 'pet-chem': 39, 'grain': 1, 'retail': 29}

(the only change to the original code was that I needed to sort the filename list, so that it starts to parse with reut2-000.sgm)

So this could really be the real topic mapping, directly derived from the original data.

@SteffenBauer
Copy link

SteffenBauer commented Jan 30, 2019

If you are interested in the code, I created a gist:

https://gist.github.com/SteffenBauer/2444afea5ea844119b3985685e6aac29

Download reuters21578.tar.gz from https://archive.ics.uci.edu/ml/machine-learning-databases/reuters21578-mld/ , unpack it into a directory reuters21578/, then run parse_reuters.py

@hadifar
Copy link
Contributor Author

hadifar commented Jan 30, 2019

@SteffenBauer Very nice investigation 👍

@hadifar hadifar closed this as completed Jan 30, 2019
@SteffenBauer
Copy link

For example, class name earn in this blog post has 2877 instances but in Keras Reuters the most dominant topic has 3159.

A last remark: This discrepancy is probably explained by martin-thoma using a different percentage for the test set than Keras. Keras splits 20% for the test set, while martin-thoma uses percentages between ~20% and ~30%, earn uses there 27% for test.

copybara-service bot pushed a commit that referenced this issue Mar 10, 2023
Imported from GitHub PR #17635

From discussions and references from:

- #12072 (comment)
- https://martin-thoma.com/nlp-reuters/

Add documentation:

- That explains the word indices returned from keras `keras.datasets.reuters.get_word_index`.
- Add helper function to return `ylabels` for label data.
Copybara import of the project:

--
3c2ac2d by Kevin Hu <hxy9243@gmail.com>:

update documentation to keras reuters dataset

--
d08241c by Kevin Hu <hxy9243@gmail.com>:

format the code

--
b1fcf1b by Kevin Hu <hxy9243@gmail.com>:

address PR reviews on formatting

--
d85556e by Kevin Hu <hxy9243@gmail.com>:

fix lint errors

--
d29df56 by Kevin Hu <hxy9243@gmail.com>:

address PR review

Merging this change closes #17635

FUTURE_COPYBARA_INTEGRATE_REVIEW=#17635 from hxy9243:master d29df56
PiperOrigin-RevId: 515713085
copybara-service bot pushed a commit that referenced this issue Mar 11, 2023
Imported from GitHub PR #17635

From discussions and references from:

- #12072 (comment)
- https://martin-thoma.com/nlp-reuters/

Add documentation:

- That explains the word indices returned from keras `keras.datasets.reuters.get_word_index`.
- Add helper function to return `ylabels` for label data.
Copybara import of the project:

--
3c2ac2d by Kevin Hu <hxy9243@gmail.com>:

update documentation to keras reuters dataset

--
d08241c by Kevin Hu <hxy9243@gmail.com>:

format the code

--
b1fcf1b by Kevin Hu <hxy9243@gmail.com>:

address PR reviews on formatting

--
d85556e by Kevin Hu <hxy9243@gmail.com>:

fix lint errors

--
d29df56 by Kevin Hu <hxy9243@gmail.com>:

address PR review

Merging this change closes #17635

FUTURE_COPYBARA_INTEGRATE_REVIEW=#17635 from hxy9243:master d29df56
PiperOrigin-RevId: 515713085
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Good first issue Issues which are good for first-time contributors. Usually easy to fix. stat:contributions welcome A pull request to fix this issue would be welcome. type:docs Need to modify the documentation
Projects
None yet
Development

No branches or pull requests

3 participants