Where can I find topics of reuters dataset #12072

hadifar · 2019-01-18T20:29:48Z

In Reuters dataset, there are 11228 instances while in the dataset's webpage there are 21578. Even in the reference paper there are more than 11228 examples after pruning.

Unfortunately, there is no information about the Reuters dataset in Keras documentation. Is it possible to clarify how this dataset gathered and what the topics labels are? It mentioned there are 46 topics but what is the category e.g. for topic number 32?

SteffenBauer · 2019-01-28T14:32:21Z

Update: As this topic gained some traction in internet discussions, and was even referenced from the official Keras documentation ( https://keras.io/api/datasets/reuters/ ), I collected all code and data from this investigation and put it here:

https://github.com/SteffenBauer/KerasTools/tree/master/Reuters_Analysis

In case it might be useful, I wrote a small library with some tools for Keras that I build for my personal deep learning explorations over the last year. I was interested in the exact mappings for all the Keras datasets, you can find the according dataset decoding module from my library here:

https://github.com/SteffenBauer/KerasTools/blob/master/KerasTools/datasets/decode.py

I got the Reuters topic mapping by transcribing the reuters entries back to human-readable form, sorting by topic label frequency, and matching them with the label topics found here:

https://martin-thoma.com/nlp-reuters/

Using this, I got these label mappings for the Keras Reuters dataset:

"reuters":
  ['cocoa','grain','veg-oil','earn','acq','wheat','copper','housing','money-supply',
   'coffee','sugar','trade','reserves','ship','cotton','carcass','crude','nat-gas',
   'cpi','money-fx','interest','gnp','meal-feed','alum','oilseed','gold','tin',
   'strategic-metal','livestock','retail','ipi','iron-steel','rubber','heat','jobs',
   'lei','bop','zinc','orange','pet-chem','dlr','gas','silver','wpi','hog','lead'],

hadifar · 2019-01-30T07:08:59Z

@SteffenBauer Thanks for the reply.

I already saw the https://maratin-thoma.com blog post but the number of examples in each topic doesn't match with Keras' Reuters dataset. For example, class name earn in this blog post has 2877 instances but in Keras Reuters the most dominant topic has 3159.

SteffenBauer · 2019-01-30T09:49:40Z

@hadifar: Yes, I never found any instance of the Reuters dataset with 11228 entries elsewhere than in Keras. When the Keras dataset was produced, there must have been some kind of pre-processing / pruning. As a result, a direct matching between the Keras set and that at martin-thoma.com is not possible.

My list of topic mapping is therefore only kind of a 'reverse engineering' result. It needed a lot of manual matching. I used the number of entries as a hint where to look deeper, and then I directly inspected several re-transcribed entries visually, trying to figure out what category they match best. After some iterations, I ended up at above result, which should match the real categories.

Here is the jupyter notebook that I used in identifying the categories:

https://github.com/SteffenBauer/KerasTools/blob/master/Notebooks/09%201b%20Reuters%20Dataset.ipynb

But yes, I would also be very interested in more detailed information how the Keras reuters set was generated.

SteffenBauer · 2019-01-30T10:27:26Z

I just browsed the commit history of datasets.reuters, and it looks like older versions indeed contained the code which was used to parse the Reuters-21578 dataset into the reuters.pkl file, but it was removed 3 years ago:

71952f2#diff-4e341a06492281a7032f4fe4ecf6a3f7

So it should be possible to investigate further how the Keras reuters dataset was derived from the official data.

SteffenBauer · 2019-01-30T11:09:16Z

Looks like the old make_reuters_dataset function is the key here. I applied it to the Reuters21578 dataset from UCI. It is indeed parsing the dataset into 11228 entries, and when I print the topic mapping dictionary topic_indexes, I get this:

{'copper': 6, 'livestock': 28, 'gold': 25, 'money-fx': 19, 'ipi': 30, 'trade': 11, 'cocoa': 0, 'iron-steel': 31, 'reserves': 12, 'tin': 26, 'zinc': 37, 'jobs': 34, 'ship': 13, 'cotton': 14, 'alum': 23, 'strategic-metal': 27, 'lead': 45, 'housing': 7, 'meal-feed': 22, 'gnp': 21, 'sugar': 10, 'rubber': 32, 'dlr': 40, 'veg-oil': 2, 'interest': 20, 'crude': 16, 'coffee': 9, 'wheat': 5, 'carcass': 15, 'lei': 35, 'gas': 41, 'nat-gas': 17, 'oilseed': 24, 'orange': 38, 'heat': 33, 'wpi': 43, 'silver': 42, 'cpi': 18, 'earn': 3, 'bop': 36, 'money-supply': 8, 'hog': 44, 'acq': 4, 'pet-chem': 39, 'grain': 1, 'retail': 29}

(the only change to the original code was that I needed to sort the filename list, so that it starts to parse with reut2-000.sgm)

So this could really be the real topic mapping, directly derived from the original data.

SteffenBauer · 2019-01-30T11:39:01Z

If you are interested in the code, I created a gist:

https://gist.github.com/SteffenBauer/2444afea5ea844119b3985685e6aac29

Download reuters21578.tar.gz from https://archive.ics.uci.edu/ml/machine-learning-databases/reuters21578-mld/ , unpack it into a directory reuters21578/, then run parse_reuters.py

hadifar · 2019-01-30T13:56:37Z

@SteffenBauer Very nice investigation 👍

SteffenBauer · 2019-01-30T14:08:01Z

For example, class name earn in this blog post has 2877 instances but in Keras Reuters the most dominant topic has 3159.

A last remark: This discrepancy is probably explained by martin-thoma using a different percentage for the test set than Keras. Keras splits 20% for the test set, while martin-thoma uses percentages between ~20% and ~30%, earn uses there 27% for test.

Imported from GitHub PR #17635 From discussions and references from: - #12072 (comment) - https://martin-thoma.com/nlp-reuters/ Add documentation: - That explains the word indices returned from keras `keras.datasets.reuters.get_word_index`. - Add helper function to return `ylabels` for label data. Copybara import of the project: -- 3c2ac2d by Kevin Hu <hxy9243@gmail.com>: update documentation to keras reuters dataset -- d08241c by Kevin Hu <hxy9243@gmail.com>: format the code -- b1fcf1b by Kevin Hu <hxy9243@gmail.com>: address PR reviews on formatting -- d85556e by Kevin Hu <hxy9243@gmail.com>: fix lint errors -- d29df56 by Kevin Hu <hxy9243@gmail.com>: address PR review Merging this change closes #17635 FUTURE_COPYBARA_INTEGRATE_REVIEW=#17635 from hxy9243:master d29df56 PiperOrigin-RevId: 515713085

gabrieldemarmiesse added stat:contributions welcome A pull request to fix this issue would be welcome. type:docs Need to modify the documentation Good first issue Issues which are good for first-time contributors. Usually easy to fix. labels Jan 19, 2019

hadifar closed this as completed Jan 30, 2019

skeydan mentioned this issue Mar 22, 2019

which are the 46 topics in reuters dataset? rstudio/keras3#703

Closed

SteffenBauer mentioned this issue Jul 6, 2020

Proposal: Update link to discussion about Reuters Dataset topic labels keras-team/keras-io#134

Closed

hxy9243 mentioned this issue Mar 5, 2023

Update documentation to keras reuters dataset #17635

Merged

copybara-service bot mentioned this issue Mar 10, 2023

PR #17635: Update documentation to keras reuters dataset #17663

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Where can I find topics of reuters dataset #12072

Where can I find topics of reuters dataset #12072

hadifar commented Jan 18, 2019 •

edited

Loading

SteffenBauer commented Jan 28, 2019 •

edited

Loading

hadifar commented Jan 30, 2019

SteffenBauer commented Jan 30, 2019

SteffenBauer commented Jan 30, 2019 •

edited

Loading

SteffenBauer commented Jan 30, 2019 •

edited

Loading

SteffenBauer commented Jan 30, 2019 •

edited

Loading

hadifar commented Jan 30, 2019

SteffenBauer commented Jan 30, 2019

Where can I find topics of reuters dataset #12072

Where can I find topics of reuters dataset #12072

Comments

hadifar commented Jan 18, 2019 • edited Loading

SteffenBauer commented Jan 28, 2019 • edited Loading

hadifar commented Jan 30, 2019

SteffenBauer commented Jan 30, 2019

SteffenBauer commented Jan 30, 2019 • edited Loading

SteffenBauer commented Jan 30, 2019 • edited Loading

SteffenBauer commented Jan 30, 2019 • edited Loading

hadifar commented Jan 30, 2019

SteffenBauer commented Jan 30, 2019

hadifar commented Jan 18, 2019 •

edited

Loading

SteffenBauer commented Jan 28, 2019 •

edited

Loading

SteffenBauer commented Jan 30, 2019 •

edited

Loading

SteffenBauer commented Jan 30, 2019 •

edited

Loading

SteffenBauer commented Jan 30, 2019 •

edited

Loading