In this directoty we provide two datasets. The first one Ground Truth Corpus is a small one used for NLP purposes and the second one Patent Figure Dataset is used for CV purposes.
This dataset contains figure captions for design patents from the USPTO database. Objects and Aspects are highlighted. The annotation follows a BIO schema.
This dataset is all about text. It can be used to train models in NLP domain.
This dataset contains 66417 design patent figures along with their corresponding visual descriptors and metadata.
Figures are in total 3G and they can be found in Google Drive link: https://drive.google.com/file/d/1Zc3ApBMtFh-Avk1PcZGFSc44mr-SLuUB/view?usp=sharing
Figures are in PNG format.
Visual descriptors and metadata are in a txt file which can be found in this derectory. This file gives the following infomation:
patentID: This is the patent ID in the USPTO database. One patent has a unique ID
patentdate: This is the data the patent was released.
figid: This is the index for figures within a patent. A patent may contain many figures.
caption: This is the figure caption.
object: What is the object in the figure
aspect: Which aspect of view is presented.
figure_file: This is the file name for a figure. It can be used to match figures in the dataset.