Skip to content
No description, website, or topics provided.
Python
Branch: master
Clone or download

Latest commit

Fetching latest commit…
Cannot retrieve the latest commit at this time.

Files

Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
config
utils
.gitignore
README.md
parse_patent.py

README.md

Parse USPTO

Step 1: Download XML patents: https://bulkdata.uspto.gov/

Download under the header:

Patent Application Full Text Data (No Images) (MAR 15, 2001 - PRESENT)
Contains the full text of each patent application (non-provisional utility and plant)
published weekly (Thursdays) from March 15, 2001 to present (excludes images/drawings).
Subset of the Patent Application Full Text Data with Embedded TIFF Images.

The current script(s) only work on version 4.0 or higher of the XML (2005 - Present).

It also will skip all documents with DNA sequences.

Step 2: Extract the ziped file: *.xml

Step 3: Individual patents (or directories) can then be parsed with:

parse_patent.py <filename.xml> <filename.xml> <directory>

You can edit the filename variable in the python file parse_patent.py to match the unzipped file. Inside that file are typically thousands of patents which can be parsed for the given week.

Using the parse_patent.py if you add a it will load all the .xml files.

Download all Files

For 2005 to Today, you can download all the zip files for a given year using the following format:

wget -r -np -l1 -nd -A zip https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/<year>/

This will download all the .zip files off that page, which contain all the patents for the given year (as a zipped .xml file).

If the -nd command is dropped, a directory will be created in the form:

bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/<year>/<filename>.zip`

It's recommended, to create a folder such as: patent/<year>, then execute the command to download all the .zip files.

When all the files are downloaded it's possible to unzip all the directory: unzip \*.zip

Storing in Database

In addition, it's possible to save the parsed data in a database. In this repository, we provide documentation in config/README to configure PostgreSQL to store and search the patents.

This reduces the size of a file ipa200109.xml of 734MB to 154MB (in the database).

In terms of overall size, the XML files are 367Gb, the parsed files (in the database) are

In terms of decompressed data (by year) & unerrored parsed patent documents:

  • 2005 - 11Gb - 157,822
  • 2006 - 15Gb - 196,485
  • 2007 - 14Gb - 182,968
  • 2008 - 14Gb - 185,249
  • 2009 - 16Gb - 192,045
  • 2010 - 20Gb - 244,589
  • 2011 - 21Gb - 248,091
  • 2012 - 24Gb - 277,264
  • 2013 - 28Gb - 303,641
  • 2014 - 31Gb - 327,014
  • 2015 - 32Gb - 326,969
  • 2016 - 33Gb - 334,674
  • 2017 - 36Gb - 352,547
  • 2018 - 35Gb - 341,104
  • 2019 - 42Gb - 392,618

Get patent count by year (in PostgreSQL):

SELECT date_trunc('year', publication_date), count(*) from uspto_patents group by date_trunc('year', publication_date) ORDER BY date_trunc('year', publication_date);
You can’t perform that action at this time.