This repositories makes available the data produced by our paper:
Mann, Katja and Lukas Püttmann (2017). "Benign Effects of Automation: New Evidence from Patent Texts". Unpublished manuscript.
Please cite us if you use our data.
We keep the following datasets here:
|Name||Format||Level||Years||Rows||Approx. size (zipped)|
||Industries (SIC 4 digit)||1976-2014||451,620||12 MB|
||Commuting zones||1976-2014||281,580||3.4 MB|
What you don't find here
- Does not include the codes to do the initial parsing and classification of the patents.
- Does not provide the codes of how to go from one dataset to the other.
- This is not a replication kit for our paper. So you won't find data and codes here to reproduce all the figures and tables in our paper.
All datasets cover the period 1976 to 2014 and the regional coverage are the United States.
1. Patent level dataset
Includes all US utility patents and contains the information for every patent if we classify it as automation or not. To construct some variables (
The assignee and citation data is from the Fung Institute and stops in 2010.
year: Grant year.
week: Identifies the weekly files that the patent was published in.
patent: Patent number (7 characters)
automat: Classification as automation patent after excluding patents (see paper for details).
raw_automat: Classification as automation patent before excluding patents (see paper for details).
excl: Excluded patents. We exclude many chemical or pharmaceutical patents in our empirical analysis, see paper for details.
post_yes: The posterior probability that a patent is an automation patent.
post_no: The posterior probability that a patent is not an automation patent.
hjt1: Hall-Jaffe-Trajtenberg top-level categories
hjt2: Hall-Jaffe-Trajtenberg subcategories by name
hjt2_num: Hall-Jaffe-Trajtenberg subcategories by number
uspc_primary: Every patent is assigned one or several USPC (United States Patent Classification) numbers. This reports the first USPC number written in the patent documents. We use this number to assign Hall-Jaffe-Trajtenberg categories. This is not the classification we use to match patents to industries: We use the complete list of patents' IPC (International Patent Classification) numbers for this (not contained in this dataset).
length_pattext: Length of patent text as measured by the number of lines in Google's text files. Number is missing for every last patent in the weekly files.
cts: Number of citations using Fung Institute data.
cts_wt: Number of weighted citations. See paper for explanation.
assignee: The group of patent assignee ("US firm", "foreigners", "governments", "universities" or missing/"NA"). We use the Fung Institute data to identify US firms, foreigners and governments and our own coding to find universities and public research institutes.
2. Industry level dataset
We distribute all patents probabilistically to industries where they are created ("sector of manufacture") and where they are likely to be used ("industry of use"). See paper and the links above for explanations. Industries are defined according to SIC 1987.
sic1: First digit of SIC number
sic_div: Name of SIC division ("Agriculture", "Mining" and so on)
sic: Four-digit SIC number (1987 SIC classification).
nb: Our classification of patents as either "automation" or non-automation ("rest") according to the Naive Bayes algorithm.
affil: Two options ("sector of manufacture" and "industry of use")
weight: Uses either no weights ("none") or weighs patents by the number of their citations.
assignee: Four options ("foreigners", "governments", "universities" and "other")
patents: Number of patent (equivalents).
Be careful when you use this dataset, as some variables provide subsets to the dataset and some offer alternative datasets:
weightare options you can choose
assigneecontain the values for subsets of patents. So if, for example, you want to know how many automation patents there are in some industry that are owned by any entity, then you need to sum across the
3. Commuting zone level dataset
Includes the number of patents that can be used in a US commuting zones.
cz: Commuting zone
year: Grant year
type: Whether the patent measure has been constructed using levels of patents or logs as described in the paper.
assignee: Group who is assigned the patent, as described above.
allcontains all groups.
weight: Citation weights as described above.
autopats: Automation patents
nonautopats: All other (non-automation) patents.
How to use
Use the maptile program to create maps:
import delim data/czone_dataset.csv keep if type == "level" & assignee == "all" & weight == "none" maptile autopats if year==1976, geography(cz) conus nquantiles(4) /// savegraph(figures\map_1976.png) replace legdecimals(0) resolution(0.5) /// twopt(title(Automation patents: 1976))
Our data and our codes are under the MIT license. This means that you can use everything as you please for research or commercial purposes, as long as you refer back to us.
If you find irregularities or bugs, please open an issue here.
Lai, R., A. D’Amour, A. Yu, Y. Sun, D. M. Doolin and L. Fleming (2011). "Disambiguation and coauthorship networks of the u.s. patent inventor database (1975 -2010)". Fung Institute. (doi)