Library and Directory Setup

In [1]:
library(data.table)
library(dplyr)
library(readr)
library(purrr)
library(tidyr)
library(lubridate)
library(tidyverse)
library(stringr)
library(rio)

"package 'dplyr' was built under R version 3.6.3"
Attaching package: 'dplyr'

The following objects are masked from 'package:data.table':

    between, first, last

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

"package 'purrr' was built under R version 3.6.3"
Attaching package: 'purrr'

The following object is masked from 'package:data.table':

    transpose

"package 'lubridate' was built under R version 3.6.3"
Attaching package: 'lubridate'

The following objects are masked from 'package:data.table':

    hour, isoweek, mday, minute, month, quarter, second, wday, week,
    yday, year

The following object is masked from 'package:base':

    date

"package 'tidyverse' was built under R version 3.6.3"

ERROR: Error: package or namespace load failed for 'tidyverse' in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]):
 there is no package called 'nlme'


In [2]:
setwd("F:/Thesis/Source_Data/WIPO_AI_Patents/WIPO_Patenscope-Transnational_Patents")
getwd()

In [None]:
Directory

## 1.1) Queries Consolidation
PATENTSCOPE AI Index developed by WIPO allows a maximum of 10 thousand patent applications to be donwloaded in a Microsoft 97-2003 Excel format. In total, 66 queries were necessary to retrieve all the relevant patent applications, meaning the data is distributed across 66 differentfiles 

However, the first step necessary is the transformation of the files from xls to csv. This has to be run in case the files have not yet been transformed. At the time of the submission of the thesis the csv files were already created and included in the WIPO_Patenscope-Transnational_Patents folder

In [None]:
#xls <- list.files(pattern = "*.xls", recursive = TRUE, full.names = TRUE)
#created <- mapply(convert, xls, gsub("xls", "csv", xls))

Now we proceed to open and combine all files. Each query has been saved in the following format:

Category/Field/-Subfield-TimeSpan.csv

In [3]:
DirectoryFiles <- list.files(pattern="*.csv", recursive = TRUE)

#Number of Queries: 66
length(DirectoryFiles)
head(DirectoryFiles,3)
tail(DirectoryFiles,3)

Combine all files into one table: "RawPatents"  

In [4]:
RawPatents <- data_frame(File = DirectoryFiles) %>%
      mutate(FileContent = 
                   map(DirectoryFiles, ~ fread(., header=TRUE, skip=5, na.string=c(""),
                                               encoding = "UTF-8", stringsAsFactors=FALSE)))
RawPatents<- data.table(unnest(RawPatents))

#Dimension of combined dataset
dim(RawPatents)
n_rows0 <- nrow(RawPatents)

"`data_frame()` is deprecated, use `tibble()`.
"`cols` is now required.
Please use `cols = c(FileContent)`"

The data collected from WIPO PATENTSCOPE AI Index contains almost 264 thousand entries, which includes duplicates due to the allocation of a patent to several subfields and fields

## 1.1) Columns Preparation

In [5]:
colnames(RawPatents)

### 1.1.1. Remove empty spaces in column headers

In [6]:
colnames(RawPatents) <- gsub(" ","_", names(RawPatents))
colnames(RawPatents)
dim(RawPatents)

### 1.1.2.   Assign Category, Field and Subfield to Patent

"File" column indicates the file from which the entry in the RawPatents has been obtained. From this field we can extract the Category, Field and Subfield of patent record.

In [7]:
head(RawPatents$File)

In [8]:
RawPatents <- separate(RawPatents, File,
                       into = c("Category","Field", "Subfield", "Query_Period", NA), sep="([\\/\\-\\-\\.])")

colnames(RawPatents)
dim(RawPatents)

"Expected 5 pieces. Additional pieces discarded in 6324 rows [158862, 158863, 158864, 158865, 158866, 158867, 158868, 158869, 158870, 158871, 158872, 158873, 158874, 158875, 158876, 158877, 158878, 158879, 158880, 158881, ...]."

In [9]:
head(RawPatents,1)
tail(RawPatents,1)

Category,Field,Subfield,Query_Period,Application_Id,Application_Number,Application_Date,Publication_Number,Publication_Date,Country,Title,Abstract,I_P_C,Applicants,Inventors,Priorities_Data,National_Phase_Entries
Field_Applications,Agriculture,Agriculture,All,WO2019217152,PCT/US2019/029989,30.04.2019,WO/2019/217152,14.11.2019,WO,DIGITAL VISUALIZATION OF PERIODICALLY UPDATED IN-SEASON AGRICULTURAL FERTILITY PRESCRIPTIONS,"In an embodiment, a computer-implemented data processing method comprises: receiving digital input specifying a request to display a map image of a specified agricultural field for a particular day; in response to receiving the input, calculating an interpolated digital image of the specified agricultural field with a plurality of different field properties, by: dividing a digital map of the specified field into a plurality of grids each having a same size and a same area; obtaining, from digital storage, a plurality of data for the different field properties and assigning the data as covariates; grouping the grids into a specified number of clusters based on values of the covariates; pseudo-randomly selecting a specified number of one or more sample values in each of the clusters; evaluating a digital fertility model using the sample values and storing a plurality of output values from the digital fertility model.",A01B 79/00; A01B 79/02; G06F 17/00; G06Q 50/02,THE CLIMATE CORPORATION,"SANGIREDDY, Harish; DZOTSI, Kofikuma; ARRIAZA, Juan Lopez; GATES, John B.","62/670,707 11.05.2018 US; 16/048,062 27.07.2018 US",


Category,Field,Subfield,Query_Period,Application_Id,Application_Number,Application_Date,Publication_Number,Publication_Date,Country,Title,Abstract,I_P_C,Applicants,Inventors,Priorities_Data,National_Phase_Entries
Techniques,Search Methods,Search Methods,All,WO2018132614,PCT/US2018/013391,11.01.2018,WO/2018/132614,19.07.2018,WO,RULES-BASED NAVIGATION,"In one embodiment, a navigation system (100) for a host vehicle (200) may comprise, at least one processing device (110). The processing device (110) may be programmed to receive a plurality of images representative of an environment of the host vehicle (200). The processing device (110) may also be programmed to analyze the plurality of images to identify at least one navigational state of the host vehicle (200). The processing device (110) may also be programmed to identify a jurisdiction based on at least one indicator of a location of the host vehicle (200), the at least one indicator based at least in part on an analysis of the plurality of images. The processing device (110) may also be programmed to determine at least one navigational rule specific to the identified jurisdiction. The processing device (110) may also be programmed to cause a navigational change based on the identified navigational state and based on the determined at least one navigational rule.",G01C 21/36; G05D 1/02,MOBILEYE VISION TECHNOLOGIES LTD.,"SHALEV-SHWARTZ, Shai; SHASHUA, Amnon; STEIN, Gideon; SHAMMAH, Shaked; TAIEB, Yoav","62/445,500 12.01.2017 US; 62/546,343 16.08.2017 US; 62/565,244 29.09.2017 US; 62/582,687 07.11.2017 US",KR-1020197023466; CN-201880013023.2; EP-2018701918; JP-2019533493


### 1.1.3 Drop Irrelevant Features

Subfield will also be revomed because, in order to reduce the number of duplicates, the data was by default collected at the field level rather than at the subfield. Duplicates will be indetified and handeld at the Field level 


In [10]:
DropFeatures <- c("Subfield","Query_Period", "I_P_C", "Priorities_Data", "National_Phase_Entries", "Abstract")
RawPatents <-select(RawPatents, -!!DropFeatures)
colnames(RawPatents)
dim(RawPatents)

### 1.1.4. Adjust Data Type

In [11]:
sapply(RawPatents, class)

In [12]:
RawPatents$Application_Date <- dmy(RawPatents$Application_Date)

In [13]:
sapply(RawPatents, class)

In [14]:
head(RawPatents,5)

Category,Field,Application_Id,Application_Number,Application_Date,Publication_Number,Publication_Date,Country,Title,Applicants,Inventors
Field_Applications,Agriculture,WO2019217152,PCT/US2019/029989,2019-04-30,WO/2019/217152,14.11.2019,WO,DIGITAL VISUALIZATION OF PERIODICALLY UPDATED IN-SEASON AGRICULTURAL FERTILITY PRESCRIPTIONS,THE CLIMATE CORPORATION,"SANGIREDDY, Harish; DZOTSI, Kofikuma; ARRIAZA, Juan Lopez; GATES, John B."
Field_Applications,Agriculture,WO2020049182,PCT/EP2019/073911,2019-09-08,WO/2020/049182,12.03.2020,WO,COGNITIVE COMPUTING METHODS AND SYSTEMS BASED ON BIOLOGICAL NEURAL NETWORKS,ALPVISION S.A.,"JORDAN, Frédéric; KUTTER, Martin; DELACRETAZ, Yves"
Field_Applications,Agriculture,WO2020061193,PCT/US2019/051732,2019-09-18,WO/2020/061193,26.03.2020,WO,METHOD AND SYSTEM FOR EXECUTING MACHINE LEARNING ALGORITHMS,THE CLIMATE CORPORATION,"ALVAREZ, Francisco; ALI, Mir; MELCHING, Jeff; HOCHMUTH, Erich"
Field_Applications,Agriculture,WO2012047834,PCT/US2011/054689,2011-10-04,WO/2012/047834,12.04.2012,WO,"A SYSTEM AND METHOD OF PROVIDING AGRICULTURAL PEDIGREE FOR AGRICULTURAL PRODUCTS THROUGHOUT PRODUCTION AND DISTRIBUTION AND USE OF THE SAME FOR COMMUNICATION, REAL TIME DECISION MAKING, PREDICTIVE MODELING, RISK SHARING AND SUSTAINABLE AGRICULTURE","BAYER CROPSCIENCE LP; KLAVINS, Maris","KLAVINS, Maris"
Field_Applications,Agriculture,EP194755451,16189806,2016-09-21,3151169,05.04.2017,EP,METHODS AND SYSTEMS FOR OPTIMIZING HIDDEN MARKOV MODEL BASED LAND CHANGE PREDICTION,TATA CONSULTANCY SERVICES LTD,LADHA SHAMSUDDIN NASIRUDDIN; YADAV PIYUSH; DESHPANDE SHAILESH SHANKAR


## 2) Duplicates elimination and data transformation to show a unique patent application per row

### 2.1 Summary entries per category and field

In [15]:
#Category
table((RawPatents$Category))

#Consistency Check
sum(table((RawPatents$Category)))- n_rows0



     Field_Applications Functional_Applications              Techniques 
                 133398                   80293                   50297 

In [16]:
#Field
table((RawPatents$Field))

#Consistency Check
sum(table((RawPatents$Field)))- n_rows0


                            Agriculture                     Arts and Humanities 
                                    852                                    6504 
                    Banking and Finance                                Business 
                                   4573                                    6385 
                            Cartography                         Computer Vision 
                                   3158                                   45560 
                Computing in Government                         Control Methods 
                                   3840                                    1818 
    Distributed Artificial Intelligence Document Management and Text Processing 
                                    474                                    4916 
                              Education                       Energy Management 
                                   1532                                     733 
                          E

### 2.2  Duplicates Elimination

Duplicates based on the Application ID are searched at the field level. Any existing duplicate is related to queries undertaken at subfield level. Due to the  limitation of the PATENTSCOPE of returning an error for queries with an excesively large number of patents, some queries at the field level gave an error. Thus, for this fields the data was retreived at the subfield level. Nonetheless, the scope of the study is set at the field level. 

Any patent assigned to a subfield  naturally will be part of the parent field. It is the case that a patent is assigned to multiple subfields pertainign to the same field. These are the duplicates to be removed. 


In [17]:
RawPatents <- unique(RawPatents, by=c("Field", "Application_Id"))
n_rows2 <- nrow(RawPatents)
n_rows2

#Number entries removed
n_rows0-n_rows2 

dim(RawPatents)

### 2.3 Modify Dataset Structure: From Multiple Entries to One Patent Application per Row

Because a patent can be applied to multiple fields, the dataset contains multiple entries for the same patent application

In [18]:
#Number of entries in the dataset
n_rows2

#Number of unique patent applications
length(unique(RawPatents$Application_Id))


To do this, Field and Category columns will be horizontally casted into dummy variables, so each Field and each Category is a column in the dataset. If a patent has been assigend to a given field and category the value in the corresponsing columns is 1, and 0 otherwise.

In [19]:
## Map field and category into dummy features
RawPatents$Dummy <- as.integer(1)
RawPatents$Dummy1 <- as.integer(1) 

RawPatents <- dcast(RawPatents, ... ~ Field, value.var =  "Dummy", fill=0)
RawPatents <- dcast(RawPatents, ... ~ Category, value.var = "Dummy1", fill=0 )
n_rows3 <- nrow(RawPatents)

#Output
n_rows3
dim(RawPatents)
colnames(RawPatents)

In [20]:
head(RawPatents,5)

Application_Id,Application_Number,Application_Date,Publication_Number,Publication_Date,Country,Title,Applicants,Inventors,Agriculture,...,Publishing,Robotics,Search Methods,Security,Speech Processing,Telecommunications,Transportation,Field_Applications,Functional_Applications,Techniques
EP105446266,13198348,2013-12-19,2746975,25.06.2014,EP,CAD drawing notes manager,SIKORSKY AIRCRAFT CORP,MARCHESSEAULT BRIAN DAVID,0,...,0,0,0,0,0,0,0,0,1,0
EP105446266,13198348,2013-12-19,2746975,25.06.2014,EP,CAD drawing notes manager,SIKORSKY AIRCRAFT CORP,MARCHESSEAULT BRIAN DAVID,0,...,0,0,0,0,0,1,0,1,0,0
EP105446280,12199334,2012-12-24,2746988,25.06.2014,EP,Method for identifying a user manipulating a touchscreen device,ORANGE,JONCZYK MACIEJ; KARPINSKI MICHAL; STAROSZCZYK TOMASZ,0,...,0,0,0,0,0,0,0,0,1,0
EP105446280,12199334,2012-12-24,2746988,25.06.2014,EP,Method for identifying a user manipulating a touchscreen device,ORANGE,JONCZYK MACIEJ; KARPINSKI MICHAL; STAROSZCZYK TOMASZ,0,...,0,0,0,1,0,1,1,1,0,0
EP105446307,14151134,2012-02-23,2747013,25.06.2014,EP,System and Method for Analyzing Messages in a Network or Across Networks,BOTTLENOSE INC,SPIVACK NOVA; TER HEIDE DOMINIEK,0,...,0,0,0,0,0,0,0,0,0,1


Agregate all dummy values per unique Application_Id by adding all the columns containing the Fields and Categories.
The first Field column is column 10.

In [21]:
## Transform data type of dummy variables
RawPatents[,10:ncol(RawPatents):= lapply(.SD, as.integer),
           .SDcols = 10:ncol(RawPatents)]
sapply(RawPatents, class)

In [22]:
# Agregate all dummy values per unique Application_Id
RawPatents <-RawPatents[, 10:ncol(RawPatents) := lapply(.SD, sum),
                        by = Application_Id,
                        .SDcols = 10:ncol(RawPatents)]
dim(RawPatents)

In [23]:
#Drop duplicates
RawPatents <- unique(RawPatents, by=c("Application_Id"))
n_rows4 <- nrow(RawPatents)
n_rows4

### 2.4) Check Completeness of Data Processing

In [26]:
#Splt Columns into Groups
Features <- colnames(RawPatents)
GeneralFeatures <- Features[1:9]
Categories <- Features[(length(Features)-2):length(Features)]
Fields <- Features[10:(length(Features)-3)]

#Output
Categories
Fields

In [27]:
length(Categories)
length(Fields)

In [28]:
# Patent count by Field
Count_Fields <- RawPatents[, lapply(.SD, sum),
                           .SDcols= Fields]
#Consistency Check
sum(Count_Fields)
sum(Count_Fields) - n_rows2

In [29]:
# Patent count by Category
Count_Categories <- RawPatents[, lapply(.SD, sum),
                               .SDcols= Categories]
#Consistency Check
sum(Count_Categories)
sum(Count_Categories) - n_rows3

## Output File: WIPO_Patents

In [30]:
fwrite(RawPatents, file="F:/Thesis/Working_Data/Final\\WIPO_Patents.csv", col.names = TRUE)