Library and Directory Setup

In [2]:
options(warn=-1)

In [3]:
library(data.table)
library(dplyr)
library(readr)
library(purrr)
library(tidyr)
library(lubridate)
library(tidyverse)
library(stringr)
library(rio)

ERROR: Error: package or namespace load failed for 'tidyverse' in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]):
 there is no package called 'nlme'


In [4]:
setwd("F:/Thesis/Working_Data/Final")
getwd()

# 1) Merge WIPO_Patents with HAN_Dataset

## 1.1) Open Both Datasests

Warning: Open HAN_Dataset may take some time due to size

In [5]:
HAN_Dataset <- fread("F:/Thesis/Working_Data/Final\\HAN_Dataset.txt", header=T, sep="|",encoding = "UTF-8")
sapply(HAN_Dataset, class)

In [6]:
WIPO_Patents <- fread("F:/Thesis/Working_Data/Final\\WIPO_Patents.csv",stringsAsFactors = FALSE, na.strings="",encoding = "UTF-8")
sapply(WIPO_Patents, class)

In [7]:
head(HAN_Dataset[Publn_auth=="EP",],1)
head(HAN_Dataset[Publn_auth=="WO",],1)

HARM_ID,HAN_ID,Appln_id,Publn_auth,Patent_number,Clean_name,Person_ctry_code
27,27,56203289,EP,EP2044418,002134761 ONTARIO LTD,CA


HARM_ID,HAN_ID,Appln_id,Publn_auth,Patent_number,Clean_name,Person_ctry_code
14,14,336903179,WO,WO2011112122,«FEDERAL GRID CO UNIFIED ENERGY SYSTEMS» JOINT STOCK CO «FGC UES» JSC,RU


In [8]:
head(WIPO_Patents[Country=="EP",], 1)
head(WIPO_Patents[Country=="WO",], 1)

Application_Id,Application_Number,Application_Date,Publication_Number,Publication_Date,Country,Title,Applicants,Inventors,Agriculture,...,Publishing,Robotics,Search Methods,Security,Speech Processing,Telecommunications,Transportation,Field_Applications,Functional_Applications,Techniques
EP105446266,13198348,2013-12-19,2746975,25.06.2014,EP,CAD drawing notes manager,SIKORSKY AIRCRAFT CORP,MARCHESSEAULT BRIAN DAVID,0,...,0,0,0,0,0,1,0,1,1,0


Application_Id,Application_Number,Application_Date,Publication_Number,Publication_Date,Country,Title,Applicants,Inventors,Agriculture,...,Publishing,Robotics,Search Methods,Security,Speech Processing,Telecommunications,Transportation,Field_Applications,Functional_Applications,Techniques
WO1983001574,PCT/US1982/001494,1982-10-21,WO/1983/001574,11.05.1983,WO,THRESHOLD PENILE RIGIDITY MEASURING DEVICE,DACOMED CORPORATION,"TIMM, Gerald, W.; BRADLEY, William, E.",0,...,0,0,0,1,0,0,1,1,1,1


## 1.2) Prepare Data Before Merging

Key IDs to match both tables are:
+ HAN_Dataset: "Patent_number"
+ WIPO_Patents: depends on the Patent Office:
    + EP Office: "Publication_Number" although it misses "EP" in front in order to match with HAN Database
    + WO Office: "Application_Id"
 
As is, there are two column keys in the WIPO_Patents, which depends on the office. To adjust this, a new column in the WIPO_Patents table called "Patent_number" will be created that will take the "Application_Id" for WO patents and the "Publication_Number" for EP patents. Also, we will add "EP" in front of the later.

In [9]:
WIPO_Patents[, Patent_number := ifelse(WIPO_Patents$Country=="EP", 
                                     paste0("EP", WIPO_Patents$Publication_Number),
                                     WIPO_Patents$Application_Id)]

#Check Last Column
head(WIPO_Patents[Country=="EP",], 1)
head(WIPO_Patents[Country=="WO",], 1)

Application_Id,Application_Number,Application_Date,Publication_Number,Publication_Date,Country,Title,Applicants,Inventors,Agriculture,...,Robotics,Search Methods,Security,Speech Processing,Telecommunications,Transportation,Field_Applications,Functional_Applications,Techniques,Patent_number
EP105446266,13198348,2013-12-19,2746975,25.06.2014,EP,CAD drawing notes manager,SIKORSKY AIRCRAFT CORP,MARCHESSEAULT BRIAN DAVID,0,...,0,0,0,0,1,0,1,1,0,EP2746975


Application_Id,Application_Number,Application_Date,Publication_Number,Publication_Date,Country,Title,Applicants,Inventors,Agriculture,...,Robotics,Search Methods,Security,Speech Processing,Telecommunications,Transportation,Field_Applications,Functional_Applications,Techniques,Patent_number
WO1983001574,PCT/US1982/001494,1982-10-21,WO/1983/001574,11.05.1983,WO,THRESHOLD PENILE RIGIDITY MEASURING DEVICE,DACOMED CORPORATION,"TIMM, Gerald, W.; BRADLEY, William, E.",0,...,0,0,1,0,0,1,1,1,1,WO1983001574


## 1.3) Merge Tables

In [10]:
setkey(WIPO_Patents, Patent_number)
setkey(HAN_Dataset, Patent_number)
Patent_HAN <- HAN_Dataset[WIPO_Patents]

head(Patent_HAN,5)

HARM_ID,HAN_ID,Appln_id,Publn_auth,Patent_number,Clean_name,Person_ctry_code,Application_Id,Application_Number,Application_Date,...,Publishing,Robotics,Search Methods,Security,Speech Processing,Telecommunications,Transportation,Field_Applications,Functional_Applications,Techniques
2850861.0,2850861.0,16428843.0,EP,EP0012777,SYSTRAN INSTITUT GES FUR FORSCHUNG & ENTWICKLUNG MASCHINELLER SPRACHUBERSETZUNGSSYSTEME MBH,DE,EP11269342,78101879,1978-12-30,...,0,0,0,0,0,0,1,1,1,1
1318206.0,1318206.0,16463303.0,EP,EP0039393,IBM CORP,US,EP11305954,81101634,1981-03-06,...,0,0,0,0,0,0,0,1,1,1
,,,,EP0052207,,,EP11348653,81107831,1981-10-02,...,0,0,0,0,0,0,0,1,0,0
2931262.0,2931262.0,16487476.0,EP,EP0059929,STEULER INDUSTRIEWERKE GMBH,DE,EP11373001,82101617,1982-03-03,...,0,0,0,1,0,0,0,1,0,0
3324113.0,698993.0,16500043.0,EP,EP0060671,UBE IND LTD,JP,EP11366509,82301198,1982-03-09,...,0,0,0,0,0,0,0,1,0,0


### 1.3.1) Explore non-matched patent applications

In [11]:
# of Patents without HAN identifier
sum(is.na(Patent_HAN$HAN_ID)) 

In [12]:
#By year
Year_NO_HAN <- table(year(Patent_HAN[is.na(Patent_HAN$HAN_ID),Application_Date]))
Year_NO_HAN


1981 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 
   1    2    1    1    3    7    7    5    4   13    7   13   12   13   23   28 
2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 
  22   25   25   24   37   43   38   27   26   35   31   40   53   63   74  172 
2018 2019 
 974 2596 

2019 clearly is the eyar with most missing patent applications in the HAN_Dataset

In [13]:
#Share of 2019 from all years
Year_NO_HAN[34]/sum(Year_NO_HAN)

How many Patent Applications from 2019 were matched witch the HAN_Dataset? 

In [14]:
Year_Patents <- table(year(Patent_HAN[!is.na(Patent_HAN$HAN_ID),Application_Date]))
Year_Patents


1978 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 
   1    1    9    3   13   11   13   15   35   79  138  162  142  175  170  165 
1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 
 167  190  239  309  399  487  522  467  522  542  523  498  521  506  502  501 
2012 2013 2014 2015 2016 2017 2018 2019 
 596  770  971 1250 2126 3090 3358  122 

Only 122 Patent Applications from 2019 were matched. Thus, to avoid missinterpretations, the study will exlude 2019 patent applications

### 1.3.2) Drop 2019 Patent Applications

In [15]:
Patent_HAN <- Patent_HAN[year(Application_Date)<2019,]
Year_Patents <- table(year(Patent_HAN[!is.na(Patent_HAN$HAN_ID),Application_Date]))

Year_Patents


1978 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 
   1    1    9    3   13   11   13   15   35   79  138  162  142  175  170  165 
1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 
 167  190  239  309  399  487  522  467  522  542  523  498  521  506  502  501 
2012 2013 2014 2015 2016 2017 2018 
 596  770  971 1250 2126 3090 3358 

# 2) Data Processing

## 2.1) Drop Irrelevant Columns

In [16]:
DropFeatures2 <- c("Application_Id","Application_Number", "Publication_Number", "Publication_Date",
                   "HARM_ID", "Appln_id", "Publn_auth")
Patent_HAN <- select(Patent_HAN, -!!DropFeatures2)

n_rows_HAN <- nrow(Patent_HAN)
dim(Patent_HAN)

Patent_HAN includes 22,037 patent applicationsm which are more than the 22,010 unique patents that had been identified. This is because a patent can have more than one applicant causing the HAN_Dataset to have in many instances multiple entries for the same patent application

## 2.2) Country Selection

Due to limitations in subsequent data preparation steps, the study was limited to the 20 countries with the most patent applications as per the "Person_ctry_code" from the HAN_Dataset

In [17]:
#Identify Top 20 Countries
sort(table(Patent_HAN$Person_ctry_code)) 


  BS   CR   HK   ID   LI   LV   PA   CL   CO   AN   LT   SI   SK   CY   IS   EE 
   1    1    1    1    1    1    1    2    2    3    3    3    3    4    4    5 
  GR   HR   VG   CZ   LU   KY   PT   ZA   HU   SA   TR   BB   PL   TW   NZ   MY 
   5    6    6    9   14   15   15   15   17   21   21   22   22   25   27   28 
  NO   BR   RU   AT   IE   BE   DK   ES   SG   IN   IT   FI   SE   AU   CH   IL 
  30   31   37   51  129  135  137  138  138  159  202  232  260  267  322  405 
  CA   FR   NL   KR   GB   DE   CN   JP   US 
 498  531  594  684  993 1091 1552 1750 9518 

In [18]:
#Subsitute Country code with Country name
Patent_HAN <- Patent_HAN[, Person_ctry_code := Person_ctry_code %>%
                               gsub("^US", "UNITED STATES",.) %>%
                               gsub("^JP", "JAPAN",.) %>%
                               gsub("^GB", "UNITED KINGDOM",.) %>%
                               gsub("^DE", "GERMANY",.) %>%
                               gsub("^CH", "SWITZERLAND",.) %>%
                               gsub("^CN", "CHINA",.) %>%
                               gsub("^CA", "CANADA",.) %>%
                               gsub("^NL", "NETHERLANDS",.) %>%
                               gsub("^IL", "ISRAEL",.) %>%
                               gsub("^KR", "SOUTH KOREA",.) %>%
                               gsub("^FR", "FRANCE",.) %>%
                               gsub("^AU", "AUSTRALIA",.) %>%
                               gsub("^FI", "FINLAND",.) %>%
                               gsub("^IT", "ITALY",.) %>%
                               gsub("^SE", "SWEDEN",.) %>%
                               gsub("^IN", "INDIA",.) %>%
                               gsub("^BE", "BELGIUM",.) %>%
                               gsub("^SG", "SINGAPORE",.) %>%
                               gsub("^ES", "SPAIN",.) %>%
                               gsub("^DK", "DENMARK",.)]

unique(Patent_HAN$Person_ctry_code)

In [19]:
#Select only patent applications related to top countries

Top_Countries <- c("UNITED STATES", "JAPAN","UNITED KINGDOM", "GERMANY", "CHINA", "CANADA", "NETHERLANDS",
                   "ISRAEL", "SOUTH KOREA", "FRANCE", "SWITZERLAND", "AUSTRALIA", "FINLAND", "ITALY",
                   "SWEDEN", "INDIA", "BELGIUM", "SINGAPORE", "SPAIN", "DENMARK")

Patent_HAN <- Patent_HAN[Person_ctry_code %in% Top_Countries,]

#Number of remaining patent applications
n_rows_HAN2 <- nrow(Patent_HAN)
n_rows_HAN2

#Number of dropped patent applications
n_rows_HAN - n_rows_HAN2

Country Breakdown

In [20]:
Country_Breakdown <- round(prop.table(sort(table(Patent_HAN$Person_ctry_code))),2)
Country_Breakdown


       BELGIUM        DENMARK      SINGAPORE          SPAIN          INDIA 
          0.01           0.01           0.01           0.01           0.01 
         ITALY        FINLAND         SWEDEN      AUSTRALIA    SWITZERLAND 
          0.01           0.01           0.01           0.01           0.02 
        ISRAEL         CANADA         FRANCE    NETHERLANDS    SOUTH KOREA 
          0.02           0.03           0.03           0.03           0.03 
UNITED KINGDOM        GERMANY          CHINA          JAPAN  UNITED STATES 
          0.05           0.06           0.08           0.09           0.49 

## 2.3) Identification of Type of Applicant

In [21]:
colnames(Patent_HAN)

Based on exploration of data, three types of applicants are identified:
+ "Inventor": 'Inventors' and 'Applicant' names are identical. Normally Applicant referrs to the organization "sponsoring" the patent application while the Inventor referres to the individual behind the invention. Assumed is that when these fields are identical, an individual rather than an organization is the ultimate benefactor of the ivnention protection.

+ "Research Institution": the Applicant is an institution but is not a for-profit company. Examples: Universities or Foundations. Identified thorugh a series of key words both in the harmonized name ('Clean_name') and the original applicant name ('Applicants')

+ "Enterprises": for profit organizations

Identification is recorded in a new column called "Applicant_Type" and is done throught two consecutive (but separate) ifelse statemens:

 + If the 'Applicants' and 'Inventors' column match --> "Inventor"
     + else: if 'Clean_name' **or** 'Applicants contains any of the key words --> "Research Institution"
         + else: "Enterprise"


In [22]:
Patent_HAN <- Patent_HAN[, Applicant_Type := 
                               ifelse(Patent_HAN$Inventors==Patent_HAN$Applicants, "Inventor",
                                      ifelse(grepl("UNIVERSITY|COLLEGE|INSTITUTE|FOUNDATION|ECOLE|FOUND
                                                   |UNIV|FUNDACIO|UNIVERSIDAD|UNIVERSITÄT|INSTITUT|
                                                   CONSERVAT|ACADEMY|HOCHSCHULE|STIFTUNG|UNIVERSITAT|
                                                   UNIVERSITAET|UNIVERSITIT|SCOLA",
                                                   Applicants) | grepl("UNIVERSITY|COLLEGE|INSTITUTE|FOUNDATION|
                                                                  ECOLE|FOUND|UNIV|FUNDACIO|UNIVERSIDAD|UNIVERSITÄT|
                                                                  INSTITUT|CONSERVAT|ACADEMY|HOCHSCHULE|STIFTUNG|
                                                                  UNIVERSITAT|UNIVERSITAET|UNIVERSITIT|SCOLA", Clean_name),
                                             "Research Institution","Enterprise"))]
#Applicant Type Breakdown
table(Patent_HAN[,Applicant_Type])

#Consistency Check
sum(table(Patent_HAN[,Applicant_Type]))-n_rows_HAN2


          Enterprise             Inventor Research Institution 
               16768                   63                 2733 

This difference is because For some applications the Type of Applicant is not identified

In [23]:
sum(is.na(Patent_HAN[,Applicant_Type]))

In [24]:
#These are assumed to be enterprise
Patent_HAN <- Patent_HAN[, Applicant_Type := ifelse(is.na(Applicant_Type),"Enterprise", Applicant_Type)]

table(Patent_HAN$Applicant_Type)
sum(is.na(Patent_HAN$Applicant_Type))


          Enterprise             Inventor Research Institution 
               16810                   63                 2733 

## 2.4) Removal of Legal Designation and Non-Alphanumeric Characters

Create new column "APPLICANT_MATCH_NAME" where to store the edited name without erasing the original harmonized names

In [25]:
Patent_HAN <- Patent_HAN[, APPLICANT_MATCH_NAME := Clean_name]

length(unique(Patent_HAN$APPLICANT_MATCH_NAME))

Remove non-alphanumeric characters

In [26]:
Patent_HAN[, APPLICANT_MATCH_NAME:= APPLICANT_MATCH_NAME %>%
                 iconv(., from = "UTF-8", to="ASCII//TRANSLIT") %>%
                 gsub("\\.$","",.) %>%
                 gsub("[^[:alnum:][:blank:].,&]","",.) %>%
                 trimws()]

length(unique(Patent_HAN$APPLICANT_MATCH_NAME))

Legal designation was removed on a country basis. 

In [27]:
#United States
Patent_HAN[Person_ctry_code=="UNITED STATES", APPLICANT_MATCH_NAME := APPLICANT_MATCH_NAME %>%
                 gsub("INCORPORATED$","",.) %>%
                 gsub(" INC$","",.) %>%
                 gsub(",INC$","",.) %>%
                 gsub("LIMITED$","",.) %>%
                 gsub(" LTD$","",.) %>%
                 gsub(",LTD$","",.) %>%
                 gsub(" LLC$","",.) %>%
                 gsub(",LLC$","",.) %>%
                 gsub("L.L.C$","",.) %>%
                 gsub(" LLP$","",.) %>%
                 gsub(",LLP$","",.) %>%
                 gsub("CORPORATION$","",.) %>%
                 gsub(" CORP$","",.) %>%
                 gsub(",CORP$","",.) %>%
                 gsub(" CO INC$","",.) %>%
                 gsub(" CO LLC$","",.) %>%
                 gsub(" CO$","",.) %>%
                 gsub(",CO$","",.) %>%
                 gsub(" LP$","",.) %>%
                 gsub(",LP$","",.) %>%
                 trimws()]                 

In [28]:
#China
Patent_HAN[Person_ctry_code=="CHINA", APPLICANT_MATCH_NAME := APPLICANT_MATCH_NAME %>%
                 gsub("CO.,LTD$","",.) %>%
                 gsub("CO., LTD$","",.) %>%
                 gsub("CO. LTD$","",.) %>%
                 gsub("CO.LTD$","",.) %>%
                 gsub("CORPORATION$","",.) %>%
                 gsub(" CORP$","",.) %>%
                 gsub(",CORP$","",.) %>%
                 gsub(" CO$","",.) %>%
                 gsub("CO$","",.) %>%
                 gsub(" CORP LTD$","",.) %>%
                 gsub(",CORP LTD$","",.) %>%
                 gsub("LIMITED$","",.) %>%
                 gsub(" LTD$","",.) %>%
                 gsub(",LTD$","",.) %>%
                 gsub("INCORPORATED$","",.) %>%
                 gsub(" INC$","",.) %>%
                 gsub(",INC$","",.) %>%
                 trimws()]

In [29]:
#United Kingdom 
Patent_HAN[Person_ctry_code=="UNITED KINGDOM", APPLICANT_MATCH_NAME := APPLICANT_MATCH_NAME %>%
                 gsub("LIMITED$", "",.) %>%
                 gsub(" LTD$", "",.) %>%
                 gsub(",LTD$", "",.) %>%
                 gsub(" LLC$", "",.) %>%
                 gsub(",LLC$", "",.) %>%
                 gsub(" PLC$", "",.) %>%
                 gsub(",PLC$", "",.) %>%
                 gsub(" INC$", "",.) %>%
                 gsub(",INC$", "",.) %>%
                 trimws()]

In [30]:
#France
Patent_HAN[Person_ctry_code=="FRANCE", APPLICANT_MATCH_NAME := APPLICANT_MATCH_NAME %>%
                 gsub("SÀRL$","",.) %>%
                 gsub(" SARL$","",.) %>%
                 gsub(",SARL$","",.) %>%
                 gsub(" SASU$","",.) %>%
                 gsub(",SASU$","",.) %>%
                 gsub(" SRL$","",.) %>%
                 gsub(",SRL$","",.) %>%
                 gsub("S.A.S$","",.) %>%
                 gsub(" SAS$","",.) %>%
                 gsub(",SAS$","",.) %>%
                 gsub("S.A$","",.) %>%
                 gsub(" SA$","",.) %>%
                 gsub(",SA$","",.) %>%
                 gsub(" SE$","",.) %>%
                 gsub(",SE$","",.) %>%
                 gsub("SOCIÉTÉ ANONYME$","",.) %>%
                 gsub("^SAS ","",.) %>%
                 trimws()]

In [31]:
#Israel
Patent_HAN[Person_ctry_code=="ISRAEL", APPLICANT_MATCH_NAME := APPLICANT_MATCH_NAME %>%
                 gsub("LIMITED$","",.) %>%
                 gsub(" LTD$", "",.) %>%
                 gsub(",LTD$", "",.) %>%
                 gsub("INCOPORATED$","",.) %>%
                 gsub(" INC$", "",.) %>%
                 gsub(",INC$", "",.) %>%
                 gsub("CO LTD$","",.) %>%
                 trimws()]

In [32]:
#Canada
Patent_HAN[Person_ctry_code=="CANADA", APPLICANT_MATCH_NAME := APPLICANT_MATCH_NAME %>%
                 gsub("INCORPORATED$","",.) %>%
                 gsub(" INC$", "",.) %>%
                 gsub(",INC$", "",.) %>%
                 gsub("CORPORATION$", "",.) %>%
                 gsub(" CORP$", "",.) %>%
                 gsub(",CORP$", "",.) %>%
                 gsub("LIMITED$", "",.) %>%
                 gsub(" LTD$", "",.) %>%
                 gsub(",LTD$", "",.) %>%
                 gsub(" ULC$","",.) %>%
                 gsub(",ULC$","",.) %>%
                 gsub(" LP$","",.) %>%
                 gsub(",LP$","",.) %>%
                 trimws()]

In [33]:
#Japan
Patent_HAN[Person_ctry_code=="JAPAN", APPLICANT_MATCH_NAME := APPLICANT_MATCH_NAME %>%
                 gsub("CO.,LTD$","",.) %>%
                 gsub("CO., LTD$","",.) %>%
                 gsub("CO. LTD$","",.) %>%
                 gsub("CO.LTD$","",.) %>% 
                 gsub("CORPORATION$","",.) %>%
                 gsub(" CORP$","",.) %>%
                 gsub(",CORP$","",.) %>%
                 gsub("CO INC$","",.) %>%
                 gsub(" CO$","",.) %>%
                 gsub(",CO$","",.) %>%
                 gsub("LIMITED$","",.) %>%
                 gsub(" LTD$","",.) %>%
                 gsub(",LTD$","",.) %>%
                 gsub("INCORPORATED$","",.) %>%
                 gsub(" INC$","",.) %>%
                 gsub(",INC$","",.) %>%
                 gsub("K.K$","",.) %>%
                 gsub(" KK$","",.) %>%
                 gsub(",KK$","",.) %>%
                 gsub("^CO LTD","",.) %>%
                 trimws()]

In [34]:
#India
Patent_HAN[Person_ctry_code=="INDIA", APPLICANT_MATCH_NAME :=  APPLICANT_MATCH_NAME %>%
                 gsub("PRIVATE LIMITED$","",.) %>%
                 gsub("PRIVATE LTD$","",.) %>%
                 gsub("PVT.LTD$","",.) %>%
                 gsub("PVT. LTD$","",.) %>%
                 gsub("PVT LTD$","",.) %>%
                 gsub(" PVT$","",.) %>%
                 gsub(",PVT$","",.) %>%
                 gsub(" INC$","",.) %>%
                 gsub(",INC$","",.) %>%
                 gsub(" LTD$","",.) %>%
                 gsub(",LTD$","",.) %>%
                 trimws()]

In [35]:
#Germany
Patent_HAN[Person_ctry_code=="GERMANY", APPLICANT_MATCH_NAME := APPLICANT_MATCH_NAME %>%
                 gsub(" SLU GMBH$", "",.) %>%
                 gsub("GMBH & CO KG$", "",.) %>%
                 gsub("GMBH & CO. KG$", "",.) %>%
                 gsub("GMBH & CO$", "",.) %>%
                 gsub(" KG$", "",.) %>%
                 gsub(",KG$", "",.) %>%
                 gsub("GMBH$", "",.) %>%
                 gsub("MBH$", "",.) %>%
                 gsub(" AG$", "",.) %>%
                 gsub(",AG$", "",.) %>%
                 gsub(" SE$", "",.) %>%
                 gsub(",SE$", "",.) %>%
                 trimws()]

In [36]:
#Singapore
Patent_HAN[Person_ctry_code=="SINGAPORE", APPLICANT_MATCH_NAME := APPLICANT_MATCH_NAME %>%
                 gsub("PRIVATE LIMITED$", "",.) %>%
                 gsub("PTE. LTD$", "",.) %>%
                 gsub("PTE LTD$", "",.) %>%
                 gsub("LIMITED$", "",.) %>%
                 gsub(" LTD$", "",.) %>%
                 gsub(",LTD$", "",.) %>%
                 gsub("CORP LTD$","",.) %>%
                 trimws()]

In [37]:
#Australia
Patent_HAN[Person_ctry_code=="AUSTRALIA", APPLICANT_MATCH_NAME := APPLICANT_MATCH_NAME %>%
                 gsub("PTY. LTD$","",.) %>%
                 gsub("PTY.LTD$","",.) %>%
                 gsub("PTY LTD$","",.) %>%
                 gsub("(PTY)$","",.) %>%
                 gsub("PTY LIMITED$","",.) %>%
                 gsub("LIMITED$","",.) %>%
                 gsub(" LTD$","",.) %>%
                 gsub(",LTD$","",.) %>%
                 gsub("CORPORATION$","",.) %>%
                 gsub(" CROP$","",.) %>%
                 gsub(" CO$","",.) %>%
                 gsub(",CO$","",.) %>%
                 trimws()]

In [38]:
#Sweden
Patent_HAN[Person_ctry_code=="SWEDEN", APPLICANT_MATCH_NAME := APPLICANT_MATCH_NAME %>%
                 gsub("\\(PUBL)$","",.) %>%
                 gsub("SE AB$","",.) %>% 
                 gsub(" AB$","",.) %>%
                 gsub(",AB$","",.) %>%
                 gsub("^AB ","",.) %>%
                 gsub(" CORP$","",.) %>%
                 gsub(",CORP$","",.) %>%
                 trimws()]

In [39]:
#Spain
Patent_HAN[Person_ctry_code=="SPAIN", APPLICANT_MATCH_NAME := APPLICANT_MATCH_NAME %>%
                 gsub(" SL$","",.) %>%
                 gsub(",SL$","",.) %>%
                 gsub("S.L$","",.) %>%
                 gsub(", S.L$","",.) %>%
                 gsub(" SA$","",.) %>%
                 gsub(",SA$","",.) %>%
                 gsub("S.A$","",.) %>%
                 gsub(", S.A$","",.) %>%
                 gsub("S.L.U$","",.) %>%
                 gsub(" SLU$","",.) %>%
                 gsub(",SLU$","",.) %>%
                 trimws()]

In [40]:
#Switzerland
Patent_HAN[Person_ctry_code=="SWITZERLAND", APPLICANT_MATCH_NAME := APPLICANT_MATCH_NAME %>%
                 gsub("SÀRL$","",.) %>%
                 gsub(" AG$","",.) %>% 
                 gsub(",AG$","",.) %>% 
                 gsub("GMBH$","",.) %>%
                 gsub("S.A$","",.) %>%
                 gsub(" SA$","",.) %>%
                 gsub(",SA$","",.) %>%
                 gsub(" SARL$","",.) %>%
                 gsub(" LTD$","",.) %>%
                 gsub(",LTD$","",.) %>%
                 trimws()]

In [41]:
#Belgium
Patent_HAN[Person_ctry_code=="BELGIUM", APPLICANT_MATCH_NAME := APPLICANT_MATCH_NAME %>%
                 gsub(" NV SA$","",.) %>%
                 gsub(" NV$","",.) %>%
                 gsub(" SA$","",.) %>%
                 gsub(" VZW$","",.) %>%
                 gsub(" BVBA$","",.) %>%
                 gsub(" SPRL$","",.) %>%
                 trimws()]

In [42]:
#Netherlands
Patent_HAN[Person_ctry_code=="NETHERLANDS", APPLICANT_MATCH_NAME := APPLICANT_MATCH_NAME %>%
                 gsub("N.V$","",.) %>%
                 gsub(" NV$","",.) %>%
                 gsub(",NV$","",.) %>%
                 gsub("B.V$","",.) %>%
                 gsub(" BV$","",.) %>%
                 gsub(",BV$","",.) %>%
                 trimws()]

In [43]:
#South Korea
Patent_HAN[Person_ctry_code=="SOUTH KOREA", APPLICANT_MATCH_NAME := APPLICANT_MATCH_NAME %>%
                 gsub("CO.,LTD$","",.) %>%
                 gsub("CO., LTD$","",.) %>%
                 gsub("CO. LTD$","",.) %>%
                 gsub("CO.LTD$","",.) %>%
                 gsub("CORPORATION$","",.) %>%
                 gsub("CORP$","",.) %>%
                 gsub(" CO$","",.) %>%
                 gsub(",CO$","",.) %>%
                 gsub("LIMITED$","",.) %>%
                 gsub(" LTD$","",.) %>%
                 gsub(",LTD$","",.) %>%
                 gsub("INCORPORATED$","",.) %>%
                 gsub(" INC$","",.) %>%
                 gsub(", INC$","",.) %>%
                 trimws()]

In [44]:
#Italy
Patent_HAN[Person_ctry_code=="ITALY", APPLICANT_MATCH_NAME := APPLICANT_MATCH_NAME %>%
                 gsub("S.P.A$","",.) %>%
                 gsub(" SPA$","",.) %>%
                 gsub(",SPA$","",.) %>%
                 gsub("S.R.L$","",.) %>%
                 gsub(" SRL$","",.) %>%
                 gsub(",SRL$","",.) %>%
                 trimws()]

In [45]:
#Finland
Patent_HAN[Person_ctry_code=="FINLAND", APPLICANT_MATCH_NAME := APPLICANT_MATCH_NAME %>%
                 gsub("OYJ$","",.) %>%
                 gsub("OY AB$","",.) %>%
                 gsub("OY$","",.) %>%
                 gsub(" INC$","",.) %>%
                 gsub(",INC$","",.) %>%
                 gsub(" LTD$","",.) %>%
                 gsub(", LTD$","",.) %>%
                 gsub(" CORP$","",.) %>%
                 gsub(" CORP$","",.) %>%
                 gsub(" AB$","",.) %>%
                 gsub(" AB LTD$","",.) %>%
                 gsub(" PLC$","",.) %>%
                 gsub("^OY ","",.) %>%
                 trimws()]

In [46]:
#Denmark
Patent_HAN[Person_ctry_code=="DENMARK", APPLICANT_MATCH_NAME := APPLICANT_MATCH_NAME %>%
                 gsub(" APS$","",.) %>%
                 gsub(" AS$","",.) %>%
                trimws()]

In [47]:
#Remove commas and dots at end of names
Patent_HAN[, APPLICANT_MATCH_NAME := APPLICANT_MATCH_NAME %>%
                 gsub("\\.$","",.) %>%
                 gsub(",$","",.) %>%
                 gsub(" TECH$","",.) %>%      
                 trimws()]


In [48]:
head(Patent_HAN$APPLICANT_MATCH_NAME,20)

## 2.5) Insert Key Word for Posterior Patent-Industrial Dataset Matching

To guide the Convoluted Fuzzy Match algorithm, the first word of the Applicant's name is needed to be stored in a column ("Applicant_Key_Word")

In [49]:
Patent_HAN[, Applicant_Key_Word := stringr::word(APPLICANT_MATCH_NAME,1)]
head(Patent_HAN$Applicant_Key_Word,20)

## 2.6) Identify Patent Application that have Multiple Applicants

In [50]:
Duplicated_Applicant <- Patent_HAN[duplicated(Patent_HAN$Patent_number),Patent_number]
length(Duplicated_Applicant)

Add a coumn to Patent_HAN labeling patent application with multiple applicants

In [51]:
Patent_HAN[, Multiple_Applicants := Patent_number %in% Duplicated_Applicant]

#Overview
table(Patent_HAN[, Multiple_Applicants])


FALSE  TRUE 
14949  4657 

## 2.7) GPT Market Cluster Mapping

In [52]:
Patent_HAN <- Patent_HAN[, GPT_Scope:= as.factor(ifelse(Field_Applications==1, "Applied_AI", "Core_AI"))]
table(Patent_HAN$GPT_Scope)


Applied_AI    Core_AI 
     18693        913 

In [53]:
head(Patent_HAN)

HAN_ID,Patent_number,Clean_name,Person_ctry_code,Application_Date,Country,Title,Applicants,Inventors,Agriculture,...,Telecommunications,Transportation,Field_Applications,Functional_Applications,Techniques,Applicant_Type,APPLICANT_MATCH_NAME,Applicant_Key_Word,Multiple_Applicants,GPT_Scope
2850861,EP0012777,SYSTRAN INSTITUT GES FUR FORSCHUNG & ENTWICKLUNG MASCHINELLER SPRACHUBERSETZUNGSSYSTEME MBH,GERMANY,1978-12-30,EP,Method using a programmed digital computer system for translation between natural languages.,SYSTRAN INST,TOMA PETER DR,0,...,0,1,1,1,1,Enterprise,SYSTRAN INSTITUT GES FUR FORSCHUNG & ENTWICKLUNG MASCHINELLER SPRACHUBERSETZUNGSSYSTEME,SYSTRAN,False,Applied_AI
1318206,EP0039393,IBM CORP,UNITED STATES,1981-03-06,EP,TEXT RECORDER WITH AUTOMATIC WORD ENDING AND METHOD OF OPERATING THE SAME,INTERNATIONAL BUSINESS MACHINES CORPORATION,"HANFT, ROY FRANCIS; PECHANEK, GERALD GEORGE",0,...,0,0,1,1,1,Enterprise,IBM,IBM,False,Applied_AI
2931262,EP0059929,STEULER INDUSTRIEWERKE GMBH,GERMANY,1982-03-03,EP,A METHOD FOR PRODUCING LINING PANELS WHICH CAN BE REMOVABLY ATTACHED TO A SURFACE,STEULER- INDUSTRIEWERKE GMBH,"ROHRINGER, ERNST",0,...,0,0,1,0,0,Enterprise,STEULER INDUSTRIEWERKE,STEULER,False,Applied_AI
698993,EP0060671,UBE IND LTD,JAPAN,1982-03-09,EP,PROCESS FOR CONTINUOUSLY PRODUCING A POLYMERIC LAMINATE TAPE HAVING A PLURALITY OF METAL WIRES EMBEDDED THEREWITHIN,"UBE INDUSTRIES, LTD.","HAYASHI, MASUMI C/O UBE INDUSTRIES LTD.; OGAWA, KAZUO C/O UBE INDUSTRIES LTD.; KIMURA, KATSUMI C/O UBE INDUSTRIES LTD.; ISHII, HIROSHI C/O UBE INDUSTRIES LTD.; BANDAI, SATOSHI C/O UBE INDUSTRIES LTD.",0,...,0,0,1,0,0,Enterprise,UBE IND,UBE,False,Applied_AI
558027,EP0091317,CO LTD TOSHIBA,JAPAN,1983-04-06,EP,SYNTAX ANALYZING METHOD AND APPARATUS,TOKYO SHIBAURA DENKI KABUSHIKI KAISHA,"AMANO, SHIN-YA; HIRAKAWA, HIDEKI",0,...,0,0,1,1,0,Enterprise,TOSHIBA,TOSHIBA,False,Applied_AI
422610,EP0096712,NCR CORP,UNITED STATES,1982-12-07,EP,A SYSTEM AND METHOD FOR RECOGNIZING SPEECH,NCR CORPORATION,"AVERY, JAMES MARTIN; HOYER, ELMER AUGUST",0,...,0,0,1,1,1,Enterprise,NCR,NCR,False,Applied_AI


# Output

In [54]:
fwrite(Patent_HAN, file="F:/Thesis/Working_Data/Final\\Patent_Dataset.csv", col.names = TRUE)