# Demo - Exclusion List

### Getting started

For running this demonstration, we first need to install and load a few R packages, which can be done by running the following lines:

In [None]:
required.packages <- c("rio", "dplyr", "tm", "RecordLinkage")

is.not.installed <- !(required.packages %in% installed.packages())
if (any(is.not.installed)) {
    install.packages(required.packages[is.not.installed])
}
invisible(lapply(required.packages, library, character.only = TRUE))


The following url also will also allow one to get access to the data used for the demo:


In [111]:
    
path.to.github <- "https://raw.github.com/olerkhan/Company-Name-Record-Linkage/master/"

# (For Swiss Re only)
Sys.setenv(https_proxy = "gate-zrh.swissre.com:9443")


## 1. Looking at the data

We fist download some mock data to base our demo on.

Note that most of the time, portfolio and target list information can be obtained in tabular format, from excel/csv files or from SQL databases -- we will thus assume such a format is available in our case and use it for the purpose of this presentation. <br>

<br>

#### Portfolio Data

First, let us retrieve and have a look at our portfolio *P*:

In [None]:

toy.portfolio <- import(paste0(path.to.github, "Fake_Portfolio.xlsx"))
toy.portfolio <- as_tibble(toy.portfolio)


We can inspect the content of the *toy.portfolio* to see that it contains a list of companies, which are given by their names, together with some other information (country, headquarters). Furthermore, *toy.portfolio* can also typically cpntain some insurance specific informaion such as premium, limit, attachement point, etc:

In [None]:
toy.portfolio


<br>

#### Target (exclusion) list data

Let us now have a look the target list of 'bad' companies we will need to monitor:

In [None]:

toy.target.list <- import(paste0(path.to.github, "Fake_Target_List.xlsx"))
toy.target.list <- as_tibble(toy.target.list)
toy.target.list


<br>

## 2. A first naive approach - direct matching

We first try the most simplistic (and bound to fail approach) of having direct comparison between the names:

In [None]:
IsInList <- function(insured.names, target.list.names) {
  is.in.list <- insured.names %in% target.list.names
  return(is.in.list)
}

IsInList(
  insured.names = c("a", "b"), # dummy insured names
  target.list.names = c("aa", "aaa", "", "b", "not-a") # dummy target list names
)


*a* does not appear in the target list, but *b* does, hence the FALSE and TRUE.

So, as we can see, there is match if and only if there is perfect agreements between the names. Let us now apply this to our portfolio:

In [None]:

FlagPortfolioEntries <- function(portfolio, target.list) {
  is.in.list <- IsInList(portfolio$Insured_Names, target.list$Company_Names)
  portfolio[["IsInList"]] <- ifelse(is.in.list, "Yes", "No")
  return(portfolio)
}

FlagPortfolioEntries(toy.portfolio, toy.target.list) %>% filter(IsInList == "Yes")


<br>

The result above shows that only one entry (*Id = 6*) of our portfolio belongs to the target list. However, it is clear to the human eye that this is not enough. The following elements from the portfolio clearly should have been caught too:

In [None]:

toy.portfolio %>% filter(Id %in% c(7, 8, 9, 10)) %>% pull("Insured_Names") %>% sort()


Since there are obviously the same than:

In [None]:

toy.target.list %>% filter(Id %in% c("C", "E", "F", "H")) %>% pull("Company_Names") %>% sort()



The reason they were not captured is visibly due to some difference in spelling which could not be captured by our 'hard' direct comparison.

<br>

## 3. A less naive approach - direct matching + cleaning

The previous example showed that we are missing some matches because of differences that are irrelevant to the human eyes but matter for a machine: difference in capitalisation, white spaces, and words which carry 'no useful discriminative power', such as gmbh.

This suggests we can be much better at capturing the companies of interest by doing some pre-processing, or cleaning, of our data.

We present here a very simple example of it:

In [None]:

CleanCompanyNames <- function(names) {
  
  stopwords <- c("gmbh", "kgaA", "inc", "ag", "se", "ltd")
  
  cleaned.names <- names %>%
    tolower() %>%               
    removePunctuation() %>%     
    removeWords(stopwords) %>%
    stripWhitespace() %>%         # N.b. remove duplicated whitespaces
    trimws()                      # N.b. remove leading and trailing whitespaces
    
  return(cleaned.names)
}

examples <- c("Some   name GmBh", "Look ...*-. here Ag", " heey ltd   ")
cleaned.examples <- CleanCompanyNames(examples)
cleaned.examples


Typically, the stopwords can be either be derived from a 'top-down approach' (for instance, list of known legal suffixes, see e.g. [wikipedia](https://en.wikipedia.org/wiki/List_of_legal_entity_types_by_country)) or from a 'bottump-up approach', e.g. doing a frequency analysis on some companynames corpus.

Let us now try to use this new 'feature' in our attempt to match the two lists. For this, we will now 'update' our *FlagPortfolioEntries* function:


In [None]:

FlagPortfolioEntries_v2.0 <- function(portfolio, target.list) {
  
  cleaned.insured.names <- CleanCompanyNames(portfolio$Insured_Names)
  cleaned.list.names    <- CleanCompanyNames(target.list$Company_Names)
  
  is.in.list <- IsInList(cleaned.insured.names, cleaned.list.names)
  portfolio[["IsInList"]] <- ifelse(is.in.list, "Yes", "No")
  
  return(portfolio)
}

FlagPortfolioEntries_v2.0(toy.portfolio, toy.target.list) 



This is quite better already and encouraging. Let us now see in the next section how to go even further.

<br>


## 4. Probabilistic Record Linkage - towards a robust solution

*Note that record linkage is also sometimes known as __entity linkage__ , and there is debate whether the two names cover the same discipline exactly or not. We will here assume they do and furthemore consider that the task of 'linking records' / 'identifying records with a unique identifier' are the same.*

What we have seen so far was *Direct matching*  - two records are linked only if they are identical; or more precisely, if some cleaned or post-processed versions of these records are identical. We will now look at the so-called *fuzzy-matching* approach, also known as 'probabilistic record linkage', which attempts at matching records that may not be perfectly identical, but only very close.

<br>
  
### Fuzzy-matching

Sometimes, two records may refer to the same entity but may be present with some spelling variations or typo, making an exact match inaccurate. Compare for instance:

In [None]:
toy.portfolio %>% filter(Id %in%  c(8, 7)) %>% pull("Insured_Names") %>% sort()


together with 


In [None]:
toy.target.list %>% filter(Id %in% c("F", "H")) %>% pull("Company_Names") %>% sort()


For the human eyes, these records are clearly the same, but not necessarily for a machine.

<br>

#### A simple string similarity measure - Levenshtein

How can we approach this problem? A standard way to adress the problem is to consider 'similarity measure' between strings, the most prominent one being the levenshtein edit (pseudo-)distance:


In [None]:

some.example <- c("coca-cola", "coca cola" , "cocacola UK", "koco-mola", "Coca Cola World")

levenshteinDist("coca-cola", some.example) # Edit distance: number of permutations/transpositions needed


Note that in practice, a *normalised distance* is needed, since an edit distance *d=1* for, say, a one-letter word or instead for a 20-letters word obviously would carry  different implications. For this reason, string distances are in practice always normalised between 0 and 1 (in the case of Levenshtein, this is customarily done by dividing by the largest length of the noun used for comparison) - a distance of 0 means two words are identical, and the closer the distance is from 1, the more different two words are. 

Furthemore, it is also often customary to use 'similarity' instead of distance - where similarity is simply defined as *1 - distance*. Hence a similarity of 1 means perfect identity, and a similarity of 0 implies total dissimilarity between the two words. These notions being defined, we can look at the same example again:

``` {r}

some.example <- c("coca-cola", "coca cola" , "cocacola UK", "koco-mola", "Coca Cola World")
levenshteinSim("coca-cola", some.example)  # Normalised between 0 and 1


``` 

As announced above, we see here that  similarity of 1 corresponds to a perfect match, and the closer to 0 the worse it gets. In order to use similarity measures to match two lists against each other, we will of course need to introduce a score threshold.


In [None]:

IsInlist_Fuzyy <- function(insured.names, list.names, threshold) {

  is.in.list <- vector(mode = "logical", length = length(insured.names))
  
  for (i in 1:length(insured.names)) { # N.b. this is a pedagogical but not very efficient implementation!
    score <- levenshteinSim(insured.names[i], list.names)
    best.score <- max(score)
    is.in.list[i] <- ifelse(best.score >= threshold, TRUE, FALSE)
  }

  return(is.in.list)
}


IsInlist_Fuzyy(
  insured.names = c("coca-cola", "coca cola" , "cocacola UK", "koco-mola", "fanta Inc", "ZZZ Corp") ,
  list.names = c("coca-cola", "fanta Inc."), 
  threshold = 1
)


Note imposing a threshold of 1 is identical to direct matching. Note also than inevitable false positives will appear when playing with the treshold value!


In [None]:

IsInlist_Fuzyy(
  insured.names = c("coca-cola", "coca cola" , "cocacola UK", "koco-mola", "fanta Inc", "ZZZ Corp") ,
  list.names = c("coca-cola", "fanta Inc."), 
  threshold = 0.65
)


Last but not least, it should here be noted that using fuzzy-matching, as we did, for the following cases:

In [None]:
toy.portfolio %>% filter(Id %in%  c(7)) %>% pull("Insured_Names") 
toy.target.list %>% filter(Id %in% c("H")) %>% pull("Company_Names") 

is not the only way to go. The difference caused by special characters like the umlaut here could have also typically been dealt during the 'cleaning' phase.

<br>

#### Wrapping it up: Levenshtein + name cleaning

Eventually, we can now flag our portfolio using all what we have seen so far!



In [None]:

FlagPortfolioEntries_v3.0 <- function(portfolio, target.list, threshold = 0.9) {

  cleaned.insured.names <- CleanCompanyNames(portfolio$Insured_Names)
  cleaned.list.names    <- CleanCompanyNames(target.list$Company_Names)

  is.in.list <- IsInlist_Fuzyy(cleaned.insured.names, cleaned.list.names, threshold)
  portfolio[["IsInList"]] <- ifelse(is.in.list, "Yes", "No")

  return(portfolio)
}

FlagPortfolioEntries_v3.0(toy.portfolio, toy.target.list)


We see that we now obtain a decent result!

<br>

### An important practical aspect - what did I match against exactly?

So far our *FlagPortfolioEntries* function was only telling us which record in our portfolio was present in the target list. However, in practice, what is most of the time needed is to know *to which record of the target list* or portfolio entry was matched again. This is especially true when the goal is also to do data enrichment.

The good news is, it is very straightforward to modify the previous code to achieve this. First we adapt our fuzzy matching function so that it returns, for each portfolio record, the row index of the target list against it there was match (if there was indeed any):


In [None]:

FindFuzzyMatches <- function(insured.names, list.names, threshold) {

  matches.location <- vector(mode = "integer", length = length(insured.names))
  
  for (i in 1:length(insured.names)) { # N.b. this is a pedagogical but not very efficient implementation!
    score <- levenshteinSim(insured.names[i], list.names)
    best.score <- max(score)
    best.match.index <- which.max(score)
    matches.location[i] <- ifelse(best.score >= threshold, best.match.index, NA)
  }

  return(matches.location)
}

FindFuzzyMatches(
  insured.names = c("coca-cola", "coca cola" , "cocacola UK", "koco-mola", "fanta Inc", "ZZZ Corp") ,
  list.names = c("coca-cola", "fanta Inc."), 
  threshold = 0.6
)


Notice that we can visualise this result in a 'friendlier' way as follows:

In [None]:
matches.location <- FindFuzzyMatches(
  insured.names = c("coca-cola", "coca cola" , "cocacola UK", "koco-mola", "fanta Inc", "ZZZ Corp") ,
  list.names = c("coca-cola", "fanta Inc."), 
  threshold = 0.6
)

 c("coca-cola", "fanta Inc.")[matches.location]


Eventually we can use this slightly modified form to perform a left join and reach the desired result:


In [None]:

FlagPortfolioEntries_v4.0 <- function(portfolio, target.list, threshold = 0.9) {

  cleaned.insured.names <- CleanCompanyNames(portfolio$Insured_Names)
  cleaned.list.names    <- CleanCompanyNames(target.list$Company_Names)

  matches.locations <- FindFuzzyMatches(cleaned.insured.names, cleaned.list.names, threshold)
  portfolio$matches.id <- target.list$Id[matches.locations]
  
  matched.portfolio <- left_join(portfolio, target.list, by = c("matches.id" = "Id"))
  
  return(matched.portfolio)
}

FlagPortfolioEntries_v4.0(toy.portfolio, toy.target.list)


## 5. To go further

### Tackle some practical problem

How to choose the threshold ? What when there are ties in the fuzzy-matching scoring?

### Using different string metric

Jaccard, Jaro-Winkler, ...

### Using several dimensions

We very often have other information that the company names, for instance, the headquarter country,
the industry the company is active in, etc. All these dimensions can also be used for the matching.

Note that this can be done with direct matching and with fuzzy matching, or with a combination of both!

### Machine Learning

No time to cover this here, but some name drops:

Train set, testset, tf-idf, boost using revenue, etc.

And some linking to the literature:

* ref 1
* ref 2
* ref 3 






