# Record Linkage for NBA Player Databases


TL;DR: This notebook will walk through an example of using simple machine learning to perform record linkage for two NBA player databases (www.stats.nba.com and www.basketball-reference.com). I will show why this is (maybe) neccessary, provide explanations for the general steps inthe record linkage pipeline, and then execute this pipeline on the aformentioned data. If you have any questions/comments, or simply would like to see this done on more datasets, please feel free to reach out via email (hw.chase.17@gmail.com) or twitter (@aabsblog) - more than happy to chat.

A lot of the code has been factored out into helper files and then imported here, mainly for making this notebook more readable. The factored out code may not be commented/explained as well as the code in here, but if you are looking at it and have questions feel free to reach out and I can try to add better documentation.

I am also hoping to collect some feedback on this post and on my ideas for next steps. I would LOVE it if you spent 3 minutes filling out the forms after reading:

[Give Feedback](https://forms.gle/R4mgQZS3oshG9vRGA)

[Suggest Next Steps](https://forms.gle/PBBwh3519RuXpXsUA)

# Introduction

#### What is record linkage?

Record Linkage is the process of matching a record from one datasource to an record in another data source. This idea is not a new idea - [here is a paper](https://docs.lib.purdue.edu/cgi/viewcontent.cgi?article=2572&context=cstech) from 2003 that describes what is going on. At a high level, this paper describes three steps:

1. Searching Methods (also called *Blocking*): find somewhat similar records across two datasets
1. Comparision Functions (also called *Modeling*):  compare the pairs of records you found in Step 1 in a more detailed way
1. Decision Models (also called *Postprocessing*): make decisions based on Step 2

#### Why should I care about record linkage?

There is so much data on the internet these days. Even just considering basketball stats, I would say there are three main sites people go to for data: www.stats.nba.com, www.basketball-reference.com, and www.espn.com. Each of these sites contains their own datasets, with similar but perhaps slightly different information. While it is possible to get away with just using one of these sites, one can easily imagine that it might be desirable to pull in information from all three sites. Since all three sites have their own unique identifiers (or uuids) for each player, you would need to link these uuids to eachother.

#### Why is it hard to link players from one datasource to another? Can't I just join on their name?

The reason it is not trivial to do so is that a player's information may actually be slightly different accross the two sites. Names can be mispelled, nicknames could be used, years active could actually be different - all very small, but incredibly annoying things that make doing a join not possible - at least for all players.

#### But surely it would be sufficient for most players?

This is true. As I'll show in this notebook, a trivial join can join roughly 90% of players.

#### If a trivial join works so well, is this linking exercise really neccessary?

To be honest - no. If my goal was to simply link these two datasets I would use a trivial join, knock out 90% of the players, and then link the rest by hand - that would probably be faster than writing all this code. 

#### So then why did you do this?
Two reasons. One, I think linking is an incredibly powerful tool and is useful in a ton of situations. In most situations you won't have a trivial rule to do joins on like you do here, and so this process is more neccessary. I wanted to give an example of this process, and I really like working with sports data, so I thought this would be a fun dataset. Second, having done this I could now actually use this for other tasks. Most simply, I could use this same pipeline, maybe even the same model, to link ESPN data to basketball-reference data. More complexly, I could probably use a lot of this pipeline to link football players, or soccer players, etc.

# Getting Data

The first step in linking data is get said data. For this I have scraped data from www.stats.nba.com and www.basketball-reference.com. 

I've included the code for that as well as the results, so you if you want to replicate the rest of this pipeline you can just use the saved data.

### Basketball reference

In [1]:
# Imports
import lxml
import requests
from bs4 import BeautifulSoup
from pysportsref.parsing import get_table_from_soup, get_table_soup
from pysportsref.scraping import get_soup
import string
import pandas as pd

In [2]:
# Scraping
player_dfs = []
for letter in string.ascii_lowercase:
    url = f"https://www.basketball-reference.com/players/{letter}/"
    soup = get_soup(url)

    table_str = get_table_soup(soup, 'players')

    bbref_df = get_table_from_soup(table_str, get_url=True)
    player_dfs.append(bbref_df)

bbref_df = pd.concat(player_dfs)

In [3]:
# Saving
bbref_df.to_csv('bbref_df.csv', index=False)

### NBA.com

In [4]:
# Imports
from nba_api.stats.static import players
from nba_api.stats.endpoints.commonallplayers import CommonAllPlayers

In [5]:
# Scraping
all_players = CommonAllPlayers().get_dict()
data = all_players['resultSets'][0]['rowSet']
columns = columns=all_players['resultSets'][0]['headers']
nba_df = pd.DataFrame(data, columns=columns)

In [6]:
# Saving
nba_df.to_csv('nba_df.csv', index=False)

# Explore the data

Lets first explore the data a bit. Several things I want to look at:
1. How many players in each dataset?
1. What peculiarities exist in each dataset?
1. How many can be matched by just joining on name and start year?

In [1]:
import pandas as pd

In [2]:
bbref_df = pd.read_csv('bbref_df.csv')
nba_df = pd.read_csv('nba_df.csv')

Q: How many players in each dataset? 

A: We can see from the below that there are 4865 players in the basketball reference dataset, and only 4589 in the nba.com dataset. This already suggests some differences in the datasets.

In [3]:
bbref_df.shape

(4865, 11)

In [4]:
nba_df.shape

(4589, 14)

Q: What peculiarities exist in each dataset?

A: The main difference that exists is that the year accounting system used by nba.com is off by 1 from the accounting system used by basketball-reference. We can quickly adjust it so it is comparable.

In [5]:
nba_df['FROM_YEAR'] +=1
nba_df['TO_YEAR'] += 1

Q: How  many can be matched by just joining on name and start year?

A: Looking at it below, we can create a `trival_uuid` that is just name and start year and then do a simple join on that. We can see that we can match 3906. To be honest, this is pretty good already - over 85% of the player can be matched by this. The remaining number is small enough where it may not be terrible to do it manually. However, lets pretend for educational purposes that we want to do this in an automated fashion

In [6]:
nba_df['trivial_uuid'] = nba_df['DISPLAY_FIRST_LAST'] + ' - ' + nba_df['FROM_YEAR'].astype(str)
bbref_df['trivial_uuid'] = bbref_df['player'] + ' - ' + bbref_df['year_min'].astype(int).astype(str)

In [7]:
merged = nba_df.merge(bbref_df, on='trivial_uuid', how='inner')

In [8]:
merged.shape

(3906, 26)

# Preprocessing

Lets do some preprocessing to get the data clean and ready for record linkage.

In [9]:
from helpers import clean_name, preprocessing

In [10]:
bbref_df['clean_name'] = clean_name(bbref_df['player'])

In [11]:
nba_df['clean_name'] = clean_name(nba_df['DISPLAY_FIRST_LAST'])

In [12]:
bbref_df = preprocessing(bbref_df)
nba_df = preprocessing(nba_df)

In [13]:
bbref_df['uuid'] = bbref_df['player_url']
nba_df['uuid'] = nba_df['PERSON_ID']

In [14]:
bbref_df = bbref_df.rename(columns={'year_min': 'start_year', 'year_max': 'end_year'})
nba_df = nba_df.rename(columns={'FROM_YEAR': 'start_year', 'TO_YEAR': 'end_year'})

# Blocking

Now lets do blocking. We do this by creating pairs for all players that either have the same first name OR the same last name. 

In [15]:
from helpers import get_buckets, get_pairs

In [16]:
bbref_buckets = get_buckets(bbref_df)
nba_buckets = get_buckets(nba_df)

In [17]:
pairs = get_pairs(bbref_buckets, nba_buckets)

In [18]:
pair_df = pd.DataFrame(pairs, columns=['uuid_1', 'uuid_2'])

Of course, the above blocking logic is not neccessarily perfect. Let's take a look at whether any players had not been blocked at all. We see that we are missing a few from each database. Of course, this is not neccessarily bad - some players may only be in one database (in fact, we know there must be some players like this, since there are a differing number of players in each database).

We can see that the nba.com database has mostly players who started this year that can't be blocked with anyone, while the basketball reference database has a bunch of historical players that can't be blocked. This is actually somewhat in line with expectations - nba.com is generally more up-to-date more quickly, while basketball reference has a bunch of ABA and historical players in different leagues, while nba.com only contains those who played in the NBA proper.

In [19]:
missing_bbref = set(bbref_df['uuid']).difference(pair_df['uuid_1'].unique())
len(missing_bbref)

14

In [20]:
missing_nba = set(nba_df['uuid']).difference(pair_df['uuid_2'].unique())
len(missing_nba)

6

In [21]:
nba_df[nba_df['uuid'].isin(missing_nba)]

Unnamed: 0,PERSON_ID,DISPLAY_LAST_COMMA_FIRST,DISPLAY_FIRST_LAST,ROSTERSTATUS,start_year,end_year,PLAYERCODE,TEAM_ID,TEAM_CITY,TEAM_NAME,...,TEAM_CODE,GAMES_PLAYED_FLAG,OTHERLEAGUE_EXPERIENCE_CH,trivial_uuid,clean_name,clean_first_name,clean_last_name,suffix,basic_name,uuid
346,202392,"Blakely, Marqus",Marqus Blakely,0,2011,2011,marqus_blakely,0,,,...,,Y,1,Marqus Blakely - 2011,marqus blakely,marqus,blakely,,marqus blakely,202392
998,1629603,"Diakite, Mamadi",Mamadi Diakite,1,2021,2021,mamadi_diakite,1610612749,Milwaukee,Bucks,...,bucks,Y,0,Mamadi Diakite - 2021,mamadi diakite,mamadi,diakite,,mamadi diakite,1629603
1587,1630204,"Hagans, Ashton",Ashton Hagans,1,2021,2021,ashton_hagans,1610612750,Minnesota,Timberwolves,...,timberwolves,Y,0,Ashton Hagans - 2021,ashton hagans,ashton,hagans,,ashton hagans,1630204
3071,1630168,"Okongwu, Onyeka",Onyeka Okongwu,1,2021,2021,onyeka_okongwi,1610612737,Atlanta,Hawks,...,hawks,N,0,Onyeka Okongwu - 2021,onyeka okongwu,onyeka,okongwu,,onyeka okongwu,1630168
3510,202375,"Rolle, Magnum",Magnum Rolle,0,2011,2011,magnum_rolle,0,,,...,,Y,1,Magnum Rolle - 2011,magnum rolle,magnum,rolle,,magnum rolle,202375
3747,1629686,"Sirvydis, Deividas",Deividas Sirvydis,1,2021,2021,deividas_sirvydis,1610612765,Detroit,Pistons,...,pistons,Y,0,Deividas Sirvydis - 2021,deividas sirvydis,deividas,sirvydis,,deividas sirvydis,1629686


In [22]:
bbref_df[bbref_df['uuid'].isin(missing_bbref)]

Unnamed: 0,player,start_year,end_year,pos,height,weight,birth_date,colleges,player_url,birth_date_url,colleges_url,trivial_uuid,clean_name,clean_first_name,clean_last_name,suffix,basic_name,uuid
436,Tommie Bowens,1968.0,1970.0,F-C,6-8,220.0,"July 7, 1940",Grambling State University,/players/b/bowento01.html,/friv/birthdays.cgi?month=7&day=7,/friv/colleges.fcgi?college=grambling,Tommie Bowens - 1968,tommie bowens,tommie,bowens,,tommie bowens,/players/b/bowento01.html
439,Orbie Bowling,1968.0,1968.0,C,6-10,215.0,"March 21, 1939",Tennessee,/players/b/bowlior01.html,/friv/birthdays.cgi?month=3&day=21,/friv/colleges.fcgi?college=tennessee,Orbie Bowling - 1968,orbie bowling,orbie,bowling,,orbie bowling,/players/b/bowlior01.html
685,Darel Carrier,1968.0,1973.0,G,6-3,185.0,"October 26, 1940",Western Kentucky,/players/c/carrida01.html,/friv/birthdays.cgi?month=10&day=26,/friv/colleges.fcgi?college=wkentucky,Darel Carrier - 1968,darel carrier,darel,carrier,,darel carrier,/players/c/carrida01.html
1946,Carroll Hooser,1968.0,1968.0,F,6-7,230.0,"March 5, 1944",SMU,/players/h/hooseca01.html,/friv/birthdays.cgi?month=3&day=5,/friv/colleges.fcgi?college=smethodist,Carroll Hooser - 1968,carroll hooser,carroll,hooser,,carroll hooser,/players/h/hooseca01.html
2543,Leary Lentz,1968.0,1969.0,F,6-6,200.0,"February 23, 1945",Houston,/players/l/lentzle01.html,/friv/birthdays.cgi?month=2&day=23,/friv/colleges.fcgi?college=houston,Leary Lentz - 1968,leary lentz,leary,lentz,,leary lentz,/players/l/lentzle01.html
2590,Riney Lochmann,1968.0,1970.0,F,6-6,215.0,"May 26, 1944",Kansas,/players/l/lochmri01.html,/friv/birthdays.cgi?month=5&day=26,/friv/colleges.fcgi?college=kansas,Riney Lochmann - 1968,riney lochmann,riney,lochmann,,riney lochmann,/players/l/lochmri01.html
2640,R.B. Lynam,1968.0,1968.0,G,6-1,190.0,,Oklahoma Baptist University,/players/l/lynamrb01.html,,/friv/colleges.fcgi?college=okbaptist,R.B. Lynam - 1968,rb lynam,rb,lynam,,rb lynam,/players/l/lynamrb01.html
2915,Dewitt Menyard,1968.0,1968.0,C,6-10,210.0,"May 24, 1944",Utah,/players/m/menyade01.html,/friv/birthdays.cgi?month=5&day=24,/friv/colleges.fcgi?college=utah,Dewitt Menyard - 1968,dewitt menyard,dewitt,menyard,,dewitt menyard,/players/m/menyade01.html
3488,Marlbert Pradd,1968.0,1969.0,G,6-3,170.0,"November 17, 1944",Dillard University,/players/p/praddma01.html,/friv/birthdays.cgi?month=11&day=17,/friv/colleges.fcgi?college=dillard,Marlbert Pradd - 1968,marlbert pradd,marlbert,pradd,,marlbert pradd,/players/p/praddma01.html
3811,Glynn Saulters,1969.0,1969.0,G,6-2,175.0,"February 10, 1945",Louisiana-Monroe,/players/s/saultgl01.html,/friv/birthdays.cgi?month=2&day=10,/friv/colleges.fcgi?college=ulamo,Glynn Saulters - 1969,glynn saulters,glynn,saulters,,glynn saulters,/players/s/saultgl01.html


# Feature Engineering

Now we do a bunch of feature engineering to create features to feed into our model that will predict whether two players are a match. I've put all of this code in the helper file to abstract it out a bit, but this is the main part of the code that requires any data science knowledge/intuition. Coming up with all these features was a combination of trying stuff out that I intuitively thought would be helpful, as well as doing error inspection on an earlier version of the model and then adding more features to hopefully "fix" those errors. Still, even this part isn't terribly complicated.

In [23]:
from helpers import create_features

In [24]:
columns_to_use = ['clean_name', 'basic_name', 'clean_first_name', 'clean_last_name', 'start_year', 'end_year', 'uuid', 'suffix']

In [25]:
feature_df = pair_df.merge(bbref_df[columns_to_use].add_suffix('_1'), how='left', on='uuid_1')
feature_df = feature_df.merge(nba_df[columns_to_use].add_suffix('_2'), how='left', on='uuid_2')

In [26]:
feature_df, features = create_features(feature_df)

# Getting labels
Since we will be using supervised modeling techniques to train our model, we need some labels. How do we get those labels? We do this is two parts. First, we do some manually labeling. This mainly consists of labeling POSITIVE examples, ie two records that are a link. Then, since we know those are a link, and we are assuming no duplicate links are allowed, we can infer that every other player is NOT a link with those two. 

Practically, we start by providing only TWO labeled examples - LeBron James and Michael Jordan. We then take all the other players they were blocked with, and those are our negative examples. That sampling happens inside the helper function `get_label_df`.

In [27]:
initial_pos_labels = [
    ('/players/j/jamesle01.html', 2544),
    ('/players/j/jordami01.html', 893),
]

In [28]:
from helpers import get_label_df

In [29]:
label_df = get_label_df(initial_pos_labels, pair_df)

# Modeling

Now for the "machine learning" part! For this I am going to keep it extremely simple and just use Logistic Regression. Although you can use fancier modeling techniques, because this is a relatively simple dataset/problem there is no real need to. Don't let people tell you need to use neural networks for no good reason! Keep it simple :)

In [30]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

In [31]:
def get_mod(label_df, feature_df, features):
    mod = make_pipeline(StandardScaler(), LogisticRegression())
    df = pd.merge(label_df, feature_df, how='left', on=['uuid_1', 'uuid_2'])
    mod.fit(df[features], df['target'])
    return mod

In [32]:
mod = get_mod(label_df, feature_df, features)

# Let's do it for real!

Now lets do it for real. The above model will not be amazing, since it is only trained on two positive examples. If we want to get a better model, the easiest way to do that at the moement is to just add more training examples. 

We can do this in two ways. First, we can use our basic way of linking players (join on name and year) and use some of the links produced via that method as training examples. Of course, if we only do that then our model will just learn that rule and that's kinda pointless. So I've also augmented our positive labels with some matches that that rule does NOT pick up. How did I come up with these? I trained a model based on initial labels, then found players that the model did not find a good match for, labeled those manually, and added those to the training dataset. Hopefully these labels we are adding are representative of other failures and therefor helps us learn some rules to link those!

In [33]:
import numpy as np

In [34]:
# Manually labeled points
initial_pos_labels = [
    ('/players/j/jamesle01.html', 2544),
    ('/players/j/jordami01.html', 893),
]
initial_pos_labels.extend([
    ('/players/l/lockro01.html', 77398),
    ('/players/w/weathcl01.html', 221),
    ('/players/e/ellisbo01.html', 76658),
    ('/players/c/coxch01.html', 76463),
    ('/players/p/paxsoji02.html', 77819),
    ('/players/p/paxsoji01.html', 77818),
    ('/players/p/paxsojo01.html', 77820),
    ('/players/j/jackslu02.html', 2739),
    ('/players/j/jackslu01.html', 77103),
    ('/players/s/silasja01.html', 78150),
    ('/players/e/elmorle01.html', 600010),
    ('/players/a/armstcu01.html', 76060),
    ('/players/j/johnsra01.html', 77139),
    ('/players/v/vaughch01.html', 78409),
    ('/players/t/taylofa01.html', 78302),
    ('/players/w/willidu01.html', 78545),
    ('/players/w/wallare01.html', 78443),
    ('/players/s/schafbo01.html', 78072),
    ('/players/d/davisch02.html', 76518),
    ('/players/p/portemi01.html', 1629008),
    ('/players/g/gillege01.html', 76813),
    ])

In [35]:
# Adding some labels from our trivial join
nba_df['clean_uuid'] = nba_df['clean_name'] + ' - ' + nba_df['start_year'].astype(str)
bbref_df['clean_uuid'] = bbref_df['clean_name'] + ' - ' + bbref_df['start_year'].astype(int).astype(str)
merged = nba_df.merge(bbref_df, on='clean_uuid', how='inner')
trivial_merged = merged.dropna(subset=['PERSON_ID', 'player_url'])
# These players can actually NOT be trivially merged
bad_uuids = {'tony mitchell - 2014', 'bill bradley - 1968', 'mark jones - 1984'}
trivial_merged = trivial_merged[~trivial_merged['clean_uuid'].isin(bad_uuids)]
trivial_matches = [(y, int(x)) for x,y in trivial_merged[['PERSON_ID', 'player_url']].values]

In [36]:
# Let's not add all these trivial matches to our training dataset, or else they will overwhelm the other examples. 
np.random.shuffle(trivial_matches)
initial_pos_labels.extend(trivial_matches[:400])

In [37]:
label_df = get_label_df(initial_pos_labels, pair_df)

In [38]:
mod = get_mod(label_df, feature_df, features)

# Predictions

Now that we have a model, we can make predictions for all our pairs! For ease of postprocessing (the next step), we will transform these to the logit space.

In [39]:
from scipy.special import logit
preds = mod.predict_proba(feature_df[features], )
feature_df['probas'] = logit(preds[:, 1])

# Postprocessing

Now we have predictions for each pair. But we don't want our final result to be predictions for each pair. Rather, we want it to be pairs of entities that we know are a match. So we need to perform some postprocessing to arrive there.

Specifically, we can split the labels into three groups:

1. pairs of entities that we think are a match
1. entities from each database that we don't think have a match at all
1. entities (or pairs of entities) that we aren't sure about

What makes us put entities in one group versus another?

1. In order to be sure that two entities are a match, they should be matched with high confidence and neither one should have any other matches with close to their confidence.
1. In order to be sure that an entity has no match, they should have no matches with high or medium confidence
1. Entities that we aren't sure about consist of entities that have a medium confidence match at best, or multiple high confidence matches.

The reason I transformed predictions to the logit space is because I found it simpler/more intuitive to do this postprocessing in the logit space for this dataset. However, postprocessing can often be very dataset/project specific, and depends on how much tolerance you have for false postives/false negatives, and therefor should be tuned carefully.

In [40]:
def get_score_df(df):
    max_2 = df.groupby('uuid_2')['probas'].max().to_frame('max_2').reset_index()
    max_1 = df.groupby('uuid_1')['probas'].max().to_frame('max_1').reset_index()
    dif_2 = df.groupby('uuid_2').apply(_dif).to_frame('dif_2').reset_index()
    dif_1 = df.groupby('uuid_1').apply(_dif).to_frame('dif_1').reset_index()
    res_df = df.merge(max_2).merge(max_1).merge(dif_1).merge(dif_2)
    res_df['top_pick'] = (res_df['max_2'] == res_df['probas']) & (res_df['max_1'] == res_df['probas'])
    return res_df[['max_2', 'max_1', 'dif_1', 'dif_2', 'top_pick', 'uuid_1', 'uuid_2']]

In [41]:
def _dif(df):
    if df.shape[0] == 1:
        return np.inf
    else:
        return (df['probas'].max() - df['probas']).sort_values().iloc[1]

In [42]:
res_df = get_score_df(feature_df)

In [43]:
high_confidence_matches = res_df[res_df['top_pick'] & (res_df['dif_1'] > 3) & (res_df['dif_2'] > 3) & (res_df['max_2'] > 1)]

In [44]:
no_matches_1 = res_df.groupby('uuid_1')['max_1'].max()[lambda x: x < -1].index
no_matches_2 = res_df.groupby('uuid_2')['max_2'].max()[lambda x: x < -1].index

In [45]:
medium_uuids1 = set(bbref_df['uuid']).difference(no_matches_1).difference(high_confidence_matches['uuid_1'])
medium_uuids2 = set(nba_df['uuid']).difference(no_matches_2).difference(high_confidence_matches['uuid_2'])

# Examine results 

What follows is some simple examination of the results and sanity checking. 

### Found matches

Examining found matches is maybe the most easy. We can first perform a simple sanity check and see how many of our "trivial matches" (joining based on unique name and start year) were linked. We would hope this to be 100%. 

In [46]:
high_confidence_match_pairs = list(zip(high_confidence_matches['uuid_1'], high_confidence_matches['uuid_2']))

In [47]:
set(trivial_matches).difference(high_confidence_match_pairs)

set()

Yay! We get all the trivial matches. Verifying whether the other found matches are correct is similarly easy. Let's take a look at some of the matches that are NOT trivial matches and see what we are learning.

In [48]:
non_trivial_results = high_confidence_matches[~high_confidence_matches['uuid_1'].isin([pair[0] for pair in trivial_matches])]

In [49]:
non_trivial_results.head()

Unnamed: 0,max_2,max_1,dif_1,dif_2,top_pick,uuid_1,uuid_2
993,9.192834,9.192834,21.097861,19.671124,True,/players/r/robisda01.html,77998
1042,9.171952,9.171952,22.577403,20.471072,True,/players/t/twardda01.html,78387
2791,3.210996,3.210996,10.800137,13.746216,True,/players/j/johnsch01.html,77133
4408,2.244283,2.244283,9.654801,10.372107,True,/players/j/johnsch02.html,77159
6414,3.454891,3.454891,10.070342,9.523362,True,/players/j/johnsra01.html,77139


In [50]:
bbref_df[bbref_df['uuid'] == '/players/a/ardji01.html']

Unnamed: 0,player,start_year,end_year,pos,height,weight,birth_date,colleges,player_url,birth_date_url,colleges_url,trivial_uuid,clean_name,clean_first_name,clean_last_name,suffix,basic_name,uuid,clean_uuid
129,Jim Ard,1971.0,1978.0,F-C,6-8,215.0,"September 19, 1948",Cincinnati,/players/a/ardji01.html,/friv/birthdays.cgi?month=9&day=19,/friv/colleges.fcgi?college=cincy,Jim Ard - 1971,jim ard,jim,ard,,jim ard,/players/a/ardji01.html,jim ard - 1971


In [51]:
nba_df[nba_df['uuid'] == 76055]

Unnamed: 0,PERSON_ID,DISPLAY_LAST_COMMA_FIRST,DISPLAY_FIRST_LAST,ROSTERSTATUS,start_year,end_year,PLAYERCODE,TEAM_ID,TEAM_CITY,TEAM_NAME,...,GAMES_PLAYED_FLAG,OTHERLEAGUE_EXPERIENCE_CH,trivial_uuid,clean_name,clean_first_name,clean_last_name,suffix,basic_name,uuid,clean_uuid
123,76055,"Ard, Jim",Jim Ard,0,1975,1978,HISTADD_jim_ard,0,,,...,Y,0,Jim Ard - 1975,jim ard,jim,ard,,jim ard,76055,jim ard - 1975


We can see that in this example, the start years of the players do not match up. This is just one of many small but annoying differences in the datasets that we are to "learn" through machine learning. I encourage you to look through the other links, and if you see any that are off let me know (or add them back into the training data and run again!)

### No matches found

Verifying the results of no matches found is a bit more tricky, since in order to confirm that a entity really does have no match we would have to basically look at all other entities in the dataset and confirm that they are not a match. But looking at that many pairs would take impossibly long, so we don't want to do that...

First lets check out the players we are missing from each dataset

In [52]:
bbref_df[bbref_df['uuid'].isin(no_matches_1)]

Unnamed: 0,player,start_year,end_year,pos,height,weight,birth_date,colleges,player_url,birth_date_url,colleges_url,trivial_uuid,clean_name,clean_first_name,clean_last_name,suffix,basic_name,uuid,clean_uuid
18,George Adams,1973.0,1975.0,F-G,6-5,210.0,"May 15, 1949",Gardner-Webb University,/players/a/adamsge01.html,/friv/birthdays.cgi?month=5&day=15,/friv/colleges.fcgi?college=gardwebb,George Adams - 1973,george adams,george,adams,,george adams,/players/a/adamsge01.html,george adams - 1973
34,Matthew Aitch,1968.0,1968.0,F,6-7,230.0,"September 21, 1944",Michigan State,/players/a/aitchma01.html,/friv/birthdays.cgi?month=9&day=21,/friv/colleges.fcgi?college=michiganst,Matthew Aitch - 1968,matthew aitch,matthew,aitch,,matthew aitch,/players/a/aitchma01.html,matthew aitch - 1968
57,Bill Allen,1968.0,1968.0,C-F,6-8,205.0,,New Mexico State,/players/a/allenbi01.html,,/friv/colleges.fcgi?college=nmstate,Bill Allen - 1968,bill allen,bill,allen,,bill allen,/players/a/allenbi01.html,bill allen - 1968
69,Willie Allen,1972.0,1972.0,F,6-6,230.0,"February 8, 1949",Miami (FL),/players/a/allenwi01.html,/friv/birthdays.cgi?month=2&day=8,/friv/colleges.fcgi?college=miamifl,Willie Allen - 1972,willie allen,willie,allen,,willie allen,/players/a/allenwi01.html,willie allen - 1972
85,Andrew Anderson,1968.0,1970.0,G,6-2,184.0,"July 6, 1945",Canisius,/players/a/anderan01.html,/friv/birthdays.cgi?month=7&day=6,/friv/colleges.fcgi?college=canisius,Andrew Anderson - 1968,andrew anderson,andrew,anderson,,andrew anderson,/players/a/anderan01.html,andrew anderson - 1968
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4801,Willie Worsley,1969.0,1969.0,G,5-9,175.0,"November 13, 1945",Texas-El Paso,/players/w/worslwi01.html,/friv/birthdays.cgi?month=11&day=13,/friv/colleges.fcgi?college=utep,Willie Worsley - 1969,willie worsley,willie,worsley,,willie worsley,/players/w/worslwi01.html,willie worsley - 1969
4813,Howie Wright,1971.0,1972.0,G,6-3,185.0,"February 22, 1947",Austin Peay State University,/players/w/wrighho01.html,/friv/birthdays.cgi?month=2&day=22,/friv/colleges.fcgi?college=austinpeay,Howie Wright - 1971,howie wright,howie,wright,,howie wright,/players/w/wrighho01.html,howie wright - 1971
4817,Leroy Wright,1968.0,1969.0,F,6-9,215.0,"May 6, 1938",University of the Pacific,/players/w/wrighle01.html,/friv/birthdays.cgi?month=5&day=6,/friv/colleges.fcgi?college=pacific,Leroy Wright - 1968,leroy wright,leroy,wright,,leroy wright,/players/w/wrighle01.html,leroy wright - 1968
4818,Lonnie Wright,1968.0,1972.0,G,6-2,205.0,"January 23, 1945",Colorado State,/players/w/wrighlo01.html,/friv/birthdays.cgi?month=1&day=23,/friv/colleges.fcgi?college=coloradost,Lonnie Wright - 1968,lonnie wright,lonnie,wright,,lonnie wright,/players/w/wrighlo01.html,lonnie wright - 1968


In [53]:
nba_df[nba_df['uuid'].isin(no_matches_2)]

Unnamed: 0,PERSON_ID,DISPLAY_LAST_COMMA_FIRST,DISPLAY_FIRST_LAST,ROSTERSTATUS,start_year,end_year,PLAYERCODE,TEAM_ID,TEAM_CITY,TEAM_NAME,...,GAMES_PLAYED_FLAG,OTHERLEAGUE_EXPERIENCE_CH,trivial_uuid,clean_name,clean_first_name,clean_last_name,suffix,basic_name,uuid,clean_uuid
363,1629129,"Bluiett, Trevon",Trevon Bluiett,0,2019,2019,trevon_bluiett,0,,,...,Y,1,Trevon Bluiett - 2019,trevon bluiett,trevon,bluiett,,trevon bluiett,1629129,trevon bluiett - 2019
586,202221,"Butch, Brian",Brian Butch,0,2010,2010,brian_butch,0,,,...,Y,1,Brian Butch - 2010,brian butch,brian,butch,,brian butch,202221,brian butch - 2010
590,202364,"Butler, Da'Sean",Da'Sean Butler,0,2011,2011,da'sean_butler,0,,,...,N,1,Da'Sean Butler - 2011,dasean butler,dasean,butler,,dasean butler,202364,dasean butler - 2011
906,1630268,"Darling, Nate",Nate Darling,1,2021,2021,nate_darling,1610612766,Charlotte,Hornets,...,Y,0,Nate Darling - 2021,nate darling,nate,darling,,nate darling,1630268,nate darling - 2021
1293,1630235,"Forrest, Trent",Trent Forrest,1,2021,2021,trent_forrest,1610612762,Utah,Jazz,...,Y,0,Trent Forrest - 2021,trent forrest,trent,forrest,,trent forrest,1630235,trent forrest - 2021
1348,202070,"Gaffney, Tony",Tony Gaffney,0,2010,2010,tony_gaffney,0,,,...,Y,1,Tony Gaffney - 2010,tony gaffney,tony,gaffney,,tony gaffney,202070,tony gaffney - 2010
1673,1630223,"Harris, Jalen",Jalen Harris,1,2021,2021,jalen_harris,1610612761,Toronto,Raptors,...,Y,0,Jalen Harris - 2021,jalen harris,jalen,harris,,jalen harris,1630223,jalen harris - 2021
1692,202238,"Hasbrouck, Kenny",Kenny Hasbrouck,0,2010,2010,kenny_hasbrouck,0,,,...,Y,1,Kenny Hasbrouck - 2010,kenny hasbrouck,kenny,hasbrouck,,kenny hasbrouck,202238,kenny hasbrouck - 2010
1794,201195,"Hill, Herbert",Herbert Hill,0,2008,2008,herbert_hill,0,,,...,Y,1,Herbert Hill - 2008,herbert hill,herbert,hill,,herbert hill,201195,herbert hill - 2008
1944,1628402,"Jackson, Frank",Frank Jackson,1,2018,2021,frank_jackson,1610612765,Detroit,Pistons,...,Y,1,Frank Jackson - 2018,frank jackson,frank,jackson,,frank jackson,1628402,frank jackson - 2018


What do we see? We see that for basketball reference players, we are missing links for older players. This is not a huge surprise - as stated before, basketball reference has a lot of old ABA players that nba.com is missing.

For nba.com, we see that we are missing a lot of players that only played a single year, and a lot of those are new this year. Again, not terribly surprising given that as stated before I have noticed nba.com to be slightly better at updating their player lists for recent players.

Still, to be honest I haven't gone over each any everyone of these. There is definitely the possibility for errors. If you find some, you know what to do - let me know, or just add it to the training data and create a feature or two!

### Unsure entities

The following are entities that we are not sure about. For these, in a production setting I would like these to be hand labeled. Rather than labeling all of them, if you sense that there are common themes you could label some, retrain the model, and hope that this group shrinks.

In [54]:
bbref_df[bbref_df['uuid'].isin(medium_uuids1)]

Unnamed: 0,player,start_year,end_year,pos,height,weight,birth_date,colleges,player_url,birth_date_url,colleges_url,trivial_uuid,clean_name,clean_first_name,clean_last_name,suffix,basic_name,uuid,clean_uuid
143,Bob Arnzen,1970.0,1974.0,F,6-5,205.0,"November 3, 1947",Notre Dame,/players/a/arnzebo01.html,/friv/birthdays.cgi?month=11&day=3,/friv/colleges.fcgi?college=notredame,Bob Arnzen - 1970,bob arnzen,bob,arnzen,,bob arnzen,/players/a/arnzebo01.html,bob arnzen - 1970
287,Arthur Becker,1968.0,1973.0,F,6-7,205.0,"January 12, 1942",Arizona State,/players/b/beckear01.html,/friv/birthdays.cgi?month=1&day=12,/friv/colleges.fcgi?college=arizonast,Arthur Becker - 1968,arthur becker,arthur,becker,,arthur becker,/players/b/beckear01.html,arthur becker - 1968
436,Tommie Bowens,1968.0,1970.0,F-C,6-8,220.0,"July 7, 1940",Grambling State University,/players/b/bowento01.html,/friv/birthdays.cgi?month=7&day=7,/friv/colleges.fcgi?college=grambling,Tommie Bowens - 1968,tommie bowens,tommie,bowens,,tommie bowens,/players/b/bowento01.html,tommie bowens - 1968
439,Orbie Bowling,1968.0,1968.0,C,6-10,215.0,"March 21, 1939",Tennessee,/players/b/bowlior01.html,/friv/birthdays.cgi?month=3&day=21,/friv/colleges.fcgi?college=tennessee,Orbie Bowling - 1968,orbie bowling,orbie,bowling,,orbie bowling,/players/b/bowlior01.html,orbie bowling - 1968
457,Bill Bradley,1968.0,1968.0,G,5-11,165.0,"June 16, 1941",,/players/b/bradlbi02.html,/friv/birthdays.cgi?month=6&day=16,,Bill Bradley - 1968,bill bradley,bill,bradley,,bill bradley,/players/b/bradlbi02.html,bill bradley - 1968
685,Darel Carrier,1968.0,1973.0,G,6-3,185.0,"October 26, 1940",Western Kentucky,/players/c/carrida01.html,/friv/birthdays.cgi?month=10&day=26,/friv/colleges.fcgi?college=wkentucky,Darel Carrier - 1968,darel carrier,darel,carrier,,darel carrier,/players/c/carrida01.html,darel carrier - 1968
993,Mickey Davis,1972.0,1977.0,F-G,6-7,195.0,"June 16, 1950",Duquesne,/players/d/davismi02.html,/friv/birthdays.cgi?month=6&day=16,/friv/colleges.fcgi?college=duquesne,Mickey Davis - 1972,mickey davis,mickey,davis,,mickey davis,/players/d/davismi02.html,mickey davis - 1972
1679,Robert Hahn,1950.0,1950.0,C,6-10,240.0,"August 25, 1925",NC State,/players/h/hahnro01.html,/friv/birthdays.cgi?month=8&day=25,/friv/colleges.fcgi?college=ncstate,Robert Hahn - 1950,robert hahn,robert,hahn,,robert hahn,/players/h/hahnro01.html,robert hahn - 1950
1765,Billy Harris,1975.0,1975.0,G,6-2,185.0,"November 12, 1951",Northern Illinois,/players/h/harribi01.html,/friv/birthdays.cgi?month=11&day=12,/friv/colleges.fcgi?college=nillinois,Billy Harris - 1975,billy harris,billy,harris,,billy harris,/players/h/harribi01.html,billy harris - 1975
1946,Carroll Hooser,1968.0,1968.0,F,6-7,230.0,"March 5, 1944",SMU,/players/h/hooseca01.html,/friv/birthdays.cgi?month=3&day=5,/friv/colleges.fcgi?college=smethodist,Carroll Hooser - 1968,carroll hooser,carroll,hooser,,carroll hooser,/players/h/hooseca01.html,carroll hooser - 1968


In [55]:
nba_df[nba_df['uuid'].isin(medium_uuids2)]

Unnamed: 0,PERSON_ID,DISPLAY_LAST_COMMA_FIRST,DISPLAY_FIRST_LAST,ROSTERSTATUS,start_year,end_year,PLAYERCODE,TEAM_ID,TEAM_CITY,TEAM_NAME,...,GAMES_PLAYED_FLAG,OTHERLEAGUE_EXPERIENCE_CH,trivial_uuid,clean_name,clean_first_name,clean_last_name,suffix,basic_name,uuid,clean_uuid
137,76064,"Arnzen, Bob",Bob Arnzen,0,1971,1971,HISTADD_bob_arnzen,0,,,...,Y,0,Bob Arnzen - 1971,bob arnzen,bob,arnzen,,bob arnzen,76064,bob arnzen - 1971
312,1630189,"Bey, Tyler",Tyler Bey,1,2021,2021,tyler_bey,1610612742,Dallas,Mavericks,...,Y,0,Tyler Bey - 2021,tyler bey,tyler,bey,,tyler bey,1630189,tyler bey - 2021
346,202392,"Blakely, Marqus",Marqus Blakely,0,2011,2011,marqus_blakely,0,,,...,Y,1,Marqus Blakely - 2011,marqus blakely,marqus,blakely,,marqus blakely,202392,marqus blakely - 2011
929,76522,"Davis, Edward",Edward Davis,0,1973,1977,HISTADD_mickey_davis,0,,,...,Y,0,Edward Davis - 1973,edward davis,edward,davis,,edward davis,76522,edward davis - 1973
998,1629603,"Diakite, Mamadi",Mamadi Diakite,1,2021,2021,mamadi_diakite,1610612749,Milwaukee,Bucks,...,Y,0,Mamadi Diakite - 2021,mamadi diakite,mamadi,diakite,,mamadi diakite,1629603,mamadi diakite - 2021
1587,1630204,"Hagans, Ashton",Ashton Hagans,1,2021,2021,ashton_hagans,1610612750,Minnesota,Timberwolves,...,Y,0,Ashton Hagans - 2021,ashton hagans,ashton,hagans,,ashton hagans,1630204,ashton hagans - 2021
1588,76914,"Hahn, Bob",Bob Hahn,0,1950,1950,HISTADD_bob_hahn,0,,,...,Y,0,Bob Hahn - 1950,bob hahn,bob,hahn,,bob hahn,76914,bob hahn - 1950
2125,1630222,"Jones, Mason",Mason Jones,1,2021,2021,mason_jones,1610612745,Houston,Rockets,...,Y,0,Mason Jones - 2021,mason jones,mason,jones,,mason jones,1630222,mason jones - 2021
2142,77203,"Jones, Willie",Willie Jones,0,1983,1984,HISTADD_hutch_jones,0,,,...,Y,0,Willie Jones - 1983,willie jones,willie,jones,,willie jones,77203,willie jones - 1983
2436,1024,"Llamas, Horacio",Horacio Llamas,0,1997,2003,horacio_llamas,0,,,...,Y,0,Horacio Llamas - 1997,horacio llamas,horacio,llamas,,horacio llamas,1024,horacio llamas - 1997


# Saving Results

Honestly, a whole separate article could be dedicated to how to properly store and update links between entities overtime. The main gist is that these links we've generated are *probabilistic*, and I would actually *expect* that a good number of them are wrong. Therefore, when storing them we need to make to store them in a way where we can update and overwrite them if neccessary.

Still, that is a separate concept. For this, I will just save the successfully linked entities in csv form. NOTE: there are still a bunch of entities not in this saved csv (the ones that have no match, and the ones we weren't sure about). 

In [56]:
high_confidence_matches[['uuid_1', 'uuid_2']].rename(columns={'uuid_1': 'bbref_id', 'uuid_2': 'nba_id'}).to_csv('high_confidence_matches.csv', index=False)

# Conclusion

I am also hoping to collect some feedback on this post and on my ideas for next steps. I would LOVE it if you spent 3 minutes filling out the forms after reading:

[Give Feedback](https://forms.gle/R4mgQZS3oshG9vRGA)

[Suggest Next Steps](https://forms.gle/PBBwh3519RuXpXsUA)