# `gambit` - Name Disambiguation for Version Control Systems

In this tutorial, you will learn how to use `gambit` to disambiguate a given list of aliases, for example, from commit authors in *git* repositories.

Two types of information constitute an *alias* of an author: (i) a name and (ii) an email.
However, one author may use different aliases, or different authors may use almost identical aliases.
To disambiguate aliases using `gambit`, you can provide them as a `pandas` DataFrame with two columns: `alias_name` and `alias_email`.
For example:

In [1]:
import pandas as pd

aliases = pd.DataFrame({'alias_name': ['hello',
                                       'world',
                                       'test'],
                        'alias_email': ['hello@world',
                                        'hello@world',
                                        'test@test']})

display(aliases)

Unnamed: 0,alias_name,alias_email
0,hello,hello@world
1,world,hello@world
2,test,test@test


This DataFrame can then be passed to the function `disambiguate_aliases`.

In [2]:
import gambit

gambit.disambiguate_aliases(aliases)

author identity disambiguation: 100%|██████████| 3/3 [00:00<00:00, 1624.02it/s]


Unnamed: 0,alias_name,alias_email,name,email,first_name,last_name,penultimate_name,email_base,author_id
0,hello,hello@world,hello,hello@world,hello,hello,,hello,0
1,world,hello@world,world,hello@world,world,world,,hello,0
2,test,test@test,test,test@test,test,test,,test,1


The returned DataFrame contains a unique `author_id` identifying the disambiguated authors.
In our example, two authors are detected:
1. `author_id == 0` appearing under two aliases: (a) name `hello` and email `hello@world`, and (b) name `world` and email `hello@world`.
2. `author_id == 1` appearing under one alias: name `test` and email `test@test`.

The returned DataFrame contains additional columns such as `last_name`, and `email_base` used internally for the disambiguation.