# `git2net` - Extracting and analysing co-editing relationships from *git* repositories

In this tutorial you will learn the basic steps required to obtain a co-editing relationship from a git repoitory using `git2net`.

## Prerequisits

This tutorial assumes you have `git2net` installed. In addition, it is recommended to create a folder for this tutorial as additional files will be downloaded to your local directory (if not specified otherwise).

## Repository Mining

To start, you will need to select and clone a git repository that you are interested in analysing. For the purpose of this tutorial, we will analyse the repository behind `git2net`&mdash;aiming to finally find a solution to the well-known chicken and egg problem.

The following lines will clone the `git2net` repository to your current working directory. To change this location, you can edit the path to the local directory stored in `local_directory`. The folder name of the repository is the name of the repository, which we store in `repo_name`.

In [1]:
import git
import os
import shutil

repo_url = 'https://github.com/gotec/git2net.git'
local_directory = '.'
repo_name = 'git2net'

if os.path.exists(repo_name):
    shutil.rmtree(repo_name)

git.Git(local_directory).clone(repo_url)

''

Now that we have obtained a local copy of the repository, we can use `git2net` to obtain a database containing information on all commits and edits made to obtain the current state of the repository.

To do so, we use the `mine_git_repo` function. This function takes two required inputs as well as a number of optional inputs, some of which we will further explore later in this tutorial. Let's start with the required inputs. Here, we need to supply a path to the git repositoy that will be analysed. Below, this is done with the variable `repo_name`. In addition, `git2net` requires a path to the *sqlite* database that will be filled during the mining process. This path is provided as `sqlite_db_file`.

Note, that if no database exists on the supplied path, `git2net` will create a new database. If a database exists, `git2net` will check if the database was mined with the same setting and on the same repository and subsequently resume the mining process from wherever it was left off.

Let's try this out. Below we import `git2net` and point it to the path to which we cloned the database. In addition, we specify the location of the database file in which the results of the mining process will be stored and ensure the database does currently not exist. We then run the `mine_git_repo` function with the optional argument `max_modifications = 1`. With this only commits in which 1 or less files were modified are mined.

In [2]:
import git2net

sqlite_db_file = 'git2net.db'

# Remove database if exists
if os.path.exists(sqlite_db_file):
    os.remove(sqlite_db_file)

max_modifications = 1
    
git2net.mine_git_repo(repo_name, sqlite_db_file, max_modifications=max_modifications)

Found no database on provided path. Starting from scratch.


Parallel (8 processes):   1%|          | 1/117 [00:00<00:21,  5.46it/s]

Commit exceeding max_modifications:  bd0ad7b12500239321a8b7c6ba547f6111c781bb


Parallel (8 processes):   8%|▊         | 9/117 [00:01<00:15,  7.14it/s]

Commit exceeding max_modifications:  eed200119f675f2abc69a5f72c3505e903b82fd2


Parallel (8 processes):  22%|██▏       | 26/117 [00:03<00:14,  6.31it/s]

Commit exceeding max_modifications:  40cc53f783aeb835fbec20f4d5e165af4e24fd32
Commit exceeding max_modifications:  9a042c9c7c6a99733d1b94bf0d440f5d22389a79


Parallel (8 processes):  27%|██▋       | 32/117 [00:04<00:09,  9.16it/s]

Commit exceeding max_modifications:  c657e752b411caf531e3fff8fc0ea8e0b756ed43
Commit exceeding max_modifications:  a2d25731c924765db4f21fa3afa7d263b7c9e79d


Parallel (8 processes):  30%|██▉       | 35/117 [00:04<00:07, 11.14it/s]

Commit exceeding max_modifications:  87c4d8f3206b400785602de03bdf87f109a65008


Parallel (8 processes):  32%|███▏      | 37/117 [00:04<00:06, 12.44it/s]

Commit exceeding max_modifications:  95eb238eeb60d6f4d1eee5acdba1d195d6e0cf70
Commit exceeding max_modifications:  16b2226a47e2747a3de9ff07f5fec0ad1abd8e0c
Commit exceeding max_modifications:  eb40bbb2e7c68c7ab73bef6d91b41d2376581907


Parallel (8 processes):  34%|███▍      | 40/117 [00:04<00:05, 13.71it/s]

Commit exceeding max_modifications:  f8e0c813a4a3049725b4c65a69bf0b487a685276
Commit exceeding max_modifications:  64701617d0d468bba66760046d5519c54c7f3371


Parallel (8 processes):  37%|███▋      | 43/117 [00:04<00:04, 16.09it/s]

Commit exceeding max_modifications:  c81b190fe260050fcd7ff86a7e947b47cf8f8085
Commit exceeding max_modifications:  e75736eaf9bd01e6f410c4dc51d9e58dcf20eacb
Commit exceeding max_modifications:  91d5d98881c6289f42f30508c4b26d3fa7baf6ca


Parallel (8 processes):  45%|████▌     | 53/117 [00:05<00:03, 17.87it/s]

Commit exceeding max_modifications:  240a13c3b87558cb85963d3cda415a63b54a8cbf


Parallel (8 processes):  50%|█████     | 59/117 [00:05<00:04, 13.00it/s]

Commit exceeding max_modifications:  a3213cd995e850c8966355755c4ac2ff61f65503


Parallel (8 processes):  58%|█████▊    | 68/117 [00:06<00:05,  8.69it/s]

Commit exceeding max_modifications:  9ef69d206d7cedb82b12d68a39445b2e936cd15f


Parallel (8 processes):  60%|█████▉    | 70/117 [00:06<00:04, 10.39it/s]

Commit exceeding max_modifications:  71b1cd496f6dc800acd7e59260d86b647cc58291
Commit exceeding max_modifications:  806fc44d2250c316c75692601362aecabc63d137


Parallel (8 processes):  62%|██████▏   | 73/117 [00:06<00:03, 12.69it/s]

Commit exceeding max_modifications:  9e72df61bf300b42c3fbc16d94153e8edbbe6dd6


Parallel (8 processes):  64%|██████▍   | 75/117 [00:07<00:03, 12.69it/s]

Commit exceeding max_modifications:  090c00c342283134a23900f85c1d232499617365
Commit exceeding max_modifications:  509e1394637f74a357ef2bf0c567dc6520a80eb6


Parallel (8 processes):  69%|██████▉   | 81/117 [00:07<00:02, 16.50it/s]

Commit exceeding max_modifications:  cf51fa8ddf40c85645cf9e6e7fb5c64b322a20ef
Commit exceeding max_modifications:  73e2b77a786cf19ec4a04e0a95ae4a0f93c45c54


Parallel (8 processes):  72%|███████▏  | 84/117 [00:07<00:01, 18.59it/s]

Commit exceeding max_modifications:  1504d68a4daf1e7529c6ac1a192794da765da9d2
Commit exceeding max_modifications:  b3b8e33bd6ae43ba9ff50f4b84cc2c6c897fe92b
Commit exceeding max_modifications:  2294efe5bf28560eb11437f54e18c4ff710e2bd1


Parallel (8 processes):  79%|███████▉  | 93/117 [00:08<00:03,  7.90it/s]

Commit exceeding max_modifications:  7e7a8bd30a12628028234308ae6c7e2f5b5ec2b2
Commit exceeding max_modifications:  0a8bc07dfd7c481b8936fddd99e7a8a8aac74dfe


Parallel (8 processes): 100%|██████████| 117/117 [00:17<00:00,  1.55it/s]


While mining, `git2net` provides information about the current progress. The first line shows that no database was found at the current path and mining will be started from scratch. This is totally expected, as we deliberately deleted any existing database before the run.

Subsequently, progress updates on the mining process are printed. The first information denotes the number of processes `git2net` spawns and runs on. `git2net` is highly parallelised and will automatically detect the number of threads of your CPU, fully utilising all of them during operation. In case you want to reduce this load, this can be done by specifically setting the number of processes with the `no_of_processes` option of the `mine_git_repo` function.

The other output shows the number of commits and total number of commits mined in this run, as well as the elapsed time and an estimate of the remaining time to finish.

If a commit is skipped, the reason and the commit hash are printed. Currently, there are three cases in which a commit can be skipped. Firstly, as seen above a commit can exceed the maximum number of modifications set by `max_modifications`. Secondly, processing the commit can take longer as a maximum time defined by the `timeout` option. Thirdly, a commit can be skipped due to an error occuring within the commit. In these cases, please report the repository and commit hash in a new issue on github.com/gotec/git2net.

Let's resume the mining process while increasing the maximum number of modifications to 5!

In [3]:
max_modifications = 5

git2net.mine_git_repo(repo_name, sqlite_db_file, max_modifications=max_modifications)

Found a matching database on provided path. Skipping 87 (74.36%) of 117 commits. 30 commits remaining.


Parallel (8 processes):  27%|██▋       | 8/30 [00:04<00:14,  1.48it/s]

Commit exceeding max_modifications:  a3213cd995e850c8966355755c4ac2ff61f65503


Parallel (8 processes):  43%|████▎     | 13/30 [00:06<00:06,  2.44it/s]

Commit exceeding max_modifications:  9e72df61bf300b42c3fbc16d94153e8edbbe6dd6
Commit exceeding max_modifications:  090c00c342283134a23900f85c1d232499617365


Parallel (8 processes): 100%|██████████| 30/30 [00:13<00:00,  1.01it/s]


As you can see from the output above, the process was resumed from the old database, skipping the already processed commits in the repository.

Great, we made some progress and a large amount of the commits in the repository are already mined and in the database! But what about the other ones? We get some more information on the commits missing from the database from the `mining_state_summary` function. Similar to `mine_git_repo`, it also requires the paths to the repository as well as the database.

In [4]:
git2net.mining_state_summary(repo_name, sqlite_db_file)

100%|██████████| 3/3 [00:00<00:00, 83.19it/s]

114 / 117 (97.44%) of commits were successfully mined.





Unnamed: 0,hash,is_merge,modifications,author_name,author_email,author_date
0,a3213cd995e850c8966355755c4ac2ff61f65503,False,18,Christoph Gote,cgote@ethz.ch,2019-02-15 15:39:38
1,9e72df61bf300b42c3fbc16d94153e8edbbe6dd6,False,7,Christoph Gote,cgote@ethz.ch,2019-04-10 17:40:01
2,090c00c342283134a23900f85c1d232499617365,False,9,Christoph Gote,cgote@ethz.ch,2019-04-10 18:26:54


The function again provides a summary of the mining state, as well as details on all missing commits. Let's assume, we are very interest in commit *090c00c342283134a23900f85c1d232499617365* but want to avoid crawling the other missing commits. While this is uneccessary for small repositories such as `git2net` this might become higly relevant for larger projects such as `linux`, where individual commits can make changes to thousands of files which in turn require significant computational resources to analyse. This is particularly important for merge commits, as all files included in the diffs to both parent commits need to be considered. Therefore, for larger projects I generally recommend to run `git2net` with `max_modifications = 1000`, subsequently increasing this number if required.

But now back to mining specifically commit *090c00c342283134a23900f85c1d232499617365*, which can be done with the `commits` option in `mine_git_repo`. We also set the number of processes to 1, enabling serial mode, which can be very helpful for debugging as significantly more information is printed.

In [5]:
# mine_git_repo takes list of commits
commits = ['090c00c342283134a23900f85c1d232499617365']

git2net.mine_git_repo(repo_name, sqlite_db_file, commits=commits, no_of_processes=1)

Serial:   0%|          | 0/1 [00:00<?, ?it/s]
	090c00c mods:   0%|          | 0/9 [00:00<?, ?it/s][A

Found a matching database on provided path. Skipping 114 (97.44%) of 117 commits. 3 commits remaining.




090c00c edits 1/1: 0it [00:00, ?it/s][A[A

090c00c edits 1/1: 23it [00:00, 226.76it/s][A[A

090c00c edits 1/1: 45it [00:00, 224.02it/s][A[A

090c00c edits 1/1: 67it [00:00, 221.61it/s][A[A

090c00c edits 1/1: 89it [00:00, 220.11it/s][A[A

090c00c edits 1/1: 113it [00:00, 223.95it/s][A[A

090c00c edits 1/1: 137it [00:00, 226.50it/s][A[A

                                            [A[A
	090c00c mods:  11%|█         | 1/9 [00:01<00:08,  1.09s/it][A

090c00c edits 1/1: 0it [00:00, ?it/s][A[A

090c00c edits 1/1: 24it [00:00, 232.40it/s][A[A

                                           [A[A
	090c00c mods:  22%|██▏       | 2/9 [00:01<00:06,  1.16it/s][A

090c00c edits 1/1: 0it [00:00, ?it/s][A[A

090c00c edits 1/1: 24it [00:00, 235.81it/s][A[A

090c00c edits 1/1: 47it [00:00, 233.67it/s][A[A

090c00c edits 1/1: 70it [00:00, 231.89it/s][A[A

                                           [A[A
	090c00c mods:  33%|███▎      | 3/9 [00:02<00:04,  1.28it/s][A

090c0

Congratulations, you have now mined your first git repository using `git2net`! Note, though that not all commits have been mined at this point. This will be done at a later stage of this tutorial.

## Visualisation and Analysis

You can now use the database to query various information on different commits or edits. In addition, `git2net` also provides the functionality to generate various network projections of the data.

To start, lets try to obtain a co-editing network for our project. This is as simple as calling the `get_coediting_network` function and providing the database we just mined.

In [6]:
t, node_info, edge_info = git2net.get_coediting_network(sqlite_db_file)
t

The function returns a `pathpy` temporal network object as well as two dictionaries which can be used to return properties of nodes and edges. As of writing this tutorial not all of them are used but they are set as placeholders for future versions of `git2net`.

A `pathpy` temporal network object can be visualised by itself as shown above. In addition, we can also aggregate the network, by dropping the order of events, yielding a standard network object. Let's do this next.

In [7]:
import pathpy as pp
pp.Network.from_temporal_network(t) 

In both the temporal and aggregated network, a node represents an author, whereas edges point from the person changing a line of code to the person who was the original author.

Next, we could ask the question which those files were that authors collaborated on. Therefore, we can plot a bipartite network containing both files and authors as nodes.

In [8]:
t, node_info, edge_info = git2net.get_bipartite_network(sqlite_db_file)
n = pp.Network.from_temporal_network(t)
n

For this network, `node_info` contains the classes of authors in the network. These can e.g. be used to color nodes as shown below.

In [9]:
colour_map = {'author': '#73D2DE', 'file': '#2E5EAA'}
node_color = {node: colour_map[node_info['class'][node]] for node in n.nodes}
pp.visualisation.plot(n, node_color=node_color)

The projection of this network that links authors editing the same file is the co-authorship network.

In [15]:
n, node_info, edge_info = git2net.get_coauthorship_network(sqlite_db_file)
n

Note that it looks similar, however, all information on the direction of interactions is lost.

If we are interested in e.g more recently edited files, we can filter the database by providing the `time_from` and `time_to` options. Let's check the files edited since May 2019.

In [11]:
from datetime import datetime
time_from = datetime(2019, 5, 1)
t, node_info, edge_info = git2net.get_bipartite_network(sqlite_db_file, time_from=time_from)
n = pp.Network.from_temporal_network(t)
colour_map = {'author': '#73D2DE', 'file': '#2E5EAA'}
node_color = {node: colour_map[node_info['class'][node]] for node in n.nodes}
pp.visualisation.plot(n, node_color=node_color)

`git2net` allows the extraction of editing paths on the level of individual lines. I.e. we are able to track consecutive changes made to a single line over time&mdash;even if these lines move up or down in a file, or even across files. This is very powerful, as it allows us to determine editing sequences as well as find lines that require more editing than others. These could either be very difficult lines to implement or contain very important information, such as the version number in an `__init__.py` file.

To extract these paths, we can use the `get_line_editing_paths` function. As these networks tend be very large we limit the analysis to a very small file for this tutorial. To only look at a specific set of file paths we can use the `file_paths` option.

In [16]:
paths, dag, node_info, edge_info = git2net.get_line_editing_paths(sqlite_db_file,
                                                                  file_paths=['git2net/__init__.py'])
pp.visualisation.plot(dag, node_color=node_info['colors'])

Searching for aliases


 15%|█▌        | 2/13 [00:00<00:01, 10.30it/s]

Querying commits
Querying edits


100%|██████████| 13/13 [00:01<00:00, 11.47it/s]
100%|██████████| 37/37 [00:00<00:00, 132527.11it/s]

2019-09-23 12:20:27 [Severity.INFO]	Creating paths from directed acyclic graph
2019-09-23 12:20:27 [Severity.INFO]	Expanding Subpaths
2019-09-23 12:20:27 [Severity.INFO]	Calculating sub path statistics ... 
2019-09-23 12:20:27 [Severity.INFO]	finished.





As you can see in the output above, the function first looks for aliases. These are other names of the files in the repository that can occur through renaming or moving the file. To follow the edits made to specific lines, we need to be aware of these renamings to track lines across these files.

Further notice, that despite only looking at a single file the network shown above is not connected. This is due to our database not being complete. Let's fix this now and try again.

In [13]:
git2net.mine_git_repo(repo_name, sqlite_db_file)

Found a matching database on provided path. Skipping 115 (98.29%) of 117 commits. 2 commits remaining.


Parallel (8 processes): 100%|██████████| 2/2 [00:26<00:00, 14.33s/it]


In [14]:
paths, dag, node_info, edge_info = git2net.get_line_editing_paths(sqlite_db_file,
                                                                  file_paths=['git2net/__init__.py'])
pp.visualisation.plot(dag, node_color=node_info['colors'])

Searching for aliases


  8%|▊         | 1/13 [00:00<00:01,  9.53it/s]

Querying commits
Querying edits


100%|██████████| 13/13 [00:01<00:00,  8.27it/s]
100%|██████████| 37/37 [00:00<00:00, 121336.39it/s]

2019-09-23 12:20:16 [Severity.INFO]	Creating paths from directed acyclic graph
2019-09-23 12:20:16 [Severity.INFO]	Expanding Subpaths
2019-09-23 12:20:16 [Severity.INFO]	Calculating sub path statistics ... 
2019-09-23 12:20:16 [Severity.INFO]	finished.





As mentioned before, these networks get very large very quickly. Therefore, it is often more useful to work with the `pathpy` path object that is also returned by the function. It cointains all paths and subpaths contained in the network shown above. More information regarding this object can be found in the documentation on [pathpy.net](http://www.pathpy.net/).

This concludes this tutorial, which I hope you found useful. Enjoy using `git2net` and best of luck for your research! If you find any bugs with the code please let me know on [github.com](https://github.com/gotec/git2net).

`git2net` has been developed as open source project. This means your ideas and inputs are highly welcome. Feel free to share the project and contribute yourself. You can imediately get started on the repository you just downloaded!