In [None]:
!git clone https://github.com/gotec/git2net-tutorials
import os
os.chdir('git2net-tutorials')
!pip install -r requirements.txt
os.chdir('..')
!git clone https://github.com/gotec/git2net git2net4analysis

In [None]:
import git2net
import os
import sqlite3
import pandas as pd

# Repository Mining

The first tutorial showed how to clone a repository and prepare it for analysis with `git2net`.
In some examples, we even started mining the resulting repositories as with `mine_github()`, `git2net` provides a function to clone and mine repositories in a single step.

This tutorial will focus on the options you have when mining a repository with `git2net`.
During this process, we will focus on the function `mine_git_repo()`, `git2net`'s primary function for repository mining.
However, you can also pass all options discussed here to `mine_github()`, which calls `mine_git_repo()` internally.

## 1 - Introduction to `mine_git_repo()`

The function `mine_git_repo()` takes two required and several optional inputs, which we will further explore later in this tutorial.
Let's start with the required inputs.
We need to supply a path to the git repository that we want to analyse. Below, this is done with the variable `git_repo_dir`.
In addition, you need to provide a path to the *SQLite* database that `git2net` will write the results to during the mining process.
This path is provided as `sqlite_db_file`.

Note that if no database exists on the supplied path, `git2net` will create a new database.
If a database exists, `git2net` will check if the database was mined with the same setting and on the same repository and subsequently resume the mining process from wherever it was left off.
For this example, we remove any existing database before calling `mine_git_repo()` to ensure the mining starts from scratch.

Let's try this out.
Below we point `git2net` to the path where we cloned the repository.
In addition, we specify the location of the database file in which the results of the mining process will be stored and ensure the database does not currently exist.
We then run the `mine_git_repo()` function

In [None]:
# We assume a clone of git2net's repository exists in the folder below following the first tutorial.
git_repo_dir = 'git2net4analysis'

# Here, we specify the database in which we will store the results of the mining process.
sqlite_db_file = 'git2net4analysis.db'

# Remove database if exists
if os.path.exists(sqlite_db_file):
    os.remove(sqlite_db_file)
    
git2net.mine_git_repo(git_repo_dir, sqlite_db_file)

Congratulations, you have now finished mining the repository using `git2net`!

## 2 - Skipping large commits while mining

In the previous section, we mined all commits existing in the repository.
When you start working with projects bigger than `git2net`, you might find that some commits are extremely large and might take very long to mine.
From our experience, these are often commits in which many files are changed at once.
If these commits are relevant for you depends on your analysis.
However, even if these commits need to be mined eventually, you might be interested in starting with smaller commits and obtaining a preliminary database.
You can then develop your downstream processing pipeline based on this database while the remaining commits are being mined.
 
### Number of modified files
 
One way to exclude commits with many changed files from the mining is to call `mine_git_repo()` with the optional argument `max_modifications`. In the example below, we use `max_modifications=3` to only mine commits in which at most three or fewer files were modified.

In [None]:
# We assume a clone of git2net's repository exists in the folder below following the first tutorial.
git_repo_dir = 'git2net4analysis'

# Here, we specify the database in which we will store the results of the mining process.
sqlite_db_file = 'git2net4analysis.db'

# Remove database if exists
if os.path.exists(sqlite_db_file):
    os.remove(sqlite_db_file)

max_modifications = 3
    
git2net.mine_git_repo(git_repo_dir, sqlite_db_file, max_modifications=max_modifications)

While mining, `git2net` provides information about its progress.
The first line shows that if found no database at the current path and therefore started mining from scratch.
This behaviour is expected, as we deliberately deleted any existing database before the run.

Subsequently, progress updates on the mining process are printed. The first information denotes the number of processes `git2net` spawns and runs on.
The other output shows the number of commits mined and the total number of commits to be mined in this run, as well as the elapsed time and an estimate of the remaining time to finish.

### Timeout

If a commit is skipped, `git2net` prints the reason for skipping it and the commit's hash.
Currently, there are three cases in which `git2net` can skip a commit:
First, as seen above, a commit can exceed the maximum number of modifications set by max_modifications.
Second, you can skip commits if processing them takes too long using the option' timeout'.
By default, the maximum processing time is set to `timeout=0`, which is equivalent to never stopping the processing of a commit due to a timeout.
However, e.g., by setting `timeout=5`, `git2net` will stop processing a commit if it takes longer than 5 seconds.
Third, `git2net` may skip commits if an error occurs during their processing.
In these cases, please report the repository and commit hash in a new issue on [github.com/gotec/git2net](https://github.com/gotec/git2net).

Let's resume the mining process while increasing the maximum number of modifications to 5!

In [None]:
max_modifications = 5

git2net.mine_git_repo(git_repo_dir, sqlite_db_file, max_modifications=max_modifications)

As you can see from the output above, the process was resumed from the old database, skipping the already processed commits in the repository.

Great, we made progress, and many of the repository's commits are already mined and in the database!
But what about the other ones?
We can obtain additional information on the commits missing from the database from the function `git2net.mining_state_summary()`.
Similar to mine_git_repo, this function also requires the paths to the repository and the database.

In [None]:
git2net.mining_state_summary(git_repo_dir, sqlite_db_file)

Calling the function yields you a summary of the mining state and details on all missing commits.
Let's assume we are very interested in commit *090c00c342283134a23900f85c1d232499617365* but want to avoid crawling the other missing commits.
While this is unnecessary for small repositories such as `git2net`, this might become highly relevant for larger projects such as `Linux`, where individual commits can make changes to thousands of files that require significant computational resources (days or weeks) to analyse.
Particularly merge commits can be very computationally expensive, as all files included in the diffs to both parent commits need to be considered.
Therefore, for larger projects, we generally recommend running `git2net` with `max_modifications = 1000`, subsequently increasing this number if required.

## 3 - Specifying which commits to mine

But now back to mining specifically commit *090c00c342283134a23900f85c1d232499617365*, which can be done with the `commits` option in `mine_git_repo()`.

In [None]:
# mine_git_repo takes list of commits
commits = ['090c00c342283134a23900f85c1d232499617365']

git2net.mine_git_repo(git_repo_dir, sqlite_db_file, commits=commits)

## 4 - Excluding files and binary files

### Excluding files
With the `exclude` option, you can exclude a list of files while mining.
You can use this option, e.g., when you have large files that are irrelevant to your analysis, e.g., because they are maintained in another repository.

### Binary files
Other than the files provided using `exclude`, `git2net` automatically excludes binary files while processing using the list of file extensions provided in [`sindresorhus/binary-extensions`](https://github.com/sindresorhus/binary-extensions).
Binary files, such as image files, are usually very long files that are not humanly readable and, hence, need to be interpreted by a program (e.g., an image viewing application).
These files are considerably different from the code files `git2net` is developed to analyse and should therefore be considered separately.

### Files with binary sections - text entropy
In addition to purely binary files, also files containing both code and binary sections exist.
A prime example of this type of file is a Jupyter notebook, like the one you are currently working with.
`git2net` does not automatically exclude these files.
However, with the `text_entropy` we provide an option to distinguish binary lines&mdash;which generally have a higher entropy&mdash;from code lines allowing you to exclude them after the mining.
The text entropy is recorded for all lines in the resulting database.
For more details on the text entropy measure, we refer to our [original publication](https://arxiv.org/abs/1903.10180).

## 5 - A closer look at code changes

### Text extraction
To save storage space, `git2net` does not store the content of modified lines by default.
However, you can enable this by providing the option `extract_text=True` to `mine_get_repo()`.
Doing so will add the information of the line content before and after all edits to the database.

### Code complexity
Using the library [`terryyin/lizard`](https://github.com/terryyin/lizard) `git2net` can further compute the cyclomatic complexity and the number of lines of code (NLOC) while mining.
As this is computationally expensive, this option is disabled by default.
To enable it, set `extract_complexity=True`.

### Code blocks
When we modify code, we usually do not think about the code as consisting of individual lines but rather as functional blocks of code.
Using the option `use_blocks=True`, `git2net` also provides the option to process changes on the level of such blocks rather than individual lines (see our [original publication](https://arxiv.org/abs/1903.10180) for more details).
The main drawback of this approach is that the multiple lines changed within a block might have different authors.
Currently, for block changes, `git2net` only records the data of the first block line, even if the code of multiple authors is modified.
Therefore, the processing in blocks is disabled by default.

## 6 - Parallelisation

### Specifying the number of parallel operations

`git2net` is highly parallelised and will automatically detect the number of threads of your CPU, fully utilising all of them during operation.
However, there might be cases, e.g. when other operations are running simultaneously, in which you may want to reduce this load.
You can do so by explicitly setting the number of processes with the `no_of_processes` option of the `mine_git_repo()` function.

### Disabeling parallelisation

While parallelisation is excellent for performance, it can make debugging challenging as error messages are collected inside the individual parallel workers.
To mediate this, `git2net` features a serial mode that disables all parallelisation during the mining.
You can enable the serial mode by setting `no_of_processes=1`.
Let's try this out now.
As you can see below, `git2net` will indicate that it is working in serial mode in the progress bar during mining.

In [None]:
# Remove database if exists
if os.path.exists(sqlite_db_file):
    os.remove(sqlite_db_file)

git2net.mine_git_repo(git_repo_dir, sqlite_db_file, no_of_processes=1)

### Parallelisation options

Parallelisation in `git2net` is based on the individual processing of independent commits.
By default, each parallel worker is provided with a single commit to mine.
Once mining the commit is complete, the results are reported back, and the worker is assigned a new commit.
In some cases, it might be advantageous to provide the worker with multiple commits at once.
You can do so by setting the option `chunksize` in `mine_git_repo()`.
Providing multiple commits at once has the benefit of requiring fewer I/O operations.
However, one should note that commits are only saved to the database&mdash;and hence skipped in a subsequent run&mdash;once the workers return them.
This behaviour might be suboptimal if, e.g., you plan to pause your mining operations intermittently.

## 7 - Merges

A unique feature of `git2net` is its ability to mine merge commits.
These commits are challenging to process as the content of two different versions of a repository needs to be compared.
The changes leading to the two versions have already been mined with the commits generating the different versions.
However, the merge commit's author can also make changes during the merge that need to be distinguished from them.
Further, when performing a merge&mdash;particularly for conflicts&mdash;the merge's author needs to decide which of the conflicting versions to keep.
These challenges can make the mining of merges computationally expensive.

By default, `git2net` will extract merges.
If your analysis does not require the information from merges, you can disable their mining by setting `extract_merges=False` in `mine_git_repo()`.

Usually, merged versions differ only in a few lines and are otherwise identical.
However, as mentioned above, only one version is kept during a merge, and git records the second version as deleted.
This behaviour is problematic for two reasons:
First, merge commits result in large amounts of lines that git considers as deleted even though the identical line from the second version continues to exist.
In other words, lines that persist are recorded as deleted.
Second, for these identical lines, it is (mostly) arbitrary which version persists and which is deleted.
Overall, merge commits can therefore result in large amounts of lines that git considers as deleted but for which an (almost) identical clone that only differs in the last commit hash remains in the repository.
Due to these challenges, deletions in merge commits are not recorded by default when mining a repository with `git2net`.
However, you can change this behaviour and include merge deletions in your resulting database by setting `extract_merge_deletions=True` in `mine_git_repo`.


## 8 - Parameterising `git blame`

For all modified files of a commit, `git2net` calls the git operation `git blame` to determine the previous authors of all lines.
While this sounds simple, precisely specifying what we mean when we say *previous author* can be challenging.
Generally, we want to know who is the author of the line that we are just changing.
But who is this author?
Does adding a whitespace character (e.g., a line break at the end of the line) make me the new author?
Am I the author if I copied the line from another file during refactoring?

The answer to these questions depends on the overarching question you aim to address.
Therefore, we cannot give you a final answer here.
However, we can give you the option to decide for yourself.
`git blame` comes with various options that allow you to specify its exact behaviour.
We have set the default options in `git2net` to conform to the default options of `git blame`.
However, especially the options `-C` concerning copied and pasted lines from other files and `-w` concerning whitespaces might be critical for your analysis.
Therefore, `git2net` provides the options `blame_C` and `blame_w` that allow you to specify them during mining.
We refer to the [git blame documentation](https://git-scm.com/docs/git-blame) for details on both options.

## 9 - Branches of repositories

Finally, `git2net` allows you to analyse different branches of repositories or, with the option `all_branches=True`, even mine all branches at once.
However, to do so, you will need to have tracked these branches when cloning your repository before the analysis.
For details on how you can do this, please refer to the first tutorial.

# The resulting database

Mining a repository with `git2net` yields an SQLite database that contains three tables: `_metadata`, `commits`, and `edits`.
The table `_metadata` contains information such as the path to the original repository, the time it was mined, and the settings for `git2net` that were used.

In [None]:
with sqlite3.connect(sqlite_db_file) as con:
    commit_data = pd.read_sql_query("SELECT * FROM _metadata", con)

commit_data.tail()

The table `commits` stores all information related to the commits themselves, e.g., information regarding the author and time, the branch(es) in which it appears, and their parent-commit(s).
All commits are uniquely identified by their hash.

In [None]:
with sqlite3.connect(sqlite_db_file) as con:
    commit_data = pd.read_sql_query("SELECT * FROM commits", con)

commit_data.tail()

Each commit may contain multiple files, each including numerous changes to lines.
These changes are reflected in the `edits` table, which contains detailed information about all modifications of the files contained in the commits.
Here `git2net` distinguishes between different file-level modifications (`modification_type`) and the corresponding line-level edits within files (`edit_type`).

A file included in a commit may be an added (`ADD`), deleted (`DELETE`), or modified (`MODIFY`) file.
In addition, changes to individual lines are saved for each file.
These can be newly added (`addition`), changed (`replacement`) or deleted (`deletion`) lines.

In [None]:
with sqlite3.connect(sqlite_db_file) as con:
    commit_data = pd.read_sql_query("SELECT * FROM edits", con)

commit_data.tail()

With this, we conclude the second part of the tutorial for `git2net`.
In this part, we have explained all options of `git2net`'s primary mining function `mine_git_repo()`, which should enable you to go ahead and make appropriate choices for your application of `git2net`.
Subsequently, you can call `git2net.mine_git_repo()` with your selected options and obtain an SQLite database that we will use as the basis of the next part of this tutorial.