<a href="https://colab.research.google.com/github/grosa1/hands-on-ml-tutorials/blob/master/tutorial_1/pydriller.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Working with PyDriller


PyDriller is a Python framework that helps developers on mining software repositories. With PyDriller you can easily extract information from any Git repository, such as commits, developers, modifications, diffs, and source codes, and quickly export CSV files.

## Installation and import

In [None]:
!pip install pydriller==2.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


We also import `pandas` to deal with data and `datetime` to work with dates:

In [None]:
from pydriller import Repository as RepositoryMining
import pandas as pd
from datetime import datetime

## Commit Object
A Commit object has all the information of a Git commit, and much more. More specifically:
```
hash (str): hash of the commit
msg (str): commit message
author (Developer): commit author (name, email)
author_date (datetime): authored date
author_timezone (int): author timezone (expressed in seconds from epoch)
committer (Developer): commit committer (name, email)
committer_date (datetime): commit date
committer_timezone (int): commit timezone (expressed in seconds from epoch)
branches (List[str]): List of branches that contain this commit
in_main_branch (Bool): True if the commit is in the main branch
merge (Bool): True if the commit is a merge commit
modifications (List[Modifications]): list of modified files in the commit (see Modifications)
parents (Set[str]): list of the commit parents
project_name (str): project name
project_path (str): project path
```

## Mining single project commits

In [None]:
project_url = 'https://github.com/ishepard/pydriller.git'

repo_commits = list()
for commit in RepositoryMining(project_url).traverse_commits():
    repo_commits.append({
        'sha': commit.hash, 
        'author': commit.author.name, 
        'date': commit.author_date
        })

In [None]:
pd.DataFrame(repo_commits)

Unnamed: 0,sha,author,date
0,ab36bf45859a210b0eae14e17683f31d19eea041,ishepard,2018-03-21 16:34:21+01:00
1,fdf671856b260aca058e6595a96a7a0fba05454b,ishepard,2018-03-22 11:07:31+01:00
2,90ca34ebfe69629cb7f186a1582fc38a73cc572e,ishepard,2018-03-22 12:53:52+01:00
3,71e053f61fc5d31b3e31eccd9c79df27c31279bf,ishepard,2018-03-26 13:13:27+02:00
4,205f6fb09734667b0c1842fd3c317013640189ce,ishepard,2018-03-27 16:34:02+02:00
...,...,...,...
790,fe503b6ec4327f6516bb63df04ffe71e43564295,cmtg,2023-02-19 15:14:36-03:00
791,e7a0a923110a0d9d784d5d23f9fd115f5f274882,cmtg,2023-02-20 09:41:50-03:00
792,6567abee773bdcba3ef1c969195982c0f36d90b3,Davide Spadini,2023-02-20 15:06:46+00:00
793,39987b7200726eca928d136155f075a416449def,Finn Kalvelage,2023-02-28 10:33:22+00:00


## Mining multiple project commits

In [None]:
repos_url = [ "https://github.com/TheAlgorithms/Java.git", "https://github.com/apache/netbeans" ]

commits = list()
for commit in RepositoryMining(path_to_repo=repos_url).traverse_commits():
    commits.append({
        'sha': commit.hash, 
        'author': commit.author.name, 
        'msg': commit.msg
        })

In [None]:
pd.DataFrame(commits)

Unnamed: 0,sha,author,msg
0,40d42574e065e8078b242d201e0fc1455c430c71,Anup Kumar Panwar,Initial commit
1,4ba958863b0cd2212b681598969bf92450c13b71,Anup Kumar Panwar,Bubble Sort
2,12d7c48ee4b7f415f19dda4a889032263cc3529a,Anup Kumar Panwar,insertion sort
3,9661bb3df62929cad320ce576c7e156a7c91a748,Anup Kumar Panwar,Binary Search
4,73bb72b8d0ab1e3a8706394782c399d4e20f3308,Anup Kumar Panwar,Renamed
...,...,...,...
10226,5fc81d9194c96bddff2ebcd72a751135ee36387d,Matthias Bläsing,Merge pull request #5756 from matthiasblaesing...
10227,ca2b81262cdf4afe31e8c2f63a9f529097be45bc,Matthias Bläsing,Merge pull request #5716 from matthiasblaesing...
10228,163c2d525543e7ac1f9fd5fbdea68440b173888e,Matthias Bläsing,Merge pull request #5694 from matthiasblaesing...
10229,0ab330adb9d07c7646354e4ea450ad4f41f1ea5f,Benjamin Asbach,[NETBEANS-5479] improve maven multithreaded ex...


## Get modifications
A modification object has the following fields:
```
old_path: old path of the file (can be None if the file is added)
new_path: new path of the file (can be None if the file is deleted)
change_type: type of the change: can be Added, Deleted, Modified, or Renamed.
diff: diff of the file as Git presents it (e.g., starting with @@ xx,xx @@).
source_code: source code of the file (can be None if the file is deleted)
source_code_before: source code of the file before the change (can be None if the file is added)
added: number of lines added
removed: number of lines removed
nloc: Lines Of Code (LOC) of the file
complexity: Cyclomatic Complexity of the file
token_count: Number of Tokens of the file
methods: list of methods of the file. The list might be empty if the programming language is not supported or if the file is not a source code file.
```

To get the list of `Modifications` that exists inside a `Commit` object:

In [None]:
mod_commits = list()
for commit in RepositoryMining('https://github.com/TheAlgorithms/Java.git').traverse_commits():
    for m in commit.modified_files:
        mod_commits.append({
            'author': commit.author.name,
            'modified_file': m.filename,
            'change_type': m.change_type.name,
            'cyclomatic_complexity': m.complexity
            })

In [None]:
pd.DataFrame(mod_commits)

Unnamed: 0,author,modified_file,change_type,cyclomatic_complexity
0,Anup Kumar Panwar,README.md,ADD,
1,Anup Kumar Panwar,BubbleSort.java,ADD,6.0
2,Anup Kumar Panwar,InsertionSort.java,ADD,6.0
3,Anup Kumar Panwar,SelectionSort.java,ADD,6.0
4,Anup Kumar Panwar,BinarySearch.java,ADD,7.0
...,...,...,...,...
4316,SwargaRajDutta,StringCompressionTest.java,ADD,1.0
4317,Isak Einberg,Volume.java,MODIFY,8.0
4318,JarZombie,update_directory.yml,MODIFY,
4319,duyuanch,CombSort.java,MODIFY,8.0


## Filter by commit

In [None]:
url = "https://github.com/ishepard/pydriller.git"
for commit in RepositoryMining(url, single='05526fad873c3fc83e40bcbc424bd1b3e5393dd5').traverse_commits():
    print('Hash {}, author {}'.format(commit.hash, commit.author.name))

Hash 05526fad873c3fc83e40bcbc424bd1b3e5393dd5, author ishepard


## Filter by date

In [None]:
filtered_commits = list()
for commit in RepositoryMining(url, since=datetime(2020, 1, 1, 1, 0, 0)).traverse_commits():
    filtered_commits.append({
        'sha': commit.hash, 
        'author': commit.author.name, 
        'msg': commit.msg
        })

In [None]:
pd.DataFrame(filtered_commits)

Unnamed: 0,sha,author,msg
0,c69e50b5d68b42c19639ac81c37f039581e149ad,stefanodallapalma,Added metric to count for devs who contributed...
1,9baf4fd9e1cb84546ae9fe6864e158b3a1c01080,stefanodallapalma,Added process metric to count the number of ne...
2,22573d99f7135d37d0aab9cd8fee0ae9ec1b6c49,stefanodallapalma,Added two process metrics
3,9be1a6f6e420ae19303b6a94caccf93397068d02,stefanodallapalma,Added metric to count normalized number of add...
4,8e379834929c14f2da5fb2cba04e2326f6ef0a3a,stefanodallapalma,Added metric to count the normalized number of...
...,...,...,...
385,fe503b6ec4327f6516bb63df04ffe71e43564295,cmtg,Added since_as_filter
386,e7a0a923110a0d9d784d5d23f9fd115f5f274882,cmtg,Fixed size issue with test-repos.zip
387,6567abee773bdcba3ef1c969195982c0f36d90b3,Davide Spadini,Merge pull request #256 from cmtg/since_as_fil...
388,39987b7200726eca928d136155f075a416449def,Finn Kalvelage,Update modifiedfile.rst\n\nRewrote the first p...


## Other options
PyDriller comes with a set of common commit filters that you can apply:
```
only_in_branch (str): only analyses commits that belong to this branch.
only_no_merge (bool): only analyses commits that are not merge commits.
only_authors (List[str]): only analyses commits that are made by these authors. The check is made on the username, NOT the email.
only_commits (List[str]): only these commits will be analyzed.
only_releases (bool): only commits that are tagged (“release” is a term of GitHub, does not actually exist in Git)
filepath (str): only commits that modified this file will be analyzed.
only_modifications_with_file_types (List[str]): only analyses commits in which at least one modification was done in that file type, e.g., if you pass “.java”, it will visit only commits in which at least one Java file was modified; clearly, it will skip other commits (e.g., commits that did not modify Java files).
```

In [None]:
# Only commits in branch1 and no merges
RepositoryMining('path/to/the/repo', only_in_branch='branch1', only_no_merge=True).traverse_commits()

# Only commits of author "ishepard"
RepositoryMining('path/to/the/repo', only_authors=['ishepard']).traverse_commits()

# Only commits that modified a java file
RepositoryMining('path/to/the/repo', only_modifications_with_file_types=['.java']).traverse_commits()

<generator object Repository.traverse_commits at 0x7f17bd17f190>

## Resources

- PyDriller docs: [link](https://pydriller.readthedocs.io/) and [link](https://readthedocs.org/projects/pydriller/downloads/pdf/latest/)
- PyDriller source code: [link](https://github.com/ishepard/pydriller)
- PyDriller paper: [link](https://www.sback.it/publications/fse2018td.pdf)
- From: [link](https://www.kaggle.com/sayedmohsin/pydriller-tool-demo-by-sayed-mohsin-reza#Pydriller-Tool-Demo)