# Microtask0

>Microtask 0: Use this notebook implementing the [Code_Changes metric](https://github.com/chaoss/wg-gmd/blob/master/implementations/Code_Changes-Git.ipynb) ([see it in MyBinder](https://hub.mybinder.org/user/chaoss-wg-gmd-lfs79xw9/notebooks/implementations/Code_Changes-Git.ipynb)) as an example of how to collect the data, producing a single JSON file per data source, with all items (commits, issues, pull/merge requests) in it. Produce one notebook per data source (git, GitHub/GitLab issues, GitHub pull requests / GitLab merge requests) showing a summary of the contents of that file (number of items in it, and number of different identities in it counting authors/committers for git, submitters for issues and pull/merge requests). This microtask is mandatory, to show that you can retrieve data and produde a notebook showing it. In each notebook, include also the list of repositories retrieved, and the date of retrieval, using data available in the JSON file.

In this notebook, I have analyzed the [tensorflow/datasets](https://github.com/tensorflow/datasets) repository having 772 commits, around 150 issues(open + closed) and around 220 pull_requests.

## Perceval
Perceval is used to retreive the data - [link](https://chaoss.github.io/grimoirelab-tutorial/perceval/intro.html) to tutorial.
>[Perceval](https://github.com/chaoss/grimoirelab-perceval) is a Python module for retrieving data from repositories related to software development. It works with many data sources, from git repositories and GitHub projects to mailing lists, Gerrit or StackOverflow.

In [1]:
from access_token import ACCESS_TOKEN
import datetime
import json
import pandas as pd
from pprint import pprint
owner, repo = 'tensorflow', 'datasets'

### Using Perceval

We will use Perceval **Git backend** for git repositories to retrieve data about commits, and **GitHub backend** to retrieve data about issues and pull requests using the [Github API](https://developer.github.com/).

Perceval can be used:
- As a python module - [git-module documentation](https://perceval.readthedocs.io/en/latest/perceval.backends.core.html#module-perceval.backends.core.git) and [github-module documentation](https://perceval.readthedocs.io/en/latest/perceval.backends.core.html#module-perceval.backends.core.github).
- As a program from commandline 

Github access token is required for authenticated access to the Github API to extend the Api request rate limit.
[Generating the access token](https://help.github.com/en/articles/creating-a-personal-access-token-for-the-command-line)

### Using from commandline - Retrieving data for the tensorflow/datasets repository and storing as json.
(Removed the output from here after getting the tf_analysis.json file and converted the cell to markdown so the notebook doesn't look cluttered)

In [2]:
#   !perceval git --json-line https://github.com/tensorflow/datasets > tf_analysis.json
#    !perceval github -t $ACCESS_TOKEN --json-line --sleep-for-rate --category issue $owner $repo >> tf_analysis.json
#    !perceval github -t $ACCESS_TOKEN --json-line --sleep-for-rate --category pull_request $owner $repo >> tf_analysis.json


### Getting the date and time of data retrieval using the retrieved data.


In [3]:
with open('tf_analysis.json') as f:
    line = f.readline()
    line = json.loads(line) #convert string to dict
time = datetime.datetime.utcfromtimestamp(line['timestamp']).strftime('%Y-%m-%d %H:%M:%S UTC %Z%z')
print ('The date and time of data retrieval : ',time)

The date and time of data retrieval :  2019-03-24 01:58:23 UTC 


The datetime python module is used everywhere while working with strings, date and datetime objects - its documentation can be found [here](https://docs.python.org/3/library/datetime.html)

## The further notebook is divided in three parts : Commits, Pull_requests and Issues

### 1) Commits 
### Class to summarize commits

In [4]:
class Code_Changes:
    """Class for Code_Changes for Git repositories.
    
    Objects are instantiated by specifying a file with the
    commits obtained by Perceval from a set of repositories.
        
    :param path: Path to file with one Perceval JSON document per line
    """

    @staticmethod
    def _summary(repo, cdata):
        """Compute a summary of a commit, suitable as a row in a dataframe"""
        
        summary = {
            'repo': repo,
            'hash': cdata['commit'],
            'author': cdata['Author'],
            'author_date': datetime.datetime.strptime(cdata['AuthorDate'],
                                                      "%a %b %d %H:%M:%S %Y %z"),
            'commit': cdata['Commit'],
            'commit_date': datetime.datetime.strptime(cdata['CommitDate'],
                                                      "%a %b %d %H:%M:%S %Y %z"),
            'files_no': len(cdata['files'])
        }
        actions = 0
        for file in cdata['files']:
            if 'action' in file:
                actions += 1
        summary['files_action'] = actions
        if 'Merge' in cdata:
            summary['merge'] = True
        else:
            summary['merge'] = False
        return summary;
    
    def __init__(self, path):
        """Initilizes self.df, the dataframe with one row per commit.
        """

        self.df = pd.DataFrame(columns=['hash', 'author', 'author_date',
                                        'commit', 'commit_date',
                                        'files_no', 'files_action',
                                        'merge','repo'])
        commits = []
        with open(path) as commits_file:
            for line in commits_file:
                line = json.loads(line)
                if(line['category']=="commit"):
                    commit = line
                    commits.append(self._summary(repo=commit['origin'],cdata=commit['data']))
        
        self.df = self.df.append(commits, sort=False)
        self.df['author_date'] = pd.to_datetime(self.df['author_date'], utc=True)
        self.df['commit_date'] = pd.to_datetime(self.df['commit_date'], utc=True)
        
    def total_count(self):
        
        return len(self.df.index)
    
    def count(self, since = None, until = None, empty=True, merge=True, date='author_date'):
        """Count number of commits
        
        :param since: Period start
        :param until: Period end
        :param empty: Include empty commits
        :param merge: Include merge commits
        :param  date: Kind of date ('author_date' or 'commit_date')
        """
        
        df = self.df
        if since:
            df = df[df[date] >= since]
        if until:
            df = df[df[date] < until]
        if not empty:
            df = df[df['files_action'] != 0]
        if not merge:
            df = df[df['merge'] == False]
        return df['hash'].nunique()
    
    def by_month(self):
        
        return self.df['author_date'] \
            .groupby([self.df.author_date.dt.year.rename('year'),
                      self.df.author_date.dt.month.rename('month')]) \
            .agg('count')
            
    def unique_users(self):
        return self.df.author.nunique()

### Using the Code_changes class to print some commit statistics

In [5]:
changes = Code_Changes('tf_analysis.json')
print("Code changes total count:", changes.total_count())
print("Code changes count all period:", changes.count())
print("Code changes count from 2018-09-12 to 2019-03-10:",
      changes.count(since="2018-09-12", until="2019-03-10"))
print("Code changes count from 2018-09-12 to 2019-03-10:(no merge commits):",
      changes.count(since="2018-09-12", until="2019-03-10", merge=False))
print("Code changes count from 2018-09-12 to 2019-03-10:(no empty commits):",
      changes.count(since="2018-09-12", until="2019-03-10", empty=False))

Code changes total count: 696
Code changes count all period: 696
Code changes count from 2018-09-12 to 2019-03-10: 617
Code changes count from 2018-09-12 to 2019-03-10:(no merge commits): 600
Code changes count from 2018-09-12 to 2019-03-10:(no empty commits): 614


In [6]:
changes.by_month()

year  month
2018  9         10
      10        28
      11       107
      12       137
2019  1        122
      2        148
      3        144
Name: author_date, dtype: int64

### The commits dataframe created by the class

In [7]:
changes.df.head()

Unnamed: 0,hash,author,author_date,commit,commit_date,files_no,files_action,merge,repo
0,680c6bda1b7d6db9d74f9c1373ebe737938926a7,Ryan Sepassi <rsepassi@google.com>,2018-09-12 21:23:17+00:00,Ryan Sepassi <rsepassi@google.com>,2018-09-12 21:23:17+00:00,1,1,False,https://github.com/tensorflow/datasets
1,87fd7f9fc4e36aa917726f3e8f89b7c89c99fb66,Ryan Sepassi <rsepassi@google.com>,2018-09-11 19:10:03+00:00,Ryan Sepassi <rsepassi@google.com>,2018-09-12 21:28:48+00:00,11,11,False,https://github.com/tensorflow/datasets
2,fb0f20383bb2a83477770517767b53ad97ec0840,Ryan Sepassi <rsepassi@google.com>,2018-09-12 20:12:22+00:00,Ryan Sepassi <rsepassi@google.com>,2018-09-12 21:28:57+00:00,2,2,False,https://github.com/tensorflow/datasets
3,ca9ffcccb2ae3409c4210475b6407b871ad51ba9,Ryan Sepassi <rsepassi@google.com>,2018-09-13 19:53:54+00:00,Ryan Sepassi <rsepassi@google.com>,2018-09-13 19:59:39+00:00,5,5,False,https://github.com/tensorflow/datasets
4,3f650ac05ea4e5d1eeef55a9d3a70a75b1cb5843,Dustin Tran <trandustin@google.com>,2018-09-20 18:59:48+00:00,Copybara-Service <copybara-piper@google.com>,2018-09-21 23:28:12+00:00,2,2,False,https://github.com/tensorflow/datasets


### Number of different users involved in the commits

In [8]:
print('Number of different identities : ',changes.unique_users()) 

Number of different identities :  42


### Number of commits that make changes in the master branch

In [9]:
# Create a dict having all commits
commits = {}
with open('tf_analysis.json') as commits_file:
    for line in commits_file:
        line = json.loads(line)
        if(line['category']=="commit"):
            commit = line
            commits[commit['data']['commit']] = commit
            
# Find commits in master branch.
# Start by adding head to an empty todo list. Then loop until todo set is empty:
# for each commit in the todo list, add it to the master set, and go backwards
# (finding parents), adding them to the todo set.
todo = set()
for id, commit in commits.items():
    if 'HEAD -> refs/heads/master' in commit['data']['refs']:
        todo.add(id)

        
master = set()
while len(todo) > 0:
    current = todo.pop()
    master.add(current)
    for parent in commits[current]['data']['parents']:
        if parent not in master:
            todo.add(parent)
    
code_commits = len(master)
    
print("Code Commits (master branch):", code_commits)

Code Commits (master branch): 677


### 2) Pull_requests
### Visualizing the json structure of a pull_request

In [None]:
with open('./tf_analysis.json') as pr_file:
            for line in pr_file:
                line = json.loads(line)
                if(line['category']=="issue"):
                    pr = line
                    break
pprint(pr)

### Class to summarize pull_requests

In [11]:
class pr_statistics:
    
    @staticmethod
    def _summary(pr_data):
        """Compute a summary of a pull_request, suitable as a row in a dataframe"""
        
        summary = {
            'base_repo': pr_data['base']['label'],
            'title': pr_data['title'],
            'state': pr_data['state'],
            'commits':pr_data['commits'],
            'commits_data': pr_data['commits_data'],
            'comments':pr_data['comments'],
            'changed_files': pr_data['changed_files'],
            'additions': pr_data['additions'],
            'deletions': pr_data['deletions'],
            'created_at': pr_data['created_at'],
            'closed_at': pr_data['closed_at'],
            'user': pr_data['user_data']['login']
        }
        if (pr_data['merged']):
            summary['merged_at'] = pr_data['merged_at']
        else:
            summary['merged_at'] = None
        return summary;
    
    def __init__(self, path):
        """
           Initilizes self.df, the dataframe with one row per pull_request.
        """

        self.df = pd.DataFrame(columns=['base_repo','title', 'state', 'commits', 'commits_data',
                                        'comments', 'changed_files', 'additions', 'deletions', 'created_at'
                                        'closed_at','user', 'merged_at'])
        pull_requests= []
        with open(path) as pr_file:
            for line in pr_file:
                line = json.loads(line)
                if(line['category']=="pull_request"):
                    pr = line
                    pull_requests.append(self._summary(pr_data=pr['data']))

        self.df = self.df.append(pull_requests, sort=False)
        #self.df['author_date'] = pd.to_datetime(self.df['author_date'], utc=True)
        #self.df['commit_date'] = pd.to_datetime(self.df['commit_date'], utc=True)
        
    def total_count(self):
        
        return len(self.df.index)
    
    def open_prs(self):
        return len(self.df.index[self.df['state']=='open'])
    
    def closed_prs(self):
        return len(self.df.index[self.df['state']=='closed'])
    
    def unique_users(self):
        return self.df.user.nunique()


### Number of unique contributors for pull_requests, and number of open and closed pull_requests

In [12]:
pull_reqs = pr_statistics('./tf_analysis.json')
print('Number of open prs', pull_reqs.open_prs())
print('Number of closed prs', pull_reqs.closed_prs())
print('Number of unique users', pull_reqs.unique_users())

Number of open prs 60
Number of closed prs 117
Number of unique users 36


### 3) Issues
Note : It would contain pull_requests data also marked as category issue, since as in Github API, all pull_requests are issues. 

### Class to summarize issues (similar to the pr_statistics class above)

In [13]:
class issue_statistics:
    
    @staticmethod
    def _summary(issue_data):
        """Compute a summary of a pull_request, suitable as a row in a dataframe"""
        
        summary = {
            'title': issue_data['title'],
            'state': issue_data['state'],
            'comments':issue_data['comments'],
            'created_at': issue_data['created_at'],
            'closed_at': issue_data['closed_at'],
            'created_by': issue_data['user_data']['login']
        }
        return summary;
    
    def __init__(self, path):
        """
           Initilizes self.df, the dataframe with one row per issue.
        """

        self.df = pd.DataFrame(columns=['title', 'state', 'created_at', 'closed_at', 'comments', 'created_by'])
        issues = []
        with open(path) as issue_file:
            for line in issue_file:
                line = json.loads(line)
                if(line['category']=="issue"):
                    issue = line
                    issues.append(self._summary(issue_data=issue['data']))

        self.df = self.df.append(issues, sort=False)
        #self.df['author_date'] = pd.to_datetime(self.df['author_date'], utc=True)
        #self.df['commit_date'] = pd.to_datetime(self.df['commit_date'], utc=True)
        
    def total_count(self):        
        return len(self.df.index)
    
    def open_issues(self):
        return len(self.df.index[self.df['state']=='open'])
    
    def closed_issues(self):
        return len(self.df.index[self.df['state']=='closed'])
    
    def unique_users(self):
        return self.df.created_by.nunique()

### The number of unique issue contributors, and the numbere of open and closed issues

In [14]:
issues = issue_statistics('./tf_analysis.json'
                         )
print('Number of open issues', issues.open_issues())
print('Number of closed issues', issues.closed_issues())
print('Number of unique issue creators', issues.unique_users())

Number of open issues 152
Number of closed issues 171
Number of unique issue creators 80


### Printing the issue dataframe created by this class

In [15]:
issues.df.head()

Unnamed: 0,title,state,created_at,closed_at,comments,created_by
0,Example in README not working,closed,2018-09-17T18:42:13Z,2018-09-17T21:56:55Z,2,piyush-kgp
1,Fixes bug - datasets.load was not implemented,closed,2018-09-17T19:57:28Z,2018-09-17T21:56:03Z,1,piyush-kgp
2,error on pip install,closed,2018-09-26T12:32:21Z,2018-10-01T07:16:31Z,1,tiaguinho
3,Import error on Windows,closed,2018-09-14T10:09:00Z,2018-10-01T07:17:00Z,3,LoSealL
4,[Question] - Would it be okay to generate tfre...,closed,2018-11-28T19:33:58Z,2018-11-29T19:11:58Z,1,ksachdeva
