# GitHub - Data Extraction

The file `../data/RPackage-Repositories-150101-150601.csv` contains a list of GitHub repositories that are candidates to store a package related to R. Those candidates were collected from the activity on GitHub between 15-01 and 15-06. Those candidates all contain a `DESCRIPTION` file at the root of the repository.

We `git clone`-ed each of those repository. This notebook will parse those git repositories and extract the `DESCRIPTION` file of each commit. 

In [9]:
import pandas
from datetime import date

We will make use of the following commands:
 - `git clone <url> <path>` where <url> is the url of the repository and <path> is the location to store the repository.
 - `git log --follow --format="%H/%ci" <path>` where <path> will be `DESCRIPTION`. The output of this command is a list of <commit> / <date> for this file. 
 - `git show <commit>:<path>` where <commit> is the considered commit, and <path> will be `DESCRIPTION`. This command outputs the content of the file at the given commit.

In [10]:
github = pandas.DataFrame.from_csv('../data/RPackage-Repositories-150101-150601.csv')
repositories = github[['owner.login', 'name']].rename(columns={'owner.login': 'owner', 'name': 'repositories'})

FILENAME = '../data/github-raw-150601.csv'

# Root of the directory where the repositories were collected
GIT_DIR = '/data/github/' 

We will retrieve a lot of data, we can benefit from IPython's parallel computation tool.

**To use this notebook, you need either to configure your IPController or to start a cluster of IPython nodes, using `ipcluster start -n 4` for example.** See https://ipython.org/ipython-doc/dev/parallel/parallel_process.html for more information.

It seems that most recent versions of IPython Notebook can directly start cluster from the web interface, under the *Cluster* tab.

In [11]:
from IPython import parallel
clients = parallel.Client()
clients.block = False # asynchronous computations
print 'Clients:', str(clients.ids)

Clients: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]


In [12]:
def get_data_from((owner, repository)):
    # Move to target directory
    try:
        os.chdir(os.path.join(GIT_DIR, owner, repository))
    except OSError as e: 
        # Should happen when directory does not exist
        return []
    
    data_list = []
    
    # Get commits for DESCRIPTION
    try:
        commits = subprocess.check_output(['git', 'log', '--format=%H/%ci', '--', 'DESCRIPTION'])
    except subprocess.CalledProcessError as e:
        # Should not happen!?
        raise Exception(owner + ' ' + repository + '/ log : ' + e.output)
        
    for commit in [x for x in commits.split('\n') if len(x.strip())!=0]:
        commit_sha, date = map(lambda x: x.strip(), commit.split('/'))
        
        # Get file content
        try:
            content = subprocess.check_output(['git', 'show', '{id}:{path}'.format(id=commit_sha, path='DESCRIPTION')])
        except subprocess.CalledProcessError as e:
            # Could happen when DESCRIPTION was added in this commit. Silently ignore
            continue
        
        try:
            metadata = deb822.Deb822(content.split('\n'))
        except Exception as e: 
            # I don't known which are the exceptions that Deb822 may throw!
            continue # Go further
            
        data = {}
        
        for md in ['Package', 'Version', 'License', 'Imports', 'Suggests', 'Depends', 'Author', 'Authors', 'Maintainer']:
            data[md] = metadata.get(md, '')
        
        data['CommitDate'] = date
        data['Owner'] = owner
        data['Repository'] = repository
        data_list.append(data)

    # Return to root directory
    os.chdir(GIT_DIR)
    return data_list

In [13]:
data = []

clients[:].execute('import subprocess, os')
clients[:].execute('from debian import deb822')
clients[:]['GIT_DIR'] = GIT_DIR

balanced = clients.load_balanced_view()

items = [(owner, repo) for idx, (owner, repo) in repositories.iterrows()]

print len(items), 'items'
    
res = balanced.map(get_data_from, items, ordered=False, timeout=15)

import time
while not res.ready():
    time.sleep(5)
    print res.progress, ' ', 
    
for result in res.result:
    data.extend(result)

6533 items
818   1229   1865   2408   2983   3536   4123   4685   5212   5825   6406   6533  


In [14]:
df = pandas.DataFrame.from_records(data)
df.to_csv(FILENAME, encoding='utf-8')
print len(df), 'items'
print len(df.drop_duplicates(['Package'])), 'packages'
print len(df.drop_duplicates(['Owner', 'Repository'])), 'repositories'
print len(df.drop_duplicates(['Package', 'Version'])), 'pairs (package, version)'

101208 items
6293 packages
6513 repositories
68833 pairs (package, version)


In [15]:
df

Unnamed: 0,Author,Authors,CommitDate,Depends,Imports,License,Maintainer,Owner,Package,Repository,Suggests,Version
0,"Yann Ruffieux, contributions from Debjani Bhow...",,2015-04-16 20:02:44 +0000,"R (>= 2.6.0),stats","Biobase, graphics, grDevices, methods, stats, ...",GPL (>= 2),Yann Ruffieux <yann.ruffieux@epfl.ch>,Bioconductor-mirror,lapmix,lapmix,,1.35.0
1,"Yann Ruffieux, contributions from Debjani Bhow...",,2015-04-16 19:42:01 +0000,"R (>= 2.6.0),stats","Biobase, graphics, grDevices, methods, stats, ...",GPL (>= 2),Yann Ruffieux <yann.ruffieux@epfl.ch>,Bioconductor-mirror,lapmix,lapmix,,1.34.0
2,"Yann Ruffieux, contributions from Debjani Bhow...",,2014-10-13 21:47:41 +0000,"R (>= 2.6.0),stats","Biobase, graphics, grDevices, methods, stats, ...",GPL (>= 2),Yann Ruffieux <yann.ruffieux@epfl.ch>,Bioconductor-mirror,lapmix,lapmix,,1.33.0
3,"Yann Ruffieux, contributions from Debjani Bhow...",,2014-10-13 21:38:33 +0000,"R (>= 2.6.0),stats","Biobase, graphics, grDevices, methods, stats, ...",GPL (>= 2),Yann Ruffieux <yann.ruffieux@epfl.ch>,Bioconductor-mirror,lapmix,lapmix,,1.32.0
4,"Yann Ruffieux, contributions from Debjani Bhow...",,2014-04-11 21:21:21 +0000,"R (>= 2.6.0),stats","Biobase, graphics, grDevices, methods, stats, ...",GPL (>= 2),Yann Ruffieux <yann.ruffieux@epfl.ch>,Bioconductor-mirror,lapmix,lapmix,,1.31.0
5,"Yann Ruffieux, contributions from Debjani Bhow...",,2014-04-11 21:07:21 +0000,"R (>= 2.6.0),stats","Biobase, graphics, grDevices, methods, stats, ...",GPL (>= 2),Yann Ruffieux <yann.ruffieux@epfl.ch>,Bioconductor-mirror,lapmix,lapmix,,1.30.0
6,"Yann Ruffieux, contributions from Debjani Bhow...",,2014-03-04 22:12:21 +0000,"R (>= 2.6.0),stats","Biobase, graphics, grDevices, methods, stats, ...",GPL (>= 2),Yann Ruffieux <yann.ruffieux@epfl.ch>,Bioconductor-mirror,lapmix,lapmix,,1.29.2
7,"Yann Ruffieux, contributions from Debjani Bhow...",,2013-10-18 19:59:46 +0000,"R (>= 2.6.0),stats","Biobase, graphics, grDevices, methods, stats, ...",GPL (>= 2),Yann Ruffieux <yann.ruffieux@epfl.ch>,Bioconductor-mirror,lapmix,lapmix,,1.29.1
8,"Yann Ruffieux, contributions from Debjani Bhow...",,2013-10-14 21:47:19 +0000,"R (>= 2.6.0),stats","Biobase, graphics, grDevices, methods, stats, ...",GPL (>= 2),Yann Ruffieux <yann.ruffieux@epfl.ch>,Bioconductor-mirror,lapmix,lapmix,,1.29.0
9,"Yann Ruffieux, contributions from Debjani Bhow...",,2013-10-14 21:29:21 +0000,"R (>= 2.6.0),stats","Biobase, graphics, grDevices, methods, stats, ...",GPL (>= 2),Yann Ruffieux <yann.ruffieux@epfl.ch>,Bioconductor-mirror,lapmix,lapmix,,1.28.0
