# Check for occurrence of CFF files in repositories linked from JOSS

1. Get list of DOIs from JOSS
2. Use code-cite to find occurrences of GitHub repos (or create new code if JOSS is structured)
3. Check each repo to find out if they have CFF file 

In [1]:
# JOSS papers are here: https://github.com/openjournals/joss-papers

from github import Github
import xml.etree.ElementTree as ET
from xml.dom import minidom
import base64
import pickle
from urllib.parse import urlparse


# Github Token
gh_token = 'secrets/github_token'

with open(gh_token, 'r') as f:
    github_token = f.read().rstrip()

g = Github(github_token)
repo = g.get_repo('openjournals/joss-papers')

verbose = True

In [2]:
# Github helper functions

def repo_url_to_name(url):
    try:
        o = urlparse(url)
        if o.netloc.startswith("github.com") or o.netloc.startswith("www.github.com"):
            # Do some more
            p = o.path.split("/")
            if len(p) < 3:
                print("Malformed GitHub repository URL: ", url)
                return None
            else:
                owner = p[1]
#                name = p[2].split(".")[0]
                if p[2].endswith(".git"):
                    name = p[2].split(".git")[0]
                else:
                    name = p[2]
                return (owner + "/" + name)
        else:
            print("Not a GitHub repository URL: ", url)
            return None
    except ValueError:
        print("Problem parsing GitHub repository URL: ", url)
        return None
        
    

In [None]:
urls = ["https://github.com/foo/bar", 
        "git://www.github.com", 
        "http://localhost:8888/notebooks/CheckJOSSforCFF.ipynb",
        "https://www.github.com/lots/of/fun.git",
        "https://www.github.com/real/fine.git",
       "https://www.github.com/real/fine.git.git",
       "https://www.github.com/real/fine.py"]

for url in urls:
    print(repo_url_to_name(url))

In [3]:
contents = repo.get_dir_contents('/')

In [4]:
# For each ContentFile that has path = "joss.*" and is a directory, load the XML and get the software repository

joss_repos = []

joss_prefix = 'joss.'

for content in contents:
    if content.type == "dir" and content.name.startswith(joss_prefix):
        if verbose: print("Analysing ", content.name)
        joss_id = content.name[len(joss_prefix):]
        try:
            paperdir_contents = repo.get_dir_contents(content.name)
        except:
            if verbose: print("Problem getting contents of ", content.name)
            paperdir_contents = []
        for file in paperdir_contents:
            if file.name.endswith(joss_id + '.xml'):
                paper_path = content.name + '/' + file.name
                if verbose: print("Paper path ", paper_path)
                try:
                    paper = repo.get_file_contents(paper_path)
                    paper_xml = base64.b64decode(paper.content)
                    mydoc = minidom.parseString(paper_xml)
                    items = mydoc.getElementsByTagName('software_repository')
                    for item in items:
                        # Note: if there are more than one repositories associated with a paper this
                        # will return all of them, but currently doesn't associate them with the paper
                        joss_repos.append(item.firstChild.data)
                        if verbose: print("Software repository: ",item.firstChild.data)
                except:
                    if verbose: print("Problem retrieving file contents")
                        
with open('joss_repos.txt', 'wb') as fp:
    pickle.dump(joss_repos, fp)

Analysing  joss.00011
Paper path  joss.00011/10.21105.joss.00011.xml
Software repository:  https://github.com/diana-hep/carl
Analysing  joss.00012
Paper path  joss.00012/10.21105.joss.00012.xml
Software repository:  http://github.com/jakevdp/mst_clustering
Analysing  joss.00016
Paper path  joss.00016/10.21105.joss.00016.xml
Software repository:  https://github.com/harpolea/r3d2
Analysing  joss.00017
Paper path  joss.00017/10.21105.joss.00017.xml
Software repository:  https://github.com/applicationskeleton/Skeleton
Analysing  joss.00018
Paper path  joss.00018/10.21105.joss.00018.xml
Software repository:  http://github.com/genomematt/xenomapper/
Analysing  joss.00020
Paper path  joss.00020/10.21105.joss.00020.xml
Software repository:  https://github.com/PeteHaitch/GenomicTuples
Analysing  joss.00021
Paper path  joss.00021/10.21105.joss.00021.xml
Software repository:  https://github.com/jtauber/pyuca
Analysing  joss.00023
Paper path  joss.00023/10.21105.joss.00023.xml
Software repository:

Software repository:  https://github.com/bmcfee/resampy
Analysing  joss.00133
Paper path  joss.00133/10.21105.joss.00133.xml
Software repository:  https://github.com/openeventdata/petrarch2
Analysing  joss.00135
Paper path  joss.00135/10.21105.joss.00135.xml
Software repository:  https://github.com/mllg/batchtools
Analysing  joss.00137
Paper path  joss.00137/10.21105.joss.00137.xml
Software repository:  https://github.com/baccuslab/pyret
Analysing  joss.00139
Paper path  joss.00139/10.21105.joss.00139.xml
Software repository:  https://github.com/biosustain/optlang
Analysing  joss.00140
Paper path  joss.00140/10.21105.joss.00140.xml
Software repository:  https://github.com/HERA-Team/pyuvdata
Analysing  joss.00141
Paper path  joss.00141/10.21105.joss.00141.xml
Software repository:  https://github.com/jeff-goldsmith/vbvs.concurrent
Analysing  joss.00142
Paper path  joss.00142/10.21105.joss.00142.xml
Software repository:  https://github.com/camillescott/shmlast
Analysing  joss.00146
Paper 

Software repository:  https://github.com/rasbt/biopandas
Analysing  joss.00280
Paper path  joss.00280/10.21105.joss.00280.xml
Software repository:  https://github.com/barbagroup/AmgXWrapper
Analysing  joss.00281
Paper path  joss.00281/10.21105.joss.00281.xml
Software repository:  https://github.com/OpenSpace/OpenSpace
Analysing  joss.00282
Paper path  joss.00282/10.21105.joss.00282.xml
Software repository:  https://github.com/bede/kindel
Analysing  joss.00289
Paper path  joss.00289/10.21105.joss.00289.xml
Software repository:  https://github.com/TACC/launcher
Analysing  joss.00291
Paper path  joss.00291/10.21105.joss.00291.xml
Software repository:  https://bitbucket.org/cloopsy/android/
Analysing  joss.00293
Paper path  joss.00293/10.21105.joss.00293.xml
Software repository:  https://github.com/ben-aaron188/netanos
Analysing  joss.00295
Paper path  joss.00295/10.21105.joss.00295.xml
Software repository:  https://github.com/nhejazi/biotmle
Analysing  joss.00296
Paper path  joss.00296/10

Paper path  joss.00431/10.21105.joss.00431.xml
Software repository:  https://github.com/hawk31/pyGPGO
Analysing  joss.00432
Paper path  joss.00432/10.21105.joss.00432.xml
Software repository:  https://github.com/mdbloice/Augmentor
Analysing  joss.00433
Paper path  joss.00433/10.21105.joss.00433.xml
Software repository:  https://github.com/ljvmiranda921/pyswarms
Analysing  joss.00436
Paper path  joss.00436/10.21105.joss.00436.xml
Software repository:  https://github.com/Edinburgh-Imaging/Masks2Metrics
Analysing  joss.00437
Paper path  joss.00437/10.21105.joss.00437.xml
Software repository:  https://github.com/iljah/hdintegrator
Analysing  joss.00440
Paper path  joss.00440/10.21105.joss.00440.xml
Software repository:  https://github.com/sns-chops/multiphonon
Analysing  joss.00441
Paper path  joss.00441/10.21105.joss.00441.xml
Software repository:  https://github.com/UniNE-CHYN/hytool
Analysing  joss.00448
Paper path  joss.00448/10.21105.joss.00448.xml
Software repository:  http://www.uni

Paper path  joss.00584/10.21105.joss.00584.xml
Software repository:  https://github.com/WinVector/vtreat/
Analysing  joss.00588
Paper path  joss.00588/10.21105.joss.00588.xml
Software repository:  https://github.com/pynucastro/pynucastro
Analysing  joss.00592
Paper path  joss.00592/10.21105.joss.00592.xml
Software repository:  https://github.com/edoddridge/aronnax
Analysing  joss.00593
Paper path  joss.00593/10.21105.joss.00593.xml
Software repository:  https://github.com/iferres/phylen
Analysing  joss.00596
Paper path  joss.00596/10.21105.joss.00596.xml
Software repository:  https://github.com/casics/nostril
Analysing  joss.00597
Paper path  joss.00597/10.21105.joss.00597.xml
Software repository:  https://github.com/FastNFT/FNFT
Analysing  joss.00598
Paper path  joss.00598/10.21105.joss.00598.xml
Software repository:  https://github.com/JuliaDynamics/DynamicalSystems.jl
Analysing  joss.00602
Paper path  joss.00602/10.21105.joss.00602.xml
Software repository:  https://github.com/FluxML

In [5]:
joss_repos

['https://github.com/diana-hep/carl',
 'http://github.com/jakevdp/mst_clustering',
 'https://github.com/harpolea/r3d2',
 'https://github.com/applicationskeleton/Skeleton',
 'http://github.com/genomematt/xenomapper/',
 'https://github.com/PeteHaitch/GenomicTuples',
 'https://github.com/jtauber/pyuca',
 'https://github.com/phenoscape/scowl',
 'https://github.com/dfm/corner.py',
 'https://github.com/genenetwork/genenetwork2',
 'https://github.com/conradsnicta/armadillo/',
 'https://github.com/dib-lab/sourmash/',
 'https://github.com/SimonGreenhill/phylogemetric',
 'https://github.com/ctjacobs/git-rdm',
 'https://github.com/cMadan/prism',
 'https://github.com/zoran-cuckovic/QGIS-visibility-analysis',
 'https://github.com/EducationalTestingService/rsmtool',
 'https://github.com/msmbuilder/osprey',
 'https://github.com/jiemakel/las',
 'https://github.com/michaellevy/gwdegree',
 'https://github.com/juliasilge/tidytext',
 'https://github.com/DoSOCSv2/DoSOCSv2',
 'https://github.com/sbogutzky/P

In [6]:
with open('joss_repos.txt', 'wb') as fp:
    pickle.dump(joss_repos, fp)

In [7]:
with open ('joss_repos.txt', 'rb') as fp:
    joss_repos = pickle.load(fp)

In [8]:
# Now check if these repos have a CITATION.CFF file!

for joss_repo_url in joss_repos:
    if verbose: print("Checking ",joss_repo_url)
    
    joss_repo_name = repo_url_to_name(joss_repo_url)
    if joss_repo_name == None:
        if verbose: print("Skipping ", joss_repo_url)
    else:
        try:
            repo = g.get_repo(joss_repo_name)
            contents = repo.get_dir_contents('/')
            for content in contents:
                if content.type == "file" and content.name.startswith('CITATION'):
                    print(content.name)
        except:
            if verbose: print("Problem analysing ", joss_repo_url)

Checking  https://github.com/diana-hep/carl
Checking  http://github.com/jakevdp/mst_clustering
Checking  https://github.com/harpolea/r3d2
Checking  https://github.com/applicationskeleton/Skeleton
CITATION
Checking  http://github.com/genomematt/xenomapper/
Checking  https://github.com/PeteHaitch/GenomicTuples
Checking  https://github.com/jtauber/pyuca
Checking  https://github.com/phenoscape/scowl
Checking  https://github.com/dfm/corner.py
Checking  https://github.com/genenetwork/genenetwork2
Checking  https://github.com/conradsnicta/armadillo/
Checking  https://github.com/dib-lab/sourmash/
Checking  https://github.com/SimonGreenhill/phylogemetric
Checking  https://github.com/ctjacobs/git-rdm
Checking  https://github.com/cMadan/prism
CITATION
Checking  https://github.com/zoran-cuckovic/QGIS-visibility-analysis
Checking  https://github.com/EducationalTestingService/rsmtool
Checking  https://github.com/msmbuilder/osprey
Checking  https://github.com/jiemakel/las
Checking  https://github.com

Checking  https://github.com/SebastianoF/bruker2nifti
Checking  https://github.com/ropensci/visdat
Checking  https://github.com/adrn/schwimmbad
Checking  https://github.com/ropensci/iheatmapr
Checking  https://github.com/pacificclimate/ClimDown
Checking  https://github.com/danilofreire/prisonbrief
Checking  https://github.com/Chilipp/psyplot.git
Checking  https://github.com/conradsnicta/armadillo-gmm
Checking  https://github.com/ornl-ndav/django-remote-submission
Checking  https://savannah.nongnu.org/projects/complot/
Not a GitHub repository URL:  https://savannah.nongnu.org/projects/complot/
Skipping  https://savannah.nongnu.org/projects/complot/
Checking  https://github.com/bjmorgan/bsym
CITATION.cff
Checking  https://github.com/GabrieleRovigatti/prodest
Checking  https://github.com/go-hep/hep
Checking  https://github.com/array-split/array_split
Checking  https://github.com/jakobbossek/mcMST
Checking  https://github.com/raamana/hiwenet
Checking  https://github.com/raamana/pyradigm
Ch

Have to create a library that takes any sort of GIT URL and cleans it up!