# Check for occurrence of CFF files in repositories linked from JOSS

1. Get list of DOIs from JOSS
2. Use code-cite to find occurrences of GitHub repos (or create new code if JOSS is structured)
3. Check each repo to find out if they have CFF file 

In [55]:
# JOSS papers are here: https://github.com/openjournals/joss-papers

from github import Github
import xml.etree.ElementTree as ET
from xml.dom import minidom
import base64
import pickle

# Github Token
gh_token = 'secrets/github_token'

with open(gh_token, 'r') as f:
    github_token = f.read().rstrip()

g = Github(github_token)
repo = g.get_repo('openjournals/joss-papers')

verbose = True

In [30]:
contents = repo.get_dir_contents('/')

In [56]:
# For each ContentFile that has path = "joss.*" and is a directory, load the XML and get the software repository

joss_repos = []

joss_prefix = 'joss.'

for content in contents:
    if content.type == "dir" and content.name.startswith(joss_prefix):
        if verbose: print("Analysing ", content.name)
        joss_id = content.name[len(joss_prefix):]
        paperdir_contents = repo.get_dir_contents(content.name)
        for file in paperdir_contents:
            if file.name.endswith(joss_id + '.xml'):
                paper_path = content.name + '/' + file.name
                if verbose: print("Paper path ", paper_path)
                paper = repo.get_file_contents(paper_path)
                paper_xml = base64.b64decode(paper.content)
                mydoc = minidom.parseString(paper_xml)
                items = mydoc.getElementsByTagName('software_repository')
                for item in items:
                    joss_repos.append(item.firstChild.data)
                    if verbose: print("Software repository: ",item.firstChild.data)
                        
with open('joss_repos.txt', 'wb') as fp:
    pickle.dump(joss_repos, fp)
                    
        
        


Analysing  joss.00011
Paper path  joss.00011/10.21105.joss.00011.xml
Software repository:  https://github.com/diana-hep/carl
Analysing  joss.00012
Paper path  joss.00012/10.21105.joss.00012.xml
Software repository:  http://github.com/jakevdp/mst_clustering
Analysing  joss.00016
Paper path  joss.00016/10.21105.joss.00016.xml
Software repository:  https://github.com/harpolea/r3d2
Analysing  joss.00017
Paper path  joss.00017/10.21105.joss.00017.xml
Software repository:  https://github.com/applicationskeleton/Skeleton
Analysing  joss.00018
Paper path  joss.00018/10.21105.joss.00018.xml
Software repository:  http://github.com/genomematt/xenomapper/
Analysing  joss.00020
Paper path  joss.00020/10.21105.joss.00020.xml
Software repository:  https://github.com/PeteHaitch/GenomicTuples
Analysing  joss.00021
Paper path  joss.00021/10.21105.joss.00021.xml
Software repository:  https://github.com/jtauber/pyuca
Analysing  joss.00023
Paper path  joss.00023/10.21105.joss.00023.xml
Software repository:

Software repository:  https://github.com/bmcfee/resampy
Analysing  joss.00133
Paper path  joss.00133/10.21105.joss.00133.xml
Software repository:  https://github.com/openeventdata/petrarch2
Analysing  joss.00135
Paper path  joss.00135/10.21105.joss.00135.xml
Software repository:  https://github.com/mllg/batchtools
Analysing  joss.00137
Paper path  joss.00137/10.21105.joss.00137.xml
Software repository:  https://github.com/baccuslab/pyret
Analysing  joss.00139
Paper path  joss.00139/10.21105.joss.00139.xml
Software repository:  https://github.com/biosustain/optlang
Analysing  joss.00140
Paper path  joss.00140/10.21105.joss.00140.xml
Software repository:  https://github.com/HERA-Team/pyuvdata
Analysing  joss.00141
Paper path  joss.00141/10.21105.joss.00141.xml
Software repository:  https://github.com/jeff-goldsmith/vbvs.concurrent
Analysing  joss.00142
Paper path  joss.00142/10.21105.joss.00142.xml
Software repository:  https://github.com/camillescott/shmlast
Analysing  joss.00146
Paper 

Software repository:  https://github.com/rasbt/biopandas
Analysing  joss.00280
Paper path  joss.00280/10.21105.joss.00280.xml
Software repository:  https://github.com/barbagroup/AmgXWrapper
Analysing  joss.00281
Paper path  joss.00281/10.21105.joss.00281.xml
Software repository:  https://github.com/OpenSpace/OpenSpace
Analysing  joss.00282
Paper path  joss.00282/10.21105.joss.00282.xml
Software repository:  https://github.com/bede/kindel
Analysing  joss.00289
Paper path  joss.00289/10.21105.joss.00289.xml
Software repository:  https://github.com/TACC/launcher
Analysing  joss.00291
Paper path  joss.00291/10.21105.joss.00291.xml
Software repository:  https://bitbucket.org/cloopsy/android/
Analysing  joss.00293
Paper path  joss.00293/10.21105.joss.00293.xml
Software repository:  https://github.com/ben-aaron188/netanos
Analysing  joss.00295
Paper path  joss.00295/10.21105.joss.00295.xml
Software repository:  https://github.com/nhejazi/biotmle
Analysing  joss.00296
Paper path  joss.00296/10

ExpatError: not well-formed (invalid token): line 37, column 13

In [57]:
joss_repos

['https://github.com/diana-hep/carl',
 'http://github.com/jakevdp/mst_clustering',
 'https://github.com/harpolea/r3d2',
 'https://github.com/applicationskeleton/Skeleton',
 'http://github.com/genomematt/xenomapper/',
 'https://github.com/PeteHaitch/GenomicTuples',
 'https://github.com/jtauber/pyuca',
 'https://github.com/phenoscape/scowl',
 'https://github.com/dfm/corner.py',
 'https://github.com/genenetwork/genenetwork2',
 'https://github.com/conradsnicta/armadillo/',
 'https://github.com/dib-lab/sourmash/',
 'https://github.com/SimonGreenhill/phylogemetric',
 'https://github.com/ctjacobs/git-rdm',
 'https://github.com/cMadan/prism',
 'https://github.com/zoran-cuckovic/QGIS-visibility-analysis',
 'https://github.com/EducationalTestingService/rsmtool',
 'https://github.com/msmbuilder/osprey',
 'https://github.com/jiemakel/las',
 'https://github.com/michaellevy/gwdegree',
 'https://github.com/juliasilge/tidytext',
 'https://github.com/DoSOCSv2/DoSOCSv2',
 'https://github.com/sbogutzky/P

In [58]:
with open('joss_repos.txt', 'wb') as fp:
    pickle.dump(joss_repos, fp)

In [59]:
with open ('joss_repos.txt', 'rb') as fp:
    joss_repos = pickle.load(fp)

In [67]:
# Now check if these repos have a CITATION.CFF file!

for joss_repo in joss_repos:
    print(joss_repo)
    if joss_repo.startswith('https://github.com/'):
        jr = joss_repo[len('https://github.com/'):]
    elif joss_repo.startswith('http://github.com/'):
        jr = joss_repo[len('http://github.com/'):]
    if jr.endswith('/'):
        jr2 = jr[:(len(jr) - len('/'))]
    else:
        jr2 = jr
    print(jr2)
    repo = g.get_repo(jr2)
    contents = repo.get_dir_contents('/')
    for content in contents:
        if content.type == "file" and content.name.startswith('CITATION'):
            print(content.name)
    

https://github.com/diana-hep/carl
diana-hep/carl
http://github.com/jakevdp/mst_clustering
jakevdp/mst_clustering
https://github.com/harpolea/r3d2
harpolea/r3d2
https://github.com/applicationskeleton/Skeleton
applicationskeleton/Skeleton
CITATION
http://github.com/genomematt/xenomapper/
genomematt/xenomapper
https://github.com/PeteHaitch/GenomicTuples
PeteHaitch/GenomicTuples
https://github.com/jtauber/pyuca
jtauber/pyuca
https://github.com/phenoscape/scowl
phenoscape/scowl
https://github.com/dfm/corner.py
dfm/corner.py
https://github.com/genenetwork/genenetwork2
genenetwork/genenetwork2
https://github.com/conradsnicta/armadillo/
conradsnicta/armadillo
https://github.com/dib-lab/sourmash/
dib-lab/sourmash
https://github.com/SimonGreenhill/phylogemetric
SimonGreenhill/phylogemetric
https://github.com/ctjacobs/git-rdm
ctjacobs/git-rdm
https://github.com/cMadan/prism
cMadan/prism
CITATION
https://github.com/zoran-cuckovic/QGIS-visibility-analysis
zoran-cuckovic/QGIS-visibility-analysis
htt

UnknownObjectException: 404 {'message': 'Not Found', 'documentation_url': 'https://developer.github.com/v3/repos/contents/#get-contents'}

Have to create a library that takes any sort of GIT URL and cleans it up!