# Parse the requirements file
The previous notebook, 'PyPi_Metadata.ipynb', parsed the requirements out of every package on the pypi server. The output was a file that looks like this:

```
packages/astrodbkit-0.2.0

packages/astrodendro-0.1.0
aplpy
astropy
matplotlib
numpy

packages/astroid-1.4.4

packages/astroimtools-0.1
git+http://github.com/astropy/astropy.git#egg=astropy
astropy-helpers
cython>=0.23.4
distribute==0.0
matplotlib
numpy
```

The packages start with the name of the python package, followed by the dependencies I was able to parse. Many of them have no dependencies; for now I will assume that is correct even though I know it is not true. Any package that programmatically defines the requirements in the setup.py, and which have no requirements files, are not found.

The purpose of this notebook will largely just be to parse the output file into a pandas dataframe.


In [146]:
import pandas as pd
from collections import defaultdict
import os
import numpy as np
import requirements
import xmlrpclib

# I need this to separate the package name from its version
client = xmlrpclib.ServerProxy('http://pypi.python.org/pypi')
packages = client.list_packages()

## 1: Parse the requirements for each package

In [36]:
datadict = defaultdict(list)
with open('requirements.txt', 'r') as infile:
    new_package = True
    for line in infile:
        if line.strip() == '':
            new_package = True
            #print(package_name)
            if package_name not in datadict['package']:
                datadict['package'].append(package_name)
                datadict['requirement'].append(np.nan)
            continue
        
        if new_package:
            # If this is the case, the current line gives the name of the package
            package_name = os.path.basename(line).strip()
            new_package = False
        else:
            # This line gives a requirement for the current package
            try:
                for req in requirements.parse(line.strip()):
                    datadict['package'].append(package_name)
                    datadict['requirement'].append(req.name)
            except ValueError:
                pass
                
            #datadict['package'].append(package_name)
            #datadict['requirement'].append(line.strip())

# Convert to dataframe
df = pd.DataFrame(data=datadict)
df.head()

Unnamed: 0,package,requirement
0,02exercicio-1.0.0,
1,0x10c-asm-0.0.2,
2,115wangpan-0.7.6,beautifulsoup4
3,115wangpan-0.7.6,homura
4,115wangpan-0.7.6,humanize


# 2: Get the base package name from the package string
The package column of the dataframe currently contains the name of the package as well as the version string. I need to separate the two. For that, I will use the package list from pypi itself again.

In [57]:
df['package_name'] = np.nan
df['package_version'] = np.nan
for i, package in enumerate(packages):
    if i % 100 == 0:
        print('Package {}: {}'.format(i+1, package))
    for release in client.package_releases(package):
        pkg_str = '{}-{}'.format(package, release)
        idx = df.loc[df.package == pkg_str].index
        if len(idx) > 0:
            df.loc[idx, 'package_name'] = package
            df.loc[idx, 'package_version'] = release
df.head()

Package 1: 0-._.-._.-._.-._.-._.-._.-0
Package 101: ABBYY
Package 201: acidfile
Package 301: addhrefs
Package 401: adspygoogle
Package 501: agd_tools
Package 601: aima
Package 701: aiorequests
Package 801: ajenti.plugin.datetime
Package 901: aldryn-cms
Package 1001: aliyun-oss
Package 1101: altered.states
Package 1201: ampl
Package 1301: android_sms_exporter
Package 1401: ansiblereporter
Package 1501: anybox.testing.datetime
Package 1601: apidev-djaloha
Package 1701: Appium-Python-Client
Package 1801: apycot
Package 1901: archiwe
Package 2001: argumentsprocessor
Package 2101: arrowhead
Package 2201: Ashser_AthleteList
Package 2301: assert_tools
Package 2401: asv_seo
Package 2501: AthleteClass
Package 2601: AttendanceTracker
Package 2701: authgoogle-middleware
Package 2801: automate
Package 2901: avatarsio
Package 3001: awsjump
Package 3101: azure-servicemanagement-legacy
Package 3201: backquotes
Package 3301: bambu-analytics
Package 3401: bard
Package 3501: basil_daq
Package 3601: bcbi

Unnamed: 0,package,requirement,package_name,package_version
0,02exercicio-1.0.0,,02exercicio,1.0.0
1,0x10c-asm-0.0.2,,0x10c-asm,0.0.2
2,115wangpan-0.7.6,beautifulsoup4,115wangpan,0.7.6
3,115wangpan-0.7.6,homura,115wangpan,0.7.6
4,115wangpan-0.7.6,humanize,115wangpan,0.7.6


In [58]:
# Save to file
df.to_csv('requirements.csv', index=False)

In [59]:
print(df.loc[df.requirement.notnull(), 'package'].unique().size)

20642


# Base dependencies

I have now parsed the formal dependencies for 20642 python packages. However, some of those dependencies themselves have dependencies. Let's go ahead and find the base dependency. I will find all of the requirements that each requirements itself has, and keep going until there are no new dependencies.

## Difficulties:

1. Cyclic dependencies: astropy requires wcs_axes, which itself requires astropy. Therefore a naive recursive solution will never end. I use a Tree class that keeps track of what has already been searched to avoid infinite loops.

In [147]:
class Tree(object):
    def __init__(self, name):
        self.name = name
        self.children = []
        return

    def __contains__(self, obj):
        return obj == self.name or any([obj in c for c in self.children])
    
    def add(self, obj):
        if not self.__contains__(obj):
            self.children.append(Tree(obj))
            return True
        return False
    
    def get_base_requirements(self):
        base = []
        for child in self.children:
            if len(child.children) == 0:
                base.append(child.name)
            else:
                for b in [c.get_base_requirements() for c in child.children()]:
                    base.extend(b)
        return np.unique(base)
    

def get_requirements(package):
    return df.loc[(df.package_name == package) & (df.requirement.notnull()), 'requirement'].values


def get_dependency_tree(package, tree):
    reqs = get_requirements(package)
    for req in reqs:
        #print(req)
        flg = tree.add(req)
        if not flg:
            continue
        tree = get_base_dependencies(req, tree)
    return tree

    

In [152]:
p = '115wangpan'
p = 'astroquery'
get_dependency_tree(p, Tree(p)).get_base_requirements()

array(['.', 'astropy', 'astropy-helpers', 'astropy_helpers', 'cython',
       'decorator', 'flask', 'httpbin', 'itsdangerous', 'markupsafe',
       'matplotlib', 'numpy', 'py', 'pytest', 'pytest-cov',
       'pytest-httpbin', 'pyyaml', 'regendoc', 'requests', 'six', 'sphinx',
       'sphinx-py3doc-enhanced-theme', 'wcsaxes', 'wheel'], 
      dtype='|S28')

In [140]:
datadict = defaultdict(list)
for i, package in enumerate(df.package_name.unique()):
    if i % 100 == 0:
        print('Package {}: {}'.format(i+1, package))
    try:
        deptree = get_dependency_tree(package, Tree(package))
    except:
        print('Failure getting base dependencies for {}'.format(package))
        raise ValueError
    for dependency in deptree.get_base_requirements():
        datadict['package_name'].append(package)
        datadict['requirements'].append(dependency)

base_df = pd.DataFrame(data=datadict)
base_df.head()

Package 1: 02exercicio
Package 101: aboutyou
Package 201: adagios
Package 301: Aduro
Package 401: agon-ratings
Package 501: aiohttp_mako
Package 601: ajenti.plugin.auth-users
Package 701: aldryn-tours
Package 801: allmychanges
Package 901: amitgroup
Package 1001: AndroidResR
Package 1101: antidogpiling
Package 1201: apidev-sanza
Package 1301: appomatic_autocomplete
Package 1401: archetypes.recurringdate
Package 1501: argvee
Package 1601: arts
Package 1701: aspose_words_java_for_python
Package 1801: asymm-enum
Package 1901: atomisator.outputs
Package 2001: authentic2-idp-freshdesk
Package 2101: autoprefixer
Package 2201: awesome-package
Package 2301: azure-batch-apps
Package 2401: badgermole
Package 2501: bancommons
Package 2601: bash-toolbelt
Package 2701: bcdoc
Package 2801: Beautils
Package 2901: berry_module
Package 3001: bibos_utils
Package 3101: biocma
Package 3201: bitfile
Package 3301: blender-bam
Package 3401: bluebream
Package 3501: Boodler
Package 3601: BotParse
Package 3701:

Unnamed: 0,package_name,requirements
0,115wangpan,.
1,115wangpan,beautifulsoup4
2,115wangpan,bottle
3,115wangpan,decorator
4,115wangpan,flaky


In [141]:
base_df.to_csv('base_requirements.csv', index=False)

In [144]:
tmp = pd.read_csv('requirements.csv')
print(len(tmp.package_name.unique()))

56232


In [145]:
print(len(tmp.loc[tmp.requirement.notnull(), 'package_name'].unique()))

20522
