# Which python packages and objects are used the most?

Anyone interested in learning Python can easily find hundreds of tutorials on the web. However deciding what to learn, in which order, remains a difficult question. In particular, which packages are popular right now? For instance, to a beginner, the choice between urllib and requests is not obvious.

In the same way, most package documentations are organized in an alphabetical order. It's easy to find the documentation for a specific function you know. It's a bit harder to know which functions you must learn first to become efficient in using this package. 

We want to help to answering these questions by looking at which packages and objects are being used by top Python users now. 

# Method
Specifically, we use the Github API to get the python scripts of the most starred 40 repository for each month between November 2016 and October 2018. We limit ourselves to smaller projects (less than 20 .py files). We parse the scripts (3800 files from 460 repositpries) to localize the imports and their uses. 

# Resultsa
We combine the top packages and objects imported from these packages in the csv file in this repository. 

For instance, we see that urllib was still used a bit more in the past two years. Also, we see that deep learning packages are extremely popular among highly rated projects.


# Requirements
Running this Notebook requires Python 3, a github API username and token, both of which are available for free on the Github website. 

In [1]:
import requests
import json
import base64
import datetime
import re
from collections import Counter
import pandas as pd
import os

In [2]:
username = '<your github username>'
token = '<your github token>'

In [3]:
# create a function to flatten a list - useful in a few places below
def flatten(nested_list):
    """ flattens a list"""
    flat_list = []
    for sublist in nested_list:
        for item in sublist:
            flat_list.append(item)
    return flat_list

### Step 1 - Get a list of python scripts

First, let's create a function to get the owners and repo names of top repositories in Github for a specific period of time and a specific language.

In [4]:
def get_most_followed_repos(start,stop,language,repo_count):
    """
    Get the top x starred github repos in a certain language/time

    Args:
        start: string yyyy-mm-dd - minimum repo creation date
        start: string yyyy-mm-dd - maximum repo creation date
        language: string - main language of the repo
        repo_count: count of repos in the period (max=100)

    Returns:
        a list of dictionnaries containing owner and repo name
        [{'owner':'abc','repo':'def'}, { } ...]
    """
    
    repo_count = min(repo_count,100)  # up to 100 results/page allowed by API

    data = requests.get(
            'https://api.github.com/search/repositories',
            params = {
                    'q': 'language:'+language+' created:'+start+'..'+stop,
                    'is': 'public',
                    'sort':'stars',
                    'order':'descending',
                    'per_page':repo_count,
                    'page':1
                    },
            auth = (username,token)
                    )
    data = data.content
    data = json.loads(data)

    owners_and_names = [item['full_name'].split("/") for item in data['items']]
    owners_and_names = [{'owner':x[0],'repo_name':x[1]} for x in owners_and_names]

    return owners_and_names

Now let's create a function that gets the python files paths within that repository

In [5]:
def get_py_files_in_repos(owner,repo):
    """
    Get urls of all python files in a github repository

    Args:
        owner: string - the username of the repo owner
        repo: string - the repository name

    Returns:
        a list filepaths within repository       
    """
    global data

    data = requests.get(
        'https://api.github.com/repos/'+owner+'/'+repo+'/git/trees/master',
        params = {'recursive':1},
        auth = (username,token)
    )
    data = json.loads(data.content)

    all_files = [x['path'] for x in data['tree'] if x['type']== 'blob']
    py_files = [x for x in all_files if x[-3:]=='.py']

    return py_files

Finally let's iterate over a some months to find the owners, repo names, and files paths to the python files of the most starred repositories in a period of time

In [7]:
# create a list of months beginning and ends
study_starts = []
study_stops = []
start = datetime.date(year=2016, month=11, day=1)
months = 24
for i in range(months):
    study_starts.append(datetime.datetime.strftime(start,'%Y-%m-%d'))
    start = start + datetime.timedelta(days=40)
    start = start.replace(day=1)
    stop = start + datetime.timedelta(days=-1)
    study_stops.append(datetime.datetime.strftime(stop,'%Y-%m-%d'))
    
# get the top repositories
most_followed_repos = []   
for start, stop in zip(study_starts,study_stops):
    repos = get_most_followed_repos(start,stop,'python',40)
    most_followed_repos = most_followed_repos + repos
    
# add the python files paths
for repo in  most_followed_repos:
    repo_name = repo['repo_name']
    repo_owner = repo['owner']
    try: 
        repo['files'] = get_py_files_in_repos(repo_owner,repo_name)
    except: #some repositories seem to have additional restrictions
        repo['files'] = []
        
# limit ourselves to repositories with fewer than 20 .py files
most_followed_repos = [x for x in most_followed_repos if len(x['files'])<=20]
most_followed_repos = [x for x in most_followed_repos if len(x['files'])>0]
most_followed_repos[0]

py_files = flatten([r['files'] for r in most_followed_repos])
print('§repositories count', len(most_followed_repos))
print('files count',len(py_files))

KeyError: 'items'

In [8]:
data

NameError: name 'data' is not defined

Last step in this section: we want to see which packages are most often imported. We therefore should exclude local imports. Let's add a list of exceptions (imports to disreguard) which correspond to the local filenames.

In [None]:
# add an exception list to each dictionaries in most_followed_repos
for repo in  most_followed_repos:
    repo['exceptions'] = flatten([x[:-3].split('/') for x in repo['files']])

most_followed_repos[0]

### Step 2 - parse the extracted files to see which imports are most common

We now have a list of popular, small, recent Python repositories. We still have to parse them to see which imports are most common.
Still using the github API, let's create a function that uses the repo name, owner, and file paths to get the text of a given script.

In [None]:
def parse_files_in_repos(owner,repo,filename):
    """
    Get raw scripts of a specific file in specific Guthub repo

    Args:
        owner: string - the username of the repo owner
        repo: string - the repository name
        filename: path to the script from repository root

    Returns:
        a list filepaths within repository       
    """
    
    global data

    data = requests.get(
            'https://api.github.com/repos/'
            + owner
            + '/'
            + repo
            + '/contents/'
            + filename,
            auth = (username,token)
            )

    data = json.loads(data.content)
    if 'content' in data: # not available for one liner files
        data = base64.b64decode(data['content'])
        data = data.decode('utf-8')
    else:
        data = ''
   
    return data

Let's now create a function that goes through a script and uses some regular expressions to find the imports. Local imports, irrelevant to the study, are excluded. This function is fairly simple. In some very specific cases it may return wrnge results (for instance if the word 'import' is used at the beginning of a line in a docstring). However it does a good job at cathcing all package imports, no local imports, and the few mistakes will not be taken into account as we parse so many files.

In [None]:
def is_local_import(from_clause,import_clause,exceptions):
    """
    Identifies the imports in a python script

    Args:
        from_clause: string - x in "from x import y"
        from_clause: list of string - y in "from x import y"
        exceptions: list of strings. filenames to disregard
   
    Returns:
        Boolean
    """
    
    if from_clause == ['.']:
        return True
    
    from_clause = from_clause.split('.')
    import_clause = [x.split('.') for x in import_clause]
    imports = from_clause + flatten(import_clause)
    if any([x in exceptions for x in imports]):
        return True
    return False
  
def parse_imports(text,exceptions=[]):
    """
    Identifies the imports in a python script

    Args:
        text: any string. Meant to be raw data of python file
        exceptions: string. filename of imports to disregard
   
    Returns:
        a dictionary containing import details
        keys: modules/submodules/functions imported
        values: how the import will appears in script
        
    Remark: not 100% accurate.
    Would miss a line in the docstring starting with "import"for instance

    # direct module import
    >>> parse_imports('import os')
    >>> {'os':'os'}

    # direct module import and renaming
    >>> parse_imports('import numpy as np')
    >>> {'numpy':'np'}

    # direct module import and renaming
    >>> parse_imports('import numpy as np')
    >>> {'numpy':'np'}

    >>> parse_imports('from os import listdir,chdir')
    >>> {'os.list_dir':'os.list_dir', 'os.chdir':'os.chdir'}

    # non identifyable imports
    >>> parse_imports('from os import *')
    >>> {'os.*':'unidentifiable'}

    # excluded local import
    >>> parse_imports('from a/utils import script',exceptions=['script'])
    >>> {''}
    """
    
    results = {}
    statements = text.split('\n')
    for statement in statements:
        st = statement
        # check if there is a from or import at start of line
        if re.search(r'^from |^import ',statement):

            # identify and remove "as clause" at end of statement
            as_clause = re.findall(
                    r'. as ([a-zA-Z0-9\._]+)', statement)
            if as_clause:
                as_clause = as_clause
                statement = statement.split(' as ')[0]

            # identify and remove "from clause" at end of statement
            from_clause = re.findall(
                    r'^from ([a-zA-Z0-9\._]+) import', statement)
            
            if from_clause:
                from_clause = from_clause[0]
                statement = statement.split(' import ')[1]
            else:
                statement = statement.split('import ')[1]
                from_clause = ''

            # identify "import clause" at end of statement
            import_clause = re.split(r' *,+',statement)
            import_clause = [x.strip() for x in import_clause]

            # check if statement is local import
            if is_local_import(from_clause,import_clause,exceptions):
                pass
            else:
                if import_clause[0] == '*':
                    as_clause = ['unidentifiable']

                # add from clause to each import - if applicable
                if from_clause:
                    imported = [from_clause + '.' + imp for imp in import_clause]
                else:
                    imported = import_clause

                # add as to imports whose names changed - if applicable
                if as_clause:
                    imported_as = as_clause
                else:
                    imported_as = import_clause

                for imp,imp_as in zip(imported,imported_as):
                    imp = imp.strip()
                    imp_as = imp_as.strip()
                    results[imp] = imp_as
                    if imp in ['ose()','os()','ose'] or imp_as in ['ose()','os()','ose']:
                        print(st)
 
    return results

Let's create another function that goes through a script, and uses the imports to see how often and how they are used. 

In [None]:
def parse_script(imports, text):
    """
    Identifies how imports in a python script are used

    Args:
        imports: a dictionary - key is the import, value how it will appear
        text: a string. Meant to be raw data of python file

    Returns:
        a Counter object with import usage
        keys: modules/submodules/functions imported
        values: specific instances or functions used

    # function from module
    >>> parse_script({'os':'os'},"os.listdir('.')"
    >>> Counter({'os.listdir':1})

    # function
    >>> parse_script({'flask.render_template':'flask.render_template'},
    "render_template('main.html')"
    >>> Counter({'flask.render_template':1})

    # instance
    >>> parse_script({'sklearn.linear_model':'linear_model'},
    "regr = linear_model.LinearRegression()"
    >>> Counter({'sklearn.linear_model.LinearRegression':1})

    # non identifyable import
    >>> parse_imports({'os.*':'unidentifiable'},'listdir('.')')
    >>> {'os.*':'unidentifiable'}
    """
    
    results = Counter()

    for key in imports:
        appears_as = imports[key]
        if appears_as != 'unidentifiable':
            regex_string = r'[\s+-=*/]' + re.escape(appears_as) + '([\.\w]*\()'
            matches = re.findall(regex_string,text)
            matches = [key + match + ')'for match in matches]
            matches = Counter(matches)
            results = results + matches
        else:
            pass
        
    return results

We can now iterate over our list of repos and screen each file for most common imports. The results will be stored in a Counter object.

In [None]:
results = Counter()
for d in most_followed_repos:
    for script in d['files']:
        try:
            raw = parse_files_in_repos(d['owner'],d['repo_name'],script)
            imports = parse_imports(raw,exceptions=d['exceptions'])
            import_use = parse_script(imports, raw)
            results = results + import_use
        except:
            pass

We can now use the data collected to see which packages are imported the most

In [None]:
most_common_packages = []
for key in results:
    most_common_packages = most_common_packages + [key.split('.')[0]]*results[key]
most_common_packages = Counter(most_common_packages)
most_common_packages.most_common(20)

Or within each package, which submodules or functions are most commonly used

In [None]:
df = pd.DataFrame()
for package in most_common_packages.most_common(20):
    p_name = package[0]
    package_functions= [x for x in results.elements() if x.split('.')[0]==p_name]
    package_functions = Counter(package_functions)
    df[p_name] = ([x[0] for x in package_functions.most_common(10)] + ['']*10)[0:10]

df.to_csv('popular_python_packages.csv')
df.head(10)