Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
294 lines (267 sloc) 22.5 KB

Top pandas, numpy and scipy functions and modules used in github repos

Introduction

Github data on Google BigQuery

Github recently introduced all files query-able using the Google BigQuery, SQL-like distributed query engine. See the announcement from github. You can execute the BigQuery queries at the BigQuery console.

What is numpy, scipy and pandas and why top functions are useful?

numpy, scipy and pandas are popular python packages for data analysis. They are quite big, so it may be hard to distinguish between functions you would use all the time and functions that you use once in a while.

When looking at results in this post, my reaction to majority of entries was “well, duh, obviously”, but there were a few entries that I somehow missed and I was doing things the wrong way. For example, I must shamefully admit that I somehow missed pandas .iloc function and I have been doing .reset_index instead to get the n-th row.

It would also be very cool to sort code completions in python packages based on the frequency on github and I may hack something for Emacs soon.

Methodology

Using the new github data on BigQuery I calculated the most popular numpy, scipy and pandas functions and modules used in github repos.

Results are approximate. I firstly look for python files that import pandas (or numpy/scipy in their respective sections). Then I extract the regular expression:

r'[^a-zA-Z](?:pd|pandas)\.([^",\(\,\`) \':\[\]/={}]*)'

It looks for all characters following the pd. or pandas. except the negated character group.

It may include modules or some false positives

When constructing regular expression, my priority was to avoid false negatives, so there are some false positives like pandas.pydata.org. It may sometimes include modules from parsing the import lines, like it would parse out “io” from import scipy.io.

Data source

I will be using table contents_py from https://bigquery.cloud.google.com/dataset/fh-bigquery:github_extracts. It is a smaller BigQuery table that contains only python files stored on github.

Link to example usage

URL to example usage may be sometimes broken. It assumes that the file points to master branch. Other tables, like sample_contents include the branch in a field sample_ref, what lets me to generate the correct link.

contents_py only lists repository and file path. Until the https://twitter.com/kozikow/status/749016021852418048 gets resolved I am generating the link based on the assumption that all files are in the master branch.

I could potentially also link to the exact line number. In current table schema it would be a lot of work to achieve it. It would be much easier if the content would be exposed as a repeated field with (line contents, line number). I asked Google BigQuery team about it in https://twitter.com/kozikow/status/749896018381144064 .

This post on github

Revision history of this blog post is stored on github.

Top pandas functions and modules

Results are approximate and based on the heuristic assumption that people usually prefix python pandas functions by “pd.” or “pandas.”.

SELECT
  REGEXP_EXTRACT(line,
        r'[^a-zA-Z](?:pd|pandas)\.([^",\(\,\`) \':\[\]/={}]*)') AS function,
  COUNT(DISTINCT(sample_repo_name)) AS count_distinct_repos,
  COUNT(*) as count_total,
  CONCAT("https://github.com/",
        FIRST(sample_repo_name),
        "/blob/master/",
        FIRST(sample_path)) AS example_url,
FROM (
  SELECT
    SPLIT(content, '\n') AS line,
    sample_path,
    sample_repo_name
  FROM
    [fh-bigquery:github_extracts.contents_py]
  WHERE
    content CONTAINS "import pandas"
  HAVING
    NOT LEFT(LTRIM(line),1)='#'
    AND REGEXP_MATCH(line, r'[^a-zA-Z](?:pd|pandas)\.') )
GROUP BY 1
ORDER BY 2 DESC
LIMIT 500;

Full result list in google docs. Top 20 results:

functioncount_distinct_reposcount_totalexample_url
<20>
DataFrame548647478https://github.com/konchris/RunMeas/blob/master/RunMeas/Buffer.py
read_csv405617567https://github.com/fcollman/MakeAT/blob/master/make_make_file.py
Series224819124https://github.com/AllenDowney/ThinkBayes2/blob/master/code/thinkplot.py
concat18697456https://github.com/mhallsmoore/qstrader/blob/master/price_handler/price_handler.py
to_datetime7743176https://github.com/cbyn/bitpredict/blob/master/model/features.py
merge6502642https://github.com/dmnfarrell/mirnaseq/blob/master/mirdeep2.py
date_range5483233https://github.com/and2egg/philharmonic/blob/master/philharmonic/simulator/environment.py
read_table4991683https://github.com/cdeboever3/cdpybio/blob/master/cdpybio/express.py
util.testing4771856https://github.com/sauloal/cnidaria/blob/master/scripts/venv/lib/python2.7/site-packages/pandas/tseries/tests/test_timeseries_legacy.py
isnull4681459https://github.com/Weissger/ext2rdf/blob/master/src/RDFConverter/TripleStructureConverter.py
DataFrame.from_dict3991455https://github.com/mdbartos/vic_utils/blob/master/deprecated/mohseni_reg.py
Timestamp3877029https://github.com/paulperry/quant/blob/master/vti_agg_7030.py
DatetimeIndex3361629https://github.com/readevalprint/zipline/blob/master/zipline/utils/tradingcalendar.py
Index3222772https://github.com/caseyclements/dask/blob/master/dask/dataframe/shuffle.py
read_excel302946https://github.com/DaveBackus/Data_Bootcamp/blob/master/Code/Lab/SPF_forecasts.py
notnull284713https://github.com/DataViva/dataviva-scripts/blob/master/scripts/secex_monthly/_rdo_temp.py
DataFrame.from_csv265802https://github.com/idbedead/RNA-sequence-tools/blob/master/RNA_Seq_analysis/make_monocle_data_js.py
HDFStore251783https://github.com/konchris/TDMS2HDF5/blob/master/TDMS2HDF5/tdms2hdf5.py
DataFrame.from_records249534https://github.com/phaustin/A405/blob/master/notebooks/python/dropgrowC.py
MultiIndex.from_tuples237744https://github.com/ZoomerAnalytics/xlwings/blob/master/xlwings/tests/test_xlwings.py
rolling_mean233651https://github.com/Ernestyj/PyStudy/blob/master/finance/DaysTest/DaysDataPrepare.py

Top pandas data frame functions

Results are again approximate and based on the heuristic assumption that data frames are usually named with the suffix “df”. To filter out noise, only files containing “import pandas” and matching regexp r”.*df\s=.*(?:pandas|pd).” are included.

SELECT
  REGEXP_EXTRACT(line, r"df\.([a-zA-Z-_\.]+)") AS pandas_function,
  COUNT(DISTINCT(sample_repo_name)) AS count_distinct_repos,
  CONCAT("https://github.com/",
          FIRST(sample_repo_name),
          "/blob/master/",
          FIRST(sample_path)) AS example_url
FROM (
  SELECT
    SPLIT(content, '\n') AS line,
    sample_path,
    sample_repo_name
  FROM
    [fh-bigquery:github_extracts.contents_py]
  WHERE
    content CONTAINS "import pandas"
    and REGEXP_MATCH(content, r".*df\s=.*(?:pandas|pd)\.") 
  HAVING
    line CONTAINS "df.")
GROUP BY 1
HAVING LENGTH(pandas_function) > 1
ORDER BY 2 DESC
LIMIT 1000;

Full result list in Google Docs. Top 20 results:

pandas_functioncount_distinct_reposexample_url
columns1290https://github.com/fialhorenato/Vermont_V2_ViewER_MutatiON_Tool/blob/master/LSCWeb/venv/lib/python2.7/site-packages/pandas/io/tests/test_parsers.py
index958https://github.com/fialhorenato/Vermont_V2_ViewER_MutatiON_Tool/blob/master/LSCWeb/venv/lib/python2.7/site-packages/pandas/io/tests/test_parsers.py
to_csv945https://github.com/fialhorenato/Vermont_V2_ViewER_MutatiON_Tool/blob/master/LSCWeb/venv/lib/python2.7/site-packages/pandas/io/tests/test_parsers.py
loc729https://github.com/fialhorenato/Vermont_V2_ViewER_MutatiON_Tool/blob/master/LSCWeb/venv/lib/python2.7/site-packages/pandas/io/tests/test_parsers.py
groupby614https://github.com/fepz/AyCC/blob/master/process_results.py
set_index571https://github.com/LinJM/clothesDetection/blob/master/caffe-fast-rcnn/python/detect.py
drop473https://github.com/lukassnoek/skbold/blob/master/skbold/exp_model/parse_presentation_logfile.py
ix450https://github.com/fialhorenato/Vermont_V2_ViewER_MutatiON_Tool/blob/master/LSCWeb/venv/lib/python2.7/site-packages/pandas/io/tests/test_parsers.py
iloc418https://github.com/fialhorenato/Vermont_V2_ViewER_MutatiON_Tool/blob/master/LSCWeb/venv/lib/python2.7/site-packages/pandas/io/tests/test_parsers.py
shape387https://github.com/sdpython/ensae_projects/blob/master/_unittests/ut_data/test_data_helper.py
iterrows348https://github.com/rmhyman/DataScience/blob/master/Lesson1/titanic_data_heuristic1.py
sort341https://github.com/CGATOxford/cgat/blob/master/scripts/data2spike.py
append340https://github.com/MadsJensen/CAA/blob/master/calc_ali.py
copy298https://github.com/wavelets/lifelines/blob/master/tests/test_estimation.py
rename288https://github.com/Kirubaharan/hydrology/blob/master/Lake_bathymetry/dt_bathymetry/bathymetry_gps_merge.py
reset_index283https://github.com/fialhorenato/Vermont_V2_ViewER_MutatiON_Tool/blob/master/LSCWeb/venv/lib/python2.7/site-packages/pandas/io/tests/test_parsers.py
apply278https://github.com/lukovkin/ufcnn-keras/blob/master/models/UFCNN_predict.py
dropna273https://github.com/nelsonag/openmc/blob/master/openmc/filter.py
head263https://github.com/Kirubaharan/hydrology/blob/master/Lake_bathymetry/dt_bathymetry/bathymetry_gps_merge.py
values259https://github.com/fialhorenato/Vermont_V2_ViewER_MutatiON_Tool/blob/master/LSCWeb/venv/lib/python2.7/site-packages/pandas/io/tests/test_parsers.py
fillna228https://github.com/thesgc/cbh_chembl_ws_extension/blob/master/cbh_chembl_ws_extension/serializers.py
plot203https://github.com/DaveBackus/Data_Bootcamp/blob/master/Code/Python/bootcamp_pandas-input.py

Top numpy functions and modules

Results are again approximate and it’s a simple string replace from the pandas version.

SELECT
  REGEXP_EXTRACT(line,
        r'[^a-zA-Z](?:np|numpy)\.([^",\(\,\`) \':\[\]/={}]*)') AS function,
  COUNT(DISTINCT(sample_repo_name)) AS count_distinct_repos,
  COUNT(*) as count_total,
  CONCAT("https://github.com/",
        FIRST(sample_repo_name),
        "/blob/master/",
        FIRST(sample_path)) AS example_url,
FROM (
  SELECT
    SPLIT(content, '\n') AS line,
    sample_path,
    sample_repo_name
  FROM
    [fh-bigquery:github_extracts.contents_py]
  WHERE
    content CONTAINS "import numpy"
  HAVING
    NOT LEFT(LTRIM(line),1)='#'
    AND REGEXP_MATCH(line, r'[^a-zA-Z](?:np|numpy)\.') )
GROUP BY 1
ORDER BY 2 DESC
LIMIT 500;

Full result list in Google docs. Top 20 results:

functioncount_distinct_reposcount_totalexample_url
<80>
array23877604263https://github.com/AlexBourassa/Generic_UI/blob/master/Widgets/GraphWidget/Fitter.py
zeros19406280579https://github.com/buzz/sniegabuda-raspi/blob/master/transformations.py
arange13587158705https://github.com/jamesp/jpy/blob/master/jpy/maths/derive.py
sqrt1029777810https://github.com/Messaoud-Boudjada/dipy/blob/master/dipy/tracking/local/localtracking.py
ones1002880998https://github.com/iamtrask/keras/blob/master/keras/models.py
sum982985793https://github.com/buzz/sniegabuda-raspi/blob/master/transformations.py
mean977356402https://github.com/buzz/sniegabuda-raspi/blob/master/transformations.py
linspace876962970https://github.com/Titan-C/learn-dmft/blob/master/examples/plot_ipt_coex.py
asarray774582563https://github.com/ratnania/caid/blob/master/caid-gui/viewer.py
ndarray761771141https://github.com/eirikgje/healpy/blob/master/healpy/pixelfunc.py
dot738690422https://github.com/Messaoud-Boudjada/dipy/blob/master/dipy/tracking/local/localtracking.py
exp697942446https://github.com/pkgw/pwkit/blob/master/pwkit/dulk_models.py
abs697943168https://github.com/eirikgje/healpy/blob/master/healpy/pixelfunc.py
where678156778https://github.com/buzz/sniegabuda-raspi/blob/master/transformations.py
empty663251718https://github.com/Messaoud-Boudjada/dipy/blob/master/dipy/tracking/local/localtracking.py
max653331860https://github.com/live-clones/dolfin-adjoint/blob/master/tests_dolfin/mantle_convection/retrieve_demo.py
concatenate642536532https://github.com/Messaoud-Boudjada/dipy/blob/master/dipy/tracking/local/localtracking.py
log574233105https://github.com/pkgw/pwkit/blob/master/pwkit/dulk_models.py
sin530225481https://github.com/jamesp/jpy/blob/master/jpy/maths/derive.py
vstack525125913https://github.com/buzz/sniegabuda-raspi/blob/master/transformations.py
min506421231https://github.com/gwpy/seismon/blob/master/seismon/psd.py

Top scipy functions and modules

Results are again approximate and it’s a simple string replace from the numpy version.

SELECT
  REGEXP_EXTRACT(line,
        r'[^a-zA-Z](?:sp|scipy)\.([^",\(\,\`) \':\[\]/={}]*)') AS function,
  COUNT(DISTINCT(sample_repo_name)) AS count_distinct_repos,
  COUNT(*) as count_total,
  CONCAT("https://github.com/",
        FIRST(sample_repo_name),
        "/blob/master/",
        FIRST(sample_path)) AS example_url,
FROM (
  SELECT
    SPLIT(content, '\n') AS line,
    sample_path,
    sample_repo_name
  FROM
    [fh-bigquery:github_extracts.contents_py]
  WHERE
    content CONTAINS "import scipy"
  HAVING
    NOT LEFT(LTRIM(line),1)='#'
    AND REGEXP_MATCH(line, r'[^a-zA-Z](?:sp|scipy)\.') )
GROUP BY 1
ORDER BY 2 DESC
LIMIT 500;

Full result list in google docs. Top 20 results:

functioncount_distinct_reposcount_totalexample_url
<80>
stats22815717https://github.com/geophysics/mtpy/blob/master/mtpy/modeling/occam2d.py
sparse17066500https://github.com/tscholak/smbkmeans/blob/master/tfidf_smbkmeans.py
optimize15312788https://github.com/cni/t1fit/blob/master/t1_fitter.py
io12183079https://github.com/wojtekwalczak/FB_datalab/blob/master/lib/most_distinctive.py
linalg11993047https://github.com/lesteve/scikit-learn/blob/master/sklearn/utils/arpack.py
interpolate9722022https://github.com/geophysics/mtpy/blob/master/mtpy/modeling/occam2d.py
special9681792https://github.com/liberatorqjw/scikit-learn/blob/master/sklearn/utils/fixes.py
signal9151883https://github.com/garibaldu/radioblobs/blob/master/code/code_1d/old_and_extra/score_GA.py
ndimage8642196https://github.com/cni/t1fit/blob/master/t1_fitter.py
misc6501135https://github.com/sillvan/hyperspy/blob/master/hyperspy/drawing/_markers/point.py
integrate574986https://github.com/kleskjr/scipy/blob/master/scipy/stats/tests/test_distributions.py
sparse.linalg4951056https://github.com/lesteve/scikit-learn/blob/master/sklearn/utils/arpack.py
spatial.distance469721https://github.com/wjchen84/rapprentice/blob/master/rapprentice/registration.py
spatial420766https://github.com/delmic/odemis/blob/master/src/odemis/acq/align/coordinates.py
io.loadmat4141501https://github.com/jdsika/TUM_SmartCardLab/blob/master/DPA/benchmark.py
sparse.csr_matrix4011305https://github.com/waterponey/scikit-learn/blob/master/scikits/learn/svm/tests/test_sparse.py
org369894https://github.com/chiotlune/ext/blob/master/gnuradio-3.7.0.1/gr-filter/examples/fir_filter_ccc.py
csr_matrix3612541https://github.com/tscholak/smbkmeans/blob/master/tfidf_smbkmeans.py
array3523873https://github.com/PMBio/limix/blob/master/limix/deprecated/io/data_util.py
issparse3342309https://github.com/thilbern/scikit-learn/blob/master/sklearn/linear_model/stochastic_gradient.py

Attribution

Regular expression used to extract function have improved upon by Felipe in the comment.

Other posts

You may also take a look at my other posts: