# Python training for data engineers
## 06 Data Filtering

In [4]:
# Read the XML data from the previous notebook
import pandas as pd
xmldf = pd.read_pickle('xml_dataframe_notebook_05.pickle')
xmldf.head()

Unnamed: 0,filename,package_name,python_version,release_type,size,uploaded_on,unit,size_in_bytes
0,watson_machine_learning_client-1.0.83-py3-none...,watson-machine-learning-client,py3,python wheel,552,2018-04-10,KB,565248
1,watson_machine_learning_client-1.0.83.tar.gz ...,watson-machine-learning-client,,source,211,2018-04-10,KB,216064
2,azure-mgmt-machinelearningcompute-0.4.0.zip ...,azure-mgmt-machinelearningcompute,,source,50,2018-01-02,KB,51200
3,azure_mgmt_machinelearningcompute-0.4.0-py2.py...,azure-mgmt-machinelearningcompute,py2.py3,python wheel,38,2018-01-02,KB,38912
4,machineLearningStanford-0.0.tar.gz (md5),machineLearningStanford,,source,3,2015-07-08,KB,3072


Get all unique release types:

In [6]:
xmldf['release_type'].unique()

[python wheel, source, h2o, fast scalable machine learning, for python, python egg, a package to aid in metalearning, ..., integration tools for running scikit-learn on ..., extendable command line utility for sysadmins, mailgun library to extract message quotations ..., trusted analytics toolkit, waterworks: because everyone has their own uti...]
Length: 28
Categories (28, object): [python wheel, source, h2o, fast scalable machine learning, for python, python egg, ..., extendable command line utility for sysadmins, mailgun library to extract message quotations ..., trusted analytics toolkit, waterworks: because everyone has their own uti...]

Get the amount of packages per Python version:

In [10]:
xmldf.groupby(['python_version'])['package_name'].count()

python_version
                                                       1064
2.4                                                       1
2.6                                                       1
2.7                                                      56
3.2                                                       1
3.4                                                      10
3.5                                                      21
3.6                                                      32
any                                                       2
cp27                                                     70
cp33                                                      5
cp34                                                     27
cp35                                                     49
cp36                                                     52
gpu,PY=35/cmake_build/tf_python/dist/tf_nightly_gpu       1
gpu,PY=36/cmake_build/tf_python/dist/tf_nightly_gpu       1
py2                      

or just get the top 5:

In [14]:
xmldf.groupby(['python_version'])['package_name'].count().nlargest(5)

python_version
           1064
py2.py3     166
py3         111
cp27         70
py2          58
Name: package_name, dtype: int64

Filter the files by `release_type` and sort them by size.

In [11]:
xmldf[xmldf['release_type'] == 'source'].sort_values('size_in_bytes', ascending=False)

Unnamed: 0,filename,package_name,python_version,release_type,size,uploaded_on,unit,size_in_bytes
1359,bob.bio.pericrosseye_competition-1.0.2.zip ...,bob.bio.pericrosseye_competition,,source,56,2017-04-10,MB,58720256
508,upsilon-1.2.7.tar.gz (md5),upsilon,,source,56,2017-11-28,MB,58720256
811,GraphLab_Create-2.1-py2.7.tar.gz (md5),GraphLab-Create,,source,49,2016-07-22,MB,51380224
52,h2o_pysparkling_2.2-2.2.11.tar.gz (md5),h2o_pysparkling_2.2,,source,48,2018-03-29,MB,50331648
51,h2o_pysparkling_2.1-2.1.25.tar.gz (md5),h2o_pysparkling_2.1,,source,48,2018-03-29,MB,50331648
50,h2o_pysparkling_2.0-2.0.26.tar.gz (md5),h2o_pysparkling_2.0,,source,48,2018-03-29,MB,50331648
48,h2o_pysparkling_2.3-2.3.0.tar.gz (md5),h2o-pysparkling-2.3,,source,48,2018-03-29,MB,50331648
1739,python-rdm-0.1.8a.tar.gz (md5),python-rdm,,source,43,2017-02-06,MB,45088768
317,imbalanced-learn-0.3.3.tar.gz (md5),imbalanced-learn,,source,39,2018-02-22,MB,40894464
102,skbold-0.3.3.tar.gz (md5),skbold,,source,39,2017-07-31,MB,40894464


Use two columns to group the data:

In [13]:
xmldf.groupby(['python_version', 'release_type'])['package_name'].count()

python_version                                       release_type        
                                                     source                  1064
2.4                                                  python egg                 1
2.6                                                  python egg                 1
2.7                                                  ms windows installer       2
                                                     python egg                23
                                                     python wheel              31
3.2                                                  python egg                 1
3.4                                                  python egg                 5
                                                     python wheel               5
3.5                                                  ms windows installer       3
                                                     python egg                 7
                        

Create a new dataframe containing the amount of files for each package.

In [18]:
count_by_package = xmldf.groupby('package_name')['filename'].count().to_frame()
count_by_package

Unnamed: 0_level_0,filename
package_name,Unnamed: 1_level_1
ActionML,1
Augmentor,2
Azimuth,1
BayesSets,1
Biofuel-MyProject,1
Bis-Miner,1
BlackBoxAuditing,1
Boruta,1
Braid,2
Braindecode,1


Which packages has the most files?

In [19]:
count_by_package.sort_values('filename', ascending=False)

Unnamed: 0_level_0,filename
package_name,Unnamed: 1_level_1
trustedanalytics,157
talon,38
pyhacrf-datamade,21
scikit-learn,21
dask-ml,18
jubatus,17
dedupe,17
kemlglearn,14
tensorflow,12
nnabla,11


### API

In [5]:
jsondf = pd.read_pickle('json_dataframe_notebook_05.pickle')
jsondf.head()

Unnamed: 0,selenium-webdriver,appium,firefox,python-2.7,testing,node.js,phpunit,angularjs,nightwatch.js,selenium-firefoxdriver,...,model,django-managers,django-validation,django-serializer,django-1.7,django-testing,django-users,django-database,django-migrations,django-signals
algorithm,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
arrays,0.0,0.0,0.0,0.0,0.0,2339.0,0.0,3426.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
beautifulsoup,83.0,0.0,0.0,1186.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
c++,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
class,0.0,0.0,0.0,572.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
