# Python training for data engineers
## 06 Data Filtering

### Goal
Apply basic filter techniques on Pandas dataframe
* `count()`
* `sort_values()`
* `nlargest()`

### XML

In [1]:
# Read the XML data from the previous notebook
import pandas as pd
xmldf = pd.read_pickle('xml_dataframe_notebook_05.pickle')
xmldf.head()

Unnamed: 0,file_type,filename_size_hash,package_name,python_version,uploaded_on,filename,size,unit,size_in_bytes
0,wheel,scikit_learn-0.19.1-cp27-cp27m-macosx_10_6_int...,scikit-learn,cp27,2017-10-23,scikit_learn-0.19.1-cp27-cp27m-macosx_10_6_int...,0,MB,0
1,wheel,scikit_learn-0.19.1-cp27-cp27m-manylinux1_i686...,scikit-learn,cp27,2017-10-23,scikit_learn-0.19.1-cp27-cp27m-manylinux1_i686...,4,MB,4194304
2,wheel,scikit_learn-0.19.1-cp27-cp27m-manylinux1_x86_...,scikit-learn,cp27,2017-10-23,scikit_learn-0.19.1-cp27-cp27m-manylinux1_x86_...,2,MB,2097152
3,wheel,scikit_learn-0.19.1-cp27-cp27mu-manylinux1_i68...,scikit-learn,cp27,2017-10-23,scikit_learn-0.19.1-cp27-cp27mu-manylinux1_i68...,4,MB,4194304
4,wheel,scikit_learn-0.19.1-cp27-cp27mu-manylinux1_x86...,scikit-learn,cp27,2017-10-23,scikit_learn-0.19.1-cp27-cp27mu-manylinux1_x86...,2,MB,2097152


Get all unique release types:

In [2]:
xmldf['file_type'].unique()

[wheel, source, windows installer]
Categories (3, object): [wheel, source, windows installer]

Get the amount of packages per Python version:

In [3]:
xmldf.groupby(['python_version'])['package_name'].count()

python_version
2.7         3
3.5         2
None        1
cp27       21
cp33        5
cp34       10
cp35       15
cp36       14
py2.py3     4
py3         1
Name: package_name, dtype: int64

or just get the top 5:

In [4]:
xmldf.groupby(['python_version'])['package_name'].count().nlargest(5)

python_version
cp27    21
cp35    15
cp36    14
cp34    10
cp33     5
Name: package_name, dtype: int64

Filter the files by `release_type` and sort them by size.

In [5]:
xmldf[xmldf['file_type'] == 'source'].sort_values('size_in_bytes', ascending=False)

Unnamed: 0,file_type,filename_size_hash,package_name,python_version,uploaded_on,filename,size,unit,size_in_bytes
22,source,scikit-learn-0.19.1.tar.gz ...,scikit-learn,,2017-10-23,scikit-learn-0.19.1.tar.gz ...,5,MB,5242880


Use two columns to group the data:

In [6]:
xmldf.groupby(['python_version', 'file_type'])['package_name'].count()

python_version  file_type        
2.7             wheel                 1
                windows installer     2
3.5             wheel                 1
                windows installer     1
None            source                1
cp27            wheel                21
cp33            wheel                 5
cp34            wheel                10
cp35            wheel                15
cp36            wheel                14
py2.py3         wheel                 4
py3             wheel                 1
Name: package_name, dtype: int64

Create a new dataframe containing the amount of files for each package.

In [7]:
count_by_package = xmldf.groupby('package_name')['filename'].count().to_frame()
count_by_package

Unnamed: 0_level_0,filename
package_name,Unnamed: 1_level_1
ninja,26
scikit-build,1
scikit-chem,2
scikit-ci,1
scikit-ci-addons,1
scikit-cycling,17
scikit-learn,26
scikit-optimize,1
scikit-ribo,1


Which packages has the most files?

In [8]:
count_by_package.sort_values('filename', ascending=False)

Unnamed: 0_level_0,filename
package_name,Unnamed: 1_level_1
ninja,26
scikit-learn,26
scikit-cycling,17
scikit-chem,2
scikit-build,1
scikit-ci,1
scikit-ci-addons,1
scikit-optimize,1
scikit-ribo,1


### API

In [9]:
jsondf = pd.read_pickle('json_dataframe_notebook_05.pickle')
jsondf.head()

Unnamed: 0,django,python-3.x,pandas,python-2.7,numpy,list,matplotlib,dictionary,regex,flask,...,output,urllib3,login,get,https,httprequest,http-post,http-headers,multipartform-data,grequests
algorithm,0.0,0.0,0.0,0.0,0.0,870.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
arrays,0.0,0.0,0.0,0.0,9475.0,4291.0,0.0,1931.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
beautifulsoup,123.0,1411.0,168.0,1188.0,0.0,75.0,0.0,49.0,421.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
c++,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
class,0.0,971.0,0.0,572.0,0.0,981.0,0.0,468.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Get the rows where `firefox` has a value higher than ten:

In [10]:
jsondf[jsondf['firefox'] > 10]

Unnamed: 0,django,python-3.x,pandas,python-2.7,numpy,list,matplotlib,dictionary,regex,flask,...,output,urllib3,login,get,https,httprequest,http-post,http-headers,multipartform-data,grequests
html,4296.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5027.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
javascript,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,26714.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
selenium,0.0,973.0,0.0,895.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [11]:
jsondf.max().nlargest(10)

jquery        500680.0
css           320479.0
html          298409.0
javascript    298409.0
php           209689.0
angularjs     114721.0
sql           100893.0
django         86864.0
python         86864.0
ajax           84942.0
dtype: float64