# Introduction to Pandas

**Pandas allows us to handle tables made from different data types**. This is something that Numpy's arrays cannot do but is a neccessity for most biological data analysis, which is why we will not bother too much about Numpy.


In [5]:
import pandas as pd
import numpy as np

## Pandas Data Structures

### Series

A **Series** is the equivalent of the 1D NumPy array in Pandas. Each element of a series is labelled with an **index**.

In [9]:
length = pd.Series([484, 493, 511, 462], dtype = np.float32)
length

0    484.0
1    493.0
2    511.0
3    462.0
dtype: float32

The series `values` are stored as a Numpy array. 

In [10]:
length.values

array([484., 493., 511., 462.], dtype=float32)

When `index` is not specified, a default sequence of integers ranging from `0` to `len(data)` are assigned as the index. A Pandas _index_ has a dedicated type.

In [14]:
length.index

RangeIndex(start=0, stop=4, step=1)

Whenever possible, we should assign meaningful labels to the index:

In [15]:
protein = pd.Series([484, 493, 511, 462], 
    index=['Human_aa', 'Chimpanzee_aa', 'Gorilla_aa', 'Gibbon_aa'])

protein

Human_aa         484
Chimpanzee_aa    493
Gorilla_aa       511
Gibbon_aa        462
dtype: int64

These index labels can be used to access the values in the _Series_.

In [16]:
protein['Gorilla_aa']

511

In [18]:
protein[[name for name in protein.index if name.endswith('_aa')]]

Human_aa         484
Chimpanzee_aa    493
Gorilla_aa       511
Gibbon_aa        462
dtype: int64

In [21]:
[name for name in protein.index if name.startswith('G')]

['Gorilla_aa', 'Gibbon_aa']

Note that while indexing, the association between the values and the corresponding indices was maintained.

We may also use positional indexing as we did with the Python sequences:

In [19]:
protein[-1]

462

We may give both the array of values and the index meaningful labels too:

In [29]:
protein.name = 'length'
protein.index.name = 'lineage'

In [30]:
protein

lineage
Human_aa         484
Chimpanzee_aa    493
Gorilla_aa       511
Gibbon_aa        462
Name: length, dtype: int64

#### We can apply NumPy's **math functions and other operations** to Series, without loosing the data structure (A non-collapsing numpy operation on a `Series` returns a `Series`).

In [32]:
np.log(protein)

lineage
Human_aa         6.182085
Chimpanzee_aa    6.200509
Gorilla_aa       6.236370
Gibbon_aa        6.135565
Name: length, dtype: float64

#### We can **filter** based on the values in the _Series_:

In [34]:
protein>500

lineage
Human_aa         False
Chimpanzee_aa    False
Gorilla_aa        True
Gibbon_aa        False
Name: length, dtype: bool

In [35]:
protein[protein>500]

lineage
Gorilla_aa    511
Name: length, dtype: int64

One can think of a _Series_ as an ordered key-value store. Also, **we can create a _series_ from a _dictionary_**:

In [36]:
protein_dict = {'Human_aa': 484, 'Chimpanzee_aa': 493, 'Gorilla_aa': 511,
                 'Gibbon_aa': 462}
pd.Series(protein_dict)

Human_aa         484
Chimpanzee_aa    493
Gorilla_aa       511
Gibbon_aa        462
dtype: int64

Note that the _Series_ is in key-sorted order.

When we pass a custom index to _Series_, it will select the corresponding values from the dictionary, and treat indices without corrsponding values as missing.

In [37]:
protein2 = pd.Series(protein_dict, 
                      index=['Bonobo_aa','Human_aa',
                             'Chimpanzee_aa','Gorilla_aa'])
protein2

Bonobo_aa          NaN
Human_aa         484.0
Chimpanzee_aa    493.0
Gorilla_aa       511.0
dtype: float64

#### Pandas uses the _NaN_ ("not a number") type for missing values.

In [39]:
protein2.isnull()

Bonobo_aa         True
Human_aa         False
Chimpanzee_aa    False
Gorilla_aa       False
dtype: bool

In [40]:
# `isna()` is an alias for `isnull()`.
protein2.isna()

Bonobo_aa         True
Human_aa         False
Chimpanzee_aa    False
Gorilla_aa       False
dtype: bool

### The labels are used to **combine data** in operations involving other _Series_ objects:

In [41]:
protein

lineage
Human_aa         484
Chimpanzee_aa    493
Gorilla_aa       511
Gibbon_aa        462
Name: length, dtype: int64

In [42]:
protein2

Bonobo_aa          NaN
Human_aa         484.0
Chimpanzee_aa    493.0
Gorilla_aa       511.0
dtype: float64

In [43]:
protein + protein2

Bonobo_aa           NaN
Chimpanzee_aa     986.0
Gibbon_aa           NaN
Gorilla_aa       1022.0
Human_aa          968.0
dtype: float64

In [56]:
NewSer = protein + protein2

In [57]:
NewSer

Bonobo_aa           NaN
Chimpanzee_aa     986.0
Gibbon_aa           NaN
Gorilla_aa       1022.0
Human_aa          968.0
dtype: float64

Thus, while `numpy.array`s of the same length combine values element-wise, `Series` combine values with the same labels. Furthermore,
missing values are propogated by addition.

In [59]:
NewSer[['Gibbon_aa', 'Bonobo_aa']]

Gibbon_aa   NaN
Bonobo_aa   NaN
dtype: float64

## DataFrame
A _DataFrame_ has a tabular data structure, and stores multiple series as seperate columns, like data in a spreadsheet. 

In [44]:
MyDf = pd.DataFrame({'length':[484, 493, 511, 462, 1102, 1130, 1078, 1121],
                     'ortholog':[1, 1, 1, 1, 2, 2, 2, 2],
                     'lineage':['Human_aa', 'Chimpanzee_aa', 'Gorilla_aa', 
    'Gibbon_aa', 'Human_aa', 'Chimpanzee_aa', 'Gorilla_aa', 'Gibbon_aa']})
MyDf

Unnamed: 0,length,ortholog,lineage
0,484,1,Human_aa
1,493,1,Chimpanzee_aa
2,511,1,Gorilla_aa
3,462,1,Gibbon_aa
4,1102,2,Human_aa
5,1130,2,Chimpanzee_aa
6,1078,2,Gorilla_aa
7,1121,2,Gibbon_aa


Note that the _DataFrame_ was sorted by column name. You can change the order by indexing in the order of preference:

In [45]:
MyDf[['lineage','length','ortholog']]

Unnamed: 0,lineage,length,ortholog
0,Human_aa,484,1
1,Chimpanzee_aa,493,1
2,Gorilla_aa,511,1
3,Gibbon_aa,462,1
4,Human_aa,1102,2
5,Chimpanzee_aa,1130,2
6,Gorilla_aa,1078,2
7,Gibbon_aa,1121,2


In _DataFrames_ the columns are represented as the second index:

In [47]:
MyDf.columns

Index(['length', 'ortholog', 'lineage'], dtype='object')

The **dtypes** attribute reveals the data type for each column in our DataFrame. 

- **int**:&emsp;&emsp;&emsp; is numeric integer values 
- **object**:&nbsp;&emsp; strings (letters and numbers)
- **float**:&emsp;&emsp; floating-point values

In [48]:
MyDf.dtypes

length       int64
ortholog     int64
lineage     object
dtype: object

#### We access columns
1. by dictionary-like indexing:

In [49]:
MyDf['ortholog']

0    1
1    1
2    1
3    1
4    2
5    2
6    2
7    2
Name: ortholog, dtype: int64

2. by attribute:

In [50]:
MyDf.ortholog

0    1
1    1
2    1
3    1
4    2
5    2
6    2
7    2
Name: ortholog, dtype: int64

In [51]:
MyDf.length

0     484
1     493
2     511
3     462
4    1102
5    1130
6    1078
7    1121
Name: length, dtype: int64

In [52]:
MyDf[['length']]

Unnamed: 0,length
0,484
1,493
2,511
3,462
4,1102
5,1130
6,1078
7,1121


In [53]:
type(MyDf[['length']]), type(MyDf['length'])

(pandas.core.frame.DataFrame, pandas.core.series.Series)


#### To access a row in a _DataFrame_, we index its __loc__ attribute.

In [54]:
MyDf

Unnamed: 0,length,ortholog,lineage
0,484,1,Human_aa
1,493,1,Chimpanzee_aa
2,511,1,Gorilla_aa
3,462,1,Gibbon_aa
4,1102,2,Human_aa
5,1130,2,Chimpanzee_aa
6,1078,2,Gorilla_aa
7,1121,2,Gibbon_aa


In [41]:
MyDf.set_index(["lineage", MyDf.index])

Unnamed: 0_level_0,Unnamed: 1_level_0,length,ortholog
lineage,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Human_aa,0,484,1
Chimpanzee_aa,1,493,1
Gorilla_aa,2,511,1
Gibbon_aa,3,462,1
Human_aa,4,1102,2
Chimpanzee_aa,5,1130,2
Gorilla_aa,6,1078,2
Gibbon_aa,7,1121,2


### Excercise

Check these commands:

- MyDf.head()
- MyDf.tail(3)
- MyDf.shape

#### _DataFrames_ can also be created with a **list of dictionaries**:

In [None]:
MyDf = pd.DataFrame([{'ortholog': 1, 'lineage': 'Human_aa', 'length': 484},
                    {'ortholog': 1, 'lineage': 'Chimpanzee_aa', 'length': 493},
                    {'ortholog': 1, 'lineage': 'Gorilla_aa', 'length': 511},
                    {'ortholog': 1, 'lineage': 'Gibbon_aa', 'length': 462},
                    {'ortholog': 2, 'lineage': 'Human_aa', 'length': 1102},
                    {'ortholog': 2, 'lineage': 'Chimpanzee_aa', 'length': 1130},
                    {'ortholog': 2, 'lineage': 'Gorilla_aa', 'length': 1078},
                    {'ortholog': 2, 'lineage': 'Gibbon_aa', 'length': 1121}])

In [63]:
MyDf

Unnamed: 0,length,ortholog,lineage
0,484,1,Human_aa
1,493,1,Chimpanzee_aa
2,511,1,Gorilla_aa
3,462,1,Gibbon_aa
4,1102,2,Human_aa
5,1130,2,Chimpanzee_aa
6,1078,2,Gorilla_aa
7,1121,2,Gibbon_aa


### View and copy

To operate on a `Series` that is part of a `DataFrame` without modifying the original values in the `DataFrame`, we must take a copy first:

In [64]:
length_copy = MyDf.length.copy()
length_copy

0     484
1     493
2     511
3     462
4    1102
5    1130
6    1078
7    1121
Name: length, dtype: int64

In [65]:
# Update vals[5]
length_copy[5] = 0

In [66]:
# Updated Series
length_copy

0     484
1     493
2     511
3     462
4    1102
5       0
6    1078
7    1121
Name: length, dtype: int64

In [67]:
# Original dataframe is unmodified
MyDf.length

0     484
1     493
2     511
3     462
4    1102
5    1130
6    1078
7    1121
Name: length, dtype: int64

### **How not to do it**

Its important to remember that the `Series` returned when a `DataFrame` is indexed is only a **view** on the DataFrame, and not a copy of the data. So we must remain cautious while manipulating this data:

In [68]:
# Take a **view** on the Series
length_view = MyDf.length
length_view

0     484
1     493
2     511
3     462
4    1102
5    1130
6    1078
7    1121
Name: length, dtype: int64

In [69]:
# Update length_view[5]
length_view[5] = 0
length_view

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  length_view[5] = 0


0     484
1     493
2     511
3     462
4    1102
5       0
6    1078
7    1121
Name: length, dtype: int64

In [70]:
# Now the original dataframe is updated as well. This may cause unintentional side effects.
MyDf.length

0     484
1     493
2     511
3     462
4    1102
5       0
6    1078
7    1121
Name: length, dtype: int64

#### We can modify columns by assignment:

In [71]:
MyDf.loc[[4,3,6],"length"] = [15,22,6]
MyDf

Unnamed: 0,length,ortholog,lineage
0,484,1,Human_aa
1,493,1,Chimpanzee_aa
2,511,1,Gorilla_aa
3,22,1,Gibbon_aa
4,15,2,Human_aa
5,0,2,Chimpanzee_aa
6,6,2,Gorilla_aa
7,1121,2,Gibbon_aa


In [42]:
NewDF=pd.DataFrame(MyDf.length)

In [72]:
MyDf.length[[3,4,6]] = [14, 21, 5]
MyDf

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  MyDf.length[[3,4,6]] = [14, 21, 5]


Unnamed: 0,length,ortholog,lineage
0,484,1,Human_aa
1,493,1,Chimpanzee_aa
2,511,1,Gorilla_aa
3,14,1,Gibbon_aa
4,21,2,Human_aa
5,0,2,Chimpanzee_aa
6,5,2,Gorilla_aa
7,1121,2,Gibbon_aa


#### We can create columns by assignment:

In [73]:
MyDf['build'] = 2020
MyDf

Unnamed: 0,length,ortholog,lineage,build
0,484,1,Human_aa,2020
1,493,1,Chimpanzee_aa,2020
2,511,1,Gorilla_aa,2020
3,14,1,Gibbon_aa,2020
4,21,2,Human_aa,2020
5,0,2,Chimpanzee_aa,2020
6,5,2,Gorilla_aa,2020
7,1121,2,Gibbon_aa,2020


**However, we cannot employ the attribute indexing method to add a new column**:

In [74]:
MyDf.exons = 1
MyDf

Unnamed: 0,length,ortholog,lineage,build
0,484,1,Human_aa,2020
1,493,1,Chimpanzee_aa,2020
2,511,1,Gorilla_aa,2020
3,14,1,Gibbon_aa,2020
4,21,2,Human_aa,2020
5,0,2,Chimpanzee_aa,2020
6,5,2,Gorilla_aa,2020
7,1121,2,Gibbon_aa,2020


In [75]:
MyDf.exons

1

### Exercise

From the _MyDf_ DataFrame above, create an index to return all rows for which the lineage name ends in "_aa" and the length is greater than 500.

In [80]:
# Write your answer here 
a = [name for name in MyDf.lineage if name.endswith('_aa')]
b = MyDf.length > 500
MyDf[a&b].index

Int64Index([2, 7], dtype='int64')

#### Specifying a _Series_ as a new column causes its values to be aligned according to the _DataFrame_'s index:

In [81]:
subs = pd.Series([0]*2 +[1]*4 + [2]*2)
subs

0    0
1    0
2    1
3    1
4    1
5    1
6    2
7    2
dtype: int64

In [82]:
MyDf['subs'] = subs
MyDf

Unnamed: 0,length,ortholog,lineage,build,subs
0,484,1,Human_aa,2020,0
1,493,1,Chimpanzee_aa,2020,0
2,511,1,Gorilla_aa,2020,1
3,14,1,Gibbon_aa,2020,1
4,21,2,Human_aa,2020,1
5,0,2,Chimpanzee_aa,2020,1
6,5,2,Gorilla_aa,2020,2
7,1121,2,Gibbon_aa,2020,2


#### However, python data structures without an index can only be added if they have the same length as the _DataFrame_:

In [83]:
chromosome = [1, 1, 14, 3]
MyDf['chromosome'] = chromosome

ValueError: Length of values (4) does not match length of index (8)

In [84]:
MyDf['chromosome'] = [1]*len(MyDf)
MyDf

Unnamed: 0,length,ortholog,lineage,build,subs,chromosome
0,484,1,Human_aa,2020,0,1
1,493,1,Chimpanzee_aa,2020,0,1
2,511,1,Gorilla_aa,2020,1,1
3,14,1,Gibbon_aa,2020,1,1
4,21,2,Human_aa,2020,1,1
5,0,2,Chimpanzee_aa,2020,1,1
6,5,2,Gorilla_aa,2020,2,1
7,1121,2,Gibbon_aa,2020,2,1


**drop** method is used to remove rows or columns, by default it drops rows. We can be explicitly mention if we want toe remove a row or column by using the **axis** argument:

- axis=0 : row
- axis=1 : column

In [85]:
MyDf.drop('chromosome', axis=1, inplace=True)
MyDf

Unnamed: 0,length,ortholog,lineage,build,subs
0,484,1,Human_aa,2020,0
1,493,1,Chimpanzee_aa,2020,0
2,511,1,Gorilla_aa,2020,1
3,14,1,Gibbon_aa,2020,1
4,21,2,Human_aa,2020,1
5,0,2,Chimpanzee_aa,2020,1
6,5,2,Gorilla_aa,2020,2
7,1121,2,Gibbon_aa,2020,2


The underlying data can be extracted as a two-dimensional `numpy.array` by accessing the `values` attribute:

In [86]:
MyDf.values

array([[484, 1, 'Human_aa', 2020, 0],
       [493, 1, 'Chimpanzee_aa', 2020, 0],
       [511, 1, 'Gorilla_aa', 2020, 1],
       [14, 1, 'Gibbon_aa', 2020, 1],
       [21, 2, 'Human_aa', 2020, 1],
       [0, 2, 'Chimpanzee_aa', 2020, 1],
       [5, 2, 'Gorilla_aa', 2020, 2],
       [1121, 2, 'Gibbon_aa', 2020, 2]], dtype=object)

Due to a mix of strings and integers (along with _NaN_) values, the data type of this array is _object_. 

The data type is automatically chosen to be the most general kind that can accomodate all the columns.

In [94]:
Df2 = pd.DataFrame({'x': [1,2,3], 'y':[5, -1.7, 3.8]})
Df2.values

array([[ 1. ,  5. ],
       [ 2. , -1.7],
       [ 3. ,  3.8]])

In [93]:
Df2.values.dtype

dtype('int64')

#### Index objects are immutable:

In [95]:
MyDf.index[0] = 87

TypeError: Index does not support mutable operations

#### But they can be reassigned.

In [96]:
protein2

Bonobo_aa          NaN
Human_aa         484.0
Chimpanzee_aa    493.0
Gorilla_aa       511.0
dtype: float64

In [97]:
protein2.index = protein.index

In [98]:
protein2

lineage
Human_aa           NaN
Chimpanzee_aa    484.0
Gorilla_aa       493.0
Gibbon_aa        511.0
dtype: float64

## Importing data

Pandas allows convenient import of tabular data directly into a _DataFrame_ object. It also has multiple options that allow indexing, parsing, iterating and cleaning as data i imported.

Let's start with some more protein alignment data, stored in csv format.

In [99]:
!head ./data/CommonDf.tsv

Id	HS_gene_id	Gene	Description	HS	GG	NL	PT	HS_aa	PT_aa	GG_aa	NL_aa	overlap	AbsId	Subs	HS_Subs	PT_Subs	GG_Subs	NL_Subs	#1#_Subs	%AbsId	%Subs	%HS_Subs	%PT_Subs	%GG_Subs	%NL_Subs	%#1#_Subs	%NoId
ENST00000000412	ENSG00000003056	M6PR	mannose-6-phosphate receptor%2C cation dependent 	ENST00000000412	ENSGGOT00000059917	ENSNLET00000034947	ENSPTRT00000008593	277	277	277	267	265	262	3	0	0	1	2	0	98.8679245283019	1.13207547169812	0	0	0.377358490566038	0.754716981132076	0	0
ENST00000000442	ENSG00000173153	ESRRA	estrogen related receptor alpha 	ENST00000000442	ENSGGOT00000001342	ENSNLET00000006350	ENSPTRT00000007149	423	422	422	422	422	416	6	0	0	5	1	0	98.5781990521327	1.4218009478673	0	0	1.18483412322275	0.23696682464455	0	0
ENST00000001008	ENSG00000004478	FKBP4	FKBP prolyl isomerase 4 	ENST00000001008	ENSGGOT00000010515	ENSNLET00000003652	ENSPTRT00000008389	459	459	453	424	424	422	2	1	0	0	1	0	99.5283018867925	0.471698113207552	0.235849056603774	0	0	0.235849056603774	0	0
ENST00000001146	ENSG00000003

This table can be read into a DataFrame using **read_csv**:

In [100]:
CommonDf = pd.read_csv("./data/CommonDf.tsv")
CommonDf

Unnamed: 0,Id\tHS_gene_id\tGene\tDescription\tHS\tGG\tNL\tPT\tHS_aa\tPT_aa\tGG_aa\tNL_aa\toverlap\tAbsId\tSubs\tHS_Subs\tPT_Subs\tGG_Subs\tNL_Subs\t#1#_Subs\t%AbsId\t%Subs\t%HS_Subs\t%PT_Subs\t%GG_Subs\t%NL_Subs\t%#1#_Subs\t%NoId
0,ENST00000000412\tENSG00000003056\tM6PR\tmannos...
1,ENST00000000442\tENSG00000173153\tESRRA\testro...
2,ENST00000001008\tENSG00000004478\tFKBP4\tFKBP ...
3,ENST00000001146\tENSG00000003137\tCYP26B1\tcyt...
4,ENST00000002125\tENSG00000003509\tNDUFAF7\tNAD...
...,...
4995,ENST00000335765\tENSG00000109103\tUNC119\tunc-...
4996,ENST00000335783\tENSG00000031691\tCENPQ\tcentr...
4997,ENST00000335790\tENSG00000166407\tLMO1\tLIM do...
4998,ENST00000335852\tENSG00000160781\tPAQR6\tproge...


So the default seperator, which is comma for the *read_csv* function, does not parse our data into columns. Looking at the first line, it is obvious that tab ("\t") was used as seperator.

We can use the `sep` argument to accomodate arbitrary separators. We provide the regular expression **'\t'** to define tab: 
    
    sep='\t'

In [101]:
CommonDf = pd.read_csv("./data/CommonDf.tsv", sep='\t')
CommonDf

Unnamed: 0,Id,HS_gene_id,Gene,Description,HS,GG,NL,PT,HS_aa,PT_aa,...,NL_Subs,#1#_Subs,%AbsId,%Subs,%HS_Subs,%PT_Subs,%GG_Subs,%NL_Subs,%#1#_Subs,%NoId
0,ENST00000000412,ENSG00000003056,M6PR,mannose-6-phosphate receptor%2C cation dependent,ENST00000000412,ENSGGOT00000059917,ENSNLET00000034947,ENSPTRT00000008593,277,277,...,2.0,0.0,98.867925,1.132075,0.000000,0.000000,0.377358,0.754717,0.000000,0.0
1,ENST00000000442,ENSG00000173153,ESRRA,estrogen related receptor alpha,ENST00000000442,ENSGGOT00000001342,ENSNLET00000006350,ENSPTRT00000007149,423,422,...,1.0,0.0,98.578199,1.421801,0.000000,0.000000,1.184834,0.236967,0.000000,0.0
2,ENST00000001008,ENSG00000004478,FKBP4,FKBP prolyl isomerase 4,ENST00000001008,ENSGGOT00000010515,ENSNLET00000003652,ENSPTRT00000008389,459,459,...,1.0,0.0,99.528302,0.471698,0.235849,0.000000,0.000000,0.235849,0.000000,0.0
3,ENST00000001146,ENSG00000003137,CYP26B1,cytochrome P450 family 26 subfamily B member 1,ENST00000001146,ENSGGOT00000004600,ENSNLET00000013223,ENSPTRT00000109607,512,512,...,1.0,0.0,99.414062,0.585938,0.390625,0.000000,0.000000,0.195312,0.000000,0.0
4,ENST00000002125,ENSG00000003509,NDUFAF7,NADH:ubiquinone oxidoreductase complex assembl...,ENST00000002125,ENSGGOT00000011414,ENSNLET00000039964,ENSPTRT00000022034,441,441,...,5.0,3.0,97.716895,2.283105,0.228311,0.000000,0.228311,1.141553,0.684932,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,ENST00000335765,ENSG00000109103,UNC119,unc-119 lipid binding chaperone,ENST00000335765,ENSGGOT00000006200,ENSNLET00000002952,ENSPTRT00000016454,240,240,...,2.0,0.0,96.153846,3.846154,1.282051,0.000000,1.709402,0.854701,0.000000,0.0
4996,ENST00000335783,ENSG00000031691,CENPQ,centromere protein Q,ENST00000335783,ENSGGOT00000024187,ENSNLET00000049470,ENSPTRT00000104685,268,271,...,3.0,0.0,94.871795,5.128205,0.854701,1.282051,1.282051,1.282051,0.000000,0.0
4997,ENST00000335790,ENSG00000166407,LMO1,LIM domain only 1,ENST00000335790,ENSGGOT00000027338,ENSNLET00000022345,ENSPTRT00000006309,156,156,...,0.0,0.0,100.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
4998,ENST00000335852,ENSG00000160781,PAQR6,progestin and adipoQ receptor family member 6,ENST00000335852,ENSGGOT00000010962,ENSNLET00000055316,ENSPTRT00000103250,351,351,...,2.0,1.0,96.385542,3.614458,0.000000,0.000000,0.000000,2.409639,1.204819,0.0


You are likely to come across tables that use variable amount of whitespace as column seperators. We advise you not to use variable whitespaces as sperators, however, when you come across such data use the following regular expressions as seperator input: 
    
    sep='\s+'

Note that _read_csv_ automatically designated the first row in the file to be a header row.

We can override the default behavior of _read_csv_ by customising `header`, `names` and `index_col` arguments.

In [102]:
pd.read_csv("./data/CommonDf.tsv", sep='\t', header=None).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,18,19,20,21,22,23,24,25,26,27
0,Id,HS_gene_id,Gene,Description,HS,GG,NL,PT,HS_aa,PT_aa,...,NL_Subs,#1#_Subs,%AbsId,%Subs,%HS_Subs,%PT_Subs,%GG_Subs,%NL_Subs,%#1#_Subs,%NoId
1,ENST00000000412,ENSG00000003056,M6PR,mannose-6-phosphate receptor%2C cation dependent,ENST00000000412,ENSGGOT00000059917,ENSNLET00000034947,ENSPTRT00000008593,277,277,...,2,0,98.8679245283019,1.13207547169812,0,0,0.377358490566038,0.754716981132076,0,0
2,ENST00000000442,ENSG00000173153,ESRRA,estrogen related receptor alpha,ENST00000000442,ENSGGOT00000001342,ENSNLET00000006350,ENSPTRT00000007149,423,422,...,1,0,98.5781990521327,1.4218009478673,0,0,1.18483412322275,0.23696682464455,0,0
3,ENST00000001008,ENSG00000004478,FKBP4,FKBP prolyl isomerase 4,ENST00000001008,ENSGGOT00000010515,ENSNLET00000003652,ENSPTRT00000008389,459,459,...,1,0,99.5283018867925,0.471698113207552,0.235849056603774,0,0,0.235849056603774,0,0
4,ENST00000001146,ENSG00000003137,CYP26B1,cytochrome P450 family 26 subfamily B member 1,ENST00000001146,ENSGGOT00000004600,ENSNLET00000013223,ENSPTRT00000109607,512,512,...,1,0,99.4140625,0.5859375,0.390625,0,0,0.1953125,0,0


We can make the first column the DataFrame's `index` by passing its column index (0) in the file to the `index_col` argument of `read_csv()`:

In [103]:
CommonDf = pd.read_csv("./data/CommonDf.tsv", sep='\t', index_col=0)
CommonDf.head()

Unnamed: 0_level_0,HS_gene_id,Gene,Description,HS,GG,NL,PT,HS_aa,PT_aa,GG_aa,...,NL_Subs,#1#_Subs,%AbsId,%Subs,%HS_Subs,%PT_Subs,%GG_Subs,%NL_Subs,%#1#_Subs,%NoId
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENST00000000412,ENSG00000003056,M6PR,mannose-6-phosphate receptor%2C cation dependent,ENST00000000412,ENSGGOT00000059917,ENSNLET00000034947,ENSPTRT00000008593,277,277,277,...,2.0,0.0,98.867925,1.132075,0.0,0.0,0.377358,0.754717,0.0,0.0
ENST00000000442,ENSG00000173153,ESRRA,estrogen related receptor alpha,ENST00000000442,ENSGGOT00000001342,ENSNLET00000006350,ENSPTRT00000007149,423,422,422,...,1.0,0.0,98.578199,1.421801,0.0,0.0,1.184834,0.236967,0.0,0.0
ENST00000001008,ENSG00000004478,FKBP4,FKBP prolyl isomerase 4,ENST00000001008,ENSGGOT00000010515,ENSNLET00000003652,ENSPTRT00000008389,459,459,453,...,1.0,0.0,99.528302,0.471698,0.235849,0.0,0.0,0.235849,0.0,0.0
ENST00000001146,ENSG00000003137,CYP26B1,cytochrome P450 family 26 subfamily B member 1,ENST00000001146,ENSGGOT00000004600,ENSNLET00000013223,ENSPTRT00000109607,512,512,512,...,1.0,0.0,99.414062,0.585938,0.390625,0.0,0.0,0.195312,0.0,0.0
ENST00000002125,ENSG00000003509,NDUFAF7,NADH:ubiquinone oxidoreductase complex assembl...,ENST00000002125,ENSGGOT00000011414,ENSNLET00000039964,ENSPTRT00000022034,441,441,441,...,5.0,3.0,97.716895,2.283105,0.228311,0.0,0.228311,1.141553,0.684932,0.0


We can make the first column the index by passing its label: 

In [104]:
CommonDf = pd.read_csv("./data/CommonDf.tsv", sep='\t', index_col='Id')
CommonDf.head()

Unnamed: 0_level_0,HS_gene_id,Gene,Description,HS,GG,NL,PT,HS_aa,PT_aa,GG_aa,...,NL_Subs,#1#_Subs,%AbsId,%Subs,%HS_Subs,%PT_Subs,%GG_Subs,%NL_Subs,%#1#_Subs,%NoId
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENST00000000412,ENSG00000003056,M6PR,mannose-6-phosphate receptor%2C cation dependent,ENST00000000412,ENSGGOT00000059917,ENSNLET00000034947,ENSPTRT00000008593,277,277,277,...,2.0,0.0,98.867925,1.132075,0.0,0.0,0.377358,0.754717,0.0,0.0
ENST00000000442,ENSG00000173153,ESRRA,estrogen related receptor alpha,ENST00000000442,ENSGGOT00000001342,ENSNLET00000006350,ENSPTRT00000007149,423,422,422,...,1.0,0.0,98.578199,1.421801,0.0,0.0,1.184834,0.236967,0.0,0.0
ENST00000001008,ENSG00000004478,FKBP4,FKBP prolyl isomerase 4,ENST00000001008,ENSGGOT00000010515,ENSNLET00000003652,ENSPTRT00000008389,459,459,453,...,1.0,0.0,99.528302,0.471698,0.235849,0.0,0.0,0.235849,0.0,0.0
ENST00000001146,ENSG00000003137,CYP26B1,cytochrome P450 family 26 subfamily B member 1,ENST00000001146,ENSGGOT00000004600,ENSNLET00000013223,ENSPTRT00000109607,512,512,512,...,1.0,0.0,99.414062,0.585938,0.390625,0.0,0.0,0.195312,0.0,0.0
ENST00000002125,ENSG00000003509,NDUFAF7,NADH:ubiquinone oxidoreductase complex assembl...,ENST00000002125,ENSGGOT00000011414,ENSNLET00000039964,ENSPTRT00000022034,441,441,441,...,5.0,3.0,97.716895,2.283105,0.228311,0.0,0.228311,1.141553,0.684932,0.0


We can specify the first two columns to make a combined index. This can be espescially useful when one column does not provide unique indices to each row.

In [105]:
CommonDf = pd.read_csv("./data/CommonDf.tsv", sep='\t', index_col=['Id','HS_gene_id'])
CommonDf.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Gene,Description,HS,GG,NL,PT,HS_aa,PT_aa,GG_aa,NL_aa,...,NL_Subs,#1#_Subs,%AbsId,%Subs,%HS_Subs,%PT_Subs,%GG_Subs,%NL_Subs,%#1#_Subs,%NoId
Id,HS_gene_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
ENST00000000412,ENSG00000003056,M6PR,mannose-6-phosphate receptor%2C cation dependent,ENST00000000412,ENSGGOT00000059917,ENSNLET00000034947,ENSPTRT00000008593,277,277,277,267,...,2.0,0.0,98.867925,1.132075,0.0,0.0,0.377358,0.754717,0.0,0.0
ENST00000000442,ENSG00000173153,ESRRA,estrogen related receptor alpha,ENST00000000442,ENSGGOT00000001342,ENSNLET00000006350,ENSPTRT00000007149,423,422,422,422,...,1.0,0.0,98.578199,1.421801,0.0,0.0,1.184834,0.236967,0.0,0.0
ENST00000001008,ENSG00000004478,FKBP4,FKBP prolyl isomerase 4,ENST00000001008,ENSGGOT00000010515,ENSNLET00000003652,ENSPTRT00000008389,459,459,453,424,...,1.0,0.0,99.528302,0.471698,0.235849,0.0,0.0,0.235849,0.0,0.0
ENST00000001146,ENSG00000003137,CYP26B1,cytochrome P450 family 26 subfamily B member 1,ENST00000001146,ENSGGOT00000004600,ENSNLET00000013223,ENSPTRT00000109607,512,512,512,512,...,1.0,0.0,99.414062,0.585938,0.390625,0.0,0.0,0.195312,0.0,0.0
ENST00000002125,ENSG00000003509,NDUFAF7,NADH:ubiquinone oxidoreductase complex assembl...,ENST00000002125,ENSGGOT00000011414,ENSNLET00000039964,ENSPTRT00000022034,441,441,441,441,...,5.0,3.0,97.716895,2.283105,0.228311,0.0,0.228311,1.141553,0.684932,0.0


This is called a **hierarchical** index. Its type is the `MultiIndex`.

In [106]:
type(CommonDf.index)

pandas.core.indexes.multi.MultiIndex

If we have parts of data that we do not wish to import, we can use the `skiprows` argument:

In [109]:
pd.read_csv("./data/CommonDf.tsv", sep='\t', skiprows=[1,2,3]).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,18,19,20,21,22,23,24,25,26,27
0,ENST00000001146,ENSG00000003137,CYP26B1,cytochrome P450 family 26 subfamily B member 1,ENST00000001146,ENSGGOT00000004600,ENSNLET00000013223,ENSPTRT00000109607,512,512,...,1.0,0.0,99.414062,0.585938,0.390625,0.0,0.0,0.195312,0.0,0.0
1,ENST00000002125,ENSG00000003509,NDUFAF7,NADH:ubiquinone oxidoreductase complex assembl...,ENST00000002125,ENSGGOT00000011414,ENSNLET00000039964,ENSPTRT00000022034,441,441,...,5.0,3.0,97.716895,2.283105,0.228311,0.0,0.228311,1.141553,0.684932,0.0
2,ENST00000002165,ENSG00000001036,FUCA2,alpha-L-fucosidase 2,ENST00000002165,ENSGGOT00000011305,ENSNLET00000019262,ENSPTRT00000034475,467,465,...,6.0,0.0,98.494624,1.505376,0.0,0.0,0.0,1.290323,0.0,0.0
3,ENST00000002596,ENSG00000002587,HS3ST1,heparan sulfate-glucosamine 3-sulfotransferase 1,ENST00000002596,ENSGGOT00000015678,ENSNLET00000020683,ENSPTRT00000029720,307,307,...,14.0,0.0,94.736842,5.263158,0.0,0.0,0.328947,4.605263,0.0,0.0
4,ENST00000002829,ENSG00000001617,SEMA3F,semaphorin 3F,ENST00000002829,ENSGGOT00000052676,ENSNLET00000008877,ENSPTRT00000106552,785,785,...,4.0,1.0,98.455598,1.544402,0.3861,0.0,0.514801,0.514801,0.1287,0.0


If we only need to look at a small number of rows from a rather large data file we can use _nrows_:

In [110]:
pd.read_csv("./data/CommonDf.tsv", sep='\t', nrows=4)

Unnamed: 0,Id,HS_gene_id,Gene,Description,HS,GG,NL,PT,HS_aa,PT_aa,...,NL_Subs,#1#_Subs,%AbsId,%Subs,%HS_Subs,%PT_Subs,%GG_Subs,%NL_Subs,%#1#_Subs,%NoId
0,ENST00000000412,ENSG00000003056,M6PR,mannose-6-phosphate receptor%2C cation dependent,ENST00000000412,ENSGGOT00000059917,ENSNLET00000034947,ENSPTRT00000008593,277,277,...,2,0,98.867925,1.132075,0.0,0,0.377358,0.754717,0,0
1,ENST00000000442,ENSG00000173153,ESRRA,estrogen related receptor alpha,ENST00000000442,ENSGGOT00000001342,ENSNLET00000006350,ENSPTRT00000007149,423,422,...,1,0,98.578199,1.421801,0.0,0,1.184834,0.236967,0,0
2,ENST00000001008,ENSG00000004478,FKBP4,FKBP prolyl isomerase 4,ENST00000001008,ENSGGOT00000010515,ENSNLET00000003652,ENSPTRT00000008389,459,459,...,1,0,99.528302,0.471698,0.235849,0,0.0,0.235849,0,0
3,ENST00000001146,ENSG00000003137,CYP26B1,cytochrome P450 family 26 subfamily B member 1,ENST00000001146,ENSGGOT00000004600,ENSNLET00000013223,ENSPTRT00000109607,512,512,...,1,0,99.414062,0.585938,0.390625,0,0.0,0.195312,0,0


We may also process our data in chunks. The _chunksize_ argument returns an iterable object that can be put in a loop. 

For example, our alignment table can be organized by orthologs, with 15 orthologs represented in each chunk:

In [63]:
pd.read_csv("./data/CommonDf.tsv", sep='\t', chunksize=15)

<pandas.io.parsers.TextFileReader at 0x7f28a578b100>

In [65]:
data_chunks = pd.read_csv("./data/CommonDf.tsv", sep='\t', chunksize=15)

In [67]:
pd.read_csv("./data/CommonDf.tsv", sep='\t', nrows=4)

Unnamed: 0,Id,HS_gene_id,Gene,Description,HS,GG,NL,PT,HS_aa,PT_aa,...,NL_Subs,#1#_Subs,%AbsId,%Subs,%HS_Subs,%PT_Subs,%GG_Subs,%NL_Subs,%#1#_Subs,%NoId
15,ENST00000005284,ENSG00000006116,CACNG3,calcium voltage-gated channel auxiliary subuni...,ENST00000005284,ENSGGOT00000002794,ENSNLET00000015190,ENSPTRT00000014550,315,315,...,0,0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0
16,ENST00000005286,ENSG00000006118,TMEM132A,transmembrane protein 132A,ENST00000005286,ENSGGOT00000008704,ENSNLET00000017437,ENSPTRT00000006947,1024,1024,...,16,1,97.183099,2.816901,0.704225,0.301811,0.100604,1.609658,0.100604,0
17,ENST00000005340,ENSG00000004975,DVL2,dishevelled segment polarity protein 2,ENST00000005340,ENSGGOT00000033070,ENSNLET00000010284,ENSPTRT00000015976,736,736,...,4,0,99.297753,0.702247,0.0,0.0,0.140449,0.561798,0.0,0
18,ENST00000005386,ENSG00000005175,RPAP3,RNA polymerase II associated protein 3,ENST00000005386,ENSGGOT00000026642,ENSNLET00000022858,ENSPTRT00000008998,665,665,...,7,0,97.596154,2.403846,0.480769,0.320513,0.480769,1.121795,0.0,0
19,ENST00000005558,ENSG00000006652,IFRD1,interferon related developmental regulator 1,ENST00000005558,ENSGGOT00000030418,ENSNLET00000014451,ENSPTRT00000047096,451,452,...,3,0,98.642534,1.357466,0.452489,0.0,0.226244,0.678733,0.0,0
20,ENST00000006275,ENSG00000007255,TRAPPC6A,trafficking protein particle complex 6A,ENST00000006275,ENSGGOT00000008723,ENSNLET00000057243,ENSPTRT00000085516,173,173,...,6,0,96.531792,3.468208,0.0,0.0,0.0,3.468208,0.0,0
21,ENST00000006658,ENSG00000006282,SPATA20,spermatogenesis associated 20,ENST00000006658,ENSGGOT00000003400,ENSNLET00000035659,ENSPTRT00000100460,802,802,...,14,1,97.214854,2.785146,0.265252,0.132626,0.265252,1.856764,0.132626,0
22,ENST00000006777,ENSG00000005486,RHBDD2,rhomboid domain containing 2,ENST00000006777,ENSGGOT00000016990,ENSNLET00000054238,ENSPTRT00000035734,364,364,...,2,0,96.610169,3.389831,0.0,0.0,0.0,3.389831,0.0,0
23,ENST00000007390,ENSG00000007520,TSR3,TSR3 ribosome maturation factor,ENST00000007390,ENSGGOT00000025322,ENSNLET00000051255,ENSPTRT00000014004,312,312,...,13,1,91.878173,8.121827,0.0,1.015228,0.0,6.598985,0.507614,0
24,ENST00000007414,ENSG00000006025,OSBPL7,oxysterol binding protein like 7,ENST00000007414,ENSGGOT00000015402,ENSNLET00000002261,ENSPTRT00000045163,842,842,...,8,0,98.322581,1.677419,0.258065,0.258065,0.129032,1.032258,0.0,0


We may also process our data in chunks. The _chunksize_ argument returns an iterable object that can be put in a loop. 

For example, our alignment table can be organized by orthologs, with 15 orthologs represented in each chunk:

In [111]:
next(data_chunks)

<pandas.io.parsers.TextFileReader at 0x7fa1b2442700>

In [112]:
mean_alignment_overalp = pd.Series({chunk.Id.values[0]: chunk.overlap.mean() for chunk in data_chunks})
    
mean_alignment_overalp

ENST00000000412    582.533333
ENST00000005284    483.800000
ENST00000009041    562.333333
ENST00000020926    872.333333
ENST00000040877    468.600000
                      ...    
ENST00000334571    651.400000
ENST00000334815    439.200000
ENST00000335146    576.333333
ENST00000335420    503.200000
ENST00000335765    308.600000
Length: 334, dtype: float64

Pandas can recognise and parse certain missing data indicators, such as _NA_ and _NULL_, by default.

In [113]:
!cat ./data/Common_missing_Df.tsv

Id	HS_gene_id	Gene	Description	HS	GG	NL	PT	HS_aa	PT_aa	GG_aa	NL_aa	overlap	AbsId	Subs	HS_Subs	PT_Subs	GG_Subs	NL_Subs	#1#_Subs	Convergent_Subs	OnlyInGpId	OnlyOutGpId	OneInOutId	NoId	%AbsId	%Subs	%HS_Subs	%PT_Subs	%GG_Subs	%NL_Subs	%#1#_Subs	%Convergent_Subs	%OnlyInGpId	%OnlyOutGpId	%OneInOutId	%NoId
ENST00000000412	ENSG00000003056	M6PR	mannose-6-phosphate receptor%2C cation dependent 	ENST00000000412	ENSGGOT00000059917	ENSNLET00000034947	ENSPTRT00000008593	277	277	277	267	265	262	3	0	0	1	2	0	0	0	0	0	0	98.8679245283019	1.13207547169812	0	0	0.377358490566038	0.754716981132076	0	0	0	0	0	0
ENST00000000442	ENSG00000173153	ESRRA	estrogen related receptor alpha 	ENST00000000442	ENSGGOT00000001342	ENSNLET00000006350	ENSPTRT00000007149	423	422	422	422	422	416	6	0	0	5	1	0	0	0	0	0	0	98.5781990521327	1.4218009478673	0	0	1.18483412322275	0.23696682464455	0	0	0	0	0	0
ENST00000001008	ENSG00000004478	FKBP4	FKBP prolyl isomerase 4 	ENST00000001008	ENSGGOT00000010515	ENSNLET00000003652	ENSPTRT0000000838

In [114]:
pd.read_csv("./data/Common_missing_Df.tsv", sep='\t')

Unnamed: 0,Id,HS_gene_id,Gene,Description,HS,GG,NL,PT,HS_aa,PT_aa,...,%HS_Subs,%PT_Subs,%GG_Subs,%NL_Subs,%#1#_Subs,%Convergent_Subs,%OnlyInGpId,%OnlyOutGpId,%OneInOutId,%NoId
0,ENST00000000412,ENSG00000003056,M6PR,mannose-6-phosphate receptor%2C cation dependent,ENST00000000412,ENSGGOT00000059917,ENSNLET00000034947,ENSPTRT00000008593,277.0,277.0,...,0.0,0.0,0.377358,0.754717,0.0,0.0,0.0,0.0,0.0,0.0
1,ENST00000000442,ENSG00000173153,ESRRA,estrogen related receptor alpha,ENST00000000442,ENSGGOT00000001342,ENSNLET00000006350,ENSPTRT00000007149,423.0,422.0,...,0.0,0.0,1.184834,0.236967,0.0,0.0,0.0,0.0,0.0,0.0
2,ENST00000001008,ENSG00000004478,FKBP4,FKBP prolyl isomerase 4,ENST00000001008,ENSGGOT00000010515,ENSNLET00000003652,ENSPTRT00000008389,459.0,459.0,...,0.235849,0.0,0.0,0.235849,0.0,0.0,0.0,0.0,0.0,0.0
3,ENST00000001146,ENSG00000003137,CYP26B1,cytochrome P450 family 26 subfamily B member 1,ENST00000001146,ENSGGOT00000004600,ENSNLET00000013223,ENSPTRT00000109607,512.0,512.0,...,0.390625,0.0,0.0,0.195312,0.0,0.0,0.0,0.0,0.0,0.0
4,ENST00000002125,ENSG00000003509,NDUFAF7,NADH:ubiquinone oxidoreductase complex assembl...,ENST00000002125,ENSGGOT00000011414,ENSNLET00000039964,ENSPTRT00000022034,441.0,441.0,...,0.228311,0.0,0.228311,1.141553,0.684932,0.0,0.0,0.0,0.0,0.0
5,ENST00000002165,ENSG00000001036,FUCA2,alpha-L-fucosidase 2,ENST00000002165,ENSGGOT00000011305,ENSNLET00000019262,ENSPTRT00000034475,467.0,465.0,...,0.0,0.0,0.0,1.290323,0.0,0.0,0.215054,0.0,0.0,0.0
6,ENST00000002596,ENSG00000002587,HS3ST1,heparan sulfate-glucosamine 3-sulfotransferase 1,ENST00000002596,ENSGGOT00000015678,ENSNLET00000020683,ENSPTRT00000029720,307.0,307.0,...,,,,,,,,,,
7,ENST00000002829,ENSG00000001617,SEMA3F,semaphorin 3F,,,,,,,...,,,,,,,,,,
8,ENST00000003084,ENSG00000001626,CFTR,CF transmembrane conductance regulator,,,,,,,...,,,,,,,,,,
9,ENST00000003302,ENSG00000048028,USP28,ubiquitin specific peptidase 28,,,,,,,...,,,,,,,,,,


Here, Pandas recognised _NA_ and blank fields as missing data.

In [115]:
pd.isnull(pd.read_csv("./data/Common_missing_Df.tsv", sep='\t'))

Unnamed: 0,Id,HS_gene_id,Gene,Description,HS,GG,NL,PT,HS_aa,PT_aa,...,%HS_Subs,%PT_Subs,%GG_Subs,%NL_Subs,%#1#_Subs,%Convergent_Subs,%OnlyInGpId,%OnlyOutGpId,%OneInOutId,%NoId
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
6,False,False,False,False,False,False,False,False,False,False,...,True,True,True,True,True,True,True,True,True,True
7,False,False,False,False,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
8,False,False,False,False,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
9,False,False,False,False,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True


Sometimes there can be inconsistency with the labelling for missing data. Here, we had a question mark "?" and a large negative number instead of _NA_. Nonetheless, We can pass additional symbols to the **na_values**
argument:
   

In [116]:
MisDf = pd.read_csv("./data/Common_missing_Df.tsv", sep='\t', na_values=['?', -99999])
MisDf

Unnamed: 0,Id,HS_gene_id,Gene,Description,HS,GG,NL,PT,HS_aa,PT_aa,...,%HS_Subs,%PT_Subs,%GG_Subs,%NL_Subs,%#1#_Subs,%Convergent_Subs,%OnlyInGpId,%OnlyOutGpId,%OneInOutId,%NoId
0,ENST00000000412,ENSG00000003056,M6PR,mannose-6-phosphate receptor%2C cation dependent,ENST00000000412,ENSGGOT00000059917,ENSNLET00000034947,ENSPTRT00000008593,277.0,277.0,...,0.0,0.0,0.377358,0.754717,0.0,0.0,0.0,0.0,0.0,0.0
1,ENST00000000442,ENSG00000173153,ESRRA,estrogen related receptor alpha,ENST00000000442,ENSGGOT00000001342,ENSNLET00000006350,ENSPTRT00000007149,423.0,422.0,...,0.0,0.0,1.184834,0.236967,0.0,0.0,0.0,0.0,0.0,0.0
2,ENST00000001008,ENSG00000004478,FKBP4,FKBP prolyl isomerase 4,ENST00000001008,ENSGGOT00000010515,ENSNLET00000003652,ENSPTRT00000008389,459.0,459.0,...,0.235849,0.0,0.0,0.235849,0.0,0.0,0.0,0.0,0.0,0.0
3,ENST00000001146,ENSG00000003137,CYP26B1,cytochrome P450 family 26 subfamily B member 1,ENST00000001146,ENSGGOT00000004600,ENSNLET00000013223,ENSPTRT00000109607,512.0,512.0,...,0.390625,0.0,0.0,0.195312,0.0,0.0,0.0,0.0,0.0,0.0
4,ENST00000002125,ENSG00000003509,NDUFAF7,NADH:ubiquinone oxidoreductase complex assembl...,ENST00000002125,ENSGGOT00000011414,ENSNLET00000039964,ENSPTRT00000022034,441.0,441.0,...,0.228311,0.0,0.228311,1.141553,0.684932,0.0,0.0,0.0,0.0,0.0
5,ENST00000002165,ENSG00000001036,FUCA2,alpha-L-fucosidase 2,ENST00000002165,ENSGGOT00000011305,ENSNLET00000019262,ENSPTRT00000034475,467.0,465.0,...,0.0,0.0,0.0,1.290323,0.0,0.0,0.215054,0.0,0.0,0.0
6,ENST00000002596,ENSG00000002587,HS3ST1,heparan sulfate-glucosamine 3-sulfotransferase 1,ENST00000002596,ENSGGOT00000015678,ENSNLET00000020683,ENSPTRT00000029720,307.0,307.0,...,,,,,,,,,,
7,ENST00000002829,ENSG00000001617,SEMA3F,semaphorin 3F,,,,,,,...,,,,,,,,,,
8,ENST00000003084,ENSG00000001626,CFTR,CF transmembrane conductance regulator,,,,,,,...,,,,,,,,,,
9,ENST00000003302,ENSG00000048028,USP28,ubiquitin specific peptidase 28,,,,,,,...,,,,,,,,,,


We can also specify *na_values* for each column by passing an appropriate dict as the argument for *na_values*.

# Pandas Functionality

This section introduces some key functionality of Pandas.

For this we will look into some football data.

In [117]:
football = pd.read_csv("./data/football.csv", index_col='id')
football.head()

Unnamed: 0_level_0,name,age,position,Current Club,nationality,minutes_played_overall,minutes_played_home,minutes_played_away,appearances_overall,appearances_home,...,conceded_per_90_overall,min_per_conceded_overall,min_per_match,min_per_card_overall,min_per_assist_overall,cards_per_90_overall,rank_in_league_top_attackers,rank_in_league_top_midfielders,rank_in_league_top_defenders,rank_in_club_top_scorer
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5,A_Cresswell,31,Defender,West Ham United,England,1589,888,701,20,11,...,1.25,72,79,1589,1589,0.06,290,191,80,20
7,A_Lennon,33,Midfielder,Burnley,England,1217,487,730,16,7,...,1.48,61,76,1217,1217,0.07,196,187,-1,10
3,A_Mooy,30,Midfielder,Huddersfield Town,Australia,2327,1190,1137,29,15,...,1.78,51,80,582,2327,0.15,144,233,-1,3
11,A_Ramsey,30,Midfielder,Arsenal,Wales,1327,689,638,28,14,...,0.81,111,47,0,221,0.0,69,8,-1,5
14,A_Rowe,20,Forward,Huddersfield Town,England,69,14,55,2,1,...,1.3,69,35,0,0,0.0,-1,-1,-1,31


In [118]:
football.columns

Index(['name', 'age', 'position', 'Current Club', 'nationality',
       'minutes_played_overall', 'minutes_played_home', 'minutes_played_away',
       'appearances_overall', 'appearances_home', 'appearances_away',
       'goals_overall', 'goals_home', 'goals_away', 'assists_overall',
       'assists_home', 'assists_away', 'penalty_goals', 'penalty_misses',
       'clean_sheets_overall', 'clean_sheets_home', 'clean_sheets_away',
       'conceded_overall', 'conceded_home', 'conceded_away',
       'yellow_cards_overall', 'red_cards_overall',
       'goals_involved_per_90_overall', 'assists_per_90_overall',
       'goals_per_90_overall', 'goals_per_90_home', 'goals_per_90_away',
       'min_per_goal_overall', 'conceded_per_90_overall',
       'min_per_conceded_overall', 'min_per_match', 'min_per_card_overall',
       'min_per_assist_overall', 'cards_per_90_overall',
       'rank_in_league_top_attackers', 'rank_in_league_top_midfielders',
       'rank_in_league_top_defenders', 'rank_in_club

It seems that the `name` column by itself might not be unique. Lets try to make a unique index by combining `name` and `age`:

In [119]:
player_id = football.name + " " + football.age.astype(str)
football_newind = football.copy()
football_newind.index = player_id
football_newind.head()

Unnamed: 0,name,age,position,Current Club,nationality,minutes_played_overall,minutes_played_home,minutes_played_away,appearances_overall,appearances_home,...,conceded_per_90_overall,min_per_conceded_overall,min_per_match,min_per_card_overall,min_per_assist_overall,cards_per_90_overall,rank_in_league_top_attackers,rank_in_league_top_midfielders,rank_in_league_top_defenders,rank_in_club_top_scorer
A_Cresswell 31,A_Cresswell,31,Defender,West Ham United,England,1589,888,701,20,11,...,1.25,72,79,1589,1589,0.06,290,191,80,20
A_Lennon 33,A_Lennon,33,Midfielder,Burnley,England,1217,487,730,16,7,...,1.48,61,76,1217,1217,0.07,196,187,-1,10
A_Mooy 30,A_Mooy,30,Midfielder,Huddersfield Town,Australia,2327,1190,1137,29,15,...,1.78,51,80,582,2327,0.15,144,233,-1,3
A_Ramsey 30,A_Ramsey,30,Midfielder,Arsenal,Wales,1327,689,638,28,14,...,0.81,111,47,0,221,0.0,69,8,-1,5
A_Rowe 20,A_Rowe,20,Forward,Huddersfield Town,England,69,14,55,2,1,...,1.3,69,35,0,0,0.0,-1,-1,-1,31


This looks okay, but let's check:

In [120]:
football_newind.index.is_unique

False

So, we can asign non-unique indices. Our choice was not unique because at least two players have same name and age.

In [121]:
pd.Series(football_newind.index).value_counts()

J_Murphy 25                     2
D_Rice 21                       1
D_Solanke 23                    1
L_Bonatini_Lohner_Maia 26       1
M_Kilman 23                     1
                               ..
D_JosÃ©_Teixeira_da_Silva 24    1
J_Ruddy 34                      1
S_Long 33                       1
J_McCarthy 30                   1
G-Kevin_N'Koudou_Mbida 25       1
Length: 571, dtype: int64

Due to non-unique index, indexing by label may return multiple values for the same label:

In [122]:
football_newind.loc['J_Murphy 25']

Unnamed: 0,name,age,position,Current Club,nationality,minutes_played_overall,minutes_played_home,minutes_played_away,appearances_overall,appearances_home,...,conceded_per_90_overall,min_per_conceded_overall,min_per_match,min_per_card_overall,min_per_assist_overall,cards_per_90_overall,rank_in_league_top_attackers,rank_in_league_top_midfielders,rank_in_league_top_defenders,rank_in_club_top_scorer
J_Murphy 25,J_Murphy,25,Midfielder,Newcastle United,England,300,167,133,9,3,...,2.1,43,33,0,300,0.0,301,21,-1,17
J_Murphy 25,J_Murphy,25,Forward,Cardiff City,England,1825,1123,702,29,17,...,1.33,68,63,913,913,0.1,119,147,-1,6


We can create a truly unique index by combining `name`, `age` and `Current Club`:

In [126]:
player_unique = football.name + " " + football.age.astype(str) + " " + football['Current Club']
football_newind = football.copy()
football_newind.index = player_unique
football_newind.head()

Unnamed: 0,name,age,position,Current Club,nationality,minutes_played_overall,minutes_played_home,minutes_played_away,appearances_overall,appearances_home,...,conceded_per_90_overall,min_per_conceded_overall,min_per_match,min_per_card_overall,min_per_assist_overall,cards_per_90_overall,rank_in_league_top_attackers,rank_in_league_top_midfielders,rank_in_league_top_defenders,rank_in_club_top_scorer
A_Cresswell 31 West Ham United,A_Cresswell,31,Defender,West Ham United,England,1589,888,701,20,11,...,1.25,72,79,1589,1589,0.06,290,191,80,20
A_Lennon 33 Burnley,A_Lennon,33,Midfielder,Burnley,England,1217,487,730,16,7,...,1.48,61,76,1217,1217,0.07,196,187,-1,10
A_Mooy 30 Huddersfield Town,A_Mooy,30,Midfielder,Huddersfield Town,Australia,2327,1190,1137,29,15,...,1.78,51,80,582,2327,0.15,144,233,-1,3
A_Ramsey 30 Arsenal,A_Ramsey,30,Midfielder,Arsenal,Wales,1327,689,638,28,14,...,0.81,111,47,0,221,0.0,69,8,-1,5
A_Rowe 20 Huddersfield Town,A_Rowe,20,Forward,Huddersfield Town,England,69,14,55,2,1,...,1.3,69,35,0,0,0.0,-1,-1,-1,31


In [124]:
football_newind.index.is_unique

True

We can easily create meaningful indices using a hierarchical index; for now, let's stick with the numeric `id` field as our index.

### Index manipulation

**Reindexing** allows us to manipulate the data labels in a DataFrame. It forces a DataFrame to conform to the new index, and optionally, fill in missing data if requested.

We may use `reindex` to alter the order of the rows:

In [131]:
football.head()

Unnamed: 0_level_0,name,age,position,Current Club,nationality,minutes_played_overall,minutes_played_home,minutes_played_away,appearances_overall,appearances_home,...,conceded_per_90_overall,min_per_conceded_overall,min_per_match,min_per_card_overall,min_per_assist_overall,cards_per_90_overall,rank_in_league_top_attackers,rank_in_league_top_midfielders,rank_in_league_top_defenders,rank_in_club_top_scorer
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5,A_Cresswell,31,Defender,West Ham United,England,1589,888,701,20,11,...,1.25,72,79,1589,1589,0.06,290,191,80,20
7,A_Lennon,33,Midfielder,Burnley,England,1217,487,730,16,7,...,1.48,61,76,1217,1217,0.07,196,187,-1,10
3,A_Mooy,30,Midfielder,Huddersfield Town,Australia,2327,1190,1137,29,15,...,1.78,51,80,582,2327,0.15,144,233,-1,3
11,A_Ramsey,30,Midfielder,Arsenal,Wales,1327,689,638,28,14,...,0.81,111,47,0,221,0.0,69,8,-1,5
14,A_Rowe,20,Forward,Huddersfield Town,England,69,14,55,2,1,...,1.3,69,35,0,0,0.0,-1,-1,-1,31


In [130]:
football.reindex(football.index[::-1]).head()

Unnamed: 0_level_0,name,age,position,Current Club,nationality,minutes_played_overall,minutes_played_home,minutes_played_away,appearances_overall,appearances_home,...,conceded_per_90_overall,min_per_conceded_overall,min_per_match,min_per_card_overall,min_per_assist_overall,cards_per_90_overall,rank_in_league_top_attackers,rank_in_league_top_midfielders,rank_in_league_top_defenders,rank_in_club_top_scorer
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1297,Åukasz_FabiaÅ„ski,35,Goalkeeper,West Ham United,Poland,3420,1710,1710,38,19,...,1.26,71,90,0,0,0.0,391,343,81,19
1256,Z_Steven_Sessegnon,20,Defender,Fulham,England,0,0,0,0,0,...,0.0,0,0,0,0,0.0,-1,-1,-1,-1
1243,Z_Medley,20,Defender,Arsenal,England,0,0,0,0,0,...,0.0,0,0,0,0,0.0,-1,-1,-1,-1
1231,Y_Bissouma,24,Midfielder,Brighton & Hove Albion,Mali,1769,747,1022,28,13,...,1.53,59,63,354,0,0.25,402,293,-1,17
1223,Y_Tielemans,23,Midfielder,Leicester City,Belgium,1092,575,517,13,7,...,1.07,84,84,546,273,0.16,80,13,-1,4


Notice that the index is not sequential. If we wanted to fill the table with every possible `id` value. We can do this by creating an index that ranges from the first to the last `id` numbers in the database, and Pandas would fill in the missing data with `NaN` values:

In [132]:
id_range = range(football.index.values.min(), football.index.values.max())
football.reindex(id_range).head()

Unnamed: 0_level_0,name,age,position,Current Club,nationality,minutes_played_overall,minutes_played_home,minutes_played_away,appearances_overall,appearances_home,...,conceded_per_90_overall,min_per_conceded_overall,min_per_match,min_per_card_overall,min_per_assist_overall,cards_per_90_overall,rank_in_league_top_attackers,rank_in_league_top_midfielders,rank_in_league_top_defenders,rank_in_club_top_scorer
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3,A_Mooy,30.0,Midfielder,Huddersfield Town,Australia,2327.0,1190.0,1137.0,29.0,15.0,...,1.78,51.0,80.0,582.0,2327.0,0.15,144.0,233.0,-1.0,3.0
4,,,,,,,,,,,...,,,,,,,,,,
5,A_Cresswell,31.0,Defender,West Ham United,England,1589.0,888.0,701.0,20.0,11.0,...,1.25,72.0,79.0,1589.0,1589.0,0.06,290.0,191.0,80.0,20.0
6,,,,,,,,,,,...,,,,,,,,,,
7,A_Lennon,33.0,Midfielder,Burnley,England,1217.0,487.0,730.0,16.0,7.0,...,1.48,61.0,76.0,1217.0,1217.0,0.07,196.0,187.0,-1.0,10.0


Missing values can be filled either with selected values, or by rule:

In [135]:
football.reindex(id_range, columns=['name', 'age']).ffill().head()

Unnamed: 0_level_0,name,age
id,Unnamed: 1_level_1,Unnamed: 2_level_1
3,A_Mooy,30.0
4,A_Mooy,30.0
5,A_Cresswell,31.0
6,A_Cresswell,31.0
7,A_Lennon,33.0


In [136]:
football.reindex(id_range, fill_value='NotTested', columns=['name']).head()

Unnamed: 0_level_0,name
id,Unnamed: 1_level_1
3,A_Mooy
4,NotTested
5,A_Cresswell
6,NotTested
7,A_Lennon


Remember that `reindex` does not work if we pass a non-unique index series.

We can remove rows or columns via the `drop` method:

In [137]:
football.drop([3, 5])

Unnamed: 0_level_0,name,age,position,Current Club,nationality,minutes_played_overall,minutes_played_home,minutes_played_away,appearances_overall,appearances_home,...,conceded_per_90_overall,min_per_conceded_overall,min_per_match,min_per_card_overall,min_per_assist_overall,cards_per_90_overall,rank_in_league_top_attackers,rank_in_league_top_midfielders,rank_in_league_top_defenders,rank_in_club_top_scorer
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
7,A_Lennon,33,Midfielder,Burnley,England,1217,487,730,16,7,...,1.48,61,76,1217,1217,0.07,196,187,-1,10
11,A_Ramsey,30,Midfielder,Arsenal,Wales,1327,689,638,28,14,...,0.81,111,47,0,221,0.00,69,8,-1,5
14,A_Rowe,20,Forward,Huddersfield Town,England,69,14,55,2,1,...,1.30,69,35,0,0,0.00,-1,-1,-1,31
20,A_Wan-Bissaka,23,Midfielder,Crystal Palace,England,3135,1605,1530,35,18,...,1.18,76,90,523,1045,0.17,312,160,-1,22
15,A_Sabiri,24,Midfielder,Huddersfield Town,Morocco,49,0,49,2,0,...,5.51,16,25,0,0,0.00,-1,-1,-1,22
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1223,Y_Tielemans,23,Midfielder,Leicester City,Belgium,1092,575,517,13,7,...,1.07,84,84,546,273,0.16,80,13,-1,4
1231,Y_Bissouma,24,Midfielder,Brighton & Hove Albion,Mali,1769,747,1022,28,13,...,1.53,59,63,354,0,0.25,402,293,-1,17
1243,Z_Medley,20,Defender,Arsenal,England,0,0,0,0,0,...,0.00,0,0,0,0,0.00,-1,-1,-1,-1
1256,Z_Steven_Sessegnon,20,Defender,Fulham,England,0,0,0,0,0,...,0.00,0,0,0,0,0.00,-1,-1,-1,-1


In [140]:
football.drop(['age','position'], axis=1)

Unnamed: 0_level_0,name,Current Club,nationality,minutes_played_overall,minutes_played_home,minutes_played_away,appearances_overall,appearances_home,appearances_away,goals_overall,...,conceded_per_90_overall,min_per_conceded_overall,min_per_match,min_per_card_overall,min_per_assist_overall,cards_per_90_overall,rank_in_league_top_attackers,rank_in_league_top_midfielders,rank_in_league_top_defenders,rank_in_club_top_scorer
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5,A_Cresswell,West Ham United,England,1589,888,701,20,11,9,0,...,1.25,72,79,1589,1589,0.06,290,191,80,20
7,A_Lennon,Burnley,England,1217,487,730,16,7,9,1,...,1.48,61,76,1217,1217,0.07,196,187,-1,10
3,A_Mooy,Huddersfield Town,Australia,2327,1190,1137,29,15,14,3,...,1.78,51,80,582,2327,0.15,144,233,-1,3
11,A_Ramsey,Arsenal,Wales,1327,689,638,28,14,14,4,...,0.81,111,47,0,221,0.00,69,8,-1,5
14,A_Rowe,Huddersfield Town,England,69,14,55,2,1,1,0,...,1.30,69,35,0,0,0.00,-1,-1,-1,31
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1223,Y_Tielemans,Leicester City,Belgium,1092,575,517,13,7,6,3,...,1.07,84,84,546,273,0.16,80,13,-1,4
1231,Y_Bissouma,Brighton & Hove Albion,Mali,1769,747,1022,28,13,15,0,...,1.53,59,63,354,0,0.25,402,293,-1,17
1243,Z_Medley,Arsenal,England,0,0,0,0,0,0,0,...,0.00,0,0,0,0,0.00,-1,-1,-1,-1
1256,Z_Steven_Sessegnon,Fulham,England,0,0,0,0,0,0,0,...,0.00,0,0,0,0,0.00,-1,-1,-1,-1


## Indexing and Selection
#### Indexing works same as in NumPy arrays.

In [141]:
# Series
goals = football_newind.goals_overall
goals

A_Cresswell 31 West Ham United           0
A_Lennon 33 Burnley                      1
A_Mooy 30 Huddersfield Town              3
A_Ramsey 30 Arsenal                      4
A_Rowe 20 Huddersfield Town              0
                                        ..
Y_Tielemans 23 Leicester City            3
Y_Bissouma 24 Brighton & Hove Albion     0
Z_Medley 20 Arsenal                      0
Z_Steven_Sessegnon 20 Fulham             0
Åukasz_FabiaÅ„ski 35 West Ham United    0
Name: goals_overall, Length: 572, dtype: int64

In [142]:
# Numpy-style indexing
goals[:3]

A_Cresswell 31 West Ham United    0
A_Lennon 33 Burnley               1
A_Mooy 30 Huddersfield Town       3
Name: goals_overall, dtype: int64

#### Additionally, we can also use the labels in the `Index` object to extract values.

In [144]:
# Indexing by label
goals[['Y_Tielemans 23 Leicester City','A_Rowe 20 Huddersfield Town']]

Y_Tielemans 23 Leicester City    3
A_Rowe 20 Huddersfield Town      0
Name: goals_overall, dtype: int64

#### We can also **slice** with the labels, since they are intrinsically ordered within the Index:

In [147]:
goals['A_Cresswell 31 West Ham United':'A_Rowe 20 Huddersfield Town']

A_Cresswell 31 West Ham United    0
A_Lennon 33 Burnley               1
A_Mooy 30 Huddersfield Town       3
A_Ramsey 30 Arsenal               4
A_Rowe 20 Huddersfield Town       0
Name: goals_overall, dtype: int64

#### Similar to the rows, we can also select columns:

1. **One or more at a time**.

In [148]:
football_newind[['name']]

Unnamed: 0,name
A_Cresswell 31 West Ham United,A_Cresswell
A_Lennon 33 Burnley,A_Lennon
A_Mooy 30 Huddersfield Town,A_Mooy
A_Ramsey 30 Arsenal,A_Ramsey
A_Rowe 20 Huddersfield Town,A_Rowe
...,...
Y_Tielemans 23 Leicester City,Y_Tielemans
Y_Bissouma 24 Brighton & Hove Albion,Y_Bissouma
Z_Medley 20 Arsenal,Z_Medley
Z_Steven_Sessegnon 20 Fulham,Z_Steven_Sessegnon


In [149]:
football_newind[['name','age']]

Unnamed: 0,name,age
A_Cresswell 31 West Ham United,A_Cresswell,31
A_Lennon 33 Burnley,A_Lennon,33
A_Mooy 30 Huddersfield Town,A_Mooy,30
A_Ramsey 30 Arsenal,A_Ramsey,30
A_Rowe 20 Huddersfield Town,A_Rowe,20
...,...,...
Y_Tielemans 23 Leicester City,Y_Tielemans,23
Y_Bissouma 24 Brighton & Hove Albion,Y_Bissouma,24
Z_Medley 20 Arsenal,Z_Medley,20
Z_Steven_Sessegnon 20 Fulham,Z_Steven_Sessegnon,20


2.  **Based on column values**.

In [150]:
football_newind[football_newind.age>30]

Unnamed: 0,name,age,position,Current Club,nationality,minutes_played_overall,minutes_played_home,minutes_played_away,appearances_overall,appearances_home,...,conceded_per_90_overall,min_per_conceded_overall,min_per_match,min_per_card_overall,min_per_assist_overall,cards_per_90_overall,rank_in_league_top_attackers,rank_in_league_top_midfielders,rank_in_league_top_defenders,rank_in_club_top_scorer
A_Cresswell 31 West Ham United,A_Cresswell,31,Defender,West Ham United,England,1589,888,701,20,11,...,1.25,72,79,1589,1589,0.06,290,191,80,20
A_Lennon 33 Burnley,A_Lennon,33,Midfielder,Burnley,England,1217,487,730,16,7,...,1.48,61,76,1217,1217,0.07,196,187,-1,10
A_David_Lallana 32 Liverpool,A_David_Lallana,32,Midfielder,Liverpool,England,465,189,276,13,6,...,0.39,233,36,465,0,0.19,379,344,-1,18
A_Mariappa 34 Watford,A_Mariappa,34,Defender,Watford,Jamaica,1921,841,1080,26,12,...,1.36,66,74,640,0,0.14,396,414,94,21
A¡n_San_Miguel_del_Castillo 34 West Ham United,A¡n_San_Miguel_del_Castillo,34,Goalkeeper,West Ham United,Spain,0,0,0,0,0,...,0.00,0,0,0,0,0.00,-1,-1,-1,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
W_Hennessey 33 Crystal Palace,W_Hennessey,33,Goalkeeper,Crystal Palace,Wales,1575,675,900,18,8,...,1.31,68,88,0,0,0.00,321,404,87,13
W_Morgan 36 Leicester City,W_Morgan,36,Defender,Leicester City,Jamaica,1926,783,1143,22,9,...,1.12,80,88,385,0,0.23,125,300,48,6
W_Daniel_Caballero 39 Chelsea,W_Daniel_Caballero,39,Goalkeeper,Chelsea,Argentina,180,90,90,2,1,...,0.00,0,90,0,0,0.00,-1,-1,-1,19
W 32 Chelsea,W,32,Forward,Chelsea,Brazil,2108,833,1275,32,15,...,0.98,92,66,1054,351,0.09,132,36,-1,8


We can also use the `query` method to perform selection on a `DataFrame`, which accepts a string that describes what to select. 

In [155]:
football_newind.query('age > 30')

Unnamed: 0,name,age,position,Current Club,nationality,minutes_played_overall,minutes_played_home,minutes_played_away,appearances_overall,appearances_home,...,conceded_per_90_overall,min_per_conceded_overall,min_per_match,min_per_card_overall,min_per_assist_overall,cards_per_90_overall,rank_in_league_top_attackers,rank_in_league_top_midfielders,rank_in_league_top_defenders,rank_in_club_top_scorer
A_Cresswell 31 West Ham United,A_Cresswell,31,Defender,West Ham United,England,1589,888,701,20,11,...,1.25,72,79,1589,1589,0.06,290,191,80,20
A_Lennon 33 Burnley,A_Lennon,33,Midfielder,Burnley,England,1217,487,730,16,7,...,1.48,61,76,1217,1217,0.07,196,187,-1,10
A_David_Lallana 32 Liverpool,A_David_Lallana,32,Midfielder,Liverpool,England,465,189,276,13,6,...,0.39,233,36,465,0,0.19,379,344,-1,18
A_Mariappa 34 Watford,A_Mariappa,34,Defender,Watford,Jamaica,1921,841,1080,26,12,...,1.36,66,74,640,0,0.14,396,414,94,21
A¡n_San_Miguel_del_Castillo 34 West Ham United,A¡n_San_Miguel_del_Castillo,34,Goalkeeper,West Ham United,Spain,0,0,0,0,0,...,0.00,0,0,0,0,0.00,-1,-1,-1,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
W_Hennessey 33 Crystal Palace,W_Hennessey,33,Goalkeeper,Crystal Palace,Wales,1575,675,900,18,8,...,1.31,68,88,0,0,0.00,321,404,87,13
W_Morgan 36 Leicester City,W_Morgan,36,Defender,Leicester City,Jamaica,1926,783,1143,22,9,...,1.12,80,88,385,0,0.23,125,300,48,6
W_Daniel_Caballero 39 Chelsea,W_Daniel_Caballero,39,Goalkeeper,Chelsea,Argentina,180,90,90,2,1,...,0.00,0,90,0,0,0.00,-1,-1,-1,19
W 32 Chelsea,W,32,Forward,Chelsea,Brazil,2108,833,1275,32,15,...,0.98,92,66,1054,351,0.09,132,36,-1,8


The `DataFrame.index` and `DataFrame.columns` exist in the query namespace by default. 

However, if we want to refer to a variable in the current namespace, we can prefix the variable with `@`:

In [152]:
min_age = 30

In [153]:
football_newind.query('age > @min_age')

Unnamed: 0,name,age,position,Current Club,nationality,minutes_played_overall,minutes_played_home,minutes_played_away,appearances_overall,appearances_home,...,conceded_per_90_overall,min_per_conceded_overall,min_per_match,min_per_card_overall,min_per_assist_overall,cards_per_90_overall,rank_in_league_top_attackers,rank_in_league_top_midfielders,rank_in_league_top_defenders,rank_in_club_top_scorer
A_Cresswell 31 West Ham United,A_Cresswell,31,Defender,West Ham United,England,1589,888,701,20,11,...,1.25,72,79,1589,1589,0.06,290,191,80,20
A_Lennon 33 Burnley,A_Lennon,33,Midfielder,Burnley,England,1217,487,730,16,7,...,1.48,61,76,1217,1217,0.07,196,187,-1,10
A_David_Lallana 32 Liverpool,A_David_Lallana,32,Midfielder,Liverpool,England,465,189,276,13,6,...,0.39,233,36,465,0,0.19,379,344,-1,18
A_Mariappa 34 Watford,A_Mariappa,34,Defender,Watford,Jamaica,1921,841,1080,26,12,...,1.36,66,74,640,0,0.14,396,414,94,21
A¡n_San_Miguel_del_Castillo 34 West Ham United,A¡n_San_Miguel_del_Castillo,34,Goalkeeper,West Ham United,Spain,0,0,0,0,0,...,0.00,0,0,0,0,0.00,-1,-1,-1,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
W_Hennessey 33 Crystal Palace,W_Hennessey,33,Goalkeeper,Crystal Palace,Wales,1575,675,900,18,8,...,1.31,68,88,0,0,0.00,321,404,87,13
W_Morgan 36 Leicester City,W_Morgan,36,Defender,Leicester City,Jamaica,1926,783,1143,22,9,...,1.12,80,88,385,0,0.23,125,300,48,6
W_Daniel_Caballero 39 Chelsea,W_Daniel_Caballero,39,Goalkeeper,Chelsea,Argentina,180,90,90,2,1,...,0.00,0,90,0,0,0.00,-1,-1,-1,19
W 32 Chelsea,W,32,Forward,Chelsea,Brazil,2108,833,1275,32,15,...,0.98,92,66,1054,351,0.09,132,36,-1,8


In [156]:
football.columns

Index(['name', 'age', 'position', 'Current Club', 'nationality',
       'minutes_played_overall', 'minutes_played_home', 'minutes_played_away',
       'appearances_overall', 'appearances_home', 'appearances_away',
       'goals_overall', 'goals_home', 'goals_away', 'assists_overall',
       'assists_home', 'assists_away', 'penalty_goals', 'penalty_misses',
       'clean_sheets_overall', 'clean_sheets_home', 'clean_sheets_away',
       'conceded_overall', 'conceded_home', 'conceded_away',
       'yellow_cards_overall', 'red_cards_overall',
       'goals_involved_per_90_overall', 'assists_per_90_overall',
       'goals_per_90_overall', 'goals_per_90_home', 'goals_per_90_away',
       'min_per_goal_overall', 'conceded_per_90_overall',
       'min_per_conceded_overall', 'min_per_match', 'min_per_card_overall',
       'min_per_assist_overall', 'cards_per_90_overall',
       'rank_in_league_top_attackers', 'rank_in_league_top_midfielders',
       'rank_in_league_top_defenders', 'rank_in_club

### **loc**[*selection of row labels*, *selection of column labels*]
`loc` facilitates selection of a subsets of rows and columns.

In [157]:
football_newind.loc['A_Ramsey 30 Arsenal', ['goals_overall', 'assists_overall', 'penalty_misses', 
                                          'clean_sheets_overall',  'conceded_overall',  'yellow_cards_overall', 'red_cards_overall',]]

goals_overall            4
assists_overall          6
penalty_misses           0
clean_sheets_overall     7
conceded_overall        12
yellow_cards_overall     0
red_cards_overall        0
Name: A_Ramsey 30 Arsenal, dtype: object

In [158]:
football_newind.loc[:'A_Ramsey 30 Arsenal', 'minutes_played_overall']

A_Cresswell 31 West Ham United    1589
A_Lennon 33 Burnley               1217
A_Mooy 30 Huddersfield Town       2327
A_Ramsey 30 Arsenal               1327
Name: minutes_played_overall, dtype: int64

### **iloc**[*selection of row positions*, *selection of column positions*]

Pandas also permits indexing by **position** with the `iloc` attribute.

Thus rows and columns can be selected by absolute position:

In [159]:
football_newind.iloc[:5, 5:8]

Unnamed: 0,minutes_played_overall,minutes_played_home,minutes_played_away
A_Cresswell 31 West Ham United,1589,888,701
A_Lennon 33 Burnley,1217,487,730
A_Mooy 30 Huddersfield Town,2327,1190,1137
A_Ramsey 30 Arsenal,1327,689,638
A_Rowe 20 Huddersfield Town,69,14,55


### Exercise

You can use the `isin` method to query a DataFrame based upon a list of values as follows: 

    MyDf['lineage'].isin(['Human_aa', 'Gibbon_aa'])

Use `isin` to find all players that played for the `Arsenal` or the `Liverpool`. How many records contain these values?

In [None]:
# Write your answer here
football[football['Current Club'].isin(['Arsenal','Liverpool'])]

In [163]:
football[football['Current Club'].isin(['Arsenal','Liverpool'])]['Current Club'].value_counts()

Arsenal      31
Liverpool    25
Name: Current Club, dtype: int64

## Operations

`DataFrame` and `Series` objects allow for several operations to take place either on a single object, or between two or more objects.

For example, we can perform arithmetic on the elements of two objects, such as combining football statistics across positions. First, let's (artificially) construct a Series, consisting of goals for each of the four postions:

In [None]:
football.position.value_counts()

In [166]:
Midf = football.loc[football.position=='Midfielder', 'goals_overall']
Midf.index = football.name[football.position=='Midfielder']

Defn = football.loc[football.position=='Defender', 'goals_overall']
Defn.index = football.name[football.position=='Defender']

Forw = football.loc[football.position=='Forward', 'goals_overall']
Forw.index = football.name[football.position=='Forward']

Keep = football.loc[football.position=='Goalkeeper', 'goals_overall']
Keep.index = football.name[football.position=='Goalkeeper']


In [167]:
Forw

name
A_Rowe                  0
A_Kamara                3
A_PeÃ±aranda_Maestre    0
A_Diakhaby              0
A_Lookman               0
                       ..
T_Deeney                9
V_GyÃ¶keres             0
V_Janssen               0
W                       3
Y_Muto                  1
Name: goals_overall, Length: 114, dtype: int64

Now, let's add the goals scored by either a forward or a defender:

In [168]:
Total_goals = Defn + Forw
Total_goals

name
A_Barnes                    NaN
A_Barreca                   NaN
A_Carroll                   NaN
A_Christensen               NaN
A_Cresswell                 NaN
                             ..
Y_Fernando_Mina_GonzÃ¡lez   NaN
Y_Muto                      NaN
Y_Valery                    NaN
Z_Medley                    NaN
Z_Steven_Sessegnon          NaN
Name: goals_overall, Length: 303, dtype: float64

Pandas' data alignment places `NaN` values for labels that do not overlap in each Series. In our dataset there is no overlap.

In [169]:
Total_goals[Total_goals.notnull()]

Series([], Name: goals_overall, dtype: float64)

While we do want the operation to honor the data labels in this way, we probably do not want the missing values to be filled with `NaN`. We can use the `add` method to calculate player total golas by using the `fill_value` argument to insert a zero for goals where labels do not overlap:

In [170]:
Forw.add(Defn, fill_value=0)

name
A_Barnes                     12.0
A_Barreca                     0.0
A_Carroll                     0.0
A_Christensen                 0.0
A_Cresswell                   0.0
                             ... 
Y_Fernando_Mina_GonzÃ¡lez     1.0
Y_Muto                        1.0
Y_Valery                      2.0
Z_Medley                      0.0
Z_Steven_Sessegnon            0.0
Name: goals_overall, Length: 303, dtype: float64

Operations can be **broadcast** between rows or columns.

For example, if we subtract the maximum number of golas from the `goals_overall` column, we get how many fewer than the maximum were scored by each player:

In [174]:
football.goals_overall - football.goals_overall.max()

id
5      -22
7      -21
3      -19
11     -18
14     -22
        ..
1223   -19
1231   -22
1243   -22
1256   -22
1297   -22
Name: goals_overall, Length: 572, dtype: int64

Or, looking at things row-wise, we can see how a particular player compares with the rest of the group with respect to important statistics

In [175]:
football.loc[1223, ["name", "Current Club"]]

name               Y_Tielemans
Current Club    Leicester City
Name: 1223, dtype: object

In [176]:
stats = football[football["Current Club"] == 'Leicester City'][[
    'goals_overall', 'penalty_misses', 'clean_sheets_overall', 
    'yellow_cards_overall', 'red_cards_overall',]]
diff = stats - stats.loc[1223]
diff

Unnamed: 0_level_0,goals_overall,penalty_misses,clean_sheets_overall,yellow_cards_overall,red_cards_overall
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
30,-3,0,-2,-2,0
90,-3,0,7,2,0
110,-3,0,-1,-1,0
217,-3,0,-3,-2,0
233,-3,0,-1,0,0
240,-3,0,0,-2,0
241,-3,0,-3,-2,0
258,1,0,5,0,0
296,-3,0,-3,-2,0
415,-3,0,1,-2,0


We can also apply functions to each column or row of a `DataFrame`

In [180]:
stats.apply(np.mean)

goals_overall           1.777778
penalty_misses          0.074074
clean_sheets_overall    5.074074
yellow_cards_overall    2.407407
red_cards_overall       0.185185
dtype: float64

In [184]:
def range_calc(x):
    return x.max() - x.min()

In [181]:
stat_range = lambda x: x.max() - x.min()
stats.apply(stat_range)

goals_overall           18
penalty_misses           1
clean_sheets_overall    10
yellow_cards_overall     8
red_cards_overall        2
dtype: int64

In [188]:
stats.apply(range_calc)

goals_overall           18
penalty_misses           1
clean_sheets_overall    10
yellow_cards_overall     8
red_cards_overall        2
dtype: int64

Lets use `apply` to calculate a meaningful football statistics, the "Goals impact per minute of play time":

$$Impact = \frac{goals + assists + clean}{time}$$

And just for fun, we will format the resulting estimate.

In [189]:
def impact(x): 
    if x['minutes_played_overall'] == 0:
        return np.nan
    imapct = (x['goals_overall']+x['assists_overall']+x['clean_sheets_overall'])/x['minutes_played_overall']
    
    
    return imapct

football.apply(impact, axis=1).round(3)

id
5       0.003
7       0.005
3       0.003
11      0.013
14      0.000
        ...  
1223    0.009
1231    0.003
1243      NaN
1256      NaN
1297    0.002
Length: 572, dtype: float64

## Sorting and Ranking

Methods for re-ordering data.

In [190]:
football_newind.sort_index().head()

Unnamed: 0,name,age,position,Current Club,nationality,minutes_played_overall,minutes_played_home,minutes_played_away,appearances_overall,appearances_home,...,conceded_per_90_overall,min_per_conceded_overall,min_per_match,min_per_card_overall,min_per_assist_overall,cards_per_90_overall,rank_in_league_top_attackers,rank_in_league_top_midfielders,rank_in_league_top_defenders,rank_in_club_top_scorer
A_Barnes 31 Burnley,A_Barnes,31,Forward,Burnley,England,2400,1307,1093,37,19,...,1.28,71,65,300,1200,0.3,27,172,-1,1
A_Barreca 25 Newcastle United,A_Barreca,25,Defender,Newcastle United,Italy,4,0,4,1,0,...,0.0,0,4,0,0,0.0,-1,-1,-1,16
A_Becker 28 Liverpool,A_Becker,28,Goalkeeper,Liverpool,Brazil,3420,1710,1710,38,19,...,0.5,180,90,3420,0,0.03,417,387,3,16
A_BegoviÄ‡ 33 AFC Bournemouth,A_BegoviÄ‡,33,Goalkeeper,AFC Bournemouth,Bosnia and Herzegovina,2160,1080,1080,24,12,...,1.83,49,90,0,0,0.0,358,273,154,19
A_Boruc 40 AFC Bournemouth,A_Boruc,40,Goalkeeper,AFC Bournemouth,Poland,1080,540,540,12,6,...,1.42,64,90,540,0,0.17,361,394,100,20


In [191]:
football_newind.sort_index(ascending=False).head()

Unnamed: 0,name,age,position,Current Club,nationality,minutes_played_overall,minutes_played_home,minutes_played_away,appearances_overall,appearances_home,...,conceded_per_90_overall,min_per_conceded_overall,min_per_match,min_per_card_overall,min_per_assist_overall,cards_per_90_overall,rank_in_league_top_attackers,rank_in_league_top_midfielders,rank_in_league_top_defenders,rank_in_club_top_scorer
Åukasz_FabiaÅ„ski 35 West Ham United,Åukasz_FabiaÅ„ski,35,Goalkeeper,West Ham United,Poland,3420,1710,1710,38,19,...,1.26,71,90,0,0,0.0,391,343,81,19
Ä°lkay_GÃ¼ndoÄŸan 30 Manchester City,Ä°lkay_GÃ¼ndoÄŸan,30,Midfielder,Manchester City,Germany,2135,985,1150,31,15,...,0.63,142,69,712,712,0.13,78,120,-1,8
Z_Steven_Sessegnon 20 Fulham,Z_Steven_Sessegnon,20,Defender,Fulham,England,0,0,0,0,0,...,0.0,0,0,0,0,0.0,-1,-1,-1,-1
Z_Medley 20 Arsenal,Z_Medley,20,Defender,Arsenal,England,0,0,0,0,0,...,0.0,0,0,0,0,0.0,-1,-1,-1,-1
Y_Valery 21 Southampton,Y_Valery,21,Defender,Southampton,France,1715,1070,645,23,13,...,1.36,66,75,343,1715,0.26,162,214,95,7


Try sorting the **columns** instead of the rows, in ascending order:

In [192]:
football_newind.sort_index(axis=1).head()

Unnamed: 0,Current Club,age,appearances_away,appearances_home,appearances_overall,assists_away,assists_home,assists_overall,assists_per_90_overall,cards_per_90_overall,...,nationality,penalty_goals,penalty_misses,position,rank_in_club_top_scorer,rank_in_league_top_attackers,rank_in_league_top_defenders,rank_in_league_top_midfielders,red_cards_overall,yellow_cards_overall
A_Cresswell 31 West Ham United,West Ham United,31,9,11,20,0,1,1,0.06,0.06,...,England,0,0,Defender,20,290,80,191,0,1
A_Lennon 33 Burnley,Burnley,33,9,7,16,0,1,1,0.07,0.07,...,England,0,0,Midfielder,10,196,-1,187,0,1
A_Mooy 30 Huddersfield Town,Huddersfield Town,30,14,15,29,1,0,1,0.04,0.15,...,Australia,1,0,Midfielder,3,144,-1,233,0,4
A_Ramsey 30 Arsenal,Arsenal,30,14,14,28,1,5,6,0.41,0.0,...,Wales,0,0,Midfielder,5,69,-1,8,0,0
A_Rowe 20 Huddersfield Town,Huddersfield Town,20,1,1,2,0,0,0,0.0,0.0,...,England,0,0,Forward,31,-1,-1,-1,0,0


We can also use `sort_values` to sort a `Series` by value, rather than by label.

In [194]:
football.goals_overall.sort_values(ascending=False)

id
1056    22
1012    22
1090    22
1105    21
453     18
        ..
931      0
933      0
934      0
936      0
5        0
Name: goals_overall, Length: 572, dtype: int64

For a `DataFrame`, we can sort according to the values of one or more columns using the `by` argument of `sort_values`:

In [195]:
football[['name','Current Club','nationality']].sort_values(ascending=[False,True], 
                                           by=['Current Club', 'nationality']).head(10)

Unnamed: 0_level_0,name,Current Club,nationality
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
937,L_Dendoncker,Wolverhampton Wanderers,Belgium
943,L_Bonatini_Lohner_Maia,Wolverhampton Wanderers,Brazil
107,B_Enobakhare,Wolverhampton Wanderers,England
117,C_John,Wolverhampton Wanderers,England
224,C_Coady,Wolverhampton Wanderers,England
481,J_Ruddy,Wolverhampton Wanderers,England
929,K_Hause,Wolverhampton Wanderers,England
997,M_Kilman,Wolverhampton Wanderers,England
1013,M_Gibbs-White,Wolverhampton Wanderers,England
1082,R_Bennett,Wolverhampton Wanderers,England


**Ranking** does not re-arrange data, but instead returns an index that ranks each value relative to others in the Series.

In [196]:
football.goals_overall.rank(ascending=False)

id
5       420.0
7       215.5
3       108.5
11       82.0
14      420.0
        ...  
1223    108.5
1231    420.0
1243    420.0
1256    420.0
1297    420.0
Name: goals_overall, Length: 572, dtype: float64

Ties are assigned the mean value of the tied ranks, which may result in decimal values.

In [197]:
pd.Series([100,100]).rank()

0    1.5
1    1.5
dtype: float64

Alternatively, you can break ties via one of several methods, such as by the order in which they occur in the dataset:

In [198]:
football.goals_overall.rank(method='first', ascending=False)

id
5       268.0
7       164.0
3        92.0
11       73.0
14      269.0
        ...  
1223    125.0
1231    569.0
1243    570.0
1256    571.0
1297    572.0
Name: goals_overall, Length: 572, dtype: float64

Calling the `DataFrame`'s `rank` method results in the ranks of all columns:

In [199]:
football.rank(ascending=False).head()

Unnamed: 0_level_0,name,age,position,Current Club,nationality,minutes_played_overall,minutes_played_home,minutes_played_away,appearances_overall,appearances_home,...,conceded_per_90_overall,min_per_conceded_overall,min_per_match,min_per_card_overall,min_per_assist_overall,cards_per_90_overall,rank_in_league_top_attackers,rank_in_league_top_midfielders,rank_in_league_top_defenders,rank_in_club_top_scorer
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5,565.0,142.0,478.0,40.5,361.0,235.0,210.0,252.0,276.0,257.5,...,262.5,210.5,180.0,32.0,58.0,336.0,130.0,229.0,96.0,109.0
7,547.0,71.0,106.5,468.0,361.0,275.0,310.0,244.0,323.5,342.0,...,172.5,298.0,210.0,51.0,77.0,324.5,224.0,233.0,374.0,308.5
3,537.0,189.0,106.5,297.5,552.5,131.0,123.0,137.0,160.0,141.5,...,87.0,382.5,169.5,164.0,26.0,210.5,276.0,187.0,374.0,448.5
11,526.0,189.0,106.5,525.0,5.0,263.0,250.0,267.0,177.5,174.5,...,417.0,55.0,385.0,471.5,247.0,471.5,351.0,412.0,374.0,408.5
14,524.0,556.0,326.5,297.5,361.0,463.0,462.5,454.5,469.5,465.5,...,244.0,229.0,428.0,471.5,417.5,471.5,496.0,496.0,374.0,1.0


In [200]:
football[['goals_overall', 'assists_overall']].rank(ascending=False).head()

Unnamed: 0_level_0,goals_overall,assists_overall
id,Unnamed: 1_level_1,Unnamed: 2_level_1
5,420.0,209.0
7,215.5,209.0
3,108.5,209.0
11,82.0,27.5
14,420.0,417.5


### Exercise

Find the **top goal scorer** for each team.

In [None]:
# Write your answer here

## Missing data

The occurence of missing data is so prevalent that it pays to use tools like Pandas, which seamlessly integrates missing data handling so that it can be dealt with easily, and in the manner required by the analysis at hand.

Missing data are represented in `Series` and `DataFrame` objects by the `NaN` floating point value. However, `None` is also treated as missing, since it is commonly used as such in other contexts (*e.g.* NumPy).

In [201]:
foo = pd.Series([np.nan, -3, None, 'foobar'])
foo

0       NaN
1        -3
2      None
3    foobar
dtype: object

In [202]:
foo.isnull()

0     True
1    False
2     True
3    False
dtype: bool

Missing values may be dropped or indexed out:

In [203]:
protein2

lineage
Human_aa           NaN
Chimpanzee_aa    484.0
Gorilla_aa       493.0
Gibbon_aa        511.0
dtype: float64

In [206]:
protein2.dropna(inplace=True)

In [207]:
protein2

lineage
Chimpanzee_aa    484.0
Gorilla_aa       493.0
Gibbon_aa        511.0
dtype: float64

In [208]:
protein2.isnull()

lineage
Chimpanzee_aa    False
Gorilla_aa       False
Gibbon_aa        False
dtype: bool

In [209]:
protein2[protein2.notnull()]

lineage
Chimpanzee_aa    484.0
Gorilla_aa       493.0
Gibbon_aa        511.0
dtype: float64

By default, `dropna` drops entire rows in which one or more values are missing.

In [210]:
MisDf.dropna()

Unnamed: 0,Id,HS_gene_id,Gene,Description,HS,GG,NL,PT,HS_aa,PT_aa,...,%HS_Subs,%PT_Subs,%GG_Subs,%NL_Subs,%#1#_Subs,%Convergent_Subs,%OnlyInGpId,%OnlyOutGpId,%OneInOutId,%NoId
0,ENST00000000412,ENSG00000003056,M6PR,mannose-6-phosphate receptor%2C cation dependent,ENST00000000412,ENSGGOT00000059917,ENSNLET00000034947,ENSPTRT00000008593,277.0,277.0,...,0.0,0.0,0.377358,0.754717,0.0,0.0,0.0,0.0,0.0,0.0
1,ENST00000000442,ENSG00000173153,ESRRA,estrogen related receptor alpha,ENST00000000442,ENSGGOT00000001342,ENSNLET00000006350,ENSPTRT00000007149,423.0,422.0,...,0.0,0.0,1.184834,0.236967,0.0,0.0,0.0,0.0,0.0,0.0
2,ENST00000001008,ENSG00000004478,FKBP4,FKBP prolyl isomerase 4,ENST00000001008,ENSGGOT00000010515,ENSNLET00000003652,ENSPTRT00000008389,459.0,459.0,...,0.235849,0.0,0.0,0.235849,0.0,0.0,0.0,0.0,0.0,0.0
3,ENST00000001146,ENSG00000003137,CYP26B1,cytochrome P450 family 26 subfamily B member 1,ENST00000001146,ENSGGOT00000004600,ENSNLET00000013223,ENSPTRT00000109607,512.0,512.0,...,0.390625,0.0,0.0,0.195312,0.0,0.0,0.0,0.0,0.0,0.0
4,ENST00000002125,ENSG00000003509,NDUFAF7,NADH:ubiquinone oxidoreductase complex assembl...,ENST00000002125,ENSGGOT00000011414,ENSNLET00000039964,ENSPTRT00000022034,441.0,441.0,...,0.228311,0.0,0.228311,1.141553,0.684932,0.0,0.0,0.0,0.0,0.0
5,ENST00000002165,ENSG00000001036,FUCA2,alpha-L-fucosidase 2,ENST00000002165,ENSGGOT00000011305,ENSNLET00000019262,ENSPTRT00000034475,467.0,465.0,...,0.0,0.0,0.0,1.290323,0.0,0.0,0.215054,0.0,0.0,0.0


This can be overridden by passing the `how='all'` argument, which only drops a row when every field is a missing value.

In [211]:
MisDf.dropna(how='all')

Unnamed: 0,Id,HS_gene_id,Gene,Description,HS,GG,NL,PT,HS_aa,PT_aa,...,%HS_Subs,%PT_Subs,%GG_Subs,%NL_Subs,%#1#_Subs,%Convergent_Subs,%OnlyInGpId,%OnlyOutGpId,%OneInOutId,%NoId
0,ENST00000000412,ENSG00000003056,M6PR,mannose-6-phosphate receptor%2C cation dependent,ENST00000000412,ENSGGOT00000059917,ENSNLET00000034947,ENSPTRT00000008593,277.0,277.0,...,0.0,0.0,0.377358,0.754717,0.0,0.0,0.0,0.0,0.0,0.0
1,ENST00000000442,ENSG00000173153,ESRRA,estrogen related receptor alpha,ENST00000000442,ENSGGOT00000001342,ENSNLET00000006350,ENSPTRT00000007149,423.0,422.0,...,0.0,0.0,1.184834,0.236967,0.0,0.0,0.0,0.0,0.0,0.0
2,ENST00000001008,ENSG00000004478,FKBP4,FKBP prolyl isomerase 4,ENST00000001008,ENSGGOT00000010515,ENSNLET00000003652,ENSPTRT00000008389,459.0,459.0,...,0.235849,0.0,0.0,0.235849,0.0,0.0,0.0,0.0,0.0,0.0
3,ENST00000001146,ENSG00000003137,CYP26B1,cytochrome P450 family 26 subfamily B member 1,ENST00000001146,ENSGGOT00000004600,ENSNLET00000013223,ENSPTRT00000109607,512.0,512.0,...,0.390625,0.0,0.0,0.195312,0.0,0.0,0.0,0.0,0.0,0.0
4,ENST00000002125,ENSG00000003509,NDUFAF7,NADH:ubiquinone oxidoreductase complex assembl...,ENST00000002125,ENSGGOT00000011414,ENSNLET00000039964,ENSPTRT00000022034,441.0,441.0,...,0.228311,0.0,0.228311,1.141553,0.684932,0.0,0.0,0.0,0.0,0.0
5,ENST00000002165,ENSG00000001036,FUCA2,alpha-L-fucosidase 2,ENST00000002165,ENSGGOT00000011305,ENSNLET00000019262,ENSPTRT00000034475,467.0,465.0,...,0.0,0.0,0.0,1.290323,0.0,0.0,0.215054,0.0,0.0,0.0
6,ENST00000002596,ENSG00000002587,HS3ST1,heparan sulfate-glucosamine 3-sulfotransferase 1,ENST00000002596,ENSGGOT00000015678,ENSNLET00000020683,ENSPTRT00000029720,307.0,307.0,...,,,,,,,,,,
7,ENST00000002829,ENSG00000001617,SEMA3F,semaphorin 3F,,,,,,,...,,,,,,,,,,
8,ENST00000003084,ENSG00000001626,CFTR,CF transmembrane conductance regulator,,,,,,,...,,,,,,,,,,
9,ENST00000003302,ENSG00000048028,USP28,ubiquitin specific peptidase 28,,,,,,,...,,,,,,,,,,


This can be customized further by specifying how many values need to be present before a row is dropped via the `thresh` argument.

In [228]:
MisDf.dropna(thresh=8)

Unnamed: 0,Id,HS_gene_id,Gene,Description,HS,GG,NL,PT,HS_aa,PT_aa,...,%HS_Subs,%PT_Subs,%GG_Subs,%NL_Subs,%#1#_Subs,%Convergent_Subs,%OnlyInGpId,%OnlyOutGpId,%OneInOutId,%NoId
0,ENST00000000412,ENSG00000003056,M6PR,mannose-6-phosphate receptor%2C cation dependent,ENST00000000412,ENSGGOT00000059917,ENSNLET00000034947,ENSPTRT00000008593,277.0,277.0,...,0.0,0.0,0.377358,0.754717,0.0,0.0,0.0,0.0,0.0,0.0
1,ENST00000000442,ENSG00000173153,ESRRA,estrogen related receptor alpha,ENST00000000442,ENSGGOT00000001342,ENSNLET00000006350,ENSPTRT00000007149,423.0,422.0,...,0.0,0.0,1.184834,0.236967,0.0,0.0,0.0,0.0,0.0,0.0
2,ENST00000001008,ENSG00000004478,FKBP4,FKBP prolyl isomerase 4,ENST00000001008,ENSGGOT00000010515,ENSNLET00000003652,ENSPTRT00000008389,459.0,459.0,...,0.235849,0.0,0.0,0.235849,0.0,0.0,0.0,0.0,0.0,0.0
3,ENST00000001146,ENSG00000003137,CYP26B1,cytochrome P450 family 26 subfamily B member 1,ENST00000001146,ENSGGOT00000004600,ENSNLET00000013223,ENSPTRT00000109607,512.0,512.0,...,0.390625,0.0,0.0,0.195312,0.0,0.0,0.0,0.0,0.0,0.0
4,ENST00000002125,ENSG00000003509,NDUFAF7,NADH:ubiquinone oxidoreductase complex assembl...,ENST00000002125,ENSGGOT00000011414,ENSNLET00000039964,ENSPTRT00000022034,441.0,441.0,...,0.228311,0.0,0.228311,1.141553,0.684932,0.0,0.0,0.0,0.0,0.0
5,ENST00000002165,ENSG00000001036,FUCA2,alpha-L-fucosidase 2,ENST00000002165,ENSGGOT00000011305,ENSNLET00000019262,ENSPTRT00000034475,467.0,465.0,...,0.0,0.0,0.0,1.290323,0.0,0.0,0.215054,0.0,0.0,0.0
6,ENST00000002596,ENSG00000002587,HS3ST1,heparan sulfate-glucosamine 3-sulfotransferase 1,ENST00000002596,ENSGGOT00000015678,ENSNLET00000020683,ENSPTRT00000029720,307.0,307.0,...,,,,,,,,,,


This is typically used in time series applications, where there are repeated measurements that are incomplete for some subjects.

### Exercise

Try using the `axis` argument to drop columns with missing values:

In [232]:
# Write your answer here
MisDf.dropna(axis=1, thresh=8)

Unnamed: 0,Id,HS_gene_id,Gene,Description
0,ENST00000000412,ENSG00000003056,M6PR,mannose-6-phosphate receptor%2C cation dependent
1,ENST00000000442,ENSG00000173153,ESRRA,estrogen related receptor alpha
2,ENST00000001008,ENSG00000004478,FKBP4,FKBP prolyl isomerase 4
3,ENST00000001146,ENSG00000003137,CYP26B1,cytochrome P450 family 26 subfamily B member 1
4,ENST00000002125,ENSG00000003509,NDUFAF7,NADH:ubiquinone oxidoreductase complex assembl...
5,ENST00000002165,ENSG00000001036,FUCA2,alpha-L-fucosidase 2
6,ENST00000002596,ENSG00000002587,HS3ST1,heparan sulfate-glucosamine 3-sulfotransferase 1
7,ENST00000002829,ENSG00000001617,SEMA3F,semaphorin 3F
8,ENST00000003084,ENSG00000001626,CFTR,CF transmembrane conductance regulator
9,ENST00000003302,ENSG00000048028,USP28,ubiquitin specific peptidase 28


Rather than omitting missing data from an analysis, in some cases it may be suitable to fill the missing value in, either with a default value (such as zero) or a value that is either imputed or carried forward/backward from similar data points. We can do this programmatically in Pandas with the `fillna` argument.

In [237]:
MisDf.fillna({'HS': 0, 'PT':-99})

Unnamed: 0,Id,HS_gene_id,Gene,Description,HS,GG,NL,PT,HS_aa,PT_aa,...,%HS_Subs,%PT_Subs,%GG_Subs,%NL_Subs,%#1#_Subs,%Convergent_Subs,%OnlyInGpId,%OnlyOutGpId,%OneInOutId,%NoId
0,ENST00000000412,ENSG00000003056,M6PR,mannose-6-phosphate receptor%2C cation dependent,ENST00000000412,ENSGGOT00000059917,ENSNLET00000034947,ENSPTRT00000008593,277.0,277.0,...,0.0,0.0,0.377358,0.754717,0.0,0.0,0.0,0.0,0.0,0.0
1,ENST00000000442,ENSG00000173153,ESRRA,estrogen related receptor alpha,ENST00000000442,ENSGGOT00000001342,ENSNLET00000006350,ENSPTRT00000007149,423.0,422.0,...,0.0,0.0,1.184834,0.236967,0.0,0.0,0.0,0.0,0.0,0.0
2,ENST00000001008,ENSG00000004478,FKBP4,FKBP prolyl isomerase 4,ENST00000001008,ENSGGOT00000010515,ENSNLET00000003652,ENSPTRT00000008389,459.0,459.0,...,0.235849,0.0,0.0,0.235849,0.0,0.0,0.0,0.0,0.0,0.0
3,ENST00000001146,ENSG00000003137,CYP26B1,cytochrome P450 family 26 subfamily B member 1,ENST00000001146,ENSGGOT00000004600,ENSNLET00000013223,ENSPTRT00000109607,512.0,512.0,...,0.390625,0.0,0.0,0.195312,0.0,0.0,0.0,0.0,0.0,0.0
4,ENST00000002125,ENSG00000003509,NDUFAF7,NADH:ubiquinone oxidoreductase complex assembl...,ENST00000002125,ENSGGOT00000011414,ENSNLET00000039964,ENSPTRT00000022034,441.0,441.0,...,0.228311,0.0,0.228311,1.141553,0.684932,0.0,0.0,0.0,0.0,0.0
5,ENST00000002165,ENSG00000001036,FUCA2,alpha-L-fucosidase 2,ENST00000002165,ENSGGOT00000011305,ENSNLET00000019262,ENSPTRT00000034475,467.0,465.0,...,0.0,0.0,0.0,1.290323,0.0,0.0,0.215054,0.0,0.0,0.0
6,ENST00000002596,ENSG00000002587,HS3ST1,heparan sulfate-glucosamine 3-sulfotransferase 1,ENST00000002596,ENSGGOT00000015678,ENSNLET00000020683,ENSPTRT00000029720,307.0,307.0,...,,,,,,,,,,
7,ENST00000002829,ENSG00000001617,SEMA3F,semaphorin 3F,0,,,-99,,,...,,,,,,,,,,
8,ENST00000003084,ENSG00000001626,CFTR,CF transmembrane conductance regulator,0,,,-99,,,...,,,,,,,,,,
9,ENST00000003302,ENSG00000048028,USP28,ubiquitin specific peptidase 28,0,,,-99,,,...,,,,,,,,,,


**Note that `fillna` by default returns a new object with the desired filling behavior, rather than changing the `Series` or  `DataFrame` in place**.

In [238]:
MisDf.tail()

Unnamed: 0,Id,HS_gene_id,Gene,Description,HS,GG,NL,PT,HS_aa,PT_aa,...,%HS_Subs,%PT_Subs,%GG_Subs,%NL_Subs,%#1#_Subs,%Convergent_Subs,%OnlyInGpId,%OnlyOutGpId,%OneInOutId,%NoId
18,ENST00000005386,ENSG00000005175,RPAP3,RNA polymerase II associated protein 3,,,,,,,...,,,,,,,,,,
19,ENST00000005558,ENSG00000006652,IFRD1,interferon related developmental regulator 1,,,,,,,...,,,,,,,,,,
20,ENST00000006275,ENSG00000007255,TRAPPC6A,trafficking protein particle complex 6A,,,,,,,...,,,,,,,,,,
21,ENST00000006658,ENSG00000006282,SPATA20,spermatogenesis associated 20,,,,,,,...,,,,,,,,,,
22,ENST00000006777,ENSG00000005486,RHBDD2,rhomboid domain containing 2,,,,,,,...,,,,,,,,,,


We can alter values in-place using `inplace=True`.

In [239]:
MisDf.fillna({'HS': 0, 'PT':0}, inplace=True)
MisDf.tail()

Unnamed: 0,Id,HS_gene_id,Gene,Description,HS,GG,NL,PT,HS_aa,PT_aa,...,%HS_Subs,%PT_Subs,%GG_Subs,%NL_Subs,%#1#_Subs,%Convergent_Subs,%OnlyInGpId,%OnlyOutGpId,%OneInOutId,%NoId
18,ENST00000005386,ENSG00000005175,RPAP3,RNA polymerase II associated protein 3,0,,,0,,,...,,,,,,,,,,
19,ENST00000005558,ENSG00000006652,IFRD1,interferon related developmental regulator 1,0,,,0,,,...,,,,,,,,,,
20,ENST00000006275,ENSG00000007255,TRAPPC6A,trafficking protein particle complex 6A,0,,,0,,,...,,,,,,,,,,
21,ENST00000006658,ENSG00000006282,SPATA20,spermatogenesis associated 20,0,,,0,,,...,,,,,,,,,,
22,ENST00000006777,ENSG00000005486,RHBDD2,rhomboid domain containing 2,0,,,0,,,...,,,,,,,,,,


Missing values can also be interpolated, using any one of a variety of methods:

In [244]:
protein

lineage
Human_aa           NaN
Chimpanzee_aa    493.0
Gorilla_aa       511.0
Gibbon_aa        462.0
Name: length, dtype: float64

In [243]:
protein.fillna(method='bfill')

lineage
Human_aa         493.0
Chimpanzee_aa    493.0
Gorilla_aa       511.0
Gibbon_aa        462.0
Name: length, dtype: float64

## Data summarization

We often wish to summarize data in `Series` or `DataFrame` objects, so that they can more easily be understood or compared with similar data. The NumPy package contains several functions that are useful here, but several summarization or reduction methods are built into Pandas data structures.

In [245]:
impact = ['goals_overall', 'assists_overall', 'clean_sheets_overall']
football[impact].sum()

goals_overall           1040
assists_overall          742
clean_sheets_overall    2847
dtype: int64

Clearly, `sum` is more meaningful for some columns than others. For methods like `mean` for which application to string variables is not just meaningless, but impossible, these columns are automatically exculded:

In [246]:
football[impact].mean()

goals_overall           1.818182
assists_overall         1.297203
clean_sheets_overall    4.977273
dtype: float64

**The important difference between NumPy's functions and Pandas' methods is that the latter have built-in support for handling missing data.**

In [247]:
protein

lineage
Human_aa           NaN
Chimpanzee_aa    493.0
Gorilla_aa       511.0
Gibbon_aa        462.0
Name: length, dtype: float64

In [248]:
protein.mean()

488.6666666666667

Sometimes we may not want to ignore missing values, and allow the `nan` to propagate.

In [251]:
protein.mean(skipna=False)

nan

Passing `axis=1` will summarize over rows instead of columns, which only makes sense in certain situations.

In [252]:
total_impact = football[impact].sum(axis=1)
total_impact.sort_values(ascending=False)

id
1012    51
1105    48
269     46
1060    45
1090    43
        ..
107      0
1031     0
1027     0
945      0
486      0
Length: 572, dtype: int64

A useful summarization that gives a quick snapshot of multiple statistics for a `Series` or `DataFrame` is `describe`:

In [253]:
football.describe()

Unnamed: 0,age,minutes_played_overall,minutes_played_home,minutes_played_away,appearances_overall,appearances_home,appearances_away,goals_overall,goals_home,goals_away,...,conceded_per_90_overall,min_per_conceded_overall,min_per_match,min_per_card_overall,min_per_assist_overall,cards_per_90_overall,rank_in_league_top_attackers,rank_in_league_top_midfielders,rank_in_league_top_defenders,rank_in_club_top_scorer
count,572.0,572.0,572.0,572.0,572.0,572.0,572.0,572.0,572.0,572.0,...,572.0,572.0,572.0,572.0,572.0,572.0,572.0,572.0,572.0,572.0
mean,27.833916,1313.013986,656.692308,656.321678,18.321678,9.155594,9.166084,1.818182,1.005245,0.812937,...,1.206154,63.416084,56.318182,473.501748,462.798951,0.139773,153.561189,153.561189,26.229021,11.316434
std,4.653158,1097.063878,557.817278,549.53464,12.879531,6.53177,6.500333,3.474473,2.148997,1.615915,...,0.953734,48.152577,30.764232,610.155728,732.959698,0.232466,139.549368,139.549368,49.669362,8.473894
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0
25%,25.0,208.25,90.0,97.5,5.75,3.0,3.0,0.0,0.0,0.0,...,0.6625,40.0,34.75,0.0,0.0,0.0,-1.0,-1.0,-1.0,4.0
50%,28.0,1103.5,540.0,557.0,19.0,9.5,9.0,0.0,0.0,0.0,...,1.21,62.5,67.0,327.5,0.0,0.1,133.5,133.5,-1.0,11.0
75%,31.0,2172.5,1126.25,1116.5,30.0,15.0,15.0,2.0,1.0,1.0,...,1.58,80.0,83.0,670.75,681.5,0.2,276.25,276.25,32.25,18.0
max,41.0,3420.0,1710.0,1710.0,38.0,19.0,19.0,22.0,18.0,11.0,...,8.57,353.0,90.0,3420.0,3420.0,4.09,419.0,419.0,175.0,31.0


`describe` can detect non-numeric data and sometimes yield useful information about it.

In [254]:
football.name.describe()

count        572
unique       568
top       B_Sako
freq           2
Name: name, dtype: object

We can also calculate summary statistics *across* multiple columns, for example, correlation and covariance.

$$cov(x,y) = \sum_i (x_i - \bar{x})(y_i - \bar{y})$$

In [255]:
football.appearances_overall.cov(football.minutes_played_overall)

13405.417559673248

$$corr(x,y) = \frac{cov(x,y)}{(n-1)s_x s_y} = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_i (x_i - \bar{x})^2 \sum_i (y_i - \bar{y})^2}}$$

In [256]:
football.goals_overall.corr(football.minutes_played_overall)

0.4462266267806529

Try running `corr` on the entire `football` DataFrame to see what is returned:

In [None]:
# Write answer here

If we have a `DataFrame` with a hierarchical index (or indices), summary statistics can be applied with respect to any of the index levels:

In [257]:
CommonDf.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Gene,Description,HS,GG,NL,PT,HS_aa,PT_aa,GG_aa,NL_aa,...,NL_Subs,#1#_Subs,%AbsId,%Subs,%HS_Subs,%PT_Subs,%GG_Subs,%NL_Subs,%#1#_Subs,%NoId
Id,HS_gene_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
ENST00000000412,ENSG00000003056,M6PR,mannose-6-phosphate receptor%2C cation dependent,ENST00000000412,ENSGGOT00000059917,ENSNLET00000034947,ENSPTRT00000008593,277,277,277,267,...,2.0,0.0,98.867925,1.132075,0.0,0.0,0.377358,0.754717,0.0,0.0
ENST00000000442,ENSG00000173153,ESRRA,estrogen related receptor alpha,ENST00000000442,ENSGGOT00000001342,ENSNLET00000006350,ENSPTRT00000007149,423,422,422,422,...,1.0,0.0,98.578199,1.421801,0.0,0.0,1.184834,0.236967,0.0,0.0
ENST00000001008,ENSG00000004478,FKBP4,FKBP prolyl isomerase 4,ENST00000001008,ENSGGOT00000010515,ENSNLET00000003652,ENSPTRT00000008389,459,459,453,424,...,1.0,0.0,99.528302,0.471698,0.235849,0.0,0.0,0.235849,0.0,0.0
ENST00000001146,ENSG00000003137,CYP26B1,cytochrome P450 family 26 subfamily B member 1,ENST00000001146,ENSGGOT00000004600,ENSNLET00000013223,ENSPTRT00000109607,512,512,512,512,...,1.0,0.0,99.414062,0.585938,0.390625,0.0,0.0,0.195312,0.0,0.0
ENST00000002125,ENSG00000003509,NDUFAF7,NADH:ubiquinone oxidoreductase complex assembl...,ENST00000002125,ENSGGOT00000011414,ENSNLET00000039964,ENSPTRT00000022034,441,441,441,441,...,5.0,3.0,97.716895,2.283105,0.228311,0.0,0.228311,1.141553,0.684932,0.0


## Writing Data to Files

As well as being able to read several data input formats, Pandas can also export data to a variety of storage formats. We will bring your attention to the .csv format.

In [258]:
CommonDf.to_csv("./new_CommonDf.csv", sep='\t', index=True, header=True)

The `to_csv` method writes a `DataFrame` to a comma-separated values (csv) file. You can specify custom delimiters (via `sep` argument), how missing values are written (via `na_rep` argument), whether the index is writen (via `index` argument), whether the header is included (via `header` argument), among other options.

### Advanced Exercise: Alignment data

The `data/CommonDf.csv` file contains alignment information for protein orthologs in 4 ape species. The `data/ColumnsInDf.txt` file describes meaning of each column.
HS, PT, GG, and NL stand for human, chimpanzee, gorilla, and gibbon respectively.

1. Find the gene names (`Gene`) for the longest and shortest proteins, based on their lengths (`_aa`),  in each species.
2. Find the gene names (`Gene`) for the orthologs with the longest and shortest final alignment(`overlap`).
3. Find the gene name (`Gene`) with highest `Subs`, i.e. Total susbstitute sites within the alignment, and its `%Subs` value.
5. Does the gene that has highest `Subs` value also has the highest `%Subs` value?
4. Find the number of ortholgs with zero substitutions (`Subs`).
5. What is the overall substitution rate if it is equal to "the total number of susbstituted sites, into 100, divided by the total number of aligned sites"?

In [None]:
# Write your answer here

## References

[Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do) Wes McKinney

[Advanced Statistical Computing at Vanderbilt University's Department of Biostatistics](http://mybinder.org:/repo/fonnesbeck/bios8366)