## Pandas  (from <a href="https://github.com/noise42/datastructures/blob/master/materials/python-data-science-handbook.pdf">here</a> and <a href="https://python4bioinformaticsblog.wordpress.com/index/python-bits/pandas/">here</a>)

[Pandas](https://pandas.pydata.org/) is a package built on top of NumPy, and provides an efficient implementation of a ``DataFrame``. ``DataFrame``s are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data.

### The Pandas Series Object
A Pandas ``Series`` is a one-dimensional array of indexed data. It can be created from a list or array as follows:

In [None]:
import pandas as pd

data = pd.Series(['RNA', 'gene', 'protein'])
data

The ``Series`` wraps both a sequence of values and a sequence of indices, which we can access with the ``values`` and ``index`` attributes. The ``values`` are simply a NumPy array, while the ``index`` is an array-like object of type ``pd.Index``. Note that the index need not be an integer, but can consist of values of any desired type:

In [None]:
data = pd.Series(
    ['RNA', 'gene', 'protein'],
    index=['ENSG', 'ENSP', 'ENST']
)
data

The last part, the `dtype`, is not part of the elements you access when cycling through the values of the ``Series``, it is just another attribute.

We can construct a ``Series`` from a dictionary and the way we access the values are similar to dictionaries:

In [None]:
map_dict = {'ENST': 'RNA', 'ENSG': 'gene', 'ENSP': 'protein'}
data = pd.Series(map_dict)

data['ENSG']

``Series`` support slicing just like other arrays:

In [None]:
data['ENSG':]

### The Pandas DataFrame Object
The ``DataFrame`` can be thought as a generalization of a mix of both a NumPy array and a dictionary. It can be constructed from 2 or more dictionary with the same keys (or from 2 ``Series`` with the same indexes).

In [None]:
count_dict = {'ENST': 3300, 'ENSG': 18435, 'ENSP': 12034}
groups_dict = {'ENST': 13, 'ENSG': 42, 'ENSP': 157}
 
df = pd.DataFrame({'mapping type': map_dict, 'counts': count_dict, 'classes': groups_dict})
df

We can access the index labels with the ``DataFrame`` attribute ``index``. Additionally, the ``DataFrame`` has a ``columns`` attribute, which holds the labels for all columns.

In [None]:
df.index, df.columns

We can access a colum like a dictionary or in a Pandas way:

In [None]:
df['counts']  # like a dictionary

In [None]:
df.counts  # The Pandas way

The only difference is that the dictionary way supports labels with spaces and special characters:

In [None]:
df['mapping type']
#df.mapping type  # I can't do it

This dictionary-style syntax can also be used to modify the object, in this case adding a new column:

In [None]:
df['averages'] = df['counts'] / df['classes']
df

### Indexers: loc, iloc
Slicing and indexing conventions can be a source of confusion. For example, if your ``Series`` has an explicit integer index, an indexing operation such as ``data[1]`` will use the explicit indices, while a slicing operation like ``data[1:3]`` will use the implicit Python-style index.

In [None]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])

print(data[1])   # explicit index when indexing
print(data[1:3]) # implicit index when slicing)

Pandas provides some special *indexer* attributes that explicitly expose certain indexing/slicing schemes.

The ``loc`` attribute allows indexing and slicing that always references the explicit index.

In [None]:
print(data.loc[1])
print(data.loc[1:3])

The ``iloc`` attribute allows indexing and slicing that always references the implicit Python-style index.

In [None]:
print(data.iloc[1])
print(data.iloc[1:3])

A little tip:
> # explicit is better than implicit.

#### DataFrame as two-dimensional array
We can also view the ``DataFrame`` as an enhanced two-dimensional array. We can examine the raw underlying data array using the ``values`` attribute:

In [None]:
df.values

Using the ``iloc`` indexer, we can index the underlying array as if it is a simple NumPy array (using the implicit Python-style index), but the ``DataFrame`` index and column labels are maintained in the result:

In [None]:
df.iloc[:3, :2]

Similarly, using the ``loc`` indexer we can index the underlying data in an array-like style but using the explicit index and column names:

In [None]:
df.loc[:'ENSG', :'classes']

#### Additional indexing conventions


In [None]:
df[:'ENSG']

Similarly, direct masking operations are also interpreted row-wise rather than column-wise:

In [None]:
df[df.counts > 5000]

## Handling Missing Data
To indicate the presence of missing data in a table, Pandas uses a *sentinel value* that indicates a missing entry. In particular, it uses two already-existing Python null values: the special floating-point ``NaN`` value, and the Python ``None`` object.

#### ``None``: Pythonic missing data
``None`` is a Python singleton object, so it cannot be used in any arbitrary NumPy/Pandas array, but only in arrays with data type ``'object'`` (i.e., arrays of Python objects):

In [None]:
import numpy as np

vals1 = np.array([1, None, 3, 4])
vals1

This ``dtype=object`` means that the best common type representation NumPy could infer for the contents of the array is that they are Python objects. The use of Python objects in an array means that if you perform aggregations like ``sum()`` or ``min()`` across an array with a ``None`` value, you will generally get an error.

#### ``NaN``: Missing numerical data
``NaN`` (acronym for *Not a Number*) is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation:

In [None]:
vals2 = np.array([1, np.nan, 3, 4]) 
vals2.dtype

Notice that NumPy chose a native floating-point type for this array, but be aware that the results of arithmetic with ``NaN`` will be another ``NaN``:

In [None]:
vals2.sum(), vals2.min(), vals2.max()

NumPy does provide some special aggregations that will ignore these missing values:

In [None]:
np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)

## Operating on Null Values

In [None]:
df.loc['ENSG', 'counts'] = np.nan
df.loc['ENSG', 'classes'] = np.nan
df.loc['ENST', 'classes'] = np.nan
df

### Dropping null values
We cannot drop single values from a ``DataFrame``; we can only drop full rows or full columns.

By default, ``dropna()`` will drop all rows in which *any* null value is present.

In [None]:
df.dropna() # It returns the modified Dataframe without the 'ENST' and 'ENSG' row

 Alternatively, you can drop NA values along a different axis; ``axis=1`` (or ``axis='columns'``) drops all columns containing a null value.

In [None]:
df.dropna(axis=1)  # It returns the modified Dataframe without the 'counts' and 'classes' columns

You can also specify ``how='all'``, which will only drop rows/columns that are *all* null values. For finer-grained control, the ``thresh`` parameter lets you specify a minimum number of non-null values for the row/column to be kept:

In [None]:
df.dropna(axis=1, thresh=2)  # The 'classes' column is dropped because it doensn't have at least 2 non-nul values

### Impute valuesFilling null values
Since removing rows can be limiting with some problems, imputing missing values is a valid alternative. The word 'imputing' refers to using a model to replace missing values. For example, we can replace missing data with:

- a constant value

In [None]:
df.fillna(0) # <=> df.replace(np.nan, 0) # The methods return the modified Dataframe

- a mean, median or mode of the column to the missing data belongs

In [None]:
df.fillna(df.mean()) # It returns the modified Dataframe

## Load Data from CSV Files
CSV (comma-separated value) and TSV (tab-separated value) files are common file formats for transferring and storing data.

As an example, we have a file where the values are tab-separated, the first row specifies the column names, and the first column contains the ids.

In [None]:
!head brca_transcripts.txt

This type of files can be load into a Pandas ``DataFrame`` using the ``read_csv`` function in Pandas:

In [None]:
brca1_df = pd.read_csv('brca_transcripts.txt', sep = '\t', index_col = 0, header = 0)
brca1_df

- ``sep`` specifies the delimiter to use (the tab);
- ``index_col`` specifies the column to use as the row labels of the ``DataFrame`` (the first column);
- ``header`` specifies the row number to use as the column names (the first row).

## Aggregation and Grouping
Pandas ``Series`` and ``DataFrame``s include a method ``describe()`` that computes several common aggregates for each column and returns the result.

In [None]:
brca1_df.describe()

Simple aggregations can give you a flavor of your dataset, but often we would prefer to aggregate conditionally on some label or index: this is implemented in the so-called ``groupby`` operation.

In [None]:
print(type(brca1_df.groupby('biotype')))

brca1_df.groupby('biotype').describe()

The ``GroupBy`` object supports column indexing in the same way as the ``DataFrame``, and returns a modified ``GroupBy`` object.

In [None]:
brca1_df.groupby('biotype')['bp'].mean()

### apply
The ``apply()`` method lets you apply a function to the group results.

In [None]:
brca1_df.groupby('biotype')[['bp', 'aa']].apply(np.sum)

In general, the ``apply()`` method lets you apply a function along input axis of a ``DataFrame``. Objects passed to these functions are ``Series`` objects having index:
- either the ``DataFrame``’s index (``axis=0``)
- or the columns (``axis=1``).

In [None]:
brca1_df[['bp', 'aa']].apply(np.sum)            # Total nucleotides and total aminoacids

In [None]:
brca1_df[['bp', 'aa']].apply(np.sum, axis=1)    # Nucleotides + aminoacids for each transcript

We can also define an arbitrary function:

In [None]:
def function(row, value):
    status = ''
    if row['bp'] >= value:
        status = 'High'
    else:
        status = 'Low'
        
    return status

## the apply requires only one argument. This requirement can be bypassed by "args"

In [None]:
bp_mean = brca1_df['bp'].mean()
print('bp mean:', bp_mean)

brca1_df['transcript_length'] = brca1_df.apply(function, args = (bp_mean,), axis = 1)
brca1_df



### Lambda function

Python <strong>lambdas</strong> are little, anonymous functions, subject to a more restrictive but more concise syntax than regular Python functions. Anonymous function means that a function is without a name.

The ``def`` keyword is used to define the normal functions and the ``lambda`` keyword is used to create anonymous functions. It has the following syntax: ``lambda arguments: expression``. This function can have any number of arguments but <strong>only one</strong> expression, which is evaluated and returned.

In [None]:
brca1_df['protein_length'] = brca1_df.apply(
    lambda row, value: 'High' if row['aa'] > value else 'Low', args = (brca1_df['bp'].mean(),),
    axis = 1
)
brca1_df

Note that lambda definition does not include a ``return`` statement, it always contains an expression which is returned. 