## Load Data from CSV Files
CSV (comma-separated value) and TSV (tab-separated value) files are common file formats for transferring and storing data.

As an example, we have a file where the values are tab-separated, the first row specifies the column names, and the first column contains the ids.

In [29]:
!head brca_transcripts.txt

transcript_id	biotype	bp	aa
ENST00000352993.7	Protein coding	3668	721
ENST00000354071.7	Protein coding	4497	1399
ENST00000461221.5	Nonsense mediated decay	5693	63
ENST00000461574.1	Protein coding	726	242
ENST00000461798.5	Nonsense mediated decay	582	63


This type of files can be load into a Pandas ``DataFrame`` using the ``read_csv`` function in Pandas:

In [30]:
brca1_df = pd.read_csv('brca_transcripts.txt', sep = '\t', index_col = 0, header = 0)
brca1_df

Unnamed: 0_level_0,biotype,bp,aa
transcript_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ENST00000352993.7,Protein coding,3668,721
ENST00000354071.7,Protein coding,4497,1399
ENST00000461221.5,Nonsense mediated decay,5693,63
ENST00000461574.1,Protein coding,726,242
ENST00000461798.5,Nonsense mediated decay,582,63


- ``sep`` specifies the delimiter to use (the tab);
- ``index_col`` specifies the column to use as the row labels of the ``DataFrame`` (the first column);
- ``header`` specifies the row number to use as the column names (the first row).

## Aggregation and Grouping
Pandas ``Series`` and ``DataFrame``s include a method ``describe()`` that computes several common aggregates for each column and returns the result.

In [31]:
brca1_df.describe()

Unnamed: 0,bp,aa
count,5.0,5.0
mean,3033.2,497.6
std,2288.655216,571.29572
min,582.0,63.0
25%,726.0,63.0
50%,3668.0,242.0
75%,4497.0,721.0
max,5693.0,1399.0


Simple aggregations can give you a flavor of your dataset, but often we would prefer to aggregate conditionally on some label or index: this is implemented in the so-called ``groupby`` operation.

In [32]:
print(type(brca1_df.groupby('biotype')))

brca1_df.groupby('biotype').describe()

<class 'pandas.core.groupby.generic.DataFrameGroupBy'>


Unnamed: 0_level_0,bp,bp,bp,bp,bp,bp,bp,bp,aa,aa,aa,aa,aa,aa,aa,aa
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
biotype,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Nonsense mediated decay,2.0,3137.5,3614.022759,582.0,1859.75,3137.5,4415.25,5693.0,2.0,63.0,0.0,63.0,63.0,63.0,63.0,63.0
Protein coding,3.0,2963.666667,1981.709952,726.0,2197.0,3668.0,4082.5,4497.0,3.0,787.333333,581.345279,242.0,481.5,721.0,1060.0,1399.0


The ``GroupBy`` object supports column indexing in the same way as the ``DataFrame``, and returns a modified ``GroupBy`` object.

In [33]:
brca1_df.groupby('biotype')['bp'].mean()

biotype
Nonsense mediated decay    3137.500000
Protein coding             2963.666667
Name: bp, dtype: float64

### apply
The ``apply()`` method lets you apply a function to the group results.

In [34]:
brca1_df.groupby('biotype')[['bp', 'aa']].apply(np.sum)

Unnamed: 0_level_0,bp,aa
biotype,Unnamed: 1_level_1,Unnamed: 2_level_1
Nonsense mediated decay,6275,126
Protein coding,8891,2362


In general, the ``apply()`` method lets you apply a function along input axis of a ``DataFrame``. Objects passed to these functions are ``Series`` objects having index:
- either the ``DataFrame``’s index (``axis=0``)
- or the columns (``axis=1``).

In [35]:
brca1_df[['bp', 'aa']].apply(np.sum)            # Total nucleotides and total aminoacids

bp    15166
aa     2488
dtype: int64

In [36]:
brca1_df[['bp', 'aa']].apply(np.sum, axis=1)    # Nucleotides + aminoacids for each transcript

transcript_id
ENST00000352993.7    4389
ENST00000354071.7    5896
ENST00000461221.5    5756
ENST00000461574.1     968
ENST00000461798.5     645
dtype: int64

We can also define an arbitrary function:

In [37]:
def function(row, value):
    status = ''
    if row['bp'] >= value:
        status = 'High'
    else:
        status = 'Low'
        
    return status

## the apply requires only one argument. This requirement can be bypassed by "args"

In [None]:
bp_mean = brca1_df['bp'].mean()
print('bp mean:', bp_mean)

brca1_df['transcript_length'] = brca1_df.apply(function, args = (bp_mean,), axis = 1)
brca1_df



### Lambda function

Python <strong>lambdas</strong> are little, anonymous functions, subject to a more restrictive but more concise syntax than regular Python functions. Anonymous function means that a function is without a name.

The ``def`` keyword is used to define the normal functions and the ``lambda`` keyword is used to create anonymous functions. It has the following syntax: ``lambda arguments: expression``. This function can have any number of arguments but <strong>only one</strong> expression, which is evaluated and returned.

In [None]:
brca1_df['protein_length'] = brca1_df.apply(
    lambda row, value: 'High' if row['aa'] > value else 'Low', args = (brca1_df['bp'].mean(),),
    axis = 1
)
brca1_df

Note that lambda definition does not include a ``return`` statement, it always contains an expression which is returned. 