# Lesson 2: Intro to Pandas - 3

---





In [None]:
# Import the packages that will be usefull for this lesson
import pandas as pd
import numpy as np

---

### Descriptive statistics of a DataFrame

The results will only contain the relevant columns (for example *mean()* will be applied only to numeric columns).
 
| Function 	| Description                                	|
|----------	|--------------------------------------------	|
| count    	| Number of non-null observations            	|
| sum      	| Sum of values                              	|
| mean     	| Mean of values                             	|
| mad      	| Mean absolute deviation                    	|
| median   	| Arithmetic median of values                	|
| min      	| Minimum                                    	|
| max      	| Maximum                                    	|
| mode     	| Mode                                       	|
| abs      	| Absolute Value                             	|
| prod     	| Product of values                          	|
| std      	| Bessel-corrected sample standard deviation 	|
| var      	| Unbiased variance                          	|
| sem      	| Standard error of the mean                 	|
| skew     	| Sample skewness (3rd moment)               	|
| kurt     	| Sample kurtosis (4th moment)               	|
| unique    | List the unique elements 
| quantile 	| Sample quantile (value at %)               	|
| cumsum   	| Cumulative sum                             	|
| cumprod  	| Cumulative product                         	|
| cummax   	| Cumulative maximum                         	|
| cummin   	| Cumulative minimum                         	|

In [None]:
infile = '../data/ecoli.txt'
df = pd.read_csv(infile, sep='\t')
df.head()

In [None]:
df.mean()

In [None]:
df.count()

**Remember: Use the *describe()* method to generate a simple table report of the DataFrame**

In [None]:
# Default: Pandas will determine which columns are numeric and only describe those
df.describe()

# Compare to when include="all"
df.describe(include="all")

### Accessing elements in a DataFrame (Indexing and Selection)

| Operation                      	| Syntax        	    | Result    	|
|--------------------------------	|---------------	    |-----------	|
| Select column                  	| df[col] **or** df.col | Series    	|
| Select row by label            	| df.loc[label] 	    | Series    	|
| Select row by integer location 	| df.iloc[loc]  	    | Series    	|
| Slice rows                     	| df[5:10]      	    | DataFrame 	|

In [None]:
# Re-Load the DataFrame
df = pd.read_csv(infile, sep='\t')
df.rename(columns={"Locus tag":"Locus Tag", "Protein product":"Protein Product"}, inplace=True)
df = df.drop(labels=["Replicon Name", "COG(s)", "Protein name"], axis=1) # Axis 1 = Columns

df

# Change the index to be Locus
df.index = df["Locus"]
df.head()

**Column Slicing**

In [None]:
# Preferred Method
df["Start"]

In [None]:
df.Start

In [None]:
df[["Start", "Stop"]]

**Index slicing**


*   `df.loc[]`: Slice by index label (or Boolean)
*   `df.iloc[]`: Slice by integer (selection by position)



In [None]:
df.head()
df.loc["rluA"]

In [None]:
df.loc["rpsJ":"rpsF"]

In [None]:
df.iloc[1]

In [None]:
df.iloc[10:12]

**Combine column and index slicing to select a specific item or a range of items**

In [None]:
## Multiple ways to do the same thing
df.loc["rluB", "Protein Product"]

In [None]:
df.loc["rluB"]["Protein Product"]

In [None]:
df.iloc[10,7]

In [None]:
df.iloc[10][7]

### Select specific elements based on their values = Boolean Indexing

**With a single condition**

In [None]:
df[df["Strand"] == "-"]

In [None]:
df.query("Strand == '-'")

In [None]:
df[df["Length"].isin([200, 201, 202])]

In [None]:
df[df["Locus"].str.startswith('rlu')]

**With multiple conditions**

In [None]:
df[(df["Strand"] == '+') & (df["Length"] > 350)]

In [None]:
df[(df["Strand"] == '+') | (df["Length"] > 350)]

In [None]:
df.query("Strand == '+' and Length > 350")

The `apply` method
----

Use custom functions on a `groupby` result

In [None]:
df = pd.read_csv(infile, sep='\t')

In [None]:
def count_operons(values):
    operons = set()
    for gene_name in values:
        operon = gene_name[:-1]
        operons.add(operon)
    return len(operons)

In [None]:
count_operons(df['Locus'])

In [None]:
df.groupby('Strand')['Locus'].apply(count_operons)

In [None]:
def find_operons(values):
    operons = set()
    for gene_name in values:
        operon = gene_name[:3]
        operons.add(operon)
    return operons

In [None]:
find_operons(df['Locus'])

In [None]:
df.groupby('Strand')['Locus'].apply(find_operons)

---


Exercises
---------

Using the data from this URL: https://evocellnet.github.io/ecoref/data/conditions.tsv answer the following questions

* how many unique strains are there? What about conditions?
* is each strain present in all conditions and vice-versa?
* how many growth defect phenotypes are present in each condition?
    * what is the proportion?
* how many growth defect phenotypes each strain has?
    * what is the proportion?
* can you filter the table to keep only those conditions with no phenotype?
* can you filter the table to keep only those entries with corrected-p-value < 0.01?
* can you reshape the table to have strains as rows, conditions as columns, and s-scores as values?