# Lesson 2: Intro to Pandas - 1

### Summary
* Flexible, and expressive data structures **Series** (1D) and **DataFrame** (2D)
* High-level building block for doing **practical, real world data analysis**
* **Nearly as fast as C language** = Build on top of Numpy and extensive use of Cython
* **Robust IO tools** for loading and parsing data from text files, excel files and databases.

### Additional Resources

*   Getting Started Guide: https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html
*   Complete User-Guide: https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html#user-guide
*   Geeks for Geeks: https://www.geeksforgeeks.org/pandas-tutorial/



---
---





# Imports

**Importing the packages you need on the top of your script/notebook is good programming practice** 

In [None]:
# Import the packages that will be usefull for this lesson
import pandas as pd
import numpy as np

---

# pandas Objects
Typically, DataFrames and Series objects are created by loading in data from a CSV or excel file (subsequent section). But sometimes you might find it useful to create an object within your script.

## Pandas Series
*   1-D array
*   Can contain data of any type (integer, string, float, python objects, etc.)
*   axis labels (*i.e.* row names) are called the index

**TL;DR a Series is a column in an excel sheet**

### Creating Pandas Series

**From a List(s)**





In [None]:
Base = ('A','T','C','G','N')
Freq = (0.21, 0.24, 0.27, 0.25, 0.03)
bases = pd.Series(data=Freq, index=Base)
bases

**From a Dictionary**

In [None]:
d = {'A':0.21, 'T':0.24, 'C':0.27, 'G':0.25, 'N':0.03}
bases_2 = pd.Series(d)
bases_2

## Pandas DataFrames
*   2D tabular data 
*   labeled axes (rows and columns)
*   size-mutable (can add/remove data)
*   potentially heterogeneous data types

### Creating Pandas DataFrames
**From a pandas Series**

In [None]:
d = {'A':0.21, 'T':0.24, 'C':0.27, 'G':0.25, 'N':0.03}
bases_2 = pd.Series(d)
pd.DataFrame(bases_2, columns=["Percent"])

**From a Dictionary**

In [None]:
d = {'Protein':['YFP', 'GFP', 'RFP', 'BFP'],
     'Ex':[514, 488, 555, 383],
     'Em':[527, 510, 584, 445]}
df = pd.DataFrame(d) 
df

---

# Loading Data

### Lesson 1 Recap
Last time, we learned about loading a text file for reading and writing.

```
f = open('mysequence.fasta', 'r')
new_file = open('thesis.txt', 'w')
```

This is useful for text-based data, such as FASTA files or text documents.

But what about **tabular data**, such as data from your flow experiment, or time-course data from the plate-reader?


### Using Pandas to Load Data
There are many methods to use pandas to load data. You can choose which method to use based off of your file-type, and can adjust the parameters accordingly.

**Load from a .csv file using `pd.read_csv()`**
```
# Comma-delimited text
data = pd.read_csv('new_data.csv')

# Tab-delimited text
data = pd.read_csv('newer_data.txt', sep='\t')

# From a URL
data = pd.read_csv('https://raw.githubusercontent.com/FBosler/you-datascientist/master/happiness_with_continent.csv')
```

**Load from a .xls/.xlsx file using `pd.read_excel()`**
```
data = pd.read_excel('thesis_data.xlsx')
```

#### Useful Parameters for Loading Data


*   **sep**: separator for the columns (*e.g.* ',' or '\t')
*   **header**: row that contains the column headers (*e.g.* 'None', 2 (third row, skip everything above it)
*   **index_col**: which column should be the index (row names)


In [None]:
## Example: Loading Data 
infile = '../data/ecoli.txt'
data = pd.read_csv(infile, sep='\t')

In [None]:
data

---

# Inspecting & Describing Data

Once data has been created/loaded using pandas, it is useful to 'look' at the data and ensure that it has been initialized properly. Rather than look at the entire data table, you can use simple functions to take a peak at the top, bottom, or random mix of the table.

### Functions to View Data

*   `data.head()`: display first n rows
*   `data.tail()`: display last n rows
*   `data.sample()`: display n random rows



In [None]:
## Example: Viewing Data
# Display the first 5 rows
data.head()

In [None]:
# Display the last 10 rows
data.tail(10)

In [None]:
# Display a random sampling of 7 rows
data.sample(7)

In [None]:
## Example: Loading & Viewing Data #2
## After viewing the table, we realize we want to use "GeneID" as the rownames instead of numbers
new_data = pd.read_csv(infile, sep='\t', index_col='GeneID')
new_data.head()

## Functions to Describe Data

*   `data.shape`: dimensions of data (# rows, # columns) *This is not actually a function, but a property, so do not use '()'
*   `data.info()`: data types of each column
*   `data.describe()`: statistical information about the numerical columns



In [None]:
## Example: Describing Data
data.shape

In [None]:
data.info()

In [None]:
data.describe()

---


Exercises
---------

Load a dataframe from this file hosted online: https://evocellnet.github.io/ecoref/data/phenotypic_data.tsv

**Hint:** the file extension might indicate which field separator to expect

How big is the dataframe? How many rows and columns are there?

How many non-null rows are there for each column? What is the type of each column?

What is the mean value for the `s-scores` column?

In [11]:
df.columns

Index(['condition', 'strain', 's-scores', 'corrected-p-values',
       'growth-defect-phenotype'],
      dtype='object')

In [15]:
df.pivot_table(index='strain', columns='condition', values='s-scores')

condition,A22.0P5,A22.0P5UM,A22.1,A22.1UM,A22.2,A22.2P5UM,A22.5UM,A22.CCCP,A22.CEFSULODIN,A22.CEPHALEXIN,...,TRIMETHOPRIM.A22,TRIMETHOPRIM.CEPHALEXIN,TRIMETHOPRIM.CHLORAMPHENICOL,TRIMETHOPRIM.FOSFOMYCIN,TRIMETHOPRIM.IMIPENEM,TRIMETHOPRIM.PYOCYANIN,TYLOSIN.50,UREA.50MM,VANCOMYCIN.40,VANILLIN.200
strain,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ATC14028,0.461853,,,-0.855892,,0.648539,0.320179,0.416647,0.240657,-0.198294,...,,,,,,,,-1.690149,,
LT2,0.634869,,,0.129073,,0.267326,0.455675,1.804888,1.069904,0.387530,...,,,,,,,,-0.385880,,
NT12001,0.973035,-1.556593,-0.746300,-1.197644,-0.580609,-0.409577,,-0.794547,0.094425,0.183347,...,-1.072894,-0.630742,-0.714179,-0.643048,-0.007615,0.020477,0.586603,0.534589,-0.799052,0.700619
NT12002,0.993044,-0.915693,-0.889290,-1.469211,0.538173,0.707219,,0.380621,0.634499,1.311746,...,-0.556301,-0.181908,-0.171593,-0.637145,-0.571075,0.044591,0.785431,2.091638,-0.810707,0.630754
NT12003,-0.043496,0.032552,0.341063,0.282494,0.094331,0.242300,,0.460456,-0.517764,0.047581,...,-1.726552,-1.077383,-0.844963,-1.085437,-1.732050,-0.341149,0.244013,0.861176,-1.401596,0.797904
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
NT12903,-0.575041,,,0.993674,-0.602798,,,0.880353,0.499751,-0.010904,...,,,,,,,,0.851299,,
NT12904,-0.759054,,,0.510368,-1.165594,,,-1.418671,-1.150296,-1.316660,...,,,,,,,,-0.696250,,
NT12905,-3.613819,,,-1.520593,-1.703222,,,-3.227510,-3.642508,-3.711046,...,,,,,,,,1.273073,,
NT12906,1.063293,,,0.560884,0.624274,,,-0.481714,2.086753,1.373061,...,,,,,,,,1.479300,,
