# Module 2.1: Pandas

In [2]:
import pandas as pd
import numpy as np

Pandas is a python package for 'panel data'. It is a library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. It is built *on top* of the Numpy package and its key data structure is called the `DataFrame`. `DataFrames` allow you to store and manipulate tabular data in rows of observations and columns of variables.

### Pandas Data Structures

#### Series
In Pandas, a `Series` is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its `index`. The simplest `Series` is formed from an array of data:


In [7]:
ser = pd.Series(np.random.rand(10))

print(ser)

0    0.595659
1    0.689792
2    0.456661
3    0.621448
4    0.172698
5    0.115973
6    0.060628
7    0.104663
8    0.900528
9    0.615383
dtype: float64



#### DataFrame
A Pandas DataFrame is a tabular data structure comprised of rows and columns, akin to spreadsheet data. You can also think of a DataFrame as a group of Series objects that share an index (the column names). Unlike NumPy arrays, which must contain only a single data type, Pandas DataFrames can contain multiple data types.

### Creating DataFrames

### Reading and writing data

### Indexing and selecting data

### Filtering data

### Sorting data

### Summarizing data

### Grouping data

### Merging data

## Tidy data
An important concept in data science is the idea of 'tidy data'. Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or 'tidy' depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data, each measured variable is a column, each observation is a row, and each type of observational unit is a table.

### Tidy data in Python
The `pandas` package provides a number of functions to help with the process of tidying data. 

The `melt` function is particularly useful. It takes a DataFrame as its first argument, and the names of the columns to be used as identifiers as its second argument. The remaining columns are then treated as 'measured variables' and 'melted' into a single column. 

The `melt` function returns a DataFrame with a new column called `variable` that contains the names of the columns that were melted, and a new column called `value` that contains the values of the melted columns.

In [None]:
import numpy as np
import pandas as pd

data = pd.read_csv('data/GSE63482_Expression_matrix.tsv', sep='\t')

data

In [None]:
# Describe the summary statistics for columns of this dataset
data.describe()

In [None]:

# Describe the summary statistics for (first 10) rows of this dataset
data.head(10)

data.head(10).describe()

In [None]:
# Melt the data into a long format using the index as the id variable
# it is important here to explicitly let `melt` know what the id variables are
data_melted = data.melt(id_vars=['gene_id'])

data_melted

In [None]:
# Split the 'variable' column into two columns for 'age' and 'celltype'
data_melted[['age', 'celltype']] = data_melted['variable'].str.split('_', expand=True)

data_melted