# Module 2.1: Numpy and Pandas

## Numpy
Numpy <Num-*pie*> is a python package for 'numerical python'. It is a library that provides a multidimensional array object (`ndarray`), various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.

Numpy itself is not a high-level analysis package, but it is the fundamental building block on which many other packages are built. It is arguably the foundation upon which the entire scientific python ecosystem is built. 

### Ndarray
The `ndarray` is the core object that can be used to store **homogeneous** data. It is a table of elements (usually numbers), **all of the same type**, indexed by a tuple of positive integers. In Numpy dimensions are called `axes`. The number of axes is called the `rank` and the `shape` of an array is a tuple of integers giving the size of the array along each axis.

We can initialize numpy arrays from nested Python lists, and access elements using square brackets:

In [3]:
import numpy as np # import numpy library

data = [5, 3.0, 1, 2.75, 4.11, 6, 7, 8.2, 9, 10]

# Create a numpy array from the list `data`
arr = np.array(data)
arr

[5, 3.0, 1, 2.75, 4.11, 6, 7, 8.2, 9, 10]


array([ 5.  ,  3.  ,  1.  ,  2.75,  4.11,  6.  ,  7.  ,  8.2 ,  9.  ,
       10.  ])

Nested lists can be converted to multidimensional arrays using the same `array` function. For example, the following code produces a two-dimensional array:


In [4]:
# Nested (list of lists) to 2D array
data = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr = np.array(data)
arr

array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

The `shape` attribute returns the number of rows and columns of the array.

In [6]:
arr.shape

(2, 4)

In this 2D example `arr.shape` returns a tuple with two elements, the first is the number of rows and the second is the number of columns.

We can use the ndim attribute to get the number of axes (dimensions) of the array.

In [7]:
arr.ndim

2

When creating an `ndarray`, we can specify the type of the elements using the `dtype` parameter. If we don't specify the type, Numpy will try to guess the type of the data when the array is created.

The `dtype` attribute returns the type of the elements in the array.

In [8]:
arr.dtype

dtype('int64')

You can specify the `dtype` of the array when creating it.

In [19]:
arr = np.array([1, 2, 3], dtype=np.float64)
print(arr)
print(arr.dtype)

dtype('float64')

You can change the `dtype` of an existing array (cast to another `dtype`) using the `astype` method. The `astype` method creates a new array (a copy of the data), and does not change the original array itself.

In [21]:
int_arr = arr.astype(np.int64)

print(arr.dtype)

print(int_arr.dtype)

float64
int64


Similar to the base `range` function, numpy has a `arange` function that returns an array that returns an array containing evenly spaced values within a given interval. 

Like `range`, the values are generated within the half-open interval [`start`, `stop`). The `start` value is inclusive, while the `stop` value is exclusive.

In [24]:
arr = np.arange(0,10)
print(arr)

arr = np.arange(1, 101)
print(arr)

[0 1 2 3 4 5 6 7 8 9]
[  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
  91  92  93  94  95  96  97  98  99 100]


We can also change the shape of an array using the `reshape` method.

In [25]:
# Create a 10x10 array of integers from 1 to 100
arr = np.arange(1, 101).reshape(10, 10)
print(arr)

[[  1   2   3   4   5   6   7   8   9  10]
 [ 11  12  13  14  15  16  17  18  19  20]
 [ 21  22  23  24  25  26  27  28  29  30]
 [ 31  32  33  34  35  36  37  38  39  40]
 [ 41  42  43  44  45  46  47  48  49  50]
 [ 51  52  53  54  55  56  57  58  59  60]
 [ 61  62  63  64  65  66  67  68  69  70]
 [ 71  72  73  74  75  76  77  78  79  80]
 [ 81  82  83  84  85  86  87  88  89  90]
 [ 91  92  93  94  95  96  97  98  99 100]]


### Creating 'placeholder' arrays
It is sometimes useful to create arrays with pre-defined values, for example an array of zeros, an array of ones, or an array with a range of values. Numpy provides a number of functions to create such arrays:

In [15]:
# Create a 1D array of zeros
arr = np.zeros(10)
print(arr)

# Create a 2D array, 4x10 of ones
arr = np.ones((4, 10))
print(arr)

# Create a 3D array, 4x4x4 filled with a specific value (999)
arr = np.full((4, 4, 4), 999)
print(arr)

# Createe a 2D array, 3x3, of random integers between 0-9
arr = np.random.randint(0, 10, (3, 3))
print(arr)

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]
[[[999 999 999 999]
  [999 999 999 999]
  [999 999 999 999]
  [999 999 999 999]]

 [[999 999 999 999]
  [999 999 999 999]
  [999 999 999 999]
  [999 999 999 999]]

 [[999 999 999 999]
  [999 999 999 999]
  [999 999 999 999]
  [999 999 999 999]]

 [[999 999 999 999]
  [999 999 999 999]
  [999 999 999 999]
  [999 999 999 999]]]
[[8 0 3]
 [7 4 8]
 [3 8 2]]


### Array indexing and slicing
Numpy arrays can be indexed and sliced in a manner similar to Python lists. 

1D arrays are indexed and sliced in _exactly_ the same way as lists.

In [38]:
arr = np.linspace(1, 10, 10)
print(arr)

# Get the 3rd element
print(arr[2])

# Get elements in the range [0, 3) (the first three elements)
print(arr[0:3])

[ 1.  2.  3.  4.  5.  6.  7.  8.  9. 10.]
3.0
[1. 2. 3.]


Multi-dimensional arrays are indexed using a *comma-separated tuple of indices*.

To get the first row of the array we can use the following code:


In [40]:
# Create a 2D array, 4x5 of evenly spaced values between 1-20
arr = np.linspace(1, 20, 20).reshape(4, 5)
print(arr)

# Get the first row of the array
arr[0,:]

[[ 1.  2.  3.  4.  5.]
 [ 6.  7.  8.  9. 10.]
 [11. 12. 13. 14. 15.]
 [16. 17. 18. 19. 20.]]


array([1., 2., 3., 4., 5.])

To get the first column of the array `arr` we use a colon `:` to indicate that we want all rows, and the index `0` to indicate that we want the first column:

In [30]:
# Get the first column of the array
arr[:, 0]

array([ 1.,  6., 11., 16.])


#### Boolean indexing
We can use boolean indexing to select elements from an array based on a condition. For example, to get all elements in the array `arr` that are greater than a given value.


In [42]:
arr = np.arange(1, 26).reshape(5, 5)

# Identify all elements in the array that are greater than 10
arr > 10

array([[False, False, False, False, False],
       [False, False, False, False, False],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True]])

In [44]:
thresh_idx = arr > 10

arr[thresh_idx]

array([11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25])

In [46]:
arr[thres_idx] += 100

arr

array([[  1,   2,   3,   4,   5],
       [  6,   7,   8,   9,  10],
       [211, 212, 213, 214, 215],
       [216, 217, 218, 219, 220],
       [221, 222, 223, 224, 225]])



### Array operations


### Math and summary statistics

### Linear algebra




## Pandas
Pandas is a python package for 'panel data'. It is a library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. It is built *on top* of the Numpy package and its key data structure is called the `DataFrame`. `DataFrames` allow you to store and manipulate tabular data in rows of observations and columns of variables.

### Pandas Data Structures

#### Series
In Pandas, a `Series` is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its `index`. The simplest `Series` is formed from only an array of data

#### DataFrame
A Pandas DataFrame is a tabular data structure comprised of rows and columns, akin to spreadsheet data. You can also think of a DataFrame as a group of Series objects that share an index (the column names). Unlike NumPy arrays, which must contain only a single data type, Pandas DataFrames can contain multiple data types.

### Creating DataFrames

### Reading and writing data

### Indexing and selecting data

### Filtering data

### Sorting data

### Summarizing data

### Grouping data

### Merging data

## Tidy data
An important concept in data science is the idea of 'tidy data'. Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or 'tidy' depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data, each measured variable is a column, each observation is a row, and each type of observational unit is a table.

### Tidy data in Python
The `pandas` package provides a number of functions to help with the process of tidying data. 

The `melt` function is particularly useful. It takes a DataFrame as its first argument, and the names of the columns to be used as identifiers as its second argument. The remaining columns are then treated as 'measured variables' and 'melted' into a single column. 

The `melt` function returns a DataFrame with a new column called `variable` that contains the names of the columns that were melted, and a new column called `value` that contains the values of the melted columns.

In [65]:
import numpy as np
import pandas as pd

data = pd.read_csv('data/GSE63482_Expression_matrix.tsv', sep='\t')

data

Unnamed: 0,gene_id,E15_cpn,E15_corticothal,E15_subcereb,E16_cpn,E16_corticothal,E16_subcereb,E18_cpn,E18_corticothal,E18_subcereb,P1_cpn,P1_corticothal,P1_subcereb
0,0610007C21Rik,3.628850e+01,29.698300,3.340490e+01,39.377400,3.443750e+01,30.097900,42.217700,3.985680e+01,32.034800,59.940700,58.409400,54.186200
1,0610007L01Rik,1.206950e+01,10.966100,1.092010e+01,10.353700,1.100810e+01,12.301300,10.785700,1.141970e+01,11.606800,16.325000,17.054100,14.946100
2,0610007P08Rik,6.412380e+00,7.046340,7.641080e+00,6.971550,7.232610e+00,6.708510,6.204450,6.601510e+00,4.911680,3.993170,3.806410,3.238020
3,0610007P14Rik,2.089430e+01,13.672500,1.484450e+01,23.519400,1.624610e+01,28.872900,29.619700,2.117910e+01,36.838800,29.359400,22.103000,32.700000
4,0610007P22Rik,2.080070e+01,19.686500,2.199760e+01,18.090600,1.825500e+01,19.137600,18.087300,1.806960e+01,17.483200,22.254400,23.984300,24.621500
...,...,...,...,...,...,...,...,...,...,...,...,...,...
25774,vesl-2,2.127080e-16,0.027159,1.526700e-25,0.025714,1.418250e-13,0.013786,0.024311,1.682800e-18,0.033038,0.020868,0.140264,0.060259
25775,wdp103,4.076470e-01,0.275522,3.704000e-01,0.640581,5.337540e-01,0.486473,0.441163,4.654600e-01,0.572070,0.564045,0.434918,0.402615
25776,wdr4,5.548660e+00,5.873140,6.091040e+00,5.836460,5.501650e+00,7.061410,6.971030,6.340210e+00,8.746420,11.419700,7.108320,10.236700
25777,wiz,6.625390e+00,6.581460,8.413580e+00,6.937630,6.275750e+00,7.135460,5.894330,4.287110e+00,6.076040,4.390600,3.832160,4.180550


In [66]:
# Describe the summary statistics for columns of this dataset
data.describe()

Unnamed: 0,E15_cpn,E15_corticothal,E15_subcereb,E16_cpn,E16_corticothal,E16_subcereb,E18_cpn,E18_corticothal,E18_subcereb,P1_cpn,P1_corticothal,P1_subcereb
count,25779.0,25779.0,25779.0,25779.0,25779.0,25779.0,25779.0,25779.0,25779.0,25779.0,25779.0,25779.0
mean,14.237358,13.660884,14.01689,13.954744,13.029715,14.124343,13.792249,13.316392,14.52966,15.147968,15.036684,17.138852
std,62.050938,52.792316,56.553113,63.584769,49.179222,67.603854,62.733385,51.208908,75.162498,72.975062,63.279192,99.209949
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.374059,0.466004,0.387307,0.459725,0.504652,0.412668,0.509617,0.524263,0.502233,0.527402,0.535642,0.483513
75%,11.3166,11.2476,11.4849,11.15395,11.19725,11.3193,11.2916,11.3623,11.3389,10.9316,11.07125,10.96805
max,4596.47,3149.32,3755.35,5389.12,3355.66,6461.55,5450.83,3714.89,7337.0,6701.26,4788.77,9825.18


In [69]:

# Describe the summary statistics for (first 10) rows of this dataset
data.head(10)

data.head(10).describe()

Unnamed: 0,E15_cpn,E15_corticothal,E15_subcereb,E16_cpn,E16_corticothal,E16_subcereb,E18_cpn,E18_corticothal,E18_subcereb,P1_cpn,P1_corticothal,P1_subcereb
count,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0
mean,16.135965,13.373856,14.345738,15.997099,14.42257,15.343884,16.572042,16.060723,15.743985,19.289689,19.679083,18.698364
std,11.876138,9.38481,10.34811,12.421713,10.801148,11.019936,13.515714,12.905686,12.528457,18.349713,18.329739,17.446516
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,7.82666,8.02628,8.460835,7.817087,7.987858,7.875108,7.346488,7.411968,6.57026,4.675873,4.575328,4.907575
50%,17.4025,12.7071,14.2387,14.91165,13.6271,15.3856,14.4365,14.74465,14.545,19.2897,19.57855,14.5753
75%,21.2855,19.666925,20.775225,23.832825,21.280575,24.261525,24.589875,21.80895,22.27345,28.1467,29.384975,30.680375
max,36.2885,29.6983,33.4049,39.3774,34.4375,30.0979,42.2177,39.8568,36.8388,59.9407,58.4094,54.1862


In [72]:
# Melt the data into a long format using the index as the id variable
# it is important here to explicitly let `melt` know what the id variables are
data_melted = data.melt(id_vars=['gene_id'])

data_melted

Unnamed: 0,gene_id,variable,value
0,0610007C21Rik,E15_cpn,36.288500
1,0610007L01Rik,E15_cpn,12.069500
2,0610007P08Rik,E15_cpn,6.412380
3,0610007P14Rik,E15_cpn,20.894300
4,0610007P22Rik,E15_cpn,20.800700
...,...,...,...
309343,vesl-2,P1_subcereb,0.060259
309344,wdp103,P1_subcereb,0.402615
309345,wdr4,P1_subcereb,10.236700
309346,wiz,P1_subcereb,4.180550


In [73]:
# Split the 'variable' column into two columns for 'age' and 'celltype'
data_melted[['age', 'celltype']] = data_melted['variable'].str.split('_', expand=True)

data_melted

Unnamed: 0,gene_id,variable,value,age,celltype
0,0610007C21Rik,E15_cpn,36.288500,E15,cpn
1,0610007L01Rik,E15_cpn,12.069500,E15,cpn
2,0610007P08Rik,E15_cpn,6.412380,E15,cpn
3,0610007P14Rik,E15_cpn,20.894300,E15,cpn
4,0610007P22Rik,E15_cpn,20.800700,E15,cpn
...,...,...,...,...,...
309343,vesl-2,P1_subcereb,0.060259,P1,subcereb
309344,wdp103,P1_subcereb,0.402615,P1,subcereb
309345,wdr4,P1_subcereb,10.236700,P1,subcereb
309346,wiz,P1_subcereb,4.180550,P1,subcereb
