## Comprehensions, Generators, NumPy, Pandas
### BIOINF 575 - Fall 2020

### For loop RECAP

### for: the repetitive control structure with a known number of steps

To loop through a sequence of elements is to iterate

```python
for var in sequence:
    statements
```

___ 

### Python Comprehension Statements
Courtesy of Marcurs Sherman - partly adapted

First, the **purpose** of comprehensions:
> "\[...\] comprehensions provide a more concise way to create \[iterables\] in situations where `map()` and `filter()` and/or nested loops would currently be used" - Barry Warsaw, [PEP 202](https://www.python.org/dev/peps/pep-0202/)

Comprehensions are what we call "_syntactic sugar_". 
This means that they do not do anything you could not have done already. But, with them, you can do some operations easier.

<img src="../images/venn_diagram2.png" width=400 />

---
### Comprehension Syntax

#### Legend

<img src="../images/legendary.png" width=250 />

#### Examples
<img src="../images/comprehensions.png" width=500 />

#### Alternate syntax of a comprehensions

<center><img src="http://python-3-patterns-idioms-test.readthedocs.io/en/latest/_images/listComprehensions.gif" width = "500"/></center>

---
#### The Comprehension Categories
1. `list` comprehensions - create a list
2. `dict`ionary comprehensions - create dictionaries
3. `set` comprehensions - create sets
4. `tuple`? comprehensions

In [2]:
sequences = ["ACTTG", "AAAGTC", "CCTAC", "AAACCT"]

In [3]:
# list comprehensions

counts_list = []
for seq in sequences:
    counts_list.append(len(seq))
counts_list

[5, 6, 5, 6]

In [4]:
[len(seq) for seq in sequences]

[5, 6, 5, 6]

In [5]:
[len(seq)*2 for seq in sequences if seq.upper().startswith("AA")]

[12, 12]

In [7]:
# dictionary comprehensions
{sequences[i]:counts_list[i] for i in (range(len(sequences)))}


{'ACTTG': 25, 'AAAGTC': 36, 'CCTAC': 25, 'AAACCT': 36}

In [None]:
{sequences[i]:counts_list[i]**2 for i in (range(len(sequences)))}

In [9]:
{sequences[i]:counts_list[i]**2 for i in (range(len(sequences))) if counts_list[i]%2}

{'ACTTG': 25, 'CCTAC': 25}

In [10]:
# set comprehensions

{count for count in counts_list}


{5, 6}

In [15]:
# string comprehension - they add complexity but can be done
long_sequence = "ACTTGAAT ACTTAG cggat"

"-".join([character.upper() for character in long_sequence if character.upper() in ("C","G")])



'C-G-C-G-C-G-G'

In [16]:
"".join([character.upper() for character in long_sequence if character.upper() in ("C","G")])



'CGCGCGG'

### Some pros of comprehensions
1. Concise - their use can easily distill multiple lines of code into a single, concise statement
1. Efficient (time and other resources) - _slightly_ more performant than regular loops
1. Flexible output - list, set, dictionary ...

### Some cons of comprehensions
1. The "imperative" syntax - the order in which you type things to make one is different from the rest of Python
1. Readability - comprehension statements get more unreadable as complexity is added

### RESOURCES

In [17]:
# Now, try to make a `tuple` comprehension
(number * 2 for number in range(10))

<generator object <genexpr> at 0x11ced1890>

In [18]:
tuple((number * 2 for number in range(10)))

(0, 2, 4, 6, 8, 10, 12, 14, 16, 18)

In [19]:
number_gen_ex = (number * 2 for number in range(10))

In [20]:
next(number_gen_ex)

0

In [21]:
next(number_gen_ex)

2

In [22]:
next(number_gen_ex)

4

### Python Generators
Courtesy of Marcurs Sherman - partly adapted

#### What was mentioned above as "comprehension statements" are actually called "generator expressions".

<img src="http://nvie.com/img/relationships.png" width=600 align='middle'/>

___
#### Functions RECAP

```python

# DEFINITION - creating a function

def function_name(arg1, arg2, darg=None):
    # instructions to compute result
    return result

# CALL - running a function

function_result = function_name(val1, val2, dval)
```

___


A generator is just a special case of a function. The main difference is how it gives its output. 

How do you make a function give a result?

In [23]:
def number_one():
    number = 1
    return number

In [24]:
number_one()

1

In [25]:
# create a generator for an infinite sequence of numbers
# uses yield instead of return

def infinite_sequence():
    number = 0
    while True:
        yield number
        number += 1

In [26]:
numbers_seq_gen = infinite_sequence()

In [27]:
numbers_seq_gen

<generator object infinite_sequence at 0x11cefd900>

In [28]:
next(numbers_seq_gen)

0

In [40]:
next(numbers_seq_gen)

12

#### and we can do next again and again ...

In [42]:
# a generator for a finite sequence of numbers
# this starts to look like range

def finite_sequence(limit):
    number = 0
    while number < limit:
        yield number
        number += 1

In [43]:
numbers_seq_gen = finite_sequence(3)

In [44]:
numbers_seq_gen

<generator object finite_sequence at 0x11ced99e0>

In [45]:
next(numbers_seq_gen)

0

In [46]:
next(numbers_seq_gen)

1

In [47]:
list(finite_sequence(10))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [48]:
list(finite_sequence(5))

[0, 1, 2, 3, 4]

In [49]:
next(numbers_seq_gen)

2

In [50]:
next(numbers_seq_gen)

StopIteration: 

In [54]:
x = range(4,10)

In [55]:
next(x)

TypeError: 'range' object is not an iterator

In [52]:
dir(range)

['__bool__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'count',
 'index',
 'start',
 'step',
 'stop']

In [56]:
dir(numbers_seq_gen)

['__class__',
 '__del__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__lt__',
 '__name__',
 '__ne__',
 '__new__',
 '__next__',
 '__qualname__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'close',
 'gi_code',
 'gi_frame',
 'gi_running',
 'gi_yieldfrom',
 'send',
 'throw']

In [57]:
def zip_2sequences(seq1, seq2):
    pos = 0
    n = min(len(seq1),len(seq2))
    while  pos < n:
        yield seq1[pos], seq2[pos]
        pos += 1

In [58]:
zip_gen = zip_2sequences([1,2,3], ["A", "B", "C", "D"])

In [59]:
next(zip_gen)

(1, 'A')

In [60]:
next(zip_gen)

(2, 'B')

In [61]:
list(zip_2sequences([1,2,3], ["A", "B", "C", "D"]))

[(1, 'A'), (2, 'B'), (3, 'C')]

In [62]:
x = (i for i in range(10))

In [65]:
next(x)

2

In [66]:
help(min)

Help on built-in function min in module builtins:

min(...)
    min(iterable, *[, default=obj, key=func]) -> value
    min(arg1, arg2, *args, *[, key=func]) -> value
    
    With a single iterable argument, return its smallest item. The
    default keyword-only argument specifies an object to return if
    the provided iterable is empty.
    With two or more arguments, return the smallest argument.



In [67]:
min(3,4)

3

In [68]:
l1 = [1,2,3,4]
l2 = [4,5,6,7,6,76,7]

min(len(l1),len(l2))

4

In [69]:
dir(x)

['__class__',
 '__del__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__lt__',
 '__name__',
 '__ne__',
 '__new__',
 '__next__',
 '__qualname__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'close',
 'gi_code',
 'gi_frame',
 'gi_running',
 'gi_yieldfrom',
 'send',
 'throw']

---
# Conclusion
Generators and generator expressions should be a standard tool in every bioinformaticist's tool belt. 

1. Generator expressions can compress simple for loops down to a single line
1. List comprehensions tend to be more efficient than standard for loops when the data is sufficiently large
1. The same syntax to make a list comprehension can be used to make dictionaries, sets, and generators
1. Generators are iterators that lazily evaluate the next value and `yield` it back
1. Once a generator (or any iterator) is consumed when complete

### Some pros of generators
1. Lazy evaluation: does not produce all the data at one time
1. Maintains state between steps: does not forget where it left off
1. Easily handles data of any size

### Some cons of generators
1. Hard to explain to someone that does not use Python
1. The data you are using is sufficiently small that the trade-off is not worth it

#### RESOURCES 



#### File read and write - RECAP

File is a named location on disk to store related information    
It is used to permanently store data in a non-volatile memory (e.g. hard disk)<br>
https://www.programiz.com/python-programming/file-operation    


#### Open a file for reading or writing   
 
https://docs.python.org/3/library/functions.html#open

```python
open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)
 
fileObj = open(fileName, ‘r’) # open file for reading, r+
fileObj = open(fileName, ‘w’) # open file for writing, w+
fileObj = open(fileName, ‘a’) # open file for appending, a+
```
(Note: fileName must be a string or reference to one)

The file object is iterable by line

```python
for line in fileObj:
    print(line)
```


#### RESOURCES 

https://www.tutorialspoint.com/python/python_files_io.htm  
https://www.tutorialspoint.com/python/file_methods.htm


In [71]:
# help(open)

<b>Write</b>

In [72]:
# open file and write lines into a file and must close the file to see the content
test_file = open("test.txt", mode = "a")
test_file.write("Adding some more text.\n")
test_file.close()

In [74]:
#dir(test_file)

In [75]:
# writing a list of sequences and their length to a file

sequences = ["ACTTG", "AAAGTCA", "CCTACTTG", "AAACCT"]

with open("test_seq_w.txt", mode = "w") as seq_file:
    for seq in sequences:
        seq_file.write(seq + " " + str(len(seq)) + "\n")


<b>Read</b>

### <font color = "red">Exercise</font>:   

Open the file test_seq_w.txt and read file contents line by line - building a dictionary with the sequence as the key and the length as value.



In [None]:
# open file and read file contents line by line - build a dictionary
res_dict = {}

with open("test_seq_w.txt", mode = "r") as seq_file:
    
    pass # here we should read lines from file and build the dictionary

res_dict

In [82]:
res_dict = {}

with open("test_seq_w.txt", mode = "r") as seq_file:
    for line in seq_file:
        key, value = line.split()
        print(key)
        print(value)

res_dict

ACTTG
5
AAAGTCA
7
CCTACTTG
8
AAACCT
6


{}

In [83]:
res_dict = {}

with open("test_seq_w.txt", mode = "r") as seq_file:
    for line in seq_file:
        key, value = line.split()
        res_dict[key] = value

res_dict

{'ACTTG': '5', 'AAAGTCA': '7', 'CCTACTTG': '8', 'AAACCT': '6'}

___
Note: We could use a generator to get lines from really large files.

_____


### NumPy - Numeric python <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1a/NumPy_logo.svg/1200px-NumPy_logo.svg.png" alt="NumPy logo" width = "100">

NumPy (np) is the premier Python package for scientific computing

https://numpy.org

Its powerful comes from the <b>N-dimensional array object</b>

np is a *lower*-level numerical computing library. 

This means that, while you can use it directly, most of its power comes from the packages built on top of np:
* Pandas (*Pan*els *Da*tas)
* Scikit-learn (machine learning)
* Scikit-image (image processing)
* OpenCV (computer vision)
* more...

<img src="https://www.oreilly.com/library/view/elegant-scipy/9781491922927/assets/elsp_0105.png" alt="data structures" width="500">

<b>NumPy basics</b>

Arrays are designed to:
* handle vectorized operations lists are not
    * if you apply a function it is performed on every item in the array, rather than on the whole array object
* store multiple items <b>of the same data type</b>
* have 0-based indexing

* Missing values can be represented using `np.nan` object
    * the object `np.inf` represents infinite
* Array size cannot be changed, should create a new array
* An equivalent numpy array occupies much less space than a python list of lists

<b>Create Array</b><br>
https://docs.scipy.org/doc/numpy-1.13.0/user/basics.creation.html

<b>Importing NumPy<br>
Convention: use np alias</b>

In [84]:
import numpy as np

In [87]:
help(np.array)

Help on built-in function array in module numpy:

array(...)
    array(object, dtype=None, copy=True, order='K', subok=False, ndmin=0)
    
    Create an array.
    
    Parameters
    ----------
    object : array_like
        An array, any object exposing the array interface, an object whose
        __array__ method returns an array, or any (nested) sequence.
    dtype : data-type, optional
        The desired data-type for the array.  If not given, then the type will
        be determined as the minimum type required to hold the objects in the
        sequence.
    copy : bool, optional
        If true (default), then the object is copied.  Otherwise, a copy will
        only be made if __array__ returns a copy, if obj is a nested sequence,
        or if a copy is needed to satisfy any of the other requirements
        (`dtype`, `order`, etc.).
    order : {'K', 'A', 'C', 'F'}, optional
        Specify the memory layout of the array. If object is not an array, the
        newly crea

In [88]:
# Build array from Python list
vector = np.array([1,2,3])
vector

array([1, 2, 3])

<b>Basic array attributes:</b>
* shape: array dimension
* size: Number of elements in array
* ndim: Number of array dimension (len(arr.size))
* dtype: Data-type of the array

In [89]:
np.array([1,"2",3])

array(['1', '2', '3'], dtype='<U21')

In [90]:
np.array([1,[2,3,4],3])

array([1, list([2, 3, 4]), 3], dtype=object)

In [91]:
dir(vector)

['T',
 '__abs__',
 '__add__',
 '__and__',
 '__array__',
 '__array_finalize__',
 '__array_function__',
 '__array_interface__',
 '__array_prepare__',
 '__array_priority__',
 '__array_struct__',
 '__array_ufunc__',
 '__array_wrap__',
 '__bool__',
 '__class__',
 '__complex__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__float__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__ilshift__',
 '__imatmul__',
 '__imod__',
 '__imul__',
 '__index__',
 '__init__',
 '__init_subclass__',
 '__int__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__irshift__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__lshift__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__or__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rand__',
 '__rdivmod__',
 '__

In [92]:
# length of array
vector.size

3

In [93]:
# shape tells us the size on each dimension and implicit the number of dimensions
vector.shape

(3,)

In [94]:
# help(np.zeros)

In [95]:
# matrix with zeros 
np.zeros((3,4), dtype = int)

array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]])

In [None]:
# matrix with 1s
np.ones((3,4), dtype=int)

In [None]:
# matrix with a constant value
value = 20
np.full((3,4,2), value)

In [None]:
# Create a 4x4 identity matrix
np.eye(4)        

In [None]:
# arange - numpy range
np.arange(10, 30, 2)

In [None]:
# evenly spaced numbers over a specified interval
ev_array = np.linspace(1, 10, 20)
print(ev_array)
ev_array.shape

<b>Random data</b><br>
https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.random.html

In [None]:
help(np.random.random)

In [None]:
# Create an array filled with random values 
# Results are from the “continuous uniform” distribution over the [0,1] interval.

np.random.random((3,4))        

In [None]:
# Create an array filled with random values from the standard normal distribution
np.random.randn(3,4)    

In [None]:
# Generate the same random numbers every time
# Set seed
np.random.seed(10)

In [None]:
np.random.randn(3,4)

In [None]:
np.random.seed(10)
print(np.random.randn(3,4))

print(np.random.randn(3,4))

np.random.seed(100)
print(np.random.randn(3,4))
                

```python
# Create the random state - for reproducible results
rs = np.random.RandomState(100)
```

In [None]:
matrix = np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])
matrix

In [None]:
# let's check them out 
matrix.shape

In [None]:
matrix.size

In [None]:
matrix = np.array([[[1,2],[2,3],[3,4]],[[4,5],[4,6],[6,7]]])

In [None]:
matrix.ndim

In [None]:
matrix.dtype

In [None]:
matrix.T

In [None]:
matrix

<b>Reshaping</b>

In [None]:
matrix

In [None]:
# Reshaping
matrix_reshaped = matrix.reshape(2,6)
matrix_reshaped

<b>Indexing/Slicing</b>

In [None]:
# List-like
matrix_reshaped[1][1]

In [None]:
# Using both rows and columns indices

matrix_reshaped[1,3]

In [None]:
matrix_reshaped[1,:3]

In [None]:
matrix_reshaped[:2,:3]

In [None]:
# iterrating ... let's print the elements of matrix_reshaped
nrows = matrix_reshaped.shape[0]
ncols = matrix_reshaped.shape[1]

for i in range(nrows):
    for j in range(ncols):
        print(matrix_reshaped[i,j])



In [None]:
# Fun arrays - display a checkers_board list
checkers_board = np.zeros((8,8),dtype=int)
checkers_board[1::2,::2] = 1
checkers_board[::2,1::2] = 1
print(checkers_board)

Create a 2d array with 1 on the border and 0 inside

In [None]:
boarder_array = np.zeros((8,8),dtype=int)
boarder_array[0,:] = 1

boarder_array

In [None]:
boarder_array = np.ones((8,8),dtype=int)
boarder_array[1:-1,1:-1] = 0
boarder_array

In [None]:
boarder_array[:,-1]

<b>Performance</b>

test_list = list(range(int(1e6)))
<br>
test_vector = np.array(test_list)

In [None]:
test_list = list(range(int(1e6)))
test_vector = np.array(test_list)

In [None]:
%%timeit
sum(test_list)

In [None]:
%%timeit
np.sum(test_vector)

https://numpy.org/devdocs/user/quickstart.html#universal-functions

<b>Matrix operations</b>

https://www.tutorialspoint.com/matrix-manipulation-in-python<br>
Arithmetic operators on arrays apply elementwise. <br> 
A new array is created and filled with the result.


<b>Array broadcasting</b><br>

https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html<br>
The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. <br>
Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes.

<img src = "https://www.tutorialspoint.com/numpy/images/array.jpg" height=10/>


https://www.tutorialspoint.com/numpy/numpy_broadcasting.htm

In [None]:
matrix = np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])
matrix


In [None]:
# create array of length 4 and reshape it to make it a column
col_vec = np.array([1,2,3,4]).reshape(4,1)
col_vec

In [None]:
# addittion with a data column
matrix + col_vec

In [None]:
##########

matrix

In [None]:
# addittion with a data row
matrix + np.array([1,2,3])

In [None]:
##########

matrix

In [None]:
col_vec

In [None]:
# multiplication with a data column
matrix * col_vec

In [None]:
##########

matrix

In [None]:
# create 4x3 matrix
matrix2 = np.array([[1,2,3],[5,6,7],[1,1,1],[2,2,2]])
matrix2

In [None]:
# multiplication with a matrix of the same shape
matrix * matrix2

In [None]:
##########

matrix

In [None]:
# matrix multiplication
col_vec = np.array([1,2,3]).reshape(3,1)
matrix.dot(col_vec)

In [None]:
# matrix multiplication - more recently
matrix@(np.array([1,2,3]).reshape(3,1))

In [None]:
##########

matrix

In [None]:
matrix2

In [None]:
# stacking arrays together - vertically
np.vstack((matrix,matrix2))

In [None]:
# stacking arrays together - horizontally
np.hstack((matrix,matrix2))

In [None]:
##########

matrix

In [None]:
# splitting arrays 
np.vsplit(matrix,2)

In [None]:
##########

matrix

In [None]:
np.hsplit(matrix,(2,3))

<b>Copy</b>

In [None]:
matrix

In [None]:
# shallow copy - looks at the same data
matrix_copy = matrix
matrix_copy1 = matrix.view()
print(matrix_copy)
print(matrix_copy1)

In [None]:
print(matrix)

print(matrix_copy)

print(matrix_copy1)

In [None]:
matrix_copy1[0,0] = 5

In [None]:
# deep copy
matrix_copy2 = matrix.copy()
print(matrix_copy2)

In [None]:
matrix_copy2[0,0] = 7

In [None]:
print(matrix)

print(matrix_copy)

print(matrix_copy1)

print(matrix_copy2)

#### <b>More matrix computation</b>

#### conditional subsetting - use array of booleans to subset array with only the elements where the bool array is True

In [None]:
# conditional subsetting
matrix[(6 < matrix[:,0])]

In [None]:
matrix[(4 <= matrix[:,0]) & (matrix[:,0] <= 7)
       & (2 <= matrix[:,1]) & (matrix[:,1] <= 7),]

In [None]:
matrix

#### Use the axis argument to compute mean for each column or row
#### axis = 0 - columns
#### axis = 1 - rows

In [None]:
# col mean 
matrix.mean(axis = 0)

In [None]:
# row mean
matrix.mean(axis = 1)

In [None]:
# unique values and counts
matrix = np.random.random((3,4), )
matrix = np.array([[ 5,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])
uvals, counts = np.unique(matrix, return_counts=True)
print(uvals,counts)

https://www.w3resource.com/python-exercises/numpy/index.php


Create a matrix of 5 rows and 6 columns with numbers from 1 to 30.
Add 2 to the odd values of the array.

In [None]:
matrix = np.arange(1,31).reshape(5,6)
matrix[matrix%2==1] +=  2 
matrix

Normalize the values in the matrix. Substract the mean and divide by the standard deviation.

In [None]:
mat_mean = np.mean(matrix)
mat_std = np.std(matrix)
matrix_norm = (matrix - mat_mean)/mat_std
matrix_norm

In [None]:
matrix

Create a random array (5 by 3) and compute: 
   * the sum of all elements 
   * the sum of the rows  
   * the sum of the columns

In [None]:
matrix = np.random.rand(5,3)
print(matrix)
matrix.sum()
matrix.sum(1)
matrix.sum(0)

In [None]:
#Given a set of Gene Ontology (GO) terms and the genes that are associated with these terms find the gene 
#that is associated with the most GO terms

go_terms=np.array(["cellular response to nicotine",
                   "cellular response to hypoxia",
                   "cellular response to lipid"])
genes=np.array(["BAD","KCNJ11","MSX1","CASR","ZFP36L1"])

assoc_matrix = np.array([[1,1,0,1,0],[1,0,0,1,1],[1,0,0,0,0]])

print(assoc_matrix)

In [None]:
max_gono = max(assoc_matrix.sum(0))
max_gono

In [None]:
sum_array = assoc_matrix.sum(0)
sum_array

In [None]:
np.where(sum_array == 3) # returns tuple

In [None]:
# help(np.where)

In [None]:
pos = np.where(sum_array == 3)[0][0]
pos

In [None]:
genes[pos]

_____

### Pandas

[Pandas](https://pandas.pydata.org/) is a high-performance library that makes familiar data structures, like `data.frame` from R, and appropriate data analysis tools available to Python users.

#### How does pandas work?

Pandas is built off of [Numpy](http://www.numpy.org/), and therefore leverages Numpy's C-level speed for its data analysis.

* Numpy can only make data structures of a single type.
* Pandas can use many types. 
* Think of a table, where each column can be whatever type you want it to be, so long as every item in the column is that same type.

#### Why use pandas?

1. Data munging/wrangling: the cleaning and preprocessing of data
2. Loading data into memory from disparate data formats (SQL, CSV, TSV, JSON)

#### Importing

Pandas is built off of numpy, it is usefull to import numpy at the same time, but not necessary.

```python
import numpy as np
import pandas as pd


```

#### 1. `pd.Series`

**One-dimensional** labeled array (or vector) 

```python
# Initialization Syntax
series = pd.Series(data, index, dtype) 
```

* **`data`** : what is going inside the Series (array-like, dict, or scalar value)
* **`index`**: row identifiers (doesn't have to be unique--think foreign key. Defaults to row number)
* **`dytpe`**: numpy/python based data types

Attributes 

['T',
 'array',
 'at',
 'axes',
 'base',
 'data',
 'dtype',
 'dtypes',
 'empty',
 'flags',
 'ftype',
 'ftypes',
 'hasnans',
 'iat',
 'iloc',
 'imag',
 'index',
 'is_monotonic',
 'is_monotonic_decreasing',
 'is_monotonic_increasing',
 'is_unique',
 'itemsize',
 'ix',
 'loc',
 'name',
 'nbytes',
 'ndim',
 'plot',
 'real',
 'shape',
 'size',
 'strides',
 'timetuple',
 'values']
 
 
 Methods
 
 ['abs',
 'add',
 'add_prefix',
 'add_suffix',
 'agg',
 'aggregate',
 'align',
 'all',
 'any',
 'append',
 'apply',
 'argmax',
 'argmin',
 'argsort',
 'asfreq',
 'asof',
 'astype',
 'at_time',
 'autocorr',
 'between',
 'between_time',
 'bfill',
 'bool',
 'clip',
 'combine',
 'combine_first',
 'convert_dtypes',
 'copy',
 'corr',
 'count',
 'cov',
 'cummax',
 'cummin',
 'cumprod',
 'cumsum',
 'describe',
 'diff',
 'div',
 'divide',
 'divmod',
 'dot',
 'drop',
 'drop_duplicates',
 'droplevel',
 'dropna',
 'duplicated',
 'eq',
 'equals',
 'ewm',
 'expanding',
 'explode',
 'factorize',
 'ffill',
 'fillna',
 'filter',
 'first',
 'first_valid_index',
 'floordiv',
 'ge',
 'get',
 'groupby',
 'gt',
 'head',
 'hist',
 'idxmax',
 'idxmin',
 'infer_objects',
 'interpolate',
 'isin',
 'isna',
 'isnull',
 'item',
 'items',
 'iteritems',
 'keys',
 'kurt',
 'kurtosis',
 'last',
 'last_valid_index',
 'le',
 'lt',
 'mad',
 'map',
 'mask',
 'max',
 'mean',
 'median',
 'memory_usage',
 'min',
 'mod',
 'mode',
 'mul',
 'multiply',
 'ne',
 'nlargest',
 'notna',
 'notnull',
 'nsmallest',
 'nunique',
 'pct_change',
 'pipe',
 'pop',
 'pow',
 'prod',
 'product',
 'quantile',
 'radd',
 'rank',
 'ravel',
 'rdiv',
 'rdivmod',
 'reindex',
 'reindex_like',
 'rename',
 'rename_axis',
 'reorder_levels',
 'repeat',
 'replace',
 'resample',
 'reset_index',
 'rfloordiv',
 'rmod',
 'rmul',
 'rolling',
 'round',
 'rpow',
 'rsub',
 'rtruediv',
 'sample',
 'searchsorted',
 'sem',
 'set_axis',
 'shift',
 'skew',
 'slice_shift',
 'sort_index',
 'sort_values',
 'squeeze',
 'std',
 'sub',
 'subtract',
 'sum',
 'swapaxes',
 'swaplevel',
 'tail',
 'take',
 'to_clipboard',
 'to_csv',
 'to_dict',
 'to_excel',
 'to_frame',
 'to_hdf',
 'to_json',
 'to_latex',
 'to_list',
 'to_markdown',
 'to_numpy',
 'to_period',
 'to_pickle',
 'to_sql',
 'to_string',
 'to_timestamp',
 'to_xarray',
 'transform',
 'transpose',
 'truediv',
 'truncate',
 'tshift',
 'tz_convert',
 'tz_localize',
 'unique',
 'unstack',
 'update',
 'value_counts',
 'var',
 'view',
 'where',
 'xs']

#### From a Python list

In [None]:
import numpy as np
import pandas as pd

In [None]:
labels = ["gene","protein","miRNA","metabolites"]
values = [3,4,5,6]
series_named_val = pd.Series(data = values, index=labels)


#### From dictionary

In [None]:
dict_var = dict(zip(labels, values))
pd.Series(dict_var)

In [None]:
dict_var = {"EGFR":2.5, "IL6":10.2, "BRAF":6.7, "ABL":5.3}
# Create new series
new_series = pd.Series(data = dict_var)
new_series

In [None]:
#help(new_series.idxmax)

In [None]:
# Return the index of the row with the max value
new_series.idxmax()

In [None]:
# generate descriptive statistics
new_series.describe()

In [None]:
# check for missing values
new_series.isna()

#### 2. `pd.DataFrame`

**Multi-dimensional** labeled data structure with columns of *potentially* different types

```python
# Initialization Syntax
df = pd.DataFrame(data, index, columns, dtype)
```

* **`data`** : what is going inside the DataFrame (numpy ndarray (structured or homogeneous), dict, or DataFrame)
* **`index`** : row identifiers (doesn't have to be unique--think foreign key. Defaults to row number)
* **`columns`** : column identifiers
* **`dtype`** : numpy/python based data types

Attributes

['T',
 'at',
 'axes',
 'columns',
 'dtypes',
 'empty',
 'ftypes',
 'iat',
 'iloc',
 'index',
 'ix',
 'loc',
 'ndim',
 'plot',
 'shape',
 'size',
 'style',
 'timetuple',
 'values']

In [None]:
correlation_array = np.arange(40,52).reshape(3,4)
genes_rows = ["HER2","PIK3CA", "BRAF"]
genes_cols = ["HER1","EGFR", "IL6", "INSR"]
df_gene_correlation = pd.DataFrame(correlation_array, genes_rows, genes_cols)
df_gene_correlation

In [None]:
# Explore DataFrame attributes and methods

df_gene_correlation.T

In [None]:
df_gene_correlation.sort_values(by='EGFR',ascending=False)

In [None]:
df_gene_correlation.aggregate(np.mean, 1)

In [None]:
df_gene_correlation.size

In [None]:
df_gene_correlation.index

In [None]:
df_gene_correlation.dtypes

In [None]:
'''
Create a 4 by 5 array with values from 20 to 80 going with a step of 3 
Create a list with row names: Gene1, Gene2 ...
Create a list with column names: GO_Term1, GO_Term2 ...
Create a DataFrame from the array created with the respective 
row names and column names from the lists
'''
values_array = np.arange(20,80,3).reshape(4,5)

#genes = ["Gene1","Gene2","Gene3","Gene4"]
#genes = ["Gene"+str(i+1) for i in range(values_array.shape[0])]
genes = []
for i in range(values_array.shape[0]):
    genes.append("Gene"+str(i+1))
genes

go_terms = ["Go_Term"+str(i+1) for i in range(values_array.shape[1])]
go_terms
df_gene_go = pd.DataFrame(values_array,genes,go_terms)
df_gene_go

#### From `pd.Series`

In [None]:
# Create pd.Series from the list with the go-terms names and set the name "new_row"
numbers_list = list(range(4,9))
numbers_series = pd.Series(numbers_list, index = go_terms, name = "new_row")

#### Row-wise (`append`)

In [None]:
# Now add on a row
df_gene_go.append(numbers_series)

#### Column-wise (`join`/`concat`)

#### `join`

In [None]:
df_gene_go

In [None]:
numbers_series1 = pd.Series([1,2,3], index = ["Gene1", "Gene2", "Gene3"], name = "new_column")


In [None]:
#different size
df_gene_go.join(numbers_series1)

#### `concat`

In [None]:
# Same size
numbers_series2 = pd.Series([1,2,3,4], index = genes, name = "new_column1")
pd.concat([df_gene_go, numbers_series2], axis=1)

In [None]:
# Unequal size
pd.concat([df_gene_go, numbers_series1], axis=1)

#### I/O in Pandas

One of the the most common reasons people use pandas is to bring data in without having to deal with file I/O, delimiters, and type conversion. Pandas deals with a lot of this.

#### CSV Files

#### Output

You can easily save your `DataFrames`

In [None]:
df_gene_go.to_csv('dataframe_data.csv')

In [None]:
# help(df_gene_go.to_csv)

In [None]:
df_gene_go.to_csv('dataframe_data.csv', index = True)

#### Input

You can easily bring data from a file into a `DataFrames`

In [None]:
pd.read_csv('dataframe_data.csv', index_col = 0)

In [None]:
# help(pd.read_csv)

#### Excel Files

In [None]:
# Output
df_gene_go.to_excel('excel_output.xlsx')
# Input
pd.read_excel('excel_output.xlsx')

#### TSV Files

In [None]:
# Output 
df_gene_go.to_csv('tsv_output.tsv', sep="\t")
# Input
pd.read_csv('tsv_output.tsv', sep="\t").tail()

#### Clipboard

#### Copy

In [None]:
df_gene_go.to_clipboard()

In [None]:
# Paste here


#### Paste

In [None]:
pd.read_clipboard()

#### Indexing/Exploring/Manipulating in Pandas

Standard `'[]'` indexing/slicing can be used, as well as `'.'` methods,

There are 2 pandas-specific methods for indexing:
1. `.loc` -> primarily label/name-based
2. `.iloc` -> primarily integer-based

In [None]:
# Create some data to work with
row_labels = ["row"+str(i) for i in range(10)]
col_labels = ["col"+str(i) for i in range(6)]

""" 
Create a DataFrame from a 10 by 6 array with values from 1 to 60, 
add the row_labels and col_labels we just created 
"""
data_array = np.arange(1,61).reshape(10,6)
data_array
df_example = pd.DataFrame(data_array,row_labels,col_labels)
df_example


Additionally, Pandas allows you to do random sampling from the dataframe

In [None]:
df_small = df_example.sample(n=5)
df_small

In [None]:
### 

df_example

#### `'[]'` slicing on a `pd.DataFrame` gives us a slice of **rows**

In [None]:
df_example[:3]

#### `'.'` operators and a column name can select a **specific named** column

In [None]:
df_example.col1

`'.'` operator selected columns are now just a `pd.Series` and can be `'[]'` sliced on further

In [None]:
df_example.col1[:3]

However, if it is a named column that doesn't fit well as a `'.'` name, you can use `'[]'` selection as well

In [None]:
df_example["col1"][:3]

In [None]:
### 

df_example

Named rows can be selected by a range of the names

In [None]:
df_example['row1':'row3']

#### Selection <b>BY NAME</b>: the `.loc` method

```python
# .loc syntax
df.loc[row indexer, column indexer]
```

<b>A slice of specific items (based on label) - start and stop included</b>

In [None]:
df_example.loc['row3':'row5', 'col2':'col4']

#### Boolean indexing

In [None]:
df_example.loc[df_example.col2 < 30]

#### Selection <b>BY POSITION</b>: the `.iloc` method

<b>A slice of specific items (based on position)</b>

In [None]:
df_example.iloc[:3,2]

In [None]:
# we can use a list of indices

df_example.iloc[:3,[0,1,3]]

#### Quick Exploration of the data

In [None]:
df_example.col1.describe()

In [None]:
df_example.col1.aggregate(sum)


In [None]:
df_example[df_example > 50] = np.nan

In [None]:
df_example

In [None]:
print('Any missing values?')
# Checks missiong values on a column (pd.Series)
df_example.col1.hasnans


#### Object Manipulation

In [None]:
df_example

In [None]:
df_example.loc[df_example.col2 > 30, ['col2',"col4"]] = 0 


In [None]:
df_example

Replace all the 0 values in df_example with 200.

In [None]:
df_iris = pd.read_csv('https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv')
df_iris

Answer the following questions by writing code:
* How may rows and columns does the dataset have?
* How may flowers with petal length > 4 and petal width > 2 are there?



https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

<img src="https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf" width=1000/>