### NumPy - Numeric python <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1a/NumPy_logo.svg/1200px-NumPy_logo.svg.png" alt="NumPy logo" width = "100">

https://numpy.org

Its power comes from the <b>N-dimensional array object</b>

np is a *lower*-level numerical computing library. 

While you can use it directly, most of its power comes from the packages built on top of np:
* Pandas (*Pan*els *Da*tas)


#### NumPy examples

In [4]:
matrix = np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])
print(matrix)

[[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]]


In [5]:
# col mean
matrix.mean(axis = 0)

array([5.5, 6.5, 7.5])

In [6]:
# row mean
matrix.mean(axis = 1)

array([ 2.,  5.,  8., 11.])

In [7]:
# unique values and counts
matrix = np.array([[ 5,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])
uvals, counts = np.unique(matrix, return_counts=True)
print(uvals,counts)

[ 2  3  4  5  6  7  8  9 10 11 12] [1 1 1 2 1 1 1 1 1 1 1]


https://www.w3resource.com/python-exercises/numpy/index.php


Create a matrix of 5 rows and 9 columns with numbers from 1 to 45.
Add 3 to the even values of the array.

In [12]:
matrix = np.arange(1,46).reshape(5,9)
#matrix[matrix%2 == 0] = matrix[matrix%2 == 0] + 3
matrix[matrix%2 == 0] += 3


matrix

array([[ 1,  5,  3,  7,  5,  9,  7, 11,  9],
       [13, 11, 15, 13, 17, 15, 19, 17, 21],
       [19, 23, 21, 25, 23, 27, 25, 29, 27],
       [31, 29, 33, 31, 35, 33, 37, 35, 39],
       [37, 41, 39, 43, 41, 45, 43, 47, 45]])

Normalize the values in the matrix. Divide by the maximum.

In [15]:
matrix/matrix.max()
matrix/np.max(matrix)

array([[0.0212766 , 0.10638298, 0.06382979, 0.14893617, 0.10638298,
        0.19148936, 0.14893617, 0.23404255, 0.19148936],
       [0.27659574, 0.23404255, 0.31914894, 0.27659574, 0.36170213,
        0.31914894, 0.40425532, 0.36170213, 0.44680851],
       [0.40425532, 0.4893617 , 0.44680851, 0.53191489, 0.4893617 ,
        0.57446809, 0.53191489, 0.61702128, 0.57446809],
       [0.65957447, 0.61702128, 0.70212766, 0.65957447, 0.74468085,
        0.70212766, 0.78723404, 0.74468085, 0.82978723],
       [0.78723404, 0.87234043, 0.82978723, 0.91489362, 0.87234043,
        0.95744681, 0.91489362, 1.        , 0.95744681]])

In [16]:
matrix

array([[ 1,  5,  3,  7,  5,  9,  7, 11,  9],
       [13, 11, 15, 13, 17, 15, 19, 17, 21],
       [19, 23, 21, 25, 23, 27, 25, 29, 27],
       [31, 29, 33, 31, 35, 33, 37, 35, 39],
       [37, 41, 39, 43, 41, 45, 43, 47, 45]])

Create a random array (4 by 7) and compute: 
   * the mean of all elements 
   * the mean of the rows  
   * the mean of the columns

In [30]:
matrix = np.random.random((4,7))
#print(matrix)
matrix_mean = np.mean(matrix)
matrix_mean = matrix.mean()
matrix_mean
matrix_rows_mean = np.mean(matrix, 1)
matrix_rows_mean        
matrix_cols_mean = np.mean(matrix, 0)
matrix_cols_mean  

array([0.61195723, 0.65998198, 0.50729199, 0.41262225, 0.35839196,
       0.52819673, 0.3594848 ])

In [37]:
#Given the following 2-dimensional array, add 1 to the fist two rows, 2 to the next two rows, and 3 to the last row.
matrix = np.arange(1,46).reshape(5,9)
print(matrix)
matrix[0,] += 1
matrix[1,] += 1
matrix[2,] += 2
matrix[3,] += 2
matrix[4,] += 3
matrix



[[ 1  2  3  4  5  6  7  8  9]
 [10 11 12 13 14 15 16 17 18]
 [19 20 21 22 23 24 25 26 27]
 [28 29 30 31 32 33 34 35 36]
 [37 38 39 40 41 42 43 44 45]]


array([[ 2,  3,  4,  5,  6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15, 16, 17, 18, 19],
       [21, 22, 23, 24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35, 36, 37, 38],
       [40, 41, 42, 43, 44, 45, 46, 47, 48]])

In [48]:
matrix[[1,2],] 

array([[10, 11, 12, 13, 14, 15, 16, 17, 18],
       [19, 20, 21, 22, 23, 24, 25, 26, 27]])

In [43]:
#Given the following 2-dimensional array, add 1 to the fist two rows, 2 to the next two rows, and 3 to the last row.
matrix = np.arange(1,46).reshape(5,9)
array_to_add = np.array([1,1,2,2,3]).reshape(5,1)
array_to_add.shape
array_to_add
matrix + array_to_add

array([[ 2,  3,  4,  5,  6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15, 16, 17, 18, 19],
       [21, 22, 23, 24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35, 36, 37, 38],
       [40, 41, 42, 43, 44, 45, 46, 47, 48]])

In [49]:
#Given a set of Gene Ontology (GO) terms and the genes that are associated with these terms find the gene 
#that is associated with the most GO terms

go_terms=np.array(["cellular response to nicotine",
                   "cellular response to hypoxia",
                   "cellular response to lipid"])
genes=np.array(["BAD","KCNJ11","MSX1","CASR","ZFP36L1"])

assoc_matrix = np.array([[1,1,0,1,0],[1,0,0,1,1],[1,0,0,0,0]])

print(assoc_matrix)

[[1 1 0 1 0]
 [1 0 0 1 1]
 [1 0 0 0 0]]


In [60]:
col_sum = assoc_matrix.sum(axis = 0)
max_col_sum = max(col_sum)
index = np.where(col_sum == max_col_sum)[0]

#print(genes[index])

np.where(assoc_matrix.sum(axis = 0) == max(assoc_matrix.sum(axis = 0)))

new_array = np.arange(1,6)
new_array[new_array > 2]
genes[new_array > 2]

array(['MSX1', 'CASR', 'ZFP36L1'], dtype='<U7')

### Pandas

[Pandas](https://pandas.pydata.org/) is a high-performance library that makes familiar data structures, like `data.frame` from R, and appropriate data analysis tools available to Python users.

#### How does pandas work?

Pandas is built off of [Numpy](http://www.numpy.org/), and therefore leverages Numpy's C-level speed for its data analysis.

* Numpy can only make data structures of a single type.
* Pandas can use many types. 
* Think of a table, where each column can be whatever type you want it to be, so long as every item in the column is that same type.

#### Why use pandas?

1. Data munging/wrangling: the cleaning and preprocessing of data
2. Loading data into memory from disparate data formats (SQL, CSV, TSV, JSON)

#### Importing

Pandas is built off of numpy, it is usefull to import numpy at the same time, but not necessary.

In [61]:
import numpy as np
import pandas as pd

#### Basic data structure overview

For a more thorough dive into the different data structure, feel free to read [this](https://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro) documentation

The data structures of interest are:
1. `pd.Series`
2. `pd.DataFrame`

#### Series

**One-dimensional** labeled array (or vector) 

```python
# Initialization Syntax
series = pd.Series(data, index, dtype) 
```

* **`data`** : what is going inside the Series (array-like, dict, or scalar value)
* **`index`**: row identifiers (doesn't have to be unique--think foreign key. Defaults to row number)
* **`dytpe`**: numpy/python based data types

#### DataFrame

**Multi-dimensional** labeled data structure with columns of *potentially* different types

```python
# Initialization Syntax
df = pd.DataFrame(data, index, columns, dtype)
```

* **`data`** : what is going inside the DataFrame (numpy ndarray (structured or homogeneous), dict, or DataFrame)
* **`index`** : row identifiers (doesn't have to be unique--think foreign key. Defaults to row number)
* **`columns`** : column identifiers
* **`dtype`** : numpy/python based data types

#### Series

#### From a Python list

In [63]:
list_var = [4,5,6,7]
series_var = pd.Series(list_var)
series_var

0    4
1    5
2    6
3    7
dtype: int64

In [64]:
dir(series_var)

['T',
 '_AXIS_ALIASES',
 '_AXIS_IALIASES',
 '_AXIS_LEN',
 '_AXIS_NAMES',
 '_AXIS_NUMBERS',
 '_AXIS_ORDERS',
 '_AXIS_REVERSED',
 '_AXIS_SLICEMAP',
 '__abs__',
 '__add__',
 '__and__',
 '__array__',
 '__array_prepare__',
 '__array_priority__',
 '__array_wrap__',
 '__bool__',
 '__bytes__',
 '__class__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__div__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__finalize__',
 '__float__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__imod__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__int__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__long__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__nonzero__',
 '__or__',
 

```python
import types
def get_attr_list(obj):
    attributes = []
    for i in dir(obj):
        if not i.startswith('_'):
            if not isinstance(getattr(obj,i), types.MethodType):
                attributes.append(i)
    return attributes

get_attr_list(series_var)
```

['T',
 'array',
 'at',
 'axes',
 'base',
 'data',
 'dtype',
 'dtypes',
 'empty',
 'flags',
 'ftype',
 'ftypes',
 'hasnans',
 'iat',
 'iloc',
 'imag',
 'index',
 'is_monotonic',
 'is_monotonic_decreasing',
 'is_monotonic_increasing',
 'is_unique',
 'itemsize',
 'ix',
 'loc',
 'name',
 'nbytes',
 'ndim',
 'plot',
 'real',
 'shape',
 'size',
 'strides',
 'timetuple',
 'values']

We can quickly generate additional data structures by using the attributes of existing data structures (so long as they are appropriate)

#### Assigning a meaningful index

In [65]:
labels = ["gene","protein","miRNA","metabolites"]
series_var_named = pd.Series(series_var.values, index=labels)
series_var_named

gene           4
protein        5
miRNA          6
metabolites    7
dtype: int64

#### From dictionary

In [66]:
dict_var = dict(zip(series_var_named.index, series_var_named.values))
dict_var

{'gene': 4, 'protein': 5, 'miRNA': 6, 'metabolites': 7}

In [67]:
series_var2 = pd.Series(dict_var)
series_var2

gene           4
protein        5
miRNA          6
metabolites    7
dtype: int64

In [68]:
dict_var = {"EGFR":2.5, "IL6":10.2, "BRAF":6.7, "ABL":5.3}
new_series = pd.Series(dict_var)
new_series

EGFR     2.5
IL6     10.2
BRAF     6.7
ABL      5.3
dtype: float64

#### Setting the dtype

In [69]:
print(series_var2)
series_var3 = pd.Series(series_var2, dtype=np.float16)
series_var3

gene           4
protein        5
miRNA          6
metabolites    7
dtype: int64


gene           4.0
protein        5.0
miRNA          6.0
metabolites    7.0
dtype: float16

#### From numpy array

In [70]:
series_var4 = pd.Series(np.array([1,2,3,4]), index=series_var3.index, dtype=np.float64)
series_var4

gene           1.0
protein        2.0
miRNA          3.0
metabolites    4.0
dtype: float64

#### Naming the series

In [71]:
series_var5 = pd.Series(series_var4, name='OmicsSeries')
series_var5

gene           1.0
protein        2.0
miRNA          3.0
metabolites    4.0
Name: OmicsSeries, dtype: float64

#### DataFrame

DataFrames - collections of `pd.Series` - work much in the same way as `pd.Series`. <br>
Just like `np.ndarray`, it is an extension.

In [72]:
names = ["Cristina", "Ana", "Dana"]
scores = [80, 90, 78]
df_grades = pd.DataFrame({'name':names, 'score':scores})
df_grades

Unnamed: 0,name,score
0,Cristina,80
1,Ana,90
2,Dana,78


In [73]:
df_grades.sort_values(by='score',ascending=False).iloc[:]

Unnamed: 0,name,score
1,Ana,90
0,Cristina,80
2,Dana,78


#### From list

In [74]:
list_var = [4,5,6,7]

list2_var = [list_var, list_var[::-1]]
list2_var

[[4, 5, 6, 7], [7, 6, 5, 4]]

In [75]:
df_list = pd.DataFrame(list2_var)
df_list

Unnamed: 0,0,1,2,3
0,4,5,6,7
1,7,6,5,4


In [78]:
df_list.size

8

Again, let's look at the attributes of a dataframe

['T',
 'at',
 'axes',
 'columns',
 'dtypes',
 'empty',
 'ftypes',
 'iat',
 'iloc',
 'index',
 'ix',
 'loc',
 'ndim',
 'plot',
 'shape',
 'size',
 'style',
 'timetuple',
 'values']

#### Add an index

In [80]:
df_list_rn = pd.DataFrame(df_list.values, index=["sample1","sample2"])
df_list_rn

Unnamed: 0,0,1,2,3
sample1,4,5,6,7
sample2,7,6,5,4


#### Add column names

Because this is a dataframe, we can add both an index ***and*** column names

In [82]:
labels = ["gene","protein","miRNA","metabolites"]
df_var2 = pd.DataFrame(df_list_rn.values, index=df_list_rn.index, columns=labels)
df_var2

Unnamed: 0,gene,protein,miRNA,metabolites
sample1,4,5,6,7
sample2,7,6,5,4


In [87]:
df_list_rn.index

Index(['sample1', 'sample2'], dtype='object')

In [83]:
df_var2.dtypes

gene           int64
protein        int64
miRNA          int64
metabolites    int64
dtype: object

#### Add a dtype

In [84]:
df_var3 = pd.DataFrame(df_var2, dtype=np.float64)
df_var3

Unnamed: 0,gene,protein,miRNA,metabolites
sample1,4.0,5.0,6.0,7.0
sample2,7.0,6.0,5.0,4.0


In [85]:
df_var3.dtypes

gene           float64
protein        float64
miRNA          float64
metabolites    float64
dtype: object

As seen, if you set a `dtype` for the `DataFrame`, you set it for ***all*** of the elements. <br>
You can also set the column dtypes individually.

In [86]:
df_var3.dtypes
df_var3.gene = df_var3.gene.astype('int64')
df_var3.dtypes



gene             int64
protein        float64
miRNA          float64
metabolites    float64
dtype: object

#### From a dictionary

In [94]:
numbers = [1, 2, 3, 4]
letters = ['A', 'B', 'C', 'D']
pd.DataFrame(letters, dict(zip(numbers,letters)))

Unnamed: 0,0
1,A
2,B
3,C
4,D


In [91]:
dict2_var = dict(zip(df_var3.index,list2_var))
dict2_var

{'sample1': [4, 5, 6, 7], 'sample2': [7, 6, 5, 4]}

In [92]:
df_var4 = pd.DataFrame(dict2_var, index = df_var3.columns).T
df_var4

Unnamed: 0,gene,protein,miRNA,metabolites
sample1,4,5,6,7
sample2,7,6,5,4


In [95]:
dict_var = {"EGFR":2.5, "IL6":10.2, "BRAF":6.7, "ABL":5.3, "MYC":5.5}
series_var2 = pd.Series(dict_var)
series_var2

EGFR     2.5
IL6     10.2
BRAF     6.7
ABL      5.3
MYC      5.5
dtype: float64

#### From numpy array

In [96]:
# Create a 2 by 4 array with even values from 2 to 17 going with a step of 2 

arr_var = np.arange(2,17,2).reshape(2,4)
arr_var

array([[ 2,  4,  6,  8],
       [10, 12, 14, 16]])

In [99]:
df_var5 = pd.DataFrame(arr_var)
df_var5

Unnamed: 0,0,1,2,3
0,2,4,6,8
1,10,12,14,16


In [100]:
df_var5 = pd.DataFrame(arr_var, index = df_var4.index, columns = df_var4.columns)
df_var5

Unnamed: 0,gene,protein,miRNA,metabolites
sample1,2,4,6,8
sample2,10,12,14,16


#### From `pd.Series`

#### Row-wise (`append`)

In [None]:
# Create pd.Series
list_var = [4,5,6,7]
# series_var = 

In [None]:
series_var = series_var.rename('original')
print(series_var)
series_var_rev = pd.Series(series_var.values[::-1], name='reveresed', dtype=np.float16)
series_var_rev

In [None]:
# Create reversed pd.Series
series_var_rev = pd.Series(series_var.values[::-1], name='reveresed', dtype=np.float16)


In [None]:
# create original data frame
df_rows = pd.DataFrame(series_var).T


In [None]:
# Now add on a row
df_rows.append(series_var_rev)

#### Column-wise (`join`/`concat`)

In [None]:
# Create a pd.DataFrame from a 4 by 3 np.array with values from 4 to 16 (not including)
# df_columns = 



In [None]:
df_columns.columns = ["col1", "col2", "col3"]


#### `join`

In [None]:
df_join = df_columns.join(series_var_rev)


In [None]:
df_join.dtypes

#### `concat`

In [None]:
# Same size
pd.concat([df_columns, series_var_rev], axis=1)

In [None]:
# Unequal size
pd.concat([df_columns, series_var_rev[:3]], axis=1)

#### I/O in Pandas

One of the the most common reasons people use pandas is to bring data in without having to deal with file I/O, delimiters, and type conversion. Pandas deals with a lot of this.

#### CSV Files

#### Output

You can also, just as easily, save your `DataFrames`

In [None]:
df_columns.to_csv('dataframe_data.csv')

#### Input

In [None]:
pd.read_csv('dataframe_data.csv')

#### Excel Files

#### Output

In [None]:
df_columns.to_excel('excel_output.xlsx')

#### Input

In [None]:
pd.read_excel('excel_output.xlsx').head()

#### TSV Files

#### Output

In [None]:
df_columns.to_csv('tsv_output.tsv', sep="\t")

#### Input

In [None]:
pd.read_csv('tsv_output.tsv', sep="\t").tail()

#### Clipboard

#### Copy

In [None]:
df_columns.to_clipboard()

In [None]:
# Paste here


#### Paste

In [None]:
pd.read_clipboard()

#### Indexing/Exploring/Manipulating in Pandas

Standard `'[]'` indexing/slicing can be used, as well as `'.'` methods,

There are 2 pandas-specific methods for indexing:
1. `.loc` -> primarily label-based
2. `.iloc` -> primarily integer-based

Additionally, Pandas allows you to do random sampling from the dataframe

In [None]:
# Create some data to work with
row_labels = ["row"+str(i) for i in range(10)]
col_labels = ["col"+str(i) for i in range(6)]

index_example = pd.DataFrame(np.arange(1,61).reshape(10,6), index = row_labels, columns = col_labels)

small_idx = index_example.sample(n=5)


#### `'[]'` slicing on a `pd.DataFrame` gives us a slice of **rows**

In [None]:
small_idx[:3]

#### `'.'` operators and a column name can select a **specific named** column

In [None]:
small_idx.col1

`'.'` operator selected columns are now just a `pd.Series` and can be `'[]'` sliced on further

In [None]:
small_idx.col1[:3]

However, if it is a named column that doesn't fit well as a `'.'` name, you can use `'[]'` selection as well

In [None]:
small_idx['col1'][:3]

Named rows can be selected by the names

In [None]:
index_example[:4]

In [None]:
index_example['row1':'row3']

#### Selection by label: the `.loc` method

```python
# .loc syntax
small_idx.loc[row indexer, column indexer]
```

#### A slice of specific items (based on label) - start and stop included

In [None]:
index_example.loc['row3':'row5', 'col2':'col4']

#### Boolean indexing

In [None]:
index_example.loc[index_example.col2 < 30]

#### Selection by position: the `.iloc` method

#### A slice of specific items (based on position)

In [None]:
small_idx.iloc[:3,2]

#### a slice of specific items (based on position)

In [None]:
small_idx.iloc[:3,[0,1,3]]

### Quick Exploration of the data

In [None]:
index_example.col1.describe()

In [None]:
print('SUM: {}'.format(index_example.col1.aggregate(sum)))
print('Any missing values: {}'.format(index_example.col1.hasnans))

### Object Manipulation

In [None]:
small_idx

In [None]:
small_idx.loc[small_idx.col2 > 40, 'col2'] = 0 


In [None]:
small_idx

Replace all the 0 values in small_idx with 12.

In [None]:
df_iris = pd.read_csv('https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv')
df_iris

Answer the following questions by writing code:
* How may rows and column does the dataset have?
* How may flowers with petal length greater than 4 and petal width > 2 are there?



https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

<img src="https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf" width=1000/>