### NumPy - Numeric python <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1a/NumPy_logo.svg/1200px-NumPy_logo.svg.png" alt="NumPy logo" width = "100">

https://numpy.org

Its power comes from the <b>N-dimensional array object</b>

np is a *lower*-level numerical computing library. 

While you can use it directly, most of its power comes from the packages built on top of np:
* Pandas (*Pan*els *Da*tas)


#### NumPy examples

In [None]:
matrix = np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])


In [None]:
# col mean
matrix.mean(axis = 0)

In [None]:
# row mean
matrix.mean(axis = 1)

In [None]:
# unique values and counts
matrix = np.array([[ 5,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])
uvals, counts = np.unique(matrix, return_counts=True)
print(uvals,counts)

https://www.w3resource.com/python-exercises/numpy/index.php


Create a matrix of 5 rows and 6 columns with numbers from 1 to 30.
Add 2 to the odd values of the array.

Normalize the values in the matrix. Substract the mean and divide by the standard deviation.

In [None]:
matrix

Create a random array (5 by 3) and compute: 
   * the sum of all elements 
   * the sum of the rows  
   * the sum of the columns

In [None]:
#Given a set of Gene Ontology (GO) terms and the genes that are associated with these terms find the gene 
#that is associated with the most GO terms

go_terms=np.array(["cellular response to nicotine",
                   "cellular response to hypoxia",
                   "cellular response to lipid"])
genes=np.array(["BAD","KCNJ11","MSX1","CASR","ZFP36L1"])

assoc_matrix = np.array([[1,1,0,1,0],[1,0,0,1,1],[1,0,0,0,0]])

print(assoc_matrix)

### Pandas

[Pandas](https://pandas.pydata.org/) is a high-performance library that makes familiar data structures, like `data.frame` from R, and appropriate data analysis tools available to Python users.

#### How does pandas work?

Pandas is built off of [Numpy](http://www.numpy.org/), and therefore leverages Numpy's C-level speed for its data analysis.

* Numpy can only make data structures of a single type.
* Pandas can use many types. 
* Think of a table, where each column can be whatever type you want it to be, so long as every item in the column is that same type.

#### Why use pandas?

1. Data munging/wrangling: the cleaning and preprocessing of data
2. Loading data into memory from disparate data formats (SQL, CSV, TSV, JSON)

#### Importing

Pandas is built off of numpy, it is usefull to import numpy at the same time, but not necessary.

In [None]:
import numpy as np
import pandas as pd

#### Basic data structure overview

For a more thorough dive into the different data structure, feel free to read [this](https://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro) documentation

The data structures of interest are:
1. `pd.Series`
2. `pd.DataFrame`

#### Series

**One-dimensional** labeled array (or vector) 

```python
# Initialization Syntax
series = pd.Series(data, index, dtype) 
```

* **`data`** : what is going inside the Series (array-like, dict, or scalar value)
* **`index`**: row identifiers (doesn't have to be unique--think foreign key. Defaults to row number)
* **`dytpe`**: numpy/python based data types

#### DataFrame

**Multi-dimensional** labeled data structure with columns of *potentially* different types

```python
# Initialization Syntax
df = pd.DataFrame(data, index, columns, dtype)
```

* **`data`** : what is going inside the DataFrame (numpy ndarray (structured or homogeneous), dict, or DataFrame)
* **`index`** : row identifiers (doesn't have to be unique--think foreign key. Defaults to row number)
* **`columns`** : column identifiers
* **`dtype`** : numpy/python based data types

#### Series

#### From a Python list

In [None]:
list_var = [4,5,6,7]
series_var = pd.Series(list_var)


In [None]:
dir(series_var)

```python
import types
def get_attr_list(obj):
    attributes = []
    for i in dir(obj):
        if not i.startswith('_'):
            if not isinstance(getattr(obj,i), types.MethodType):
                attributes.append(i)
    return attributes

get_attr_list(series_var)
```

['T',
 'array',
 'at',
 'axes',
 'base',
 'data',
 'dtype',
 'dtypes',
 'empty',
 'flags',
 'ftype',
 'ftypes',
 'hasnans',
 'iat',
 'iloc',
 'imag',
 'index',
 'is_monotonic',
 'is_monotonic_decreasing',
 'is_monotonic_increasing',
 'is_unique',
 'itemsize',
 'ix',
 'loc',
 'name',
 'nbytes',
 'ndim',
 'plot',
 'real',
 'shape',
 'size',
 'strides',
 'timetuple',
 'values']

We can quickly generate additional data structures by using the attributes of existing data structures (so long as they are appropriate)

#### Assigning a meaningful index

In [None]:
labels = ["gene","protein","miRNA","metabolites"]
series_var_named = pd.Series(series_var.values, index=labels)
series_var_named

#### From dictionary

In [None]:
dict_var = dict(zip(series_var_named.index, series_var_named.values))


In [None]:
series_var2 = pd.Series(dict_var)


In [None]:
dict_var = {"EGFR":2.5, "IL6":10.2, "BRAF":6.7, "ABL":5.3}


#### Setting the dtype

In [None]:
series_var3 = pd.Series(series_var2, dtype=np.float16)


#### From numpy array

In [None]:
series_var4 = pd.Series(np.array([1,2,3,4]), index=series_var3.index, dtype=np.float64)


#### Naming the series

In [None]:
series_var5 = pd.Series(series_var4, name='OmicsSeries')


#### DataFrame

DataFrames - collections of `pd.Series` - work much in the same way as `pd.Series`. <br>
Just like `np.ndarray`, it is an extension.

In [None]:
names = ["Cristina", "Ana", "Dana"]
scores = [80, 90, 78]
df_grades = pd.DataFrame({'name':names, 'score':scores})


In [None]:
df_grades.sort_values(by='score',ascending=False).iloc[:]

#### From list

In [None]:
list_var = [4,5,6,7]

list2_var = [list_var, list_var[::-1]]


In [None]:
df_list = pd.DataFrame(list2_var)


Again, let's look at the attributes of a dataframe

['T',
 'at',
 'axes',
 'columns',
 'dtypes',
 'empty',
 'ftypes',
 'iat',
 'iloc',
 'index',
 'ix',
 'loc',
 'ndim',
 'plot',
 'shape',
 'size',
 'style',
 'timetuple',
 'values']

#### Add an index

In [None]:
df_list_rn = pd.DataFrame(df_list.values, index=["row1","row2"])


#### Add column names

Because this is a dataframe, we can add both an index ***and*** column names

In [None]:
labels = ["gene","protein","miRNA","metabolites"]
df_var2 = pd.DataFrame(df_var1.values, index=df_var1.index, columns=labels)


In [None]:
df_var2.dtypes

#### Add a dtype

In [None]:
df_var3 = pd.DataFrame(df_var2, dtype=np.float64)


In [None]:
df_var3.dtypes

As seen, if you set a `dtype` for the `DataFrame`, you set it for ***all*** of the elements. <br>
You can also set the column dtypes individually.

In [None]:
df_var3.dtypes
df_var3.gene = df_var3.gene.astype('int64')


#### From a dictionary

In [None]:
numbers = [1, 2, 3, 4]
letters = ['A', 'B', 'C', 'D']
dict(zip(numbers,letters))

In [None]:
dict2_var = dict(zip(df_var3.index,list2_var))


In [None]:
df_var4 = pd.DataFrame(dict2_var, index = df_var3.columns).T


In [None]:
dict_var = {"EGFR":2.5, "IL6":10.2, "BRAF":6.7, "ABL":5.3, "MYC":5.5}
series_var2 = pd.Series(dict_var)


#### From numpy array

In [None]:
# Create a 2 by 4 array with even values from 2 to 17 
#arr_var = 


In [None]:
df_var5 = pd.DataFrame(arr_var, index = df_var4.index, columns = df_var4.columns)


#### From `pd.Series`

#### Row-wise (`append`)

In [None]:
# Create pd.Series
list_var = [4,5,6,7]
# series_var = 

In [None]:
series_var = series_var.rename('original')
print(series_var)
series_var_rev = pd.Series(series_var.values[::-1], name='reveresed', dtype=np.float16)
series_var_rev

In [None]:
# Create reversed pd.Series
series_var_rev = pd.Series(series_var.values[::-1], name='reveresed', dtype=np.float16)


In [None]:
# create original data frame
df_rows = pd.DataFrame(series_var).T


In [None]:
# Now add on a row
df_rows.append(series_var_rev)

#### Column-wise (`join`/`concat`)

In [None]:
# Create a pd.DataFrame from a 4 by 3 np.array with values from 4 to 16 (not including)
# df_columns = 



In [None]:
df_columns.columns = ["col1", "col2", "col3"]


#### `join`

In [None]:
df_join = df_columns.join(series_var_rev)


In [None]:
df_join.dtypes

#### `concat`

In [None]:
# Same size
pd.concat([df_columns, series_var_rev], axis=1)

In [None]:
# Unequal size
pd.concat([df_columns, series_var_rev[:3]], axis=1)

#### I/O in Pandas

One of the the most common reasons people use pandas is to bring data in without having to deal with file I/O, delimiters, and type conversion. Pandas deals with a lot of this.

#### CSV Files

#### Output

You can also, just as easily, save your `DataFrames`

In [None]:
df_columns.to_csv('dataframe_data.csv')

#### Input

In [None]:
pd.read_csv('dataframe_data.csv')

#### Excel Files

#### Output

In [None]:
df_columns.to_excel('excel_output.xlsx')

#### Input

In [None]:
pd.read_excel('excel_output.xlsx').head()

#### TSV Files

#### Output

In [None]:
df_columns.to_csv('tsv_output.tsv', sep="\t")

#### Input

In [None]:
pd.read_csv('tsv_output.tsv', sep="\t").tail()

#### Clipboard

#### Copy

In [None]:
df_columns.to_clipboard()

In [None]:
# Paste here


#### Paste

In [None]:
pd.read_clipboard()

#### Indexing/Exploring/Manipulating in Pandas

Standard `'[]'` indexing/slicing can be used, as well as `'.'` methods,

There are 2 pandas-specific methods for indexing:
1. `.loc` -> primarily label-based
2. `.iloc` -> primarily integer-based

Additionally, Pandas allows you to do random sampling from the dataframe

In [None]:
# Create some data to work with
row_labels = ["row"+str(i) for i in range(10)]
col_labels = ["col"+str(i) for i in range(6)]

index_example = pd.DataFrame(np.arange(1,61).reshape(10,6), index = row_labels, columns = col_labels)

small_idx = index_example.sample(n=5)


#### `'[]'` slicing on a `pd.DataFrame` gives us a slice of **rows**

In [None]:
small_idx[:3]

#### `'.'` operators and a column name can select a **specific named** column

In [None]:
small_idx.col1

`'.'` operator selected columns are now just a `pd.Series` and can be `'[]'` sliced on further

In [None]:
small_idx.col1[:3]

However, if it is a named column that doesn't fit well as a `'.'` name, you can use `'[]'` selection as well

In [None]:
small_idx['col1'][:3]

Named rows can be selected by the names

In [None]:
index_example[:4]

In [None]:
index_example['row1':'row3']

#### Selection by label: the `.loc` method

```python
# .loc syntax
small_idx.loc[row indexer, column indexer]
```

#### A slice of specific items (based on label) - start and stop included

In [None]:
index_example.loc['row3':'row5', 'col2':'col4']

#### Boolean indexing

In [None]:
index_example.loc[index_example.col2 < 30]

#### Selection by position: the `.iloc` method

#### A slice of specific items (based on position)

In [None]:
small_idx.iloc[:3,2]

#### a slice of specific items (based on position)

In [None]:
small_idx.iloc[:3,[0,1,3]]

### Quick Exploration of the data

In [None]:
index_example.col1.describe()

In [None]:
print('SUM: {}'.format(index_example.col1.aggregate(sum)))
print('Any missing values: {}'.format(index_example.col1.hasnans))

### Object Manipulation

In [None]:
small_idx

In [None]:
small_idx.loc[small_idx.col2 > 40, 'col2'] = 0 


In [None]:
small_idx

Replace all the 0 values in small_idx with 12.

In [None]:
df_iris = pd.read_csv('https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv')
df_iris

Answer the following questions by writing code:
* How may rows and column does the dataset have?
* How may flowers with petal length greater than 4 and petal width > 2 are there?



https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

<img src="https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf" width=1000/>