### Pandas

[Pandas](https://pandas.pydata.org/) is a high-performance library that makes familiar data structures, like `data.frame` from R, and appropriate data analysis tools available to Python users.

#### How does pandas work?

Pandas is built off of [Numpy](http://www.numpy.org/), and therefore leverages Numpy's C-level speed for its data analysis.

* Numpy can only make data structures of a single type.
* Pandas can use many types. 
* Think of a table, where each column can be whatever type you want it to be, so long as every item in the column is that same type.

#### Why use pandas?

1. Data munging/wrangling: the cleaning and preprocessing of data
2. Loading data into memory from disparate data formats (SQL, CSV, TSV, JSON)

#### Importing

Pandas is built off of numpy, it is usefull to import numpy at the same time, but not necessary.

In [None]:
import numpy as np
import pandas as pd

#### 1. `pd.Series`

**One-dimensional** labeled array (or vector) 

```python
# Initialization Syntax
series = pd.Series(data, index, dtype) 
```

* **`data`** : what is going inside the Series (array-like, dict, or scalar value)
* **`index`**: row identifiers (doesn't have to be unique--think foreign key. Defaults to row number)
* **`dytpe`**: numpy/python based data types

Attributes 

['T',
 'array',
 'at',
 'axes',
 'base',
 'data',
 'dtype',
 'dtypes',
 'empty',
 'flags',
 'ftype',
 'ftypes',
 'hasnans',
 'iat',
 'iloc',
 'imag',
 'index',
 'is_monotonic',
 'is_monotonic_decreasing',
 'is_monotonic_increasing',
 'is_unique',
 'itemsize',
 'ix',
 'loc',
 'name',
 'nbytes',
 'ndim',
 'plot',
 'real',
 'shape',
 'size',
 'strides',
 'timetuple',
 'values']

#### From a Python list

In [None]:
labels = ["gene","protein","miRNA","metabolites"]
values = [3,4,5,6]
series_named_val = pd.Series(data = values, index=labels)


#### From dictionary

In [None]:
dict_var = dict(zip(labels, values))
pd.Series(dict_var)

In [None]:
dict_var = {"EGFR":2.5, "IL6":10.2, "BRAF":6.7, "ABL":5.3}
# Create new series
# new_series =


In [None]:
new_series.idxmax(10.2)

In [None]:
# Explore Series attributes and methods



#### 2. `pd.DataFrame`

**Multi-dimensional** labeled data structure with columns of *potentially* different types

```python
# Initialization Syntax
df = pd.DataFrame(data, index, columns, dtype)
```

* **`data`** : what is going inside the DataFrame (numpy ndarray (structured or homogeneous), dict, or DataFrame)
* **`index`** : row identifiers (doesn't have to be unique--think foreign key. Defaults to row number)
* **`columns`** : column identifiers
* **`dtype`** : numpy/python based data types

Attributes

['T',
 'at',
 'axes',
 'columns',
 'dtypes',
 'empty',
 'ftypes',
 'iat',
 'iloc',
 'index',
 'ix',
 'loc',
 'ndim',
 'plot',
 'shape',
 'size',
 'style',
 'timetuple',
 'values']

In [None]:
correlation_array = np.arange(40,52).reshape(3,4)
genes_rows = ["HER2","PIK3CA", "BRAF"]
genes_cols = ["HER1","EGFR", "IL6", "INSR"]
df_gene_correlation = pd.DataFrame(correlation_array, genes_rows, genes_cols)


In [None]:
# Explore DataFrame attributes and methods

df_gene_correlation.T

In [None]:
df_gene_correlation.sort_values(by='EGFR',ascending=False)

In [None]:
df_gene_correlation.aggregate(np.mean, 1)

In [None]:
df_gene_correlation.size

In [None]:
df_gene_correlation.index

In [None]:
df_gene_correlation.dtypes

In [None]:
'''
Create a 4 by 5 array with even values from 20 to 80 going with a step of 3 
Create a list with row names: Gene1, Gene2 ...
Create a list with column names: GO_Term1, GO_Term2 ...
Create a DataFrame from the array created with the respective 
row names and colnames from the lists
'''

df_gene_go =

#### From `pd.Series`

In [None]:
# Create pd.Series from the list and set the name "new_row"
numbers_list = list(range(4,9))
numbers_series =

#### Row-wise (`append`)

In [None]:
# Now add on a row
df_gene_go.append(numbers_series)

#### Column-wise (`join`/`concat`)

#### `join`

In [None]:
df_gene_go

In [None]:
numbers_series1 = numbers_series.rename("new_column")


In [None]:
#different size
df_gene_go.join(numbers_series1)

#### `concat`

In [None]:
# Same size
pd.concat([df_gene_go, numbers_series1[:-1]], axis=1)

In [None]:
# Unequal size


#### I/O in Pandas

One of the the most common reasons people use pandas is to bring data in without having to deal with file I/O, delimiters, and type conversion. Pandas deals with a lot of this.

#### CSV Files

#### Output

You can easily save your `DataFrames`

In [None]:
df_columns.to_csv('dataframe_data.csv')

#### Input

You can easily bring data from a file into a `DataFrames`

In [None]:
pd.read_csv('dataframe_data.csv')

#### Excel Files

In [None]:
# Output
df_columns.to_excel('excel_output.xlsx')
# Input
pd.read_excel('excel_output.xlsx')

#### TSV Files

In [None]:
# Output 
df_columns.to_csv('tsv_output.tsv', sep="\t")
# Input
pd.read_csv('tsv_output.tsv', sep="\t").tail()

#### Clipboard

#### Copy

In [None]:
df_columns.to_clipboard()

In [None]:
# Paste here


#### Paste

In [None]:
pd.read_clipboard()

#### Indexing/Exploring/Manipulating in Pandas

Standard `'[]'` indexing/slicing can be used, as well as `'.'` methods,

There are 2 pandas-specific methods for indexing:
1. `.loc` -> primarily label-based
2. `.iloc` -> primarily integer-based

In [None]:
# Create some data to work with
row_labels = ["row"+str(i) for i in range(10)]
col_labels = ["col"+str(i) for i in range(6)]

""" 
Create a DataFrame from a 10 by 6 array with values from 1 to 60, 
add the row_labels and col_labels we just created 
"""
df_example = 


Additionally, Pandas allows you to do random sampling from the dataframe

In [None]:
df_small = df_example.sample(n=5)

#### `'[]'` slicing on a `pd.DataFrame` gives us a slice of **rows**

In [None]:
df_small[:3]

#### `'.'` operators and a column name can select a **specific named** column

In [None]:
df_small.col1

`'.'` operator selected columns are now just a `pd.Series` and can be `'[]'` sliced on further

In [None]:
df_small.col1[:3]

However, if it is a named column that doesn't fit well as a `'.'` name, you can use `'[]'` selection as well

In [None]:
df_small["col3"][:3]

Named rows can be selected by the names

In [None]:
df_example

In [None]:
df_example['row1':'row3']

#### Selection by label: the `.loc` method

```python
# .loc syntax
df.loc[row indexer, column indexer]
```

<b>A slice of specific items (based on label) - start and stop included</b>

In [None]:
df_example.loc['row3':'row5', 'col2':'col4']

#### Boolean indexing

In [None]:
df_example.loc[df_example.col2 < 30]

#### Selection by position: the `.iloc` method

<b>A slice of specific items (based on position)</b>

In [None]:
df_small.iloc[:3,2]

In [None]:
# we can use a list of indices

df_small.iloc[:3,[0,1,3]]

#### Quick Exploration of the data

In [None]:
df_example.col1.describe()

In [None]:
print('SUM: {}'.format(df_example.col1.aggregate(sum)))


In [None]:
df_example[df_example > 50] = np.nan

In [None]:
print('Any missing values: {}'.format(df_example.col1.hasnans))


#### Object Manipulation

In [None]:
df_small

In [None]:
df_small.loc[df_small.col2 > 30, ['col2',"col4"]] = 0 


In [None]:
df_small

Replace all the 0 values in df_small with 12.

In [None]:
df_iris = pd.read_csv('https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv')


Answer the following questions by writing code:
* How may rows and column does the dataset have?
* How may flowers with petal length greater than 4 and petal width > 2 are there?



https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

<img src="https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf" width=1000/>