## NumPy, Pandas & Visualization - examples
### BIOINF 575 - Fall 2020



_____

<img src = "https://blog.thedataincubator.com/wp-content/uploads/2018/02/Numpypandas.png" width = 300/>


____
#### Array product - vectorized operations

In [None]:
import numpy as np
import pandas as pd

In [None]:
mat1 = np.arange(1,7).reshape(2,3)
mat2 = np.array([[10, 11], [20, 21], [30, 31]])

In [None]:
mat1.T

In [None]:
mat2

In [None]:
mat1.T * mat2

_____
#### Matrix multiplication

<img src = "https://miro.medium.com/max/1400/1*YGcMQSr0ge_DGn96WnEkZw.png" width = 370/>

In [None]:
# matrix multiplication 

mat1.dot(mat2)


In [None]:
# matrix multiplication - more recently

mat1@mat2


___

#### Combining arrays into a larger array - vstack, hstack, vsplit, hsplit

In [None]:
##########

mat1

In [None]:
mat2

In [None]:
# stacking arrays together - vertically
vmatrix = np.vstack((mat1, mat2.T))
vmatrix

In [None]:
# stacking arrays together - horizontally
hmatrix = np.hstack((mat1.T,mat2))
hmatrix

#### <b>More matrix computation</b> - basic aggregate functions are available - min, max, sum, mean, std

In [None]:
# Let's look at our matrix again
mat1

#### Use the axis argument to compute mean for each column or row
    - axis = 0 - columns
    - axis = 1 - rows

In [None]:
# compute max for each column 
# using the array max method (np.ndarray.max)

mat1.max(axis = 0)


In [None]:
# compute sum of rows using the np.sum function

np.sum(mat1, axis = 1)

#### RESOURCES

http://scipy-lectures.org/intro/numpy/array_object.html#what-are-numpy-and-numpy-arrays   
https://www.python-course.eu/numpy.php   
https://numpy.org/devdocs/user/quickstart.html#universal-functions   
https://www.geeksforgeeks.org/python-numpy/

____ 

#### [Pandas](https://pandas.pydata.org/) is a high-performance library that makes familiar data structures, like `data.frame` from R, and appropriate data analysis tools available to Python users.

#### How does pandas work?

Pandas is built off of [Numpy](http://www.numpy.org/), and therefore leverages Numpy's C-level speed for its data analysis.

* Numpy can only make data structures of a single type.
* Pandas can use many types. 
* Think of a table, where each column can be whatever type you want it to be, so long as every item in the column is that same type.

#### Why use pandas?

1. Data munging/wrangling: the cleaning and preprocessing of data
2. Loading data into memory from disparate data formats (SQL, CSV, TSV, JSON)

___
#### 1. `pd.Series` - One-dimensional** labeled array (or vector)

```python
# Initialization Syntax
series = pd.Series(data, index, dtype) 
```

Attributes 

['T',
 'array',
 'at',
 'axes',
 'base',
 'data',
 'dtype',
 'dtypes',
 'empty',
 'flags',
 'ftype',
 'ftypes',
 'hasnans',
 'iat',
 'iloc',
 'imag',
 'index',
 'is_monotonic',
 'is_monotonic_decreasing',
 'is_monotonic_increasing',
 'is_unique',
 'itemsize',
 'ix',
 'loc',
 'name',
 'nbytes',
 'ndim',
 'plot',
 'real',
 'shape',
 'size',
 'strides',
 'timetuple',
 'values']
 
 
 Methods
 
 ['abs',
 'add',
 'add_prefix',
 'add_suffix',
 'agg',
 'aggregate',
 'align',
 'all',
 'any',
 'append',
 'apply',
 'argmax',
 'argmin',
 'argsort',
 'asfreq',
 'asof',
 'astype',
 'at_time',
 'autocorr',
 'between',
 'between_time',
 'bfill',
 'bool',
 'clip',
 'combine',
 'combine_first',
 'convert_dtypes',
 'copy',
 'corr',
 'count',
 'cov',
 'cummax',
 'cummin',
 'cumprod',
 'cumsum',
 'describe',
 'diff',
 'div',
 'divide',
 'divmod',
 'dot',
 'drop',
 'drop_duplicates',
 'droplevel',
 'dropna',
 'duplicated',
 'eq',
 'equals',
 'ewm',
 'expanding',
 'explode',
 'factorize',
 'ffill',
 'fillna',
 'filter',
 'first',
 'first_valid_index',
 'floordiv',
 'ge',
 'get',
 'groupby',
 'gt',
 'head',
 'hist',
 'idxmax',
 'idxmin',
 'infer_objects',
 'interpolate',
 'isin',
 'isna',
 'isnull',
 'item',
 'items',
 'iteritems',
 'keys',
 'kurt',
 'kurtosis',
 'last',
 'last_valid_index',
 'le',
 'lt',
 'mad',
 'map',
 'mask',
 'max',
 'mean',
 'median',
 'memory_usage',
 'min',
 'mod',
 'mode',
 'mul',
 'multiply',
 'ne',
 'nlargest',
 'notna',
 'notnull',
 'nsmallest',
 'nunique',
 'pct_change',
 'pipe',
 'pop',
 'pow',
 'prod',
 'product',
 'quantile',
 'radd',
 'rank',
 'ravel',
 'rdiv',
 'rdivmod',
 'reindex',
 'reindex_like',
 'rename',
 'rename_axis',
 'reorder_levels',
 'repeat',
 'replace',
 'resample',
 'reset_index',
 'rfloordiv',
 'rmod',
 'rmul',
 'rolling',
 'round',
 'rpow',
 'rsub',
 'rtruediv',
 'sample',
 'searchsorted',
 'sem',
 'set_axis',
 'shift',
 'skew',
 'slice_shift',
 'sort_index',
 'sort_values',
 'squeeze',
 'std',
 'sub',
 'subtract',
 'sum',
 'swapaxes',
 'swaplevel',
 'tail',
 'take',
 'to_clipboard',
 'to_csv',
 'to_dict',
 'to_excel',
 'to_frame',
 'to_hdf',
 'to_json',
 'to_latex',
 'to_list',
 'to_markdown',
 'to_numpy',
 'to_period',
 'to_pickle',
 'to_sql',
 'to_string',
 'to_timestamp',
 'to_xarray',
 'transform',
 'transpose',
 'truediv',
 'truncate',
 'tshift',
 'tz_convert',
 'tz_localize',
 'unique',
 'unstack',
 'update',
 'value_counts',
 'var',
 'view',
 'where',
 'xs']

In [None]:
import numpy as np
import pandas as pd

In [None]:
# Create series from dictionary
labels = ["EGFR","IL6","BRAF","ABL"]
values = [4,2,3,2]
dict_var = dict(zip(labels, values))
new_series = pd.Series(dict_var)
new_series

In [None]:
# check the first few elements - head, the last few - tail
new_series.head(2)

#### describe() - Generate descriptive statistics



In [None]:
# generate descriptive statistics
new_series.describe()

_____

#### 2. `pd.DataFrame` - Multi-dimensional** labeled data structure with columns of *potentially* different types

```python
# Initialization Syntax
df = pd.DataFrame(data, index, columns, dtype)
```

Attributes

['T',
 'at',
 'axes',
 'columns',
 'dtypes',
 'empty',
 'ftypes',
 'iat',
 'iloc',
 'index',
 'ix',
 'loc',
 'ndim',
 'plot',
 'shape',
 'size',
 'style',
 'timetuple',
 'values']

___
#### <font color = "green">Example</font>

The file "GSE22955_small_gene_table.txt" contains tab-separated data for the normalized gene expresion of about 1200 genes measured for every 3h for 45h to measure the effect of a HER2 inhibitor.   
This is a filtered and processed subset of the data available in the Gene Expression Omnibus:   
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE22955


In [None]:
file_name = "GSE22955_small_gene_table.txt"
expression_data = pd.read_csv(file_name, sep = "\t", comment = "#", index_col= 0)
expression_data

In [None]:
expression_data.shape

In [None]:
# index by column name to get a column
expression_data["6"]

In [None]:
# subset by a range to get specific rows 
expression_data[3:7]

In [None]:
# change column names
expression_data.columns = "Hour" + expression_data.columns

In [None]:
# . notation to get columns and vectorized operations to substract data from two columns

expr_diff = expression_data.Hour45 - expression_data.Hour0

expr_diff

In [None]:
# where method to find the position of values that satisfy a certain condition

pos = np.where(abs(expr_diff) == max(abs(expr_diff)))


In [None]:
expr_diff[pos[0]]

In [None]:
# row standard deviation
gene_sd = expression_data.std(axis = 1)

In [None]:
expression_data[gene_sd > 1.25]

In [None]:
# add a new column using join - the column has to have a name

gene_sd.name = "Gene_sd"
expr_sd_data = expression_data.join(gene_sd)
expr_sd_data

#### There are 2 pandas-specific methods for indexing:
####   1.  ```.loc``` - primarily label/name-based
####   2.  `.iloc` - primarily integer/position-based

In [None]:
# subset dataframe using conditional subsetting and column names
gene_var = expr_sd_data.loc[expr_sd_data.Gene_sd > 1.25,"Hour30":]
gene_var

In [None]:
gene_var.to_csv("gene_var.txt", sep = "\t", header = True, index = True)

In [None]:
expr_data_small = expression_data.iloc[40:45,1:]
expr_data_small

In [None]:
expr_data_small.loc["ALPL"].plot()

In [None]:
expr_data_small.T.plot.box()

#### RESOURCES

https://www.python-course.eu/pandas.phphttps://www.python-course.eu/numpy.php    
https://scipy-lectures.org/packages/statistics/index.html?highlight=pandas  
https://www.geeksforgeeks.org/pandas-tutorial/

https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

<img src="https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf" width=1000/>

____
_____
## Data Visualization

____

#### `matplotlib` - powerful basic plotting library - pandas plots are matplotlib plots
https://matplotlib.org/3.1.1/tutorials/introductory/pyplot.html

`matplotlib.pyplot` is a collection of command style functions that make matplotlib work like MATLAB. <br>
Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc.

In `matplotlib.pyplot` various states are preserved across function calls, so that it keeps track of things like the current figure and plotting area, and the plotting functions are directed to the current axes.<br>
"axes" in most places in the documentation refers to the axes part of a figure and not the strict mathematical term for more than one axis).


https://github.com/pandas-dev/pandas/blob/v0.25.0/pandas/plotting/_core.py#L504-L1533
https://matplotlib.org
https://matplotlib.org/tutorials/    
https://github.com/rougier/matplotlib-tutorial     
https://www.tutorialspoint.com/matplotlib/matplotlib_pyplot_api.htm    
https://realpython.com/python-matplotlib-guide/    
https://github.com/matplotlib/AnatomyOfMatplotlib    
https://www.w3schools.com/python/matplotlib_pyplot.asp   
http://scipy-lectures.org/intro/matplotlib/index.html

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

Call signatures::
```
    plot([x], y, [fmt], data=None, **kwargs)
    plot([x], y, [fmt], [x2], y2, [fmt2], ..., **kwargs)
```

The main usage of `plt` is the `plot()` and `show()` functions

In [None]:
# Plot the two lists, add axes labels

plt.plot(expr_data_small.T, marker = "s")
plt.xlabel("Time")
plt.ylabel("Expression")
plt.legend(expr_data_small.index)
plt.xticks(rotation = 90)
plt.show()

`matplotlib` can use *format strings* to quickly declare the type of plots you want. Here are *some* of those formats:

|**Character**|**Description**|
|:-----------:|:--------------|
|'--'|Dashed line|
|':'|Dotted line|
|'o'|Circle marker|
|'^'|Upwards triangle marker|
|'b'|Blue|
|'c'|Cyan|
|'g'|Green|

In [None]:
plt.plot(expr_data_small.loc["AMPH"], '^b--', linewidth=3, markersize=12)
plt.xticks(rotation = 90)

plt.show()

In [None]:
plt.plot(expr_data_small.loc["AMPH"], color='blue', marker='^', linestyle='dashed', linewidth=0.5, markersize=5)
plt.xticks(rotation = 90)


plt.show()

In [None]:
plt.plot(expr_data_small.loc["AMPH"], '^m--', expr_data_small.loc["ALPL"], 'sg-')
plt.xticks(rotation = 90)

plt.show()

In [None]:
# Making a figure -  grid layout


plt.figure(figsize=(16, 12))

plt.subplot(221)
plt.bar(expr_data_small.index, expr_data_small.mean(axis = 1))
plt.xticks(rotation = 90)

plt.subplot(222)
plt.scatter(expr_data_small.columns, expr_data_small.loc["AMPH"])
plt.scatter(expr_data_small.columns, expr_data_small.loc["ALPL"])
plt.legend(["AMPH","ALPL"])
plt.xticks(rotation = 90)

plt.subplot(223)
plt.hist(expr_data_small.loc["AMPH"])
plt.hist(expr_data_small.loc["ALPL"])
plt.legend(["AMPH","ALPL"])

axs = plt.subplot(224)
axs.violinplot(expr_data_small)
axs.set_xticks(range(1,6))
axs.set_xticklabels(expr_data_small.index)
plt.xticks(rotation = 90)



plt.suptitle('Cool data summary')
plt.show()

In [None]:
# help(plt.bar)

#### Multiple Plots

In [None]:
expr_data_small.T.AMPH.plot(kind='density')
expr_data_small.T.ALPL.plot(kind='density')
plt.legend()
plt.show()

____________

### `seaborn` - dataset-oriented plotting

Seaborn is a library that specializes in making *prettier* `matplotlib` plots of statistical data. <br>
It is built on top of matplotlib and closely integrated with pandas data structures.

https://seaborn.pydata.org/introduction.html<br>
https://python-graph-gallery.com/seaborn/   
https://jakevdp.github.io/PythonDataScienceHandbook/04.14-visualization-with-seaborn.html
https://seaborn.pydata.org/tutorial/distributions.html

In [None]:
import seaborn as sns

`seaborn` lets users *style* their plotting environment.

In [None]:
sns.set(style='whitegrid')

In [None]:
#dir(sns)

In [None]:
colors = ["<= Hour15","<= Hour15","<= Hour15","<= Hour15","<= Hour15",
         "<= Hour30","<= Hour30","<= Hour30","<= Hour30","<= Hour30",
         "<= Hour45","<= Hour45","<= Hour45","<= Hour45","<= Hour45"]

In [None]:
# hue argument allows you to color dots by category

sns.scatterplot(x='AMPH',y='ALPL', hue = colors, data=expr_data_small.T)
plt.show()

In [None]:
sns.relplot(x="AMPH", y="ALPL", data=expr_data_small.T, hue = colors)
plt.show()

In [None]:
df_iris = pd.read_csv('https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv')
df_iris.head()

In [None]:
sns.relplot(x="petal_length", y="petal_width", col="species",
            hue="species", style="species", size="species",
            data=df_iris)
plt.show()

In [None]:
sns.heatmap(expression_data[gene_sd>1.1], center = 12, cmap = "Oranges")
plt.show()

____

### `plotnine` - grammar of graphics - R ggplot2 in python

plotnine is an implementation of a grammar of graphics in Python, it is based on ggplot2. The grammar allows users to compose plots by explicitly mapping data to the visual objects that make up the plot.

Plotting with a grammar is powerful, it makes custom (and otherwise complex) plots are easy to think about and then create, while the simple plots remain simple.



https://plotnine.readthedocs.io/en/stable/   
http://cmdlinetips.com/2018/05/plotnine-a-python-library-to-use-ggplot2-in-python/  
https://plotnine.readthedocs.io/en/stable/tutorials/miscellaneous-altering-colors.html   
https://datascienceworkshops.com/blog/plotnine-grammar-of-graphics-for-python/   
https://realpython.com/ggplot-python/

In [None]:
# !pip install plotnine

In [None]:
from plotnine import *

In [None]:
pd.melt(expr_data_small.T)

In [None]:
ggplot(data=pd.melt(expr_data_small.T)) + geom_boxplot(aes(x = "Symbol", y = "value"))

In [None]:
# add transparency - to avoid over plotting - alpha argument and change point size 
# more parameters - scale_x_log10 - transform x axis values to log scale, xlab - add label to x axis

ggplot(data=df_iris) +aes(x='petal_length',y='petal_width',color="species") + \
    geom_point(size=0.7,alpha=0.7) + facet_wrap('~species',nrow=3) + \
    theme(figure_size=(7,7)) + ggtitle("Plot of iris dataset") + \
    scale_x_log10() + xlab("Petal Length") + ylab("Petal Width")


In [None]:
# Set width of bar for histogram and color for the bar line and bar fill color

p = ggplot(data=df_iris) + aes(x='petal_length') + geom_histogram(binwidth=1,color='black',fill='grey')
p

In [None]:
# Create a linear regression line that uses the petal length to predict the petal width of the flower
# These are broken down in 3 categories by species
# The grey area is the 95% confidence level interval for predictions from a linear model ("lm")

p = ggplot(df_iris, aes('petal_length', 'petal_width', color='species')) \
 + geom_point() \
 + stat_smooth(method='lm')
# + facet_wrap('~species'))
p

In [None]:
# Save the plot to a file

ggsave(plot=p, filename='iris_linear_model.png')



https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf

<img src = "https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf" width = "1000"/>