## NumPy, Pandas & Visualization
### BIOINF 575 - Fall 2020



_____


### RECAP & new info - NumPy - Numeric python <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1a/NumPy_logo.svg/1200px-NumPy_logo.svg.png" alt="NumPy logo" width = "100">

____
#### The list contains refences to each of the values.
#### The array refers to a block of memory containg all values one after the other.
- <b>that is why we need to know the size of the array and the array size cannot change <br>


<img src = "https://www.python-course.eu/images/list_structure.png" width = 300 /> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<img src = "https://www.python-course.eu/images/array_structure.png" width = 350 />
____

#### Arrays of different dimensions (`shape` gives the number of elements on each dimension):

<img src="https://www.oreilly.com/library/view/elegant-scipy/9781491922927/assets/elsp_0105.png" alt="data structures" width="450">  

_____


#### <b>NumPy basics</b>

Arrays are designed to:
* <b>handle vectorized operations lists are not</b>
    - if you apply a function it is performed on every item in the array, rather than on the whole array object
    - both arrays and lists have 0-based indexing
* <b>store multiple items of the same data type</b>
* <b>handle missing values </b>
    - missing numerical values are represented using the `np.nan` object (not a number)
    - the object `np.inf` represents infinite  
* <b>have an unchangeable size</b>
    - array size cannot be changed, should create a new array
    - you know when you create the array how much space you need for it and that will not change  
* <b>have efficient memory usage</b>
    - an equivalent numpy array occupies much less space than a python list of lists

#### <b>Importing NumPy
The recommended convention to import numpy is to use the <b>np</b> alias:

In [None]:
import numpy as np

#### <b>Documentation and help
https://numpy.org/doc/

In [None]:
# np.lookfor('sum') 

In [None]:
np.me*?

In [None]:
# np.mean?

In [None]:
# help(np.mean)

#### <b>Motivating example</b> - change weight from grams to ounces

In [None]:
weight_list_g = [20, 5, 30, 100]

In [None]:
# using lists we need a comprehension to apply the formula to each element of the list
weight_list_oz = [weightg*0.03527396195 for weightg in weight_list_g]
weight_list_oz

In [None]:
# using arrays we can apply the formula directly to the array and it will be applied to each element

np.array(weight_list_g)

In [None]:
weight_array_oz = np.array(weight_list_g)*0.03527396195
weight_array_oz

In [None]:
# plot the values
#the following command is used to show the plot in the notebook not in a pop-up window
%matplotlib inline   

# the following command is used to import the module and class used for plorring
import matplotlib.pyplot as plt  

plt.plot(weight_array_oz)
plt.show()

#### <b>Functions for creating arrays</b>
https://docs.scipy.org/doc/numpy-1.13.0/user/basics.creation.html

In [None]:
# Creating arrays see the different functions used to create arrays:

matrix_list = np.array([[1,2,3,4], [40,60,70,80], [101, 202, 303, 404]])
print("2D array from a list of lists: \n")
print("np.array([[1,2,3,4], [40,60,70,80], [101, 202, 303, 404]]) \n")
print(matrix_list)
print("_______________________________________________________________\n")


vector_range = np.arange(3, 18, 3) 
print("Vecor of evenly spaced values form a range (arange) given by start, stop and step: \n")
print("np.arange(3, 18, 3) \n")
print(vector_range)
print("_______________________________________________________________\n")

vector_lin = np.linspace(0, 1.5, 7)
print("Vector of evenly spaced values (known number, linspace) given by start, stop and number of points: \n")
print("np.linspace(0, 1.5, 7) \n")
print(vector_lin)
print("_______________________________________________________________\n")


matrix_zeros = np.zeros((5,4), dtype = int)
print("2D array of zeros: \n")
print("np.zeros((5,4), dtype = int) \n")
print(matrix_zeros)
print("_______________________________________________________________\n")


matrix_ones = np.ones((4,5,3), dtype = int)
print("3D array of ones: \n")
print("np.ones((4,5,3), dtype = int) \n")
print(matrix_ones)
print("_______________________________________________________________\n")


val = 42
matrix_val = np.full((4,5), val, dtype = int)
print("2D array filled with a given value: \n")
print("val = 42 ")
print("np.full((4,5) \n")
print(matrix_val)
print("_______________________________________________________________\n")



matrix_id = np.identity(4)
print("2D square array filled with 1 on the diagonal: \n")
print("np.identity(4) \n")
print(matrix_id)
print("_______________________________________________________________\n")



# Create a 4x6 matrix with a diagonal of 1s - does not need to be square - k gives the line and col of the first 1
matrix_eye = np.eye(4, 6, k = 1) 
print("Create a 4x6 identity matrix - does not need to be square, starting diagonal inxed given by k: \n")
print("np.eye(4, 6, k = 1)  \n")
print(matrix_eye)
print("_______________________________________________________________\n")


# Create a 2D array with the given 1D array on the diagonal starting with index k
# help(np.diag)
vector_diag = np.array([1,2,3])
matrix_diag =  np.diag(vector_diag, k = 1)
print("Create a 2D array with the given 1D array on the diagonal starting with the given index k: \n")
print("vector_diag = np.array([1,2,3])")
print("np.diag(vector_diag, k = 1) \n")
print(matrix_diag)
print("_______________________________________________________________\n")


# Extract the diagonal from a given matrix starting a given index
matrix_list = np.array([[1,2,3,4], [40,60,70,80], [101, 202, 303, 404]])
vec_diag = np.diag(matrix_list, k = 0)
print("Extract the diagonal from a given matrix starting a given index: \n")
print("matrix_list = np.array([[1,2,3,4], [40,60,70,80], [101, 202, 303, 404]])")
print("matrix_list: \n", matrix_list, "\n" )
print("np.diag(matrix_list, k = 0) \n")
print(vec_diag)
print("_______________________________________________________________\n")




_____

#### Random data - important for generating synthetic data to test your code
https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.random.html

In [None]:
# help(np.random.random)

test_matrix = np.random.random((3,4)) 
print("An array filled with random values from the continuous uniform distribution over the [0,1] interval: \n")
print("np.random.random((3,4)) \n")
print(test_matrix, "\n")
print("_______________________________________________________________\n")


# Multiply by the standard deviation and add the mean if the normal distribution is not standard 
test_matrix = np.random.randn(3,4) 
print("An array filled with random values from the standard normal distribution (mean = 0, standard deviation = 1) interval: \n")
print("np.random.randn(3,4) \n")
print(test_matrix, "\n")
print("_______________________________________________________________\n")


# Set a seed to get the same random numbers
np.random.seed(10)
test_matrix = np.random.randn(3,4) 
print("Set a seed to get the same random numbers: \n")
print("np.random.randn(3,4) \n")
print(test_matrix, "\n")
print("_______________________________________________________________\n")


np.random.seed(10)
test_matrix = np.random.randn(3,4) 
print("\nSet a seed to get the same random numbers: \n")
print("np.random.randn(3,4) \n")
print(test_matrix, "\n")
print("_______________________________________________________________\n")



_____
#### <b>Basic array attributes (important to understand the data you are working with):</b>
* <b>shape</b>: array dimension
* <b>size</b>: Number of elements in array
* <b>ndim</b>: Number of array dimension (len(arr.shape))
* <b>dtype</b>: Data-type of the array

In [None]:
# nested lists give us multi dimensional arrays

matrix = np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])
# dir(matrix)
print("Create matrix array from lists of lists: \n")
print("np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]]) \n")
print(matrix)
print("_______________________________________________________________\n")


print("Size of the array (total number of elements): \n")
print("matrix.size \n")
print(matrix.size)
print("_______________________________________________________________\n")

print("Shape of the array (tuple of number of elements on each dimension): \n")
print("matrix.shape \n")
print(matrix.shape)
print("_______________________________________________________________\n")

print("Type of the array data (type of the array elements): \n")
print("Examples include int64, int32 (int and the number of bits it is represented on), <U5 (unicode and number of characters) \n")
print("matrix.dtype \n")
print(matrix.dtype)
print("_______________________________________________________________\n")

____
#### <b>Reshaping</b> - changing the numbers of rows and columns - data and size stay the same
#### Important for matrix operations used to compute statistics from your data
* <b>T</b>: transpose of a 2D array (rows and columns switched)
* <b>reshape(rowno, colno)</b>: change the number of rows and columns, but NOT the data or the size


In [None]:
print("matrix: \n")
print(matrix)
print("_______________________________________________________________\n")


print("Transpose of the array (rows and columns switched): \n")
print("matrix.T \n")
print(matrix.T)
print("_______________________________________________________________\n")

print("Matrix reshaped from 3 by 4 to 6 by 2 (notice the product is the same = 12 so we keep the same size) \n")
print("matrix.reshape(6, 2) \n")
print(matrix.reshape(6, 2))
print("_______________________________________________________________\n")

# help(np.reshape)


____
#### <b>Indexing/Slicing(subsetting)</b> (important to work with subsets of your data): 
* <b>list-like</b>: [start:stop:step][start:stop:step] 
* <b>rows and cols index/range</b>: [start:stop:step, start:stop:step]

* <b>change array type</b> - vector = vector.<b>astype</b>(new_type)
___


In [None]:
print("Create a matrix with 6 rows and 6 columns \n")
a_row = np.arange(6)
a = np.zeros((6,6)) + a_row + (10 * a_row).reshape(6,1)
# change array type
a = a.astype(int)

print("a_row = np.arange(6)")
print("a = np.zeros(6,6) + a_row + (10 * a_row).reshape(6,1)")
print("a = a.astype(int) \n")
print(a)
print("_______________________________________________________________\n")

print("a[0, 3:5] \n")
print(a[0, 3:5])
print("_______________________________________________________________\n")



___
The code color in the left corresponds to the result in the matrix to the right highlighted with the same color: 

<img src = "http://scipy-lectures.org/_images/numpy_indexing.png" width = 400/>

____
#### <font color = "red">Exercise</font>
Try the above subsetting exmaples.



In [None]:
print("Create a matrix with 5 rows and 8 columns with numbers from 0 to 40 (40 not included) \n")
matrix = np.arange(40).reshape(5,8)

print("matrix = np.arange(40).reshape(5,8) \n")
print(matrix)
print("_______________________________________________________________\n")

print("")



In [None]:
# List-like indexing - second element second column (indices start from 0)

matrix[1][1]


In [None]:
# Using both rows and columns indices to get a value

matrix[1,1]


In [None]:
# Using both rows and columns indices to get a sub-matrix
# List-like indexing will not work in this case 

matrix[:2, :3]


#### Array of indeces subsetting - use array of indices to subset array with only the elements given by the indices

In [None]:
#________

matrix

In [None]:
# Take only the second and fourth rows and the third forurth and fifth columns. 
# (indices start from 0)

matrix[[1,3],2:5]



In [None]:
matrix

____

#### <font color = "red">Exercise</font>
From the above matrix take only the first, fourth and fifth row and the second, third and sixth column




#### Conditional subsetting - use array of booleans to subset array with only the elements where the bool array is True

In [None]:
matrix

In [None]:
# Using a list of boolean values will select the rows where the list is True
# The list should have the same number of elements as the number of rows

matrix[[False, True,  True,  False,  True]]

In [None]:
# Applying a conditional operator on a 1D array returns a 1D array of True False values

matrix[:,0] < 10 

In [None]:
# Conditional subsetting - select only rows that meet the condition
# Select rows that have numbers < 10 on the first column [:,0] => rows 1 and 2

matrix[matrix[:,0] < 10]

In [None]:
cond = matrix[:,0] < 10
matrix[cond]

In [None]:
# Select columns that have numbers > 10 on the second row [1,:] => cols 4 and above

cond1 = matrix[1,] > 10
matrix[:,cond1]

In [None]:
# Put the two conditions together

matrix[cond,][:,cond1]


In [None]:
# Copy an array - just assignment would not make a copy 
# so any changes would be refelcted in both variables
matrix1 = matrix.copy()
matrix1


In [None]:
matrix

In [None]:
# We can put conditions on all elements of a matrix to select a subset of elements
# For compound conditions use () to separate them and bit-wise operations
(matrix1 > 10) & (matrix1 < 30)

In [None]:
matrix1[(matrix1 > 10) & (matrix1 < 30)]

In [None]:
# We can change a subset of values through conditional subsetting and ASSIGNMENT
# Change all values that meet the condition to 100

matrix1[(matrix1 > 10) & (matrix1 < 30)] = 100


In [None]:
matrix1

In [None]:
# due to copy the first matrix is not changed:

matrix


____

## RECAP END - new information  

#### <b>Matrix operations</b>

https://www.tutorialspoint.com/matrix-manipulation-in-python<br>
Arithmetic operators on arrays apply element-wise. <br> 
A new array is created and filled with the result.


#### <b>Array broadcasting</b> - for compatible arrays

https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html<br>
The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. <br>
Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes.

<img src = "https://www.tutorialspoint.com/numpy/images/array.jpg" height=10/>


https://www.tutorialspoint.com/numpy/numpy_broadcasting.htm

In [None]:
# Let's look at our matrix again

matrix


In [None]:
# Adding a number - it is added to all elements of a matrix
# - The same applies for any arithmetic operation

matrix + 100



In [None]:
########
# Let's look at our matrix again

matrix


In [None]:
# Create array of length 5 that starts from 1 and reshape it to make it a column
# The new column is compatible with the matrix columns (same number of elements)

col_vec = np.arange(1,6).reshape(5,1)
col_vec

##### Addittion between an array and a compatible 1D array
##### The elements are added one by one for each column
      - The same applies for all arithmetic operations


`Example for the first column (the same way all columns are updated):
 0 + 1 = 1  
 8 + 2 = 10  
16 + 3 = 19   
24 + 4 = 28   
32 + 5 = 37`  


In [None]:
# Addittion between an array and a compatible 1D array
# The elements are added one by one for each column
print(matrix, "\n")
print(col_vec, "\n")
matrix + col_vec

In [None]:
# Addittion with a data row - adds each element of the new row 
# to the respective element of each row of the matrix

print(matrix, "\n")
print(np.arange(8), "\n")
matrix + np.arange(8)

In [None]:
##########
# For compatible matrices (same number of rows and columns) 
# the arithmatic operations will be done element by element

print("matrix: \n", matrix, "\n")
print("matrix + matrix: \n", matrix + matrix, "\n" )
print("matrix * matrix: \n", matrix * matrix )

_____
#### Matrix multiplication

<img src = "https://miro.medium.com/max/1400/1*YGcMQSr0ge_DGn96WnEkZw.png" width = 370/>

In [None]:
# matrix multiplication
mat1 = np.arange(1,7).reshape(2,3)
mat2 = np.array([[10, 11], [20, 21], [30, 31]])
mat1.dot(mat2)

In [None]:
# matrix multiplication - more recently
mat1@mat2

___

#### Combining arrays into a larger array - vstack, hstack, vsplit, hsplit

In [None]:
##########

mat1

In [None]:
mat2

In [None]:
# stacking arrays together - vertically
vmatrix = np.vstack((mat1, mat2.T))
vmatrix

In [None]:
# stacking arrays together - horizontally
hmatrix = np.hstack((mat1.T,mat2))
hmatrix

In [None]:
##########

vmatrix


In [None]:
# splitting arrays - results in a list of arrays

np.vsplit(vmatrix,2)


In [None]:
##########

hmatrix


In [None]:
# splitting arrays - results in a list of arrays

np.hsplit(hmatrix,2)


#### <b>More matrix computation</b> - basic aggregate functions are available - min, max, sum, mean, std

In [None]:
# Let's look at our matrix again
matrix

#### Use the axis argument to compute mean for each column or row
    - axis = 0 - columns
    - axis = 1 - rows

In [None]:
# compute sum for each column 
# using the array sum method (np.ndarray.sum)

matrix.sum(axis = 0)


In [None]:
# using the np.sum function

np.sum(matrix, axis = 0)

In [None]:
####
matrix 

In [None]:
# Compute the row mean

matrix.mean(axis = 1)
 

#### Compute unique values and counts for matrix elements



In [None]:
# Compute unique values and counts
# Results are a list of two lists values and corresponding counts

test_matrix = np.array([[ 5,  2,  3],
       [ 4,  5,  6],
       [ 3,  3,  2],
       [4, 2, 3]])
uvals, counts = np.unique(test_matrix, return_counts=True)
print(uvals,counts)

https://www.w3resource.com/python-exercises/numpy/index.php


___
#### <font color = "red">Exercise</font>

Add substract 4 from values of the matrix array that are divisible by 4.
- Hint: use conditional subsetting to get the required values then assignment to remove 4 from those values

In [None]:
matrix 

___
#### <font color = "red">Exercise</font>

Normalize the matrix values.    
For each row, substract the mean and divide by the standard deviation.

In [None]:
matrix

____
#### <font color = "red">Exercise</font>

Create a random array (5 by 7) from the standard normal distribution and compute: 
   * the mean of all elements 
   * the max of the rows  
   * the sum of the columns

#### Find indices of elements that meet condition - where function

In [None]:
# where function finds the indices of elelmets that meet condition
# it returns a tuple of n arrays (n no of dimensions)
# in our example the arrays in the tuple contain the row indices 
# and resoectively the column indices for the elements

positions = np.where(matrix  < 20) 
positions

In [None]:
matrix[positions]

In [None]:
# help(np.where)

In [None]:
pos = np.where(matrix == 3)
pos

In [None]:
matrix[pos]

#### RESOURCES

http://scipy-lectures.org/intro/numpy/array_object.html#what-are-numpy-and-numpy-arrays   
https://www.python-course.eu/numpy.php   
https://numpy.org/devdocs/user/quickstart.html#universal-functions   
https://www.geeksforgeeks.org/python-numpy/

_____

<img src = "https://blog.thedataincubator.com/wp-content/uploads/2018/02/Numpypandas.png" width = 300/>

### Pandas



[Pandas](https://pandas.pydata.org/) is a high-performance library that makes familiar data structures, like `data.frame` from R, and appropriate data analysis tools available to Python users.

#### How does pandas work?

Pandas is built off of [Numpy](http://www.numpy.org/), and therefore leverages Numpy's C-level speed for its data analysis.

* Numpy can only make data structures of a single type.
* Pandas can use many types. 
* Think of a table, where each column can be whatever type you want it to be, so long as every item in the column is that same type.

#### Why use pandas?

1. Data munging/wrangling: the cleaning and preprocessing of data
2. Loading data into memory from disparate data formats (SQL, CSV, TSV, JSON)

#### Importing

Pandas is built off of numpy, it is usefull to import numpy at the same time, but not necessary.

```python
import numpy as np
import pandas as pd


```

___
#### 1. `pd.Series`

#### **One-dimensional** labeled array (or vector) 

```python
# Initialization Syntax
series = pd.Series(data, index, dtype) 
```

* **`data`** : what is going inside the Series (array-like, dict, or scalar value)
* **`index`**: row identifiers (doesn't have to be unique--think foreign key. Defaults to row number)
* **`dytpe`**: numpy/python based data types

Attributes 

['T',
 'array',
 'at',
 'axes',
 'base',
 'data',
 'dtype',
 'dtypes',
 'empty',
 'flags',
 'ftype',
 'ftypes',
 'hasnans',
 'iat',
 'iloc',
 'imag',
 'index',
 'is_monotonic',
 'is_monotonic_decreasing',
 'is_monotonic_increasing',
 'is_unique',
 'itemsize',
 'ix',
 'loc',
 'name',
 'nbytes',
 'ndim',
 'plot',
 'real',
 'shape',
 'size',
 'strides',
 'timetuple',
 'values']
 
 
 Methods
 
 ['abs',
 'add',
 'add_prefix',
 'add_suffix',
 'agg',
 'aggregate',
 'align',
 'all',
 'any',
 'append',
 'apply',
 'argmax',
 'argmin',
 'argsort',
 'asfreq',
 'asof',
 'astype',
 'at_time',
 'autocorr',
 'between',
 'between_time',
 'bfill',
 'bool',
 'clip',
 'combine',
 'combine_first',
 'convert_dtypes',
 'copy',
 'corr',
 'count',
 'cov',
 'cummax',
 'cummin',
 'cumprod',
 'cumsum',
 'describe',
 'diff',
 'div',
 'divide',
 'divmod',
 'dot',
 'drop',
 'drop_duplicates',
 'droplevel',
 'dropna',
 'duplicated',
 'eq',
 'equals',
 'ewm',
 'expanding',
 'explode',
 'factorize',
 'ffill',
 'fillna',
 'filter',
 'first',
 'first_valid_index',
 'floordiv',
 'ge',
 'get',
 'groupby',
 'gt',
 'head',
 'hist',
 'idxmax',
 'idxmin',
 'infer_objects',
 'interpolate',
 'isin',
 'isna',
 'isnull',
 'item',
 'items',
 'iteritems',
 'keys',
 'kurt',
 'kurtosis',
 'last',
 'last_valid_index',
 'le',
 'lt',
 'mad',
 'map',
 'mask',
 'max',
 'mean',
 'median',
 'memory_usage',
 'min',
 'mod',
 'mode',
 'mul',
 'multiply',
 'ne',
 'nlargest',
 'notna',
 'notnull',
 'nsmallest',
 'nunique',
 'pct_change',
 'pipe',
 'pop',
 'pow',
 'prod',
 'product',
 'quantile',
 'radd',
 'rank',
 'ravel',
 'rdiv',
 'rdivmod',
 'reindex',
 'reindex_like',
 'rename',
 'rename_axis',
 'reorder_levels',
 'repeat',
 'replace',
 'resample',
 'reset_index',
 'rfloordiv',
 'rmod',
 'rmul',
 'rolling',
 'round',
 'rpow',
 'rsub',
 'rtruediv',
 'sample',
 'searchsorted',
 'sem',
 'set_axis',
 'shift',
 'skew',
 'slice_shift',
 'sort_index',
 'sort_values',
 'squeeze',
 'std',
 'sub',
 'subtract',
 'sum',
 'swapaxes',
 'swaplevel',
 'tail',
 'take',
 'to_clipboard',
 'to_csv',
 'to_dict',
 'to_excel',
 'to_frame',
 'to_hdf',
 'to_json',
 'to_latex',
 'to_list',
 'to_markdown',
 'to_numpy',
 'to_period',
 'to_pickle',
 'to_sql',
 'to_string',
 'to_timestamp',
 'to_xarray',
 'transform',
 'transpose',
 'truediv',
 'truncate',
 'tshift',
 'tz_convert',
 'tz_localize',
 'unique',
 'unstack',
 'update',
 'value_counts',
 'var',
 'view',
 'where',
 'xs']

#### Create Series/data column from a Python list

In [None]:
import numpy as np
import pandas as pd

In [None]:
labels = ["gene","protein","miRNA","metabolites"]
values = [3,4,5,6]
series_named_val = pd.Series(data = values, index=labels)


#### Create Series/data column from dictionary

In [None]:
dict_var = dict(zip(labels, values))
pd.Series(dict_var)

In [None]:
dict_var = {"EGFR":2.5, "IL6":10.2, "BRAF":6.7, "ABL":5.3}
# Create new series
new_series = pd.Series(data = dict_var)
new_series

#### idxmax() - Return the index of the row with the max value



In [None]:
# help(new_series.idxmax)

In [None]:
# Return the index of the row with the max value
new_series.idxmax()

#### describe() - Generate descriptive statistics



In [None]:
# generate descriptive statistics
new_series.describe()

#### isna() - check for missing values
#### other similar function: isnull(), notna(), isin()
#### np.info(np.isnan)
#### another useful function: dropna() - remove rows with None or missing values form the column
#### numpy objects: np.NaN, np.Inf
#### help(np.NaN) 

In [None]:
# np.info(np.isnan)
# help(pd.isna)

In [None]:
# check for missing values
new_series.isna()

_____

#### 2. `pd.DataFrame`

#### **Multi-dimensional** labeled data structure with columns of *potentially* different types

```python
# Initialization Syntax
df = pd.DataFrame(data, index, columns, dtype)
```

* **`data`** : what is going inside the DataFrame (numpy ndarray (structured or homogeneous), dict, or DataFrame)
* **`index`** : row identifiers (doesn't have to be unique--think foreign key. Defaults to row number)
* **`columns`** : column identifiers
* **`dtype`** : numpy/python based data types

Attributes

['T',
 'at',
 'axes',
 'columns',
 'dtypes',
 'empty',
 'ftypes',
 'iat',
 'iloc',
 'index',
 'ix',
 'loc',
 'ndim',
 'plot',
 'shape',
 'size',
 'style',
 'timetuple',
 'values']

In [None]:
# Labeled 3 x 4 matrix of gene (expression) correlation (%)
# - rows and columns are genes and the values in the matrix is 
# the correlation (in %) between their expression in a given timecourse dataset 

correlation_array = np.arange(40,52).reshape(3,4) # the data = data argument
genes_rows = ["HER2","PIK3CA", "BRAF"] # row lables = index argument
genes_cols = ["HER1","EGFR", "IL6", "INSR"] # column lables = columns argument
df_gene_correlation = pd.DataFrame(correlation_array, genes_rows, genes_cols)
df_gene_correlation


In [None]:
# Explore DataFrame attributes and methods

# Transpose - switch columns and rows

df_gene_correlation.T


In [None]:
# Sort the data based on a specific column

print(df_gene_correlation, "\n")
df_gene_correlation.sort_values(by = 'EGFR', ascending = False)


In [None]:
# Aggregate the data based on an aggregate function and the axis (col = 0, row = 1)
# can apply any aggregate function: np.sum, np.min, np.mean, np.std, np.max, np.median ....

print(df_gene_correlation, "\n")
df_gene_correlation.aggregate(np.mean, 1)


In [None]:
# Number of elements in the dataframe (total)

df_gene_correlation.size


In [None]:
# index = row names

df_gene_correlation.index


In [None]:
# dtypes - data types for each Series/column

df_gene_correlation.dtypes


___
#### <font color = "red">Exercise</font>


Create a 7 by 5 array with values from 20 to 90 going with a step of 2   
Create a list with row names: Gene1, Gene2 ...  
Create a list with column names: GO1, GO2 ...  
Create a DataFrame from the array created with the row names and column names from the respective lists  


#### From `pd.Series`

In [None]:
# Let's look at the gene correlation matrix again 

df_gene_correlation


In [None]:
# Create a column (Series) with 4 values and the names given by the  df_gene_correlation columns

new_row = pd.Series([1,2,3,4], df_gene_correlation.columns, name = "New Gene")
new_row

_____

#### Combinig dataframes 
#### Row-wise (`append`)

In [None]:
# Now add on a row to the correlation matrix, has to have a name

df_gene_correlation.append(new_row)

#### Column-wise (`join`/`concat`)

#### `join`

In [None]:
df_gene_correlation

In [None]:
# Create a new column (Series), provide rownames (index) and name

new_col = pd.Series([1,2], index = ["HER2", "BRAF"], name = "Gene Col")
new_col


In [None]:
# join a column with a different (smaller) size it will add NaN for the missing values

df_gene_correlation.join(new_col)


#### `concat`

In [None]:
# Add column with the same size

new_col1 = pd.Series([1,2,3], index = df_gene_correlation.index, name = "New Gene Col")
pd.concat([df_gene_correlation, new_col1], axis = 1)


In [None]:
# Unequal size (remove the last element from the column and try adding it)
# It will add NaN for the missing value

pd.concat([df_gene_correlation, new_col1[:-1]], axis = 1)


____ 

#### <b>Reading and writing files (I/O) with Pandas

One of the the most common reasons people use pandas is to bring data in without having to deal with file I/O, delimiters, and type conversion. Pandas deals with a lot of this.

#### CSV Files

#### Output - write to file

You can easily save your `DataFrames`

In [None]:
# write the dataframe to the file

df_gene_correlation.to_csv('dataframe_data.csv')


In [None]:
# help(df_gene_correlation.to_csv)

In [None]:
# write the dataframe from the file - include rownames

df_gene_correlation.to_csv('dataframe_data.csv', index = True)


In [None]:
df_gene_go

#### Input - read from file

You can easily bring data from a file into a `DataFrames`

In [None]:
# read dataframe from file and also mention the column where the rownames are (0 - first column)

pd.read_csv('dataframe_data.csv', index_col = 0)


In [None]:
# help(pd.read_csv)

#### Excel Files

In [None]:
# Output
df_gene_correlation.to_excel('excel_output.xlsx')
# Input
pd.read_excel('excel_output.xlsx')

#### TSV Files

In [None]:
# Output 
df_gene_correlation.to_csv('tsv_output.tsv', sep="\t")
# Input
pd.read_csv('tsv_output.tsv', sep="\t").tail()

#### Clipboard

#### Copy

In [None]:
df_gene_correlation.to_clipboard()

In [None]:
# Paste here


#### Paste

In [None]:
pd.read_clipboard()

___
### <b>Indexing/Exploring/Manipulating dataframes in Pandas

Standard `'[]'` indexing/slicing can be used, as well as `'.'` methods,

#### There are 2 pandas-specific methods for indexing:
####   1.  ```.loc``` - primarily label/name-based
####   2.  `.iloc` - primarily integer/position-based

In [None]:
# Create some data to work with
row_labels = ["row"+str(i) for i in range(10)]
col_labels = ["col"+str(i) for i in range(8)]

""" 
Create a DataFrame from a 10 by 6 array with values from 1 to 70, 
add the row_labels and col_labels we just created 
"""
data_array = np.arange(1,81).reshape(10,8)
data_array
df_example = pd.DataFrame(data_array,row_labels,col_labels)
df_example


#### Pandas allows you to do random sampling from the dataframe

In [None]:
# Sample 5 rows and 5 columns

df_small = df_example.sample(n=5)
df_small

In [None]:
### 

df_example

_____
#### Slicing `'[]'` on a `pd.DataFrame` gives us a slice of **rows**

In [None]:
df_example[4:8]

___

#### `'.'` operators and a column name can select a **specific named** column

In [None]:
### 

df_example

In [None]:
df_example.col4

`'.'` operator selected columns are now just a `pd.Series` and can be `'[]'` sliced on further

In [None]:
### 

df_example

In [None]:
df_example.col4[4:8]

However, if it is a named column that doesn't fit well as a `'.'` name, you can use `'[]'` selection as well

In [None]:
### 

df_example

In [None]:
df_example["col4"][:3]

In [None]:
### 

df_example

Named rows can be selected by a range of the names

In [None]:
df_example['row2':'row5']

_____

#### Selection <b>BY NAME</b>: the `.loc` method

```python
# .loc syntax
df.loc[row indexer, column indexer]
```

<b>A slice of specific items (based on label) - start and stop included</b>

In [None]:
### 

df_example

In [None]:
df_example.loc['row2':'row5', 'col1':'col6']

#### Boolean indexing

In [None]:
### 

df_example

In [None]:
df_example.loc[df_example.col2 < 30]

___

#### Selection <b>BY POSITION</b>: the `.iloc` method

```python
# .loc syntax
df.iloc[row indices, column indices]
```

<b>A slice of specific items (based on position)</b>

In [None]:
### 

df_example

In [None]:
df_example.iloc[3:8,2]

In [None]:
# we can use a list of indices

df_example.iloc[3:8,[2,4,6]]

____
#### Quick <b>Exploration</b> of the data

In [None]:
### 

df_example

In [None]:
# describe() - Summary statistics for a column - describe

df_example.col1.describe()


In [None]:
# aggregate - Aggregate the values on a given column using a specific function

df_example.col1.aggregate(sum)


#### Object Manipulation - use conditional subsetting and assignment to change values in a dataframe

In [None]:
df_example

In [None]:
# some plotting

df_example.col5.plot()

In [None]:
# Series/column of random numbers for each day for 1000 days
ts = pd.Series(np.random.randn(1000),
                  index=pd.date_range('1/1/2000', periods=1000))

# Check the first 5 rows/elements
ts.head()

In [None]:
# Check the last 5 rows/elements
ts.tail()

In [None]:
ts = ts.cumsum()
ts.plot()

In [None]:
# some plotting

# Dataframe of random numbers for each day for 1000 days for 4 categories/columns ABCD
df = pd.DataFrame(np.random.randn(1000, 4),
                      index=ts.index, columns=list('ABCD'))
 

df = df.cumsum()

plt.figure();

df.plot();

In [None]:
# Only rows that meet the condition will be retrieved and 
# for the columns that are selected the values will be changed to 0

df_example.loc[df_example.col2 > 30, ['col2',"col4"]] = 0 


In [None]:
df_example

____
#### <font color = "red">Exercise</font>

Replace all the 0 values in df_example with 200.

In [None]:
# Reset the exmaple dataframe

df_example = pd.DataFrame(data_array,row_labels,col_labels)
df_example


In [None]:
df_example[(df_example >= 20) & (df_example <= 60) & (df_example%2 == 0)] = np.nan

In [None]:
df_example

In [None]:
# hasnans- Checks missiong values on a column (pd.Series)

df_example.col1.hasnans


___
#### Melt a dataframe - convert columns data to row data
##### Useful to apply statistics on the long format of the data where the categories of a variable are in values of a column

In [None]:
df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
                   'B': {0: 11, 1: 33, 2: 55},
                   'C': {0: 22, 1: 44, 2: 66}})
df

In [None]:
# melt - Useful function to change a DataFrame into a format where one or more columns are identifier variables (id_vars), 
# while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, 
# leaving just two non-identifier columns, ‘variable’ and ‘value’.

pd.melt(df, id_vars=['A'], value_vars=['B', 'C'])


____
#### Load dataframe from online file 



In [None]:
# Iris dataset is a dataset with information about 4 characteristics of 3 species of 150 iris flowers (50 flowers/species) 

df_iris = pd.read_csv('https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv')
df_iris


#### <font color = "red">Exercise</font>

Answer the following questions by writing code:
* How may rows and columns does the dataset have?
* How may flowers with petal length > 4 and petal width > 2 are there?



#### RESOURCES

https://www.python-course.eu/pandas.phphttps://www.python-course.eu/numpy.php    
https://scipy-lectures.org/packages/statistics/index.html?highlight=pandas  
https://www.geeksforgeeks.org/pandas-tutorial/

https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

<img src="https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf" width=1000/>

____
_____
## Data Visualization

____

#### `matplotlib` - powerful basic plotting library
https://matplotlib.org/3.1.1/tutorials/introductory/pyplot.html

`matplotlib.pyplot` is a collection of command style functions that make matplotlib work like MATLAB. <br>
Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc.

In `matplotlib.pyplot` various states are preserved across function calls, so that it keeps track of things like the current figure and plotting area, and the plotting functions are directed to the current axes.<br>
"axes" in most places in the documentation refers to the axes part of a figure and not the strict mathematical term for more than one axis).


https://github.com/pandas-dev/pandas/blob/v0.25.0/pandas/plotting/_core.py#L504-L1533
https://matplotlib.org
https://matplotlib.org/tutorials/    
https://github.com/rougier/matplotlib-tutorial     
https://www.tutorialspoint.com/matplotlib/matplotlib_pyplot_api.htm    
https://realpython.com/python-matplotlib-guide/    
https://github.com/matplotlib/AnatomyOfMatplotlib    
https://www.w3schools.com/python/matplotlib_pyplot.asp   
http://scipy-lectures.org/intro/matplotlib/index.html

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

Call signatures::
```
    plot([x], y, [fmt], data=None, **kwargs)
    plot([x], y, [fmt], [x2], y2, [fmt2], ..., **kwargs)
```

Quick plot

The main usage of `plt` is the `plot()` and `show()` functions

In [None]:
plt.plot()
plt.show()

List

In [None]:
plt.plot([8, 24, 27, 42])
plt.ylabel('numbers')
plt.show()

In [None]:
# Plot the two lists, add axes labels
x=[4,5,6,7]
y=[2,5,1,7]
plt.plot(x,y)
plt.xlabel("x numerical values")
plt.ylabel("y numerical values")
plt.show()

`matplotlib` can use *format strings* to quickly declare the type of plots you want. Here are *some* of those formats:

|**Character**|**Description**|
|:-----------:|:--------------|
|'--'|Dashed line|
|':'|Dotted line|
|'o'|Circle marker|
|'^'|Upwards triangle marker|
|'b'|Blue|
|'c'|Cyan|
|'g'|Green|

In [None]:
plt.plot([3, 4, 9, 20], 'gs--')
plt.axis([-1, 4, 0, 25])
plt.show()

In [None]:
plt.plot([3, 4, 9, 20], '^b--', linewidth=2, markersize=12)
plt.show()

In [None]:
plt.plot([3, 4, 9, 20], color='blue', marker='^', linestyle='dashed', linewidth=2, markersize=12)
plt.show()

In [None]:
#import numpy as np

# Plot a list with 10 numbers with a magenta dotted line and circles for points.
#numbers = [2,25,16,10,4,5,6,22,19,23]
numbers = np.random.rand(10)
plt.plot(numbers, "mo:")
plt.show()

In [None]:
# help(plt.plot)

In [None]:
#import numpy as np

# evenly sampled time 
time = np.arange(0, 7, 0.3)
# gene expression
ge = np.arange(1, 8, 0.3)

# red dashes, blue squares and green triangles
plt.plot(time, ge, 'r--', time, ge**2, 'bs', time, ge**2.5, 'ms:', time, ge**3, 'g^')
plt.show()

linestyle or ls	[ '-' | '--' | '-.' | ':' | 

In [None]:
# Categorical data plotting using categories on the x axis 
# we also use the figure function to create more complex figure (size = (width,height))
# and subplot to plot multiple sub-plots ar different positions in the figure
# 131 - *nrows*, *ncols*, and *index*
# Different types of plots: bar, scatter, and histogram 
 
names = ['A', 'B', 'C', 'D']
values = [7, 20, 33, 44]
values1 = np.random.rand(100)
values2 = np.random.rand(100)

plt.figure(figsize=(12, 3))

plt.subplot(131)
plt.bar(names, values)
plt.subplot(132)
plt.scatter(names, values)
plt.subplot(133)
plt.hist(values1)
plt.suptitle('Categorical Plotting')
plt.show()

In [None]:
# help(plt.subplot)

In [None]:
# Add another subplot with another color

names = ['A', 'B', 'C', 'D']
values = [7, 20, 33, 44]
values1 = np.random.rand(100)
values2 = np.random.rand(10000)

plt.figure(figsize=(15, 3))

plt.subplot(141)
plt.bar(names, values)
plt.subplot(142)
plt.scatter(names, values)
plt.subplot(143)
plt.hist(values1)
plt.subplot(144)
plt.hist(values2, color = "green")
plt.suptitle('Categorical Plotting')
plt.show()

In [None]:
# Changing the grid layout

names = ['A', 'B', 'C', 'D']
values = [7, 20, 33, 44]
values1 = np.random.rand(100)
values2 = np.random.rand(10000)

plt.figure(figsize=(9, 6))

plt.subplot(221)
plt.bar(names, values)
plt.subplot(222)
plt.scatter(names, values)
plt.subplot(223)
plt.hist(values1)
plt.subplot(224)
plt.hist(values2, color = "green")
plt.suptitle('Categorical Plotting')
plt.show()

In [None]:
# help(plt.bar)

In [None]:
import pandas as pd

In [None]:
df_iris = pd.read_csv('https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv')
df_iris.head()

In [None]:
x1 = df_iris.petal_length
y1 = df_iris.petal_width

x2 = df_iris.sepal_length
y2 = df_iris.sepal_width

# Plot the data categories from the dataframe with green triangles and blue squares

plt.plot(x1, y1, 'g^', x2, y2, 'bs')
plt.show()

#### Histogram

In [None]:
# help(plt.hist)

In [None]:
n, bins, patches = plt.hist(df_iris.petal_length, bins=20,facecolor='#8303A2', alpha=0.8, rwidth=.8, align='mid')
print(n)
# Add a title
plt.title('Iris dataset petal length')

# Add y axis label
plt.ylabel('number of plants')
plt.xlabel('petal length')


plt.show()

#### Boxplot

In [None]:
# help(plt.boxplot)

In [None]:
plt.boxplot(df_iris.petal_length)

# Add a title
plt.title('Iris dataset petal length')

# Add y axis label
plt.ylabel('petal length')

The biggest issue with `matplotlib` isn't its lack of power...it is that it is too much power. With great power, comes great responsibility. When you are quickly exploring data, you don't want to have to fiddle around with axis limits, colors, figure sizes, etc. Yes, you *can* make good figures with `matplotlib`, but you probably won't.

https://python-graph-gallery.com/matplotlib/

Pandas works off of `matplotlib` by default. You can easily start visualizing dataframs and series just by a simple command.

#### Using pandas `.plot()`

Pandas abstracts some of those initial issues with data visualization. However, it is still a `matplotlib` plot</br></br>
Every plot that is returned from `pandas` is subject to `matplotlib` modification.

In [None]:
df_iris.plot.box()
plt.show()

In [None]:
df_iris.head()

In [None]:
# Plot the histogram of the petal lengths
# Plot the histograms of all 4 numerical characteristics in a plot
df_iris.petal_length.plot.hist()
plt.show()



In [None]:
df_iris.plot.hist()
plt.show()

In [None]:
df_iris.groupby("species")['petal_length'].mean().plot(kind='bar')
plt.show()

In [None]:
df_iris.groupby("species")['sepal_length'].sum().plot(kind='bar',color = "green")
plt.show()

In [None]:
df_iris.plot(x='petal_length', y='petal_width', kind = "scatter")
plt.savefig('output.png')

In [None]:
plt.savefig('output.png')

https://github.com/pandas-dev/pandas/blob/v0.25.0/pandas/plotting/_core.py#L504-L1533

#### Multiple Plots

In [None]:
df_iris.petal_length.plot(kind='density')
df_iris.sepal_length.plot(kind='density')
df_iris.petal_width.plot(kind='density')
plt.show()

`matplotlib` allows users to define the regions of their plotting canvas. If the user intends to create a canvas with multiple plots, they would use the `subplot()` function. The `subplot` function sets the number of rows and columns the canvas will have **AND** sets the current index of where the next subplot will be rendered.

In [None]:
plt.figure(1)
# Plot all three columns from df in different subplots
# Rows first index (top-left)
plt.subplot(3, 1, 1)
df_iris.petal_length.plot(kind='density')
plt.subplot(3, 1, 2)
df_iris.sepal_length.plot(kind='density')
plt.subplot(3, 1, 3)
df_iris.petal_width.plot(kind='density')
# Some plot configuration
plt.subplots_adjust(top=.92, bottom=.08, left=.1, right=.95, hspace=.25, wspace=.35)
plt.show()

In [None]:
# Temporary styles
with plt.style.context(('ggplot')):
    plt.figure(1)

    # Plot all three columns from df in different subplots
    # Rows first index (top-left)
    plt.subplot(3, 1, 1)
    df_iris.petal_length.plot(kind='density')
    plt.subplot(3, 1, 2)
    df_iris.sepal_length.plot(kind='density')
    plt.subplot(3, 1, 3)
    df_iris.petal_width.plot(kind='density')
    # Some plot configuration
    plt.subplots_adjust(top=.92, bottom=.08, left=.1, right=.95, hspace=.25, wspace=.35)
    plt.show()

In [None]:
# Plot the histograms of the petal length and width and sepal length and width 
# Display them on the columns of a figure with 2X2 subplots
# color them red, green, blue and yellow, respectivelly  


with plt.style.context(('ggplot')):
    plt.figure(1)

    # Plot each of the columns from the df in different subplots
    # Rows first index (top-left)
    plt.subplot(2, 2, 1)
    df_iris.petal_length.plot(kind='hist', color = "red")
    plt.xlabel("petal length")
    plt.subplot(2, 2, 2)
    df_iris.sepal_length.plot.hist(color = "blue")
    plt.xlabel("sepal length")
    plt.subplot(2, 2, 3)
    df_iris.petal_width.plot(kind='hist', color = "green")
    plt.xlabel("petal width")
    plt.subplot(2, 2, 4)
    df_iris.sepal_width.plot.hist(color = "yellow")
    plt.xlabel("sepal width")
    # Some plot configuration
    plt.subplots_adjust(top=.92, bottom=.001, left=.1, right=.95, hspace=.30, wspace=.35)
    plt.show()

In [None]:
# Adjusting the plot configuration

with plt.style.context(('ggplot')):
    plt.figure(1)

    # Plot each of the columns from the df in different subplots
    # Rows first index (top-left)
    plt.subplot(2, 2, 1)
    df_iris.petal_length.plot(kind='box', color = "red")
    plt.subplot(2, 2, 2)
    df_iris.sepal_length.plot.box(color = "blue")
    plt.subplot(2, 2, 3)
    df_iris.petal_width.plot(kind='hist', color = "green")
    plt.xlabel("petal width")
    plt.subplot(2, 2, 4)
    df_iris.sepal_width.plot.hist(color = "yellow")
    plt.xlabel("sepal width")
    # Some plot configuration
    plt.subplots_adjust(top=.92, bottom=.08, left=.1, right=.95, hspace=.25, wspace=.35)
    plt.show()

In [None]:
# dir(df_iris.petal_length.plot)

____________

### `seaborn` - dataset-oriented plotting

Seaborn is a library that specializes in making *prettier* `matplotlib` plots of statistical data. <br>
It is built on top of matplotlib and closely integrated with pandas data structures.

https://seaborn.pydata.org/introduction.html<br>
https://python-graph-gallery.com/seaborn/   
https://jakevdp.github.io/PythonDataScienceHandbook/04.14-visualization-with-seaborn.html
https://seaborn.pydata.org/tutorial/distributions.html

In [None]:
import seaborn as sns

`seaborn` lets users *style* their plotting environment.

In [None]:
sns.set(style='whitegrid')

However, you can always use `matplotlib`'s `plt.style`

In [None]:
#dir(sns)

In [None]:
sns.scatterplot(x='petal_length',y='petal_width',data=df_iris)
plt.show()

In [None]:
# hue argument allows you to color dots by category

sns.scatterplot(x='petal_length',y='petal_width', hue = "species", data=df_iris)
plt.show()

#### Violin plot

Fancier box plot that gets rid of the need for 'jitter' to show the inherent distribution of the data points

In [None]:
columns = ['petal_length', 'petal_width', 'sepal_length']

fig, axes = plt.subplots(figsize=(5, 5))
sns.violinplot(data=df_iris.loc[:,columns], ax=axes)
axes.set_ylabel('number')
axes.set_xlabel('columns', )
plt.show()

#### Distplot

In [None]:
# A distplot plots a univariate distribution of observations. 
# The distplot() function combines the matplotlib hist function with the seaborn kdeplot() and rugplot() functions.

sns.set(style='darkgrid', palette='muted')

# 4 rows, 1 column - all have the same x axis
f, axes = plt.subplots(4,1, figsize=(10,10), sharex=True)
sns.despine(left=True)

# Regular displot
sns.distplot(df_iris.petal_length, ax=axes[0])

# Change the color
sns.distplot(df_iris.petal_width, kde=False, ax=axes[1], color='orange')

# Show the Kernel density estimate
sns.distplot(df_iris.sepal_width, hist=False, kde_kws={'shade':True}, ax=axes[2], color='purple')

# Show the rug
sns.distplot(df_iris.sepal_length, hist=False, rug=True, ax=axes[3], color='green')

#### FacetGrid

In [None]:
sns.set()
columns = ['species', 'petal_length', 'petal_width']
facet_column = 'species'
g = sns.FacetGrid(df_iris.loc[:,columns], col=facet_column, hue=facet_column, col_wrap=5)
g.map(plt.scatter, 'petal_length', 'petal_width')

In [None]:
sns.relplot(x="petal_length", y="petal_width", col="species",
            hue="species", style="species", size="species",
            data=df_iris)
plt.show()

____

### `plotnine` - grammar of graphics - R ggplot2 in python

plotnine is an implementation of a grammar of graphics in Python, it is based on ggplot2. The grammar allows users to compose plots by explicitly mapping data to the visual objects that make up the plot.

Plotting with a grammar is powerful, it makes custom (and otherwise complex) plots are easy to think about and then create, while the simple plots remain simple.



https://plotnine.readthedocs.io/en/stable/   
http://cmdlinetips.com/2018/05/plotnine-a-python-library-to-use-ggplot2-in-python/  
https://plotnine.readthedocs.io/en/stable/tutorials/miscellaneous-altering-colors.html   
https://datascienceworkshops.com/blog/plotnine-grammar-of-graphics-for-python/   
https://realpython.com/ggplot-python/



### Uncomment and run the following line to install the library

In [None]:
# !pip install plotnine

In [None]:
from plotnine import *

In [None]:
ggplot(data=df_iris) + geom_point(aes(x="petal_length", y = "petal_width"))

In [None]:
# add transparency - to avoid over plotting - alpha argument
ggplot(data=df_iris) + aes(x="petal_length", y = "petal_width") + geom_point(alpha=0.7)

In [None]:
# change point size 
ggplot(data=df_iris) + aes(x="petal_length", y = "petal_width") + geom_point(size = 0.7, alpha=0.7)

In [None]:
# more parameters - scale_x_log10 - transform x axis values to log scale, xlab - add label to x axis
ggplot(data=df_iris) + aes(x="petal_length", y = "petal_width") \
    + geom_point() + scale_x_log10() + xlab("Petal Length")

In [None]:
n = "3"
ft = "length and width"
title = 'species : ' + n + ', petal : ' + ft  

ggplot(data=df_iris) +aes(x='petal_length',y='petal_width',color="species") + \
    geom_point(size=0.7,alpha=0.7) + facet_wrap('~species',nrow=3) + \
    theme(figure_size=(9,5)) + ggtitle(title)


In [None]:
# Set width of bar for histogram and color for the bar line and bar fill color

p = ggplot(data=df_iris) + aes(x='petal_length') + geom_histogram(binwidth=1,color='black',fill='grey')
p

In [None]:
# Save the plot to a file

ggsave(plot=p, filename='hist_plot_with_plotnine.png')


#### Quick analysis on the cars dataset:

| variable | description                              |
|----------|------------------------------------------|
| mpg      | Miles/(US) gallon                        |
| cyl      | Number of cylinders                      |
| disp     | Displacement (cu.in.)                    |
| hp       | Gross horsepower                         |
| drat     | Rear axle ratio                          |
| wt       | Weight (lb/1000)                         |
| qsec     | 1/4 mile time                            |
| vs       | V/S                                      |
| am       | Transmission (0 = automatic, 1 = manual) |
| gear     | Number of forward gears                  |
| carb     | Number of carburetors                    |

In [None]:
# Create a linear regression line that uses the weight of the car to explain/predict the miles per gallon
# These are broken down in 3 categories by gear
# The grey area is the 95% confidence level interval for predictions from a linear model ("lm")

from plotnine.data import mtcars

(ggplot(mtcars, aes('wt', 'mpg', color='factor(gear)'))
 + geom_point()
 + stat_smooth(method='lm')
 + facet_wrap('~gear'))


https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf

<img src = "https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf" width = "1000"/>

#### <font color = "red">Exercise</font>

* Use ggplot to plot the sepal_length in boxplots separated by species, add new axes labels and make the y axis values log10.

* Write a function that takes as a parameter a line of the dataframe and if the species is: 
        - setosa it returns the petal_length 
        - versicolor it returns the petal_width 
        - virginica it returns the sepal_length

    - Apply this function to every line in the dataset and save the result in an array.  
    - Use ggplot to make a histogram of the values.  