# Introduction to Scientific Computing II

Last week, we did a gentle introduction to manipulating n-dimensional arrays using the NumPy library.

This week, we're learning about a couple more important scientific computing libraries:

[**SciPy**](https://docs.scipy.org/doc/scipy/reference/) - SciPy is a collection of mathematical algorithms and utility functions built on the NumPy library. We will look at some, but not all, of the SciPy subpackages including: `spatial`, `sparse`, `stats`, and `linalg`. 

[**Pandas**](https://pandas.pydata.org/docs/user_guide/index.html) - From the website: *fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.* It's a powerful tool for doing "real world" data munging and analysis.

#### Import the Packages

In [None]:
from scipy import *
import pandas as pd
import numpy as np

## SciPy

First up, SciPy!

### Spatial `scipy.spatial`

#### Distance Computations `scipy.spatial.distance`

Let's import in our stuff!

In [None]:
from scipy.spatial.distance import pdist, cdist, cityblock, euclidean, cosine, jensenshannon, minkowski

You know what I love about SciPy? It's wonderful documentation!

Using the documentation, we can find the equations for these distances!

##### `scipy.spatial.distance.cityblock` - https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cityblock.html

![](assets/cityblock.jpg)

##### `scipy.spatial.distance.euclidean` - https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.euclidean.html

![](assets/euclidean.jpg)

##### `scipy.spatial.distance.euclidean` - https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html

![](assets/cosine.jpg)

##### `scipy.spatial.distance.jensenshannon` - https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jensenshannon.html

![](assets/jensenshannon.jpg)

##### `scipy.spatial.distance.minkowski` - https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.minkowski.html

![](assets/minkowski.jpg)

In [17]:
# Magical Testing Area

### Sparse Matrices `scipy.sparse`

Consider the following example: You go shopping at Wal-Mart and only have 3 small items you plan on buying. Would it be better to take a large cart or just a small basket? You're right! The small basket, because then you save space!

Similarly, with spare matrices, we can get store only what we need to, our non-zero values!

![](assets/sparsevsdense.png)

Usually, there are multiple formats of sparse matrixes that can be used for different things, but tbh, it's not that big of a deal.

Here's some things you can do with sparse matrices:

- Create graph representations and compute graph characteristics
- Liear Algebra \[covering later\]

In [22]:
from scipy.sparse.csgraph import dijkstra

dijkstra([[0, 4, 0, 0, 0, 0, 0, 8, 0], 
        [4, 0, 8, 0, 0, 0, 0, 11, 0], 
        [0, 8, 0, 7, 0, 4, 0, 0, 2], 
        [0, 0, 7, 0, 9, 14, 0, 0, 0], 
        [0, 0, 0, 9, 0, 10, 0, 0, 0], 
        [0, 0, 4, 14, 10, 0, 2, 0, 0], 
        [0, 0, 0, 0, 0, 2, 0, 1, 6], 
        [8, 11, 0, 0, 0, 0, 1, 0, 7], 
        [0, 0, 2, 0, 0, 0, 6, 7, 0] 
        ])

array([[ 0.,  4., 12., 19., 21., 11.,  9.,  8., 14.],
       [ 4.,  0.,  8., 15., 22., 12., 12., 11., 10.],
       [12.,  8.,  0.,  7., 14.,  4.,  6.,  7.,  2.],
       [19., 15.,  7.,  0.,  9., 11., 13., 14.,  9.],
       [21., 22., 14.,  9.,  0., 10., 12., 13., 16.],
       [11., 12.,  4., 11., 10.,  0.,  2.,  3.,  6.],
       [ 9., 12.,  6., 13., 12.,  2.,  0.,  1.,  6.],
       [ 8., 11.,  7., 14., 13.,  3.,  1.,  0.,  7.],
       [14., 10.,  2.,  9., 16.,  6.,  6.,  7.,  0.]])

### Statistical Functions `scipy.stats`

Yay! So this is the module I'm most familiar with!

##### Statistics \*cues shrieks of terror\*

Just kidding! Statistics is great(-ish)!

SciPy provides INNUMERABLE statistical functions, ranging anywhere from random number generation to statistical analysis. I personally know the most about random number generation, so that's where I'm going to focus.

Some of the most common random number functions I use are
`scipy.stats.bernoulli` and `scipy.stats.maxwell`

SciPy has made it all convenient and has given each of these distributions static methods.

Ex: `scipy.stats.bernoulli.rvs(p, size=())`

Every statistical distribution has these staticmethods:
`rvs`, `pdf` `logpdf`, `cdf`, `logcdf`, `mean`, `std`, `var`, etc.

These staticmethods have their own significance. As described by the documentation itself:

![](assets/rv_staticmethods.jpg)

In [23]:
# Some statistical distributions to have fun with!
from scipy.stats import bernoulli, boltzmann, norm, multivariate_normal

However, though SciPy has innumrable random number generation functions, it also has some other amazing functions!

In [24]:
# Other amazing functions to demonstrate

# scipy.stats.chisquare
# scipy.stats.zscore
# scipy.stats.cumfreq

### Linear Algebra `scipy.linalg`

We'll be doing a linear algebra lecture in a few weeks, so we'll introduce this subpackage then!

## Pandas

A pandas `dataframe` is a 2-dimensional data structure that can store different types of values. It is very similar to a spreadsheet:

<img alt="Structure of the from keyword" src="https://pandas.pydata.org/docs/_images/01_table_dataframe1.svg" width="500px"/>

Now, let's manually create a dataframe and fill it with some data.

In [19]:
df = pd.DataFrame({
    "integers" : [1, 0, 2, 0], 
    "floats" : np.random.rand(4), 
    "strings" : ["first", "second", "third", "fourth"]})

In [20]:
df

Unnamed: 0,integers,floats,strings
0,1,0.301707,first
1,0,0.09699,second
2,2,0.545391,third
3,0,0.289979,fourth


Our dataframe has three columns, each with a unique label: `integers`, `floats`, `string`. Notice how a single dataframe can contain different data types.

### Dataframe attributes and functions

There are various useful attributes of a dataframe that we will explore below:

In [25]:
df.dtypes

integers      int64
floats      float64
strings      object
dtype: object

In [30]:
df.shape

(4, 3)

In [27]:
df.columns

Index(['integers', 'floats', 'strings'], dtype='object')

If you're just interested in working with one column in a dataframe, you can select it by column label:

In [35]:
df.head(1)

Unnamed: 0,integers,floats,strings
0,1,0.301707,first


### Selecting subsets of a Dataframe

Let's say you're only interested in one column in a dataframe, you can select it using the column labels:

In [37]:
df["floats"]

0    0.301707
1    0.096990
2    0.545391
3    0.289979
Name: floats, dtype: float64

We haven't actually overwritten the original dataframe with the above statement, we're just printing the column `floats`. To overwrite the original dataframe, you can do the following:

In [43]:
df = df[["floats", "strings"]]
df

Unnamed: 0,floats,strings
0,0.301707,first
1,0.09699,second
2,0.545391,third
3,0.289979,fourth


We've now overwritten the original dataframe by selecting the `floats` and `strings` columns to keep. The `integers` column has been removed from the dataframe!

Let's look at selecting rows based on a conditional expression:

In [45]:
df["floats"] > 0.2

0     True
1    False
2     True
3     True
Name: floats, dtype: bool

The condition checks for rows in `floats` that have a value greater than 0.2. The output is a series of `boolean` values. We can then use the series of `boolean` values to select only rows where the condition is True:

In [46]:
df[df["floats"] > 0.2]

Unnamed: 0,floats,strings
0,0.301707,first
2,0.545391,third
3,0.289979,fourth


### Reading from a spreadsheet

So far, we've worked with a manually created dataframe. But, we can also generate data frames from a spreadsheet in the comma-seperated value format using `pd.read_csv`. Let's read in a comma-seperated file of titanic passenger data.

In [50]:
pd.read_csv("https://github.com/pandas-dev/pandas/blob/master/doc/data/titanic.csv?raw=1")

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C
