# Basic Analysis with Python: NumPy and Pandas

## Numpy
* A module for doing math in Python with versatile array objects.
* Great for statistics and data science, and widely utilized in modules for such!

### Let's make a table with Python's default tools.

Say your engineer friend needs a reference for areas of circles.

In [None]:
# First, we'll need our radii.

radii = list(range(10)) # You can turn a range directly into a list.

print("radii:\n" + str(radii))

In [None]:
# Now let's define how to change radii to areas.


import math # "math" is a module that comes with Python by default.
            # you can use packages with the command "import."
    
from math import pi # if you only need part of a package, you can get it this way instead.

def area(R):
    return pi*(R**2)

In [None]:
# Exercise: How does "return" work?
# Put an appropriate value in place of "#" in line 4 and run this box.

x = area(#)

print(x)

In [None]:
# Now let's calculate the areas.

areas = [] # an empty list to fill in

for radius in radii:
    areas += [area(radius)] # "x += 1" means "x = x + 1." A convenient shorthand.
                            # This operator works for concatenation, too!
    
print("areas:\n" + str(areas))

### Can we operate directly on a list?

That would be nice, and it would take fewer lines of code.

In [None]:
areas = pi*(radii**2) # (Spoiler: this will produce an error!)

### Nope! Instead, let's try doing that with NumPy's ndarray object.

Lists don't work like that. You need to iterate through their contents.

NumPy's *arange* method is one way to call a powerful and flexible object it provides: the **ndarray** (like "n-dimensional array").


In [None]:
# Unlike the math module, NumPy doesn't come with Python by default.
# You can install it with package managers such as "pip:"

!pip install numpy

In [None]:
# with the package installed, we can import it:

import numpy

# and use it:

radii = numpy.arange(10)
areas = pi*(radii**2)

print("areas:\n" + str(areas))

In [None]:
# numpy's "arange" method creates the module's useful and powerful object: the ndarray.
type(radii)

### *arange* takes similar parameters to *range*:

In [None]:
print(numpy.arange(10))

print(numpy.arange(10, dtype='float'))

print(numpy.arange(0, 1, 0.1)) # arguments: (start, stop, step-size)

In [None]:
# Exercise: can you replace "#" in line 3 with an argument to count from 0 to 5 by increments of 0.5?

count_by_halves = numpy.arange(#)
    
print(count_by_halves)

### Broadcasting: an ndarray property

Mathematical operations on ndarrays do something NumPy calls **broadcasting.**

For the circle-area example, broadcasting behaved like item-by-item multiplication of two ndarrays.

For basic 1D ndarrays, broadcasting performs the operation element-by-element:

In [None]:
X = numpy.arange(1, 4, dtype='float')
Y = numpy.arange(4, 7, dtype='float')

print("1D Broadcasting:\n")
print("X:\t" + str(X)) # "\t" = tab
print("Y:\t" + str(Y) + "\n")

print("X + Y:\t" + str(X + Y))
print("X - Y:\t" + str(X - Y))
print("X * Y:\t" + str(X * Y))
print("X / Y:\t" + str(X / Y))

### You can nest ndarrays.

In [None]:
X = numpy.arange(10)
Y = numpy.arange(10, 20)
Z = numpy.vstack((X,Y)) # vstack is one of many methods to assemble multidimensional ndarrays.

print("X:\n" + str(X))
print("Y:\n" + str(Y))
print("Z:\n" + str(Z) + "\n")

print("dimensions of Z:", Z.ndim)
print("shape of Z:",      Z.shape)
print("Z[0,4]: " + str(Z[0,4]) + "\n")

print("That's how you get the n > 1 in 'ndarray!'")

### Other constructors:

In [None]:
X = numpy.ones(10)
Y = numpy.zeros((10,10)) # (10,10) is a tuple. The inner parentheses are required here.
Y_but_ones = numpy.ones_like(Y)

print("numpy.ones(10):\n" + str(X) + "\n")
print("numpy.zeros((10,10)):\n" + str(Y) + "\n")
print("numpy.ones_like(above):\n" + str(Y_but_ones) + "\n")

### 2D Broadcasting:

There are other constructors for ndarrays I haven't touched on. NumPy has great tutorials and documentation available for them! https://docs.scipy.org/doc/numpy-dev/user/index.html

In [None]:
# Here, we define a 2D ndarray.

X = numpy.ones((2,2))
X[0,1] = 2
X[1,0] = 3
X[1,1] = 4

print("X:\n" + str(X))

In [None]:
# And another: the 2x2 identity matrix.

Y = numpy.identity(2)

print("Y:\n" + str(Y))

In [None]:
print("2D Broadcasting:\n")

# default broadcasting is still element-by-element.

YX = Y * X

print("Y * X (broadcasting):\n" + str(YX) + "\n")

In [None]:
# Let's look at numpy's "dot" method:

YdX = numpy.dot(Y,X)

print("numpy.dot(Y,X):\n" + str(YdX) + "\n")

# For these 2D arrays, it performs matrix multiplication, but its behavior varies with other shapes.
# Explore its general behavior with help(numpy.dot)

### Loading data into an ndarray

You will often want to construct ndarrays by importing data. One command to do this is **loadtxt**:

In [None]:
# load an ndarray from a file (located where this notebook is):

X = numpy.loadtxt('some_numbers.txt', delimiter=',')
print(X)

### Moving on...

Through broadcasting and many other operations available for the ndarray, it becomes straightforward with practice to do algebra on large and multidimensional arrays.

You can then move on with these objects to other modules for analysis and plotting, such as SciPy, Matplotlib, and...

## Pandas

* provides "fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive." (pandas.org)
* a tool for **analysis** and **manipulation** of *any* text-based data.

In [None]:
!pip install pandas

In [None]:
# First, we import it. We can also use "as" to give it a nickname!
import pandas as pd

In [None]:
# Now we can use read_csv to look at comma-delimited text and create Pandas's powerful object:
# the DataFrame!

stooges = pd.read_csv('stooges.csv')
type(stooges)

In [None]:
# You can best display a DataFrame without using print():
stooges

### Let's specify field names.

By default, it assumes the first row was the header, as you see above.

Field names weren't written in the file we read. Let's add them with the parameter *names*.

In [None]:
stooges = pd.read_csv('stooges.csv',
                      names = ['Stooge', 'First Appearance', 'Final Appearance'])

stooges

### Head and Tail

For larger sets, we can peek at the stard or end to check on the data without using lots of memory:


In [None]:
stooges.head(2)

In [None]:
stooges.tail(1)

### Filtering

You can use booleans to filter for certain rows.


In [None]:
type(stooges['First Appearance'] == 1930)

In [None]:
# filter for the original three stooges:

stooges[stooges['First Appearance'] == 1930]

In [None]:
# return the field "Final Appearance" where the field "Stooge" is "Moe Howard"

stooges['Final Appearance'][stooges['Stooge'] == 'Moe Howard']

### Quality Control

The "Final Appearance" for is 2183 for Curly Joe and missing for Larry Fine. We'd rather not carry forward with incorrect or missing information for further analysis. Let's filter them out.

In [None]:
stooges_fixed = stooges[stooges['Final Appearance'] < 2019]
stooges_fixed

### Saving DataFrames

Now that we've performed quality control, let's save the file with **to_csv**.

In [None]:
stooges_fixed.to_csv('stooges_fixed.csv', header=True, index=False)

## Thank you!

I only scratched the surface of what both of these modules are capable of.

The following are both sources for this tutorial and recommended reading if you want to practice:
* https://docs.scipy.org/doc/numpy/reference/
* https://pandas.pydata.org/pandas-docs/stable/tutorials.html