<center><h1>NumPy Package</h1></center>
<center><h3>Paul Stey</h3></center>
<center><h3>2022-02-22</h3></center>


# NumPy Package

The NumPy package is one of the most widely-used tools for scientific computing in Python. NumPy—short for Numerical Python—is a package filled with data structures and algorithms for working on mathematical problems.

* Provides support for working with large, multi-dimensional arrays and matrices. It is built for numerical computing and scientific computing tasks.

* Provides a range of mathematical functions that can be performed on arrays, such as trigonometric, logarithmic, and exponential functions. It also provides linear algebra operations and random number generation

* Allows for efficient array operations because it is implemented in C 
  - This makes it faster than performing the same operations in pure Python code

* Is the foundation for many other scientific computing packages in Python, such as Pandas, SciPy, and scikit-learn. 

* Is open source software and is widely used in both academia and industry. 

## NumPy Arrays

In [None]:
import numpy as np          # import numpy and alias it as `np`

In [None]:
a = np.array([2, 3, 45])    # create numpy array

print(a)

In [None]:
print(type(a))              # shows type of `a`

print(a.shape)              # prints "(3,)"

## Array Indexing

NumPy's `array` indexing works much like it does with lists.

In [None]:
print(a[0], a[1], a[2])        # prints "1 2 45"

In [None]:
a[0] = 9999                    # change an element in place

print(a)                       # prints "[9999, 2, 3]"

In [None]:
v = np.random.normal(0, 1, 5)  # draws from normal dist'n with mean 0, and SD 1

print(v)

In [None]:
v[0:3]                         # get first 3 elements

### More Array Indexing

In [None]:
print(v)

In [None]:
v[-1]                   # get the last element

In [None]:
v[-2]                   # get the second-to-last element

In [None]:
v[2:]                   # get the third element and everything after 

## Two-Dimensional Arrays (i.e., Matrices)

In [None]:
b = np.array([[1,2,3],
              [4,5,6]])            # create 2-d array (i.e., matrix)

print(b)                     

In [None]:
print(b[0, 0], b[0, 1], b[1, 0])   # prints "1 2 4"

In [None]:
a = np.zeros((2,2))                # create an array of all zeros

print(a)

### More Matrices

In [None]:
b = np.ones((1, 5))    # create an array of all ones

print(b)              

In [None]:
c = np.full((2, 2), 7)  # 2x2 array of 7s

print(c)               

In [None]:
d = np.full((2, 4), "potato")

print(d)

<center><h1>Challenge Problem</h1></center>

In this exercise our goal is to normalize a 2D array. We will do the following: 

* Create a 2D NumPy array with the following values: `[[1, 2, 3], [4, 5, 6], [7, 8, 9]]`
* Normalize the array so that each column has zero mean and unit variance. To do this, subtract the mean of each column from each element in that column, then divide each element in that column by the standard deviation of that column.

* Print the normalized array to the console.

The NumPy module has built-in functions for computing the mean (i.e., `np.mean()`) as well as the standard deviation (i.e., `np.std()`). Let's use these in our computation.

**Hint**: Note that we want to use these functions to operate on the _columns_. This is controlled by the `axis` argument on the `np.mean()` and `np.std()` functions. By default, the axis argument in NumPy functions is set to None, which means that the function will calculate the mean over the entire array. However, you can set axis to an integer value to specify which axis to perform the calculation on. For example, if you have a two-dimensional array and set axis to 0, the function will calculate the mean of each column, while setting axis to 1 will calculate the mean of each row.

## Why NumPy Arrays?

The `array` type is superficially similar to the `list` type in Python, however, the major strength of the NumPy `array` is that the implementation of the data structure is _very highly optimized_ for speed.

In [None]:
a_arr = np.random.normal(0, 1, 10_000_000)  # random draws from normal dist'n

a_list = list(a_arr)                        # cast as a list

In [None]:
%timeit np.sum(a_list)

In [None]:
%timeit np.sum(a_arr)

# Vectorize Operations with NumPy

* Vectorized operations perform a computation on an entire array, rather than on individual elements of the array

* Vectorized operations can be performed much, _much_ faster than operations that are performed on individual elements of the array using loops or list comprehensions

* NumPy provides a wide range of vectorized operations, including mathematical operations such as addition, subtraction, multiplication, and division, as well as logical operations such as AND, OR, and NOT

* Using vectorized operations with NumPy can make code much faster and easier to read. By avoiding the use of loops and list comprehensions, code becomes more concise and easier to debug. 


## Vectorize Addition Example

In [None]:
a = np.array([0, 2, 4, 6])      # create NumPy array

a_new = a + 1                   # add 1 to all elements

print(a_new)

In [None]:
a2 = [0, 2, 4, 6]               # create list

a2_new = [x + 1 for x in a2]    # add 1 to all elements

print(a2_new)

### Performance of Vectorization

Taking advantage of vectorization in NumPy will typically lead to a performance improvement; often, the improvement will be substantial.

In [None]:
b_arr = np.random.normal(0, 1, 10_000_000)  # random draws from normal dist'n

b_list = list(b_arr)                        # cast as a list


In [None]:
%timeit [x + 1 for x in b_list]             # list comprehension adds 1 to all elements

In [None]:
%timeit b_arr + 1                           # add 1 to each element

<center><h1>Pandas DataFrames</h1></center>


# Pandas Overview

* Hugely popular Python package for the analysis of data. The name, in fact, comes from the concatentation of "_Python for Data Analysis_". 

* Provides powerful data manipulation and analysis capabilities. It is built on top of NumPy and provides a high-level interface for working with structured data, such as tables and time series data.

* Provides two key data structures: `Series` and `DataFrame`. A Series is a one-dimensional array-like object that can hold any data type, while a DataFrame is a two-dimensional table of data with rows and columns.

* Provides a wide range of functions and methods for manipulating and analyzing data, including filtering, sorting, grouping, merging, reshaping, and pivoting. 

* Pandas is especially useful for working with structured data, such as data from CSV files, Excel spreadsheets, SQL databases, and other data sources.

In [None]:
import pandas as pd          # import pandas and alias it as "pd"

## Pandas `DataFrame`

As a reminder, a `DataFrame` is a tabular data structure. It is superficially similar a table in Excel. 

In [None]:
# Use pd.DataFrame() to construct a dataframe

df = pd.DataFrame({"x1": [412, 5, 6, 67],
                   "x2": ["soup", "shoe", "cat", "potato"]})

df

## Reading in Data

In [None]:
arrests_df = pd.read_csv("data/pvd_arrests_2021-10-03.csv")   # read in data

In [None]:
arrests_df.head()                                             # show first few lines

## Indexing and Slicing `DataFrame`s

For better or worse, there are several ways to do indexing and slicing using Pandas `DataFrame` objects. 

We will discuss the `[]`, `loc`, and `iloc` methods.

In [None]:
arrests_df["age"]       # get entire column

### Using `loc` for Indexing and Slicing

In [None]:
arrests_df.loc[:, ["age", "gender", "statute_desc"]]      # `loc` is for label-based indexing

### Using `iloc` for Indexing and Slicing

In [None]:
arrests_df.iloc[0:3, 0:5]     # `iloc` is for integer-based indexing

### Indexing with Booleans

In [None]:
arrests_df["age"] > 40        # series of boolean

In [None]:
arrests_df.loc[arrests_df["age"] < 19, ["age", "gender", "statute_desc"]]


<center><h1>Challenge Problem</h1></center>

Pandas `DataFrame`s can built from NumPy arrays (among other kinds of objects). 

Use two NumPy arrays to create a Pandas `DataFrame` with two columns, `x`, and `y`. The first should be 100 random numbers drawn from the standard normal distribution, and the second should be 100 random numbers from the uniform distribution with numbers between `0.0` and `1.0`, inclusive.

**_Hint_**: We can use the `random.normal()` function in the NumPy module to take draws from Normal distributions. And similarly, we can use `random.uniform()` to take draws from Uniform distributions.

<center><h1>Grouped DataFrames</h1></center>

* Powerful feature in Pandas that allow you to group data by one or more columns in a `DataFrame`, and then apply a function to each group. This is useful for tasks like calculating summary statistics, performing aggregations

* Created using the `groupby()` method, which groups data by one or more columns in the DataFrame. The resulting object is a `GroupBy` object, which is a special type of `DataFrame` that allows you to apply functions to each group.

* Support a wide range of functions and methods for performing operations on each group, including aggregations like `mean()`, `sum()`, and `count()`, as well as transformations like `apply()`, `transform()`, and `filter()`. These functions can be used to create new columns, filter data, and perform calculations on each group.

* Can be used to create pivot tables, which are a powerful tool for summarizing and analyzing data

* Key feature in data analysis and data science, and are widely used in a variety of applications, such as finance, marketing, and scientific research. They provide a flexible and efficient way to work with large datasets and perform complex calculations on each group.

# Grouping in Pandas

In addition to the `DataFrames` object, the Pandas package in Python also supports a huge vareity of grouping, sub-setting, and aggregations functions. These can provide some functionality similar to the _dplyr_ package in R.

In [None]:
# create sample data
student_df = pd.DataFrame({
    "name":    ["Alice", "Bob", "Charlie", "Dave", "Eve", "Lee", "Isabel"],
    "subject": ["Math", "English", "Math", "English", "Math", "English", "Math"],
    "grade":   [80, 90, 81, 75, 95, 93, 81]
})

grouped_df = student_df.groupby("subject")    # group by subject


grouped_df["grade"].mean()                    # compute mean by subject

In [None]:
student_df.groupby("subject").describe()      # verbose summary

### Grouped DataFrame

In [None]:
animal_df = pd.DataFrame({"iq":     [75, 4, 55, 63, 44, 65, 59, 6],
                          "age":    [2, 33, 4, 12, 8, 4, 7, 9],
                          "animal": ["llama", "shoe", "cat", "llama", "cat", "llama", "cat", "shoe"]}) 

In [None]:
grp_df = animal_df.groupby("animal")       # save grouped dataframe obj

In [None]:
grp_df.groups                              # see the groups we have

In [None]:
grp_df.get_group("cat")                    # get dataframe of single group

## Using Arrest Data

In [None]:
arrests_df = pd.read_csv("data/pvd_arrests_2021-10-03.csv")   # read in data

arrests_df.shape

In [None]:
grp_df = arrests_df.groupby("id")

In [None]:
grp_df['case_number'].size()

In [None]:
grp_df.get_group("pvd9871727665829819099")

<center><h1>Challenge Problem</h1></center>

Let's write some code to determine the most frequent statute violations. In particular, let's use the `groupby()` method on the `"statute_desc"` column to create a grouped dataframe. Then let's use the `size()` method on that grouped dataframe to get the total counts of violations for that statute. 

Finally, let's use the `sort_values()` method on the resulting object, while passsing that method the argument `ascending = False` to get the violations sorted in descending order. 
