### Numpy
`numpy` is an ndarray-based data structure (n-dimensional) 

In [None]:
import numpy as np
np.array([1,2,3])

`numpy` supports many similar properties to Python arrays 


In [None]:
a = np.array([1, 3, 2])
len(a)

In [None]:
a.sort()
a

### Numpy's arange() 
This is very similar to Python's range() function. Here's Python's range


In [None]:
r = range(17)
r

In [None]:
list(r)

### Numpy's arange() 

In [None]:
a = np.arange(17)
a

In [None]:
nums = list (range(20))

### Numpy arrays vs Normal Python Arrays
Numpy supports more compact  and faster array processing including linear algebra operations

In [None]:
[3 * i for i in r]


With Numpy? We can simply do this on our `a` numpy-array object from above to get the same result with no need for list comprehension or looping

In [None]:
3 * a

In [None]:
# can also apply other math operators
a**2

### Numpy arrays support faster array processing than Python ones 

In [None]:
%%timeit
[i**2 for i in range(1000)] 

In [None]:
%%timeit
import numpy as np
np.arange(1000)**2  

### Numpy array slicing. Run the following in code cells checking your array at each step
1. a[0, :]            # first row
2.  a[:, 0]            # first column
3. A = a[0: 2, 0:2]  # combining  colones with number ranges for subarrays
4. A = a[:2, :2]     # more simply

### Numpy constants, Functions and Linear Spacing -- test out following
1. import math
2. math.e
3. np.e
4. math.e == np.e
5. np.pi   # numby also defines a pi but no tau.. but you can get tau like this
6. math.tau

Run and test these out in code cells!


### Numpy trig & log functions -- test these out
1. np.cos(math.tau)
2. np.sin(math.tau)
3. np.log(np.e)
4. np.log10(np.e)  # NB: numpy uses log() to denote the natural log and log10 for base 10!
5. np.pi

Run and check these out documenting your understanding in Markdown cells


### linespace function
The `linspace()` function is often used to create an array with much finer spacing using a larger number of points.e.g: 100 points of cos x can be obtained with following code lines:
1. angles = np.linspace(0, math.tau, 100)
2. angles

Test this out!

In [None]:
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(-2, 2, 100)
fig, ax = plt.subplots(1, 1)
ax.plot(x, x*x)
plt.show()
%matplotlib inline

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from math import tau
x = np.linspace(0, tau, 100)
fig, ax = plt.subplots()

#To make a plot of the cosine function, we can then call the ax object’s plot() method with x 
# (x-axis) values equal to our 100 linearly spaced points and y-axis values as: np.cos on x:
ax.plot(x, np.cos(x))
plt.show()
%matplotlib inline


In [None]:
fig, ax = plt.subplots()
ax.set_xticks([0, tau/4, tau/2, 3*tau/4, tau])
ax.set_yticks([-1, -1/2, 0, 1/2, 1])
plt.grid(True)

ax.plot(x, np.cos(x))
plt.show()

# Scatter Plots
The plot above introduced some of the key ideas of Matplotlib, and from this point there are a million possible ways to go. In this section and the next, we’ll focus on two kinds of visualizations especially important in data science: scatter plots and histograms. 
A scatter plot just plots a bunch of discrete function values against the corresponding points, which is a great way to get an overall sense of what relationships the function values might satisfy. Let’s take a look at a concrete example to see what this means.

We’ll begin by generating some random points chosen from the standard normal distribution,14 which is a normal distribution (or “bell curve”) with an average value (mean) of 0 and a spread (standard deviation) of 1.15 We can obtain these values using NumPy’s random library, which includes a default random number generator called __default_rng()__


In [None]:
from numpy.random import default_rng
rng = default_rng()
n_pts = 50
x = rng.standard_normal(n_pts)
x


With those x values in hand, let’s create a set of y values by adding a constant multiple (the slope) of 5 times x plus another random factor:

In [None]:
y = 5*x + rng.standard_normal(n_pts)

This broadly follows the pattern of the equation for a line, y = mx + b, only with random values for x and b. Because the functional form of y is essentially linear, a plot of y vs. x should look roughly like a line, which we can confirm with a scatter plot as follows:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from math import tau
fig, ax = plt.subplots()
ax.scatter(x, y)
plt.show()


## Histograms
Finally, let’s apply some of the same ideas from above to visualize the distribution of 1000 random values drawn from the standard normal distribution:

In [None]:
values = rng.standard_normal(1000)

A common way to get a sense of what these values look like is to make a fixed number of “bins” and plot how many values fit into each bin. The resulting plot is known as a **histogram**, and can be generated automatically using Matplotlib’s hist() method: (The result should be a good approximation of a “bell curve”)

In [None]:
fig, ax = plt.subplots()
ax.hist(values)
plt.show()


# Handcrafted Examples - Hand-crafted Examples 
The first steps to getting started are nearly always to import numpy as np and pandas as pd, along with matplotlib.pyplot as plt:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

The core data structures of pandas are Series and DataFrame. The DataFrame is key, but it’s built up from the Series and both encapsulate an Index object denoting what the name says!
## Series
A Series is essentially a fancy array with elements of arbitrary types (much like a list), each of which is called an axis. For example, the following command defines a Series of numbers and strings, plus a special (and commonly encountered) value known as NaN (“Not a Number”):

In [None]:
pd.Series([1, 2, 3, "foo", np.nan, "bar"])

### Using dropna() to clean the data
The dropna() method drops any “Not Available” values, such as None, NaN, or NaT (“Not a Time”).
By default, Series axis labels are numbered just like array indices (in this case, 0–5). The set of axes is known as the index of the Series:

In [None]:
pd.Series([1, 2, 3, "foo", np.nan, "bar"]).index 

We can also define our own axis labels (indices), which must have the same number of elements as the Series:

In [None]:
from numpy.random import default_rng
rng = default_rng()
s = pd.Series(rng.standard_normal(5), index=["a", "b", "c", "d", "e"])
s

We can use our indices as dictionary keys 

In [None]:
s['c']

Indeed call the keys() method on s should prove the point

In [None]:
s.keys()

### Series Methods
Series come equipped with a wealth of methods, including plotting methods that use Matplotlib under the hood. For example, here’s a histogram for a Series with 1000 values generated with the standard normal distribution:

In [None]:
s = pd.Series(rng.standard_normal(1000))
s.hist()
plt.show()


## DataFrame
DataFrame object is the heart of Python data analysis. A DataFrame can be thought of as a two-dimensional grid of cells containing arbitrary data types—roughly equivalent to an Excel worksheet. Let's see how to create a simple DataFrame by hand just to get a sense of how they work, but as you know already  in most cases DataFrame objects are created by importing data from files you typically download from the web
There are a large number of ways to initialize or build DataFrames appropriate to a correspondingly large number of circumstances. For example: Initializing a DataFrame with a Python dictionary.


In [None]:
from math import tau
from numpy.random import default_rng
rng = default_rng()
df = pd.DataFrame({
  "Number": 1.0,
  "String": "foo",
  "Angles": np.linspace(0, tau, 5),
  "Random": pd.Series(rng.standard_normal(5)),
  "Timestamp": pd.Timestamp("20221020"),
  "Size": pd.Categorical(["tiny", "small", "mid", "big", "huge"])
})
df

Here we’ve applied the `linspace()` method and two new pandas methods, `TimeStamp` (just what it sounds like) and `Categorical` (which contains values of a categorical variable). The result is a set of labeled rows and columns with a heterogeneous set of data. We can access a DataFrame column using the column name as a key as in code cell below:


In [None]:
df["Size"]

We can also `calculate statistics`, such as the mean value of the Random column:

In [None]:
df["Random"].mean()

One useful pandas function for getting a general overview of numeric data is `describe()`:

In [None]:
df.describe()

This automatically displays the `total count`, `mean`, `standard deviation`, `minimum`, and `maximum values`, and the `middle three quartiles (25%, 50%, and 75%)` of each **numeric column**. These values won’t always be meaningful—the standard deviation of the linearly spaced angles, for example, doesn’t really tell us anything useful— but `describe()` is often helpful as a first step in an analysis. Two other useful summary methods are `head()` and `info()` which you already know. Another useful method is `map()`, which we can use to map categorical values to numbers. Suppose, for example, that **"Size"** corresponds to drink sizes in ounces, which we can represent as a sizes dictionary. Using `map()` on the "Size" column should then produce the desired result as below:

In [None]:
sizes = {"tiny": 4, "small": 8, "mid": 12, "big": 16, "huge": 24}
df["Size"].map(sizes)


### The Example Nobel Lauretes Dataset
As you know already the most common practice is to load data from external files and then take the analysis from there. The most common input format is CSV files (for “comma-separated values”). Your typical first step is to download your dataset. For example you can download the winners of the Nobel Prize, who are typically known as laureates (a reference to the ancient practice of using wreaths from a laurel tree to honor great accomplishments). You can do this with the commandline tool - curl:  <curl -OL http://api.nobelprize.org/v1/laureate.csv>: We can then read the data using panda's `read_csv()` function:


In [None]:
nobel = pd.read_csv("laureates.csv")

The statistics for the numeric columns aren’t very meaningful, so describe() doesn’t tell us much:

In [None]:
nobel.describe()

Let's investigate the `head()` or top entries of the Nobel Prize data. 

In [None]:
nobel.head()

Here it shows the top 5 but if you want more you can do `head(n)` subsituting n for how many you want to see! We can get more useful info using `info()`:

In [None]:
nobel.info()

This produces a complete list of the column names, together with the number of non-null values for each one.
## Locating Data
One of the most useful tasks in pandas is locating data that satisfies desired criteria. For example, we can locate a Nobel laureate with a particular surname. Let’s use square brackets and a boolean criterion on the "surname" column to find Feynman’s record in the laureates data: Physicist Richard Feynman (pronounced “FINE-men”) did groundbreaking work in theoretical physics (especially quantum electrodynamics and its associated Feynman diagrams). Feynman is also known for The Feynman Lectures on Physics, which covers the elementary physics curriculum (mechanics, thermal physics, electrodynamics, etc.) in an unusually entertaining and insightful way.

In [None]:
nobel[nobel["surname"] == "Feynman"]

This array-style notation returns the full record, which allows us to determine the year Feynman won his Nobel Prize. Examining a pandas record in a Jupyter notebook.


In [None]:
nobel[nobel["surname"] == "Feynman"].year

By the way, the syntax `nobel[nobel["surname"] == "Feynman"]` can be a little confusing since it might not be clear why we have to refer to nobel twice. The answer is that the inner part of the syntax returns a Series (as we saw before) consisting of boolean values for every laureate, with True if the surname is equal to "Feynman" and False otherwise


In [None]:
nobel["surname"] == "Feynman"

By using the correct `index (i.e., 86)`, we can confirm that the value in that case is True:

In [None]:
(nobel["surname"] == "Feynman")[86]

This confirms that the following will select only the values of nobel where `nobel["surname"] == "Feynman"` is True.

In [None]:
nobel[nobel["surname"] == "Feynman"]

Another method for getting the year is by specifying the column along with the boolean criterion, using the loc (location) attribute (loc is a special kind of attribute created using a property decorator):

In [None]:
nobel.loc[nobel["surname"] == "Feynman", "year"]

This returns just the overall id (in this case, 86) and the column of interest. The `loc` attribute can be used in place of brackets in many places and is generally a more flexible way to pull out data items of interest.

Finding Curies in the laureates.csv dataset - one of the most famous Nobel laureates is Maria Curie .. but her husband and daughter too! Let's find all Nobel laureates named named “Curie” as:

In [None]:
curies = nobel.loc[nobel["surname"].str.contains("Curie", na=False)]
curies


Just the first and last names of these Laureates?

In [None]:
curies[["firstname", "surname"]]

Finding winners of multiple Nobel Prizes.

In [None]:
laureates = nobel.groupby(["id", "firstname", "surname"])
sizes = laureates.size()
sizes[sizes > 1]


### What does following compute? Which Laureates?

In [None]:
nobel.loc[(nobel["born"].astype('string').str.contains("06-28")) & (nobel["category"] == "physics")]


**Some Practice Exercises**
1.	What happens if the dimensions in reshape() don’t match the array size (e.g., np.arange(16).reshape((4, 17)))?
2.	Confirm that A = np.random.rand(5, 5) lets you define a 5 × 5 random matrix.
3.	Find the inverse Ainv of the 5 × 5 matrix in the previous exercise. (Calculating the inverse of a 2 × 2 matrix as in Section 11.2.2 is fairly simple by hand, but the task rapidly gets harder as the matrix size increases, in which case a computational system like NumPy is indispensable.)
4.	What is the matrix product I = A @ Ainv of the matrices in the previous two exercises? Use the same isclose() trick from Listing 11.7 to zero out the elements of I close to zero and confirm that the resulting matrix is indeed the 5 × 5 identity matrix.
5.	Research how to add a title and axis labels to your plots
6.	One common plotting task is including multiple subplots in the same figure. Show that the code in Listing 11.10 creates vertically stacked subplots. (Here the suptitle() method produces a “supertitle” that sits above both plots. See the Matplotlib documentation on subplots for other ways to create multiple subplots.)
