# Python for everyone -- 11 NumPy, plotting

<a href="https://classroom-40p3.onrender.com/" target="_blank">Classroom sign-in</a>

## Data analysis in Python

Python is the most popular programming language for data science. The main reasons for this is that it is easy to write and read, so you get quick results (and easily re-usable code). Also, in a self-reinforcing way, there are a lot of packages that are created for data analysis and scientific use.

The main packages in Python's scientific ecosystem:
* **NumPy**: It provides fast numerical calculations and an intiutive syntax to write mathematical expressions. It does the heavy work for many other data analysis packages.
* **Matplotlib**: The default plotting package, very flexible and has many options. Sometimes it gets a bit complicated to produce customized figures.
* **Pandas**: Extremely useful package for loading and wrangling (cleaning and re-structuring) data. Good for exploratory data analysis.
* **SciPy**: Scientific python package, provides optimizers, differential equation solving, some statistics, etc.
* **scikit-learn**: Maching learning library
* **PyTorch**: deep learning library, supports the use of GPUs. ChatGPT is mostly implemented in Python and uses PyTorch.

Other plotting libraries:
* **Seaborn**: Builds on Matplotlib. Creates nice looking plots that are common in data analysis with less instructions.
* **Plotly**: Higher-level alternative to Matplotlib. Good for, among other things, interactive visualizations that are displayed in your browser.

Other tools:
* **Anaconda environment manager**: Separate environment for each project. Makes installing data analysis libraries easy.
* **Jupyter Notebooks**: Popular for data analysis, because results display next to the code.



## 01 NumPy

[Video.](https://ceu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=2aebaf08-1ba6-41e4-89d2-acb600e0ecf2)

From NumPy's documentation:

NumPy is the fundamental package for scientific computing with Python. It contains among other things:

 * a powerful N-dimensional array object
 * sophisticated (broadcasting) functions
 * tools for integrating C/C++ and Fortran code
 * useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

Traditionally, everyone imports NumPy as:

In [None]:
import numpy as np

The central object of NumPy is the N-dimensional array (or `ndarray`):

In [None]:
l = [1,2,3,4,5]
print(type(l),l,l[0],l[2:4])

In [None]:
a = np.array(l) #convert list to array
print(type(a),a,a[0],a[2:4])

Indexing and slicing works similarly to lists.

If used right, NumPy arrays are much more effective for numerical calculations than lists because array operations avoid Python `for` loops and offload heavy duty to C code (more on this later).

### Dimension and shape

A one-dimensional array is like a list, it is a sequence of values that can be accessed by a single index.

In [None]:
a1 = np.array([1,1,4,-2])
a1

In [None]:
a1[2]

The shape of a one-dimensional array is its length:

In [None]:
a1.shape

A two-dimensional array is a like table with rows and columns, or like a list of lists.

In [None]:
a2 = np.array([[11,12,13,14],
               [21,22,22,24],
               [31,32,33,34]])
a2

The elements of a 2-dimensional array can be accessed with two indices:

In [None]:
a2[2][1] # first index: row, second index: column

In [None]:
a2[2,1]

The shape of a 2-dimensional array is a tuple with two integers:

In [None]:
a2.shape # number of rows and columns

You can also create 3- and higher dimensional arrays.

### Data type

As opposed to lists arrays contain a single type of data, which is usually numeric

In [None]:
a

In [None]:
a.dtype

In [None]:
a_float = np.array( [1.,2.5,3.3])

In [None]:
a_float.dtype

Arrays are mutable, i.e., you can change the values of its elements:

In [None]:
a

In [None]:
a[1]=42
a

But the type cannot change:

In [None]:
a[0]=12.34

In [None]:
a

### Array operations and functions

[Video.](https://ceu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=86512763-8ed4-4b45-b2d0-acb600e71194)

Most array operations and mathematical functions implemented in NumPy are applied to arrays elementwise: 

In [None]:
a = np.ones(3) 
b = np.arange(3, dtype=float) # creates an array [0 1 2]
print(a,"+",b,"=",a+b)
print(a,"-",b,"=",a-b)
print('sin(b)=',np.sin(b))
print("b**3 =", np.power(b,3))

And there are many useful functions that calculate summaries of arrays:

In [None]:
a = np.arange(4)
print("a =", a)
print("sum =", np.sum(a))
print("mean =", np.mean(a))
print("std =", np.std(a))
print("median =", np.median(a))
print("min =", np.min(a))


### Loops vs array operations (important!)

Working with arrays you should vectorize your algorithm, meaning that you avoid explicit `for` loops. Array operations off-load Python loops to compiled C code, leading to large performance improvements.

Consider:

In [None]:
#%%timeit
n = 100000000
total = 0
for i in range(n):
    total += i
total

Same thing with vectorized code:

In [None]:
#%%timeit
a = np.arange(n)
np.sum(a)

Note: using the built in `sum()` function is also slow:

In [None]:
sum(a)

This is because `sum` cannot take advantage of the properties of a numpy array, while `np.sum` can.

## 02 Plotting with matplotlib

[Video.](https://ceu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=3c096e75-27e6-4fbd-a120-acbb00cf049e)

Just like NumPy, matplotlib is a fundamental part of Python's scientific stack. It is the plotting tool most people default to. It has a wide range of options, it is very flexible and customizable, and even more is available through third party [extensions](https://matplotlib.org/thirdpartypackages/index.html). A number of modules, such as pandas, Seaborn, and NetworkX, make use of matplotlib. 

Just looking at the documentation can be daunting due to the high number of options and various ways to interact with the module. The recommended strategy to learn matplotlib is to familiarize yourself with plotting via [tutorials](https://matplotlib.org/tutorials/introductory/pyplot.html) and [examples](https://matplotlib.org/gallery/). Kind of like what we will be doing here!

Traditionally everyone imports matplotlib as

In [None]:
import matplotlib.pyplot as plt

`pyplot` is the submodule that contains the functions that we use to interface with the module.

Next, let's take a look at parts of a figure:
<img src="https://matplotlib.org/_images/anatomy.png" width="50%">
<font size=.1>(image borrowed from matplotlib's official tutorial)</font>

With matplotlib you can customize every part of this and more. Before going into the myrad of options, we should internalize two important objects: 
* `Figure`: the object that contains everything (canvas, title, axes,...), this is what first create when you get ready to plot something
* `Axes`: Defines a region on your figure where you actually plot data using x-y coordinates. A `Figure` can have any number of them.



In our first figure, we will plot the $x^2$ function. So before that we have to create the data:

In [None]:
x = np.arange(0,10,.1) 
y = x**2.
x[:30],y[:30]

Side note: matplotlib is designed to plot NumPy arrays, but similar objects typically work (for example lists).

It adds to the sometimes confusing syntax of matplotlib that there are two styles to create figures: object oriented and pyplot style.

**Object oriented** is when you explicitly create the `Figure` and the `Axes`, and explicitly specify which `Axes` you are working with:

In [None]:
fig  = plt.figure(figsize=(3,3))
ax   = plt.axes()

ax.plot(x, y, label="x**2") #first two positions: x and y data; label a string that will appear in the legend

ax.set_xlabel('x')
ax.set_ylabel('y')


ax.legend() # add legend

ax.set_title('A title') # add title


Or you can use the **pyplot style**, which is meant to resemble plotting with MatLab. It automatically creates the `Figure` and `Axes`, and keeps track of a current `Figure` and `Axes` in the background. The same figure:

In [None]:
plt.plot(x, y, label="x**2") #first two positions: x and y data; label a string that will appear in the legend


plt.xlabel('x')
plt.ylabel('y')

plt.legend() # add legend

plt.title('A nice quadtratic function'); # add title

Both styles are equally powerful, you can use either or even mix them. In this notebook, we will use both: pyplot for simple one plot figures, and object oriented style for more complicated cases.

In our first plot we used the default style and the color. How can we customize it? Through the argmuments of the `plot` function. But again, two typical ways:
* format string for quick formating
* separate attributes for more explicit (and complete) formatting

In [None]:
x = np.linspace(0,2*np.pi,20)


plt.plot(x, np.sin(x), 'ro--', linewidth=3., label="sin") # o: circle marker; --: dashed line; r:red color

plt.plot(x, np.cos(x), color='green', marker='s', linestyle='solid', label="cos") 
plt.xlabel('x')
plt.ylabel('y')

plt.legend() # add legend

plt.title('Adding some style'); # add title

Format string requires less typing, but less flexible, it only has three parts (you can change their order)
```python
fmt = '[marker][line][color]'
```
Try out these format strings: 
* `'b'`: blue markers with default shape
* `'-g'`: green solid line
*`'^k:'`: black triangle_up markers connected by a dotted line

For a full list of options click [here](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html#matplotlib.pyplot.plot), and scroll all the way down to notes.

### Exercise

Using a for loop plot the powers of x between 1 and 5, i.e., $x$,$x^2$ and $x^3$. Create a legend too. Try out the default formating and try to modify it.

Advanced: Use a `for` loop to plot more powers of $x$. 

<details><summary><u>Solution.</u></summary>
<p>
    
```python
x= np.linspace(0,1.1,20)

plt.plot(x,x,label="x")
plt.plot(x,x**2,label="x**2")
plt.plot(x,x**3,label="x**3")

plt.xlabel('x')
plt.ylabel('y')
plt.legend();
    
# advanced:
    
x= np.linspace(0,1.1,20)
for k in range(1,6):
    plt.plot(x,x**k,label="x**%d"%k)
plt.xlabel('x')
plt.ylabel('y')
plt.legend();
```
    
</p>
</details>

### Subplots


You can add multiple `Axes` to a `Figure` to create subplots. The easiest way of doing this is using the `plt.subplots` function wich creates a grid of plots. For example:

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(6,4)) # returns a Figure and an array of Axes
print("shape of axes array:",axes.shape)

#plotting in subplots
x = np.arange(0.1,10,.1)

axes[0][0].plot(x,np.exp(x),':', linewidth=2)

axes[0][1].plot(x,np.log(x),'--', color="#aaaa00",linewidth=8)

axes[1][0].plot(x,np.tan(x),'b-', markersize=10.,linewidth=2)

axes[1][1].plot(x,x**2,'x', linewidth=8, markersize=2.) 


# let's label all axes the same way
for ax in axes.reshape(-1): #we reshape the axes to be one dimensional, try iterating over it without reshaping
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    
plt.tight_layout()

(Why do you see those uneven peaks for the `tan` function? Try changing the increment in the `np.arange(.1,10,.1)`.)

### Histograms

The `hist(data)` function can both calculate and plot a histogram from a sample `data`.

It has a `bins` attribute that works just like NumPy's `histogram` function: you can specify the number of evenly spaced bins or the edges of the bins.

In [None]:
#generate a large sample of normally distributed numbers with mean 2 and standard deviation 1
x = np.random.normal(2,1,10000)
plt.hist(x, 50, facecolor='g',alpha=1.) #try changing facecolor and alpha
plt.xlabel('x')
plt.ylabel('count')


### ðŸ”´ Exercise

Download a csv file from [here](https://posfaim.github.io/dnds5027/data/weight-height-clean.csv). This file contains the weight, height, age and gender of a sample of people.

Run the following code to open the csv file and save the weight of the people. Don't forget to adjust the file path to reflect the location of the file on your machine.

In [None]:
weights = []
with open("data/weight-height-clean.csv") as f:
    print(f.readline()) #print out the header
    for line in f:
        data = line.split(',')
        weights.append(float(data[0]))

weights = np.array(weights) # convert from list to numpy array
weights[:5]

Use numpy functions to calculate the minimum, the maximum and the average weight.

<details><summary><u>Solution.</u></summary>
<p>
    
```python
print(f"min: {weights.min()}, max: {weights.max()}")
print(f"mean weight:{weights.mean():.2f}")
```
    
</p>
</details>

Plot the histogram of the weights.

<details><summary><u>Solution.</u></summary>
<p>
    
```python
plt.hist(weights, bins=50)
plt.xlabel('weight [kg]')
plt.ylabel('count')
```
    
</p>
</details>

### Scatter plots

A scatter plot is useful to visualize the realtionship between two (or) more variables.

For example, it is intuitive that higher people tend to be heavier. To confirm this intuition, let's first load the data again, now saving all data.

In [None]:
weights = []
heights = []
ages    = []
genders = []
with open("data/weight-height-clean.csv") as f:
    print(f.readline()) #print out the header
    for line in f:
        data = line.split(',')
        weights.append(float(data[0]))
        heights.append(float(data[1]))
        ages.append(float(data[2]))
        genders.append(int(data[3]))


weights = np.array(weights) # convert from list to numpy array
heights = np.array(heights)
ages = np.array(ages)
genders = np.array(genders)

We can simply use the `plot` function to creat a scatter plot by setting the style to have a marker but no line: 

In [None]:
plt.plot(weights,heights,'o', markersize=1.)
plt.xlabel('height[cm]')
plt.ylabel('weight[kg]')

However, matplotlib also has a function specifically for scatter plots: 

In [None]:
plt.scatter(weights,heights, s=1.)
plt.xlabel('height[cm]')
plt.ylabel('weight[kg]')

So how is this different? `scatter` allows you to change the size and color of the markers individually depending on data.

In [None]:
plt.scatter(weights,heights, s=ages/3, c=genders)
plt.xlabel('height[cm]')
plt.ylabel('weight[kg]')


You can add a legend for the colors:

In [None]:
scatter =  plt.scatter(weights,heights, s=ages/3, c=genders)
plt.xlabel('weight[kg]')
plt.ylabel('height[cm]')
plt.legend(handles=scatter.legend_elements()[0], labels=['male','female'])