# Python Beginners Workshop - Day 2

# Session 1: Numpy

## Learning Goals:

- What is a Numpy array, and why to use them? (motivation)
- Importing and Generating Data
- Getting insight about the Data (type, dimension, size, etc.)
- Manipulating the array (arithmetic operations, transpose, etc.)
- Slicing and Masking
- Combining arrays
- Saving data

---

## Motivation

NumPy arrays are a bit like Python lists, but still very much different at the same time. 

In [None]:
my_list = [11.7, 21.2, 13.5, 17.0, 19.9]
my_list + 1

In [1]:
import numpy as np

my_list = [11.7, 21.2, 13.5, 17.0, 19.9]
my_arr = np.array(my_list)

my_arr + 1

array([12.7, 22.2, 14.5, 18. , 20.9])

<div style="text-align:center">
    <img src ="../images/numpy-diagram.png" height="600" width="600"/>
</div>

<br>

## Import Data (from text file)

Creating arrays with the help of initial placeholders or with some example data is a great way of getting started with `numpy`. But when you want to get started with data analysis, you’ll need to load data from text files.

With that what you have seen up until now, you won’t really be able to do much. Make use of some specific functions to load data from your files, such as `loadtxt()` or `genfromtxt()`.

In [None]:
x = np.loadtxt('data/data1.txt')

In [None]:
x

<br>

In the code above, you use `loadtxt()` to load the data in your environment. You see that the only argument that the functions takes is the text file data.txt. And it returns the data as a 2D aray. However, there are other arguments that gives us us more freedom in defining how we want to import the data. For instance, we might want to store each column as a single variable: to do that we can add `unpack=TRUE` to our function. Since we have three columns, we also should provide three variable names:

In [None]:
x, y, z = np.loadtxt('data/data1.txt', unpack=True)

In [None]:
x, y, z

<br>

Note that, in case you have comma-delimited data or if you want to specify the data type, there are also the arguments `delimiter` and `dtype` that you can add to the `loadtxt()` arguments.

In [None]:
x = np.loadtxt('data/data2.txt', delimiter=',')
x

<br>

Now, let's try `data3.txt`

In [None]:
x = np.loadtxt('data/data3.txt', delimiter=',')
x

<br>

Have a look at the data file and try to figure out what happened.

<br>

What is happening is that we have data points with different types in out data file. And loadtxt can only handle a single data format. Instead of `loadtxt()` we can use `genformatxt()`. This function would define anything that is not a number as `nan` (i.e., Not A Number).

In [None]:
x = np.genfromtxt('data/data3.txt')
x

## Create a numpy array

- **zeros** : Return a new array setting values to zero.
- **ones** : Return a new array setting values to one.
- **random.random** : Returns a new array containing random values.
- **empty** : Return a new uninitialized array.
- **full** : Returns an array with the given dimension, with all elements set to the given scaler value (i.e., fill_value).
- **full_like** : Return a new array with shape of input (another array) filled with value.
- **eye** : Returns a diagonal matrix.
- **identity** : Returns identity matrix (only square matrices, so you can specify only one dimension).
- **linspace** :

In [None]:
my_list = [1,2,3,4,5,6,7]
my_arr = np.array(my_list)

my_arr

In [None]:
my_range = np.arange(10)
my_linspace = np.linspace(1, 10, 10)

In [None]:
my_range

In [None]:
my_linspace

In [None]:
zeros_arr = np.zeros(10)  # (10, 10)
ones_arr = np.zeros((2, 10))
random_arr = np.random.random((2, 10))
empty_arr = np.empty((2, 10))
full_arr = np.full((2, 10), fill_value=10)
full_like_arr = np.full_like(full_arr, 0)
eye_arr = np.eye(4)
identity_arr = np.identity(4)
linspace_arr = np.linspace()

## Data Inspection

In [13]:
data = np.random.random((4, 3))

In [14]:
data

array([[0.07504302, 0.23317936, 0.53365493],
       [0.2683568 , 0.26192259, 0.07487019],
       [0.72555485, 0.48568916, 0.84214848],
       [0.66204963, 0.84893406, 0.72823758]])

In [15]:
data.dtype

dtype('float64')

In [16]:
data.ndim

2

In [17]:
data.shape

(4, 3)

In [18]:
data.size

12

## Aggregation (!!)

In [None]:
data.sum()

In [None]:
data.sum(axis=0) # compute the sum for each columns

In [None]:
data.sum(axis=1)

In [None]:
data.min()  # axis
data.max()
data.mean()
data.std()

In [None]:
data.cumsum()

In [None]:
x = np.arange(11)
x, x.cumsum()

## Array Transformation

Data Transformation is essentially the application of any kind of operation on your data so that you tranform your data from one representation to another, making it ready for upcoming analysis. Numpy provides a handful number of functions that can be used to transform the data. Let's go through some of them.

Let's start by creating a 2D array:

In [None]:
arr = [[1, 2, 3, 4], 
       [5, 6, 7, 8], 
       [9, 10, 11, 12], 
       [13, 14, 15, 16], 
       [17, 18, 19, 20]]

data = np.array(arr)
data

In [None]:
data.shape

<br>

Seems like we have 5 rows and 4 columns. What if we wanted to change our 2D array so that the rows become columns and the columns become rows (i.e., transpose)

In [None]:
data.T

<br>

We can also change the shape of the whole array using the `reshape`:

In [None]:
data

In [None]:
data.reshape(2, 10)

<br>

We can do more with reshape. Let's say you do not know the dimensions of the your data exactly, but you want to have a fixed number of rows, and you dont care about the number of columns (or the other way around)

In [None]:
data.reshape(10, -1) # in this case we are fixing the number of rows and dont care about the number of columns

<br>

With `Reshape` we will preserve all the data points in our array. What if we know that we want the first N (in this example 8) elements and we want the with a specific shape?

In [None]:
np.resize(data, (4, 2))

<br>

What we want to add a new dimension to our array? The application of this is when you have a system that accpets an input with specific diemsnions - a clear application of this is actually in deep learning!

In [None]:
np.expand_dims(data, axis=2).shape

In [None]:
np.ravel(data)  # flattens the input

In [None]:
np.flatten(data)

<br>

### Esercise

Let's try an explore the `np.pad()` function. <br>
1. Go through the documentation (docstring) of `np.pad()` function and try to understand how does it work.
2. Create the following matrices using Numpy's `pad()` function

<div style="text-align:center">
    <img src ="../images/padding_ex_1.png" height="300" width="300" style="border:0px;margin:50px"/>
    <img src ="../images/padding_ex_2.png" height="200" width="200" style="border:0px;margin:90px"/>
</div>

<div style="text-align:center">
    
</div>

You have 10 minutes to solve this task. After that we'll take a 10-minute break.

In [None]:
import time
mins = .1
finish_time = time.time() + mins * 60
print("Start time: " + time.ctime().split(' ')[3] + "\n")
while time.time() < finish_time:
    print("\rCurrent time: " + time.ctime().split(' ')[3], end="")
    
print("\n\n====Done====")

### Your Answer

<br><br>

## Data Transformation

Beside changing the shape of our data, we can also play around on the level of data elements (i.e., applying functions on them)

In [None]:
a = np.arange(11, 21)
b = np.arange(1, 11)

print("a is", a)
print("b is", b)

In [None]:
a_plus_b = np.add(a, b) # a + b
a_minus_b = np.subtract(a, b) # a - b
a_mult_b = np.multiply(a, b) # a * b
a_div_b = np.divide(a, b) # a / b

In [None]:
a_remain_b = np.remainder(a, b)

In [None]:
a_remain_b = np.remainder(a[1:], b[1:])
a_remain_b

In [None]:
b_exp = np.exp(b)
b_exp

In [None]:
np.log(b_exp)

## Slicing

<div style="text-align:center">
    <img src ="../images/indexing.png" height="700" width="700"/>
</div>

In [None]:
a = np.arange(11)
a

In [None]:
a[0]

In [None]:
a[1]

In [None]:
a[-1]

In [None]:
a[-2]

In [None]:
a[0:3] # returns index 0, 1, and 2

In [None]:
a[:3]

In [None]:
a[1:3] # returns index 1 and 2

In [None]:
a[1:]

Note that we are alway jumping by 1 index (starting from index 1, and go through all of them till the end)

In [None]:
a[1:9:2]

<br><br><br><br>

## Masking

The basic idea of masking is to index your data not by explicitly using the index values, but rather use another data to crop the data. We can also think of this as conditioned slicing.

In [None]:
a[[True, False, False, False, True, False, False, False, False, False, False]]

In [None]:
a > 5

In [None]:
a[a > 5]

<br><br><br><br><br>

## Combining Arrays

<div style="text-align:center">
    <img src ="../images/split_stack.png" height="600" width="600"/>
</div>

In [None]:
a, b

In [None]:
np.append(a, b) # axis

In [None]:
np.vstack((a, b)).shape

In [None]:
np.hstack((a, b)).shape

In [None]:
data

In [None]:
data_left, data_right = np.hsplit(data, 2)

In [None]:
data_left

In [None]:
data_right

In [None]:
data_up, data_down = np.vsplit(data, 2)

In [None]:
data

In [None]:
np.vsplit(data, (3, 3)) # the indeces: left: upper bound for the rows starting from 0, right: lower bound for the rows till the end

In [13]:
data.shape

(5, 5)

In [18]:
np.vsplit(np.random.rand(6, 3), (2, -3))

[array([[0.09697354, 0.00805065, 0.09368983],
        [0.01914853, 0.24804644, 0.44401373]]),
 array([[0.66060835, 0.16932486, 0.66936688]]),
 array([[0.16883278, 0.39683218, 0.47557304],
        [0.46253381, 0.90007149, 0.65264813],
        [0.15504371, 0.88808471, 0.45665059]])]

<br><br>

### Exercise 

Create a function that get any 2D array and performs either a vertical or a horizontal split (divide into half) depending on the mode ('horizontal' or 'vertical').

Here is how your finction definition should look like:

``` python
def split_2d(arr, mode):
    
    ######################

    ### Your code here ###

    ######################

return array_1, array_2
```

Here is how one will use the function:

``` python
array_up, array_down = split_2d(example_arr, mode='vertical')
array_left, array_right = split_2d(example_arr, mode='horizontal')
```

### Your Answer

<br><br><br><br>

## Saving the Array

- **save()**: saves data in .npy format
- **savez()**: Save several arrays into an uncompressed .npz archive
- **savez_compressed()**: 
- **savetxt()**: saves the data in the given format (e.g., txt, csv, etc.)

And you probably wanna load the data as well? We can use `np.load()`

In [None]:
# save() example
x = np.arange(10)

outfile = 'test_save'
np.save(outfile, x)

In [None]:
# import the .npy file
np.load(outfile + '.npy')

<br><br><br><br><br>

In [None]:
# savez() example
x = np.arange(10)
y = np.exp(x)

outfile = 'test_savez'
np.savez(outfile, x, y)

In [None]:
# import the .npz file
npzfile = np.load(outfile + '.npz')

In [None]:
npzfile.files

In [None]:
npzfile['arr_0']

In [None]:
npzfile['arr_1']

<br><br><br><br><br>

In [None]:
# savez_compressed() example
x = np.arange(10)
y = np.exp(x)

outfile = 'test_savez_compressed'
np.savez_compressed(outfile, x, y)

Note that this file has a smaller size than the file we saved with `np.savez()`

In [None]:
# import the .npz file
npzfile = np.load(outfile + '.npz')

In [None]:
npzfile.files

In [None]:
npzfile['arr_0']

<br><br><br><br><br>

In [None]:
# savetxt() example
x = np.arange(10)

outfile = 'test_savetxt.txt'
np.savetxt(outfile, x, delimiter=',')  # saves the data in the given format (e.g., txt, csv, etc.)

In [None]:
np.loadtxt(outfile)

Note that since we are dealing with a text file, we gotta use `np.loadtxt()`.

<br><br><br><br>

---

### References

- https://www.datacamp.com/community/tutorials/python-numpy-tutorial#visualize
- the [cheetsheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf)