# Data manipulation in Python

In real-world projects, once we get our hands on a dataset, we usually want to perform some of the following steps on the data:
- **Inspect the data** to get an understaning of the number of datapoints, summary statistics, missing values, etc.
- **Transform the data** to prepare it for down stream analysis 
- **Export/save the data** to keep the final version or to share it with someone else

**Numpy** is a Python package that provides many **functions** exactly for this purpose. These functions operate on an object called **array**, which also can be defined using the Numpy library.

![Duplicate Button](./imgs/numpy_diagram.png)

#### But.. why do we need arrays when we have lists?

In [1]:
a_list = [1, 2, 3, 4]
a_list + 2

TypeError: can only concatenate list (not "int") to list

Now let's do the same thing with numpy:
- import the numpy library
- create a numpy array
- try incrementing each element by 2

In [2]:
import numpy as np

my_arr = np.array([1, 2, 3, 4]) # some explanation
my_arr

array([1, 2, 3, 4])

In [3]:
my_arr + 2

array([3, 4, 5, 6])

## Arrays in Numpy

**array** objects are yet another data collection type, provided by the **Numpy** library, that are designed to work with (mostly) numerical data. Arrays are very similar to lists, with one exception:
  - **all elements in the array must be of the same data type (e.g. int, float, bool)**
  
Despite that limitation, arrays are extremely useful for data analysis, and we'll be taking advantage of its many features throughout the course.  So let's start by learning how to easily generate different patterns of data with arrays!

### Building Arrays

Let's generate some arrays using Numpy functions!  Some commonly-used functions are **arange()**, **linspace()**, **zeros()**, and the random number generation functions in **random** sub-module.

| function | Purpose |  Example |
| :-----------: | :-------------: | :-------------: |
| **np.array()**  | Turns a list into an array |   np.array([2, 5, 3]) |
| **np.arange()**                  | Makes an array with all the integers between two values | np.arange(2, 7) |
| **np.linspace()**               | Makes a specific-length array |  np.linspace(2, 3, 10) |
| **np.zeros()**                    | Makes an array of all zeros | np.zeros(5) |
| **np.ones()**                     | Makes an array of all ones | np.ones(3) |
| **np.random.random()** | Makes an array of random numbers | np.random.random(100) |
| **np.random.randn()**     | Makes an array of normally-distributed random numbers | np.random.randn(100) |


### Exercise

Solve the following exercises using the functions provided by the numpy package.

1. Import the numpy package and called it `np` so we can use `np.` to access its methods later on.

In [4]:
import numpy as np

2. Make an array from the following list.

In [5]:
my_list = [1, 2, 3, 4]

In [6]:
np.array(my_list)

array([1, 2, 3, 4])

3. Make an array containing the integers from 1 to 15.

In [7]:
np.arange(1, 16)

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

4. Make an array of only 6 numbers starting from 1 and ending at 10, with evenly-spaced elements.

In [8]:
np.linspace(1, 10, 6)

array([ 1. ,  2.8,  4.6,  6.4,  8.2, 10. ])

5. Make an array of the values from 2 and 6, spaced 0.5 from each other.

In [9]:
np.arange(2, 6, .5)

array([2. , 2.5, 3. , 3.5, 4. , 4.5, 5. , 5.5])

6. Make an array containing 20 zeros.

In [10]:
np.zeros(20)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0.])

7. Generate an array of 10 **uniformly-distributed** random numbers.

In [11]:
np.random.rand(10)

array([0.20535792, 0.96784474, 0.73312936, 0.38704906, 0.08971233,
       0.93373462, 0.70065325, 0.58807924, 0.85387331, 0.64360375])

### From vectors to matrices

Numpy arrays are quite flexible. You can define an array that:
- contains a single value: `np.array(4)`
- contains many values in the form of a *vector*: `np.array([1, 2, 3, 4])`
- contains many values in the form of a *matrix*: 
    ```
    np.array([[1, 2], 
              [3, 4]])
    ```
<br>

These definitions results in arrays that have different number of *dimensions*. For instance a vector is a 1-dimensional array and a matrix is a 2-dimensional array. 

So far we practiced creating a 1-d array, but how do we create a 2-d array? Well, a 1-d array was a list, and a 2-d array is a list of lists.

Let's create some 2-d arrays.

In [12]:
my_list = [[1, 2], 
           [3, 4]]
np.array(my_list)

array([[1, 2],
       [3, 4]])

In [13]:
np.ones((2, 2))

array([[1., 1.],
       [1., 1.]])

In [14]:
help(np.ones)

Help on function ones in module numpy:

ones(shape, dtype=None, order='C', *, like=None)
    Return a new array of given shape and type, filled with ones.
    
    Parameters
    ----------
    shape : int or sequence of ints
        Shape of the new array, e.g., ``(2, 3)`` or ``2``.
    dtype : data-type, optional
        The desired data-type for the array, e.g., `numpy.int8`.  Default is
        `numpy.float64`.
    order : {'C', 'F'}, optional, default: C
        Whether to store multi-dimensional data in row-major
        (C-style) or column-major (Fortran-style) order in
        memory.
    like : array_like, optional
        Reference object to allow the creation of arrays which are not
        NumPy arrays. If an array-like passed in as ``like`` supports
        the ``__array_function__`` protocol, the result will be defined
        by it. In this case, it ensures the creation of an array object
        compatible with that passed in via this argument.
    
        .. versionad

### Exercise (quick)

1. Create a 2d array of zeros with 3 rows and 4 columns.

In [15]:
np.zeros((3, 4))

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

2. Create a 2d array of normally-distributed random number with 5 rows and 2 columns.

In [16]:
np.random.randn(5, 2)

array([[ 2.00425342, -0.42572986],
       [ 1.1217087 ,  1.09941148],
       [ 0.64399524,  2.93876544],
       [-1.60275123, -0.94295057],
       [-0.45667054, -0.24335174]])

---

## Data Inspection

Data inspection usually involves getting an understanding of what does the data look like:
- What type of values are we dealing with? `int` or `float`?
- How big is it? how many values are in the data?
- How many dimensions does it have? is it a 1-d array or a 2-d array?
- If 1-d, how many elements does it have?
- If 2-d, how many rows and columns does it have?

### Aggregation

Or we might want to know some statsitics of the data:
- What are the mininum and maximum values?
- What is the mean?
- What is the standard deviation?
- etc.

Numpy arrays come with useful methods that allow us to inspect the data.

In [17]:
data = np.random.rand(4, 3)

In [18]:
data

array([[0.07786538, 0.36446008, 0.83556346],
       [0.41887985, 0.94907796, 0.99382534],
       [0.99799756, 0.75993314, 0.62395731],
       [0.70375457, 0.32310148, 0.71521759]])

In [19]:
data.dtype

dtype('float64')

In [20]:
type(data)

numpy.ndarray

In [21]:
data.size

12

In [22]:
data.ndim

2

In [23]:
data.shape

(4, 3)

In [None]:
data.sum()

6.816378678512074

In [None]:
data.min()

0.03360066037049725

In [None]:
data.mean()

0.5680315565426729

#### Statistics per dimension

In [None]:
data

array([[0.33934215, 0.03360066, 0.71505202],
       [0.97470958, 0.11236965, 0.96309194],
       [0.77823698, 0.60840807, 0.95004025],
       [0.43719263, 0.52097667, 0.38335807]])

In [None]:
data.mean(axis=0) # across rows (i.e. per column)

array([0.63237034, 0.31883876, 0.75288557])

In [None]:
data.mean(axis=1) # across columns (i.e. per row)

array([0.36266494, 0.68339039, 0.7788951 , 0.44717579])

In [None]:
data.shape

(4, 3)

**Note**: many of the aggregation functions are provided both as a function from the numpy package or as a method of the array itself. You can choose whichever you'd prefer to use.

```
np.mean(my_arr)
# or
my_arr.mean()
```

## Exercise (quick)

1. Create a Numpy array with shape 3 by 4 that contains only 1s.

In [24]:
dd = np.ones((3, 4))

2. Compute the sum of the whole array.

In [25]:
dd.sum()

12.0

3. Compute the sum for each column.

In [26]:
dd.sum(axis=0)

array([3., 3., 3., 3.])

4. Compute the sum for each row. 

In [27]:
dd.sum(axis=1)

array([4., 4., 4.])

## Data transformation

Data Transformation is essentially the application of any kind of operation on your data so that you tranform your data from one representation to another, making it ready for upcoming analysis. Numpy provides a handful number of functions that can be used to transform the data. Let's go through some of them.

In [28]:
arr = [[1, 2, 3, 4], 
       [5, 6, 7, 8], 
       [9, 10, 11, 12], 
       [13, 14, 15, 16], 
       [17, 18, 19, 20]]

data = np.array(arr)
data

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12],
       [13, 14, 15, 16],
       [17, 18, 19, 20]])

In [29]:
data.shape

(5, 4)

We flip the rows and columns - transpose operation

In [30]:
data.T

array([[ 1,  5,  9, 13, 17],
       [ 2,  6, 10, 14, 18],
       [ 3,  7, 11, 15, 19],
       [ 4,  8, 12, 16, 20]])

We can also change the shape of the whole array using the `reshape` method:

In [34]:
data.reshape(2, 10)

array([[ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15, 16, 17, 18, 19, 20]])

There are also other useful function such as `log` and `exp`, `sin`, `cos`, etc. 

In [None]:
np.exp(data)

array([[2.71828183e+00, 7.38905610e+00, 2.00855369e+01, 5.45981500e+01],
       [1.48413159e+02, 4.03428793e+02, 1.09663316e+03, 2.98095799e+03],
       [8.10308393e+03, 2.20264658e+04, 5.98741417e+04, 1.62754791e+05],
       [4.42413392e+05, 1.20260428e+06, 3.26901737e+06, 8.88611052e+06],
       [2.41549528e+07, 6.56599691e+07, 1.78482301e+08, 4.85165195e+08]])

## Indexing and slicing

Similar to what we learned before for indexing and slicing lists, numpy arrays can be indexed and sliced the same way, in all dimensions.

![Duplicate Button](./imgs/indexing.png)

## Exercise

1. How many dimensions does the following Numpy array have?

In [35]:
my_arr = np.array([[1., 2., 3.], [4., 5., 6.]])

In [36]:
my_arr.ndim

2

2. How many elements does `my_arr` have in each dimension (i.e., what is the shape of it)?

In [37]:
my_arr.shape

(2, 3)

3. What is the data type of the elements of `my_arr`?

In [38]:
my_arr.dtype

dtype('float64')

4. Generate a 3 x 10 array of random integers between 1 and 4 using **`np.random.randint`**.

In [39]:
np.random.randint(1, 4, (3, 10))

array([[3, 3, 2, 1, 2, 3, 1, 3, 3, 2],
       [3, 2, 3, 2, 1, 1, 1, 1, 3, 2],
       [3, 1, 3, 1, 1, 2, 3, 2, 2, 1]])

5. Make a flat (i.e. 1-d) array with all the integer values between 0 and 11, and then reshape it into a 3 x 4 matrix using **`array.reshape()`**.

In [40]:
arr = np.arange(0, 12)
arr = arr.reshape(3, 4)
arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

...Reshape the previous array into a 4 x 3 matrix...

In [41]:
arr.reshape(4, 3)

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

...Reshape that array into a 2 x 6 matrix...

In [42]:
arr.reshape(2, 6)

array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11]])

...Then flatten it (meaning: make it 1-d). **Hint**: the array object has a method for this. 

In [43]:
arr.flatten()

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

Numpy arrays work the same way as other sequences, but they can have multiple dimensions (i.e. rows, columns, etc.) over which we can index/slice the array.

```
data = np.array([[0, 1, 2,  3],
                 [4, 5, 6,  7],
                 [8, 9, 10, 11]]
               )
second_row = data[1, :]
third_column = data[:, 2]
```

Using the example variable `scores`, select only the described elements from each list:

In [44]:
import numpy as np
scores = np.arange(1, 49).reshape(6, 8)
scores

array([[ 1,  2,  3,  4,  5,  6,  7,  8],
       [ 9, 10, 11, 12, 13, 14, 15, 16],
       [17, 18, 19, 20, 21, 22, 23, 24],
       [25, 26, 27, 28, 29, 30, 31, 32],
       [33, 34, 35, 36, 37, 38, 39, 40],
       [41, 42, 43, 44, 45, 46, 47, 48]])

6. The third-through-fourth columns.

In [45]:
scores[:, 2:4]

array([[ 3,  4],
       [11, 12],
       [19, 20],
       [27, 28],
       [35, 36],
       [43, 44]])

7. The third-through-fifth columns.

In [46]:
scores[:, 2:5]

array([[ 3,  4,  5],
       [11, 12, 13],
       [19, 20, 21],
       [27, 28, 29],
       [35, 36, 37],
       [43, 44, 45]])

8. The 2nd through 5th score, in the 6th column.

In [47]:
scores[1:5, 5]

array([14, 22, 30, 38])

9. Change the 3rd column to all 10s.

In [48]:
scores[:, 2] = 10
scores

array([[ 1,  2, 10,  4,  5,  6,  7,  8],
       [ 9, 10, 10, 12, 13, 14, 15, 16],
       [17, 18, 10, 20, 21, 22, 23, 24],
       [25, 26, 10, 28, 29, 30, 31, 32],
       [33, 34, 10, 36, 37, 38, 39, 40],
       [41, 42, 10, 44, 45, 46, 47, 48]])

10. Change the 4th row to 0s.

In [50]:
scores[3, :] = 0
scores

array([[ 1,  2, 10,  4,  5,  6,  7,  8],
       [ 9, 10, 10, 12, 13, 14, 15, 16],
       [17, 18, 10, 20, 21, 22, 23, 24],
       [ 0,  0,  0,  0,  0,  0,  0,  0],
       [33, 34, 10, 36, 37, 38, 39, 40],
       [41, 42, 10, 44, 45, 46, 47, 48]])

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=f12bb307-0323-41b8-b58a-a3dc423d7ca4' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>