In [None]:
!python --version

Python 3.8.16


<img src="Data/UP Data Science Society Logo 2.png" width=700>

# [2] Numpy
# [2.2] Data Manipulation with Numpy (Processing Data, etc)
**Prepared by:**

- Dexter To
- Jeremiah Marimon


**Topics to cover:**

- Basic Data Manipulation (checking, substituting missing values in Ndarrays, reshaping and removing values from Ndarrays)
- Other Data Manipulation techniques (Sorting, Shuffling, Casting, and Stacking Ndarrays)

**Weekly Objectives:**

- Gain a solid understanding of how to check for missing values, substituting them with appropriate values, reshape arrays, and removing missing values

- Perform other essential data manipulation techniques like sorting, shuffling, casting, and stacking using numpy


**References:**
- [Python documentation](https://docs.python.org/3/)
- [(Ivezic, Connolly, Vanderplas, Gray) Statistics, Data Mining, and Machine Learning in Astronomy](https://press.princeton.edu/books/hardcover/9780691198309/statistics-data-mining-and-machine-learning-in-astronomy)
- [W3Schools: Python Data Types](https://www.w3schools.com/python/python_datatypes.asp)

# I. Basic Data Manipulation in NumPy

Data taken from different sources may have missing values due to mistakes in the gathering of data.

Missing values can introduce bias and affect the accuracy of data analysis.

It can adversely impact machine learning or statistical models.


In [None]:
import numpy as np

## A. Checking for Missing Values

Checking for missing values helps ensure the integrity and reliability of the analysis results and model performance.

- Let's create a numpy array with no nan values.

In [None]:
array1 = np.array([-1, 5, 0, 3])
array1

array([-1,  5,  0,  3])

- To check for missing or NaN values, we use NumPy's [`np.isnan()`](https://numpy.org/doc/stable/reference/generated/numpy.isnan.html) function.

- It returns a boolean array stating whether an item is a missing value or not.

In [None]:
np.isnan(array1)

array([False, False, False, False])

- We can add `.any()` to check if there is at least one missing value.

In [None]:
np.isnan(array1).any()

False

Let's create a numpy array with a NaN value.

In [None]:
array2 = np.array([-9, 5, np.nan, 5, 7])
array2

array([-9.,  5., nan,  5.,  7.])

Nan value evaluates to true when subjected to `np.isnan()`.

- Let's check if it evaluates True since at least one item is NaN

In [None]:
print(np.isnan(array2).any())
print(np.isnan(array2))

True
[False False  True False False]


Why is dealing with missing values is important?
- Some operations and algorithms will produce an error or result in wrong ouput, like the sum and max function here

In [None]:
array3 = np.array([-1, 9, 3, -4, np.nan, 100, np.nan])
print(f"Sum of array3: {array3.sum()}")
print(f"Maximum value of array3: {array3.max()}")

Sum of array3: nan
Maximum value of array3: nan


NaN value output from adding arrays with NaN values

Supposed we have two arrays `array4` and `array5`, let's add the two arrays and what will happen

In [None]:
array4 = np.array([[-1, 3, 5],
                   [2, 1, 0],
                   [8, 7, np.nan]])
array5 = np.array([[12, -4, 3],
                   [90, np.nan, -1],
                   [3, 3, 1]])
array4 + array5

array([[11., -1.,  8.],
       [92., nan, -1.],
       [11., 10., nan]])

As you can see, adding arrays with NaN values results in NaN despite the other having a value.

## B. Substituting/Removing Missing Values
One way of handling missing values is to substitute them with specific values (such as $0$), statistical measures (such as mean and median) or through interpolation.

Another way of handling missing values is to remove them entirely by deleting the rows or columns containing them, however, this must be done cautiously since it will affect the dataset's representativeness and sample size.

We use [`np.nan_to_num()`](https://numpy.org/doc/stable/reference/generated/numpy.nan_to_num.html) to replace nan values in numpy, default value is 0.

In [None]:
array4 = np.array([-1, 9, 3, -4, np.nan, 100, np.nan])
np.nan_to_num(array4)

array([ -1.,   9.,   3.,  -4.,   0., 100.,   0.])

- Replacing NaN with 1

In [None]:
array5 = np.array([[8, np.nan, 8],
                   [8, 8, 8],
                   [8, 8, np.nan]])
np.nan_to_num(array5, nan=1)

array([[8., 1., 8.],
       [8., 8., 8.],
       [8., 8., 1.]])

Other functions or arguments can also be utilized to ignore the nan missing values. We use this to compute for statistics of the array without

- let's try `np.nanmean()` including the NaN values.

In [None]:
array6 = np.array([[7, np.nan, 7],
                   [7, 7, 7],
                   [7, 7, np.nan]])

print(np.mean(array6))

print(np.nanmean(array6))

nan
7.0


- We can then use the mean of `array6` to replace its NaN values through the argument:
`nan=np.nanmean()`

In [None]:
print(f"Unreplaced array6 with nan values:\n {array6}")

print(f"Replaced NaN values of array6:\n {np.nan_to_num(array6, nan=np.nanmean(array6))}")

Unreplaced array6 with nan values:
 [[ 7. nan  7.]
 [ 7.  7.  7.]
 [ 7.  7. nan]]
Replaced NaN values of array6:
 [[7. 7. 7.]
 [7. 7. 7.]
 [7. 7. 7.]]


- Replacing using [`np.nanmedian()`](https://numpy.org/doc/stable/reference/generated/numpy.nanmedian.html)

In [None]:
s = np.array([[1, 2, 4, 7, 8],
              [-1, np.nan, np.nan, np.nan, 10],
              [np.nan, 5, 6, 3, 3]])

np.nan_to_num(s, nan=np.nanmedian(s))

array([[ 1.,  2.,  4.,  7.,  8.],
       [-1.,  4.,  4.,  4., 10.],
       [ 4.,  5.,  6.,  3.,  3.]])

## C. Reshaping arrays
Lastly, reshaping arrays allows you to convert a 1-dimensional array into a 2-dimensional array or vice versa. This is particularly useful when you want to perform operations that require a specific shape of the array.

Reshaping arrays is often required to feed data into machine learning algorithms. Many machine learning libraries expect input data to be in specific shapes, such as 2D arrays with samples in rows and features in columns.

Reshaping arrays can be useful for visualizing data in different formats.

For reshaping arrays, we use [`np.reshape()`](https://numpy.org/doc/stable/reference/generated/numpy.reshape.html)

- Creating a 1D array

In [None]:
arr = np.array([1, 2, 3, 4, 5, 6])

reshaped_arr = np.reshape(arr, (3, 2))
reshaped_arr

array([[1, 2],
       [3, 4],
       [5, 6]])

This reshaped the array to a 2D array with 3 rows and 2 columns

Reshaping a 2D array into a 1D array.

- Creating a 2D array

In [None]:
arr = np.array([[1, 2],
                [3, 4],
                [5, 6]])

- Reshaping the array to a 1D array

In [None]:
reshaped_arr = np.reshape(arr, (6,))
reshaped_arr

array([1, 2, 3, 4, 5, 6])

Another function that might be useful related to reshaping arrays is transposing using np.transpose(), for interchanging the rows and columns of arrays. This is important especially when the task involves matrix operations.

- Transposing a 2D array

In [None]:
arr = np.array([[161, 30, 43], [45, 512, 67]])

np.transpose(arr)

array([[161,  45],
       [ 30, 512],
       [ 43,  67]])

- Creating a 3D array

In [None]:
arr = np.array([[[10, 90, 89],
                 [3, 43, 5]],
                [[12, 54, 89],
                 [71, 8, 90]]])

- Transposing a 3D array

In [None]:
transposed_arr = np.transpose(arr)

print(transposed_arr)

[[[10 12]
  [ 3 71]]

 [[90 54]
  [43  8]]

 [[89 89]
  [ 5 90]]]


# II. Other data manipulation techniques

### A. Sorting
- Sorting allows for analyzing data in a specific order (least to greatest, greatest to least, etc.).
- By sorting arrays, you can identify patterns, detect outliers, and gain insights into the distribution and characteristics of the data. It can also help facilitate efficient searching and indexing operations in algorithms.

We use [`np.sort()`](https://numpy.org/doc/stable/reference/generated/numpy.sort.html) for sorting.

- Creating a 1D array

In [None]:
arr = np.array([29, 13, 9, -1, 8])

- sorting the array in ascending order

In [None]:
sorted_arr = np.sort(arr)
sorted_arr

array([-1,  8,  9, 13, 29])

Note that there are two primary ways of sorting NumPy arrays: [`np.sort()`](https://numpy.org/doc/stable/reference/generated/numpy.sort.html) and [`np.ndarray.sort()`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.sort.html).

The difference is that the former returns a sorted copy of the input array, does not modify the original array but instead returns a new sorted array. The latter sorts the array in-place (modifies the original array itself and does not return anything).

For the most part, `np.sort()` is often used because of its flexibility (you can sort in descending order and get a copy of the array) although depending on the use case, `np.ndarray.sort()` may be faster since it edits the original array directly.

A comparision between the two is presented below:

- Creating the array

In [None]:
arr = np.array([9, 3, 7, 1, 5])

- Using `np.sort()` to get a sorted copy

In [None]:
np.sort(arr)

array([1, 3, 5, 7, 9])

- Using `np.ndarray.sort()` to sort the array in-place

In [None]:
arr.sort()
arr

array([1, 3, 5, 7, 9])

### B. Shuffling
- Shuffling allows for randomization which is useful for randomization, cross-validation, or creating diverse datasets. It can help reduce bias that may be present in the current ordering of data.

We use [`np.random.shuffle()`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.shuffle.html) for shuffling numpy arrays.

- creating a simple 1D array

In [None]:
data = np.array([5, 6, 7, 8, 9, 10])

- shuffling the array using `np.random.shuffle()`

In [None]:
np.random.shuffle(data)
data

array([10,  7,  9,  6,  8,  5])

### C. Casting
- Casting allows for transforming the datatype of arrays, which is beneficial when working with mixed data types or converting data to a format compatible with specific operations. Casting can also reduce memory usage and enhance computational efficiency by converting to a more memory-efficient data type (e.g., from float64 to float32). This is helpful especially for large datasets.

We use [`np.astype()`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.astype.html) for casting numpy arrays.

- Creating a float array and checking its datatype

In [None]:
float_array = np.array([1.4, 0.4, 1.2, -0.3])
print(float_array)
print(float_array.dtype)

[ 1.4  0.4  1.2 -0.3]
float64


- Casting the array to integers using `astype()` method

In [None]:
int_array = float_array.astype(int)
print(int_array)
print(int_array.dtype)

[1 0 1 0]
int32


- Casting float64 array to float 32

In [None]:
print(float_array)
print(float_array.dtype, "\n")

float32_array = float_array.astype(np.float32, casting="same_kind")
print(float32_array)
print(float32_array.dtype)

[ 1.4  0.4  1.2 -0.3]
float64 

[ 1.4  0.4  1.2 -0.3]
float32


### D. Stacking Ndarrays
- Stacking arrays vertically or horizontally allows for combining multiple arrays into a single array, which can be valuable for merging datasets, creating feature matrices, or building complex models. Reshaping ang reorganizing data in a desired format is useful especially for handling multi-dimensional data.

We use [`np.stack()`](https://numpy.org/doc/stable/reference/generated/numpy.stack.html) for stacking i.e. joining multiple numpy arrays.

The two arrays must have compatible shapes along the specified axis in which they are stacked.

In [None]:
exam1_scores = np.array([85, 92, 78, 80])
exam2_scores = np.array([90, 88, 94, 82])

- Stacking at `axis=0`

In [None]:
np.stack((exam1_scores, exam2_scores), axis=0)

array([[85, 92, 78, 80],
       [90, 88, 94, 82]])

- Stacking at `axis=1`

In [None]:
np.stack((exam1_scores, exam2_scores), axis=1)

array([[85, 90],
       [92, 88],
       [78, 94],
       [80, 82]])

[`np.vstack()`](https://numpy.org/doc/stable/reference/generated/numpy.vstack.html) and [`np.hstack()`](https://numpy.org/doc/stable/reference/generated/numpy.hstack.html) can also be used although there may be slight differences with using np.stack() (and setting the axis) depending on the use case. Take note of the difference of the outputs of these functions especially when dealing with arrays that are already 3D or higher.

- Stacking vertically

In [None]:
np.vstack((exam1_scores, exam2_scores))

array([[85, 92, 78, 80],
       [90, 88, 94, 82]])

- Stacking horizontally

In [None]:
np.hstack((exam1_scores, exam2_scores))

array([85, 92, 78, 80, 90, 88, 94, 82])

### End of tutorial.


---



# Sample exercises

Try to solve exercises these exercises if you like.

1.
- Set the random seed to 13 (you can use `np.random.seed()`)
- Generate 12 random numbers (you can use `np.random.rand()`)
- Reshape it as a 4x3 array.
- Shuffle it once using `np.random.shuffle()`
- Stack the array with the given "my_array" using `np.hstack()`
- Substitute the missing values equal to the mean of the array
- Multiply the array by 10
- Cast it to an integer datatype
- Reshape the array as a 1x16 array
- Sort it in ascending order
- Calculate the sum.
- Add to the sum: the last element of the array multiplied by 17.3
- What is the resulting final sum?

In [None]:
import numpy as np

my_array = np.array([[np.nan], [10], [np.nan], [9]])

# insert you codes here