<a href="https://colab.research.google.com/github/glassresearch/PLT/blob/master/Python%20colab%20Georgia%20Tech/numpy_argsort_sort_FA25.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Numpy argsort() and Numpy sort()

In [2]:
import numpy as np
import random
import pandas as pd

### Why are we covering this topic?

#### Historically, students have had difficulty understanding what argsort() does and how to apply it.

So we introduce it here, to help with that understanding.

We will not be doing complex exercises here, but simply introducing the concept.

# argsort() function

Documentation link:  https://numpy.org/doc/stable/reference/generated/numpy.argsort.html

### The np.argsort() function is used to return the indices that would sort an array.

#### So the function returns an array of indexes of the same shape as `a` that index data along the given axis in sorted order (from the documentation).

OK, so what does this mean, in practice?

The function will return an integer array, with the same shape as the source array, with the values being the index locations of the sorted source array values. The returned array does not sort the values themselves, but it gives us what the order of the sorted values would be.

### Admittedly, the use of argsort() is contained in very specific use cases, and it is not one that students will often encounter.

However, in those use cases, it is absolutely the right function/method to use.

Students will encounter a couple of use cases in Homework Notebook 11, and we encourage students to understand what those use cases are, and why argsort() is best for them.

Additionally, there are a few previous exams (Exam Practice Notebooks) which present scenarios in which argsort() is a good solution. The function is not the only one that can solve the exam exercise, but its use could be good in a solution for the exercise.

### A simple example of when argsort() is most appropriate would be when there are two same-sized arrays, call them `A` and `B`, in which the value at each location in `A` matches directly to the value at the corresponding location in `B`. The values themselves in `A` and `B` would have different meanings, in the context of the exercise.

#### In this class, we will generally go from a numpy array to be sorted to a pandas dataframe with the corresponding values.

An example might be `average income` in array `A` and corresponding `zip code` in dataframe `B`.

#### The requirement for the exercise might be to provide the zip code value(s) from dataframe `B` that correspond to the `x` number of highest/lowest income values in array `A`.

To meet this requirement, we would want to extract the index values of the `x` highest/lowest values in array `A` and apply those indices to dataframe `B`, to return the values from dataframe `B`.

There is an example of this in Homework NB11, Exercises 2 and 4. The example code below would be similar to steps in solving those exercises.

### Let's look at a few simple examples for understanding how argsort() functions.

In [3]:
a = np.array([5, 3, 2, 0, 1, 4])
np.argsort(a)

array([3, 4, 2, 1, 5, 0])

OK, so what is this array telling us?

1. The element at index = 3 is the first element in the sorted order (0 is the lowest value).
2. The element at index = 4 is the second element in the sorted order (1 is the next lowest value).
3. The element at index = 2 is the third element in the sorted order (2 is the next lowest value).
.......
4. The element at index = 0 is the largest element in the sorted order (5 is the highest value).

Does it sort float values in the same manner?

In [4]:
b = np.array([5.0, 3.0, 2.0, 0.0, 1.0, 4.0])
np.argsort(b)

array([3, 4, 2, 1, 5, 0])

What about strings?

In [5]:
c = np.array(['p','m','x','h','a','t'])
np.argsort(c)

array([4, 3, 1, 0, 5, 2])

Now let's look at a simple example. While this may seem fairly straightforward, conceptually, this is the types of exercise that you will see in the homework notebooks and on the exams.

**Requirement:**

What are the three largest values in an array?

Return a numpy array with these three values.

In [6]:
# intialize an array
incomes = np.array([68000,43000,21000,10000,54000,50000,120000,76000,23000,37000])
incomes

array([ 68000,  43000,  21000,  10000,  54000,  50000, 120000,  76000,
        23000,  37000])

In [7]:
np.max(incomes)

np.int64(120000)

Using visual inspection, what are the three largest values?

1. Value = 100, at index 3.
2. Value = 76, at index 7.
3. Value = 68, at index 0.

In [8]:
# using argsort, get indices of the values, arranged in ascending order
np.argsort(incomes)

array([3, 2, 8, 9, 1, 5, 4, 0, 7, 6])

#### Recall slicing of arrays, for the cells below.

We use square brackets to access subarrays with the slice notation, marked by the colon (:) character. The NumPy slicing syntax follows that of the standard Python list; to access a slice of an array x, use this:

**x[start:stop:step]**

If any of these are unspecified, they default to the values `start=0`, `stop=size of dimension`, `step=1`.


Good reference, go down about 1/3rd of the page:  https://jakevdp.github.io/PythonDataScienceHandbook/02.02-the-basics-of-numpy-arrays.html

#### Below are three ways of using slicing notation to return the index values of the three largest values in the array.

None of the three are inherently better than the others, just three different ways of doing it, with one key difference.

The difference is that the **first method** returns the index values in the order that they appear in the array. Because the array is sorted from smallest to largest (value at that index location), it returns the indices from smallest value to largest, of the three in question.

In the **second and third methods**, they return the indices in the value-sorted order.

In [12]:
print(incomes)
np.argsort(incomes)

[ 68000  43000  21000  10000  54000  50000 120000  76000  23000  37000]


array([3, 2, 8, 9, 1, 5, 4, 0, 7, 6])

In [10]:
# return three highest value index of array
# In the slice notation, we are telling it to return the last three values of the sort array,
# which are the indexes of the three largest values in the original array.

np.argsort(incomes)[-3::]

array([0, 7, 6])

In [16]:
incomes

array([ 68000,  43000,  21000,  10000,  54000,  50000, 120000,  76000,
        23000,  37000])

In [14]:
np.argsort(incomes)[-1:-4:-1]

array([6, 7, 0])

In [19]:
np.argsort(incomes)[::-1][:3:]   # first reverse the ascending order then stop at 3(which means top 3)

array([6, 7, 0])

### Now let's arrange the sort array in ascending order of index, for the top three.

We are taking the array from the previous cell and using slice notation to sort the indexes in reverse order (step = -1).

Note that we are still returning the indexes from the original array.

#### Also note that we don't need to change the second or third code, because they already return the indices in the value-sorted order.

In [15]:
# array([3, 2, 8, 9, 1, 5, 4, 0, 7, 6])
np.argsort(incomes)[-3::][::-1]


array([6, 7, 0])

In [16]:
np.argsort(incomes)[-1:-4:-1]

array([6, 7, 0])

In [23]:
np.argsort(incomes)[::-1][:3:]

array([6, 7, 0])

#### Finally, let's return the 3 highest values from the original array

Remember from the last step the we are returning, in sorted order, the indexes of the top three values.

So all we are doing now is returning the values at those indexes.

In [24]:
# income = [ 68000,  43000,  21000,  10000,  54000,  50000, 120000,  76000,  23000,  37000]
# after sort [3, 2, 8, 9, 1, 5, 4, 0, 7, 6]
# incomes[np.argsort(incomes)[-3::][::-1]]         # both are same
incomes[np.argsort(incomes)[::-1][:3:]]

array([120000,  76000,  68000])

In [28]:
incomes[np.argsort(incomes)[-1:-4:-1]]


array([120000,  76000])

In [29]:
# All of these are the same as:
incomes[[3,7,0]]

array([10000, 76000, 68000])

## So how would we map these index locations to a array or dataframe?

Let's say that we have an array called `zip_array`, with a series of zip codes.

Our array `incomes` represents the the average income in each zip code area.

We want to know the three zip codes with the highest average incomes.

In [30]:
zip_array = ['12345','23456','34567','45678','56789','67890','78901','89012','90123','01234']
column = ['zip_code']
zip_df = pd.DataFrame(data = zip_array,
                  columns = column)

In [31]:
zip_df

Unnamed: 0,zip_code
0,12345
1,23456
2,34567
3,45678
4,56789
5,67890
6,78901
7,89012
8,90123
9,1234


In [32]:
np.argsort()
positions = np.argsort(incomes)[-1:-4:-1]
positions

array([6, 7, 0])

In [33]:
top_3_zips = zip_df.iloc[positions]
top_3_zips

Unnamed: 0,zip_code
6,78901
7,89012
0,12345


### What are your questions on argsort()?

## So what if all you want to do is return the three highest values in an array, and you don't care about their index location, or you don't need to map their index locations to the values in another array?

### Use the Numpy function `sort()` directly on the array, along with the appropriate slicing notation, to return an array of the selected values.

https://numpy.org/doc/2.1/reference/generated/numpy.sort.html

https://www.w3schools.com/python/numpy/numpy_array_sort.asp

#### What are the three largest values in the array, sorted in order from highest to lowest?

In [36]:
np.sort(incomes)[-3::][::-1]

array([120000,  76000,  68000])

In [37]:
np.sort(incomes)[-1:-4:-1]

array([120000,  76000,  68000])

In [38]:
np.sort(incomes)[::-1][:3:]

array([120000,  76000,  68000])

## What are your questions on argsort() and sort()?