<img src="numpy_logo.png" width="150" class="center" >

**In this very mini tutorial** we are going to talk about how efficient is **Numpy**.

- Should we use numpy?
- Why we should use numpy?
- should we use it instead of list in python in data preprocessing or even in our data pipelines?
- How numpy works behind the scene? and ...

My approach is learning why numpy is efficient by comparing it to other data structures.
then test them and see how they are working.
in a very simple example we'll see the miracle of **Numpy**.

---

# Curving The Test Grades

Imagin you are a university teacher or TA. you took a test and students were mess and the grades was far below than the difficulty curve.
you should curve the grades to make the grades near the difficulty average.

**So we are going to do this thing in 2 ways with python list and numpy array and figure out which one is more efficient**

In [34]:
import numpy as np
GRADES_SIZE = 100000

---

## List Version

first we creating a random list of grades.
in real world projects we are dealing with thousands or millions rows of data, so we don't want compare our example in 50 or 100 grades. reason of why i create a list of 10000 grades.(i know that 10000 students is impossible for a class)

In [39]:
grades_list = [round((np.random.rand()*100), 2) for i in range(GRADES_SIZE)]

In [40]:
np.mean(grades_list)

50.05780529999999

I define a function that take the **grades list** and **the curve center or our desired dificulty average**.
then it returns the new average and new list of grades.

In [41]:
def curve_grades_list(grades, curve_center):
    """
    This function takes a list of grades and a curve center, 
    and returns the new grades list and new average after curving to the center.
    """
    average = sum(grades) / len(grades)
    change = curve_center - average
    new_grades = [i+change for i in grades]
    
    # set minimum grade to 0 and the maximum grade to 100
    for i in range(len(new_grades)):
        if new_grades[i] > 100:
            new_grades[i] = 100
        elif new_grades[i] < 0:
            new_grades[i] = 0
        
    new_avg = sum(new_grades) / len(new_grades) 
    
    return new_grades, new_avg

In [42]:
%timeit curve_grades_list(grades=grades_list, curve_center=75)

14.9 ms ± 1.35 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


---

# Numpy Version

In [43]:
grades_array = np.random.randint(0,101, GRADES_SIZE)

def curve(grades, curve_center):
    average = grades.mean()
    change = curve_center - average
    new_grades = grades + change
    
    return np.clip(new_grades, grades, 100)

In [44]:
%timeit curve(grades=grades_array, curve_center=75)

529 µs ± 15.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


That is what i want to see.
**Numpy is 28 times faster than python list.**

We take advantage of two important concepts at once:

- Vectorization
- Broadcasting

**Vectorization** is the process of performing the same operation in the same way for each element in an array. This removes `for` loops from your code but achieves the same result.

**Broadcasting** is the process of extending two arrays of different shapes and figuring out how to perform a vectorized calculation between them. Remember, grades is an array of numbers of shape `(8,)` and change is a **scalar**, or single number, essentially with shape `(1,)`. In this case, NumPy adds the scalar to each item in the array and returns a new array with the results.

Because of this 2 important feature we can have **fewer loops**, **clearer codes** , **faster execution times**, **less memory** and etc.
so if you can use numpy over list, use it.

Use **Numpy's Builtins Methods** and do not enter your ceativity into this.