# Meet Numpy

1. **NumPy** is an open-source library that is the universal standard for working with numerical data in Python, and forms the foundation of other libraries like Pandas
2. **Pandas DataFrames** are built on NumPy arrays and can leverage NumPy functions

# NumPy Arrays

**NumPy arrays** are fixed-size containers of items that are more efficient than Python lists or tuples for data processing
1. They only store a single data type (mixed data types are stored as a string)
2. They can be one-dimensional or multi-dimensional
3. Array elements can be modified, but the array size cannot change

In [1]:
import numpy as np

In [2]:
sales = [0, 5, 155, 0, 518, 0, 1827, 616, 317, 325]
sales

[0, 5, 155, 0, 518, 0, 1827, 616, 317, 325]

In [3]:
type(sales)

list

> NumPy's **array** function converts Python lists into NumPy arrays

In [4]:
sales_array = np.array(sales)
sales_array

array([   0,    5,  155,    0,  518,    0, 1827,  616,  317,  325])

In [5]:
type(sales_array)

numpy.ndarray

# Array Properties

NumPy arrays have these key properties:

- **ndim** - the number of dimensions (axes) in the array
- **shape** - the size of the array for each dimension
- **size** - the total number of elements in the array
- **dtype** - the data type of the elements in the array

In [6]:
sales = [
    [0, 5, 155, 0, 518],
    [0, 1827, 616, 317, 325]
]

sales

[[0, 5, 155, 0, 518], [0, 1827, 616, 317, 325]]

In [7]:
type(sales)

list

In [8]:
sales_array = np.array(sales)
sales_array

array([[   0,    5,  155,    0,  518],
       [   0, 1827,  616,  317,  325]])

In [9]:
print(f"ndim: {sales_array.ndim}")
print(f"shape: {sales_array.shape}")
print(f"size: {sales_array.size}")
print(f"dtype: {sales_array.dtype}")

ndim: 2
shape: (2, 5)
size: 10
dtype: int32


> - **ndim: 2**        The sales_array has 2 dimensions
> - **shape: (2, 5)**  The first dimension has a size of 2 rows and second a size of 5 columns
> - **size: 10**       It has 10 elements total
> - **dtype: int**     The elements are stored as integers

# EXERCISE: ARRAY BASICS

#### NEW MESSAGE: 
- From: RossRetail (Head of Analytics)
- Subject: NumPy?

` Hi there, welcome to the DataCrafters LLC!`

`Your resume mentions you have basic Python experience. Our finance team has been asking us to cut software costs - can you help us dig into Python as an analysis tool?`

`I know NumPy is foundational, but not much beyond that.`

`Can you convert a Python list into a NumPy array and help me get familiar with their properties?`

`Thanks `

In [10]:
my_list = [x*10 for x in range(1, 11)]
my_array = np.array(my_list)
my_array

array([ 10,  20,  30,  40,  50,  60,  70,  80,  90, 100])

In [11]:
print(f"ndim: {my_array.ndim}")
print(f"shape: {my_array.shape}")
print(f"size: {my_array.size}")
print(f"dtype: {my_array.dtype}")

ndim: 1
shape: (10,)
size: 10
dtype: int32


# ARRAY CREATION

As an alternative to converting lists, you can **create arrays** using functions:

1. **ones**:     Creates an array of ones of a given size, as float by default
2. **zeros**:    Creates an array of zeros of a given size, as float by default
3. **arange**:   Creates an array of integers with given start & stop values, and a step size (only stop is required and is not inclusive)
4. **linspace**: Creates an array of floats with given start & stop values, with n elements, separated by a consistent step size (stop is inclusive)
5. **reshape**:  Changes an array into the specified dimensions, if compatible

> `np.ones((rows, cols), dtype)`

In [12]:
np.ones(4, )

array([1., 1., 1., 1.])

> `np.zeros((rows, cols), dtype)`

In [13]:
np.zeros((2, 5), dtype=int)

array([[0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0]])

> `np.arange(start, stop, step) `

In [14]:
np.arange(10, 101, 10)

array([ 10,  20,  30,  40,  50,  60,  70,  80,  90, 100])

> `np.linspace(start, stop, n)`

In [15]:
np.linspace(0, 100, 5)

array([  0.,  25.,  50.,  75., 100.])

> `np.reshape(rows, cols)`

In [16]:
np.arange(1, 9, 2)

array([1, 3, 5, 7])

In [17]:
np.arange(1, 9, 2).reshape(2, 2)

array([[1, 3],
       [5, 7]])

# Random Number Arrays

You can create **random number arrays** from a variety of distributions using NumPy functions and methods (*great for sampling and simulation!*)

- **default_rng** : Creates a random number generator (*the seed is for reproducibility*)
- **random**      : Returns n random numbers from a uniform distribution between 0 and 1
- **normal**      : Returns n random numbers from a normal distribution with a given mean and standard deviation

> - First, we are creating a random number generator with a seed of 12345 and assigning it to `rng` using `default_rng`
> - Then we are using the `random` method on `rng` to return an array with 10 random numbers

In [18]:
from numpy.random import default_rng

rng = default_rng(12345)

random_array = rng.random(10)
random_array

array([0.22733602, 0.31675834, 0.79736546, 0.67625467, 0.39110955,
       0.33281393, 0.59830875, 0.18673419, 0.67275604, 0.94180287])

> Here we are using the `normal` method on `rng` to return an array with 10 random numbers from a normal distribution with a mean of 5 and a standard deviation of 1.

In [19]:
rng = default_rng(12345)
mean, stddev = 5, 1
random_normal = rng.normal(mean, stddev, size=10)
random_normal

array([3.57617496, 6.26372846, 4.12933826, 4.74082677, 4.92465669,
       4.25911535, 3.6322073 , 5.6488928 , 5.36105811, 3.04713694])

> **PRO TIP**: Even though it's optional, make sure to **set a seed** when generating random numbers to ensure you and others can recreate the work you have done

# EXERCISE: ARRAY CREATION

#### NEW MESSAGE: 
- From: RossRetail (Head of Analytics)
- Subject: Array Creation

`Thanks for your help last time – I’m starting to understand
NumPy Arrays!`

`Are there any NumPy functions that can create arrays so we
don’t have to convert from a Python list? Recreate the array
from the first assignment but make it 5 rows by 2 columns.`

`Once you’ve done that, create an array of random numbers
between 0 and 1 in a 3x3 shape. One of our data scientists has
been asking about this so I want to make sure it’s possible.
Thanks!`

In [20]:
my_array = np.linspace(10, 100, 10).reshape(5, 2)
my_array

array([[ 10.,  20.],
       [ 30.,  40.],
       [ 50.,  60.],
       [ 70.,  80.],
       [ 90., 100.]])

In [21]:
from numpy.random import default_rng

rng = default_rng(2024)

random_array = rng.random(9).reshape(3, 3)
random_array

array([[0.67583134, 0.2143232 , 0.30945203],
       [0.7994661 , 0.9958021 , 0.14223182],
       [0.07872553, 0.18082381, 0.35964689]])

# Indexing & Slicing Arrays

Indexing & slicing one-dimensional arrays is the same as base Python

- `array[index]`: indexing to access a single element (*0-indexed*)
- `array[start:stop:step_size]`: slicing to access a series of elements (*stop is not inclusive*)

In [22]:
product_array = np.array(['fruits', 'vegetables', 'cereal', 'dairy', 'eggs', 'snacks', 'beverages', 'coffee', 'tea', 'spices'])
product_array

array(['fruits', 'vegetables', 'cereal', 'dairy', 'eggs', 'snacks',
       'beverages', 'coffee', 'tea', 'spices'], dtype='<U10')

In [23]:
print(product_array[1]) # This grabs the second element of product_array

vegetables


In [24]:
print(product_array[-1]) # This grabs the last element of product_array

spices


In [25]:
product_array[:5]  # This grabs the first five elements of product_array

array(['fruits', 'vegetables', 'cereal', 'dairy', 'eggs'], dtype='<U10')

In [26]:
product_array[5::2] # This starts at the sixth element and grabs every other element until the end of product_array

array(['snacks', 'coffee', 'spices'], dtype='<U10')

> Indexing and slicing two-dimensional arrays requires an extra index or slice
> - `array[row index, column index]` - indexing to access a single element (*0-indexed*)
> - `array[start:stop:step_size, start:stop:step_size]` - slicing to access a series of elements

In [27]:
product_array2D = product_array.reshape(2, 5)
product_array2D

array([['fruits', 'vegetables', 'cereal', 'dairy', 'eggs'],
       ['snacks', 'beverages', 'coffee', 'tea', 'spices']], dtype='<U10')

In [28]:
product_array2D[1, 2] # This goes to the second row and grabs the third element

'coffee'

In [29]:
product_array2D[:, 2:] # This goes to all rows and grabs all the elements starting from the third in each row

array([['cereal', 'dairy', 'eggs'],
       ['coffee', 'tea', 'spices']], dtype='<U10')

In [30]:
product_array2D[1:, :] # This goes to the second row and grabs all it's elements

array([['snacks', 'beverages', 'coffee', 'tea', 'spices']], dtype='<U10')

# EXERCISE: ARRAY ACCESS

#### NEW MESSAGE: 
- From: RossRetail (Head of Analytics)
- Subject: Indexing & Slicing Arrays

`Ok, last ‘theoretical’ exercise before we start working with
real data.`
 
`I am familiar with indexing and slicing in base Python but have
no idea how it works in multiple dimensions.`

`I’ve provided a few different ‘cuts’ of the data in the notebook
– can you slice and dice the random array we created in the
last exercise?`

`Thanks!`

In [31]:
from numpy.random import default_rng

rng = default_rng(2024)

random_array = rng.random(9).reshape(3, 3)
random_array

array([[0.67583134, 0.2143232 , 0.30945203],
       [0.7994661 , 0.9958021 , 0.14223182],
       [0.07872553, 0.18082381, 0.35964689]])

> First two `rows`

In [32]:
random_array[:2,:]

array([[0.67583134, 0.2143232 , 0.30945203],
       [0.7994661 , 0.9958021 , 0.14223182]])

> First `column`

In [33]:
random_array[:, 0]

array([0.67583134, 0.7994661 , 0.07872553])

> Second number in third `row`

In [34]:
random_array[2, 1]

0.18082381369685463

# Array Operations

Arithmetic operators can be used to perform **array operations**

In [35]:
sales = [
    [0, 5, 155, 0, 518],
    [0, 1827, 616, 317, 325]
]

sales_array = np.array(sales)
sales_array

array([[   0,    5,  155,    0,  518],
       [   0, 1827,  616,  317,  325]])

In [36]:
sales_array + 2  # This adds 2 to every element in the array

array([[   2,    7,  157,    2,  520],
       [   2, 1829,  618,  319,  327]])

In [37]:
quantity = sales_array[0, :]  # This assigns all the elements in the first row to 'quantity'

In [38]:
price = sales_array[1, :]     # This assigns all the elements in the second row to 'price'

In [39]:
quantity * price

array([     0,   9135,  95480,      0, 168350])

# EXERCISE: ARRAY OPERATIONS

#### NEW MESSAGE: 
- From: RossRetail (Head of Analytics)
- Subject: Random Discount

`Ok, so now that we’ve gotten the basics down, we can start
using NumPy for our first tasks. As part of a promotion, we
want to apply a random discount to surprise our customers
and generate social media buzz.`

`First, add a flat shipping cost of 5 to our prices to get the
‘total’ amount owed.`

`The numbers in the random array represent ‘discount
percent’. To get the ‘percent owed’, subtract the first 6
numbers in the random array from 1, then multiply ‘percent
owed’ by ‘total’ to get the final amount owed.`

`Thanks!`

In [40]:
prices = np.array([5.99, 6.99, 22.49, 99.99, 4.99, 49.99])

In [41]:
total = prices + 5
total

array([ 10.99,  11.99,  27.49, 104.99,   9.99,  54.99])

In [42]:
random_array

array([[0.67583134, 0.2143232 , 0.30945203],
       [0.7994661 , 0.9958021 , 0.14223182],
       [0.07872553, 0.18082381, 0.35964689]])

In [43]:
random_array = random_array.flatten()

In [44]:
discount_percent = random_array[:6]
discount_percent

array([0.67583134, 0.2143232 , 0.30945203, 0.7994661 , 0.9958021 ,
       0.14223182])

In [45]:
percent_owed = 1 - discount_percent

In [46]:
percent_owed

array([0.32416866, 0.7856768 , 0.69054797, 0.2005339 , 0.0041979 ,
       0.85776818])

In [47]:
percent_owed * prices

array([1.94177029e+00, 5.49188082e+00, 1.55304238e+01, 2.00513850e+01,
       2.09475267e-02, 4.28798316e+01])

In [48]:
(percent_owed * prices).round()

array([ 2.,  5., 16., 20.,  0., 43.])

# FILTERING ARRAYS

You can **filter arrays** by indexing them with a logical test.
- Only the array elements in positions where the logical test returns True are returned

In [49]:
sales_array

array([[   0,    5,  155,    0,  518],
       [   0, 1827,  616,  317,  325]])

> Performing a logical test on a NumPy array
returns a **Boolean array** with the results of the
logical test on each array element

In [50]:
sales_array != 0

array([[False,  True,  True, False,  True],
       [False,  True,  True,  True,  True]])

> Indexing an array with a Boolean array returns an array with
the elements where the Boolean value is **True**

In [51]:
sales_array[sales_array != 0]

array([   5,  155,  518, 1827,  616,  317,  325])

> You can filter arrays with **multiple logical tests**
> - Use `|` for **or** conditions and `&` for **and** conditions

In [52]:
sales_array

array([[   0,    5,  155,    0,  518],
       [   0, 1827,  616,  317,  325]])

In [53]:
sales_array[(sales_array == 616) | (sales_array < 100)] # Returns an array with elements equal to 616 or less than 100

array([  0,   5,   0,   0, 616])

In [54]:
sales_array[(sales_array > 100) & (sales_array < 500)] # Returns an array with elements greater than 100 and less than 500

array([155, 317, 325])

In [55]:
mask = (sales_array > 100) & (sales_array < 500)
sales_array[mask]

array([155, 317, 325])

> **PRO TIP**: Store complex filtering criteria in a variable (known as a Boolean mask)

> You can filter arrays based on **values in other arrays**
> - Use the Boolean array returned from the other array to index the array you want to filter

In [56]:
sales_array

array([[   0,    5,  155,    0,  518],
       [   0, 1827,  616,  317,  325]])

In [57]:
sales_array.flatten()

array([   0,    5,  155,    0,  518,    0, 1827,  616,  317,  325])

In [58]:
product_array

array(['fruits', 'vegetables', 'cereal', 'dairy', 'eggs', 'snacks',
       'beverages', 'coffee', 'tea', 'spices'], dtype='<U10')

In [59]:
product_array[sales_array.flatten() > 0] # This returns the elements from product_array where values in sales_array are greater than 0

array(['vegetables', 'cereal', 'eggs', 'beverages', 'coffee', 'tea',
       'spices'], dtype='<U10')

# Modifying Array Values

You can **modify array values** by assigning new ones

In [60]:
sales_array

array([[   0,    5,  155,    0,  518],
       [   0, 1827,  616,  317,  325]])

In [61]:
sales_1D = sales_array.flatten()

In [62]:
sales_1D[0] = 25  # This assigns a single value via indexing
sales_1D

array([  25,    5,  155,    0,  518,    0, 1827,  616,  317,  325])

In [63]:
sales_1D[sales_1D == 0] = 5  # This filters the zero values in sales_array and assigns them a new value of 5
sales_1D

array([  25,    5,  155,    5,  518,    5, 1827,  616,  317,  325])

# The WHERE Function

The **`where()`** NumPy function performs a logical test and returns a given value if the test is True, or another if the test is False

`np.where(logical test, value if True, value if False)`

In [64]:
inventory_array = np.array([12, 102, 18, 0, 0])

In [65]:
small_product_array = product_array[:5]
small_product_array

array(['fruits', 'vegetables', 'cereal', 'dairy', 'eggs'], dtype='<U10')

> If inventory is zero or negative, assign *Out of Stock*, otherwise assign *In Stock*

In [66]:
np.where(inventory_array <=0, "Out of Stock", "In Stock")

array(['In Stock', 'In Stock', 'In Stock', 'Out of Stock', 'Out of Stock'],
      dtype='<U12')

> If inventory is zero or negative, assign *Out of Stock*, otherwise assign the product_array value

In [67]:
np.where(inventory_array <= 0, "Out of Stock", small_product_array)

array(['fruits', 'vegetables', 'cereal', 'Out of Stock', 'Out of Stock'],
      dtype='<U12')

# EXERCISE: FILTERING ARRAYS

#### NEW MESSAGE: 
- From: RossRetail (Head of Analytics)
- Subject: Subsetting Our Data

`Hey there,`

`We’re working on some more promotions. Can you filter our
product list to only include prices greater than 25?`

`Once you’ve done that, modify your logic to force cola into
the list. Call this array ‘fancy_feast_special’.`

`Finally, we need to modify our shipping logic. Create a new
shipping cost array, but this time if price is greater than 20,
shipping cost is 0, otherwise shipping cost is 5.`

`Thanks!`

In [68]:
products = np.array(
    ["salad", "bread", "mustard", "rare tomato", "cola", "gourmet ice cream"]
)

products

array(['salad', 'bread', 'mustard', 'rare tomato', 'cola',
       'gourmet ice cream'], dtype='<U17')

In [69]:
prices

array([ 5.99,  6.99, 22.49, 99.99,  4.99, 49.99])

In [70]:
products[prices > 25]

array(['rare tomato', 'gourmet ice cream'], dtype='<U17')

In [71]:
mask = (prices > 25) | (products == 'cola')

fancy_feast_special = products[mask]
fancy_feast_special

array(['rare tomato', 'cola', 'gourmet ice cream'], dtype='<U17')

In [72]:
shipping_cost = np.where(prices > 20, 0, 5)
shipping_cost

array([5, 5, 0, 0, 5, 0])

# Array Aggregation Methods

**Array aggregation methods** let you calculate metrics like sum, mean, and max

In [73]:
sales_array

array([[   0,    5,  155,    0,  518],
       [   0, 1827,  616,  317,  325]])

In [74]:
sales_array.sum() # Returns the sum of all values in an array

3763

In [75]:
sales_array.max() # Returns the largest value in an array

1827

In [76]:
sales_array.mean() # Returns the average of the values in an array

376.3

In [77]:
sales_array.min() # Returns the smallest value in an array

0

> You can also aggregate across **rows** or **columns**

In [78]:
sales_array

array([[   0,    5,  155,    0,  518],
       [   0, 1827,  616,  317,  325]])

In [79]:
sales_array.sum()

3763

In [80]:
sales_array.sum(axis=0) # Aggregates across columns

array([   0, 1832,  771,  317,  843])

In [81]:
sales_array.sum(axis=1) # Aggregates across rows

array([ 678, 3085])

# ARRAY FUNCTIONS

**Array functions** let you perform other aggregations like median and percentiles

> You can also return a **unique** list of values or the **square root** for each number 

In [82]:
sales_array

array([[   0,    5,  155,    0,  518],
       [   0, 1827,  616,  317,  325]])

In [83]:
np.unique(sales_array)

array([   0,    5,  155,  317,  325,  518,  616, 1827])

In [84]:
np.sqrt(sales_array)

array([[ 0.        ,  2.23606798, 12.4498996 ,  0.        , 22.75961335],
       [ 0.        , 42.74342055, 24.81934729, 17.80449381, 18.02775638]])

In [85]:
np.median(sales_array) # Returns the median value in an array

236.0

In [86]:
np.percentile(sales_array, 50)

236.0

In [87]:
np.percentile(sales_array, 90)

737.0999999999996

# SORTING ARRAYS

The `sort()` method will **sort arrays** in place
- Use the axis argument to specify the dimension to sort by

In [88]:
sales_array

array([[   0,    5,  155,    0,  518],
       [   0, 1827,  616,  317,  325]])

In [89]:
sales_array.sort()  # axis = 1 by default, which sorts a 2D array row by row
sales_array

array([[   0,    0,    5,  155,  518],
       [   0,  317,  325,  616, 1827]])

In [90]:
sales_array.sort(axis=0)  # axis = 0 will sort by columns
sales_array

array([[   0,    0,    5,  155,  518],
       [   0,  317,  325,  616, 1827]])

In [91]:
sales_array = np.array([
        [0,    5,  616,    0,  518],
        [   0, 1827,  155,  317,  325]
])

In [92]:
sales_array

array([[   0,    5,  616,    0,  518],
       [   0, 1827,  155,  317,  325]])

In [93]:
sales_array.sort(axis=0)  # axis = 0 will sort by columns
sales_array

array([[   0,    5,  155,    0,  325],
       [   0, 1827,  616,  317,  518]])

# EXERCISE: SORTING AND AGGREGATING

#### NEW MESSAGE: 
- From: RossRetail (Head of Analytics)
- Subject: Top Tier Products

`Hey there,`

`Thanks for all your hard work. I know we’re working with
small sample sizes, but we’re proving that analysis can be
done in Python!`

`Can you calculate the mean, min, max, and median of our 3
most expensive product prices? Sorting the array first should
help!`

`Then, calculate the number of unique price tiers we have.
Thanks!`

In [94]:
prices = np.array([5.99, 6.99, 22.49, 99.99, 4.99, 49.99])

prices

array([ 5.99,  6.99, 22.49, 99.99,  4.99, 49.99])

In [95]:
prices.sort()

In [96]:
prices

array([ 4.99,  5.99,  6.99, 22.49, 49.99, 99.99])

In [97]:
high_priced = prices[:2:-1]
high_priced

array([99.99, 49.99, 22.49])

In [98]:
print("High Prices Mean:", np.mean(high_priced))
print("High Prices Min:", np.min(high_priced))
print("High Prices Max:", np.max(high_priced))
print("High Prices Median:", np.median(high_priced))

High Prices Mean: 57.49
High Prices Min: 22.49
High Prices Max: 99.99
High Prices Median: 49.99


In [99]:
price_tiers = np.array(["budget", "budget", "mid-tier", "luxury", "mid-tier", "luxury"])

In [100]:
np.unique(price_tiers) 

array(['budget', 'luxury', 'mid-tier'], dtype='<U8')