<a href="https://colab.research.google.com/github/mupungijose-hue/Data-Analysis-Projects/blob/main/section01_NumPy_solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Package Imports

## NumPy Cheat Sheet

This cheat sheet summarizes the NumPy concepts and operations demonstrated in the notebook.

### 1. Array Creation

*   **From Python List**: Convert a Python list into a NumPy array.
    ```python
    import numpy as np
    my_list = [10, 20, 30]
    my_array = np.array(my_list)
    # Output: array([10, 20, 30])
    ```

*   **Using `arange()`**: Create an array with a range of evenly spaced values.
    ```python
    # np.arange(start, stop, step).reshape(rows, cols)
    my_array = np.arange(10, 101, 10).reshape(5, 2)
    # Output:
    # array([[ 10,  20],
    #        [ 30,  40],
    #        [ 50,  60],
    #        [ 70,  80],
    #        [ 90, 100]])
    ```

*   **Using `linspace()`**: Create an array with a specified number of evenly spaced values over an interval.
    ```python
    # np.linspace(start, stop, num_elements).reshape(rows, cols)
    my_array = np.linspace(10, 100, 10).reshape(5, 2)
    # Output:
    # array([[ 10.,  20.],
    #        [ 30.,  40.],
    #        [ 50.,  60.],
    #        [ 70.,  80.],
    #        [ 90., 100.]])
    ```

*   **Random Arrays**: Generate arrays of random numbers.
    ```python
    rng = np.random.default_rng(2022) # Set a random seed for reproducibility
    random_array = rng.random(9).reshape(3, 3)
    # Output: 3x3 array of random floats between 0 and 1
    ```

### 2. Array Properties

*   **`.ndim`**: Number of array dimensions.
    ```python
    my_array.ndim # For a 5x2 array, Output: 2
    ```

*   **`.shape`**: Tuple of array dimensions (rows, columns, etc.).
    ```python
    my_array.shape # For a 5x2 array, Output: (5, 2)
    ```

*   **`.size`**: Total number of elements in the array.
    ```python
    my_array.size # For a 5x2 array, Output: 10
    ```

*   **`.dtype`**: Data type of the elements in the array.
    ```python
    my_array.dtype # Output: dtype('int64') or dtype('float64')
    ```

### 3. Accessing and Slicing Arrays

Using a 3x3 `random_array` as an example:

*   **Grab the first two rows**:
    ```python
    random_array[:2, :]
    # Output: The first two rows of the array
    ```

*   **Grab the entire first column**:
    ```python
    random_array[:, 0]
    # Output: The first column of the array
    ```

*   **Grab a specific element (e.g., second element of the third row)**:
    ```python
    random_array[2, 1] # Remember 0-based indexing
    # Output: A single scalar value
    ```

## NumPy Cheat Sheet from Notebook Demonstrations

This cheat sheet summarizes the NumPy concepts and operations demonstrated in the notebook's assignments.

### 1. Array Creation

*   **From Python List**: Convert a Python list into a NumPy array.
    ```python
    import numpy as np
    my_list = [x * 10 for x in range(1, 11)]
    my_array = np.array(my_list)
    # Output: array([ 10,  20,  30,  40,  50,  60,  70,  80,  90, 100])
    ```

*   **Using `arange()`**: Create an array with a range of evenly spaced values.
    ```python
    my_array = np.arange(10, 101, 10).reshape(5, 2)
    # Output:
    # array([[ 10,  20],
    #        [ 30,  40],
    #        [ 50,  60],
    #        [ 70,  80],
    #        [ 90, 100]])
    ```

*   **Using `linspace()`**: Create an array with a specified number of evenly spaced values over an interval.
    ```python
    my_array = np.linspace(10, 100, 10).reshape(5, 2)
    # Output:
    # array([[ 10.,  20.],
    #        [ 30.,  40.],
    #        [ 50.,  60.],
    #        [ 70.,  80.],
    #        [ 90., 100.]])
    ```

*   **Random Arrays**: Generate arrays of random numbers.
    ```python
    rng = np.random.default_rng(2022) # Set a random seed for reproducibility
    random_array = rng.random(9).reshape(3, 3)
    # Output: 3x3 array of random floats between 0 and 1
    ```

### 2. Array Properties

Using `my_array` (5x2) or `random_array` (3x3) as examples:

*   **`.ndim`**: Number of array dimensions.
    ```python
    my_array.ndim # Output: 2
    ```

*   **`.shape`**: Tuple of array dimensions (rows, columns, etc.).
    ```python
    my_array.shape # Output: (5, 2)
    ```

*   **`.size`**: Total number of elements in the array.
    ```python
    my_array.size # Output: 10
    ```

*   **`.dtype`**: Data type of the elements in the array.
    ```python
    my_array.dtype # Output: dtype('int64') or dtype('float64')
    ```

### 3. Accessing and Slicing Arrays

Using a 3x3 `random_array` as an example:

*   **Grab the first two rows**:
    ```python
    random_array[:2, :]
    # Output: The first two rows of the array
    ```

*   **Grab the entire first column**:
    ```python
    random_array[:, 0]
    # Output: The first column of the array
    ```

*   **Grab a specific element (e.g., second element of the third row)**:
    ```python
    random_array[2, 1] # Remember 0-based indexing
    # Output: A single scalar value
    ```

### 4. Arithmetic Operations

NumPy arrays support element-wise arithmetic operations.

*   **Addition (broadcasting)**:
    ```python
    prices = np.array([5.99, 6.99, 22.49, 99.99, 4.99, 49.99])
    total = prices + 5 # Add 5 to each element
    # Output: array([ 10.99,  11.99,  27.49, 104.99,   9.99,  54.99])
    ```

*   **Multiplication (element-wise)**:
    ```python
    discount_pct = np.array([0.2474, 0.0929, 0.6117, 0.0606, 0.6610, 0.7551]) # Example values
    pct_owed = 1 - discount_pct # Subtract from 1 element-wise
    # Output: array([0.7526, 0.9071, 0.3883, 0.9394, 0.339 , 0.2449])

    final_owed = total * pct_owed # Element-wise multiplication
    # Output: array([ 8.27, 10.88, 10.67, 98.62,  3.39, 13.46])
    ```

*   **Rounding**:
    ```python
    final_owed.round(2)
    # Output: array([ 8.27, 10.88, 10.67, 98.62,  3.39, 13.46])
    ```

### 5. Filtering Arrays (Boolean Indexing)

*   **Filter based on a condition**:
    ```python
    products = np.array(['salad', 'bread', 'mustard', 'rare tomato', 'cola', 'gourmet ice cream'])
    prices = np.array([5.99, 6.99, 22.49, 99.99, 4.99, 49.99])

    expensive_products = products[prices > 25]
    # Output: array(['rare tomato', 'gourmet ice cream'], dtype='<U17')
    ```

*   **Combine multiple conditions (OR operator `|`)**:
    ```python
    mask = (prices > 25) | (products == "cola")
    fancy_feast_special = products[mask]
    # Output: array(['rare tomato', 'cola', 'gourmet ice cream'], dtype='<U17')
    ```

*   **Conditional assignment with `np.where()`**:
    ```python
    # np.where(condition, value_if_true, value_if_false)
    shipping = np.where(prices > 20, 0, 5)
    # Output: array([5, 5, 0, 0, 5, 0])
    ```

### 6. Aggregating and Sorting Arrays

*   **Sorting an array in-place**:
    ```python
    prices = np.array([5.99, 6.99, 22.49, 99.99, 4.99, 49.99])
    prices.sort()
    # prices is now: array([ 4.99,  5.99,  6.99, 22.49, 49.99, 99.99])
    ```

*   **Slicing after sorting to get top N values**:
    ```python
    top3 = prices[-3:] # Grab the last 3 (highest) elements
    # Output: array([22.49, 49.99, 99.99])
    ```

*   **Aggregation functions**:
    ```python
    top3.mean()    # Output: 57.49
    top3.min()     # Output: 22.49
    top3.max()     # Output: 99.99
    np.median(top3) # Output: 49.99 (use np.median for arrays)
    ```

*   **Unique values**: Get the unique elements from an array.
    ```python
    price_tiers = np.array(["budget", "budget", "mid-tier", "luxury", "mid-tier", "luxury"])
    np.unique(price_tiers)
    # Output: array(['budget', 'luxury', 'mid-tier'], dtype='<U8')
    ```

### 7. Combining Concepts (Example: Filtering, Sampling, and Categorization)

*   **Filtering by condition**: Select specific rows based on another array's values.
    ```python
    # Assuming family_array and sales_array are defined
    produce_array = sales_array[family_array == "PRODUCE"]
    ```

*   **Random sampling**: Select elements from an array based on random numbers.
    ```python
    rng = np.random.default_rng(2022)
    random_numbers = rng.random(len(produce_array))
    sampled_array = produce_array[random_numbers < 0.5]
    # Reporting mean and median after sampling
    # sampled_array.mean()
    # np.median(sampled_array)
    ```

*   **Categorizing elements based on multiple conditions**: Nested `np.where()` can create categories.
    ```python
    mean_val = sampled_array.mean()
    median_val = np.median(sampled_array)
    categories = np.where(
        sampled_array < median_val,
        "below_both",
        np.where(sampled_array > mean_val, "above_both", "above_median")
    )
    # Output: An array with 'above_both', 'above_median', or 'below_both'
    ```

### 4. Arithmetic Operations

NumPy arrays support element-wise arithmetic operations.

*   **Addition (broadcasting)**:
    ```python
    prices = np.array([5.99, 6.99, 22.49, 99.99, 4.99, 49.99])
    total = prices + 5 # Add 5 to each element
    # Output: array([ 10.99,  11.99,  27.49, 104.99,   9.99,  54.99])
    ```

*   **Multiplication (element-wise)**:
    ```python
    discount_pct = np.array([0.2, 0.1, 0.5])
    pct_owed = 1 - discount_pct # Subtract from 1 element-wise
    # Output: array([0.8, 0.9, 0.5])

    final_owed = total[:3] * pct_owed # Element-wise multiplication
    # Output: array([ 8.792, 10.791, 13.745])
    ```

### 5. Filtering Arrays (Boolean Indexing)

*   **Filter based on a condition**:
    ```python
    products = np.array(['salad', 'bread', 'mustard', 'rare tomato', 'cola', 'gourmet ice cream'])
    prices = np.array([5.99, 6.99, 22.49, 99.99, 4.99, 49.99])

    expensive_products = products[prices > 25]
    # Output: array(['rare tomato', 'gourmet ice cream'], dtype='<U17')
    ```

*   **Combine multiple conditions (OR operator `|`)**:
    ```python
    mask = (prices > 25) | (products == "cola")
    fancy_feast_special = products[mask]
    # Output: array(['rare tomato', 'cola', 'gourmet ice cream'], dtype='<U17')
    ```

*   **Conditional assignment with `np.where()`**:
    ```python
    # np.where(condition, value_if_true, value_if_false)
    shipping = np.where(prices > 20, 0, 5)
    # Output: array([5, 5, 0, 0, 5, 0])
    ```

### 6. Aggregating and Sorting Arrays

*   **Sorting an array in-place**:
    ```python
    prices = np.array([5.99, 6.99, 22.49, 99.99, 4.99, 49.99])
    prices.sort()
    # prices is now: array([ 4.99,  5.99,  6.99, 22.49, 49.99, 99.99])
    ```

*   **Slicing after sorting to get top N values**:
    ```python
    top3 = prices[-3:] # Grab the last 3 (highest) elements
    # Output: array([22.49, 49.99, 99.99])
    ```

*   **Aggregation functions**:
    ```python
    top3.mean()    # Output: 57.49
    top3.min()     # Output: 22.49
    top3.max()     # Output: 99.99
    np.median(top3) # Output: 49.99 (use np.median for arrays)
    ```

*   **Unique values**: Get the unique elements from an array.
    ```python
    price_tiers = np.array(["budget", "budget", "mid-tier", "luxury", "mid-tier", "luxury"])
    np.unique(price_tiers)
    # Output: array(['budget', 'luxury', 'mid-tier'], dtype='<U8')
    ```

### 7. Combining Concepts (Example: Sampling and Categorization)

*   **Filtering by condition**: Select specific rows based on another array's values.
    ```python
    # produce_array = sales_array[family_array == "PRODUCE"]
    ```

*   **Random sampling**: Select elements from an array based on random numbers.
    ```python
    # rng = np.random.default_rng(2022)
    # random_numbers = rng.random(len(produce_array))
    # sampled_array = produce_array[random_numbers < 0.5]
    ```

*   **Categorizing elements based on multiple conditions**: Nested `np.where()` can create categories.
    ```python
    # mean_val = sampled_array.mean()
    # median_val = np.median(sampled_array)
    # categories = np.where(
    #     sampled_array < median_val,
    #     "below_both",
    #     np.where(sampled_array > mean_val, "above_both", "above_median")
    # )
    ```

In [None]:
import numpy as np

# Assignment 1: Array Basics

Hi there,

Can you import Numpy and convert the following list comprehension (I just learned about comprehensions in an awesome course by Maven) into an array?

Once you've done that report the following about the array:
* The number of dimensions
* The shape
* The number of elements in the array
* The type of data contained inside

In [None]:
my_list = [x * 10 for x in range(1, 11)]

my_array = np.array(my_list).reshape(5, 2)

my_array

array([[ 10,  20],
       [ 30,  40],
       [ 50,  60],
       [ 70,  80],
       [ 90, 100]])

In [None]:
my_array.ndim

2

In [None]:
my_array.shape

(5, 2)

In [None]:
my_array.size

10

In [None]:
my_array.dtype

dtype('int64')

# Assignment 2: Array Creation

Thanks for your help with the first piece - I'm starting to understand some of the key differences between base Python data types and NumPy arrays.

Does NumPy have anything like the range() function from base Python?

If so:
* create the same array from assignment 1 using a NumPy function.
* Make it 5 rows and 2 columns.
* It's ok if the datatype is float or int.

In [None]:
# With arange

my_array = np.arange(10, 101, 10).reshape(5, 2)

my_array

NameError: name 'np' is not defined

In [None]:
# With linspace

my_array = np.linspace(10, 100, 10).reshape(5, 2)

my_array

NameError: name 'np' is not defined

In [None]:
# For fun: Use Array math to create multiples of 10 from single digit integers

my_array = (np.arange(1, 11) * 10).reshape(5, 2)

my_array

array([[ 10,  20],
       [ 30,  40],
       [ 50,  60],
       [ 70,  80],
       [ 90, 100]])

Looking good so far! One of our data scientists asked about random number generation in NumPy.

Can you create a 3x3 array of random numbers between 0 and 1? Use a random state of 2022.

Store the random array in a variable called `random_array`.

In [None]:
rng = np.random.default_rng(2022)

random_array = rng.random(9).reshape(3, 3)

random_array

array([[0.24742606, 0.09299006, 0.61176337],
       [0.06066207, 0.66103343, 0.75515778],
       [0.1108689 , 0.04305584, 0.41441747]])

# Assignment 3: Accessing Array Data


Slice and index the `random_array` we created in the previous exercise. Perform the following:

* Grab the first two 'rows' of the array
* Grab the entire first column
* Finally, grab the second selement of the third row.

Thanks!


In [None]:
random_array

array([[0.24742606, 0.09299006, 0.61176337],
       [0.06066207, 0.66103343, 0.75515778],
       [0.1108689 , 0.04305584, 0.41441747]])

In [None]:
random_array[:2, :]

array([[0.24742606, 0.09299006, 0.61176337],
       [0.06066207, 0.66103343, 0.75515778]])

In [None]:
random_array[:, 0]

array([0.24742606, 0.06066207, 0.1108689 ])

In [None]:
random_array[2, 1]

0.04305584439252108

# Assignment 4: Arithmetic Operations

The creativity of our marketing team knows no bounds!

They've asked us to come up with a simple algorithm to provide a random discount to our list of prices below.

Before we do that,

* Add a 5 dollar shipping fee to each price. Call this array `total`.

Once we have that, we want to use the random_array created in assignment 2 and apply them to the 6 prices.

* Grab the first 6 numbers from `random_array`, reshape it to one dimension. Call this `discount_pct`.
* Subtract `discount_pct` FROM 1, store this in `pct_owed`.
* Multiply `pct_owed` by `total` to get the final amount owed.

In [None]:
prices = np.array([5.99, 6.99, 22.49, 99.99, 4.99, 49.99])

total = prices + 5

total

array([ 10.99,  11.99,  27.49, 104.99,   9.99,  54.99])

In [None]:
discount_pct = random_array[:2, :].reshape(6)

pct_owed = 1 - discount_pct

final_owed = total * pct_owed

final_owed.round(2)


array([ 8.27, 10.88, 10.67, 98.62,  3.39, 13.46])

In [None]:
((1 - (random_array[:2, :].reshape(6))) * total).round(2)

array([ 8.27, 10.88, 10.67, 98.62,  3.39, 13.46])

In [None]:
print(discount_pct)
print(pct_owed)
print(final_owed.round(2))

[0.24742606 0.09299006 0.61176337 0.06066207 0.66103343 0.75515778]
[0.75257394 0.90700994 0.38823663 0.93933793 0.33896657 0.24484222]
[ 8.27 10.88 10.67 98.62  3.39 13.46]


# Assignment 5: Filtering Arrays

Filter the product array to only include those with prices greater than 25.

Modify your logic to include cola, despite it not having a price greater than 25.
Store the elements returned in an array called `fancy_feast_special`.

Next, create a shipping cost array where the cost is 0 if price is greater than 20, and 5 if not.

In [None]:
products = np.array(
    ["salad", "bread", "mustard", "rare tomato", "cola", "gourmet ice cream"]
)

products

array(['salad', 'bread', 'mustard', 'rare tomato', 'cola',
       'gourmet ice cream'], dtype='<U17')

In [None]:
products[prices > 25]

array(['rare tomato', 'gourmet ice cream'], dtype='<U17')

In [None]:
mask = (prices > 25) | (products == "cola")

fancy_feast_special = products[mask]

fancy_feast_special

array(['rare tomato', 'cola', 'gourmet ice cream'], dtype='<U17')

In [None]:
shipping = np.where(prices > 20, 0, 5)

shipping

array([5, 5, 0, 0, 5, 0])

# Assignment 6: Aggregating and Sorting Arrays

First, grab the top 3 highest priced items in our list.

Then, calculated the mean, min, max, and median of the top three prices.

Finally, calculate the number of unique price tiers in our `price_tiers` array.

In [None]:
prices = np.array([5.99, 6.99, 22.49, 99.99, 4.99, 49.99])

prices.sort()

In [None]:
prices

array([ 4.99,  5.99,  6.99, 22.49, 49.99, 99.99])

In [None]:
top3 = prices[-3:]

In [None]:
print(f"Mean: {top3.mean()}")
print(f"Min: {top3.min()}")
print(f"Max: {top3.max()}")
print(f"Median: {np.median(top3)}")

Mean: 57.49
Min: 22.49
Max: 99.99
Median: 49.99


In [None]:
price_tiers = np.array(["budget", "budget", "mid-tier", "luxury", "mid-tier", "luxury"])

In [None]:
np.unique(price_tiers)

array(['budget', 'luxury', 'mid-tier'], dtype='<U8')

# Assignment 7: Bringing it All Together

Ok, final NumPy task - let's read in some data with the help of Pandas.

Our data scientist provided the code to read in a csv as a Pandas dataframe, and has converted the two columns of interest to arrays.

* Filter `sales_array` down to only sales where the product family was produce.

* Then, randomly sample roughly half (random number < .5) of the produce sales and report the mean and median sales. Use a random seed of 2022.

* Finally, create a new array that has the values 'above_both', 'above_median', and 'below_both' based on whether the sales were above the median and mean of the sample, just above the median of the sample, or below both the median and mean of the sample.

In [None]:
import pandas as pd
import numpy as np

retail_df = pd.read_csv(
    "../retail/retail_2016_2017.csv", skiprows=range(1, 11000), nrows=1000
)

family_array = np.array(retail_df["family"])
sales_array = np.array(retail_df["sales"])

In [None]:
produce_array = sales_array[family_array == "PRODUCE"]

In [None]:
rng = np.random.default_rng(2022)

random_array = rng.random(30)

sampled_array = produce_array[random_array < 0.5]

In [None]:
mean = sampled_array.mean()

mean

2268.102470588235

In [None]:
median = np.median(sampled_array)

median

1272.755

In [None]:
np.where(
    sampled_array < median,
    "below_both",
    np.where(sampled_array > mean, "above_both", "above_median"),
)

array(['above_median', 'below_both', 'below_both', 'below_both',
       'above_both', 'below_both', 'below_both', 'above_both',
       'below_both', 'above_median', 'above_both', 'above_both',
       'below_both', 'above_median', 'above_both', 'below_both',
       'above_both'], dtype='<U12')

In [None]:
sampled_array

array([1662.394,  447.064,  962.866, 1077.44 , 3404.531,  962.96 ,
       1089.319, 7860.031,  446.038, 1272.755, 2775.771, 2339.906,
        722.333, 1567.843, 2458.456,  673.885, 8834.15 ])