# Introduction to NumPy

NumPy (Numerical Python) is the fundamental package for scientific computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays efficiently.

**Official Documentation:** https://numpy.org/doc/stable/

## Why NumPy?

### Advantages:
1. **Performance**: NumPy arrays are faster than Python lists (up to 50x)
2. **Memory Efficient**: Uses less memory than lists
3. **Convenience**: Rich functionality for mathematical operations
4. **Foundation**: Core library for pandas, matplotlib, scikit-learn, and more
5. **Vectorization**: Perform operations on entire arrays without loops

### Key Features:
- Multi-dimensional array object (ndarray)
- Broadcasting capabilities
- Linear algebra operations
- Random number generation
- Fourier transforms

## Python Lists vs NumPy Arrays

Before diving into NumPy, let's understand why we need it by comparing it with Python lists.

### Python Lists - Flexible but Slow

In [None]:
# Python lists: flexible but inefficient for numerical operations
my_list = [1, 2, 3, 4, 5]
my_matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]

print("List:", my_list)
print("Matrix (nested list):\n", my_matrix)

# To multiply a list by 10, we need a loop
result = [x * 10 for x in my_list]
print("\nList * 10 (requires loop):", result)

# For matrices, even more complex
matrix_result = [[element * 10 for element in row] for row in my_matrix]
print("\nMatrix * 10 (nested loops):")
for row in matrix_result:
    print(row)

### NumPy Arrays - Fast and Convenient

With NumPy, the same operations become simple and much faster:

In [None]:
import numpy as np

# NumPy arrays: optimized for numerical operations
np_array = np.array([1, 2, 3, 4, 5])
np_matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

print("NumPy array:", np_array)
print("NumPy matrix:\n", np_matrix)

# Vectorized operations - no loops needed!
print("\nArray * 10 (vectorized):", np_array * 10)
print("\nMatrix * 10 (vectorized):\n", np_matrix * 10)

# Element-wise operations are equally simple
print("\nArray squared:", np_array ** 2)
print("\nMatrix + 100:\n", np_matrix + 100)

### Key Differences Summary

| Feature | Python Lists | NumPy Arrays |
|---------|-------------|--------------|
| **Speed** | Slow for numerical operations | Very fast (optimized C code) |
| **Memory** | More memory per element | Less memory (fixed type) |
| **Operations** | Need loops for element-wise ops | Vectorized operations |
| **Data Types** | Can mix types | Single type (more efficient) |
| **Multi-dimensional** | Nested lists (awkward) | Native support |
| **Mathematical ops** | Not built-in | Rich library of functions |

**Bottom line:** Use lists for general purposes, use NumPy for numerical data!

## Installation and Import

First, install NumPy if you haven't already:
```bash
pip install numpy
```

In [None]:
# Import NumPy with the standard alias
import numpy as np

# Check NumPy version
print("NumPy version:", np.__version__)

## Creating NumPy Arrays

There are multiple ways to create NumPy arrays:

**Documentation:** https://numpy.org/doc/stable/user/basics.creation.html

In [None]:
# 1. From Python lists
list_1d = [1, 2, 3, 4, 5]
array_1d = np.array(list_1d)
print("1D Array:", array_1d)
print("Type:", type(array_1d))

# 2. From nested lists (2D array)
list_2d = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
array_2d = np.array(list_2d)
print("\n2D Array:\n", array_2d)

# 3. Using arange (like range)
array_range = np.arange(0, 10, 2)  # start, stop, step
print("\nArange:", array_range)

# 4. Using linspace (evenly spaced values)
array_linspace = np.linspace(0, 1, 5)  # start, stop, num_points
print("\nLinspace:", array_linspace)

In [None]:
# Special array creation functions

# Array of zeros
zeros = np.zeros((3, 4))  # 3 rows, 4 columns
print("Zeros:\n", zeros)

# Array of ones
ones = np.ones((2, 3))
print("\nOnes:\n", ones)

# Identity matrix
identity = np.eye(4)
print("\nIdentity Matrix:\n", identity)

# Array with a constant value
full = np.full((2, 3), 7)
print("\nFull Array:\n", full)

# Empty array (uninitialized values)
empty = np.empty((2, 2))
print("\nEmpty Array:\n", empty)

## Array Attributes

NumPy arrays have important attributes that describe their properties:

**Documentation:** https://numpy.org/doc/stable/reference/arrays.ndarray.html#array-attributes

In [None]:
# Create a sample array
arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])

print("Array:\n", arr)
print("\nShape (dimensions):", arr.shape)  # (rows, columns)
print("Number of dimensions:", arr.ndim)
print("Size (total elements):", arr.size)
print("Data type:", arr.dtype)
print("Item size (bytes):", arr.itemsize)
print("Total bytes:", arr.nbytes)

## Data Types in NumPy

NumPy supports various data types for efficient storage:

**Documentation:** https://numpy.org/doc/stable/user/basics.types.html

In [None]:
# Integer types
int_array = np.array([1, 2, 3], dtype=np.int32)
print("Int32 array:", int_array, "- dtype:", int_array.dtype)

# Float types
float_array = np.array([1.5, 2.7, 3.9], dtype=np.float64)
print("Float64 array:", float_array, "- dtype:", float_array.dtype)

# Boolean type
bool_array = np.array([True, False, True], dtype=np.bool_)
print("Boolean array:", bool_array, "- dtype:", bool_array.dtype)

# Converting data types
converted = int_array.astype(np.float64)
print("Converted to float:", converted, "- dtype:", converted.dtype)

## Array Indexing and Slicing

Access elements using indices, similar to Python lists but with more power:

**Documentation:** https://numpy.org/doc/stable/user/basics.indexing.html

In [None]:
# 1D array indexing
arr_1d = np.array([10, 20, 30, 40, 50])
print("Array:", arr_1d)
print("First element:", arr_1d[0])
print("Last element:", arr_1d[-1])
print("Slice [1:4]:", arr_1d[1:4])
print("Every other element:", arr_1d[::2])

# 2D array indexing
arr_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("\n2D Array:\n", arr_2d)
print("Element at [1, 2]:", arr_2d[1, 2])  # row 1, column 2
print("First row:", arr_2d[0, :])
print("Second column:", arr_2d[:, 1])
print("Subarray:\n", arr_2d[0:2, 1:3])

In [None]:
# Boolean indexing
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
print("Array:", arr)

# Create boolean mask
mask = arr > 5
print("Mask (arr > 5):", mask)

# Apply mask
filtered = arr[mask]
print("Filtered array:", filtered)

# Direct boolean indexing
even_numbers = arr[arr % 2 == 0]
print("Even numbers:", even_numbers)

# Multiple conditions
result = arr[(arr > 3) & (arr < 8)]
print("Numbers between 3 and 8:", result)

## Conditional Operations with np.where()

`np.where()` is a powerful function for applying conditional logic to arrays. It's like a vectorized if-else statement.

**Documentation:** https://numpy.org/doc/stable/reference/generated/numpy.where.html

In [None]:
# Basic np.where() usage: np.where(condition, value_if_true, value_if_false)
scores = np.array([85, 92, 78, 95, 88, 67, 73, 90])

# Assign grades based on scores
grades = np.where(scores >= 90, 'A', 'B')
print("Scores:", scores)
print("Grades:", grades)

# Multiple conditions with nested np.where()
grades_detailed = np.where(scores >= 90, 'A',
                  np.where(scores >= 80, 'B',
                  np.where(scores >= 70, 'C', 'F')))
print("\nDetailed grades:", grades_detailed)

# Numeric transformations
# Replace negative values with 0, keep positive values
data = np.array([-5, 3, -2, 8, -1, 6])
cleaned = np.where(data < 0, 0, data)
print("\nOriginal data:", data)
print("Cleaned data (negatives ‚Üí 0):", cleaned)

In [None]:
# Practical example: Categorize data
temperatures = np.array([15, 25, 30, 10, 35, 28, 18, 22])

# Categorize temperatures
categories = np.where(temperatures < 20, 'Cold',
              np.where(temperatures < 30, 'Warm', 'Hot'))

print("Temperatures:", temperatures)
print("Categories:", categories)

# Apply different calculations based on condition
sales = np.array([1000, 1500, 800, 2000, 1200])

# Apply 10% discount if sales > 1000, otherwise no discount
discounted = np.where(sales > 1000, sales * 0.9, sales)
print("\nOriginal sales:", sales)
print("After discount:", discounted)
print("Savings:", sales - discounted)

In [None]:
# np.where() with 2D arrays
matrix = np.array([[1, -2, 3],
                   [-4, 5, -6],
                   [7, -8, 9]])

print("Original matrix:")
print(matrix)

# Replace negative values with their absolute values
abs_matrix = np.where(matrix < 0, -matrix, matrix)
print("\nAbsolute value matrix:")
print(abs_matrix)

# Create a mask matrix: 1 for positive, 0 for negative
sign_matrix = np.where(matrix > 0, 1, 0)
print("\nSign matrix (1=positive, 0=negative):")
print(sign_matrix)

In [None]:
# Real-world example: Flag outliers in data
np.random.seed(42)
data = np.random.normal(100, 15, 50)  # mean=100, std=15

# Calculate bounds (mean ¬± 2 standard deviations)
mean = np.mean(data)
std = np.std(data)
lower_bound = mean - 2 * std
upper_bound = mean + 2 * std

# Flag outliers
flags = np.where((data < lower_bound) | (data > upper_bound), 'OUTLIER', 'NORMAL')

print("Data statistics:")
print(f"Mean: {mean:.2f}")
print(f"Std: {std:.2f}")
print(f"Bounds: [{lower_bound:.2f}, {upper_bound:.2f}]")
print(f"\nNumber of outliers: {np.sum(flags == 'OUTLIER')}")
print(f"First 10 data points: {data[:10]}")
print(f"First 10 flags: {flags[:10]}")

In [None]:
# np.where() to find indices (without replacement values)
arr = np.array([10, 25, 30, 15, 40, 22, 35])

# Find indices where condition is True
indices_above_25 = np.where(arr > 25)
print("Array:", arr)
print("Indices where value > 25:", indices_above_25[0])
print("Values at those indices:", arr[indices_above_25])

# Find indices in 2D array
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])

# Find row and column indices where value > 5
rows, cols = np.where(matrix > 5)
print("\nMatrix:")
print(matrix)
print(f"Elements > 5 are at positions: {list(zip(rows, cols))}")
print(f"Values: {matrix[rows, cols]}")

## Comparison: Boolean Indexing vs np.where()

Both are useful but serve different purposes:

- **Boolean indexing**: Filter/select elements that meet a condition
  ```python
  arr[arr > 5]  # Returns only elements > 5
  ```

- **np.where()**: Transform elements based on a condition
  ```python
  np.where(arr > 5, 'high', 'low')  # Returns array of same size with labels
  ```

- **np.where() for indices**: Find positions of elements
  ```python
  np.where(arr > 5)  # Returns indices where condition is True
  ```

In [None]:
# Side-by-side comparison
data = np.array([3, 7, 2, 9, 5, 1, 8])

print("Original array:", data)
print("\n1. Boolean indexing (filter):")
print("   Values > 5:", data[data > 5])

print("\n2. np.where() (transform):")
print("   Label high/low:", np.where(data > 5, 'HIGH', 'LOW'))

print("\n3. np.where() (indices):")
print("   Indices where > 5:", np.where(data > 5)[0])

print("\n4. Combined approach:")
indices = np.where(data > 5)[0]
print(f"   Found {len(indices)} values > 5 at positions {indices}")
print(f"   Those values are: {data[indices]}")

## Array Operations

NumPy supports vectorized operations (element-wise operations without loops):

**Documentation:** https://numpy.org/doc/stable/user/basics.ufuncs.html

In [None]:
# Arithmetic operations
a = np.array([1, 2, 3, 4])
b = np.array([10, 20, 30, 40])

print("a:", a)
print("b:", b)
print("\nAddition:", a + b)
print("Subtraction:", a - b)
print("Multiplication:", a * b)
print("Division:", a / b)
print("Power:", a ** 2)
print("Square root:", np.sqrt(a))

# Operations with scalars
print("\na + 10:", a + 10)
print("a * 2:", a * 2)

In [None]:
# Matrix operations
matrix_a = np.array([[1, 2], [3, 4]])
matrix_b = np.array([[5, 6], [7, 8]])

print("Matrix A:\n", matrix_a)
print("\nMatrix B:\n", matrix_b)

# Element-wise multiplication
print("\nElement-wise multiplication:\n", matrix_a * matrix_b)

# Matrix multiplication (dot product)
print("\nMatrix multiplication (dot):\n", np.dot(matrix_a, matrix_b))
# or
print("\nMatrix multiplication (@):\n", matrix_a @ matrix_b)

# Transpose
print("\nTranspose of A:\n", matrix_a.T)

## Statistical Operations

NumPy provides many statistical functions:

**Documentation:** https://numpy.org/doc/stable/reference/routines.statistics.html

In [None]:
data = np.array([12, 15, 18, 21, 24, 27, 30, 33])

print("Data:", data)
print("\nMean:", np.mean(data))
print("Median:", np.median(data))
print("Standard Deviation:", np.std(data))
print("Variance:", np.var(data))
print("Min:", np.min(data))
print("Max:", np.max(data))
print("Sum:", np.sum(data))
print("Cumulative sum:", np.cumsum(data))

# For 2D arrays, can specify axis
data_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("\n2D Data:\n", data_2d)
print("Mean of each column:", np.mean(data_2d, axis=0))
print("Mean of each row:", np.mean(data_2d, axis=1))

### Understanding the `axis` Parameter

The `axis` parameter is crucial but often confusing. Here's how to think about it:

- **`axis=0`**: Operation along rows (result per column)
- **`axis=1`**: Operation along columns (result per row)

**Think of it as:** "Which dimension do we collapse/aggregate?"

In [None]:
# Visual example of axis parameter
sales_data = np.array([[100, 150, 200],   # Week 1: Mon, Tue, Wed
                       [120, 180, 210],   # Week 2: Mon, Tue, Wed
                       [110, 160, 190]])  # Week 3: Mon, Tue, Wed

print("Sales data (3 weeks x 3 days):")
print(sales_data)
print("\nRows = weeks, Columns = days (Mon, Tue, Wed)")

# axis=0: aggregate across weeks (for each day)
print("\n--- axis=0: Total sales per DAY (across all weeks) ---")
print("Total per day:", np.sum(sales_data, axis=0))
print("Average per day:", np.mean(sales_data, axis=0))
# Result: 3 values (one per column/day)

# axis=1: aggregate across days (for each week)
print("\n--- axis=1: Total sales per WEEK (across all days) ---")
print("Total per week:", np.sum(sales_data, axis=1))
print("Average per week:", np.mean(sales_data, axis=1))
# Result: 3 values (one per row/week)

# No axis: aggregate everything
print("\n--- No axis: Total across everything ---")
print("Grand total:", np.sum(sales_data))
print("Overall average:", np.mean(sales_data))
# Result: 1 value (entire array)

**Memory trick:** 
- `axis=0` ‚Üí think "**down** the rows" ‚Üí result has shape of columns
- `axis=1` ‚Üí think "**across** the columns" ‚Üí result has shape of rows

### Working with Missing Data (NaN)

In real-world data analysis, you'll often encounter missing values. NumPy provides special functions that ignore NaN (Not a Number) values.

**NaN-safe Functions Table:**

| Regular Function | NaN-safe Version | Description |
|-----------------|------------------|-------------|
| `np.sum()` | `np.nansum()` | Sum, ignoring NaN |
| `np.mean()` | `np.nanmean()` | Mean, ignoring NaN |
| `np.median()` | `np.nanmedian()` | Median, ignoring NaN |
| `np.std()` | `np.nanstd()` | Standard deviation, ignoring NaN |
| `np.var()` | `np.nanvar()` | Variance, ignoring NaN |
| `np.min()` | `np.nanmin()` | Minimum, ignoring NaN |
| `np.max()` | `np.nanmax()` | Maximum, ignoring NaN |
| `np.argmin()` | `np.nanargmin()` | Index of min, ignoring NaN |
| `np.argmax()` | `np.nanargmax()` | Index of max, ignoring NaN |
| `np.percentile()` | `np.nanpercentile()` | Percentile, ignoring NaN |

**Documentation:** https://numpy.org/doc/stable/reference/routines.statistics.html

In [None]:
# Example with missing data
data_with_missing = np.array([10, 20, np.nan, 30, 40, np.nan, 50])

print("Data with missing values:", data_with_missing)

# Regular functions will return NaN if there's any NaN in the data
print("\nUsing regular mean:", np.mean(data_with_missing))

# NaN-safe functions ignore the missing values
print("Using nanmean:", np.nanmean(data_with_missing))

# More examples
print("\nRegular sum:", np.sum(data_with_missing))
print("NaN-safe sum:", np.nansum(data_with_missing))

print("\nRegular max:", np.max(data_with_missing))
print("NaN-safe max:", np.nanmax(data_with_missing))

# Works with 2D arrays too
data_2d_missing = np.array([[1, 2, np.nan],
                             [4, np.nan, 6],
                             [7, 8, 9]])

print("\n2D data with missing values:")
print(data_2d_missing)

print("\nColumn means (ignoring NaN):")
print(np.nanmean(data_2d_missing, axis=0))

print("\nRow means (ignoring NaN):")
print(np.nanmean(data_2d_missing, axis=1))

**Important:** In data analysis, you'll use these NaN-safe functions frequently when working with real datasets that have missing values!

## Common NumPy Functions Reference

### Mathematical Functions:
- `np.abs()` - Absolute value
- `np.sqrt()` - Square root
- `np.exp()` - Exponential
- `np.log()` - Natural logarithm
- `np.sin()`, `np.cos()`, `np.tan()` - Trigonometric functions
- `np.round()` - Round to nearest integer

**Documentation:** https://numpy.org/doc/stable/reference/routines.math.html

### Statistical Functions:
- `np.mean()` - Average
- `np.median()` - Median value
- `np.std()` - Standard deviation
- `np.var()` - Variance
- `np.percentile()` - Percentiles
- `np.corrcoef()` - Correlation coefficient

**Documentation:** https://numpy.org/doc/stable/reference/routines.statistics.html

### Aggregate Functions:
- `np.sum()` - Sum of elements
- `np.prod()` - Product of elements
- `np.min()`, `np.max()` - Minimum and maximum
- `np.argmin()`, `np.argmax()` - Index of min/max
- `np.cumsum()` - Cumulative sum

**Documentation:** https://numpy.org/doc/stable/reference/routines.math.html#sums-products-differences

# üìù NumPy Exercises for Data Analytics
````markdown
These exercises are designed to help you practice **NumPy** for common **data analytics tasks**: statistics, array operations, filtering, and conditional logic.  

Make sure you **import NumPy** at the start of your code:

```python
import numpy as np
````

---

## Exercise 1 ‚Äî Descriptive Analysis

**Scenario:**
You have a dataset of **customer ages**. Understanding basic statistics is crucial in analytics to describe the data.

**Task:**

1. Create a NumPy array with the following customer ages:

```python
[18, 22, 25, 30, 35, 40, 45, 50]
```

2. Calculate the following statistics:

   * Mean
   * Median
   * Standard deviation

3. Identify which ages are **above the average**.

> Tip: Use `np.mean()`, `np.median()`, `np.std()` and boolean indexing.

---

## Exercise 2 ‚Äî Group Comparison

**Scenario:**
You are comparing **sales for two months** to see daily trends.

**Task:**

1. Create two NumPy arrays representing sales for **5 days**:

```python
month1 = [200, 220, 250, 240, 260]
month2 = [210, 230, 240, 260, 280]
```

2. Calculate the **daily difference** in sales between month2 and month1.
3. Calculate the **percentage change** for each day.

> Tip: Use element-wise subtraction and division. Multiply by 100 for percentage.

---

## Exercise 3 ‚Äî 2D Dataset (Table-like Data)

**Scenario:**
You have weekly sales data for **3 products over 4 weeks**. Aggregation will help you analyze trends.

**Task:**

1. Create a 2D NumPy array representing sales:

```python
sales_matrix = [
    [120, 130, 140, 150],
    [200, 210, 220, 230],
    [90, 100, 110, 120]
]
```

2. Calculate:

   * Total sales **per product**
   * Total sales **per week**
   * The **best-selling product** (the product with the highest total sales)

> Tip: Use `np.sum()` with the `axis` parameter and `np.argmax()`.

---

## Exercise 4 ‚Äî Analytical Filtering

**Scenario:**
You have a list of **product prices** and want to filter for specific price ranges.

**Task:**

1. Create a NumPy array of prices:

```python
prices = [30, 60, 90, 120, 150, 180]
```

2. Filter the prices that are:

   * Greater than 100
   * Between 50 and 150 (inclusive)

> Tip: Use boolean indexing with conditions like `prices > 100` or `(prices >= 50) & (prices <= 150)`.

---

## Exercise 5 ‚Äî Using `np.where` (Customer Segmentation)

**Scenario:**
You want to classify customers into groups based on their age. This is a common task in analytics for segmentation.

**Task:**

1. Create a NumPy array of **customer ages**:

```python
ages = [18, 22, 25, 30, 35, 40, 45, 50]
```

2. Create a new array `age_group` with the following rules:

   * `"Young"` if age < 30
   * `"Adult"` if age >= 30

3. Count how many customers are `"Young"` and how many are `"Adult"`.

> Tip: Use `np.where()` for conditional assignment, and `np.sum()` for counting.

---

**End of Exercises**
These exercises will help you practice key **NumPy skills for data analytics**, including:

* Array creation and manipulation
* Statistical analysis
* Filtering and conditional logic
* Working with 2D arrays and aggregations

```

This is **ready to save as `NumPy_Exercises.md`**.  

If you want, I can make a **version with ‚ÄúAnswer space‚Äù sections** for students to write their code **directly under each exercise**, like a real lab worksheet.  

Do you want me to do that?
```


## Summary

NumPy is essential for:
- Fast numerical computations
- Working with large datasets
- Scientific and statistical analysis
- Foundation for data science libraries

**Key Takeaways:**
- NumPy arrays are faster and more memory efficient than Python lists
- Use vectorized operations instead of loops
- Broadcasting allows operations on arrays of different shapes
- Rich set of mathematical and statistical functions
- Foundation for pandas, matplotlib, scikit-learn, and more

**Next Steps:**
- Practice with real datasets
- Learn pandas (built on NumPy)
- Explore matplotlib for visualization
- Study linear algebra with NumPy

**Additional Resources:**
- Official NumPy Tutorial: https://numpy.org/doc/stable/user/quickstart.html
- NumPy for Absolute Beginners: https://numpy.org/doc/stable/user/absolute_beginners.html
- NumPy API Reference: https://numpy.org/doc/stable/reference/index.html
- NumPy Cheat Sheet: https://numpy.org/doc/stable/user/numpy-for-matlab-users.html