# Lecture 4 

February 24, 2025

# Outline

# 1. Recap of Numpy arrays

# 2. Practice with Example

# 3. Saving Numpy arrays to files

# 4. Quiz


# Part 1. Recap NumPy Arrays

## 1. Performance (Speed)
- NumPy arrays are much **faster** than Python lists because they are implemented in C and use contiguous memory allocation.
- Operations on NumPy arrays are optimized and use vectorized computations, avoiding slow Python loops.
- NumPy leverages efficient memory access patterns and CPU optimizations (e.g., SIMD instructions).

## 2. Memory Efficiency
- NumPy arrays consume **less memory** compared to Python lists because they store elements of the same data type in a contiguous block of memory.
- Lists in Python store references to objects, adding overhead, whereas NumPy uses a compact data structure.

## 3. Vectorization (Element-wise Operations)
- NumPy allows **vectorized operations**, meaning mathematical operations are applied to entire arrays without explicit loops.
- Example:





In [None]:
import numpy as np
a = np.array([1, 2, 3, 4])
b = np.array([10, 20, 30, 40])
c = a + b  # [11, 22, 33, 44] (element-wise addition)

- In contrast, a Python list would require a loop or list comprehension.



In [None]:
def addition(x, y):
    return x+y

In [None]:
# integer, float variables

k, l = 10, 20
k_plus_l = addition(k,l)
print(k_plus_l)

In [None]:
# list objects:
la = [1, 2, 3]
lb = [4, 5, 6]
lsum = addition(la, lb)
print(lsum)

In [None]:
# numpy array
a = np.array([1, 2, 3, 4])
b = np.array([10, 20, 30, 40])
a_plus_b = addition(a,b)
print(a_plus_b)

In [None]:
vectorized_addition = np.vectorize(addition)

In [None]:
a_plus_b_v2 = vectorized_addition(a,b)
print(a_plus_b_v2)

In [None]:
from math import factorial

def factorial_calculation( n):
    return factorial(n)

In [None]:
print(factorial_calculation(3))

In [None]:
print(factorial_calculation(a))

In [None]:
vectorized_factorial = np.vectorize(factorial_calculation)

print(vectorized_factorial(a))

## 4. Broadcasting
- NumPy supports broadcasting, which allows operations between arrays of different shapes without explicit loops.
- Example:

In [None]:
a = np.array([1, 2, 3])
b = 2  # Scalar
c = a * b  # [2, 4, 6] (scalar is broadcasted)


## 5. Built-in Mathematical and Statistical Functions
- NumPy provides a rich set of mathematical functions (np.mean, np.sum, np.std, etc.).
- Example:

In [None]:
arr = np.array([1, 2, 3, 4, 5])
print(np.mean(arr))  # 3.0
print(np.std(arr))   # 1.414...


## 6. Multidimensional Arrays & Matrix Operations
- Unlike Python lists, NumPy supports multi-dimensional arrays (e.g., 2D matrices, 3D tensors).
- Example:


In [None]:
matrix = np.array([[1, 2], [3, 4]])
print(matrix.T)  # Transpose of the matrix


## 7. Interoperability with Other Libraries
- NumPy is the foundation for many scientific computing and machine learning libraries, such as **Pandas, SciPy, TensorFlow, PyTorch**.

## 8. Indexing and Slicing
- NumPy arrays support advanced **indexing and slicing**, enabling efficient selection and manipulation of subsets of data.

## Summary: Must-Know Operations

| Operation          | Example                                      |
|--------------------|----------------------------------------------|
| **Creating arrays**  | `np.array([1,2,3])`, `np.zeros((3,3))`, `np.random.rand(3,3)` |
| **Reshaping**       | `arr.reshape(3,3)`                         |
| **Indexing & Slicing** | `arr[1, :]`, `arr[:, 1]`              |
| **Math operations** | `arr + 2`, `np.dot(A, B)`, `A @ B`         |
| **Aggregation**     | `np.mean(arr)`, `np.sum(arr, axis=0)`      |
| **Boolean Indexing** | `arr[arr > 10]`                         |
| **Stacking**        | `np.vstack((A,B))`, `np.hstack((A,B))`    |
| **Broadcasting**    | `A + B` (different shapes)                |
| **Linear Algebra**  | `np.linalg.inv(A)`, `np.linalg.eig(A)`     |
| **File I/O**        | `np.loadtxt('data.csv', delimiter=',')`    |


# Part 2. Practice
- 1. Represent this data table with a numpy array
- 2. Calculate GDP per capita and population density

| State      | Population (Million) | Area (Thousand sq mi) | GDP (Billion USD) |
|------------|----------------------|-----------------------|-------------------|
| California | 39.2                 | 163.7                 | 4,080.2          |
| Texas      | 31.3                 | 268.6                 | 2,694.5          |
| New York   | 19.0                 | 54.6                  | 2,284.4          |
| Florida    | 22.0                 | 65.8                  | 1,695.3          |
| Washington | 7.8                  | 71.3                  | 808.0            |



### Step 1: Create a Numpy array representation of this data set

In [None]:

# Data: [Population (Million), Area (Thousand sq mi), GDP (Billion USD)]
data = np.array([
    [39.2, 163.7, 4080.2],  # California
    [31.3, 268.6, 2694.5],  # Texas
    [19.0,  54.6, 2284.4],  # New York
    [22.0,  65.8, 1695.3],  # Florida
    [7.8,   71.3,  808.0]   # Washington
])

# Print the NumPy array
print(data)


### Step 2: Build a helper function to break down key info of a numpy array

In [None]:
### Let's check basic information of this array

import numpy as np

def check_np(arr):
    """
    Function to print the dimension, shape, and size of a NumPy array.
    Raises an exception if the input is not a NumPy array.

    Parameters:
    arr (numpy.ndarray): The input NumPy array.
    """
    if not isinstance(arr, np.ndarray):
        raise TypeError("Input must be a NumPy array.")

    print(f"Dimension: {arr.ndim}")
    print(f"Shape: {arr.shape}")
    print(f"Size: {arr.size}")
    print(f"Array printout: \n {arr}")


In [None]:
check_np(data)

### Step 3: Calculate the ratio of GDP to Population with "vectorized" (element-wise) operations

I want to calculate GDP per capita, which is defined as the GDP of a state divided by its population:

#### **GDP per capita = GDP / Population**

Using a NumPy array, I can retrieve an array containing the GDP values for all five states and another array containing their respective populations. By performing element-wise division of the GDP array by the population array, I obtain an array of GDP per capita values. Below is an example:



In [None]:
GDP = data[:,2]
population = data[:,0]
GDP_per_capita = GDP / population

In [None]:
check_np(GDP_per_capita)

### Step 4: Correct the unit by broadcasting

The unit of these numbers are billion dollars per million people, which is not exactly "per capita". So we should convert these numbers to dollars per person. 

#### **GDP_per_capita_with_proper_unit = GDP_per_capita$\times \frac{10^9}{10^6}$**

Because this factor is applied to all entries in the GDP_per_capita array, it will take advantage of the broadcasting capability of numpy array. What it means is that you only need to multiple this factor to the numpy array `GDP_per_capita` as a whole. There is no need to do it individually for each element.

In [None]:
GDP_per_capita_with_proper_unit = GDP_per_capita*(1e9)/(1e6)
check_np(GDP_per_capita_with_proper_unit)

### Step 5: Do all of these in one line

#### Of course, you can do this entire operation in one go
- there is no need to create intermediate arrays

In [None]:
GDP_percapita = data[:,2]/data[:,0]*1e9/1e6
check_np(GDP_percapita)

### Step 6: Add the per capita GDP back to the original numpy array

#### what needs to be done if I want to add this array `GDP_percapita` back to the original array `data`?

In [None]:
data_modified = np.hstack(data,GDP_percapita)

In [None]:
data_modified = np.hstack(data,GDP_percapita)

#### Step 6.1 Reshape the numpy array to the right shape
- for hstack and vstack, the arrays to be merged must have the same dimension
- for hstack, the size of axis 0 must be the same for two arrays
- for vstack, the size of axis 1 must be the same for two arrays

In [None]:
GDP_percapita = np.reshape(GDP_percapita, (5,1))

In [None]:
check_np(GDP_percapita)

In [None]:
data_modified = np.hstack((data,GDP_percapita))
check_np(data_modified)

### Step 7.  Practice time: 

 Can you follow the example to calculate the population density, which is defined as number of population per squared miles? Add this entry back to the `data_modified` array so that it has a shape of (5,5), corresponding to five states as rows (axis 0) and five quantities as columns (axis 1)

In [None]:
# Your code

## Part 3. Saving numpy arrays

There are a number of ways of saving numpy arrays into a file for later use

### 1. Writing it into a csv file

A **CSV (Comma-Separated Values) file** is a plain text format used to store **tabular data**, where each row represents a data entry and columns are separated by commas. It is widely used for data exchange between applications like spreadsheets, databases, and programming languages such as Python. CSV files are simple, lightweight, and easily readable by both humans and computers, making them a popular choice for storing structured data in a universally accessible format.


In [None]:
data = data_modified

In [None]:
# Save to CSV
np.savetxt("data.csv", data, delimiter=",", header="Population,Area,GDP,GDP_per_Capita", comments='')

#### **Here is how you can print out the context of csv file in a terminal**
the exclamation mark at the beginning tells Jupyter that this is a unix command to be executed in the Unix operating system the Jupyter Notebook is run on

In [None]:
!ls -ltr
!cat data.csv

#### **Here’s a simple Python script to print the contents of a CSV file when you don’t know what’s inside:**



In [None]:
import csv

# Open and read the CSV file
with open("data.csv", "r") as file:
    reader = csv.reader(file)
    
    # Print each row
    for row in reader:
        print(row)


**Getting the numpy array back from the csv**

In [None]:
# Load NumPy array from the CSV file (skipping the header)
data_loaded = np.loadtxt("data.csv", delimiter=",", skiprows=1)

# Display the loaded NumPy array
check_np(data_loaded)

In [None]:
# Another method

data_loaded_v2 = np.genfromtxt("data.csv", delimiter=",", skip_header=1)

check_np(data_loaded_v2)

### 2. Writing it into a (HDF5) h5 file

An H5 file, or HDF5 (Hierarchical Data Format version 5), is a binary file format designed to store and organize large amounts of numerical data efficiently. It supports a hierarchical structure similar to a file system, allowing datasets, metadata, and groups to be stored within a single file. HDF5 is widely used in scientific computing, machine learning, and high-performance computing due to its ability to handle complex data structures and large-scale datasets efficiently. Libraries such as `h5py` in Python provide easy access for reading and writing HDF5 files.


In [None]:
import h5py

# Save to HDF5
with h5py.File("data.h5", "w") as hf:
    hf.create_dataset("dataset", data=data)


In [None]:
! pip install h5py

In [None]:
!ls -ltr

**You can get the numpy array back from an h5 file**

In [None]:
# Load the data from the HDF5 file
with h5py.File("data.h5", "r") as hf:
    mydata = np.array(hf["dataset"])  # Read the dataset into a NumPy array

# Print the retrieved data
check_np(mydata)

**If you don't know what is inside the h5 file, you can**

In [None]:

# Open the HDF5 file in read mode
with h5py.File("data.h5", "r") as hf:
    print("HDF5 file structure:")
    hf.visit(print)  # Lists all groups and datasets inside the file


In [None]:
with h5py.File("data.h5", "r") as hf:
    print("Datasets available:", list(hf.keys()))


In [None]:
with h5py.File("data.h5", "r") as hf:
    for name in hf.keys():
        dataset = hf[name]
        print(f"Dataset Name: {name}")
        print(f" - Shape: {dataset.shape}")
        print(f" - Data Type: {dataset.dtype}")


### 3. Write the numpy array into a Pandas data frame and then save it as a `pickle` file

A Pandas DataFrame is a two-dimensional, mutable data structure in Python that organizes data in a tabular format with labeled rows and columns. It provides powerful functionality for data manipulation, analysis, and visualization, making it a fundamental tool in data science and machine learning. DataFrames support operations such as filtering, grouping, merging, and handling missing values, and they can be created from various data sources, including CSV, Excel, SQL databases, and dictionaries.

A Pickle file (`.pkl`) is a serialized binary format used to save Python objects, including Pandas DataFrames, lists, dictionaries, and custom objects. The `pickle` module in Python enables efficient storage and retrieval of objects, preserving their structure and data types. This format is useful for saving intermediate computations, caching results, and sharing complex objects across sessions without requiring reprocessing.


In [None]:
import pandas as pd

# Convert NumPy array to Pandas DataFrame
columns = ["Population", "Area", "GDP", "GDP_per_Capita"]
rows = ["CA", "NY", "FL", "TX", "WA"]
df = pd.DataFrame(data, columns=columns, index=rows)

# Save as a Pickle file
df.to_pickle("data.pkl")


**Retrieving entries in dataframes**

In [None]:
print(df["GDP"])  # Get GDP column


In [None]:
print(df)

In [None]:
print( df.loc["CA" , "Population"])

In [None]:
df.loc["CA","Population"] = 40.0
print(df)

In [None]:
print(df.describe())  # Summary statistics (mean, min, max, std)


#### Reading pandas dataframe from pickle file

In [None]:
# Load DataFrame from the pickle file
df_v2 = pd.read_pickle("data.pkl")

# Display DataFrame
print(df_v2)


# **Comparison of CSV vs HDF5 vs Pandas/Pickle**

| Feature            | **CSV (Comma-Separated Values)** | **HDF5 (Hierarchical Data Format)** | **Pandas/Pickle** |
|--------------------|--------------------------------|------------------------------------|--------------------|
| **File Type**      | Plain text                     | Binary (structured format)         | Binary (Python-specific) |
| **Human-Readable?** | ✅ Yes (can be opened in Notepad, Excel) | ❌ No (binary format) | ❌ No (binary format) |
| **Supports Structured Data?** | ❌ No (flat table only) | ✅ Yes (supports hierarchies & datasets) | ✅ Yes (Pandas DataFrame) |
| **Supports Missing Data?** | ❌ No (must use placeholders like "NA") | ✅ Yes (native support) | ✅ Yes (native support) |
| **Storage Efficiency** | ❌ Large file sizes (repetitive text) | ✅ Highly efficient for large data | ✅ Compact, but larger than HDF5 |
| **Read/Write Speed** | ⏳ Slow (text parsing required) | ⚡ Fast (optimized for large datasets) | ⚡ Fast (stores DataFrame directly) |
| **Supports Random Access?** | ❌ No (must read entire file) | ✅ Yes (supports chunk-based access) | ✅ Yes (row/column access) |
| **Best For?** | Simple tabular data exchange | Large-scale numerical datasets | Storing Pandas DataFrames |
| **Python Compatibility** | ✅ Supported (with `csv`, `pandas`, `numpy`) | ✅ Supported (with `h5py`, `pandas`) | ✅ Pandas-native (`to_pickle`, `read_pickle`) |
| **Cross-Platform Use?** | ✅ Yes (universally supported) | ✅ Yes (widely used in HPC & machine learning) | ❌ No (Pickle is Python-specific) |
| **Ideal Use Case** | Sharing small datasets | Storing large structured data | Fast loading of Pandas objects |

---

## **Summary: When to Use Each Format**
- **Use CSV** if you need a simple, universally readable format for **small datasets**.
- **Use HDF5** when dealing with **large, structured, or hierarchical datasets** and need **fast random access**.
- **Use Pandas/Pickle** when working **within Python** and need to quickly **save/load a DataFrame** efficiently.


