# üöÄ From Loops to Vectorisation

**DS105W W04 LECTURE ‚Äì Data for Data Science (Winter Term 2025/2026)**

<div style="font-family: system-ui; padding: 20px 30px 20px 20px; background-color: #FFFFFF; color: #212121; border-left: 8px solid #ED9255; border-radius: 8px; box-shadow: 0 4px 12px rgba(0, 0, 0, 0.1);max-width:600px;">

**Lecture Demonstration Notebook**

- üìÖ Date: 12 February 2026
- üë§ Instructor: Dr Jon Cardoso-Silva
- üéØ Purpose: Learn efficient data operations with NumPy and Pandas

<span style="display:block;line-height:1.15em;color:#666666;font-size:0.9em;">

ü•Ö **Learning Goals**

 i) Understand vectorisation and why it matters,
 ii) Use NumPy arrays for fast numerical operations,
 iii) Create Pandas DataFrames from collected data.

</span>

</div>

## What You'll Learn Today

You completed W04 Practice using loops and conditionals to detect heatwaves. Today we introduce **vectorisation**: operations that work on entire arrays at once, without explicit loops. We move from Python loops to NumPy arrays, then Pandas DataFrames. Vectorised code scales without rewriting; the same operations work whether you have 35 years of data or 100 cities.

üí≠ **Personal Reflection:**

As we go through today's lecture, use this space to write your own notes, questions, or observations.

- [*Write your notes here*]

‚öôÔ∏è **Importing libraries**

We'll use three libraries today:

In [2]:
import json
import numpy as np
import pandas as pd

print("‚úÖ Libraries loaded successfully!")

‚úÖ Libraries loaded successfully!


üìñ **Learn more:**
- [NumPy Data Types](https://www.w3schools.com/python/numpy/numpy_data_types.asp) (W3Schools)
- [Pandas Data Types and Performance](https://notes.dsc80.com/content/02/data-types.html) (UCSD DSC80)
- [Pandas Series Reference](https://pandas.pydata.org/docs/reference/series.html) (Official Docs)
- [Pandas DataFrame Reference](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) (Official Docs)

## Section 1: Loading Your W04 Practice Data

We just built this data in the practice solution. Same file, same structure. Now we load it here to work with new tools.

In [3]:
# Load the JSON file you created in W04 Practice
with open('data/london_temperatures_1990_2025.json', 'r') as f:
    weather_data = json.load(f)

# Extract dates and temperatures
dates = weather_data['daily']['time']
temps = weather_data['daily']['temperature_2m_max']

print(f"‚úÖ Loaded {len(dates)} days of data")
print(f"üìÖ From {dates[0]} to {dates[-1]}")
print(f"üå°Ô∏è  Temperature range: {min(temps):.1f}¬∞C to {max(temps):.1f}¬∞C")

‚úÖ Loaded 13078 days of data
üìÖ From 1990-01-01 to 2025-10-21
üå°Ô∏è  Temperature range: -4.2¬∞C to 38.0¬∞C


## Your Loop Approach (Reference)

You built this in the practice solution. Here it is for comparison with what comes next.

In [5]:
# Your W04 Practice approach
is_hot = []

for temp in temps:
    if temp >= 28:
        is_hot.append(True)
    else:
        is_hot.append(False)

In [6]:
# Check the result
print(f"Total hot days: {sum(is_hot)}")

Total hot days: 153


## Section 2: Vectorisation - Operations on Entire Arrays

**Vectorisation** means applying operations to entire arrays at once. The loop still happens, but it's handled by fast C code inside NumPy and Pandas.

<div style="background-color: #fcfcfc; width:80%; margin-left: 1em; color: #212121; padding: 1em; border-radius: 0.5em; border-left: 5px solid #ff9800;">

**Key Idea:** Instead of writing `for item in data: do_something(item)`, you write `do_something(data)` and it works on ALL items at once.

</div>

## Section 3: NumPy Arrays - Fast Numerical Operations

NumPy arrays are like Python lists, but all elements must be the same type. This restriction allows NumPy to optimize operations dramatically.

üìñ **Learn more about NumPy data types:** [https://www.w3schools.com/python/numpy/numpy_data_types.asp](https://www.w3schools.com/python/numpy/numpy_data_types.asp)

### 3.1 Creating a NumPy Array

In [7]:
# Convert Python list to NumPy array
temps_array = np.array(temps)

In [13]:
dates_array = np.array(dates)

In [8]:
# Inspect the array
temps_array

array([ 5.9,  6.9,  5.9, ..., 15.3, 14.7, 15.4], shape=(13078,))

In [14]:
dates_array

array(['1990-01-01', '1990-01-02', '1990-01-03', ..., '2025-10-19',
       '2025-10-20', '2025-10-21'], shape=(13078,), dtype='<U10')

In [15]:
dates_array[temps_array > 28]

array(['1990-07-20', '1990-07-21', '1990-08-01', '1990-08-02',
       '1990-08-03', '1990-08-04', '1994-07-12', '1994-07-24',
       '1995-06-30', '1995-07-20', '1995-07-31', '1995-08-01',
       '1995-08-02', '1995-08-03', '1995-08-11', '1995-08-12',
       '1995-08-16', '1995-08-19', '1996-06-07', '1996-07-22',
       '1997-08-10', '1997-08-11', '1997-08-19', '1998-08-10',
       '1998-08-11', '1999-07-31', '1999-08-01', '1999-08-02',
       '2001-07-29', '2002-07-29', '2003-07-15', '2003-08-06',
       '2003-08-08', '2003-08-09', '2003-08-10', '2003-08-11',
       '2003-08-12', '2004-08-08', '2005-06-19', '2005-06-23',
       '2005-06-24', '2005-08-31', '2006-07-02', '2006-07-03',
       '2006-07-04', '2006-07-17', '2006-07-18', '2006-07-19',
       '2006-07-20', '2006-07-21', '2006-07-25', '2006-07-26',
       '2011-06-27', '2012-08-19', '2013-07-13', '2013-07-16',
       '2013-07-17', '2013-07-22', '2013-08-01', '2014-07-18',
       '2015-07-01', '2016-07-19', '2016-07-20', '2016-

In [9]:
# Check the properties
print(f"Type: {type(temps_array)}")
print(f"Data type: {temps_array.dtype}")
print(f"Shape: {temps_array.shape}")

Type: <class 'numpy.ndarray'>
Data type: float64
Shape: (13078,)


In [10]:
# Look at first 10 values
temps_array[:10]

array([ 5.9,  6.9,  5.9,  6.5,  9.6, 10.6,  9. ,  8.8,  9.8, 10.4])

üí≠ **Personal Reflection:**

What does `dtype: float64` mean? Why is `shape` a tuple?

- [*Write your notes here*]

### 3.2 Vectorised Comparison

In [11]:
# No loop needed! Operation happens to ALL elements at once
is_hot_vectorised = temps_array >= 28

In [12]:
# What did we get?
is_hot_vectorised

array([False, False, False, ..., False, False, False], shape=(13078,))

In [16]:
# Check properties
print(f"Type: {type(is_hot_vectorised)}")
print(f"Data type: {is_hot_vectorised.dtype}")

Type: <class 'numpy.ndarray'>
Data type: bool


In [17]:
# See first 20 results
is_hot_vectorised[:20]

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False])

In [18]:
# Count hot days
print(f"Total hot days: {is_hot_vectorised.sum()}")

Total hot days: 153


<div style="background-color: #fcfcfc; width:80%; margin-left: 1em; color: #212121; padding: 1em; border-radius: 0.5em; border: 1px solid #03a9f4; border-left: 5px solid #03a9f4;">

The expression `temps_array >= 28` tells NumPy to compare every temperature to 28 and produce a boolean array, without any loop in your code. That is vectorisation.

</div>

üí≠ **Personal Reflection:**

Compare the loop version and vectorised version. Which is easier to read?

- [*Write your notes here*]

### 3.3 Comparison: Loop vs Vectorised

In [19]:
print("APPROACH 1: Python Loop")
print(f"Result: {sum(is_hot)} hot days")

APPROACH 1: Python Loop
Result: 153 hot days


In [24]:
sum_temp = 0
for temp in temps:
    sum_temp += temp
print(sum_temp / len(temps))

#or
print(temps_array.mean())

14.423780394555774
14.423780394555743


In [25]:
print("\nAPPROACH 2: NumPy Vectorisation")
print(f"Result: {is_hot_vectorised.sum()} hot days")


APPROACH 2: NumPy Vectorisation
Result: 153 hot days


In [26]:
print(f"‚ú® Same result? {sum(is_hot) == is_hot_vectorised.sum()}")

‚ú® Same result? True


### 3.4 More Vectorised Operations

In [27]:
# Convert ALL temperatures to Fahrenheit at once
temps_fahrenheit = temps_array * 9/5 + 32

In [28]:
# Check the first few
print("First 10 temperatures:")
for i in range(10):
    print(f"{temps_array[i]:>5.1f}¬∞C = {temps_fahrenheit[i]:>5.1f}¬∞F")

First 10 temperatures:
  5.9¬∞C =  42.6¬∞F
  6.9¬∞C =  44.4¬∞F
  5.9¬∞C =  42.6¬∞F
  6.5¬∞C =  43.7¬∞F
  9.6¬∞C =  49.3¬∞F
 10.6¬∞C =  51.1¬∞F
  9.0¬∞C =  48.2¬∞F
  8.8¬∞C =  47.8¬∞F
  9.8¬∞C =  49.6¬∞F
 10.4¬∞C =  50.7¬∞F


In [29]:
# Find extreme heat days (> 30¬∞C)
extreme_heat = temps_array > 30

In [30]:
print(f"Days above 30¬∞C: {extreme_heat.sum()}")

Days above 30¬∞C: 49


In [58]:
# Calculate statistics instantly
print("Temperature Statistics:")
print(f"Mean: {temps_array.mean():.2f}¬∞C")
print(f"Median: {np.median(temps_array):.2f}¬∞C")
print(f"Std deviation: {temps_array.std():.2f}¬∞C")

Temperature Statistics:
Mean: 14.42¬∞C
Median: 14.30¬∞C
Std deviation: 6.13¬∞C


üí≠ **Personal Reflection:**

How would you find all days with temperatures between 20-25¬∞C?

- [*Write your notes here*]

### 3.5 Simple Categorization with `np.where()`

In [None]:
# Simple categorization: "Hot" or "Not Hot"
temp_category = np.where(temps_array >= 28, "Hot", "Not Hot")
# is like if temps_array >= 28, append Hot, else append Not Hot

In [34]:
# Look at the result
temp_category

array(['Not Hot', 'Not Hot', 'Not Hot', ..., 'Not Hot', 'Not Hot',
       'Not Hot'], shape=(13078,), dtype='<U7')

In [35]:
# See first 20 categories with temperatures
print("First 20 categories:")
for i in range(20):
    print(f"{dates[i]}: {temps_array[i]:>5.1f}¬∞C ‚Üí {temp_category[i]}")

First 20 categories:
1990-01-01:   5.9¬∞C ‚Üí Not Hot
1990-01-02:   6.9¬∞C ‚Üí Not Hot
1990-01-03:   5.9¬∞C ‚Üí Not Hot
1990-01-04:   6.5¬∞C ‚Üí Not Hot
1990-01-05:   9.6¬∞C ‚Üí Not Hot
1990-01-06:  10.6¬∞C ‚Üí Not Hot
1990-01-07:   9.0¬∞C ‚Üí Not Hot
1990-01-08:   8.8¬∞C ‚Üí Not Hot
1990-01-09:   9.8¬∞C ‚Üí Not Hot
1990-01-10:  10.4¬∞C ‚Üí Not Hot
1990-01-11:  10.8¬∞C ‚Üí Not Hot
1990-01-12:   9.8¬∞C ‚Üí Not Hot
1990-01-13:   7.4¬∞C ‚Üí Not Hot
1990-01-14:   8.1¬∞C ‚Üí Not Hot
1990-01-15:  12.0¬∞C ‚Üí Not Hot
1990-01-16:  11.2¬∞C ‚Üí Not Hot
1990-01-17:  10.8¬∞C ‚Üí Not Hot
1990-01-18:   8.3¬∞C ‚Üí Not Hot
1990-01-19:   9.7¬∞C ‚Üí Not Hot
1990-01-20:  10.8¬∞C ‚Üí Not Hot


## Section 4: The Limits of NumPy - Complex Conditions

NumPy is great for simple operations. But complex logic requires **nested** `np.where()` calls, which become hard to read.

### 4.1 Multiple Categories

In [36]:
# Three categories: Hot, Warm, Cool
temp_category_complex = np.where(
    temps_array >= 28,
    "Hot",
    np.where(
        temps_array >= 20,
        "Warm",
        "Cool"
    )
)

In [37]:
# See the result
temp_category_complex

array(['Cool', 'Cool', 'Cool', ..., 'Cool', 'Cool', 'Cool'],
      shape=(13078,), dtype='<U4')

In [38]:
# Show first 20 categorized temperatures
print("First 20 categorized:")
for i in range(20):
    print(f"{dates[i]}: {temps_array[i]:>5.1f}¬∞C ‚Üí {temp_category_complex[i]}")

First 20 categorized:
1990-01-01:   5.9¬∞C ‚Üí Cool
1990-01-02:   6.9¬∞C ‚Üí Cool
1990-01-03:   5.9¬∞C ‚Üí Cool
1990-01-04:   6.5¬∞C ‚Üí Cool
1990-01-05:   9.6¬∞C ‚Üí Cool
1990-01-06:  10.6¬∞C ‚Üí Cool
1990-01-07:   9.0¬∞C ‚Üí Cool
1990-01-08:   8.8¬∞C ‚Üí Cool
1990-01-09:   9.8¬∞C ‚Üí Cool
1990-01-10:  10.4¬∞C ‚Üí Cool
1990-01-11:  10.8¬∞C ‚Üí Cool
1990-01-12:   9.8¬∞C ‚Üí Cool
1990-01-13:   7.4¬∞C ‚Üí Cool
1990-01-14:   8.1¬∞C ‚Üí Cool
1990-01-15:  12.0¬∞C ‚Üí Cool
1990-01-16:  11.2¬∞C ‚Üí Cool
1990-01-17:  10.8¬∞C ‚Üí Cool
1990-01-18:   8.3¬∞C ‚Üí Cool
1990-01-19:   9.7¬∞C ‚Üí Cool
1990-01-20:  10.8¬∞C ‚Üí Cool


üí≠ **Personal Reflection:**

Look at the nested `np.where()` code above. Can you easily read it? How would you add a fourth category?

- [*Write your notes here*]

### 4.2 Even More Complex - Multiple Conditions

Imagine you also have rainfall data and want categories like "Hot & Dry", "Hot & Wet", "Warm & Dry", etc. The nested `np.where()` becomes unreadable.

In [39]:
# Create fake rainfall data for demonstration
np.random.seed(42)
rainfall = np.random.uniform(0, 10, len(temps_array))

In [40]:
# Look at first 10 values
rainfall[:10]

array([3.74540119, 9.50714306, 7.31993942, 5.98658484, 1.5601864 ,
       1.5599452 , 0.58083612, 8.66176146, 6.01115012, 7.08072578])

In [41]:
# This works but look how hard it is to read!
weather_type_numpy = np.where(
    temps_array >= 28,
    np.where(rainfall < 1, "Hot & Dry", "Hot & Wet"),
    np.where(
        temps_array >= 20,
        np.where(rainfall < 1, "Warm & Dry", "Warm & Wet"),
        "Cool"
    )
)

In [42]:
# See the result
weather_type_numpy

array(['Cool', 'Cool', 'Cool', ..., 'Cool', 'Cool', 'Cool'],
      shape=(13078,), dtype='<U10')

In [43]:
# Show first 15 classifications
print("First 15 weather classifications:")
for i in range(15):
    print(f"{dates[i]}: {temps_array[i]:>5.1f}¬∞C, {rainfall[i]:>5.2f}mm ‚Üí {weather_type_numpy[i]}")

First 15 weather classifications:
1990-01-01:   5.9¬∞C,  3.75mm ‚Üí Cool
1990-01-02:   6.9¬∞C,  9.51mm ‚Üí Cool
1990-01-03:   5.9¬∞C,  7.32mm ‚Üí Cool
1990-01-04:   6.5¬∞C,  5.99mm ‚Üí Cool
1990-01-05:   9.6¬∞C,  1.56mm ‚Üí Cool
1990-01-06:  10.6¬∞C,  1.56mm ‚Üí Cool
1990-01-07:   9.0¬∞C,  0.58mm ‚Üí Cool
1990-01-08:   8.8¬∞C,  8.66mm ‚Üí Cool
1990-01-09:   9.8¬∞C,  6.01mm ‚Üí Cool
1990-01-10:  10.4¬∞C,  7.08mm ‚Üí Cool
1990-01-11:  10.8¬∞C,  0.21mm ‚Üí Cool
1990-01-12:   9.8¬∞C,  9.70mm ‚Üí Cool
1990-01-13:   7.4¬∞C,  8.32mm ‚Üí Cool
1990-01-14:   8.1¬∞C,  2.12mm ‚Üí Cool
1990-01-15:  12.0¬∞C,  1.82mm ‚Üí Cool


<div style="background-color: #fcfcfc; width:80%; margin-left: 1em; color: #212121; padding: 1em; border-radius: 0.5em; border: 1px solid #e91e63; border-left: 5px solid #e91e63;">

The nested `np.where()` code above is hard to read, and adding wind speed or other conditions would make it worse. Pandas offers a clearer approach for this kind of logic.

</div>

üí≠ **Personal Reflection:**

Imagine you found a bug in the classification above. How easy would it be to fix?

- [*Write your notes here*]

---

**üéØ INSTRUCTOR NOTE:** Demonstrate debugging here - how would you find an error in nested `np.where()`? No `print()` inside the expression!

## Section 5: Enter Pandas - Adding Structure to Your Data

Pandas builds on NumPy but adds crucial features: column names, row labels, and mixed data types. Think of it as spreadsheets for Python.

üìñ **Learn more:**
- [Pandas Series Tutorial](https://www.w3schools.com/python/pandas/pandas_series.asp) (W3Schools)
- [Pandas DataFrame Tutorial](https://www.w3schools.com/python/pandas/pandas_dataframes.asp) (W3Schools)
- [Official Series Reference](https://pandas.pydata.org/docs/reference/series.html)
- [Official DataFrame Reference](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)

### 5.1 Creating a DataFrame

In [44]:
# Combine your data into a table structure
weather_df = pd.DataFrame({
    'date': dates,
    'temp': temps
})

In [45]:
# Look at the DataFrame
weather_df

Unnamed: 0,date,temp
0,1990-01-01,5.9
1,1990-01-02,6.9
2,1990-01-03,5.9
3,1990-01-04,6.5
4,1990-01-05,9.6
...,...,...
13073,2025-10-17,14.8
13074,2025-10-18,14.7
13075,2025-10-19,15.3
13076,2025-10-20,14.7


In [46]:
# See first 10 rows
weather_df.head(10)

Unnamed: 0,date,temp
0,1990-01-01,5.9
1,1990-01-02,6.9
2,1990-01-03,5.9
3,1990-01-04,6.5
4,1990-01-05,9.6
5,1990-01-06,10.6
6,1990-01-07,9.0
7,1990-01-08,8.8
8,1990-01-09,9.8
9,1990-01-10,10.4


In [47]:
# Basic information
print(f"Shape: {weather_df.shape}")
print(f"Columns: {weather_df.columns.tolist()}")

Shape: (13078, 2)
Columns: ['date', 'temp']


<div style="background-color: #fcfcfc; width:80%; margin-left: 1em; color: #212121; padding: 1em; border-radius: 0.5em; border: 1px solid #4caf50; border-left: 5px solid #4caf50;">

You just created `dates` and `temps` as Python lists in the practice solution. Now you are putting them into a DataFrame, which gives you column names, row labels, and built-in operations.

</div>

üí≠ **Personal Reflection:**

How is this DataFrame different from the lists we had before?

- [*Write your notes here*]

### 5.2 Accessing Columns - Pandas Series

In [48]:
# A single column is a Pandas Series
temp_series = weather_df['temp']

In [49]:
# Look at it
temp_series

0         5.9
1         6.9
2         5.9
3         6.5
4         9.6
         ... 
13073    14.8
13074    14.7
13075    15.3
13076    14.7
13077    15.4
Name: temp, Length: 13078, dtype: float64

In [50]:
# Check properties
print(f"Type: {type(temp_series)}")
print(f"Data type: {temp_series.dtype}")

Type: <class 'pandas.Series'>
Data type: float64


In [51]:
# See first 10 values
temp_series.head(10)

0     5.9
1     6.9
2     5.9
3     6.5
4     9.6
5    10.6
6     9.0
7     8.8
8     9.8
9    10.4
Name: temp, dtype: float64

üìñ **Pandas Series** can do everything NumPy arrays can, plus more. See all operations: [https://pandas.pydata.org/docs/reference/series.html](https://pandas.pydata.org/docs/reference/series.html)

### 5.3 Vectorised Operations Still Work

In [52]:
# Create a new column with vectorised comparison
weather_df['is_hot'] = weather_df['temp'] >= 28

In [53]:
# See the result
weather_df.head(15)

Unnamed: 0,date,temp,is_hot
0,1990-01-01,5.9,False
1,1990-01-02,6.9,False
2,1990-01-03,5.9,False
3,1990-01-04,6.5,False
4,1990-01-05,9.6,False
5,1990-01-06,10.6,False
6,1990-01-07,9.0,False
7,1990-01-08,8.8,False
8,1990-01-09,9.8,False
9,1990-01-10,10.4,False


In [54]:
# Count hot days
print(f"Total hot days: {weather_df['is_hot'].sum()}")

Total hot days: 153


üí≠ **Personal Reflection:**

We just added a new column to the DataFrame. How is this easier than creating a separate list?

- [*Write your notes here*]

### 5.4 Filtering DataFrames

In [55]:
# Get only hot days
hot_days_df = weather_df[weather_df['is_hot']]

In [56]:
# See what we got
hot_days_df

Unnamed: 0,date,temp,is_hot
200,1990-07-20,28.4,True
201,1990-07-21,28.5,True
212,1990-08-01,29.5,True
213,1990-08-02,30.9,True
214,1990-08-03,33.7,True
...,...,...,...
12976,2025-07-12,28.8,True
12982,2025-07-18,29.4,True
13006,2025-08-11,29.7,True
13007,2025-08-12,31.1,True


In [57]:
print(f"Found {len(hot_days_df)} hot days")
print("\nFirst 10:")
hot_days_df.head(10)

Found 153 hot days

First 10:


Unnamed: 0,date,temp,is_hot
200,1990-07-20,28.4,True
201,1990-07-21,28.5,True
212,1990-08-01,29.5,True
213,1990-08-02,30.9,True
214,1990-08-03,33.7,True
215,1990-08-04,32.6,True
1653,1994-07-12,30.3,True
1665,1994-07-24,28.8,True
1671,1994-07-30,28.0,True
2006,1995-06-30,29.2,True


## Summary

Today you saw:

- Vectorisation: applying operations to entire arrays without writing loops
- NumPy arrays for fast numerical comparisons
- `np.where()` for simple categorisation, and its limits when logic gets complex
- Pandas DataFrames for structured data with named columns
- Boolean column creation and DataFrame filtering

Your loop-based approach from the practice assignment was correct. You now have tools that do the same work with less code. Tomorrow's lab will let you experience both approaches side by side.

üí≠ **Final Reflection:**

What was the most surprising thing you learned today? What questions do you still have?

- [*Write your notes here*]

## Tomorrow's Lab

In tomorrow's üíª **W04 Lab**, you'll:
- Work in pairs (üßë‚Äç‚úàÔ∏è Pilot + üôã Copilot)
- Classify summer weather data
- Try nested `np.where()` first, then Pandas with `np.select()` and boolean columns

The contrast between the two approaches makes the value of Pandas clear.

## Mini-Project 1

**Released tomorrow, due Week 06 Thursday 8pm**

Your first graded assignment (20% of final grade) will use these Pandas skills.

Covered today: creating DataFrames, adding columns with vectorised operations, and filtering. Next week adds custom functions and `.apply()`.

---

üìñ **Additional Resources:**
- [NumPy Data Types](https://www.w3schools.com/python/numpy/numpy_data_types.asp)
- [Pandas Performance Tips](https://notes.dsc80.com/content/02/data-types.html)
- [10 Minutes to Pandas](https://pandas.pydata.org/docs/user_guide/10min.html)