# Part 1: Data Ingestion & Exploration

In this notebook, we will focus on fetching historical weather data for San Francisco International Airport (SFO).

**Goals:**
1.  **Fetch Data**: Use the `meteostat` library to download hourly weather data.
2.  **Clean & Inspect**: Check for missing values and understand the dataset structure.
3.  **Visualize**: Explore the data through time series plots and distributions to understand the local climate.
4.  **Export**: Save the processed data for use in forecasting models.


## 1. Setup & Imports


In [None]:
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from meteostat import Point, stations, hourly, config
import pandas as pd
import numpy as np
import os

# Set plot style
sns.set_theme(style="whitegrid")
plt.rcParams['figure.figsize'] = [12, 6]
plt.rcParams['font.family'] = "monospace"

# Create data directories
os.makedirs('../data/processed', exist_ok=True)

# Allow large requests
config.block_large_requests = False


## 2. Station Selection
We will target San Francisco International Airport (SFO) specifically.


In [None]:
lat = 37.6213 # San Francisco International Airport (SFO)
lon = -122.3790
point = Point(lat, lon)

# Find the nearest weather stations
print("Finding nearest station to SFO...")
nearby = stations.nearby(point)

# Fetch top 5 nearby stations to check meaningful names
stations_df = nearby.head(5)
print("Top 5 nearby stations:")
print(stations_df[['name', 'country', 'region']])

Select 'San Francisco International' specifically

In [None]:
station = stations_df[stations_df['name'].str.contains("San Francisco International", case=False)].iloc[0]
station_id = station.name # In Meteostat DataFrame, the index is the ID
print(f"\nSelected Station: {station['name']} (ID: {station_id})")

## 3. Data Fetching
Fetching hourly data from 2022 to 2026.


In [None]:
# Define time period
start = datetime(2022, 1, 1)
end = datetime(2026, 12, 31)

# Fetch hourly data
print("Fetching data...")
data = hourly(station_id, start, end)
df = data.fetch()

print(f"Fetched {len(df)} hourly records.")
df.head()


## 4. Initial Data Inspection
Let's look at the data structure and summary statistics.


In [None]:
print("Dataset Info:")
df.info()

In [None]:
print("\nSummary Statistics:")
df.describe()

## 5. Data Visualization

### Missing Values Analysis


In [None]:
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Data Heatmap')

# Check percentage of missing values
missing_percent = df.isnull().mean() * 100
print("\nPercentage of missing values per column:")
print(missing_percent[missing_percent > 0])


### Variable Distributions


In [None]:
# Plot distributions of key variables
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Temperature
sns.histplot(df['temp'], kde=True, ax=axes[0, 0], color='orange')
axes[0, 0].set_title('Temperature Distribution (°C)')

# Dew Point
sns.histplot(df['rhum'], kde=True, ax=axes[0, 1], color='blue')
axes[0, 1].set_title('Relative Humidity Distribution (%)')

# Wind Speed
sns.histplot(df['wspd'], kde=True, ax=axes[1, 0], color='green')
axes[1, 0].set_title('Wind Speed Distribution (km/h)')

# Pressure
sns.histplot(df['pres'], kde=True, ax=axes[1, 1], color='purple')
axes[1, 1].set_title('Pressure Distribution (hPa)')

plt.tight_layout()

### Time Series


In [None]:
# Full Time Series for Temperature
plt.figure(figsize=(15, 6))
df['temp'].plot()
plt.title('Temperature Over Time (2022-2026)')
plt.ylabel('Temperature (°C)')

## 6. Export Data


In [None]:
output_path = '../data/processed/sfo_hourly_2022_2026.csv'
df.to_csv(output_path)
print(f"Saved to {output_path}")
