# Data Handling with Pandas

## Learning Objectives
- Understand DataFrames and Series
- Load and save data from various formats
- Clean and manipulate datasets
- Group and aggregate data
- Handle time series data

## Prerequisites
- Python Basics
- Working with Arrays (NumPy)

---

## 1. Introduction to Pandas

Pandas provides data structures for handling structured data.

In [None]:
import pandas as pd
import numpy as np

# Create a DataFrame from a dictionary
ocean_data = {
    'station': ['A1', 'A2', 'A3', 'B1', 'B2', 'B3'],
    'latitude': [45.2, 45.5, 45.8, 46.1, 46.4, 46.7],
    'longitude': [-125.3, -125.1, -124.9, -124.7, -124.5, -124.3],
    'temperature': [15.2, 14.8, 16.1, 15.9, 14.5, 16.3],
    'salinity': [33.8, 33.9, 34.1, 34.0, 33.7, 34.2],
    'depth': [50, 75, 100, 60, 80, 95]
}

df = pd.DataFrame(ocean_data)
print("Ocean Station Data:")
print(df)
print(f"\nDataFrame shape: {df.shape}")
print(f"Column names: {list(df.columns)}")

## 2. Data Exploration and Info

Get to know your dataset quickly.

In [None]:
# Basic information about the dataset
print("Dataset Info:")
print(df.info())

print("\nBasic Statistics:")
print(df.describe())

print("\nFirst few rows:")
print(df.head(3))

print("\nLast few rows:")
print(df.tail(3))

## 3. Data Selection and Filtering

Access specific rows, columns, and subsets of data.

In [None]:
# Select columns
print("Station temperatures:")
print(df['temperature'])

# Select multiple columns
print("\nTemperature and Salinity:")
print(df[['station', 'temperature', 'salinity']])

# Filter rows based on conditions
warm_stations = df[df['temperature'] > 15.5]
print("\nWarm stations (>15.5°C):")
print(warm_stations)

# Multiple conditions
shallow_warm = df[(df['temperature'] > 15.5) & (df['depth'] < 80)]
print("\nShallow and warm stations:")
print(shallow_warm)

## 4. Data Manipulation

Add new columns and modify existing data.

In [None]:
# Add new calculated columns
df['temp_kelvin'] = df['temperature'] + 273.15

# Calculate density using simplified formula
df['density'] = 1000 + 0.8 * df['salinity'] - 0.2 * df['temperature'] + 0.005 * df['depth']

# Create categorical column
df['depth_category'] = pd.cut(df['depth'], 
                              bins=[0, 60, 90, 200], 
                              labels=['shallow', 'medium', 'deep'])

# Round numerical columns
df['density'] = df['density'].round(2)

print("Enhanced dataset:")
print(df)

## 5. Grouping and Aggregation

Summarize data by groups.

In [None]:
# Group by depth category
depth_groups = df.groupby('depth_category')

print("Statistics by depth category:")
print(depth_groups[['temperature', 'salinity', 'density']].mean())

print("\nCount by depth category:")
print(depth_groups.size())

# Multiple aggregations
print("\nMultiple statistics:")
agg_stats = depth_groups['temperature'].agg(['mean', 'std', 'min', 'max'])
print(agg_stats)

## 6. Working with Time Series

Handle time-based data effectively.

In [None]:
# Create time series data
dates = pd.date_range('2024-01-01', periods=30, freq='D')
np.random.seed(42)
sea_level = 1000 + 50 * np.sin(np.arange(30) * 2 * np.pi / 14) + 10 * np.random.randn(30)

time_series = pd.DataFrame({
    'date': dates,
    'sea_level_mm': sea_level
})

# Set date as index
time_series.set_index('date', inplace=True)

print("Sea level time series:")
print(time_series.head(10))

# Time-based operations
print(f"\nMean sea level: {time_series['sea_level_mm'].mean():.1f} mm")
print(f"Sea level range: {time_series['sea_level_mm'].max() - time_series['sea_level_mm'].min():.1f} mm")

# Resample to weekly means
weekly_mean = time_series.resample('W').mean()
print("\nWeekly averages:")
print(weekly_mean)

## 7. Data Import/Export

Work with external data files.

In [None]:
# Save to CSV
df.to_csv('ocean_stations.csv', index=False)
print("Data saved to 'ocean_stations.csv'")

# Read from CSV
loaded_data = pd.read_csv('ocean_stations.csv')
print("\nLoaded data:")
print(loaded_data.head())

# Save time series to CSV with proper date handling
time_series.to_csv('sea_level_data.csv')
print("\nTime series saved to 'sea_level_data.csv'")

# Read time series with date parsing
loaded_ts = pd.read_csv('sea_level_data.csv', index_col='date', parse_dates=True)
print("\nLoaded time series:")
print(loaded_ts.head())

## 8. Exercise: CTD Data Analysis

Analyze a simulated CTD (Conductivity, Temperature, Depth) profile.

In [None]:
# Create simulated CTD profile
np.random.seed(123)
depths = np.arange(0, 201, 5)  # 0 to 200m, every 5m
n_points = len(depths)

# Simulate realistic ocean profile
temperatures = 20 - depths * 0.08 + np.random.normal(0, 0.5, n_points)
salinities = 34 + depths * 0.01 + np.random.normal(0, 0.1, n_points)
conductivities = salinities * 4.2  # Simplified relationship

ctd_data = pd.DataFrame({
    'depth_m': depths,
    'temperature_c': temperatures,
    'salinity_psu': salinities,
    'conductivity': conductivities
})

print("CTD Profile Data:")
print(ctd_data.head(10))

# Your analysis tasks:
# 1. Calculate water density at each depth
# 2. Find the thermocline (depth of maximum temperature gradient)
# 3. Identify water masses (group by temperature/salinity ranges)
# 4. Calculate average properties for upper ocean (0-50m) vs deep ocean (>50m)

# Example solution for task 1:
ctd_data['density'] = (1000 + 0.8 * ctd_data['salinity_psu'] - 
                      0.2 * ctd_data['temperature_c'] + 
                      0.005 * ctd_data['depth_m'])

print("\nWith density calculated:")
print(ctd_data.head(10))

# Save the CTD data for future use
ctd_data.to_csv('ctd_profile.csv', index=False)
print("\nCTD data saved to 'ctd_profile.csv'")

## Summary

In this module, you learned:
- Creating and manipulating DataFrames
- Data selection, filtering, and indexing
- Adding calculated columns
- Grouping and aggregating data
- Working with time series
- Reading and writing data files
- Applying pandas to oceanographic data analysis

## Next Steps

Continue to "Basic Plotting" to learn how to visualize your data with matplotlib.

## Additional Resources

- [Pandas Documentation](https://pandas.pydata.org/docs/)
- [10 Minutes to Pandas](https://pandas.pydata.org/docs/user_guide/10min.html)
- [Python for Data Analysis](https://wesmckinney.com/book/) by Wes McKinney