[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jkitchin/s26-06642/blob/main/dsmles/00-introduction/introduction.ipynb)

# Module 00: Introduction

Welcome to Data Science and Machine Learning in Chemical Engineering!

## Learning Objectives

By the end of this module, you will:
1. Understand the scope and goals of the course
2. Set up your Python environment
3. Review essential Python concepts
4. Understand the machine learning workflow

## Why Data Science in Chemical Engineering?

Chemical engineers increasingly work with data:

- **Process data**: Sensor readings, control systems, quality measurements
- **Experimental data**: Lab results, catalyst testing, reaction optimization
- **Simulation data**: CFD, molecular dynamics, process modeling
- **Literature data**: Published properties, kinetic parameters, correlations

Machine learning provides tools to:
1. **Find patterns** in complex, high-dimensional data
2. **Build predictive models** without explicit physical equations
3. **Optimize processes** based on data-driven insights
4. **Quantify uncertainty** in predictions and measurements

## Course Overview

### Data Foundations (Modules 01-03)
- NumPy for numerical computing
- Pandas for data manipulation
- Visualization with Matplotlib

### Core Machine Learning (Modules 04-08)
- Dimensionality reduction (PCA, t-SNE)
- Regression (linear, regularized, nonlinear)
- Ensemble methods (Random Forests, Gradient Boosting)

### Advanced Topics (Modules 09-11)
- Clustering for unsupervised learning
- Uncertainty quantification
- Model interpretability

## Environment Setup

### Option 1: uv (Recommended)

[uv](https://docs.astral.sh/uv/) is a fast Python package manager. Install it first:

```bash
# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Windows
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
```

Then set up the project:

```bash
# Clone the course repository and cd into it
uv sync  # Creates virtual environment and installs all dependencies

# Activate the environment
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Start JupyterLab
jupyter lab
```

### Option 2: Google Colab

Open notebooks directly in Colab using the rocket icon at the top of each page.

## Verify Installation

Run the following cell to check that everything is installed correctly:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
import pycse
import shap
import xgboost

print(f"NumPy: {np.__version__}")
print(f"Pandas: {pd.__version__}")
print(f"Scikit-learn: {sklearn.__version__}")
print(f"XGBoost: {xgboost.__version__}")
print("\nAll packages imported successfully!")

## Python Refresher

### Lists and Loops

In [None]:
# List of experimental temperatures (Kelvin)
temperatures = [300, 350, 400, 450, 500]

# Convert to Celsius using a loop
temps_celsius = []
for T in temperatures:
    temps_celsius.append(T - 273.15)

print("Celsius:", temps_celsius)

In [None]:
# Better: List comprehension
temps_celsius = [T - 273.15 for T in temperatures]
print("Celsius:", temps_celsius)

### Dictionaries

In [None]:
# Store experiment parameters
experiment = {
    'temperature': 400,  # K
    'pressure': 101.325,  # kPa
    'catalyst': 'Pt/Al2O3',
    'conversion': 0.85
}

print(f"At {experiment['temperature']} K, conversion was {experiment['conversion']:.1%}")

### Functions

In [None]:
def arrhenius_rate(T, A=1e13, Ea=80000, R=8.314):
    """
    Calculate reaction rate constant using Arrhenius equation.
    
    Parameters
    ----------
    T : float
        Temperature in Kelvin
    A : float
        Pre-exponential factor (1/s)
    Ea : float
        Activation energy (J/mol)
    R : float
        Gas constant (J/mol/K)
    
    Returns
    -------
    float
        Rate constant k
    """
    import numpy as np
    return A * np.exp(-Ea / (R * T))

# Calculate rate at different temperatures
for T in [300, 400, 500]:
    k = arrhenius_rate(T)
    print(f"k({T} K) = {k:.2e} 1/s")

## The Machine Learning Workflow

Most ML projects follow this pattern:

```
1. Define the problem
   ↓
2. Collect and explore data
   ↓
3. Prepare data (cleaning, feature engineering)
   ↓
4. Choose and train models
   ↓
5. Evaluate and validate
   ↓
6. Interpret and communicate
   ↓
7. Deploy (if applicable)
```

This course covers steps 2-6, with emphasis on chemical engineering applications.

## Example: Predicting Reaction Yield

Let's preview what we'll be able to do by the end of the course.

**Problem**: Predict reaction yield from experimental conditions.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Generate synthetic experiment data
np.random.seed(42)
n_experiments = 100

# Experimental conditions
temperature = np.random.uniform(300, 500, n_experiments)  # K
pressure = np.random.uniform(1, 10, n_experiments)  # atm
catalyst_loading = np.random.uniform(0.01, 0.10, n_experiments)  # wt%

# "True" yield model (unknown to us in practice)
yield_true = (
    50 + 
    0.1 * (temperature - 400) + 
    2 * pressure + 
    100 * catalyst_loading +
    np.random.normal(0, 3, n_experiments)  # noise
)
yield_true = np.clip(yield_true, 0, 100)  # Yield between 0-100%

# Create DataFrame
data = pd.DataFrame({
    'temperature': temperature,
    'pressure': pressure,
    'catalyst_loading': catalyst_loading,
    'yield': yield_true
})

data.head(10)

In [None]:
# Quick visualization
fig, axes = plt.subplots(1, 3, figsize=(12, 4))

axes[0].scatter(data['temperature'], data['yield'], alpha=0.5)
axes[0].set_xlabel('Temperature (K)')
axes[0].set_ylabel('Yield (%)')

axes[1].scatter(data['pressure'], data['yield'], alpha=0.5)
axes[1].set_xlabel('Pressure (atm)')
axes[1].set_ylabel('Yield (%)')

axes[2].scatter(data['catalyst_loading'], data['yield'], alpha=0.5)
axes[2].set_xlabel('Catalyst Loading (wt%)')
axes[2].set_ylabel('Yield (%)')

plt.tight_layout()
plt.show()

In [None]:
# Build a simple model (we'll learn this properly later!)
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Prepare data
X = data[['temperature', 'pressure', 'catalyst_loading']]
y = data['yield']

# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse:.2f}%")
print(f"R²: {r2:.3f}")

In [None]:
# Visualize predictions
plt.figure(figsize=(6, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([40, 80], [40, 80], 'r--', label='Perfect prediction')
plt.xlabel('Actual Yield (%)')
plt.ylabel('Predicted Yield (%)')
plt.title('Model Performance')
plt.legend()
plt.axis('equal')
plt.show()

## What's Next

In the upcoming modules, we'll learn:

1. **NumPy** - How to work with numerical data efficiently
2. **Pandas** - How to load, clean, and manipulate data
3. **Visualization** - How to explore and present data
4. **Regression** - How to build and validate predictive models
5. **Advanced methods** - How to handle complex, nonlinear relationships
6. **Uncertainty** - How to quantify confidence in predictions
7. **Interpretability** - How to understand what the model learned

Let's get started with NumPy in the next module!

## Summary

- Data science and ML are increasingly important in chemical engineering
- This course covers practical tools: NumPy, Pandas, scikit-learn, and more
- We follow a standard ML workflow: data → model → evaluation → interpretation
- All examples will use chemical engineering applications