[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jkitchin/s26-06642/blob/main/dsmles/assignments/project.ipynb)

# Course Project: Machine Learning for Chemical Engineering

**Weight:** 50% of final grade

**Due:** End of semester

## Overview

Apply machine learning techniques to a chemical engineering problem of your choice. This project demonstrates your ability to:

1. Formulate a meaningful problem
2. Collect and preprocess data
3. Apply appropriate ML methods
4. Evaluate and interpret results
5. Communicate findings clearly

In [None]:
! pip install -q pycse
from pycse.colab import pdf

## Project Options

### Option A: Bring Your Own Data
Use data from your research, internship, or a published paper. This is the preferred option as it connects to your own work.

### Option B: Public Dataset
Use a publicly available dataset relevant to chemical engineering:
- [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)
- [Kaggle Datasets](https://www.kaggle.com/datasets)
- [Materials Project](https://materialsproject.org/)
- [NIST Chemistry WebBook](https://webbook.nist.gov/chemistry/)

### Example Topics
- Predicting material properties from composition
- Process optimization using surrogate models
- Fault detection in manufacturing processes
- Reaction yield prediction
- Catalyst performance modeling
- Polymer property prediction

## Deliverables

### 1. Project Proposal (Week 4, 10%)
- Problem description (1 paragraph)
- Data source and description
- Proposed methods
- Expected outcomes

### 2. Progress Report (Week 8, 15%)
- Data exploration and preprocessing
- Initial modeling results
- Challenges encountered
- Updated plan

### 3. Final Report (End of semester, 50%)
- Complete analysis notebook
- Written report (see template below)

### 4. Presentation (End of semester, 25%)
- 10-minute presentation
- 5-minute Q&A

## Report Template

Your final report should include:

### 1. Introduction (1-2 pages)
- Problem motivation
- Background and prior work
- Objectives

### 2. Data (1-2 pages)
- Data source and collection
- Feature descriptions
- Exploratory data analysis
- Preprocessing steps

### 3. Methods (2-3 pages)
- Model selection rationale
- Hyperparameter tuning approach
- Validation strategy

### 4. Results (2-3 pages)
- Model performance metrics
- Comparison of methods
- Feature importance / interpretability
- Uncertainty quantification (if applicable)

### 5. Discussion (1-2 pages)
- Key findings
- Limitations
- Future work

### 6. Conclusions (0.5 page)
- Summary of contributions

## Grading Rubric

| Criterion | Description |
|-----------|-------------|
| Problem Formulation | Clear, relevant, well-motivated |
| Data Analysis | Thorough EDA, appropriate preprocessing |
| Methodology | Appropriate methods, proper validation |
| Results | Clear presentation, proper metrics |
| Interpretation | Domain insights, meaningful conclusions |
| Communication | Clear writing, good visualizations |

## Project Notebook Template

Use the sections below as a starting point for your analysis.

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

# Add other imports as needed

### Data Loading and Exploration

In [None]:
# Load your data
# df = pd.read_csv('your_data.csv')

# Basic info
# print(df.shape)
# df.head()

In [None]:
# Summary statistics
# df.describe()

In [None]:
# Missing values
# df.isna().sum()

In [None]:
# Visualizations
# Add plots to explore your data

### Data Preprocessing

In [None]:
# Handle missing values
# Feature engineering
# Train/test split
# Scaling

### Model Training

In [None]:
# Train multiple models
# Cross-validation
# Hyperparameter tuning

### Results and Evaluation

In [None]:
# Performance metrics
# Model comparison
# Predicted vs actual plots

### Interpretability

In [None]:
# Feature importance
# SHAP analysis
# Domain validation

### Conclusions

Summarize your findings here.

## Tips for Success

1. **Start early** - Data collection and cleaning take time
2. **Iterate** - Don't expect perfect results on the first try
3. **Document everything** - Keep notes on what you tried
4. **Validate results** - Check if findings make physical sense
5. **Ask questions** - Office hours are there to help
6. **Focus on insight** - A simple model with good interpretation beats a complex black box