# 2.1 AI-Assisted Coding with Codex

## Course 3: Advanced Classification Models for Student Success

## Introduction

### What Is Codex?

**Codex** is OpenAI's AI coding agent, built into ChatGPT. It can:

- Write, run, and debug Python code in a cloud sandbox
- Install packages (`scikit-learn`, `xgboost`, `pandas`, etc.)
- Read and analyze your uploaded data files
- Generate complete Jupyter-style analyses from natural language prompts

### Why Learn This Now?

You've just learned regularized logistic regression in Module 1. Before diving into
tree-based models (Module 3) and model comparison (Module 4), we pause to learn
**how to use AI as a coding partner**. This skill will accelerate everything that follows.

### What Is "Vibecoding"?

**Vibecoding** = using natural language prompts to direct an AI tool to write,
run, and iterate on code. Instead of writing every line yourself, you:

1. **Describe** what you want in plain English
2. **Review** the code the AI generates
3. **Iterate** by refining your prompts
4. **Validate** the results against your domain knowledge

> *Think of it as pair programming — you bring the institutional knowledge,
> Codex brings the coding speed.*

### Learning Objectives

1. Understand what Codex is and how it works
2. Learn effective prompting strategies for data science tasks
3. Practice the prompt → review → iterate workflow
4. Apply vibecoding to replicate analyses from this course

## 1. How to Access Codex

### Getting Started

1. Go to [chatgpt.com](https://chatgpt.com)
2. Codex is available to ChatGPT Plus/Pro/Team/Enterprise subscribers
3. Look for the **Codex** option in the model/tool selector
4. Upload your data file(s) and start prompting

### The Codex Workflow

```
┌─────────────────────────────────────────────┐
│  1. Upload data file (CSV, Excel, etc.)     │
│  2. Write a prompt describing your analysis │
│  3. Codex writes and runs Python code       │
│  4. Review output (tables, charts, metrics) │
│  5. Refine prompt if needed                 │
│  6. Download results                        │
└─────────────────────────────────────────────┘
```

## 2. Prompting Strategies for Data Science

### The CRISP Prompting Framework

| Letter | Meaning | Example |
|:-------|:--------|:--------|
| **C** | Context | "I have a dataset of 5,000 college students..." |
| **R** | Role | "Act as an institutional research analyst..." |
| **I** | Instructions | "Build a logistic regression model to predict..." |
| **S** | Specifics | "Use L2 regularization with C=0.1, scale features..." |
| **P** | Product | "Show me the AUC, confusion matrix, and top 5 coefficients" |

### Example Prompts

#### Prompt 1: Exploratory Data Analysis
```
I've uploaded a CSV of student records with columns for HS_GPA, UNITS_ATTEMPTED,
GPA_1, DFW_RATE_1, and DEPARTED (0/1). Please:
1. Show summary statistics
2. Plot the distribution of each variable
3. Show a correlation heatmap
4. Identify the top 3 features most correlated with DEPARTED
```

#### Prompt 2: Build a Classification Model
```
Using the same data, build a regularized logistic regression:
- Target: DEPARTED
- Features: all numeric columns except DEPARTED
- Use StandardScaler
- Use L2 regularization with C=0.1
- Show: AUC, F1, precision, recall, confusion matrix
- Plot the ROC curve
- List the top 10 most important features by absolute coefficient
```

#### Prompt 3: Compare Models
```
Now build a Random Forest and XGBoost model on the same data.
Compare all three models (logistic, RF, XGBoost) in a table showing:
AUC, F1, Precision, Recall, Accuracy.
Also plot all three ROC curves on the same chart.
```

## 3. Best Practices

### DO ✓

- **Be specific** about what metrics, charts, and outputs you want
- **Upload your actual data** so Codex can run real code
- **Iterate** — your first prompt rarely produces the perfect result
- **Validate** — check that the code does what you think it does
- **Use domain knowledge** — tell Codex about your institutional context

### DON'T ✗

- **Don't trust blindly** — always review the generated code
- **Don't skip validation** — check metrics against your expectations
- **Don't use vague prompts** — "analyze my data" is too broad
- **Don't ignore errors** — ask Codex to explain and fix them
- **Don't share sensitive data** — check your institution's data governance policies

### The 80/20 Rule of Vibecoding

> Codex gets you 80% of the way there in 20% of the time.
> Your job is the remaining 20% — validation, interpretation, and institutional context.

## 4. Mapping Course Content to Codex Prompts

Here's how each module in this course maps to a Codex prompt:

| Module | Task | Codex Prompt Starter |
|:-------|:-----|:--------------------|
| 1. Regularization | L1/L2 Logistic | "Build a logistic regression with L2 penalty C=0.1..." |
| 3. Tree-Based | Random Forest | "Build a Random Forest with 200 trees, max_depth=12..." |
| 3. Tree-Based | XGBoost | "Build an XGBoost classifier with learning_rate=0.1..." |
| 4. Comparison | Head-to-head | "Compare logistic, RF, and XGBoost on AUC, F1..." |
| 5. Unsupervised | K-Means + PCA | "Cluster students using K-Means with k=4, show PCA..." |

### Practice Exercise

Open Codex and try this prompt with the course's `training.csv`:

```
I've uploaded training.csv with student records. The target variable is
SEM_3_STATUS where 'E' = enrolled (not departed) and anything else = departed.

Please:
1. Create a binary DEPARTED column (1 if not 'E', 0 otherwise)
2. Use these features: HS_GPA, UNITS_ATTEMPTED_1, UNITS_COMPLETED_1,
   DFW_UNITS_1, GPA_1, DFW_RATE_1
3. Build a logistic regression with L2, C=0.1
4. Show AUC, F1, precision, recall
5. Plot the ROC curve
6. List the top coefficients
```

## Summary

### Key Takeaways

1. **Codex** is an AI coding agent that writes and runs Python code from natural language
2. **Vibecoding** = describe → review → iterate → validate
3. **Good prompts** are specific about context, method, and desired output
4. **You are still the analyst** — Codex is the coding accelerator, you bring domain expertise
5. **Every analysis in this course** can be replicated and extended using Codex

### Next Module

**Proceed to:** Module 3 — Tree-Based Models (Decision Trees, Random Forest, XGBoost)