In [2]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Program 2: School Success\n",
    "\n",
    "**Due date: Friday, March 7, 2024 at 12:00pm. No late submissions allowed.**\n",
    "\n",
    "**Learning Objectives**\n",
    "- Build competency with pandas for storing and cleaning data\n",
    "- Feature engineering, including categorical (one-hot) encoding\n",
    "- Understand train-test splits in practice\n",
    "- Train 1-dimensional linear models\n",
    "- Validate models and choose the best one among multiple candidates\n",
    "\n",
    "**Available Libraries**: Python 3.9+ and pandas."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Grading Rubric\n",
    "```\n",
    "Code linting -- 10 points\n",
    "Section 1: Data ingestion and feature  -- 40 points\n",
    "Section 2: Training a linear regressor -- 30 points\n",
    "Section 3: Model evaluation -- 20 points\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dataset\n",
    "We'll use NYC school data from:\n",
    "- [2021 DOE High School Directory](https://data.cityofnewyork.us/Education/2021-DOE-High-School-Directory/8b6c-7uty)\n",
    "- [2020 DOE High School Directory](https://data.cityofnewyork.us/Education/2020-DOE-High-School-Directory/23z9-6uk9)\n",
    "- [2019 DOE High School Directory](https://data.cityofnewyork.us/Education/2019-DOE-High-School-Directory/uq7m-95z8)\n",
    "\n",
    "Sample datasets included in repo:\n",
    "- `2021_DOE_High_School_Directory_SI.csv`: Staten Island schools\n",
    "- `2020_DOE_High_School_Directory_late_start.csv`: Schools with 9am start times"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Section 0: Setup\n",
    "Function templates:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "def import_data(file_name):\n",
    "    \"\"\"Read and filter relevant data from CSV file.\"\"\"\n",
    "    pass\n",
    "\n",
    "def impute_numeric_cols_median(df):\n",
    "    \"\"\"Impute missing numeric values with column median.\"\"\"\n",
    "    pass\n",
    "\n",
    "def compute_item_count(df, col):\n",
    "    \"\"\"Count comma-separated items in specified column.\"\"\"\n",
    "    pass\n",
    "\n",
    "def encode_categorical_col(x):\n",
    "    \"\"\"One-hot encode categorical column.\"\"\"\n",
    "    pass\n",
    "\n",
    "def split_test_train(df, x_col_names, y_col_name, frac=0.25, random_state=922):\n",
    "    \"\"\"Split data into train/test subsets.\"\"\"\n",
    "    pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Section 1: Data Ingestion & Feature Engineering\n",
    "### Example Usage\n",
    "Load Staten Island data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# file_name = '2021_DOE_High_School_Directory_SI.csv'\n",
    "# si_df = import_data(file_name)\n",
    "# print(f'There are {len(si_df.columns)} columns:')\n",
    "# print(si_df.columns)\n",
    "# print('The dataframe is:')\n",
    "# print(si_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Expected Output**:\n",
    "```\n",
    "There are 10 columns:\n",
    "Index(['dbn', 'school_name', 'borocode', 'NTA', 'graduation_rate', 'pct_stu_safe',\n",
    "       'attendance_rate', 'college_career_rate', 'language_classes',\n",
    "       'advancedplacement_courses'],\n",
    "      dtype='object')\n",
    "The dataframe is:\n",
    "      dbn                                        school_name  ...\n",
    "1  31R047          CSI High School for International Studies  ...\n",
    "2  31R064        Gaynor McCown Expeditionary Learning School  ...\n",
    "[10 rows x 10 columns]\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Section 2: Training a Linear Regressor\n",
    "Function templates:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def compute_lin_reg(x, y):\n",
    "    \"\"\"Calculate linear regression coefficients.\"\"\"\n",
    "    pass\n",
    "\n",
    "def predict(x, theta_0, theta_1):\n",
    "    \"\"\"Make predictions using linear model.\"\"\"\n",
    "    pass\n",
    "\n",
    "def mse_loss(y_actual, y_estimate):\n",
    "    \"\"\"Calculate Mean Squared Error.\"\"\"\n",
    "    pass\n",
    "\n",
    "def rmse_loss(y_actual, y_estimate):\n",
    "    \"\"\"Calculate Root Mean Squared Error.\"\"\"\n",
    "    pass\n",
    "\n",
    "def compute_loss(y_actual, y_estimate, loss_fnc=mse_loss):\n",
    "    \"\"\"Compute specified loss function.\"\"\"\n",
    "    pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Section 3: Model Evaluation\n",
    "### Example Model Comparison\n",
    "```\n",
    "                     train_loss  test_loss\n",
    "language_count         0.015100   0.010848\n",
    "ap_count               0.015115   0.009148\n",
    "pct_stu_safe           0.013934   0.005853\n",
    "attendance_rate        0.008011   0.006838\n",
    "college_career_rate    0.005345   0.001625\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualization Example\n",
    "```python\n",
    "import matplotlib.pyplot as plt\n",
    "def graph_data(df, col, coeff):\n",
    "    plt.scatter(df[col], df['graduation_rate'], label='Actual')\n",
    "    predict_grad = predict(df[col], coeff[col][0], coeff[col][1])\n",
    "    plt.scatter(df[col], predict_grad, label='Predicted')\n",
    "    plt.title(f'{col} vs graduation_rate')\n",
    "    plt.ylabel('graduation_rate')\n",
    "    plt.xlabel(f'{col}')\n",
    "    plt.legend()\n",
    "    plt.show()\n",
    "\n",
    "# graph_data(late_df, 'college_career_rate', coeff)\n",
    "```\n",
    "Expected output shows actual vs predicted values plot."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Submission Notes\n",
    "- Submit via Gradescope as `.ipynb` or `.py`\n",
    "- All code outside functions must be commented out or wrapped in `main()`\n",
    "- Test locally with sample datasets before submitting"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}

NameError: name 'null' is not defined

## Grading Rubric
```
Code linting -- 10 points
Section 1: Data ingestion and feature  -- 40 points
Section 2: Training a linear regressor -- 30 points
Section 3: Model evaluation -- 20 points
```

## Dataset
We'll use NYC school data from:
- [2021 DOE High School Directory](https://data.cityofnewyork.us/Education/2021-DOE-High-School-Directory/8b6c-7uty)
- [2020 DOE High School Directory](https://data.cityofnewyork.us/Education/2020-DOE-High-School-Directory/23z9-6uk9)
- [2019 DOE High School Directory](https://data.cityofnewyork.us/Education/2019-DOE-High-School-Directory/uq7m-95z8)

Sample datasets included in repo:
- `2021_DOE_High_School_Directory_SI.csv`: Staten Island schools
- `2020_DOE_High_School_Directory_late_start.csv`: Schools with 9am start times

## Section 0: Setup
Function templates:

In [None]:
import pandas as pd

def import_data(file_name):
    """Read and filter relevant data from CSV file."""
    pass

def impute_numeric_cols_median(df):
    """Impute missing numeric values with column median."""
    pass

def compute_item_count(df, col):
    """Count comma-separated items in specified column."""
    pass

def encode_categorical_col(x):
    """One-hot encode categorical column."""
    pass

def split_test_train(df, x_col_names, y_col_name, frac=0.25, random_state=922):
    """Split data into train/test subsets."""
    pass

## Section 1: Data Ingestion & Feature Engineering
### Example Usage
Load Staten Island data:

In [None]:
# file_name = '2021_DOE_High_School_Directory_SI.csv'
# si_df = import_data(file_name)
# print(f'There are {len(si_df.columns)} columns:')
# print(si_df.columns)
# print('The dataframe is:')
# print(si_df)

**Expected Output**:
```
There are 10 columns:
Index(['dbn', 'school_name', 'borocode', 'NTA', 'graduation_rate', 'pct_stu_safe',
       'attendance_rate', 'college_career_rate', 'language_classes',
       'advancedplacement_courses'],
      dtype='object')
The dataframe is:
      dbn                                        school_name  ...
1  31R047          CSI High School for International Studies  ...
2  31R064        Gaynor McCown Expeditionary Learning School  ...
[10 rows x 10 columns]
```

## Section 2: Training a Linear Regressor
Function templates:

In [None]:
def compute_lin_reg(x, y):
    """Calculate linear regression coefficients."""
    pass

def predict(x, theta_0, theta_1):
    """Make predictions using linear model."""
    pass

def mse_loss(y_actual, y_estimate):
    """Calculate Mean Squared Error."""
    pass

def rmse_loss(y_actual, y_estimate):
    """Calculate Root Mean Squared Error."""
    pass

def compute_loss(y_actual, y_estimate, loss_fnc=mse_loss):
    """Compute specified loss function."""
    pass

## Section 3: Model Evaluation
### Example Model Comparison
```
                     train_loss  test_loss
language_count         0.015100   0.010848
ap_count               0.015115   0.009148
pct_stu_safe           0.013934   0.005853
attendance_rate        0.008011   0.006838
college_career_rate    0.005345   0.001625
```

## Visualization Example
```python
import matplotlib.pyplot as plt
def graph_data(df, col, coeff):
    plt.scatter(df[col], df['graduation_rate'], label='Actual')
    predict_grad = predict(df[col], coeff[col][0], coeff[col][1])
    plt.scatter(df[col], predict_grad, label='Predicted')
    plt.title(f'{col} vs graduation_rate')
    plt.ylabel('graduation_rate')
    plt.xlabel(f'{col}')
    plt.legend()
    plt.show()

# graph_data(late_df, 'college_career_rate', coeff)
```
Expected output shows actual vs predicted values plot.

## Submission Notes
- Submit via Gradescope as `.ipynb` or `.py`
- All code outside functions must be commented out or wrapped in `main()`
- Test locally with sample datasets before submitting