# Intro

This workbook converts the schoolList.json file from the web scraping process into a prisma seed file. The prisma seed file is used to populate the database with the school data.

Part of this conversion is **manual** since the school from the SFUSD website are not a perfect match for the data submitted to the CA Department of Education. The workbook attempts to match the records on a combination of zipcode and school name, but the algorithm is not perfect. In cases of low quality matches, the casid (short for CA School ID) can be confirmed by looking at the SARC document for the school (located in Google Drive folders).

This workbook also removes the schools that are not in the California Department of Education databases. These schools are typically preschools or non-degree granting institutions.

In [1]:
import json
import csv
from matcher.school_matcher import School, SchoolMatcher

matcher = SchoolMatcher.from_db()

#### read the json file
with open("schoolList.json", "r", encoding="utf-8") as file:
    data = json.load(file)

csv_data = []
for school in data:
    zip_code = f"no zip for {school['schoolLabel']}"
    if "geolocations" in school:
        location = school["geolocations"][0]["addressDetails"]
        if "PostalCode" in location:
            zip_code = location["PostalCode"].split("-")[0]
            sfusd_name = school["schoolLabel"]
            sfusd_hash = School.generate_school_hash(sfusd_name, zip_code)

            best = matcher.best_match(sfusd_name, zip_code)
            match_school_id = best.match.school_code if best else ""
            match_school_name = best.match.school_name if best else ""
            match_pct: int = int(best.ratio * 100) if best else 0

            csv_data.append(
                [
                    sfusd_hash,
                    sfusd_name,
                    match_school_name,
                    match_school_id,
                    zip_code,
                    match_pct,
                ]
            )
            school["school_hash"] = sfusd_hash
            school["casid"] = match_school_id
        else:
            print(f"no zipcode for {school['schoolLabel']}")
    else:
        print(f"no geolocations for {school['schoolLabel']}")

# Write the CSV
sorted_csv = sorted(csv_data, key=lambda x: x[5])
sorted_csv.insert(0, ["sfusd_hash", "sfusd_name", "matched_school_name", "matched_school_id", "sfusd_zip_code", "match_pct"])

with open("matches.csv", mode="w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)

    for row in sorted_csv:
        writer.writerow(row)

### Fix the bad school_codes

Unfortunately, this is a manual part of the workflow. The school zipcodes are not always accurate which leads to bad matches. Moreover, the matching algorithm is not perfect. Where the match looks incorrect, the `casid` can be confirmed by looking at the SARC document for the school (located in Google Drive folders).

The suggesed workflow is to open the `matches.csv` produced above in either Excel or another speadsheet package to review the matches. Be careful not to let Excel remove the zero-padding from the identifiers.

After reviewing the matches, you need to create another CSV file with the correct casids. The file should have the following columns with no header row:

- `school_hash`: The hash of the school name and zipcode (from the `matches.csv` file)
- `casid`: The correct casid for the school
- `instruction`: A note about the change either `KEEP`, `DELETE`, or `UPDATE`

For example:

```csv
797c2568e4c71628d72d86379d7282ccc8a65a5aa4f13ea28958ae619f1736e8,6089569,OVERWRITE
98f4714e5aade9a9465c99ccb07582aa0309787319074631df4b81f73a499242,6040752,DELETE
e372d38bbc0d25217f22c4568fe6688df8f6ce9a296799d196b727469846a47b,6040752,KEEP
19c4204704adb9868eb23780ee585451e596feb8346812a18382a2a108a991be,0102103,DELETE
f948b5c086e43e7f885e93a8bd9d3cb8bb7f50388fd0751ab42e10ee76febfb0,3831765,KEEP
e5ad922242f5110e8656a050a9affc0e595fcc42cbbbc6f7a5bc2bb7f6f30edb,6102479,KEEP
```

This file should be saved as `actions.csv` in the same directory as this notebook. It will be used in the next step to update the `schoolList.json` file.

In [2]:
# This section reads the actions.csv file and applies the changes to the schoolList.json file
# by removing the schools that have been marked for deletion and updating the casid to the schools
# the instructions in the actions.csv file. The updated schoolList.json file is then written to
# schoolList_hashed.json

with open("actions.csv", mode="r", newline="", encoding="utf-8") as file:
    reader = csv.reader(file)
    actions = [row for row in reader]

del_list = [action[0] for action in actions if action[2] == "DELETE"]
filtered = [school for school in data if school["school_hash"] not in del_list]

# create a dictionary of school_hash to casid
casid_lookup = {action[0]: action[1] for action in actions if action[2] != "DELETE"}

for school in filtered:
    if not school["school_hash"] in casid_lookup:
        print(f"no action for {school['schoolLabel']}")
    else:
        school["casid"] = casid_lookup.get(school["school_hash"])

# Write the JSON
with open("schoolList_corrected.json", "w", encoding="utf-8") as file:
    json.dump(filtered, file, indent=2)