In [None]:
from sqlalchemy import *

yourusername = 'yourandrewid'
yourdatabase = 'yourandrewid'
yourpassword = "yourpassword"
db_url = f"postgresql://" + yourusername + ":" + yourpassword + "@debprodserver.postgres.database.azure.com:5432/" + yourdatabase
engine = create_engine(db_url)

In [None]:
%load_ext sql
%sql engine

### Part 1: Designing the database

In Part 1 of the project, your team must design a table schema for the data. By “table schema” I mean the CREATE TABLE statements necessary to create database tables that fit the data. You should follow the design principles in Section 4.1 to build a normalized structure for the database that minimizes redundant information. Include primary keys, foreign keys, column types, and any appropriate constraints. It is up to you to decide how many tables you need, their names, and their contents.

Write your CREATE TABLE statements in a notebook. Test them out on Azure to ensure they work correctly. You do not need to load any real data into the database yet.

In the notebook, write comments explaining the following: What are the basic entities in your schema? (In Example 4.1, entities were things like songs, record labels, and albums, that each had their own database table.) How did you choose them and what did you do to ensure there is not redundant information in your database?

The College Scorecard data files cover dozens of variables per institution; we won’t be interested in all of them here. The Data Dictionary lists all variables, their column names (VARIABLE NAME), a human readable description, and the meaning of each value. We care about the following columns:

* UNITID, the institution ID
* ACCREDAGENCY - institutional accreditor - to facilitate analysis by accreditor 
* PREDDEG - predominant undergraduate award ; the type of award that the institution primarily confers
* HIGHDEG - highest award level conferred at the institution
* CONTROL - whether institution's governance structure is public, private nonprofit or private for profit 
* REGION - 
* CCBASIC (TODO: this is missing before 2022; prefer IPEDS version?)
* ADM_RATE - admissions rate at teach campus 
* TUITIONFEE_IN - estimated tuition and required fees for in-district students 
* TUITIONFEE_OUT - estimated tuition and required fees for out-of-state students
* TUITIONFEE_PROG - estimated tuition and required fees for program-year institutions
* TUITFTE - the net tuition revenue per full-time equivalent students (uses tuition revenue minus discounts and allowances, and divides that
by the number of FTE undergraduate and graduate students)
* AVGFACSAL - average faculty salary per month, by dividing the total salary outlays by the number of months worked for all full-time, nonmedical instructional staff
* CDR2 and CDR3 -  institutions with high default rates may lose access to federal financial aid. The two-year cohort and three-year cohort default rate 

Scroll through the list (there are nearly 3,500 variables!) and pick some additional variables that could be interesting to analyze.
The data is updated annually. Your database should be able to store each year’s data for each university, so you can quickly look up statistics for a university in any particular year.

You will also be using supplementary data from the Integrated Postsecondary Education Data System (IPEDS); specifically, the directory information files. These provide annual directory information and other statistics based on surveys of institutions. Again, since this is updated annually, you must be able to track each version of the data and the dates it applies to.

Links to Data Dictionary:
* https://collegescorecard.ed.gov/assets/InstitutionDataDocumentation.pdf
* https://collegescorecard.ed.gov/assets/FieldOfStudyDataDocumentation.pdf



From the IPEDS data, obtain:

* All information about the institution’s name, location, address, and similar
* All Carnegie Classification 2021 variables
* The Census identifiers that apply to it: Core Based Statistical Area (CBSA) and its type, the Combined Statistical Area (CSA), and the county FIPS code
* Latitude and longitude of the institution.


In [None]:
%%sql
--- Institution Table
CREATE TABLE Institutions(
    UNITID SERIAL PRIMARY KEY, 
    INSTNM VARCHAR(255) UNIQUE NOT NULL, 
    PREDDEG INTEGER NOT NULL, 
    HIGHDEG INTEGER NOT NULL, 
    CONTROL INTEGER NOT NULL, 
    REGION INTEGER  NOT NULL, 
    CCBASIC INTEGER NOT NULL, 
    ST_FIPS INTEGER NOT NULL, 
    ADDR VARCHAR(255) NOT NULL,
    CITY VARCHAR(100) NOT NULL,
    STABBR TEXT NOT NULL,
    ZIP VARCHAR(10) NOT NULL, 
    LATITUDE, 
    LONGITUDE, 
);

-- Annual Records Table
CREATE TABLE AnnualRecords(
    RECORDID SERIAL PRIMARY KEY, 
    UNITID INTEGER REFERENCES Institution(id) DELETE ON CASCADE, --? 
    YEAR INTEGER CHECK (YEAR <= EXTRACT(YEAR FROM CURRENT_DATE), 
    ADM_RATE FLOAT CHECK(ADM_RATE BETWEEN 0 AND 1), 
    TUITIONFEE_IN INTEGER NOT NULL, 
    TUITIONFEE_OUT INTEGER NOT NULL, 
    TUITIONFEE_PROG INTEGER NOT NULL, 
    TUITFTE INTEGER NOT NULL, 
    AVGFASCAL INTEGER CHECK(AVGFASCAL > 0), 
    C100_4 FLOAT CHECK(CDR3 BETWEEN 0 AND 1),
    C100_L4 FLOAT CHECK(CDR3 BETWEEN 0 AND 1),
    UNIQUE(UNITID, YEAR) -- SYNTAX CHECK
); 

-- Demographic Table
CREATE TABLE StudentDemographic(
    DEMOID SERIAL PRIMARY KEY, 
    INSTID INTEGER REFERENCES Institutions(UNITID) DELETE ON CASCADE,  
    YEAR INTEGER CHECK (YEAR <= EXTRACT(YEAR FROM CURRENT_DATE), 
    UGDS INTEGER CHECK(ADM_RATE >= 0),  
    UGDS_MEN FLOAT CHECK(UGDS_MEN BETWEEN 0 AND 1),  
    UGDS_WOMEN FLOAT CHECK(UGDS_WOMEN BETWEEN 0 AND 1)
)

-- Financial Table -- FSA
CREATE TABLE Financial(
    FINANCIALID SERIAL PRIMARY KEY, 
    INSTID INTEGER REFERENCES Institutions(UNITID) DELETE ON CASCADE,  
    YEAR INTEGER CHECK (YEAR <= EXTRACT(YEAR FROM CURRENT_DATE), 
    CDR2 FLOAT CHECK(CDR2 > 0),  
    CDR3 FLOAT CHECK(CDR3 > 0)
); 

-- PostGraduation Table
CREATE TABLE PostGraduation(
    PGID SERIAL PRIMARY KEY, 
    INSTID INTEGER REFERENCES Institutions(UNITID) DELETE ON CASCADE,  
    YEAR INTEGER CHECK (YEAR <= EXTRACT(YEAR FROM CURRENT_DATE), 
    COUNT_NWNE_3YR INTEGER CHECK(COUNT_NWNE_3YR >= 0), 
    COUNT_NWNE_3YR INTEGER CHECK(COUNT_NWNE_3YR >= 0), 
    CNTOVER150_3YR INTEGER CHECK(COUNT_NWNE_3YR >= 0)
)