Data Source:
"Common Core of Data School District Finance Survey (F-33), FY 2020." National Center for Education Statistics, U.S. Department of Education. 
Accessed from: 
- [NCES Website](https://nces.ed.gov/programs/edge/Geographic/SchoolLocations)
- [Zip File Download from NCES](https://nces.ed.gov/programs/edge/data/EDGE_GEOCODE_PUBLICSCH_1920.zip)


# Local Education Agency Finance Survey – School District Data 2019 – 2020  
Local Education Agency will be abbreviated as LEA

In [271]:
# If you are running this code on your local machine and do not have necessary packages installed,
# Uncomment the packages you need and run this cell first. 
# Once installed, replace the comment and proceed with running the remainder of the notebook. 

#!pip install --upgrade pip
#!pip install pandas
#!pip install numpy
#!pip install seaborn
#!pip install matplotlib
#!pip install sqlalchemy
#!pip install pandas sqlalchemy psycopg2-binary
#!pip install scikit-lean

### Import Packages

In [272]:
import pandas as pd
import numpy as np
import json
from sqlalchemy import create_engine
import seaborn as sns # import is for upcoming use
import matplotlib.pyplot as plt # import is for upcoming use
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

### Importing the Dataset

When importing the dataset, follow these steps for best practices and to ensure accuracy:

1. **Locate the Dataset:**
   Ensure that the dataset file is present in the project directory. This is where the import function will look for the file.

2. **Understand the File Format:**
   Our dataset is in a TAB-delimited format. When using `pandas.read_csv` or similar functions, specify the delimiter with `delimiter='\t'` to correctly parse the file.

3. **Verify the Import:**
   After importing, it's crucial to do a quick check of the DataFrame:
   - Use `df.head()` to preview the first few rows.

```python
df = pd.read_csv('sdf20_1a.txt', delimiter= '\t')

# Preview the first few rows of the DataFrame
df.head()

In [273]:
df = pd.read_csv('sdf20_1a.txt', delimiter= '\t')

df.head()

  df = pd.read_csv('sdf20_1a.txt', delimiter= '\t')


Unnamed: 0,LEAID,CENSUSID,FIPST,CONUM,CSA,CBSA,NAME,STNAME,STABBR,SCHLEV,...,FL_AR3,FL_AR4,FL_AR5,FL_AR6,FL_AE1,FL_AE2,FL_AE3,FL_AE4,FL_AE5,FL_AE6
0,100002,N,1,1073,142,13820,Alabama Youth Services,Alabama,AL,N,...,M,M,M,M,M,M,M,M,M,M
1,100005,01504840100000,1,1095,N,10700,Albertville City,Alabama,AL,03,...,R,R,R,R,R,R,R,R,R,R
2,100006,01504800100000,1,1095,N,10700,Marshall County,Alabama,AL,03,...,R,R,R,R,R,R,R,R,R,R
3,100007,01503740100000,1,1073,142,13820,Hoover City,Alabama,AL,03,...,R,R,R,R,R,R,R,R,R,R
4,100008,01504530100000,1,1089,290,26620,Madison City,Alabama,AL,03,...,R,R,R,R,R,R,R,R,R,R


### Import the Column Mapping
I prepared an excel file that has the original column names, the new names of the columns, their expected datatype in a database, and the description. This file will serve as a quick and easy way to map the new columns with less code, and maintaining a dictionary of the columns.

In [274]:
# Import Column Map
column_mapping_df = pd.read_excel('LEA Local Finance Survey – School District Data 2019 – 2020 – Column Mapping.xlsx', 
                                  sheet_name='Column Mapping')

# Remove White Spaces from Column Names
column_mapping_df['Original Name'] = column_mapping_df['Original Name'].str.strip()
column_mapping_df['New Name'] = column_mapping_df['New Name'].str.strip()

column_mapping_df.head()

Unnamed: 0,Original Name,New Name,Type,Table,Description
0,LEAID,lea_id,VARCHAR(7),entity,National Center For Education Statistics (NCES...
1,CENSUSID,census_id,VARCHAR(14),all,Census Bureau 14-Digit Government Id
2,FIPST,ansi_state_code,VARCHAR(2),entity,American National Standards Institute (ANSI) S...
3,CONUM,ansi_county_code,VARCHAR(7),entity,American National Standards Institute (ANSI) C...
4,CSA,csa,VARCHAR(3),entity,Consolidated Statistical Area


In [275]:
# Create Dictionary to Map New Column Names
column_map_dict = column_mapping_df.set_index('Original Name')['New Name'].to_dict()

# Rename Columns
df.rename(columns=column_map_dict, inplace=True)
df.columns = df.columns.str.strip()

In [276]:
df.dtypes

lea_id                                                       object
census_id                                                    object
ansi_state_code                                               int64
ansi_county_code                                              int64
csa                                                          object
                                                              ...  
cares_act_expenditure_instructional_flag                     object
cares_act_expenditure_support_services_flag                  object
cares_act_expenditure_capital_outlay_flag                    object
cares_act_expenditure_tech_related_supplies_services_flag    object
cares_act_expenditure_tech_related_equipment_flag            object
Length: 302, dtype: object

After renaming the columns, I used `df.dtypes` to confirm the names of the columns were correct, but also so I can get an idea of what columns may need cleaning to acheive a certain data type.

## Exclusion of Non-Government Entities from Analysis

The Census Bureau has specific criteria to determine if a Local Education Agency (LEA) qualifies as a government entity. These criteria include the LEA's power to:

- Levy taxes
- Independently manage its own budget
- Appoint its school board members without oversight from other local government bodies

An LEA that satisfies these conditions is considered a government entity and is assigned a unique `census_id`. This identifier signals eligibility for federal, state, and local funding, which is often dependent on an LEA's tax authority and fiscal independence.

However, LEAs that do not meet these criteria are assigned an 'N' for their `census_id`. This indicates that they are not recognized as government entities by the Census Bureau and, consequently, are not typically eligible for the tax-based funding that our analysis focuses on. Therefore, these LEAs are excluded from our dataset to maintain a focus on entities eligible for such funding.

By removing rows where `census_id` is 'N', we ensure that our analysis only includes LEAs that have the potential to receive and manage federal, state, and local funding in line with our research objectives.


In [277]:
df = df[df['census_id'] != 'N']

In [278]:
df['census_id'].duplicated().any()

False

After removing the Census IDs that had the 'N' placeholder, I wanted to confirm that there were no duplicate Census IDs. This is in preparation for this being the Primary Key within the database table keeping the LEA Entity information. This will serve as a Foreign Key in subsequent tables to link records to the Entity.  
#### Expected Result : False

## Column Removal for Database Normalization

As part of the data normalization process for database insertion, we target columns starting with 'total_' for removal. These columns are presumed to contain aggregate data that may not be suitable for the normalized database structure. Prior to their removal, the content of these columns is preserved by transferring it to a separate DataFrame. This precaution ensures that the aggregate data remains accessible for any future analysis or reference requirements.


In [279]:
total_columns = []

for col in df.columns: 
    if col.startswith('total_'):
        total_columns.append(col)

total_columns

['total_revenue',
 'total_federal_revenue',
 'total_state_revenue',
 'total_local_revenue',
 'total_expenditures',
 'total_curr_expenditures_pri_sec_ed',
 'total_curr_expenditures_instruction',
 'total_curr_expenditures_support_services',
 'total_current_expenditures_other_prim_sec',
 'total_non_prim_sec_expenditures',
 'total_capital_outlay_expenditures',
 'total_salaries',
 'total_employee_benefits',
 'total_salaries_flag',
 'total_employee_benefits_flag']

In [280]:
column_totals = df[total_columns].copy()
column_totals

Unnamed: 0,total_revenue,total_federal_revenue,total_state_revenue,total_local_revenue,total_expenditures,total_curr_expenditures_pri_sec_ed,total_curr_expenditures_instruction,total_curr_expenditures_support_services,total_current_expenditures_other_prim_sec,total_non_prim_sec_expenditures,total_capital_outlay_expenditures,total_salaries,total_employee_benefits,total_salaries_flag,total_employee_benefits_flag
1,63333000,7605000,40121000,15607000,54630000,50454000,29888000,17241000,3325000,436000,1933000,27371000,10808000,R,R
2,66333000,9259000,42131000,14943000,65302000,61190000,34095000,23096000,3999000,704000,2796000,33442000,13337000,R,R
3,196210000,8918000,82689000,104603000,181862000,161809000,101645000,56471000,3693000,5561000,9313000,97962000,37736000,R,R
4,139137000,7055000,77294000,54788000,145036000,114802000,70337000,40309000,4156000,1000000,21842000,68070000,26019000,R,R
6,25777000,2474000,13591000,9712000,23732000,20534000,11546000,8101000,887000,585000,302000,11776000,4436000,R,R
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19661,61964000,2469000,3810000,55685000,60205000,56931000,35490000,20133000,1308000,1000,2724000,34681000,14692000,R,R
19662,6749000,305000,4812000,1632000,7249000,6221000,3277000,2749000,195000,52000,976000,3088000,1588000,R,R
19663,25592000,1842000,18138000,5612000,24630000,22101000,13620000,7605000,876000,0,1784000,12572000,6014000,R,R
19664,-2,-2,-2,-2,-2,-2,-2,-2,-2,-2,-2,-2,-2,N,N


In [281]:
df.drop(columns= total_columns, inplace= True)

In [282]:
df['year'] = df['year'].astype(str)
df['year'] = '20' + df['year']
df['year']

1        2020
2        2020
3        2020
4        2020
6        2020
         ... 
19661    2020
19662    2020
19663    2020
19664    2020
19665    2020
Name: year, Length: 14473, dtype: object

## Casting Data Types
In the above cell, I am converting year to a String so I can add '20' to the year in order to have the correct format to convert to datetime.
In the below cells:
- ansi_state_code and ansi_county_code are converted to Strings because these will not be aggregated at any point. 
- year is being converted to datetime.
- ccd_nonfiscal_match and census_fiscal_match are being converted to booleans to match database data type requirements.

In [283]:
df['ansi_state_code'] = df['ansi_state_code'].astype(str)
df['ansi_county_code'] = df['ansi_county_code'].astype(str)
df['year'] = pd.to_datetime(df['year'].astype(str), format='%Y')
df['ccd_nonfiscal_match'] = df['ccd_nonfiscal_match'].astype(bool)
df['census_fiscal_match'] = df['census_fiscal_match'].astype(bool)

In [284]:
df[['ansi_state_code', 'ansi_county_code', 'year', 'ccd_nonfiscal_match', 'census_fiscal_match']].dtypes

ansi_state_code                object
ansi_county_code               object
year                   datetime64[ns]
ccd_nonfiscal_match              bool
census_fiscal_match              bool
dtype: object

In [285]:
df.describe(include='all')

Unnamed: 0,lea_id,census_id,ansi_state_code,ansi_county_code,csa,cbsa,lea_name,state,st_abbr,school_level_code,...,education_stabilization_fund_esf_rwp_grant_flag,education_stabilization_fund_esf_rem_grant_flag,project_serv_flag,coronavirus_relief_fund_flag,cares_act_expenditure_curr_flag,cares_act_expenditure_instructional_flag,cares_act_expenditure_support_services_flag,cares_act_expenditure_capital_outlay_flag,cares_act_expenditure_tech_related_supplies_services_flag,cares_act_expenditure_tech_related_equipment_flag
count,14473.0,14473.0,14473.0,14473.0,14473,14473,14473,14473,14473,14473.0,...,14473,14473,14473,14473,14473,14473,14473,14473,14473,14473
unique,14473.0,14473.0,51.0,3126.0,173,925,14142,51,51,7.0,...,3,3,3,5,3,3,4,4,4,4
top,100005.0,1504840100000.0,6.0,17031.0,N,N,Jefferson County,California,CA,3.0,...,R,R,R,R,R,R,R,R,R,R
freq,1.0,1.0,1114.0,164.0,6858,3740,5,1114,1114,10416.0,...,7807,7790,7979,10368,10351,9799,9770,10223,8522,8458
mean,,,,,,,,,,,...,,,,,,,,,,
min,,,,,,,,,,,...,,,,,,,,,,
25%,,,,,,,,,,,...,,,,,,,,,,
50%,,,,,,,,,,,...,,,,,,,,,,
75%,,,,,,,,,,,...,,,,,,,,,,
max,,,,,,,,,,,...,,,,,,,,,,


## Data Cleaning Notes

**Handling Special Placeholders in Financial Data:**

The dataset uses special placeholder values to indicate non-standard entries for financial data: 
- “-1” indicates missing data, which may arise in situations where zero values are ambiguous.
- “-2” and “-3” could similarly indicate other forms of non-standard or suppressed data, such as revised figures or privacy-related omissions.

To facilitate accurate analysis, we replace these placeholder values in the money-related fields to avoid distortions in statistical calculations. However, each financial field is paired with a corresponding "flag" column. These flag columns provide references to documentation that explain the classification of each value in more depth, including the placeholders.

The purpose of this cleaning step is not to discard the nuances and details encoded by these placeholders but to create a dataset that can be analyzed quantitatively without misinterpretation caused by non-numeric values. The flag columns remain intact for any case-by-case examination where the context behind the numeric values is necessary, ensuring transparency and traceability in our dataset.

This approach ensures that while the dataset is primed for quantitative analysis, the integrity and comprehensiveness of the data are maintained for more qualitative assessments.


In [286]:
df.replace(-3, np.nan, inplace=True)
df.replace(-2, np.nan, inplace=True)
df.replace(-1, np.nan, inplace=True)

In [287]:
df.describe()

Unnamed: 0,year,fall_membership,fall_membership_school_univ,title_I_thru_state,indiv_with_disabilities_thru_state,voc_tech_education_thru_state,effective_instruction_support_thru_state,student_support_academic_enrich_thru_state,21st_century_learning_centers_thru_state,rural_low_income_school_program_thru_state,...,education_stabilization_fund_esf_rem_grant,project_serv,coronavirus_relief_fund,cares_act_expenditure_curr,cares_act_expenditure_instructional,cares_act_expenditure_support_services,cares_act_expenditure_capital_outlay,cares_act_expenditure_tech_related_supplies_services,cares_act_expenditure_tech_related_equipment,weight
count,14473,13318.0,13675.0,14134.0,14134.0,14134.0,14134.0,14134.0,14134.0,14134.0,...,14134.0,14134.0,14134.0,14134.0,14134.0,14134.0,14134.0,14134.0,14134.0,14473.0
mean,2020-01-01 00:00:00.000000256,3601.779622,3500.249141,997604.9,820512.9,41984.58,79281.73,29413.33,27358.71,3501.698,...,0.0,0.0,17263.12,131353.5,77660.18,36372.01,9777.841,9217.277,1809.325,1.0
min,2020-01-01 00:00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,2020-01-01 00:00:00,401.0,368.0,62000.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
50%,2020-01-01 00:00:00,1101.0,1049.0,196000.0,148000.0,0.0,5000.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,2020-01-01 00:00:00,2931.75,2834.0,584000.0,593000.0,16000.0,52000.0,15000.0,0.0,0.0,...,0.0,0.0,0.0,15000.0,0.0,0.0,0.0,0.0,0.0,1.0
max,2020-01-01 00:00:00,956634.0,934580.0,603376000.0,304835000.0,9657000.0,22926000.0,16476000.0,18330000.0,3227000.0,...,0.0,0.0,21535000.0,72115000.0,72115000.0,20764000.0,66963000.0,9680000.0,1314000.0,1.0
std,,14269.09774,13972.640772,7471489.0,4018264.0,199345.5,411694.1,221999.1,270668.0,32572.54,...,0.0,0.0,313285.2,1201275.0,1047666.0,350332.5,569909.9,142715.8,27298.67,0.0


### Entity Schema Tables  

#### Create entity DataFrame

In [288]:
# Initialize an empty list for storing column names
entity_columns = []

# Iterate over each row in the mapping DataFrame
for index, row in column_mapping_df.iterrows():
    # Check if the table is 'entity' or 'all', and the column name is not 'year'
    if row['Table'] in ['entity', 'all'] and row['New Name'] != 'year':
        if row['New Name'] not in total_columns:
            # Add the new column name to the list
            entity_columns.append(row['New Name'])

# Create a new DataFrame with only the selected columns
entity = df[entity_columns].copy()
entity


Unnamed: 0,lea_id,census_id,ansi_state_code,ansi_county_code,csa,cbsa,lea_name,state,st_abbr
1,100005,01504840100000,1,1095,N,10700,Albertville City,Alabama,AL
2,100006,01504800100000,1,1095,N,10700,Marshall County,Alabama,AL
3,100007,01503740100000,1,1073,142,13820,Hoover City,Alabama,AL
4,100008,01504530100000,1,1089,290,26620,Madison City,Alabama,AL
6,100011,01503710100000,1,1073,142,13820,Leeds City,Alabama,AL
...,...,...,...,...,...,...,...,...,...
19661,5605830,51502000200000,56,56039,N,27220,Teton County School District #1,Wyoming,WY
19662,5606090,51502300200000,56,56045,N,23940,Weston County School District #7,Wyoming,WY
19663,5606240,51502200400000,56,56043,N,N,Washakie County School District #1,Wyoming,WY
19664,5680180,51500340100000,56,56005,N,23940,Northeast Wyoming BOCES,Wyoming,WY


#### Create annual_stats DataFrame

In [289]:
# Initialize an empty list for storing column names
annual_stats_columns = []

# Iterate over each row in the mapping DataFrame
for index, row in column_mapping_df.iterrows():
    # Check if the table is 'annual_stats' or 'all'
    if row['Table'] in ['annual_stats', 'all']:
        if row['New Name'] not in total_columns:
            # Add the new column name to the list
            annual_stats_columns.append(row['New Name'])

# Create a new DataFrame with only the selected columns
annual_stats = df[annual_stats_columns].copy()

# If 'year' is not the last column, move it to the end
if 'year' in annual_stats.columns and annual_stats.columns[-1] != 'year':
    # Get a list of all columns except 'year'
    cols = [col for col in annual_stats.columns if col != 'year']
    # Add 'year' at the end of the list
    cols.append('year')
    # Reorder the DataFrame
    annual_stats = annual_stats[cols]

annual_stats

Unnamed: 0,census_id,school_level_code,agency_charter_code,ccd_nonfiscal_match,census_fiscal_match,low_grade_offered,high_grade_offered,fall_membership,fall_membership_school_univ,fall_membership_flag,fall_membership_school_univ_flag,year
1,01504840100000,03,3,True,True,PK,12,5824.0,5824.0,R,R,2020-01-01
2,01504800100000,03,3,True,True,PK,12,5764.0,5764.0,R,R,2020-01-01
3,01503740100000,03,3,True,True,PK,12,14061.0,14061.0,R,R,2020-01-01
4,01504530100000,03,3,True,True,PK,12,11695.0,11695.0,R,R,2020-01-01
6,01503710100000,03,3,True,True,PK,12,2076.0,2076.0,R,R,2020-01-01
...,...,...,...,...,...,...,...,...,...,...,...,...
19661,51502000200000,03,3,True,True,KG,12,2869.0,2869.0,R,R,2020-01-01
19662,51502300200000,03,3,True,True,KG,12,235.0,235.0,R,R,2020-01-01
19663,51502200400000,03,3,True,True,KG,12,1244.0,1244.0,R,R,2020-01-01
19664,51500340100000,07,3,True,True,PK,12,,1.0,A,R,2020-01-01


### Expenses & Revenue Schema Tables

#### `melt_df()`
The purpose of this fuction is to convert the data from a wide format to a long format, which is optimal for data normalization in relational databases, and for data visualizations.

In [290]:
def melt_df(df: pd.DataFrame, schema: str, table: str, column_mapping_df: pd.DataFrame, total_columns: list) -> pd.DataFrame:
    # Initialize an empty list for storing column names
    columns_to_use = []
    new_columns = []
    
    # Iterate over each row in the mapping DataFrame
    for _, row in column_mapping_df.iterrows():
        # Check if the table is the specified one or 'all'
        if row['Table'] in [table, 'all']:
            if row['New Name'] not in total_columns:
                # Add the new column name to the list
                columns_to_use.append(row['New Name'])
    
    # Create a new DataFrame with only the selected columns
    new_df = df[columns_to_use].copy()
    
    # Select id_vars for the melt function
    id_vars = ['census_id', 'year'] + [col for col in new_df.columns if col.endswith('_flag')]
    
    # Melt the DataFrame
    if schema == 'expenses':
        new_df = pd.melt(new_df, id_vars=id_vars, var_name='expenditure_title', value_name='amount')
        # The new columns will be 'expenditure_title' and 'amount'
        new_columns = ['expenditure_title', 'amount']
    elif schema == 'revenue':
        new_df = pd.melt(new_df, id_vars=id_vars, var_name='revenue_title', value_name='revenue')
        # The new columns will be 'revenue_title' and 'revenue'
        new_columns = ['revenue_title', 'revenue']
    
    # Ensure the new columns are at indexes 2 and 3
    # Get the list of id_vars that don't include the new columns
    remaining_columns = [col for col in id_vars if col not in new_columns]

    # Reorder columns such that new columns are at index 2 and 3
    ordered_columns = remaining_columns[:2] + new_columns + remaining_columns[2:]
    
    # Reassign the DataFrame with the ordered columns
    new_df = new_df[ordered_columns]
    
    return new_df


#### Create expenditures DF

In [291]:
expenditures = melt_df(df,'expenses', 'expenditures', column_mapping_df, total_columns)
expenditures.head()

Unnamed: 0,census_id,year,expenditure_title,amount,curr_expenditures_instruction_flag,payments_private_schools_flag,payments_charter_schools_flag,support_services_pupils_flag,support_services_instructional_staff_flag,support_services_general_admin_flag,...,special_education_expenditure_instructional_flag,special_education_expenditure_pupil_support_flag,special_education_expenditure_instructional_staff_support_flag,special_education_expenditure_student_transportation_support_flag,cares_act_expenditure_curr_flag,cares_act_expenditure_instructional_flag,cares_act_expenditure_support_services_flag,cares_act_expenditure_capital_outlay_flag,cares_act_expenditure_tech_related_supplies_services_flag,cares_act_expenditure_tech_related_equipment_flag
0,1504840100000,2020-01-01,curr_expenditures_instruction,29888000.0,R,R,M,R,R,R,...,R,R,R,R,R,R,R,R,R,R
1,1504800100000,2020-01-01,curr_expenditures_instruction,34095000.0,R,R,M,R,R,R,...,R,R,R,R,R,R,R,R,R,R
2,1503740100000,2020-01-01,curr_expenditures_instruction,101645000.0,R,R,M,R,R,R,...,R,R,R,R,R,R,R,R,R,R
3,1504530100000,2020-01-01,curr_expenditures_instruction,70337000.0,R,R,M,R,R,R,...,R,R,R,R,R,R,R,R,R,R
4,1503710100000,2020-01-01,curr_expenditures_instruction,11546000.0,R,R,M,R,R,R,...,R,R,R,R,R,R,R,R,R,R


#### Create local DataFrame

In [292]:
local = melt_df(df,'revenue', 'local_revenue', column_mapping_df, total_columns)
local.head()

Unnamed: 0,census_id,year,revenue_title,revenue,parent_government_contributions_flag,propery_taxes_flag,general_sales_tax_flag,public_utility_taxes_flag,individual_corporate_income_tax_flag,all_other_taxes_flag,...,district_activity_receipts_flag,students_fees_nonspecified_flag,other_sales_and_services_flag,rents_and_royalties_flag,sale_of_property_flag,interest_earnings_flag,fines_and_forfeits_flag,private_contributions_local_flag,misc_local_flag,nces_local_and_census_state_rev_flag
0,1504840100000,2020-01-01,parent_government_contributions,,N,R,M,M,M,R,...,R,R,R,R,R,R,R,R,R,R
1,1504800100000,2020-01-01,parent_government_contributions,,N,R,M,M,M,R,...,R,R,R,R,R,R,R,R,R,R
2,1503740100000,2020-01-01,parent_government_contributions,,N,R,M,M,M,R,...,R,R,R,R,R,R,R,R,R,R
3,1504530100000,2020-01-01,parent_government_contributions,,N,R,M,M,M,R,...,R,R,R,R,R,R,R,R,R,R
4,1503710100000,2020-01-01,parent_government_contributions,,N,R,M,M,M,R,...,R,R,R,R,R,R,R,R,R,R


#### Create state DataFrame

In [293]:
state = melt_df(df,'revenue', 'state_revenue', column_mapping_df, total_columns)
state.head()

Unnamed: 0,census_id,year,revenue_title,revenue,general_formula_assistance_flag,staff_improvement_programs_flag,special_education_programs_flag,compensatory_basic_skills_programs_flag,bilingual_education_state_flag,gifted_talented_programs_flag,vocational_education_programs_flag,school_lunch_programs_flag,capital_outlay_debit_services_programs_flag,transportation_programs_flag,other_programs_state_flag,nonspecified_state_flag,employee_benefits_state_flag,not_employee_benefits_state_flag
0,1504840100000,2020-01-01,general_formula_assistance,31112000.0,R,R,R,R,R,M,M,M,R,R,R,R,M,M
1,1504800100000,2020-01-01,general_formula_assistance,32252000.0,R,R,R,R,R,M,M,M,R,R,R,R,M,M
2,1503740100000,2020-01-01,general_formula_assistance,66762000.0,R,R,R,R,R,M,M,M,R,R,R,R,M,M
3,1504530100000,2020-01-01,general_formula_assistance,60055000.0,R,R,R,R,R,M,M,M,R,R,R,R,M,M
4,1503710100000,2020-01-01,general_formula_assistance,10406000.0,R,R,R,R,R,M,M,M,R,R,R,R,M,M


#### Create federal DataFrame

In [294]:
federal = melt_df(df,'revenue', 'federal_revenue', column_mapping_df, total_columns)
federal.head()

Unnamed: 0,census_id,year,revenue_title,revenue,title_I_flag,indiv_with_disabilities_flag,voc_tech_education_flag,effective_instruction_support_flag,student_support_academic_enrich_flag,21st_century_learning_centers_flag,...,impact_aid_direct_flag,indian_education_direct_flag,small_rural_school_achievement_program_direct_flag,other_direct_fed_rev_flag,esser_fund_flag,geer_fund_flag,education_stabilization_fund_esf_rwp_grant_flag,education_stabilization_fund_esf_rem_grant_flag,project_serv_flag,coronavirus_relief_fund_flag
0,1504840100000,2020-01-01,title_I_thru_state,1775000.0,R,R,R,R,R,R,...,R,R,M,R,R,R,R,R,R,R
1,1504800100000,2020-01-01,title_I_thru_state,2594000.0,R,R,R,R,R,R,...,R,R,M,R,R,R,R,R,R,R
2,1503740100000,2020-01-01,title_I_thru_state,1047000.0,R,R,R,R,R,R,...,R,R,M,R,R,R,R,R,R,R
3,1504530100000,2020-01-01,title_I_thru_state,773000.0,R,R,R,R,R,R,...,R,R,M,R,R,R,R,R,R,R
4,1503710100000,2020-01-01,title_I_thru_state,379000.0,R,R,R,R,R,R,...,R,R,M,R,R,R,R,R,R,R


## Create Database Mapping
The keys of the dictionary are the table names within the database.
Values:
- Index 0 = Schema Name
- Index 1 = DataFrame Name

In [295]:
database_map = {'entity' : ['entity', entity],
                'annual_stats' : ['entity', annual_stats],
                'expenditures' : ['expenses', expenditures],
                'federal_revenue' : ['revenue', federal],
                'state_revenue' : ['revenue', state],
                'local_revenue' : ['revenue', local]}

## Database Initialization with Mapped Data
The code snippet enclosed within the conditional block is designed for the initial population of the database. As this project evolves, we will enhance this section with more sophisticated logic and additional functionality to support incremental updates and data management requirements.

In [296]:
# use_database = input("Enter 'y' to use database script. Else enter 'n'")
use_database = 'n'
if use_database == 'y':
    
    # Read in database credentials from JSON file
    with open('LEA_Finance_Survey_DB.json') as infile:
        credentials = json.load(infile)
    
    # Assign Credentials to Variables
    database_name = credentials['database']
    username = credentials['user']
    password = credentials['password']
    host = credentials['host']
    port = credentials['port']

    # Create a database connection using SQLAlchemy engine
    engine = create_engine(f'postgresql://{username}:{password}@{host}:{port}/{database_name}')
    
    populate_new_tables = 'n'

    if populate_new_tables == 'y':
        # Iterate over the database_map to insert each DataFrame
        for table_name, [schema_name, df_to_export] in database_map.items():
            df_to_export.to_sql(table_name, engine, schema=schema_name, if_exists='append', index=False)

    engine.dispose()


### Data Normalization for Visualization

In preparing our dataset for visualization in Tableau, we employ normalization techniques on specific columns to ensure that our visualizations are not biased by the scale of the data:

#### Min-Max Scaling
- **Purpose**: To transform the data into a fixed range of 0 to 1, making it easier to visualize different variables on the same scale without distorting the distribution of values. This is particularly important when creating comparative visualizations, such as heatmaps or line charts, where relative scales matter.
- **Applied to**: Columns like `[revenue]`, where we need to maintain the relative distribution of the values for accurate visual comparison.

#### Z-Score Standardization
- **Purpose**: To standardize values so that they have a mean of zero and a standard deviation of one. This normalization is useful for visualizations that compare the relative standing of data points within a distribution, such as histograms or scatter plots.
- **Applied to**: Columns like `[revenue]`, which benefits from showing how many standard deviations away from the mean the data points are, thus facilitating a clear interpretation of outliers and distribution spread.

By normalizing the data before visualization, we aim to create clear and meaningful visualizations in Tableau that accurately represent the underlying data without the distortion that can come from varying scales.

In [301]:
scaler = MinMaxScaler()

expenditures["amount (Min/Max Scale)"] = scaler.fit_transform(expenditures[["amount"]])
federal["revenue (Min/Max Scale)"] = scaler.fit_transform(federal[["revenue"]])
state["revenue (Min/Max Scale)"] = scaler.fit_transform(state[["revenue"]])
local["revenue (Min/Max Scale)"] = scaler.fit_transform(local[["revenue"]])

In [302]:
scaler = StandardScaler()

expenditures["amount (Z-Score Std)"] = scaler.fit_transform(expenditures[["amount"]])
federal["revenue (Z-Score Std)"] = scaler.fit_transform(federal[["revenue"]])
state["revenue (Z-Score Std)"] = scaler.fit_transform(state[["revenue"]])
local["revenue (Z-Score Std)"] = scaler.fit_transform(local[["revenue"]])

### Code Commentary on Exporting DataFrames

The subsequent code snippet performs the operation of exporting DataFrames to CSV files. These CSV files include derived values such as Z-Scores and Min/Max statistics. This export facilitates further analysis in data visualization tools like Tableau. Although these derived values are excluded from the database for flexibility and to adhere to best practices, they are being included in the CSV exports specifically for the purpose of exploratory analysis outside the database environment.

In [303]:
entity.to_csv('LEA Finance Survey – Entity Data.csv')
annual_stats.to_csv('LEA Finance Survey – Entity – Annual Stats Data.csv')
expenditures.to_csv('LEA Finance Survey – Expenditures Data.csv')
federal.to_csv('LEA Finance Survey – Federal Revenue Data.csv')
state.to_csv('LEA Finance Survey – State Revenue Data.csv')
local.to_csv('LEA Finance Survey – Local Revenue Data.csv')