# Team 82 Data Cleanup

In [1]:
import os
import pandas as pd
import numpy as np

## Census Data

### Current Spending of Public Elementary-Secondary School Systems by State_2012-2018
Survey Component: Annual Survey of School System Finance
<br />
Type of Government (GOVTYPE_LABEL): State and Local

#### Cleanup Methodology
* Removed the following columns:
    * The `Survey Component (SVY_COMP_LABEL)` column because it contains the same value, `Annual Survey of School System Finance` for all rows.
    * The `Aggregate Description (AGG_DESC)` column because it's values are not human readable and represent the same data as in the `Meaning of Aggregate Description (AGG_DESC_LABEL)` column
    * The `Type of Government (GOVTYPE_LABEL)` column because it contains the same value `State and Local` for all rows.
* Renamed the following columns:
    * `Year (YEAR)`: removed the parenthesis  
    * `Geographic Area Name (NAME)`: removed the parenthesis and rename to "State"
    * `Amount Formatted (AMOUNT_FORMATTED)`: removed the parenthesis and rename to "Spending"
    * `Meaning of Aggregate Description (AGG_DESC_LABEL)`: rename to "Description"

* Simplified the values of the `Description` column by renaming the value `Elementary-secondary education school system total current expenditures` to "total" and removing the text "Elementary-secondary education school system current expenditures"
* Updated `Revenue` column datatype to int64

### Per Pupil Amounts for Current Spending of Public Elementary-Secondary School Systems-US and State-2012 - 2018
Survey Component: Annual Survey of School System Finance
<br />
Type of Government (GOVTYPE_LABEL): State and Local

#### Cleanup Methodology
* Removed the following columns
    * `Survey Component (SVY_COMP_LABEL)` because it contains the same value, `Annual Survey of School System Finance` for all rows
    * The `Aggregate Description (AGG_DESC)` column because it's values are not human readable and represent the same data as in the `Meaning of Aggregate Description (AGG_DESC_LABEL)` column
    * The `Type of Government (GOVTYPE_LABEL)` column because it contains the same value `State and Local` for all rows.
* Renamed the following columns:
    * `Geographic Area Name (NAME)`: removed the parenthesis and rename to "State"
    * `Year (YEAR)`: removed the parenthesis  
    * `Meaning of Aggregate Description (AGG_DESC_LABEL)`: rename to "Description"
    * `Amount Formatted (AMOUNT_FORMATTED)`: removed the parenthesis and rename to "Spending"

* Simplified the values of the `Description` column by renaming the value `Elementary-secondary education school system total current expenditures` to "total" and removing the text "Elementary-secondary education school system current expenditures"
* Updated `Revenue` column datatype to int64

### Percentage Distribution of Public Elementary-Secondary School System Revenue by Source-US and State-2012 - 2018
Survey Component: Annual Survey of School System Finance
<br />
Type of Government (GOVTYPE_LABEL): State and Local

#### Cleanup Methodology
* Removed the following columns
    * `Survey Component (SVY_COMP_LABEL)` because it contains the same value, `Annual Survey of School System Finance` for all rows
    * The `Aggregate Description (AGG_DESC)` column because it's values are not human readable and represent the same data as in the `Meaning of Aggregate Description (AGG_DESC_LABEL)` column
    * The `Type of Government (GOVTYPE_LABEL)` column because it contains the same value `State and Local` for all rows.
* Renamed the following columns:
    * `Geographic Area Name (NAME)`: removed the parenthesis and rename to "State"
    * `Year (YEAR)`: removed the parenthesis  
    * `Meaning of Aggregate Description (AGG_DESC_LABEL)`: rename to "Description"
    * `Amount Formatted (AMOUNT_FORMATTED)`: removed the parenthesis and rename to "Percentage"
* Simplified the values of the `Description` column by renaming the value `Elementary-secondary education school system total current expenditures` to "total" and removing the text "Elementary-secondary education school system current expenditures"
* Replaced the percentage values for DC where the value was "X" with zero
* Updated `Percentage` datatype to float64

### Revenue from Federal Sources for Public Elementary-Secondary School Systems-US and States-2012 - 2018
Survey Component: Annual Survey of School System Finance
<br />
Type of Government (GOVTYPE_LABEL): State and Local

#### Cleanup Methodology
* Removed the following columns
    * `Survey Component (SVY_COMP_LABEL)` because it contains the same value, `Annual Survey of School System Finance` for all rows
    * The `Aggregate Description (AGG_DESC)` column because it's values are not human readable and represent the same data as in the `Meaning of Aggregate Description (AGG_DESC_LABEL)` column
    * The `Type of Government (GOVTYPE_LABEL)` column because it contains the same value `State and Local` for all rows.
* Renamed the following columns:
    * `Geographic Area Name (NAME)`: removed the parenthesis and rename to "State"
    * `Year (YEAR)`: removed the parenthesis  
    * `Meaning of Aggregate Description (AGG_DESC_LABEL)`: rename to "Description"
    * `Amount Formatted (AMOUNT_FORMATTED)`: removed the parenthesis and rename to "Revenue"

* Simplified the values of the `Description` column by renaming the value `Elementary-secondary education school system total current expenditures` to "total" and removing the text "Elementary-secondary education school system current expenditures"
* Replaced "N" values in the `Revenue` column with zero
* Updated `Revenue` column datatype to int64

### Revenue from State Sources for Public Elementary-Secondary School Systems-US and State-
2012 - 2018
Survey Component: Annual Survey of School System Finance
<br />
Type of Government (GOVTYPE_LABEL): State and Local

#### Cleanup Methodology
* Removed the following columns
    * `Survey Component (SVY_COMP_LABEL)` because it contains the same value, `Annual Survey of School System Finance` for all rows
    * The `Aggregate Description (AGG_DESC)` column because it's values are not human readable and represent the same data as in the `Meaning of Aggregate Description (AGG_DESC_LABEL)` column
    * The `Type of Government (GOVTYPE_LABEL)` column because it contains the same value `State and Local` for all rows.
* Removed D.C. from dataset because it doesn't receive any state funding
* Renamed the following columns:
    * `Geographic Area Name (NAME)`: removed the parenthesis and rename to "State"
    * `Year (YEAR)`: removed the parenthesis  
    * `Meaning of Aggregate Description (AGG_DESC_LABEL)`: rename to "Description"
    * `Amount Formatted (AMOUNT_FORMATTED)`: removed the parenthesis and rename to "Revenue"

* Simplified the values of the `Description` column by renaming the value `Elementary-secondary education school system total current expenditures` to "total" and removing the text "Elementary-secondary education school system current expenditures"
* Updated `Revenue` column datatype to int64

### Revenue from Local Sources for Public Elementary-Secondary School Systems-US and State-2012 - 2018
Survey Component: Annual Survey of School System Finance
<br />
Type of Government (GOVTYPE_LABEL): State and Local

#### Cleanup Methodology
* Removed the following columns
    * `Survey Component (SVY_COMP_LABEL)` because it contains the same value, `Annual Survey of School System Finance` for all rows
    * The `Aggregate Description (AGG_DESC)` column because it's values are not human readable and represent the same data as in the `Meaning of Aggregate Description (AGG_DESC_LABEL)` column
    * The `Type of Government (GOVTYPE_LABEL)` column because it contains the same value `State and Local` for all rows.

* Renamed the following columns:
    * `Geographic Area Name (NAME)`: removed the parenthesis and rename to "State"
    * `Year (YEAR)`: removed the parenthesis  
    * `Meaning of Aggregate Description (AGG_DESC_LABEL)`: rename to "Description"
    * `Amount Formatted (AMOUNT_FORMATTED)`: removed the parenthesis and rename to "Revenue"

* Simplified the values of the `Description` column by renaming the value `Elementary-secondary education school system total current expenditures` to "total" and removing the text "Elementary-secondary education school system current expenditures"
* Replaced "X" values in the `Revenue` column with zero
* Updated `Revenue` column datatype to int64

### Summary of Public Elementary-Secondary School System Finances-US and States-2012-2018
Survey Component: Annual Survey of School System Finance
<br />
Type of Government (GOVTYPE_LABEL): State and Local

#### Cleanup Methodology
* Removed the following columns
    * `Survey Component (SVY_COMP_LABEL)` because it contains the same value, `Annual Survey of School System Finance` for all rows
    * The `Aggregate Description (AGG_DESC)` column because it's values are not human readable and represent the same data as in the `Meaning of Aggregate Description (AGG_DESC_LABEL)` column
    * The `Type of Government (GOVTYPE_LABEL)` column because it contains the same value `State and Local` for all rows.
* Renamed the following columns:
    * `Geographic Area Name (NAME)`: removed the parenthesis and rename to "State"
    * `Year (YEAR)`: removed the parenthesis  
    * `Meaning of Aggregate Description (AGG_DESC_LABEL)`: rename to "Description"
    * `Amount Formatted (AMOUNT_FORMATTED)`: removed the parenthesis and rename to "Revenue"

* Simplified the values of the `Description` column by renaming the value `Elementary-secondary education school system total current expenditures` to "total" and removing the text "Elementary-secondary education school system current expenditures"
* Replaced "X" values in the `Revenue` column with zero
* Updated `Revenue` column datatype to int64

In [2]:
spending_by_state = pd.read_csv('./data_sets/US Census/Current Spending of Public Elementary-Secondary School Systems by State_2012-2018.csv')

In [3]:
# dropping columns that are not useful
columns_to_drop = ['Survey Component (SVY_COMP_LABEL)', 'Aggregate Description (AGG_DESC)', 'Type of Government (GOVTYPE_LABEL)']
spending_by_state.drop(columns=columns_to_drop, inplace=True)

In [4]:
# renaming columns with names that are more useful
columns_to_rename = {
    'Year (YEAR)': 'Year',
    'Geographic Area Name (NAME)': 'State',
    'Amount Formatted (AMOUNT_FORMATTED)': 'Spending',
    'Meaning of Aggregate Description (AGG_DESC_LABEL)': 'Description'
}
spending_by_state.rename(columns=columns_to_rename, inplace=True)

In [5]:
# update description values
def renameDescriptions(desc, text_to_remove_p1, text_to_remove_p2='', replacement1='', replacement2=''):
    return desc.replace(text_to_remove_p1, replacement1).replace(text_to_remove_p2, replacement2).strip()
text1 = 'Elementary-secondary education school system'
text2 = 'current expenditures'
spending_by_state['Description'] = spending_by_state['Description'].map(lambda desc: renameDescriptions(desc, text1, text2))

In [6]:
# update spending type
spending_by_state['Spending'] = spending_by_state['Spending'].astype('int64')

In [7]:
per_pupil_spending = pd.read_csv('./data_sets/US Census/Per Pupil Amounts for Current Spending of Public Elementary-Secondary School Systems- US and State- 2012 - 2018.csv')

In [8]:
# dropping columns that are not useful
per_pupil_spending.drop(columns=columns_to_drop, inplace=True)

# renaming columns with names that are more useful
columns_to_rename['Amount Formatted (AMOUNT_FORMATTED)'] = 'Spending'
per_pupil_spending.rename(columns=columns_to_rename, inplace=True)

# update description values
per_pupil_spending['Description'] = spending_by_state['Description'].map(lambda desc: renameDescriptions(desc, text1, text2))

# update spending type
per_pupil_spending['Spending'] = per_pupil_spending['Spending'].astype('int64')

In [9]:
revenue_distribution = pd.read_csv('./data_sets/US Census/Percentage Distribution of Public Elementary-Secondary School System Revenue by Source- US and State- 2012 - 2018.csv')

In [10]:
# dropping columns that are not useful
revenue_distribution.drop(columns=columns_to_drop, inplace=True)

# renaming columns with names that are more useful
columns_to_rename['Amount Formatted (AMOUNT_FORMATTED)'] = 'Percentage'
revenue_distribution.rename(columns=columns_to_rename, inplace=True)

# update description values
revenue_distribution['Description'] = spending_by_state['Description'].map(lambda desc: renameDescriptions(desc, 'Revenue from ', ' sources ', replacement2='-'))

# update 'X' values in Percentage column with 0 and update type to float
revenue_distribution['Percentage'] = revenue_distribution['Percentage'].apply(lambda x: 0 if x == 'X' else x).astype('float64')

In [11]:
revenue_from_fed = pd.read_csv('./data_sets/US Census/Revenue from Federal Sources for Public Elementary-Secondary School Systems- US and States- 2012 - 2018.csv')

In [12]:
# dropping columns that are not useful
revenue_from_fed.drop(columns=columns_to_drop, inplace=True)

# renaming columns with names that are more useful
columns_to_rename['Amount Formatted (AMOUNT_FORMATTED)'] = 'Revenue'
revenue_from_fed.rename(columns=columns_to_rename, inplace=True)

# update description values
text1 = 'Elementary-secondary education school system revenue from Federal sources'
text2 = text1 + ' - '
revenue_from_fed['Description'] = spending_by_state['Description'].map(lambda desc: renameDescriptions(desc, text2, text1, replacement2='total'))

# replace revenue values with 0 where they equal'N' and update type to int64
revenue_from_fed['Revenue'] = revenue_from_fed['Revenue'].apply(lambda x: 0 if x == 'N' else x).astype('int64')

In [13]:
revenue_from_state = pd.read_csv('./data_sets/US Census/Revenue from State Sources for Public Elementary-Secondary School Systems- US and State- 2012 - 2018.csv')

In [14]:
# dropping columns that are not useful
revenue_from_state.drop(columns=columns_to_drop, inplace=True)

# renaming columns with names that are more useful
columns_to_rename['Amount Formatted (AMOUNT_FORMATTED)'] = 'Revenue'
revenue_from_state.rename(columns=columns_to_rename, inplace=True)

# dropping D.C. rows because all values for Revenue are 'X'
rows_to_remove = revenue_from_state[(revenue_from_state['State'] == 'District of Columbia') & (revenue_from_state['Revenue'] == 'X')].index
revenue_from_state.drop(index=rows_to_remove, inplace=True)

# update description values
text1 = 'Elementary-secondary education school system revenue from state sources'
text2 = text1 + ' - '
revenue_from_state['Description'] = spending_by_state['Description'].map(lambda desc: renameDescriptions(desc, text2, text1, replacement2='total'))

# update revenue type
revenue_from_state['Revenue'] = revenue_from_state['Revenue'].astype('int64')

In [15]:
revenue_from_local = pd.read_csv('./data_sets/US Census/Revenue from Local Sources for Public Elementary-Secondary School Systems- US and State - 2012 - 2018.csv')

In [16]:
# dropping columns that are not useful
revenue_from_local.drop(columns=columns_to_drop, inplace=True)

# renaming columns with names that are more useful
columns_to_rename['Amount Formatted (AMOUNT_FORMATTED)'] = 'Revenue'
revenue_from_local.rename(columns=columns_to_rename, inplace=True)

# update description values
text1 = 'Elementary-secondary education school system revenue from local sources'
text2 = text1 + ' - '
revenue_from_local['Description'] = spending_by_state['Description'].map(lambda desc: renameDescriptions(desc, text2, text1, replacement2='total'))

# replace revenue values with 0 where they equal'X' and update type to int64
revenue_from_local['Revenue'] = revenue_from_local['Revenue'].apply(lambda x: 0 if x == 'X' else x).astype('int64')

In [17]:
finances_summary = pd.read_csv('./data_sets/US Census/Summary of Public Elementary-Secondary School System Finances-US and States-2012-2018.csv')

In [18]:
# dropping columns that are not useful
finances_summary.drop(columns=columns_to_drop, inplace=True)

# renaming columns with names that are more useful
columns_to_rename['Amount Formatted (AMOUNT_FORMATTED)'] = 'Revenue'
finances_summary.rename(columns=columns_to_rename, inplace=True)

# update description values
text1 = 'Elementary-secondary education school system'
finances_summary['Description'] = spending_by_state['Description'].map(lambda desc: renameDescriptions(desc, text1))

# replace revenue values with 0 where they equal'X' and update type to int64
finances_summary['Revenue'] = finances_summary['Revenue'].apply(lambda x: 0 if x == 'X' else x).astype('int64')

# Team 82 Data Cleanup - NCES Parent Demographic Data

## NCES Parent Demographic Data

### Education Attainment of Parent
Education Demographic and Geographic est`imates ( EDGE)  2014-2018
<br />
Geography: All Districts
Population Group:  Parents of Relevant Children
* PDP 02 Selected Socal Characteristics on Parents in the United States
    * `'PDP025 Attainment'`


#### Cleanup Methodology
* Removed the following columns:
    * Columns that end in `moe` - Margin of Error` because data will not be used for this project`
    * `'PDP02.5_29moe'`
    * `'PDP02.5_30pctmoe'`
    * `'PDP02.5_31pctmoe'`
    * `'PDP02.5_32pctmoe'`
    * `'PDP02.5_33pctmoe'`
    * `'PDP02.5_34pctmoe'`
    * `'PDP02.5_35pctmoe'`
    * `'PDP02.5_36pctmoe'`
    * `'PDP02.5_38pctmoe'`
    * `'PDP02.5_30moe'`
    * `'PDP02.5_31moe'`
    * `'PDP02.5_32moe'`
    * `'PDP02.5_33moe'`
    * `'PDP02.5_34moe'`
    * `'PDP02.5_35moe'`
    * `'PDP02.5_36moe'`
    * `'PDP02.5_37pctmoe'`
    
* Renamed the following columns to readable names and int64
    POP - population 25 years or older; pc = per cent  
    * `'PDP02.5_30est` to num_Educational_Attain_POP_LT9th
    * `'PDP02.5_31est` to num_Educational_Attain_POP_9th-12th
    * `'PDP02.5_32est` to num_Educational_Attain_POP_HS_GRAD
    * `'PDP02.5_33est` to num_Educational_Attain_POP_SomeColl
    * `'PDP02.5_34est` to num_Educational_Attain_POP_AssocDeg
    * `'PDP02.5_35est` to num_Educational_Attain_POP_BacDeg
    * `'PDP02.5_36est` to num_Educational_Attain_POP_GradProf
    * `'PDP02.5_37pct` to pct_Educational_Attain_HS_Grad_higher
    * `'PDP02.5_29est` to num_Educational_Attain_POP
    * `'PDP02.5_30pct` to pc_ Educational_Attain_POP_LT9th
    * `'PDP02.5_31pct` to pc_Educational_Attain_POP_9th-12th
    * `'PDP02.5_32pct` to pc_Educational_Attain_POP_HS_GRAD
    * `'PDP02.5_33pct` to pc_Educational_Attain_POP_SomeColl
    * `'PDP02.5_34pct` to pc_Educational_Attain_POP_AssocDeg
    * `'PDP02.5_35pct` to pc_Educational_Attain_POP_BacDeg
    * `'PDP02.5_36pct` to pc_Educational_Attain_POP_GradProf
    * `'PDP02.5_38pct` to pct_Educational_Attain_BS_Deg_higher
    



In [19]:
parent_social_by_district = pd.read_csv('./data_sets/EDGE_Export_122152216654_Social Demographics parents 2014_18/PDP02.5_202_USSchoolDistrictAll_12215227826.txt', sep='|')

In [20]:
parent_social_by_district

Unnamed: 0,GeoId,Geography,LEAID,Year,Iteration,PDP02.5_29est,PDP02.5_29moe,PDP02.5_30est,PDP02.5_30moe,PDP02.5_30pct,...,PDP02.5_36pct,PDP02.5_36pctmoe,PDP02.5_37est,PDP02.5_37moe,PDP02.5_37pct,PDP02.5_37pctmoe,PDP02.5_38est,PDP02.5_38moe,PDP02.5_38pct,PDP02.5_38pctmoe
0,97000US2700106,A.C.G.C. Public School District,2700106,2014-2018,202,905,79,15,14,1.7,...,4.4,1.3,855,78,94.5,2.6,180,33,19.9,3.2
1,97000US4500690,Abbeville County School District,4500690,2014-2018,202,2965,336,50,38,1.7,...,8.4,3.1,2740,326,92.4,3.2,715,165,24.1,4.8
2,97000US5500030,Abbotsford School District,5500030,2014-2018,202,670,113,90,39,13.4,...,0.6,0.5,545,104,81.3,6.0,80,25,11.9,3.7
3,97000US4807380,Abbott Independent School District,4807380,2014-2018,202,210,61,0,13,0.0,...,11.9,6.4,205,61,97.6,3.5,105,39,50.0,11.6
4,97000US5300030,Aberdeen School District,5300030,2014-2018,202,2895,321,330,161,11.4,...,9.2,3.2,2410,295,83.2,6.1,540,141,18.7,4.6
5,97000US2800360,Aberdeen School District,2800360,2014-2018,202,1260,218,105,74,8.3,...,3.2,3.3,1155,214,91.7,5.8,190,108,15.1,8.0
6,97000US1600030,Aberdeen School District 58,1600030,2014-2018,202,575,151,95,61,16.5,...,7.8,5.4,420,124,73.0,12.1,95,51,16.5,8.2
7,97000US4807410,Abernathy Independent School District,4807410,2014-2018,202,755,149,35,36,4.6,...,4.0,3.7,680,129,90.1,7.1,175,73,23.2,8.1
8,97000US4807440,Abilene Independent School District,4807440,2014-2018,202,13660,659,525,151,3.8,...,6.5,1.3,12175,689,89.1,1.9,2615,388,19.1,2.5
9,97000US2003180,Abilene Unified School District 435,2003180,2014-2018,202,1390,174,75,49,5.4,...,7.2,4.6,1270,168,91.4,3.9,410,94,29.5,6.7


In [21]:
#debug look at columns
#parent_social_by_district.columns

In [22]:
# dropping moe - margin of error
df =parent_social_by_district
columns_to_drop = ['PDP02.5_29moe','PDP02.5_30pctmoe','PDP02.5_31pctmoe','PDP02.5_32pctmoe','PDP02.5_33pctmoe','PDP02.5_34pctmoe',\
                   'PDP02.5_35pctmoe','PDP02.5_36pctmoe','PDP02.5_38pctmoe','PDP02.5_30moe','PDP02.5_31moe','PDP02.5_32moe',\
                   'PDP02.5_33moe','PDP02.5_34moe','PDP02.5_35moe','PDP02.5_36moe','PDP02.5_37moe','PDP02.5_37pctmoe','PDP02.5_38moe','PDP02.5_38pctmoe']
parent_social_by_district.drop(columns=columns_to_drop, inplace=True)

In [23]:
parent_social_by_district.columns

Index(['GeoId', 'Geography', 'LEAID', 'Year', 'Iteration', 'PDP02.5_29est',
       'PDP02.5_30est', 'PDP02.5_30pct', 'PDP02.5_31est', 'PDP02.5_31pct',
       'PDP02.5_32est', 'PDP02.5_32pct', 'PDP02.5_33est', 'PDP02.5_33pct',
       'PDP02.5_34est', 'PDP02.5_34pct', 'PDP02.5_35est', 'PDP02.5_35pct',
       'PDP02.5_36est', 'PDP02.5_36pct', 'PDP02.5_37est', 'PDP02.5_37pct',
       'PDP02.5_38est', 'PDP02.5_38pct'],
      dtype='object')

In [24]:
# renaming columns with names that are more useful
df2 =parent_social_by_district
columns_to_rename = {
    'PDP02.5_30est': 'num_Educational_Attain_POP_LT9th',
    'PDP02.5_31est': 'num_Educational_Attain_POP_9th-12th',
    'PDP02.5_32est': 'num_Educational_Attain_POP_HS_GRAD',
    'PDP02.5_33est': 'num_Educational_Attain_POP_SomeColl',
    'PDP02.5_34est': 'num_Educational_Attain_POP_AssocDeg',
    'PDP02.5_35est': 'num_Educational_Attain_POP_BacDeg',
    'PDP02.5_36est': 'num_Educational_Attain_POP_GradProf',
    'PDP02.5_37pct': 'pct_Educational_Attain_POP_HS_Grad_higher',
    'PDP02.5_29est': 'num_Educational_Attain_POP',
    'PDP02.5_30pct': 'pc_ Educational_Attain_POP_LT9th',
    'PDP02.5_31pct': 'pc_Educational_Attain_POP_9th-12th',
    'PDP02.5_32pct': 'pc_Educational_Attain_POP_HS_GRAD',
    'PDP02.5_33pct': 'pc_Educational_Attain_POP_SomeColl',
    'PDP02.5_34pct': 'pc_Educational_Attain_POP_AssocDeg',
    'PDP02.5_35pct': 'pc_Educational_Attain_POP_BacDeg',
    'PDP02.5_36pct': 'pc_Educational_Attain_POP_GradProf',
    'PDP02.5_38pct': 'pct_Educational_Attain_BS_Deg_higher'
}
parent_social_by_district.rename(columns=columns_to_rename, inplace=True)
parent_social_by_district

Unnamed: 0,GeoId,Geography,LEAID,Year,Iteration,num_Educational_Attain_POP,num_Educational_Attain_POP_LT9th,pc_ Educational_Attain_POP_LT9th,num_Educational_Attain_POP_9th-12th,pc_Educational_Attain_POP_9th-12th,...,num_Educational_Attain_POP_AssocDeg,pc_Educational_Attain_POP_AssocDeg,num_Educational_Attain_POP_BacDeg,pc_Educational_Attain_POP_BacDeg,num_Educational_Attain_POP_GradProf,pc_Educational_Attain_POP_GradProf,PDP02.5_37est,pct_Educational_Attain_POP_HS_Grad_higher,PDP02.5_38est,pct_Educational_Attain_BS_Deg_higher
0,97000US2700106,A.C.G.C. Public School District,2700106,2014-2018,202,905,15,1.7,35,3.9,...,175,19.3,140,15.5,40,4.4,855,94.5,180,19.9
1,97000US4500690,Abbeville County School District,4500690,2014-2018,202,2965,50,1.7,175,5.9,...,580,19.6,460,15.5,250,8.4,2740,92.4,715,24.1
2,97000US5500030,Abbotsford School District,5500030,2014-2018,202,670,90,13.4,30,4.5,...,65,9.7,75,11.2,4,0.6,545,81.3,80,11.9
3,97000US4807380,Abbott Independent School District,4807380,2014-2018,202,210,0,0.0,4,1.9,...,45,21.4,80,38.1,25,11.9,205,97.6,105,50.0
4,97000US5300030,Aberdeen School District,5300030,2014-2018,202,2895,330,11.4,155,5.4,...,255,8.8,275,9.5,265,9.2,2410,83.2,540,18.7
5,97000US2800360,Aberdeen School District,2800360,2014-2018,202,1260,105,8.3,4,0.3,...,160,12.7,150,11.9,40,3.2,1155,91.7,190,15.1
6,97000US1600030,Aberdeen School District 58,1600030,2014-2018,202,575,95,16.5,60,10.4,...,45,7.8,50,8.7,45,7.8,420,73.0,95,16.5
7,97000US4807410,Abernathy Independent School District,4807410,2014-2018,202,755,35,4.6,40,5.3,...,30,4.0,145,19.2,30,4.0,680,90.1,175,23.2
8,97000US4807440,Abilene Independent School District,4807440,2014-2018,202,13660,525,3.8,955,7.0,...,1480,10.8,1725,12.6,890,6.5,12175,89.1,2615,19.1
9,97000US2003180,Abilene Unified School District 435,2003180,2014-2018,202,1390,75,5.4,45,3.2,...,145,10.4,305,21.9,100,7.2,1270,91.4,410,29.5


### Update Population Type

In [25]:
# update Parent Population  type
df3 =parent_social_by_district
c = parent_social_by_district.columns[6:]
for j in c:
    parent_social_by_district[j] = parent_social_by_district[j].astype('int64')


## NCES Parent Economic Demographics Data

### Education Attainment of Parent
Education Demographic and Geographic est`imates ( EDGE)  2014-2018
<br />
Geography: All Districts
Population Group:  Parents of Relevant Children
* PDP 03 Selected Economic Characteristics on Parents in the United States
    * `* `'PDP3.8 Percentage of People Whose Income in past 12 Months us Below the Poverty Level'`


#### Cleanup Methodology
* Removed the following columns:
    * Columns that end in `moe` - Margin of Error` because data will not be used for this project`
    * `'PDP03.8_72pctmoe'
    * `'PDP03.8_73pctmoe'
    * `'PDP03.8_74pctmoe'
    * `'PDP03.8_75pctmoe'
    
* Renamed the following columns to readable names and to data type int64.  pc data not available, set value to NaN

    * `'PDP03.8_72pct'` to `'pc_Below PovLvL_All_Ages'`
    * `'PDP03.8_73pct'` to `'pc_Below PovLvL_Age_gte_18'`	
    * `'PDP03.8_74pct'` to `'pc_Below PovLvL_Age_18_64'`
    * `'PDP03.8_75pct'` to `'pc_Below PovLvL_Age_gte_65'`
    

In [26]:
parent_econ_by_district = pd.read_csv("./data_sets/EDGE_Export_122153919246_Economic Demographics Parents 2014-2018/PDP03.8_202_USSchoolDistrictAll_122153917324.txt", sep='|')
parent_econ_by_district.columns
#parent_econ_by_district 

Index(['GeoId', 'Geography', 'LEAID', 'Year', 'Iteration', 'PDP03.8_72pct',
       'PDP03.8_72pctmoe', 'PDP03.8_73pct', 'PDP03.8_73pctmoe',
       'PDP03.8_74pct', 'PDP03.8_74pctmoe', 'PDP03.8_75pct',
       'PDP03.8_75pctmoe'],
      dtype='object')

In [27]:
# dropping moe - margin of error
df4 =parent_econ_by_district
columns_to_drop =['PDP03.8_72pctmoe','PDP03.8_73pctmoe','PDP03.8_74pctmoe','PDP03.8_75pctmoe']
parent_econ_by_district.drop(columns=columns_to_drop, inplace=True)

In [28]:
#Rename Columns
df5 =parent_econ_by_district
columns_to_rename = {
     'PDP03.8_72pct' : 'pc_Below PovLvL_All_Ages',
     'PDP03.8_73pct' : 'pc_Below PovLvL_Age_gte_18',
     'PDP03.8_74pct' : 'pc_Below PovLvL_Age_18_64',
     'PDP03.8_75pct' : 'pc_Below PovLvL_Age_gte_65'
}

parent_econ_by_district.rename(columns=columns_to_rename, inplace=True)
#debug 
parent_econ_by_district

Unnamed: 0,GeoId,Geography,LEAID,Year,Iteration,pc_Below PovLvL_All_Ages,pc_Below PovLvL_Age_gte_18,pc_Below PovLvL_Age_18_64,pc_Below PovLvL_Age_gte_65
0,97000US2700106,A.C.G.C. Public School District,2700106,2014-2018,202,11.8,11.8,11.8,-
1,97000US4500690,Abbeville County School District,4500690,2014-2018,202,21.9,21.9,21.9,0.0
2,97000US5500030,Abbotsford School District,5500030,2014-2018,202,16.0,16.0,16.1,0.0
3,97000US4807380,Abbott Independent School District,4807380,2014-2018,202,0.0,0.0,0.0,-
4,97000US5300030,Aberdeen School District,5300030,2014-2018,202,18.3,18.3,18.4,0.0
5,97000US2800360,Aberdeen School District,2800360,2014-2018,202,14.9,14.9,15.1,0.0
6,97000US1600030,Aberdeen School District 58,1600030,2014-2018,202,16.0,16.0,16.0,0.0
7,97000US4807410,Abernathy Independent School District,4807410,2014-2018,202,6.9,6.9,6.9,-
8,97000US4807440,Abilene Independent School District,4807440,2014-2018,202,16.7,16.7,16.8,0.0
9,97000US2003180,Abilene Unified School District 435,2003180,2014-2018,202,2.3,2.3,2.3,0.0


### Update Population Type

In [29]:
# update Percent Poverty type  Force non numbers to NaN
df6 =parent_econ_by_district
c = parent_econ_by_district.columns[5:]

for j in c:
       parent_econ_by_district[j]= parent_econ_by_district[j].apply(pd.to_numeric,errors='coerce')
#debug
#parent_econ_by_district

