# Educational Attainment Census Data

- This notebook explores education data for Los Angeles County (ACS 5-year Estimates 2022) and compiles a few tables and maps to help visualize the data.

By Katie Greenler

## Importing the Data

In [1]:
# importing python libraries
import pandas as pd

In [2]:
# add csv file
df = pd.read_csv('Data/EducAttainment.csv')

## Inspecting the Data
Just want to see a quick snapshot of the data.

In [3]:
df.shape

(2498, 120)

In [4]:
df.head()

Unnamed: 0,GEOID,NAME,S1501_C01_001E,S1501_C01_002E,S1501_C01_003E,S1501_C01_004E,S1501_C01_005E,S1501_C01_006E,S1501_C01_007E,S1501_C01_008E,...,S1501_C02_045E,S1501_C02_046E,S1501_C02_047E,S1501_C02_048E,S1501_C02_049E,S1501_C02_050E,S1501_C02_051E,S1501_C02_052E,S1501_C02_053E,S1501_C02_054E
0,6037101110,Census Tract 1011.10; Los Angeles County; Cali...,300,40,169,60,31,3119,122,133,...,-,(X),76.2,12.8,(X),93.4,31.2,(X),81.1,14.6
1,6037101122,Census Tract 1011.22; Los Angeles County; Cali...,367,8,137,214,8,3132,467,37,...,-,(X),63.0,18.5,(X),75.7,24.7,(X),63.3,35.5
2,6037101220,Census Tract 1012.20; Los Angeles County; Cali...,325,40,112,149,24,2560,344,251,...,-,(X),55.2,0.0,(X),48.6,15.6,(X),51.1,13.9
3,6037101221,Census Tract 1012.21; Los Angeles County; Cali...,380,58,43,237,42,2682,348,346,...,-,(X),59.7,8.7,(X),50.0,20.3,(X),54.3,7.4
4,6037101222,Census Tract 1012.22; Los Angeles County; Cali...,206,0,98,82,26,2090,132,163,...,0,(X),100.0,0.0,(X),79.8,4.8,(X),79.5,10.3


## Data Types
Now I want to further investigate the data types.

In [5]:
# look at the data types, verbose = True shows us all the counts
df.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2498 entries, 0 to 2497
Data columns (total 120 columns):
 #    Column          Non-Null Count  Dtype 
---   ------          --------------  ----- 
 0    GEOID           2498 non-null   int64 
 1    NAME            2498 non-null   object
 2    S1501_C01_001E  2498 non-null   int64 
 3    S1501_C01_002E  2498 non-null   int64 
 4    S1501_C01_003E  2498 non-null   int64 
 5    S1501_C01_004E  2498 non-null   int64 
 6    S1501_C01_005E  2498 non-null   int64 
 7    S1501_C01_006E  2498 non-null   int64 
 8    S1501_C01_007E  2498 non-null   int64 
 9    S1501_C01_008E  2498 non-null   int64 
 10   S1501_C01_009E  2498 non-null   int64 
 11   S1501_C01_010E  2498 non-null   int64 
 12   S1501_C01_011E  2498 non-null   int64 
 13   S1501_C01_012E  2498 non-null   int64 
 14   S1501_C01_013E  2498 non-null   int64 
 15   S1501_C01_014E  2498 non-null   int64 
 16   S1501_C01_015E  2498 non-null   int64 
 17   S1501_C01_016E  2498 non-null  

In [6]:
# I want to check my GEOID field to see how it looks
df.GEOID.head()

0    6037101110
1    6037101122
2    6037101220
3    6037101221
4    6037101222
Name: GEOID, dtype: int64

In [7]:
# the leading zero disappeared which is an issue so I want to re-import the data with the proper specificification
df = pd.read_csv(
    'Data/EducAttainment.csv',
    dtype=
    {
        'GEOID':str,
    }
)

In [8]:
# check to see if GEOID displays properly now
df.head()

Unnamed: 0,GEOID,NAME,S1501_C01_001E,S1501_C01_002E,S1501_C01_003E,S1501_C01_004E,S1501_C01_005E,S1501_C01_006E,S1501_C01_007E,S1501_C01_008E,...,S1501_C02_045E,S1501_C02_046E,S1501_C02_047E,S1501_C02_048E,S1501_C02_049E,S1501_C02_050E,S1501_C02_051E,S1501_C02_052E,S1501_C02_053E,S1501_C02_054E
0,6037101110,Census Tract 1011.10; Los Angeles County; Cali...,300,40,169,60,31,3119,122,133,...,-,(X),76.2,12.8,(X),93.4,31.2,(X),81.1,14.6
1,6037101122,Census Tract 1011.22; Los Angeles County; Cali...,367,8,137,214,8,3132,467,37,...,-,(X),63.0,18.5,(X),75.7,24.7,(X),63.3,35.5
2,6037101220,Census Tract 1012.20; Los Angeles County; Cali...,325,40,112,149,24,2560,344,251,...,-,(X),55.2,0.0,(X),48.6,15.6,(X),51.1,13.9
3,6037101221,Census Tract 1012.21; Los Angeles County; Cali...,380,58,43,237,42,2682,348,346,...,-,(X),59.7,8.7,(X),50.0,20.3,(X),54.3,7.4
4,6037101222,Census Tract 1012.22; Los Angeles County; Cali...,206,0,98,82,26,2090,132,163,...,0,(X),100.0,0.0,(X),79.8,4.8,(X),79.5,10.3


In [9]:
# now I want to re-check the data types for the whole file
# verbose shows all the columns, show_counts shows the number of non-null values
df.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2498 entries, 0 to 2497
Data columns (total 120 columns):
 #    Column          Non-Null Count  Dtype 
---   ------          --------------  ----- 
 0    GEOID           2498 non-null   object
 1    NAME            2498 non-null   object
 2    S1501_C01_001E  2498 non-null   int64 
 3    S1501_C01_002E  2498 non-null   int64 
 4    S1501_C01_003E  2498 non-null   int64 
 5    S1501_C01_004E  2498 non-null   int64 
 6    S1501_C01_005E  2498 non-null   int64 
 7    S1501_C01_006E  2498 non-null   int64 
 8    S1501_C01_007E  2498 non-null   int64 
 9    S1501_C01_008E  2498 non-null   int64 
 10   S1501_C01_009E  2498 non-null   int64 
 11   S1501_C01_010E  2498 non-null   int64 
 12   S1501_C01_011E  2498 non-null   int64 
 13   S1501_C01_012E  2498 non-null   int64 
 14   S1501_C01_013E  2498 non-null   int64 
 15   S1501_C01_014E  2498 non-null   int64 
 16   S1501_C01_015E  2498 non-null   int64 
 17   S1501_C01_016E  2498 non-null  

In [10]:
# it does not appear that I have any null columns but I'll check again
df.columns[df.isna().all()].tolist()

[]

## Data Dictionary
Now I want to properly separate and name the columns I will be focusing on, specifically these:
- 