# Data Download and Exploration

This code means that the notebook will re-import your source code in `src` when it is edited (the default is not to re-import, because most modules are assumed not to change over time).  It's a good idea to include it in any exploratory notebook that uses `src` code

In [1]:
%load_ext autoreload
%autoreload 2

This snippet allows the notebook to import from the `src` module.  The directory structure looks like:

```
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering)
│   │                     followed by the topic of the notebook, e.g.
│   │                     01_data_collection_exploration.ipynb
│   └── exploratory    <- Raw, flow-of-consciousness, work-in-progress notebooks
│   └── report         <- Final summary notebook(s)
│
├── src                <- Source code for use in this project
│   ├── data           <- Scripts to download and query data
│   │   ├── sql        <- SQL scripts. Naming convention is a number (for ordering)
│   │   │                 followed by the topic of the script, e.g.
│   │   │                 03_create_pums_2017_table.sql
│   │   ├── data_collection.py
│   │   └── sql_utils.py
```

So we need to go up two "pardir"s (parent directories) to import the `src` code from this notebook.  You'll want to include this code at the top of any notebook that uses the `src` code.

In [2]:
import os
import sys
module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)

The code to download all of the data and load it into a SQL database is in the `data` module within the `src` module.  You'll only need to run `download_data_and_load_into_sql` one time for the duration of the project.

In [3]:
from src.data import data_collection

This line may take as long as 10-20 minutes depending on your network connection and computer specs

In [4]:
#data_collection.download_data_and_load_into_sql()

In [5]:
pip install tabula.py

Collecting tabula.py
  Using cached tabula_py-2.0.4-py3-none-any.whl (10.4 MB)
Installing collected packages: tabula.py
Successfully installed tabula.py
Note: you may need to restart the kernel to use updated packages.


Now it's time to explore the data!

In [6]:
import psycopg2
import pandas as pd
import numpy as np
from tabula import read_pdf
pd.set_option('max_colwidth', 80)

In [7]:
DBNAME = "opportunity_youth"

In [8]:
conn = psycopg2.connect(dbname=DBNAME)

In [9]:
df =pd.read_sql("""
SELECT  
        A.serialno, 
        A.puma, 
        A.pwgtp totalnumber,
        CASE WHEN A.sch ='1' AND (A.esr='3' OR A.esr='6') THEN 'Opportunity Youth'
             WHEN (A.esr ='1' or A.esr='2' OR A.esr='4' OR A.esr='5') AND A.schl < '16' THEN 'Working without Diploma'
             ELSE 'Not Opportunity Youth' 
             END AS youthtype,
        CASE WHEN A.agep BETWEEN 16 AND 18 THEN '16-18'
             WHEN A.agep BETWEEN 19 AND 21 THEN '19-21'
             WHEN A.agep BETWEEN 22 AND 24 THEN '22-24'
             END AS age,
        CASE WHEN A.sex = '1' THEN 'Male'
             WHEN A.sex = '2' THEN 'Female'
             ELSE 'Other'
             END AS sex,
        CASE WHEN A.schl BETWEEN '01' AND '15' THEN 'No Diploma'
             WHEN A.schl BETWEEN '16' AND '17' THEN 'HS Diploma or GED'
             WHEN A.schl BETWEEN '18' AND '19' THEN 'Some College, No Degree'
             ELSE 'Degree (Associate or Higher)'
             END as educationattainment
    
FROM  pums_2017 AS A
      
WHERE 
          puma BETWEEN '11610' AND '11615'
          AND agep BETWEEN 16 and 24
          AND rt = 'P'  

;
""", conn)
#  CASE WHEN a.rac1p = '1' AND HISP ='01' THEN 'White'
#              WHEN a.rac1p = '2' AND HISP ='01' THEN 'Black of African American aLone'
#              WHEN a.rac1p = '3' AND HISP ='01' THEN 'American Indian along'
#              WHEN a.rac1p = '4' AND HISP ='01' THEN 'Alaska Native alone'
#              WHEN a.rac1p = '5' AND HISP ='01' THEN ''
#              WHEN a.rac1p = '6' AND HISP ='01' THEN 'Asian along'
#              WHEN a.rac1p = '7' AND HISP ='01' THEN 'Native Hawaiian or Other Pacific Islander alone'
#              WHEN a.rac1p = '8' AND HISP ='01' THEN 'Some other Race alone'
#              WHEN a.rac1p = '9' THEN 'Two or More Races'
#               ELSE 'Other Races'
#               END AS race,
#LEFT JOIN (SELECT DISTINCT * FROM ct_puma_xwalk WHERE statefp='53')AS K ON K.puma5ce =A.puma 
#LEFT JOIN (SELECT DISTINCT trct, cty, ctyname, blklondd, blklatdd from wa_geo_xwalk WHERE cty='53033') AS C ON K.tractce = RIGHT(trct, 6)
#LEFT JOIN (SELECT DISTINCT puma, puma_name FROM puma_names_2010 WHERE state_fips='53') AS D on A.puma=D.puma



Unnamed: 0,serialno,puma,st,agep,sch,fesrp,cty,ctyname,puma_name,blklondd,blklatdd
0,2017000004988,11615,53,21.0,1,0,53033,"King County, WA ...","King County (Southeast)--Maple Valley, Covington & Enumclaw Cities ...",-122.092623,47.367345
1,2017000004988,11615,53,21.0,1,0,53033,"King County, WA ...","King County (Southeast)--Maple Valley, Covington & Enumclaw Cities ...",-122.102270,47.360446
2,2017000004988,11615,53,21.0,1,0,53033,"King County, WA ...","King County (Southeast)--Maple Valley, Covington & Enumclaw Cities ...",-122.096710,47.370312
3,2017000004988,11615,53,21.0,1,0,53033,"King County, WA ...","King County (Southeast)--Maple Valley, Covington & Enumclaw Cities ...",-122.094064,47.364661
4,2017000004988,11615,53,21.0,1,0,53033,"King County, WA ...","King County (Southeast)--Maple Valley, Covington & Enumclaw Cities ...",-122.095817,47.371737
...,...,...,...,...,...,...,...,...,...,...,...
646667,2017001530818,11613,53,23.0,1,0,53033,"King County, WA ...",King County (Southwest Central)--Kent City ...,-122.152117,47.359073
646668,2017001530818,11613,53,23.0,1,0,53033,"King County, WA ...",King County (Southwest Central)--Kent City ...,-122.130259,47.360500
646669,2017001530818,11613,53,23.0,1,0,53033,"King County, WA ...",King County (Southwest Central)--Kent City ...,-122.147096,47.374415
646670,2017001530818,11613,53,23.0,1,0,53033,"King County, WA ...",King County (Southwest Central)--Kent City ...,-122.154309,47.364714


In [10]:
df.head()

Unnamed: 0,serialno,puma,totalnumber,youthtype,age,sex,educationattainment
0,2013000056099,11613,16.0,Not Opportunity Youth,22-24,Male,HS Diploma or GED
1,2013000057563,11611,20.0,Opportunity Youth,19-21,Male,HS Diploma or GED
2,2013000058010,11614,45.0,Opportunity Youth,16-18,Female,No Diploma
3,2013000059060,11610,19.0,Opportunity Youth,22-24,Male,HS Diploma or GED
4,2013000065045,11611,27.0,Not Opportunity Youth,22-24,Female,"Some College, No Degree"


In [11]:
import numpy as np
nd =pd.pivot_table(df,
               index=['youthtype'],
               values=['totalnumber'],
               columns=['age'],
               aggfunc=np.sum,
               fill_value='',
               margins=True,
               margins_name='Total'
              )['totalnumber']
nd.reset_index()
nd.columns.rename('', inplace=True)
newc =(nd['16-18']/nd['16-18']['Total'])*100 
nd.insert(0, '(%)16-18', newc)
newd =(nd['19-21']/nd['19-21']['Total'])*100 
nd.insert(2, '(%)19-21', newd)
newe =(nd['22-24']/nd['22-24']['Total'])*100 
nd.insert(4, '(%)22-24', newe)
newf =(nd['Total']/nd['Total']['Total'])*100 
nd.insert(6, '(%)Total', newf)
new_index1=['Total', 'Opportunity Youth', 'Working without Diploma', 'Not Opportunity Youth']
total_youth_2017=nd.reindex(new_index1)
total_youth_2017

Unnamed: 0_level_0,(%)16-18,16-18,(%)19-21,19-21,(%)22-24,22-24,(%)Total,Total
youthtype,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Total,100.0,30141,100.0,25486,100.0,30256,100.0,85883
Opportunity Youth,6.0217,1815,15.3104,3902,16.1852,4897,12.3587,10614
Working without Diploma,14.5217,4377,6.39567,1630,5.63525,1705,8.97966,7712
Not Opportunity Youth,79.4566,23949,78.294,19954,78.1795,23654,78.6617,67557


In [12]:
total_youth_2017.to_csv('total_youth_2017.csv')

Notice the `LIMIT 10` above.  These tables have a large amount of data in them and **your goal is to use SQL to create your main query, not Pandas**.  Pandas can technically do everything that you need to do, but it will be much slower and more inefficient.  Nevertheless, Pandas is still a useful tool for exploring the data and getting a basic sense of what you're looking at.

In [13]:
pd.read_csv('total_youth_2017.csv')

Unnamed: 0,youthtype,(%)16-18,16-18,(%)19-21,19-21,(%)22-24,22-24,(%)Total,Total
0,Total,100.0,30141.0,100.0,25486.0,100.0,30256.0,100.0,85883.0
1,Opportunity Youth,6.021698,1815.0,15.310366,3902.0,16.185219,4897.0,12.358674,10614.0
2,Working without Diploma,14.521748,4377.0,6.395668,1630.0,5.635246,1705.0,8.979658,7712.0
3,Not Opportunity Youth,79.456554,23949.0,78.293965,19954.0,78.179535,23654.0,78.661668,67557.0


In [14]:
df_oy =df[df['youthtype']=='Opportunity Youth']
import numpy as np
md =pd.pivot_table(df_oy,
               index=['educationattainment'],
               values=['totalnumber'],
               columns=['age'],
               aggfunc=np.sum,
               fill_value=0,
               margins=True,
               margins_name='Total'
              )['totalnumber']
md.reset_index()
md.columns.rename('', inplace=True)

for i, c in enumerate(md):
    mewc=(md[c]/md[c]['Total'])*100
    md.insert(i*2, '%'+str(c), mewc)

new_index2=pd.Index(['Total', 'No Diploma', 'HS Diploma or GED', 'Some College, No Degree', 'Degree (Associate or Higher)'], 
                  name='educationattainment')
opportunity_youth_2017=md.reindex(new_index2)
opportunity_youth_2017



# #md['educationattainment']= pd.Categorical(["Total", "No Diploma", "HS Diploma or GED", "Some College, No Degree", "Degree(Associate or Higher)"], ordered =True)
# md

Unnamed: 0_level_0,%16-18,16-18,%19-21,19-21,%22-24,22-24,%Total,Total
educationattainment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Total,100.0,1815,100.0,3902,100.0,4897,100.0,10614.0
No Diploma,50.46832,916,28.498206,1112,27.547478,1349,31.816469,3377.0
HS Diploma or GED,43.030303,781,55.766274,2176,43.598121,2135,47.974373,5092.0
"Some College, No Degree",6.501377,118,13.352127,521,20.420666,1000,15.441869,1639.0
Degree (Associate or Higher),0.0,0,2.383393,93,8.433735,413,4.767288,506.0


In [15]:
opportunity_youth_2017.to_csv('opportunity_youth_2017')

In [16]:
pd.read_csv('opportunity_youth_2017')

Unnamed: 0,educationattainment,%16-18,16-18,%19-21,19-21,%22-24,22-24,%Total,Total
0,Total,100.0,1815,100.0,3902,100.0,4897,100.0,10614.0
1,No Diploma,50.46832,916,28.498206,1112,27.547478,1349,31.816469,3377.0
2,HS Diploma or GED,43.030303,781,55.766274,2176,43.598121,2135,47.974373,5092.0
3,"Some College, No Degree",6.501377,118,13.352127,521,20.420666,1000,15.441869,1639.0
4,Degree (Associate or Higher),0.0,0,2.383393,93,8.433735,413,4.767288,506.0


In [33]:
df = read_pdf('https://roadmapproject.org/wp-content/uploads/2018/09/Opportunity-Youth-2016-Data-Brief-v2.pdf', 
              pages='9', output_format='DataFrame')
df=df[0]
df.columns =['16-18', '19-21', '22-24', 'Total']
total_2016= df[0:3]
oy_2016= df[6:]

#df=pd.DataFrame(df)
#tabula.convert_into_by_batch('https://roadmapproject.org/wp-content/uploads/2018/09/Opportunity-Youth-2016-Data-Brief-v2.pdf', output_format='csv', PAGES='9')


In [29]:
total_2016=total_2016.rename(index={0: 'Opportunity Youth', 1: 'Working Without Diploma', 2:'Not Youth Opportunity'})

for i in total_2016:    
    total_2016[i]=total_2016[i].str.replace(',','').astype('float')
total_2016.loc['Total']=total_2016.sum(axis=0)
#total_2016_sum.values.reshape(-1,4)

total_2016=total_2016.reindex(['Total', 'Opportunity Youth', 'Working Without Diploma', 'Not Youth Opportunity'])

for i, c in enumerate(total_2016):
    num=(total_2016[c]/total_2016[c]['Total'])*100
    total_2016.insert(i*2, '%'+str(c), num)
total_2016

Unnamed: 0,%16-18,16-18,%19-21,19-21,%22-24,22-24,%Total,Total
Total,100.0,50053.0,100.0,41651.0,100.0,48031.0,100.0,139735.0
Opportunity Youth,5.60406,2805.0,17.488176,7284.0,18.171598,8728.0,13.466204,18817.0
Working Without Diploma,1.172757,587.0,4.91945,2049.0,5.989882,2877.0,3.945325,5513.0
Not Youth Opportunity,93.223183,46661.0,77.592375,32318.0,75.838521,36426.0,82.588471,115405.0


In [30]:
total_2016.to_csv('total_youth_2016')

In [31]:
pd.read_csv('total_youth_2016')

Unnamed: 0.1,Unnamed: 0,%16-18,16-18,%19-21,19-21,%22-24,22-24,%Total,Total
0,Total,100.0,50053.0,100.0,41651.0,100.0,48031.0,100.0,139735.0
1,Opportunity Youth,5.60406,2805.0,17.488176,7284.0,18.171598,8728.0,13.466204,18817.0
2,Working Without Diploma,1.172757,587.0,4.91945,2049.0,5.989882,2877.0,3.945325,5513.0
3,Not Youth Opportunity,93.223183,46661.0,77.592375,32318.0,75.838521,36426.0,82.588471,115405.0


In [34]:
oy_2016=oy_2016.rename(index={6:'No Diploma', 7: 'HS Diploma or GED', 8: 'Some College, No Degree', 9:'Degree (Associate or Higher)'})

for i in oy_2016:    
    oy_2016[i]=oy_2016[i].str.replace(',','').astype('float')
oy_2016.loc['Total']=oy_2016.sum(axis=0)
oy_2016=oy_2016.reindex(['Total', 'No Diploma', 'HS Diploma or GED', 'Some College, No Degree', 'Degree (Associate or Higher)])

for i, c in enumerate(oy_2016):
    num=(oy_2016[c]/oy_2016[c]['Total'])*100
    oy_2016.insert(i*2, '%'+str(c), num)
oy_2016

SyntaxError: EOL while scanning string literal (<ipython-input-34-916b93f5ac15>, line 6)

In [None]:
#df[(df['agep']>= 16) & (df['agep']<=24)]['cow']
#cow =9 (unemployeed)
#division = 9 (pacific)
#region =4 (west)
#st =53 (Washington)
#sch = 1(school enrollment: has not attended in the last 3 months) page 42
#schl: school level page 42
#esr page 59 employee status recode 
#fesrp= 0 page 121 employee status 0=no
#rac1p race recoded page 103

#df_set=df[['serialno', 'puma','division', 'region', 'st', 'agep', 'esr', 'nwab']]

6    3
Name: cow, dtype: object

In [22]:
#df[(df['agep']>= 16) & (df['agep']<=24)]['cow']
#cow =9 (unemployeed)
#division = 9 (pacific)
#region =4 (west)
#st =53 (Washington)
#sch = 1(school enrollment: has not attended in the last 3 months) page 42
#schl: school level page 42
#esr page 59 employee status recode 
#fesrp= 0 page 121 employee status 0=no
#rac1p race recoded page 103
# weights page 126
#hisp =01 is non-hispanic 

#df_set=df[['serialno', 'puma','division', 'region', 'st', 'agep', 'esr', 'nwab']]

In [23]:
pd.read_sql("""
SELECT DISTINCT trct, cty, ctyname, blklondd, blklatdd from wa_geo_xwalk AS A 


WHERE st='53' AND ctyname like 'King%'

;""", conn)

#(blklondd > -122.530396 AND blklondd < -120.084384) AND (blklatdd > 47.072060 AND blklatdd < 47.788447) 

Unnamed: 0,trct,cty,ctyname,blklondd,blklatdd
0,53033000100,53033,"King County, WA ...",-122.295678,47.727200
1,53033000100,53033,"King County, WA ...",-122.295498,47.725603
2,53033000100,53033,"King County, WA ...",-122.295482,47.720286
3,53033000100,53033,"King County, WA ...",-122.295476,47.723796
4,53033000100,53033,"King County, WA ...",-122.295191,47.731046
...,...,...,...,...,...
35826,53033990100,53033,"King County, WA ...",-122.346140,47.363858
35827,53033990100,53033,"King County, WA ...",-122.343965,47.373841
35828,53033990100,53033,"King County, WA ...",-122.342136,47.376844
35829,53033990100,53033,"King County, WA ...",-122.341212,47.399598


In [12]:
pd.read_sql("""SELECT DISTINCT puma, puma_name
FROM puma_names_2010  AS A
WHERE state_name ='Washington' 
AND puma between '11601' AND '11615'

ORDER BY puma, puma_name 
;""", conn).head()


In [11]:
pd.read_sql("""
SELECT DISTINCT trct, cty, ctyname, blklondd, blklatdd from wa_geo_xwalk AS A 


WHERE st='53' AND ctyname like 'King%'

;""", conn).head()

#(blklondd > -122.530396 AND blklondd < -120.084384) AND (blklatdd > 47.072060 AND blklatdd < 47.788447) 

In [None]:
pd.read_sql("SELECT * FROM ct_puma_xwalk WHERE statefp = '53';", conn).head()

In [None]:
newtable=pd.read_sql("""
    CREATE TABLE oy_status_by_age (
    PopulationType varchar(255) PRIMARY KEY,
    sixteen int,
    nineteen int,
    twentytwo int
    )
“”")


In [None]:
cur.execute(“”"
    INSERT INTO oy_status_by_age (PopulationType, sixteen, nineteen, twentytwo)
    VALUES
    (‘Opportunity Youth’, 2805, 7284, 8728),
    (‘Working without diploma’, 587, 2049, 2877),
    (‘Not an Opportunity Youth’, 46661, 32318, 36426),
    (‘No diploma’, 1610, 2048, 1981),
    (‘HS diploma or GED’, 985, 3349, 3067),
    (‘Some college, no degree’, 179, 1666, 1763),
    (‘Degree (Associate or higher)’, 31, 221, 1917)
“”")

In [None]:
import geopandas as gpd
shapefile = 'ne_10m_populated_places.shp'
#Read shapefile using Geopandas
gdf = gpd.read_file(shapefile)[['ADMIN', 
#Rename columns.
gdf.columns = ['country', 'country_code', 'geometry']
gdf.head()

Make sure you close the DB connection when you are done using it

In [14]:
conn.close()