# Data Download and Exploration

This code means that the notebook will re-import your source code in `src` when it is edited (the default is not to re-import, because most modules are assumed not to change over time).  It's a good idea to include it in any exploratory notebook that uses `src` code

In [1]:
%load_ext autoreload
%autoreload 2

This snippet allows the notebook to import from the `src` module.  The directory structure looks like:

```
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering)
│   │                     followed by the topic of the notebook, e.g.
│   │                     01_data_collection_exploration.ipynb
│   └── exploratory    <- Raw, flow-of-consciousness, work-in-progress notebooks
│   └── report         <- Final summary notebook(s)
│
├── src                <- Source code for use in this project
│   ├── data           <- Scripts to download and query data
│   │   ├── sql        <- SQL scripts. Naming convention is a number (for ordering)
│   │   │                 followed by the topic of the script, e.g.
│   │   │                 03_create_pums_2017_table.sql
│   │   ├── data_collection.py
│   │   └── sql_utils.py
```

So we need to go up two "pardir"s (parent directories) to import the `src` code from this notebook.  You'll want to include this code at the top of any notebook that uses the `src` code.

In [2]:
import os
import sys
module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)

The code to download all of the data and load it into a SQL database is in the `data` module within the `src` module.  You'll only need to run `download_data_and_load_into_sql` one time for the duration of the project.

In [3]:
from src.data import data_collection

This line may take as long as 10-20 minutes depending on your network connection and computer specs

In [4]:
#data_collection.download_data_and_load_into_sql()

In [5]:
pip install tabula.py

Collecting tabula.py
  Using cached tabula_py-2.0.4-py3-none-any.whl (10.4 MB)
Installing collected packages: tabula.py
Successfully installed tabula.py
Note: you may need to restart the kernel to use updated packages.


Now it's time to explore the data!

In [31]:
import psycopg2
import pandas as pd
import numpy as np
from tabula import read_pdf
pd.set_option('max_colwidth', 80)

In [32]:
DBNAME = "opportunity_youth"

In [33]:
conn = psycopg2.connect(dbname=DBNAME)

In [34]:
df =pd.read_sql("""
SELECT  
        A.serialno, 
        A.puma, 
        A.pwgtp totalnumber,
        CASE WHEN A.sch ='1' AND (A.esr='3' OR A.esr='6') THEN 'Opportunity Youth'
             WHEN (A.esr ='1' or A.esr='2' OR A.esr='4' OR A.esr='5') AND A.schl < '16' THEN 'Working without Diploma'
             ELSE 'Not Opportunity Youth' 
             END AS youthtype,
        CASE WHEN A.agep BETWEEN 16 AND 18 THEN '16-18'
             WHEN A.agep BETWEEN 19 AND 21 THEN '19-21'
             WHEN A.agep BETWEEN 22 AND 24 THEN '22-24'
             END AS age,
        CASE WHEN A.sex = '1' THEN 'Male'
             WHEN A.sex = '2' THEN 'Female'
             ELSE 'Other'
             END AS sex,
        CASE WHEN A.schl BETWEEN '01' AND '15' THEN 'No Diploma'
             WHEN A.schl BETWEEN '16' AND '17' THEN 'HS Diploma or GED'
             WHEN A.schl BETWEEN '18' AND '19' THEN 'Some College, No Degree'
             ELSE 'Degree (Associate or Higher)'
             END as educationattainment,
        CASE WHEN a.rac1p = '1' AND HISP ='01' THEN 'White'
             WHEN a.rac1p = '2' AND HISP ='01' THEN 'Black of African American aLone'                                                         
             WHEN a.rac1p = '4' OR a.rac1p = '5' OR a.rac1p='3' AND HISP ='01' THEN 'American Indian\Alaska Native'
             WHEN a.rac1p = '6' AND HISP ='01' THEN 'Asian along'
             WHEN a.rac1p = '7' AND HISP ='01' THEN 'Native Hawaiian or Other Pacific Islander alone'
             WHEN a.rac1p = '8' AND HISP ='01' THEN 'Some other Race alone'
             WHEN a.rac1p = '9' AND HISP ='01' THEN 'Two or More Races'
             ELSE 'Hispanic'
             END AS race
    
FROM  pums_2017 AS A
      
WHERE 
          puma BETWEEN '11610' AND '11615'
          AND agep BETWEEN 16 and 24
          AND rt = 'P'  

;
""", conn)

#LEFT JOIN (SELECT DISTINCT * FROM ct_puma_xwalk WHERE statefp='53')AS K ON K.puma5ce =A.puma 
#LEFT JOIN (SELECT DISTINCT trct, cty, ctyname, blklondd, blklatdd from wa_geo_xwalk WHERE cty='53033') AS C ON K.tractce = RIGHT(trct, 6)
#LEFT JOIN (SELECT DISTINCT puma, puma_name FROM puma_names_2010 WHERE state_fips='53') AS D on A.puma=D.puma



In [35]:
df.head()

Unnamed: 0,serialno,puma,totalnumber,youthtype,age,sex,educationattainment,race
0,2013000056099,11613,16.0,Not Opportunity Youth,22-24,Male,HS Diploma or GED,White
1,2013000057563,11611,20.0,Opportunity Youth,19-21,Male,HS Diploma or GED,White
2,2013000058010,11614,45.0,Opportunity Youth,16-18,Female,No Diploma,American Indian\Alaska Native
3,2013000059060,11610,19.0,Opportunity Youth,22-24,Male,HS Diploma or GED,White
4,2013000065045,11611,27.0,Not Opportunity Youth,22-24,Female,"Some College, No Degree",Black of African American aLone


In [36]:
df.to_csv('main_table_2017.csv')

In [37]:
nd =pd.pivot_table(df,
               index=['youthtype'],
               values=['totalnumber'],
               columns=['age'],
               aggfunc=np.sum,
               fill_value='',
               margins=True,
               margins_name='Total'
              )['totalnumber']
nd.reset_index()
nd.columns.rename('', inplace=True)
for i, c in enumerate(nd):
    newc=(nd[c]/nd[c]['Total'])*100
    nd.insert(i*2, '%'+ str(c), newc)
    
new_index1=['Total', 'Opportunity Youth', 'Working without Diploma', 'Not Opportunity Youth']
total_youth_2017=nd.reindex(new_index1)
total_youth_2017

Unnamed: 0_level_0,%16-18,16-18,%19-21,19-21,%22-24,22-24,%Total,Total
youthtype,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Total,100.0,30141,100.0,25486,100.0,30256,100.0,85883
Opportunity Youth,6.0217,1815,15.3104,3902,16.1852,4897,12.3587,10614
Working without Diploma,14.5217,4377,6.39567,1630,5.63525,1705,8.97966,7712
Not Opportunity Youth,79.4566,23949,78.294,19954,78.1795,23654,78.6617,67557


In [38]:
total_youth_2017.to_csv('total_youth_2017.csv')
pd.read_csv('total_youth_2017.csv')

Notice the `LIMIT 10` above.  These tables have a large amount of data in them and **your goal is to use SQL to create your main query, not Pandas**.  Pandas can technically do everything that you need to do, but it will be much slower and more inefficient.  Nevertheless, Pandas is still a useful tool for exploring the data and getting a basic sense of what you're looking at.

Unnamed: 0,youthtype,%16-18,16-18,%19-21,19-21,%22-24,22-24,%Total,Total
0,Total,100.0,30141.0,100.0,25486.0,100.0,30256.0,100.0,85883.0
1,Opportunity Youth,6.021698,1815.0,15.310366,3902.0,16.185219,4897.0,12.358674,10614.0
2,Working without Diploma,14.521748,4377.0,6.395668,1630.0,5.635246,1705.0,8.979658,7712.0
3,Not Opportunity Youth,79.456554,23949.0,78.293965,19954.0,78.179535,23654.0,78.661668,67557.0


In [40]:
df_oy =df[df['youthtype']=='Opportunity Youth']
import numpy as np
md =pd.pivot_table(df_oy,
               index=['educationattainment'],
               values=['totalnumber'],
               columns=['age'],
               aggfunc=np.sum,
               fill_value=0,
               margins=True,
               margins_name='Total'
              )['totalnumber']
md.reset_index()
md.columns.rename('', inplace=True)

for i, c in enumerate(md):
    mewc=(md[c]/md[c]['Total'])*100
    md.insert(i*2, '%'+str(c), mewc)

new_index2=pd.Index(['Total', 'No Diploma', 'HS Diploma or GED', 'Some College, No Degree', 'Degree (Associate or Higher)'], 
                  name='educationattainment')
opportunity_youth_2017=md.reindex(new_index2)
opportunity_youth_2017





Unnamed: 0_level_0,%16-18,16-18,%19-21,19-21,%22-24,22-24,%Total,Total
educationattainment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Total,100.0,1815,100.0,3902,100.0,4897,100.0,10614.0
No Diploma,50.46832,916,28.498206,1112,27.547478,1349,31.816469,3377.0
HS Diploma or GED,43.030303,781,55.766274,2176,43.598121,2135,47.974373,5092.0
"Some College, No Degree",6.501377,118,13.352127,521,20.420666,1000,15.441869,1639.0
Degree (Associate or Higher),0.0,0,2.383393,93,8.433735,413,4.767288,506.0


In [16]:
opportunity_youth_2017.to_csv('opportunity_youth_2017.csv')

In [17]:
pd.read_csv('opportunity_youth_2017.csv')

Unnamed: 0,educationattainment,%16-18,16-18,%19-21,19-21,%22-24,22-24,%Total,Total
0,Total,100.0,1815,100.0,3902,100.0,4897,100.0,10614.0
1,No Diploma,50.46832,916,28.498206,1112,27.547478,1349,31.816469,3377.0
2,HS Diploma or GED,43.030303,781,55.766274,2176,43.598121,2135,47.974373,5092.0
3,"Some College, No Degree",6.501377,118,13.352127,521,20.420666,1000,15.441869,1639.0
4,Degree (Associate or Higher),0.0,0,2.383393,93,8.433735,413,4.767288,506.0


In [53]:
df = read_pdf('https://roadmapproject.org/wp-content/uploads/2018/09/Opportunity-Youth-2016-Data-Brief-v2.pdf', 
              pages='9', output_format='DataFrame')
df=df[0]
df.columns =['16-18', '19-21', '22-24', 'Total']
total_2016= df[0:3]
oy_2016= df[6:]

In [54]:
total_2016=total_2016.rename(index={0: 'Opportunity Youth', 1: 'Working without Diploma', 2:'Not Opportunity Youth'})

for i in total_2016:    
    total_2016[i]=total_2016[i].str.replace(',','').astype('float')
total_2016.loc['Total']=total_2016.sum(axis=0)

total_2016=total_2016.reindex(['Total', 'Opportunity Youth', 'Working without Diploma', 'Not Opportunity Youth'])

for i, c in enumerate(total_2016):
    num=(total_2016[c]/total_2016[c]['Total'])*100
    total_2016.insert(i*2, '%'+str(c), num)
total_2016

Unnamed: 0,%16-18,16-18,%19-21,19-21,%22-24,22-24,%Total,Total
Total,100.0,50053.0,100.0,41651.0,100.0,48031.0,100.0,139735.0
Opportunity Youth,5.60406,2805.0,17.488176,7284.0,18.171598,8728.0,13.466204,18817.0
Working without Diploma,1.172757,587.0,4.91945,2049.0,5.989882,2877.0,3.945325,5513.0
Not Opportunity Youth,93.223183,46661.0,77.592375,32318.0,75.838521,36426.0,82.588471,115405.0


In [55]:
total_2016.to_csv('total_youth_2016.csv')
pd.read_csv('total_youth_2016')

In [22]:
oy_2016=oy_2016.rename(index={6:'No Diploma', 7: 'HS Diploma or GED', 8: 'Some College, No Degree', 9:'Degree (Associate or Higher)'})

for i in oy_2016:    
    oy_2016[i]=oy_2016[i].str.replace(',','').astype('float')
oy_2016.loc['Total']=oy_2016.sum(axis=0)
oy_2016=oy_2016.reindex(['Total', 'No Diploma', 'HS Diploma or GED', 'Some College, No Degree', 'Degree (Associate or Higher)'])

for i, c in enumerate(oy_2016):
    num=(oy_2016[c]/oy_2016[c]['Total'])*100
    oy_2016.insert(i*2, '%'+str(c), num)
oy_2016

Unnamed: 0,%16-18,16-18,%19-21,19-21,%22-24,22-24,%Total,Total
Total,100.0,2805.0,100.0,7284.0,100.0,8728.0,100.0,18817.0
No Diploma,57.397504,1610.0,28.11642,2048.0,22.697067,1981.0,29.967583,5639.0
HS Diploma or GED,35.115865,985.0,45.977485,3349.0,35.13978,3067.0,39.331456,7401.0
"Some College, No Degree",6.381462,179.0,22.872048,1666.0,20.199358,1763.0,19.174151,3608.0
Degree (Associate or Higher),1.105169,31.0,3.034047,221.0,21.963795,1917.0,11.526811,2169.0


In [23]:
oy_2016.to_csv('opportunity_youth_2016.csv')
pd.read_csv('opportunity_youth_2016.csv')

Unnamed: 0.1,Unnamed: 0,%16-18,16-18,%19-21,19-21,%22-24,22-24,%Total,Total
0,Total,100.0,2805.0,100.0,7284.0,100.0,8728.0,100.0,18817.0
1,No Diploma,57.397504,1610.0,28.11642,2048.0,22.697067,1981.0,29.967583,5639.0
2,HS Diploma or GED,35.115865,985.0,45.977485,3349.0,35.13978,3067.0,39.331456,7401.0
3,"Some College, No Degree",6.381462,179.0,22.872048,1666.0,20.199358,1763.0,19.174151,3608.0
4,Degree (Associate or Higher),1.105169,31.0,3.034047,221.0,21.963795,1917.0,11.526811,2169.0


In [24]:
#df[(df['agep']>= 16) & (df['agep']<=24)]['cow']
#cow =9 (unemployeed)
#division = 9 (pacific)
#region =4 (west)
#st =53 (Washington)
#sch = 1(school enrollment: has not attended in the last 3 months) page 42
#schl: school level page 42
#esr page 59 employee status recode 
#fesrp= 0 page 121 employee status 0=no
#rac1p race recoded page 103
# weights page 126
#hisp =01 is non-hispanic 
#df_set=df[['serialno', 'puma','division', 'region', 'st', 'agep', 'esr', 'nwab']]

In [25]:
pd.read_sql("""SELECT DISTINCT puma, puma_name
FROM puma_names_2010  AS A
WHERE state_name ='Washington' 
AND puma between '11601' AND '11615'

ORDER BY puma, puma_name 
;""", conn).head()


Unnamed: 0,puma,puma_name
0,11601,Seattle City (Northwest) ...
1,11602,Seattle City (Northeast) ...
2,11603,Seattle City (Downtown)--Queen Anne & Magnolia ...
3,11604,Seattle City (Southeast)--Capitol Hill ...
4,11605,Seattle City (West)--Duwamish & Beacon Hill ...


In [26]:
pd.read_sql("""
SELECT DISTINCT trct, cty, ctyname, blklondd, blklatdd from wa_geo_xwalk AS A 


WHERE st='53' AND ctyname like 'King%'

;""", conn).head()

#(blklondd > -122.530396 AND blklondd < -120.084384) AND (blklatdd > 47.072060 AND blklatdd < 47.788447) 

Unnamed: 0,trct,cty,ctyname,blklondd,blklatdd
0,53033000100,53033,"King County, WA ...",-122.295678,47.7272
1,53033000100,53033,"King County, WA ...",-122.295498,47.725603
2,53033000100,53033,"King County, WA ...",-122.295482,47.720286
3,53033000100,53033,"King County, WA ...",-122.295476,47.723796
4,53033000100,53033,"King County, WA ...",-122.295191,47.731046


In [27]:
pd.read_sql("SELECT * FROM ct_puma_xwalk WHERE statefp = '53';", conn).head()

Unnamed: 0,statefp,countyfp,tractce,puma5ce
0,53,1,950100,10600
1,53,1,950200,10600
2,53,1,950300,10600
3,53,1,950400,10600
4,53,1,950500,10600


In [56]:
# import geopandas as gpd
# shapefile = 'ne_10m_populated_places.shp'
# #Read shapefile using Geopandas
# gdf = gpd.read_file(shapefile)[['ADMIN', 
# #Rename columns.
# gdf.columns = ['country', 'country_code', 'geometry']
# gdf.head()

Make sure you close the DB connection when you are done using it

In [None]:
conn.close()

In [59]:
df = read_pdf('https://roadmapproject.org/wp-content/uploads/2018/09/Opportunity-Youth-2016-Data-Brief-v2.pdf', 
              pages='9', output_format='DataFrame')
df
# df.columns =['16-18', '19-21', '22-24', 'Total']
# total_2016= df[0:3]
# oy_2016= df[6:]

Got stderr: Feb 13, 2020 5:45:11 AM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
INFO: OpenType Layout tables used in font BCDEEE+Calibri-Light are not implemented in PDFBox and will be ignored
Feb 13, 2020 5:45:11 AM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
INFO: OpenType Layout tables used in font BCDEEE+Calibri-Light are not implemented in PDFBox and will be ignored
Feb 13, 2020 5:45:12 AM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
INFO: OpenType Layout tables used in font BCDEEE+Calibri-Light are not implemented in PDFBox and will be ignored



[       60     34     37     33     28
 0     585    513    477    530    462
 1      68     58     58     51     49
 2     490    385    342    248    259
 3     115    109    105    112    137
 4     717    561    510    486    446
 5     277    247    192    170    170
 6     NaN    NaN    NaN    NaN    NaN
 7    1321   1134   1026    959    928
 8     992    773    695    671    623
 9     NaN    NaN    NaN    NaN    NaN
 10    669    594    572    501    471
 11    NaN    NaN    NaN    NaN    NaN
 12    957    965    927    867    807
 13    421    304    284    331    277
 14    NaN    NaN    NaN    NaN    NaN
 15    353    302    320    256    274
 16  1,960  1,605  1,401  1,374  1,277
 17    NaN    NaN    NaN    NaN    NaN
 18    131    141    134     95    110
 19  2,182  1,766  1,587  1,535  1,441
 20    NaN    NaN    NaN    NaN    NaN
 21    775    647    564    607    546
 22  1,538  1,260  1,157  1,023  1,005
 23    NaN    NaN    NaN    NaN    NaN
 24    386    269    219 