# Data Download and Exploration

This code means that the notebook will re-import your source code in `src` when it is edited (the default is not to re-import, because most modules are assumed not to change over time).  It's a good idea to include it in any exploratory notebook that uses `src` code

In [2]:
%load_ext autoreload
%autoreload 2

This snippet allows the notebook to import from the `src` module.  The directory structure looks like:

```
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering)
│   │                     followed by the topic of the notebook, e.g.
│   │                     01_data_collection_exploration.ipynb
│   └── exploratory    <- Raw, flow-of-consciousness, work-in-progress notebooks
│   └── report         <- Final summary notebook(s)
│
├── src                <- Source code for use in this project
│   ├── data           <- Scripts to download and query data
│   │   ├── sql        <- SQL scripts. Naming convention is a number (for ordering)
│   │   │                 followed by the topic of the script, e.g.
│   │   │                 03_create_pums_2017_table.sql
│   │   ├── data_collection.py
│   │   └── sql_utils.py
```

So we need to go up two "pardir"s (parent directories) to import the `src` code from this notebook.  You'll want to include this code at the top of any notebook that uses the `src` code.

In [3]:
import os
import sys
module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)

The code to download all of the data and load it into a SQL database is in the `data` module within the `src` module.  You'll only need to run `download_data_and_load_into_sql` one time for the duration of the project.

In [4]:
from src.data import data_collection

This line may take as long as 10-20 minutes depending on your network connection and computer specs

In [5]:
#data_collection.download_data_and_load_into_sql()

Now it's time to explore the data!

In [6]:
import psycopg2
import pandas as pd
pd.set_option('max_colwidth', 80)

In [7]:
DBNAME = "opportunity_youth"

In [8]:
conn = psycopg2.connect(dbname=DBNAME)

In [9]:
df =pd.read_sql("""
SELECT  
        A.serialno, 
        A.puma, 
        A.pwgtp totalnumber,
        CASE WHEN A.sch ='1' AND (A.esr='3' OR A.esr='6') THEN 'Opportunity Youth'
             WHEN (A.esr ='1' or A.esr='2' OR A.esr='4' OR A.esr='5') AND A.schl < '16' THEN 'Working without Diploma'
             ELSE 'Not Opportunity Youth' 
             END AS youthtype,
        CASE WHEN A.agep BETWEEN 16 AND 18 THEN '16-18'
             WHEN A.agep BETWEEN 19 AND 21 THEN '19-21'
             WHEN A.agep BETWEEN 22 AND 24 THEN '22-24'
             END AS age,
        CASE WHEN A.sex = '1' THEN 'Male'
             WHEN A.sex = '2' THEN 'Female'
             ELSE 'Other'
             END AS sex,
        CASE WHEN A.schl BETWEEN '01' AND '15' THEN 'No Diploma'
             WHEN A.schl BETWEEN '16' AND '17' THEN 'HS Diploma or GED'
             WHEN A.schl BETWEEN '18' AND '19' THEN 'Some College, No Degree'
             ELSE 'Degree (Associate or Higher)'
             END as educationattainment
    
FROM  pums_2017 AS A
      
WHERE 
          puma BETWEEN '11610' AND '11615'
          AND agep BETWEEN 16 and 24
          AND rt = 'P'  

;
""", conn)
#  CASE WHEN a.rac1p = '1' AND HISP ='01' THEN 'White'
#              WHEN a.rac1p = '2' AND HISP ='01' THEN 'Black of African American aLone'
#              WHEN a.rac1p = '3' AND HISP ='01' THEN 'American Indian along'
#              WHEN a.rac1p = '4' AND HISP ='01' THEN 'Alaska Native alone'
#              WHEN a.rac1p = '5' AND HISP ='01' THEN ''
#              WHEN a.rac1p = '6' AND HISP ='01' THEN 'Asian along'
#              WHEN a.rac1p = '7' AND HISP ='01' THEN 'Native Hawaiian or Other Pacific Islander alone'
#              WHEN a.rac1p = '8' AND HISP ='01' THEN 'Some other Race alone'
#              WHEN a.rac1p = '9' THEN 'Two or More Races'
#               ELSE 'Other Races'
#               END AS race,
#LEFT JOIN (SELECT DISTINCT * FROM ct_puma_xwalk WHERE statefp='53')AS K ON K.puma5ce =A.puma 
#LEFT JOIN (SELECT DISTINCT trct, cty, ctyname, blklondd, blklatdd from wa_geo_xwalk WHERE cty='53033') AS C ON K.tractce = RIGHT(trct, 6)
#LEFT JOIN (SELECT DISTINCT puma, puma_name FROM puma_names_2010 WHERE state_fips='53') AS D on A.puma=D.puma



In [10]:
df.head()

Unnamed: 0,serialno,puma,totalnumber,youthtype,age,sex,educationattainment
0,2013000056099,11613,16.0,Not Opportunity Youth,22-24,Male,HS Diploma or GED
1,2013000057563,11611,20.0,Opportunity Youth,19-21,Male,HS Diploma or GED
2,2013000058010,11614,45.0,Opportunity Youth,16-18,Female,No Diploma
3,2013000059060,11610,19.0,Opportunity Youth,22-24,Male,HS Diploma or GED
4,2013000065045,11611,27.0,Not Opportunity Youth,22-24,Female,"Some College, No Degree"


In [39]:
import numpy as np
nd =pd.pivot_table(df,
               index=['youthtype'],
               values=['totalnumber'],
               columns=['age'],
               aggfunc=np.sum,
               fill_value='',
               margins=True,
               margins_name='Total'
              )['totalnumber']
nd.reset_index()
nd.columns.rename('', inplace=True)
newc =(nd['16-18']/nd['16-18']['Total'])*100 
nd.insert(0, '%16-18', newc)
newd =(nd['19-21']/nd['19-21']['Total'])*100 
nd.insert(2, '%19-21', newd)
newe =(nd['22-24']/nd['22-24']['Total'])*100 
nd.insert(4, '%22-24', newe)
newf =(nd['Total']/nd['Total']['Total'])*100 
nd.insert(6, '%Total', newf)
nd

Unnamed: 0_level_0,%16-18,16-18,%19-21,19-21,%22-24,22-24,%Total,Total
youthtype,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Not Opportunity Youth,79.4566,23949,78.294,19954,78.1795,23654,78.6617,67557
Opportunity Youth,6.0217,1815,15.3104,3902,16.1852,4897,12.3587,10614
Working without Diploma,14.5217,4377,6.39567,1630,5.63525,1705,8.97966,7712
Total,100.0,30141,100.0,25486,100.0,30256,100.0,85883


Notice the `LIMIT 10` above.  These tables have a large amount of data in them and **your goal is to use SQL to create your main query, not Pandas**.  Pandas can technically do everything that you need to do, but it will be much slower and more inefficient.  Nevertheless, Pandas is still a useful tool for exploring the data and getting a basic sense of what you're looking at.

In [106]:
df_oy =df[df['youthtype']=='Opportunity Youth']
import numpy as np
md =pd.pivot_table(df_oy,
               index=['educationattainment'],
               values=['totalnumber'],
               columns=['age'],
               aggfunc=np.sum,
               fill_value=0,
               margins=True,
               margins_name='Total'
              )['totalnumber']
md.reset_index()
md.columns.rename('', inplace=True)

mewc =(md['16-18']/md['16-18']['Total'])*100 
md.insert(0, '%16-18', mewc)
mewd =(md['19-21']/md['19-21']['Total'])*100 
md.insert(2, '%19-21', mewd)
mewe =(md['22-24']/md['22-24']['Total'])*100 
md.insert(4, '%22-24', mewe)
mewf =(md['Total']/md['Total']['Total'])*100 
md.insert(6, '%Total', mewf)

newindex=['Total', 'No Diploma', 'HS Diploma or GED', 'Some College, No Degree', 'Degree(Associate or Higher)']
md.reindex(newindex)

#md['educationattainment']= pd.Categorical(["Total", "No Diploma", "HS Diploma or GED", "Some College, No Degree", "Degree(Associate or Higher)"], ordered =True)
md

Unnamed: 0_level_0,%16-18,16-18,%19-21,19-21,%22-24,22-24,%Total,Total
educationattainment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Degree (Associate or Higher),0.0,0,2.383393,93,8.433735,413,4.767288,506.0
HS Diploma or GED,43.030303,781,55.766274,2176,43.598121,2135,47.974373,5092.0
No Diploma,50.46832,916,28.498206,1112,27.547478,1349,31.816469,3377.0
"Some College, No Degree",6.501377,118,13.352127,521,20.420666,1000,15.441869,1639.0
Total,100.0,1815,100.0,3902,100.0,4897,100.0,10614.0


In [107]:
#df[(df['agep']>= 16) & (df['agep']<=24)]['cow']
#cow =9 (unemployeed)
#division = 9 (pacific)
#region =4 (west)
#st =53 (Washington)
#sch = 1(school enrollment: has not attended in the last 3 months) page 42
#schl: school level page 42
#esr page 59 employee status recode 
#fesrp= 0 page 121 employee status 0=no
#rac1p race recoded page 103
# weights page 126
#hisp =01 is non-hispanic 

#df_set=df[['serialno', 'puma','division', 'region', 'st', 'agep', 'esr', 'nwab']]

In [112]:
pd.read_sql("""SELECT DISTINCT puma, puma_name
FROM puma_names_2010  AS A
WHERE state_name ='Washington' 
AND puma between '11601' AND '11615'

ORDER BY puma, puma_name 
;""", conn).head()


Unnamed: 0,puma,puma_name
0,11601,Seattle City (Northwest) ...
1,11602,Seattle City (Northeast) ...
2,11603,Seattle City (Downtown)--Queen Anne & Magnolia ...
3,11604,Seattle City (Southeast)--Capitol Hill ...
4,11605,Seattle City (West)--Duwamish & Beacon Hill ...


In [111]:
pd.read_sql("""
SELECT DISTINCT trct, cty, ctyname, blklondd, blklatdd from wa_geo_xwalk AS A 


WHERE st='53' AND ctyname like 'King%'

;""", conn).head()

#(blklondd > -122.530396 AND blklondd < -120.084384) AND (blklatdd > 47.072060 AND blklatdd < 47.788447) 

Unnamed: 0,trct,cty,ctyname,blklondd,blklatdd
0,53033000100,53033,"King County, WA ...",-122.295678,47.7272
1,53033000100,53033,"King County, WA ...",-122.295498,47.725603
2,53033000100,53033,"King County, WA ...",-122.295482,47.720286
3,53033000100,53033,"King County, WA ...",-122.295476,47.723796
4,53033000100,53033,"King County, WA ...",-122.295191,47.731046


In [110]:
pd.read_sql("SELECT * FROM ct_puma_xwalk WHERE statefp = '53';", conn).head()

Unnamed: 0,statefp,countyfp,tractce,puma5ce
0,53,1,950100,10600
1,53,1,950200,10600
2,53,1,950300,10600
3,53,1,950400,10600
4,53,1,950500,10600


In [95]:
# cur.execute("""
#     CREATE TABLE oy_status_by_age (
#     PopulationType varchar(255) PRIMARY KEY,
#     sixteen int,
#     nineteen int,
#     twentytwo int
#     )
# “”")
# cur.execute(“”"
#     INSERT INTO oy_status_by_age (PopulationType, sixteen, nineteen, twentytwo)
#     VALUES
#     (‘Opportunity Youth’, 2805, 7284, 8728),
#     (‘Working without diploma’, 587, 2049, 2877),
#     (‘Not an Opportunity Youth’, 46661, 32318, 36426),
#     (‘No diploma’, 1610, 2048, 1981),
#     (‘HS diploma or GED’, 985, 3349, 3067),
#     (‘Some college, no degree’, 179, 1666, 1763),
#     (‘Degree (Associate or higher)’, 31, 221, 1917)
# “”")

Make sure you close the DB connection when you are done using it

In [14]:
conn.close()