# Data Download and Exploration

This code means that the notebook will re-import your source code in `src` when it is edited (the default is not to re-import, because most modules are assumed not to change over time).  It's a good idea to include it in any exploratory notebook that uses `src` code

In [1]:
%load_ext autoreload
%autoreload 2

This snippet allows the notebook to import from the `src` module.  The directory structure looks like:

```
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering)
│   │                     followed by the topic of the notebook, e.g.
│   │                     01_data_collection_exploration.ipynb
│   └── exploratory    <- Raw, flow-of-consciousness, work-in-progress notebooks
│   └── report         <- Final summary notebook(s)
│
├── src                <- Source code for use in this project
│   ├── data           <- Scripts to download and query data
│   │   ├── sql        <- SQL scripts. Naming convention is a number (for ordering)
│   │   │                 followed by the topic of the script, e.g.
│   │   │                 03_create_pums_2017_table.sql
│   │   ├── data_collection.py
│   │   └── sql_utils.py
```

So we need to go up two "pardir"s (parent directories) to import the `src` code from this notebook.  You'll want to include this code at the top of any notebook that uses the `src` code.

In [2]:
import os
import sys
module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)

The code to download all of the data and load it into a SQL database is in the `data` module within the `src` module.  You'll only need to run `download_data_and_load_into_sql` one time for the duration of the project.

In [3]:
from src.data import data_collection

This line may take as long as 10-20 minutes depending on your network connection and computer specs

In [4]:
#data_collection.download_data_and_load_into_sql()

Now it's time to explore the data!

In [5]:
import psycopg2
import pandas as pd
pd.set_option('max_colwidth', 80)

In [6]:
DBNAME = "opportunity_youth"

In [7]:
conn = psycopg2.connect(dbname=DBNAME)

In [24]:
df =pd.read_sql("""
SELECT  
        A.serialno, 
        A.puma, 
        A.pwgtp totalnumber,
        CASE WHEN A.sch ='1' AND (A.esr='3' OR A.esr='6') THEN 'Opportunity Youth'
             WHEN (A.esr ='1' or A.esr='2' OR A.esr='4' OR A.esr='5') AND A.schl < '16' THEN 'Working without Diploma'
             ELSE 'Not Opportunity Youth' 
             END AS youthtype,
        CASE WHEN A.agep BETWEEN 16 AND 18 THEN '16-18'
             WHEN A.agep BETWEEN 19 AND 21 THEN '19-21'
             WHEN A.agep BETWEEN 22 AND 24 THEN '22-24'
             END AS age,
        CASE WHEN A.sex = '1' THEN 'Male'
             WHEN A.sex = '2' THEN 'Female'
             ELSE 'Other'
             END AS sex,
        CASE WHEN A.schl BETWEEN '01' AND '15' THEN 'No Diploma'
             WHEN A.schl BETWEEN '16' AND '17' THEN 'HS Diploma or GED'
             WHEN A.schl BETWEEN '18' AND '19' THEN 'Some College, No Degree'
             ELSE 'Degree (Associate or Higher)'
             END as educationattainment
    
FROM  pums_2017 AS A
      
WHERE 
          puma BETWEEN '11610' AND '11615'
          AND agep BETWEEN 16 and 24
          AND rt = 'P'  

;
""", conn)
#  CASE WHEN a.rac1p = '1' AND HISP ='01' THEN 'White'
#              WHEN a.rac1p = '2' AND HISP ='01' THEN 'Black of African American aLone'
#              WHEN a.rac1p = '3' AND HISP ='01' THEN 'American Indian along'
#              WHEN a.rac1p = '4' AND HISP ='01' THEN 'Alaska Native alone'
#              WHEN a.rac1p = '5' AND HISP ='01' THEN ''
#              WHEN a.rac1p = '6' AND HISP ='01' THEN 'Asian along'
#              WHEN a.rac1p = '7' AND HISP ='01' THEN 'Native Hawaiian or Other Pacific Islander alone'
#              WHEN a.rac1p = '8' AND HISP ='01' THEN 'Some other Race alone'
#              WHEN a.rac1p = '9' THEN 'Two or More Races'
#               ELSE 'Other Races'
#               END AS race,
#LEFT JOIN (SELECT DISTINCT * FROM ct_puma_xwalk WHERE statefp='53')AS K ON K.puma5ce =A.puma 
#LEFT JOIN (SELECT DISTINCT trct, cty, ctyname, blklondd, blklatdd from wa_geo_xwalk WHERE cty='53033') AS C ON K.tractce = RIGHT(trct, 6)
#LEFT JOIN (SELECT DISTINCT puma, puma_name FROM puma_names_2010 WHERE state_fips='53') AS D on A.puma=D.puma



In [25]:
df.head()

Unnamed: 0,serialno,puma,totalnumber,youthtype,age,sex,educationattainment
0,2013000056099,11613,16.0,Not Opportunity Youth,22-24,Male,HS Diploma or GED
1,2013000057563,11611,20.0,Opportunity Youth,19-21,Male,HS Diploma or GED
2,2013000058010,11614,45.0,Opportunity Youth,16-18,Female,No Diploma
3,2013000059060,11610,19.0,Opportunity Youth,22-24,Male,HS Diploma or GED
4,2013000065045,11611,27.0,Not Opportunity Youth,22-24,Female,"Some College, No Degree"


In [46]:
df_youth=df.groupby(['youthtype', 'age']).agg({'totalnumber': 'sum'})
#df_youth.values
df_youth
#df_youth.pivot(columns='age', values=df_youth.values)
#df_youth.set_index('youthtype').unstack('youthtype')

Unnamed: 0_level_0,Unnamed: 1_level_0,totalnumber
youthtype,age,Unnamed: 2_level_1
Not Opportunity Youth,16-18,23949.0
Not Opportunity Youth,19-21,19954.0
Not Opportunity Youth,22-24,23654.0
Opportunity Youth,16-18,1815.0
Opportunity Youth,19-21,3902.0
Opportunity Youth,22-24,4897.0
Working without Diploma,16-18,4377.0
Working without Diploma,19-21,1630.0
Working without Diploma,22-24,1705.0


Notice the `LIMIT 10` above.  These tables have a large amount of data in them and **your goal is to use SQL to create your main query, not Pandas**.  Pandas can technically do everything that you need to do, but it will be much slower and more inefficient.  Nevertheless, Pandas is still a useful tool for exploring the data and getting a basic sense of what you're looking at.

In [35]:
df_oy =df[df['youthtype']=='Opportunity Youth']
df_oy_age =df_oy.groupby(['age', 'educationattainment']).agg({'totalnumber': 'sum'})
df_oy_age

Unnamed: 0_level_0,Unnamed: 1_level_0,totalnumber
age,educationattainment,Unnamed: 2_level_1
16-18,HS Diploma or GED,781.0
16-18,No Diploma,916.0
16-18,"Some College, No Degree",118.0
19-21,Degree (Associate or Higher),93.0
19-21,HS Diploma or GED,2176.0
19-21,No Diploma,1112.0
19-21,"Some College, No Degree",521.0
22-24,Degree (Associate or Higher),413.0
22-24,HS Diploma or GED,2135.0
22-24,No Diploma,1349.0


In [19]:
#df[(df['agep']>= 16) & (df['agep']<=24)]['cow']
#cow =9 (unemployeed)
#division = 9 (pacific)
#region =4 (west)
#st =53 (Washington)
#sch = 1(school enrollment: has not attended in the last 3 months) page 42
#schl: school level page 42
#esr page 59 employee status recode 
#fesrp= 0 page 121 employee status 0=no
#rac1p race recoded page 103
# weights page 126
#hisp =01 is non-hispanic 

#df_set=df[['serialno', 'puma','division', 'region', 'st', 'agep', 'esr', 'nwab']]

6    3
Name: cow, dtype: object

In [22]:
pd.read_sql("""SELECT DISTINCT puma, puma_name
FROM puma_names_2010  AS A
WHERE state_name ='Washington' 
ORDER BY puma, puma_name 
;""", conn)


Unnamed: 0,puma,puma_name
0,10100,Whatcom County--Bellingham City ...
1,10200,"Skagit, Island & San Juan Counties ..."
2,10300,Chelan & Douglas Counties ...
3,10400,"Stevens, Okanogan, Pend Oreille & Ferry Counties ..."
4,10501,Spokane County (North Central)--Spokane City (North) ...
5,10502,Spokane County (South Central)--Spokane City (South) ...
6,10503,Spokane County (East Central)--Greater Spokane Valley City ...
7,10504,Spokane County (Outer)--Cheney City ...
8,10600,"Whitman, Asotin, Adams, Lincoln, Columbia & Garfield Counties ..."
9,10701,"Benton & Franklin Counties--Pasco, Richland (North) & West Richland Cities ..."


In [23]:
pd.read_sql("""
SELECT DISTINCT trct, cty, ctyname, blklondd, blklatdd from wa_geo_xwalk AS A 


WHERE st='53' AND ctyname like 'King%'

;""", conn)

#(blklondd > -122.530396 AND blklondd < -120.084384) AND (blklatdd > 47.072060 AND blklatdd < 47.788447) 

Unnamed: 0,trct,cty,ctyname,blklondd,blklatdd
0,53033000100,53033,"King County, WA ...",-122.295678,47.727200
1,53033000100,53033,"King County, WA ...",-122.295498,47.725603
2,53033000100,53033,"King County, WA ...",-122.295482,47.720286
3,53033000100,53033,"King County, WA ...",-122.295476,47.723796
4,53033000100,53033,"King County, WA ...",-122.295191,47.731046
...,...,...,...,...,...
35826,53033990100,53033,"King County, WA ...",-122.346140,47.363858
35827,53033990100,53033,"King County, WA ...",-122.343965,47.373841
35828,53033990100,53033,"King County, WA ...",-122.342136,47.376844
35829,53033990100,53033,"King County, WA ...",-122.341212,47.399598


In [12]:
pd.read_sql("SELECT * FROM ct_puma_xwalk WHERE statefp = '53';", conn)

Unnamed: 0,statefp,countyfp,tractce,puma5ce
0,53,001,950100,10600
1,53,001,950200,10600
2,53,001,950300,10600
3,53,001,950400,10600
4,53,001,950500,10600
...,...,...,...,...
1453,53,077,940002,10902
1454,53,077,940003,10902
1455,53,077,940004,10902
1456,53,077,940005,10902


In [149]:
cur.execute("""
    CREATE TABLE oy_status_by_age (
    PopulationType varchar(255) PRIMARY KEY,
    sixteen int,
    nineteen int,
    twentytwo int
    )
“”")


SyntaxError: EOF while scanning triple-quoted string literal (<ipython-input-149-21e517b78b25>, line 8)

In [150]:
cur.execute(“”"
    INSERT INTO oy_status_by_age (PopulationType, sixteen, nineteen, twentytwo)
    VALUES
    (‘Opportunity Youth’, 2805, 7284, 8728),
    (‘Working without diploma’, 587, 2049, 2877),
    (‘Not an Opportunity Youth’, 46661, 32318, 36426),
    (‘No diploma’, 1610, 2048, 1981),
    (‘HS diploma or GED’, 985, 3349, 3067),
    (‘Some college, no degree’, 179, 1666, 1763),
    (‘Degree (Associate or higher)’, 31, 221, 1917)
“”")

SyntaxError: invalid character in identifier (<ipython-input-150-49adbdf2a4e2>, line 1)

Make sure you close the DB connection when you are done using it

In [14]:
conn.close()