# Data Download and Exploration

This code means that the notebook will re-import your source code in `src` when it is edited (the default is not to re-import, because most modules are assumed not to change over time).  It's a good idea to include it in any exploratory notebook that uses `src` code

In [47]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [56]:
!ls

01_erh_download_and_explore_data.ipynb race2_2017.csv
README.md                              race_2017.csv
Visualization.ipynb                    [34mtl_2017_53_puma10[m[m
cavinsNOTEBOOK.ipynb                   tl_2017_53_puma10.zip
main_table_2017.csv                    total_youth_2016.csv
opportunity_youth_2016.csv             total_youth_2017.csv
opportunity_youth_2017.csv             trend.ipynb


This snippet allows the notebook to import from the `src` module.  The directory structure looks like:

```
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering)
│   │                     followed by the topic of the notebook, e.g.
│   │                     01_data_collection_exploration.ipynb
│   └── exploratory    <- Raw, flow-of-consciousness, work-in-progress notebooks
│   └── report         <- Final summary notebook(s)
│
├── src                <- Source code for use in this project
│   ├── data           <- Scripts to download and query data
│   │   ├── sql        <- SQL scripts. Naming convention is a number (for ordering)
│   │   │                 followed by the topic of the script, e.g.
│   │   │                 03_create_pums_2017_table.sql
│   │   ├── data_collection.py
│   │   └── sql_utils.py
```

So we need to go up two "pardir"s (parent directories) to import the `src` code from this notebook.  You'll want to include this code at the top of any notebook that uses the `src` code.

In [49]:
import os
import sys
module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)

The code to download all of the data and load it into a SQL database is in the `data` module within the `src` module.  You'll only need to run `download_data_and_load_into_sql` one time for the duration of the project.

In [50]:
from src.data import data_collection

This line may take as long as 10-20 minutes depending on your network connection and computer specs

In [67]:
#data_collection.download_data_and_load_into_sql()

Now it's time to explore the data!

In [6]:
import psycopg2
import pandas as pd
import numpy as np
from tabula import read_pdf
pd.set_option('max_colwidth', 80)

In [7]:
DBNAME = "opportunity_youth"

In [8]:
conn = psycopg2.connect(dbname=DBNAME)

In [62]:
df.head()

Unnamed: 0,serialno,puma,totalnumber,youthtype,age,sex,educationattainment,race,race2
0,2013000056099,11613,16.0,Not Opportunity Youth,22-24,Male,HS Diploma or GED,White,White
1,2013000057563,11611,20.0,Opportunity Youth,19-21,Male,HS Diploma or GED,White,White
2,2013000058010,11614,45.0,Opportunity Youth,16-18,Female,No Diploma,American Indian\Alaska Native,Other Races
3,2013000059060,11610,19.0,Opportunity Youth,22-24,Male,HS Diploma or GED,White,White
4,2013000065045,11611,27.0,Not Opportunity Youth,22-24,Female,"Some College, No Degree",Black of African American alone,Black of African American


In [72]:
import src.final_functions as FF
df=FF.create_df()

In [71]:
total_youth_2017=FF.create_total_youth_2017(df)
total_youth_2017

Unnamed: 0_level_0,%16-18,16-18,%19-21,19-21,%22-24,22-24,%Total,Total
youthtype,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Total,100.0,30141,100.0,25486,100.0,30256,100.0,85883
Opportunity Youth,6.0217,1815,15.3104,3902,16.1852,4897,12.3587,10614
Working without Diploma,14.5217,4377,6.39567,1630,5.63525,1705,8.97966,7712
Not Opportunity Youth,79.4566,23949,78.294,19954,78.1795,23654,78.6617,67557


In [74]:
opportunity_youth_2017=FF.create_total_youth_2017(df)
opportunity_youth_2017

Unnamed: 0_level_0,%16-18,16-18,%19-21,19-21,%22-24,22-24,%Total,Total
youthtype,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Total,100.0,30141,100.0,25486,100.0,30256,100.0,85883
Opportunity Youth,6.0217,1815,15.3104,3902,16.1852,4897,12.3587,10614
Working without Diploma,14.5217,4377,6.39567,1630,5.63525,1705,8.97966,7712
Not Opportunity Youth,79.4566,23949,78.294,19954,78.1795,23654,78.6617,67557


Notice the `LIMIT 10` above.  These tables have a large amount of data in them and **your goal is to use SQL to create your main query, not Pandas**.  Pandas can technically do everything that you need to do, but it will be much slower and more inefficient.  Nevertheless, Pandas is still a useful tool for exploring the data and getting a basic sense of what you're looking at.

In [None]:
df_2016=FF.create_basetable_2016()

In [101]:
total_youth_2016=FF.create_total_youth_2016(df_2016)
total_youth_2016

Unnamed: 0,%16-18,16-18,%19-21,19-21,%22-24,22-24,%Total,Total
Total,100.0,50053.0,100.0,41651.0,100.0,48031.0,100.0,139735.0
Opportunity Youth,5.60406,2805.0,17.488176,7284.0,18.171598,8728.0,13.466204,18817.0
Working without Diploma,1.172757,587.0,4.91945,2049.0,5.989882,2877.0,3.945325,5513.0
Not Opportunity Youth,93.223183,46661.0,77.592375,32318.0,75.838521,36426.0,82.588471,115405.0


In [102]:
opportunity_youth_2016=FF.create_opportunity_youth_2016(df_2016)
opportunity_youth_2016

Unnamed: 0,%16-18,16-18,%19-21,19-21,%22-24,22-24,%Total,Total
Total,100.0,2805.0,100.0,7284.0,100.0,8728.0,100.0,18817.0
No Diploma,57.397504,1610.0,28.11642,2048.0,22.697067,1981.0,29.967583,5639.0
HS Diploma or GED,35.115865,985.0,45.977485,3349.0,35.13978,3067.0,39.331456,7401.0
"Some College, No Degree",6.381462,179.0,22.872048,1666.0,20.199358,1763.0,19.174151,3608.0
Degree (Associate or Higher),1.105169,31.0,3.034047,221.0,21.963795,1917.0,11.526811,2169.0


Make sure you close the DB connection when you are done using it

In [116]:
df=FF.create_df()
race_2017=FF.create_race_2017(df)
race_2017

Unnamed: 0_level_0,%oOpportunityYouth,Opportunity Youth,%oTotal,Total
race,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
American Indian\Alaska Native,3.26927,347,0.990883,851
Asian,11.2022,1189,15.4152,13239
Black/African American,12.3893,1315,10.0486,8630
Hawaiian and Other Pacific Islander,3.39175,360,2.11218,1814
Hispanic,20.0961,2133,18.5625,15942
Some other Race alone,0.150744,16,0.343491,295
Two or More Races,8.12135,862,7.39727,6353
White,41.3793,4392,45.13,38759
Total,100.0,10614,100.0,85883


In [118]:
#df=FF.create_df()
race2_2017=FF.create_race2_2017(df)
race2_2017

Unnamed: 0_level_0,%oOpportunityYouth,Opportunity Youth,%oTotal,Total
race2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Black of African American,12.3893,1315,10.0486,8630
Hispanic,20.0961,2133,18.5625,15942
Other Races,26.1353,2774,26.259,22552
White,41.3793,4392,45.13,38759
Total,100.0,10614,100.0,85883
