# Data Download and Exploration

This code means that the notebook will re-import your source code in `src` when it is edited (the default is not to re-import, because most modules are assumed not to change over time).  It's a good idea to include it in any exploratory notebook that uses `src` code

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
!ls

README.md          oy_project_1.ipynb


This snippet allows the notebook to import from the `src` module.  The directory structure looks like:

```
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering)
│   │                     followed by the topic of the notebook, e.g.
│   │                     01_data_collection_exploration.ipynb
│   └── exploratory    <- Raw, flow-of-consciousness, work-in-progress notebooks
│   └── report         <- Final summary notebook(s)
│
├── src                <- Source code for use in this project
│   ├── data           <- Scripts to download and query data
│   │   ├── sql        <- SQL scripts. Naming convention is a number (for ordering)
│   │   │                 followed by the topic of the script, e.g.
│   │   │                 03_create_pums_2017_table.sql
│   │   ├── data_collection.py
│   │   └── sql_utils.py
```

So we need to go up two "pardir"s (parent directories) to import the `src` code from this notebook.  You'll want to include this code at the top of any notebook that uses the `src` code.

In [3]:
import os
import sys
module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)

The code to download all of the data and load it into a SQL database is in the `data` module within the `src` module.  You'll only need to run `download_data_and_load_into_sql` one time for the duration of the project.

In [4]:
from src.data import data_collection

This line may take as long as 10-20 minutes depending on your network connection and computer specs

In [5]:
#data_collection.download_data_and_load_into_sql()

Now it's time to explore the data!

In [6]:
import psycopg2
import pandas as pd
import numpy as np
from tabula import read_pdf
import matplotlib.pyplot as plt
pd.set_option('max_colwidth', 80)

ModuleNotFoundError: No module named 'tabula'

In [None]:
DBNAME = "opportunity_youth"

In [None]:
conn = psycopg2.connect(dbname=DBNAME)

In [None]:
import src.create_tables as FF
df=FF.create_df()

In [None]:
total_youth_2017=FF.create_total_youth_2017(df)
total_youth_2017

In [None]:
opportunity_youth_2017=FF.create_total_youth_2017(df)
opportunity_youth_2017

Notice the `LIMIT 10` above.  These tables have a large amount of data in them and **your goal is to use SQL to create your main query, not Pandas**.  Pandas can technically do everything that you need to do, but it will be much slower and more inefficient.  Nevertheless, Pandas is still a useful tool for exploring the data and getting a basic sense of what you're looking at.

In [None]:
df_2016=FF.create_basetable_2016()

In [None]:
total_youth_2016=FF.create_total_youth_2016(df_2016)
total_youth_2016

In [None]:
opportunity_youth_2016=FF.create_opportunity_youth_2016(df_2016)
opportunity_youth_2016

Make sure you close the DB connection when you are done using it

In [None]:
df=FF.create_df()
race_2017=FF.create_race_2017(df)
race_2017.reset_index()
race_2017

In [None]:
race_2016=FF.create_race_2016()
race_2016

In [None]:
race_youthtype_2017=FF.create_race2_2017(df)
race_youthtype_2017

In [None]:
import src.create_visualization as VSZ
datafiles=VSZ.pull_data()

In [None]:
createbar_201=VSZ.create_trend_bar_age(datafiles)


In [None]:
createbar_degree_1620=VSZ.create_trend_bar_degree(datafiles)
createbar_degree_1620

In [None]:
race_bar =VSZ.create_race_bar(race_2016, race_2017)

In [None]:
# import functions to be called from separate functions.py file
import src.functions as fc

#first map: whole state of Washington
df = fc.create_df()
fig1, ax1, cmap = fc.map_creation(["Washington State", "King County"], "Washington State PUMAs");
df.plot(ax=ax1, column="kc", edgecolor="black", cmap=cmap);

In [None]:
# second map: all of King County
fig2, ax2, cmap = fc.map_creation(["North King County", "South King County"], "King County PUMAs");
df[df["kc"]==True].plot(ax=ax2, column="s_kc", edgecolor="black", cmap=cmap);

In [None]:
# third map: South King County only
fig3, ax3, cmap = fc.map_creation(["South King County"], "South King County PUMAs");
df[df["s_kc"]==True].plot(ax=ax3, column="s_kc", edgecolor="black", cmap=cmap);

In [None]:
#save three maps: 
fc.save_map(fig1, '1_wa_state.png')

In [None]:
fc.save_map(fig2, '2_king_county.png')

In [None]:
fc.save_map(fig3, '3_south_king_county.png')