### Step 1: Data gathering 

__Step goal__: Download and store the datasets used in this study.

__Step overview__:
1. Census data;
2. Geographic boundaries.

All data is __open access__ and can be found on the official websites. Note, that the data sets can be updated by corresponding agencies; therefore, some discrepancies are possible: new variables will become available, or some data set will have fewer attributes.

Note that this notebook collects the __census data only for British Columbia__. To get the data for another region, go [here](https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/details/download-telecharger/comp/page_dl-tc.cfm?Lang=E), select the region of interest and change the url in the _1. Census data_ code block. 

In [1]:
import requests, zipfile, io
from datetime import datetime
import os

## 1. Canada

#### 1.1 Census data

In [5]:
%%time
# Download the data for British Columbia
url = 'https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/details/download-telecharger/comp/GetFile.cfm?Lang=E&FILETYPE=CSV&GEONO=044_BRITISH_COLUMBIA'
r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))

directory = "../data/raw/census/canada/"
if not os.path.exists(directory):
    print(f'Succefully created new directory {directory}')
    os.makedirs(directory)
    
z.extractall(path=directory)
print(f'Downloading date: {datetime.today().strftime("%d-%m-%Y %H:%M:%S")}')

# Download the data for Ontario
url = 'https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/details/download-telecharger/comp/GetFile.cfm?Lang=E&FILETYPE=CSV&GEONO=044_ONTARIO'
r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))

directory = "../data/raw/census/canada/"
if not os.path.exists(directory):
    print(f'Succefully created new directory {directory}')
    os.makedirs(directory)
    
z.extractall(path=directory)
print(f'Downloading date: {datetime.today().strftime("%d-%m-%Y %H:%M:%S")}')

# Download the data for Quebec
url = 'https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/details/download-telecharger/comp/GetFile.cfm?Lang=E&FILETYPE=CSV&GEONO=044_QUEBEC'
r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))

directory = "../data/raw/census/canada/"
if not os.path.exists(directory):
    print(f'Succefully created new directory {directory}')
    os.makedirs(directory)
    
z.extractall(path=directory)
print(f'Downloading date: {datetime.today().strftime("%d-%m-%Y %H:%M:%S")}')

Downloading date: 26-05-2020 18:38:33
Downloading date: 26-05-2020 19:00:47
Downloading date: 26-05-2020 19:16:11
Wall time: 46min 14s


#### 1.2 Geographic boundaries

In [6]:
%%time
url = 'http://www12.statcan.gc.ca/census-recensement/2011/geo/bound-limit/files-fichiers/2016/lda_000b16a_e.zip'
r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))

directory = "../data/raw/geometry/"
if not os.path.exists(directory):
    print(f'Succefully created new directory {directory}')
    os.makedirs(directory)

z.extractall(path=directory)
print(f'Downloading date: {datetime.today().strftime("%d-%m-%Y %H:%M:%S")}')

Succefully created new directory ../data/raw/geometry/
Downloading date: 02-05-2020 11:56:23
Wall time: 2min 35s


## 2. United States

#### 1.1 Census data and Geographic Boundaries

In [2]:
%%time
from src.data.census.usa.prepare_us_data import retrieve_us_data

# chicago = [17, 14000]
# NYC = [36, 51000]
# SF = [int('06'), 67000]
# philadelphia = [42, 60000]
# houston = [48, 35000]
boston = [25, int('07000')]
# seattle = [53, 63000]
# miami = [12, 45000]
# LA = [int('06'), 44000]
san_diego
DC
portland
baltimore
sacramento
atlanta
minneapolis
pittsburgh
denver

cities_geo = [LA]

# retrieve vars of interest for chicago, NYC, and SF
LA_data = retrieve_us_data(cities_geo = cities_geo, year = 2017)

City collected successfully!
Wall time: 1h 49min 35s


In [4]:
# Save the raw data
city = 'los_angeles'

directory = "../../data/raw/census/united_states/"
if not os.path.exists(directory):
    print(f'Succefully created new directory {directory}')
    os.makedirs(directory)
    
# houston_data[0].to_file(directory + f'{city.lower()}.shp', index=False)
LA_data[0].to_csv(directory + f'{city.lower()}.csv', index=False)
LA_data[0].to_file(directory + f'{city.lower()}.geojson', driver='GeoJSON')

### References
1. Statistics Canda (2020). Census Profile, 2016 Census. Retrieved from https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/details/download-telecharger/comp/page_dl-tc.cfm?Lang=E
2. Statistics Canda (2020). 2016 Census - Boundary files. Retrieved from http://www12.statcan.gc.ca/census-recensement/2011/geo/bound-limit/files-fichiers/2016/lda_000b16a_e.zip
3. American Community Survey 5 Year. Retrieved from https://www.census.gov/data/developers/data-sets/acs-5year.html
4. US Census API wrapper: https://pypi.org/project/census-area/
5. US counties FIPS codes. Retrieved from https://www.nrcs.usda.gov/wps/portal/nrcs/detail/ca/home/?cid=nrcs143_013697
6. Variable Information for ACS5 census dataset. Retrieved from https://api.census.gov/data/2018/acs/acs5/variables.html