# Deep Convolutional Methods for Population Estimation
<hr>

### Objective
This project models population estimates from satellite imagery of a region.


https://deeppop.github.io/resources/robinson2017-deeppop.pdf

In [1]:
import pandas as pd
import tensorflow as tf
import os

## Data

This project introduces a novel dataset of geographic information using U.S. Census Tracts. These ~70,000 tracts divide the populated regions of the United States into administrative regions for census record-keeping. Census tracts are subdivisions of counties and completely segement the U.S. landmass. 

"Census tracts average 4,000 population and may be disaggregated in census blocks and census block groups. Census tracts cover the U.S. from wall-to-wall. They are the smallest geography for which American Community Survey data are tabulated for all types subject matter tables. Their "hierarchical" structure makes it possible to aggregate subject matter data, such as the population of a certain age, to higher level geography including counties and metros (but not cities and legislative districts, in general)."

There are two components to this data: <b>images</b> and <b>census records</b>.

#### Images

<img src="Sample2.jpg" style="height: 20%">
<p style="text-align: center;"><i>Census Tract 7807, Kandiyohi County, Minnesota</i></p>

Naively acquiring satellite imagery for every census tract would be a time-consuming task. Fortunately, a simple work-around exists. Images are scraped from the media feed of a Twitter bot (<a href="https://twitter.com/everytract">@everytract</a>) run by Neil Freeman. The bot tweets an image of a census tract every thirty minutes. It has currently tweeted ~36,000 of the ~70,000 total census tracts for the United States, providing a large and robust data set. Images are large and precisely delineated. A representative sample is shown above.

#### Census Records

The census records used for this project where manually collected from the U.S. Census Bureau's online data portal (https://www.census.gov/data.html). The latest estimates from the 2018 American Community Survey were used for all (~37,000) census tracts corresponding to the available image data set.

### Preprocessing

I aggregated census data for all tracts into a single .csv file, ```Census_Combined.csv```. 

In [19]:
'''
states = []
for filename in os.listdir("./Census Data"):
    print(filename)
    states.append(filename)
    
states.sort()
states.pop(0)

# Estimate!!Total housing units / DP05_0086E
# Estimate!!RACE!!Total population / DP05_0033E
# Geographic Area Name / NAME
# id / GEO_ID

alabama = pd.read_csv("./Census Data/Alabama 2018/ACSDP5Y2018.DP05_data_with_overlays_2020-11-10T102701.csv")
alabama = alabama[['GEO_ID','NAME','DP05_0033E','DP05_0086E']]
alabama.drop(index=0, inplace=True)

for state in states[1:]:
    temp = pd.read_csv("./Census Data/"+state+"/ACSDP5Y2018.csv")
    temp = temp[['GEO_ID','NAME','DP05_0033E','DP05_0086E']]
    temp.drop(index=0, inplace=True)
    alabama = alabama.append(temp, ignore_index=True)

combined = alabama
combined.to_csv('Census_Combined.csv')
'''

'\nstates = []\nfor filename in os.listdir("./Census Data"):\n    print(filename)\n    states.append(filename)\n    \nstates.sort()\nstates.pop(0)\n\n# Estimate!!Total housing units / DP05_0086E\n# Estimate!!RACE!!Total population / DP05_0033E\n# Geographic Area Name / NAME\n# id / GEO_ID\n\nalabama = pd.read_csv("./Census Data/Alabama 2018/ACSDP5Y2018.DP05_data_with_overlays_2020-11-10T102701.csv")\nalabama = alabama[[\'GEO_ID\',\'NAME\',\'DP05_0033E\',\'DP05_0086E\']]\nalabama.drop(index=0, inplace=True)\n\nfor state in states[1:]:\n    temp = pd.read_csv("./Census Data/"+state+"/ACSDP5Y2018.csv")\n    temp = temp[[\'GEO_ID\',\'NAME\',\'DP05_0033E\',\'DP05_0086E\']]\n    temp.drop(index=0, inplace=True)\n    alabama = alabama.append(temp, ignore_index=True)\n\ncombined = alabama\ncombined.to_csv(\'Census_Combined.csv\')\n'

In [21]:
#Read Census data from Census_Combined.csv
census = pd.read_csv("Census_Combined.csv")
census

Unnamed: 0.1,Unnamed: 0,GEO_ID,NAME,TOTAL POPULATION,TOTAL HOUSING UNITS
0,0,1400000US01001020100,"Census Tract 201, Autauga County, Alabama",1923,779
1,1,1400000US01001020200,"Census Tract 202, Autauga County, Alabama",2028,852
2,2,1400000US01001020300,"Census Tract 203, Autauga County, Alabama",3476,1397
3,3,1400000US01001020400,"Census Tract 204, Autauga County, Alabama",3831,1867
4,4,1400000US01001020500,"Census Tract 205, Autauga County, Alabama",9883,4488
5,5,1400000US01001020600,"Census Tract 206, Autauga County, Alabama",3705,1413
6,6,1400000US01001020700,"Census Tract 207, Autauga County, Alabama",4029,1448
7,7,1400000US01001020801,"Census Tract 208.01, Autauga County, Alabama",2826,1224
8,8,1400000US01001020802,"Census Tract 208.02, Autauga County, Alabama",11603,4288
9,9,1400000US01001020900,"Census Tract 209, Autauga County, Alabama",6401,2472


### Exploratory Data Analysis:

The accumulated census tracts showed a range of population values from 0 to ~40,000, with a mean population around 4,400. Manual inspection imagery for very low population tracts confirmed that these population values were accurate and not in error. Because the entire land mass of the United States is divided into census tracts, this results in tracts for regions like airports, national parks, and remote islands with very low populations.

In [40]:
import statistics
print("Population Range: ",min(census["TOTAL POPULATION"])," - ",max(census["TOTAL POPULATION"]))
print("Mean Population: ", int(statistics.mean(census["TOTAL POPULATION"])))

Population Range:  0  -  39919
Mean Population:  4442


## Model