# 01 - Getting and Cleaning Census Data
### by Ian Flores Siaca
##### October 2018

## Purpose of this Notebook

#### Learning
* Dealing with the `os` library for basic functions
* Standarizing names in Spanish with the `unidecode` library.

#### Project
* Extract the population estimates for each municipality.
* Clean the name of the municipalities in the census dataset.

## Census Data

The United States established in 1840 a central office to oversee the census of the population of its states and territories. This office today is known as the United States Census Bureau. They perform a decenial census, the last one being in 2010. With this information they produce what is called *PEPANNRES - Annual Estimates of the Resident Population* in which they do population estimates by different geographic levels. In March 2018 they published the estimates for municipalities in the year 2017. This is exactly what we need for our analysis. 

## Getting the Data

The Census Bureau makes their data available through many pages and API's. Here we are going to use the *FactFinder* webpage and download the data manually, but it is worth noting that there are API wrappers in Python such as: 

* Census - https://github.com/datamade/census 
* Census Areas - https://github.com/datamade/census_area

If you want to get the data manually it is available in the following [LINK](https://factfinder.census.gov/bkmk/table/1.0/en/PEP/2017/PEPANNRES/0400000US72.05000). However, it is included as a zip file in the repository for reproducibility purposes.  

## Let's code!

#### Load the libraries

* pandas is a library to do data wrangling and dealing with data in general.
* os is a library that allows us to interact with the computer in a lower level than python.
* unidecode is a library that helps us deal with the names of the municipalities in Spanish. (Remember, PR is a Spanish speaking nation :D)

In [1]:
import pandas as pd
import os
import unidecode

#### Extract the data

The data is stored in a compressed form, we need to extract it and save it in a place where we can use it.

Steps:
* Extract the data ---- `unzip` command
* Store it the `data/` folder ---- `-d data/` command

In [2]:
os.system("unzip ../data/PEP_2017_PEPANNRES.zip -d ../data/")

0

#### Load the data

`pd.read_csv` is a function from the pandas package that allows us to load Comma Separated Values (CSV) files into Python with ease. The `head()` method let's us see the top rows of the data frame. 

In [3]:
data = pd.read_csv("../data/PEP_2017_PEPANNRES.csv", encoding='latin-1')
data.head()

Unnamed: 0,GEO.id,GEO.id2,GEO.display-label,rescen42010,resbase42010,respop72010,respop72011,respop72012,respop72013,respop72014,respop72015,respop72016,respop72017
0,0500000US72001,72001,"Adjuntas Municipio, Puerto Rico",19483,19483,19472,19297,19116,19019,18798,18560,18276,17971
1,0500000US72003,72003,"Aguada Municipio, Puerto Rico",41959,41959,41913,41532,41107,40707,40135,39539,38853,38118
2,0500000US72005,72005,"Aguadilla Municipio, Puerto Rico",60949,60949,60766,59976,58978,58036,57078,55808,54525,53164
3,0500000US72007,72007,"Aguas Buenas Municipio, Puerto Rico",28659,28659,28652,28333,28052,27782,27350,26913,26382,25850
4,0500000US72009,72009,"Aibonito Municipio, Puerto Rico",25900,25900,25874,25537,25205,24879,24448,24040,23566,23108


#### Dealing with names in Spanish

Spanish is a beautiful language, however, when dealing with names when programming, sometimes it's hard given some inconsistencies. This comes because of a wrong encoding of the files, typing some of the names in English and some in Spanish, or simply skiping 'weird' characters. To deal with this inconsistencies, we convert the names to their english-letter representation and to an upper case. This way we standarize all the names for all the files. 

Steps: 

* Extract the names of the municipalities from the strings
* Convert the names to english-letter representation
* Convert the strings to upper case format

In [4]:
municipalities = []

for i in range(0, 78):
    muni_raw = data['GEO.display-label'][i].split(" Municipio, Puerto Rico")[0]
    muni_clean = unidecode.unidecode(muni_raw)
    muni_upper = muni_clean.upper()
    municipalities.append(muni_upper)
    
data['ResidencePlace'] = municipalities 

In [5]:
data_processed = data[['ResidencePlace', 'respop72017']]
data_processed.head()

Unnamed: 0,ResidencePlace,respop72017
0,ADJUNTAS,17971
1,AGUADA,38118
2,AGUADILLA,53164
3,AGUAS BUENAS,25850
4,AIBONITO,23108


#### Save the progress | Clean our directories

Let's save the population estimates for 2017 to use the file later in our pipeline

In [6]:
data_processed.to_csv("../data/census.csv")

Finally, clean the data directory to only remain with the files we need to reproduce this analysis and produce it's output. 

In [7]:
os.system("rm ../data/PEP_2017_PEPANNRES.csv ../data/PEP_2017_PEPANNRES_metadata.csv ../data/PEP_2017_PEPANNRES.txt ../data/aff_download_readme.txt")

0

## Optional Questions

1) What is the municipality with the highest population? How about the least?

2) Which municipality changed the most between 2016 and 2017?

3) How about between 2010 and 2017?

4) Can you find the municipality that has stayed the most stable (less variability) between all these years?

5) Are there some municipalities consistingly losing people and other winning them?

<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons Licence" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.