<a href="https://colab.research.google.com/github/la-counts/instructables/blob/master/Instructable_9_Generating_Statistics_Using_Multiple_Data_Sources.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![alt text](https://www.latimes.com/resizer/tggkBQhqHKLehMeQ09CdgyK03H0=/800x0/www.trbimg.com/img-5cfc3cb8/turbine/la-1560034486-1beza1b9ta-snap-image)

# Generating Statistics Using Multiple Data Sources

---

Which council districts are shouldering a greater burden for managing the homeless crisis? Los Angeles County is in the midst of a [homelessness crisis](https://www.latimes.com/local/lanow/la-me-homeless-county-letter-20190608-story.html). The most recent count of people who lack reliable housing showed a statistically significant increase in many areas. In total, the number of people living in the streets [rose 12%](https://www.usatoday.com/story/news/nation/2019/06/04/los-angeles-homeless-population-up-12-amid-affordable-housing-crisis/1336734001/) to a total of 58,936. In 2018, voters passed Measure HHH, a $1.2 billion initiative to build houses. However, the burden for the crisis is not shared equally throughout the city. Certain neighborhoods and council districts have disproportionately more homeless than others. This presents a challenge for both budgeting and use of public space. 

While programming experience helps for this instructable, it is not required. *(Please see our first two instructables for information on the tools used in this exercise)*

# Step 1: Gather and Understand Ingredients Used in This Notebook

In this instructable we'll be using data from two data sets. Specifically, we will be cross-referencing a dataset on the 2018 homeless count with a dataset on the population of the different council districts. This will give us a better idea of the per-capita burden that the crisis is placing on certain areas. 

For this exercise, you will need: 

*  A CSV dataset from [LA Counts ](https://lacounts.org) that has two types of data you wish to compare. In this exercise we will use a dataset for the [2018 homeless count](https://www.lacounts.org/dataset/homeless-count-by-council-district-2018) and [Population of Council Districts](https://controllerdata.lacity.org/api/views/2ybs-mbdp/rows.csv), which contains demographic information about the council districts. 
*  A Jupyter Notebook like this one, hosted on [Google's colab.](http://colab.research.google.com/) 
*   Free Python Libraries ([numpy](http://www.numpy.org/) and [pandas](https://pandas.pydata.org/). These are accessible within Jupyter Notebooks, so you don't need to download them. 
*   Your smarts! 🧠 

# Step 2: Load and Show Homeless Count Data as a Table 

The first step is to load data into this Jupyter Notebook. A Jupyter Notebook is an open-source application that runs in your web browser. It can contain sections of live code, data visualizations, and text. 

In [0]:
import pandas as pd
import numpy as np

# This is the CSV file for the 2018 Q3 Homeless count from above. 

data1 = pd.read_csv("http://geohub.lacity.org/datasets/c8e6c2f2b6434c67a33a7b189f53f2b4_0.csv")

# 1. The below code uses the .loc method to select columns for CD (Council District) and total homeless (totPeople). 

newdata1 = data1.loc[:, ['CD','totPeople']]

# 2. the "head" method prints out the first FIVE lines of the dataframe so you can see its rows & columns. 
# .head is a good way to get a feel for the type of data you're working with! 

newdata1.head() 

Unnamed: 0,CD,totPeople
0,6,43.585
1,1,69.665
2,11,15.87
3,2,0.0
4,2,7.502


# Step 3: Create Summary Data by District 

Remember that we want to create summaries of the total number of homeless people in each district. That is, we want to create a data frame with two columns: one for the district ('CD' above) and a second that counts all the data for each district ('totPeople' above). The below code does this and stores it in the new variable 'sumdata'. 

In [0]:
# Use newdata1 dataframe created above, grouped by council district and added using .agg method. 

sumdata = newdata1.groupby(['CD'])['totPeople'].agg('sum')
print(sumdata)

CD
1     2335.467
2     1221.844
3      535.735
4      630.993
5      695.934
6     2649.900
7     1256.473
8     1775.345
9     2640.257
10    1161.906
11    1908.471
12     591.596
13    2744.496
14    6806.621
15    1511.797
Name: totPeople, dtype: float64


# Step 4: Load and Show Total Council Area Populations 

Remember that our goal is to create a table with the number of counted homeless divided by the total number of residents. To do this, we need to retrieve the data set for the demographics of each council district. This data is calculated from the last United States Census. We are interested in two columns: 'council_district' – a number that represents the same council district as in the first file we downloaded above – and 'value' – the population for that council. Let's download that file and save it as a dataframe. 

In [0]:
data2 = pd.read_csv("https://controllerdata.lacity.org/api/views/2ybs-mbdp/rows.csv")

# 1. The below code uses the .loc method to select columns for council district (council_district) 
# and total population (value) and save them to a new variable newdata2. 

newdata2 = data2.loc[:, ['council_district','value']]

# 2. We will also need to rename the council_district to 'CD' so it matches up with our first dataframe. 

newdata2 = newdata2.rename(index=str, columns={"council_district": "CD"})


# 3. the "head" method prints out the first FIVE lines of the dataframe so you can see its rows & columns. 
# .head is a good way to get a feel for the type of data you're working with! 

newdata2.head() 

Unnamed: 0,CD,value
0,1,236931.84
1,10,244936.64
2,11,289385.25
3,12,284395.28
4,13,252322.31


# Step 5: Divide Homeless Totals by Total Council Area Populations 

Okay, great! We have one dataframe (sumdata) with each council district number and homeless counts, and a second dataframe (newdata2) with each council district and total district area populations. The next step is to divide the homeless counts by the total population to get an idea of the burden on each council district. To do this, we will first merge the datasets. Then, for each council district, calculate a column to our first dataframe that represents the total population divided by the homeless count. This gives us a value that represents how many residents with homes live in this area for each homeless person. 

In [0]:
finaldata = pd.merge(sumdata, newdata2, how='inner', on=['CD'])
finaldata['perCapita'] = np.divide(finaldata['value'], finaldata['totPeople'])
print(finaldata)

    CD  totPeople      value   perCapita
0    1   2335.467  236931.84  101.449449
1    2   1221.844  252255.69  206.454908
2    3    535.735  264356.25  493.445920
3    4    630.993  248331.03  393.555919
4    5    695.934  264851.94  380.570485
5    6   2649.900  258000.39   97.362312
6    7   1256.473  260029.70  206.952079
7    8   1775.345  250221.55  140.942493
8    9   2640.257  265957.03  100.731493
9   10   1161.906  244936.64  210.805900
10  11   1908.471  289385.25  151.631987
11  12    591.596  284395.28  480.725495
12  13   2744.496  252322.31   91.937576
13  14   6806.621  236878.34   34.801165
14  15   1511.797  275486.72  182.224677


# Step 5: What next?

Congratulations, you just learned how to analyze data across multiple data sets! From the above table, we can see that District 14 (Downtown and Boyle Heights, councilman Jose Huizar) is the most impacted – with nearly 35 residents for each homeless person. As should not come as a surprise, District 14 includes the high-density homeless area referred to as "skid row." District 3 (the San Fernando Valley, councilman Bob Blumenfield) is the least impacted – with approximately 492 residents for each homeless person. Some questions you might ask from here include: 

* What are the underlying reasons why certain districts are more impacted than others? 
* How might you use this data to advocate for more attention to housing issues in Los Angeles? 
* Could you use this data to create a graph, like in [one of our earlier data instructables](https://colab.research.google.com/drive/1G1mSsjMeVQmmPO2Wo0qr88Uy88n9_8ae)? 

Have fun with your data analysis, and come back for the next instructable! 