# This notebook will explore data related to the LAHSA reports on LA homelessness in 2018. 

Reports are publically available at: https://www.lahsa.org/news?article=557-2019-greater-los-angeles-homeless-count-results

Note that data are spread between many reports, each of which is a PDF. There are also census-tract data in csv format, but this csv was generated by a four-day survey and LAHSA notes that the data will not add up to totals seen in other reports. 

I'll intake a full report for LA, as well as district-specific reports. Each report will be cleaned up to be used for downstream analysis. Throughout, I'll use pandas to work with dataframes, tabula to convert pdfs to dataframes, and requests to fetch data.

Some of the items to dig into: demographics, relation of an area's wealth/rent rates to homelessness, death rates (if data exist). 



In [None]:
###Data files (examples):

# https://www.lahsa.org/documents?id=3444-2019-greater-los-angeles-homeless-count-council-district-4.pdf
# https://www.lahsa.org/documents?id=3441-2019-greater-los-angeles-homeless-count-council-district-1.pdf

# There are 15 such files, each ending w/ council-district-n.pdf

## First off, let's streamline the intake and digestion of all these PDFs

1. build a list of URLs
2. use the requests package to get the data from these URLs
3. use the tabula package to read data from the URLs into pandas dataframes







In [359]:
import requests
import pandas as pd
import tabula
import matplotlib.pyplot as plt

In [54]:
##note that the URLs listed above are the page URLs that include download links for our files. Here's what the base url
##for a downloadable file looks like:

example_url = "https://www.lahsa.org/item.ashx?id=3441-2019-greater-los-angeles-homeless-count-council-district-1.pdf&dl=true"
base_url = "https://www.lahsa.org/item.ashx?id="
base_url_2 = "-2019-greater-los-angeles-homeless-count-council-district-"
suffix = ".pdf&dl=true"

##build a list of urls to download:
tag_list = [3441, 3442, 3443, 3444, 3445, 3446, 3447, 3649, 3449, 3450, 3451, 3452, 3453, 3454, 3455]

url_list = []
for i in range(1,16):
    url_list.append(base_url + str(tag_list[i-1]) + base_url_2 + str(i) + suffix)

url_list

['https://www.lahsa.org/item.ashx?id=3441-2019-greater-los-angeles-homeless-count-council-district-1.pdf&dl=true',
 'https://www.lahsa.org/item.ashx?id=3442-2019-greater-los-angeles-homeless-count-council-district-2.pdf&dl=true',
 'https://www.lahsa.org/item.ashx?id=3443-2019-greater-los-angeles-homeless-count-council-district-3.pdf&dl=true',
 'https://www.lahsa.org/item.ashx?id=3444-2019-greater-los-angeles-homeless-count-council-district-4.pdf&dl=true',
 'https://www.lahsa.org/item.ashx?id=3445-2019-greater-los-angeles-homeless-count-council-district-5.pdf&dl=true',
 'https://www.lahsa.org/item.ashx?id=3446-2019-greater-los-angeles-homeless-count-council-district-6.pdf&dl=true',
 'https://www.lahsa.org/item.ashx?id=3447-2019-greater-los-angeles-homeless-count-council-district-7.pdf&dl=true',
 'https://www.lahsa.org/item.ashx?id=3649-2019-greater-los-angeles-homeless-count-council-district-8.pdf&dl=true',
 'https://www.lahsa.org/item.ashx?id=3449-2019-greater-los-angeles-homeless-coun

In [27]:
## download one pdf as a test:

r = requests.get(example_url)
with open("pdfs/test.pdf", "wb") as code:
    code.write(r.content)

In [55]:
#download them all:

for i in range(len(url_list)):
    r = requests.get(url_list[i])
    with open(("pdfs/CD" + str(i+1) + ".pdf"), "wb") as code:
        code.write(r.content)

## confirm that we've downloaded all our files:

In [58]:
%%bash 
cd pdfs
ls

CD1.pdf
CD10.pdf
CD11.pdf
CD12.pdf
CD13.pdf
CD14.pdf
CD15.pdf
CD2.pdf
CD3.pdf
CD4.pdf
CD5.pdf
CD6.pdf
CD7.pdf
CD8.pdf
CD9.pdf
LA_City.pdf


## Use tabula to read in our dataframes. 

Note that for tabula to work, you'll need to install a JDK (downloadable from Oracle, but you'll need to set up a developer's account with them. 



In [257]:
from tabula import read_pdf

cd4_df = read_pdf("pdfs/CD4.pdf", multiple_tables=True)

In [268]:
#explore the data. Note that the dataframes are a bit of a mess as they're read in. Here I'll split them into two 
#separate dataframes and clean them separately:

la_df = read_pdf("pdfs/LA_City.pdf", multiple_tables=True)


In [388]:
totals, demographics = la_df[0], la_df[1]
demographics

Unnamed: 0,Population,Sheltered,Unsheltered,Total,Prevalence,Percent_Change
0,Gender,,,,,
1,Male,4697.0,19663.0,24360.0,67%,+17% Yes
2,Female,4125.0,6720.0,10845.0,30%,+13% Yes
3,Transgender,102.0,708.0,810.0,2%,+18% No
4,Gender Non‐Conforming,20.0,130.0,150.0,0.4%,+33% No
5,Race/Ethnicity,,,,,
6,American Indian/ Alaska Native,42.0,488.0,530.0,1%,+64% No
7,Asian,53.0,246.0,299.0,1%,‐15% No
8,Black/African American,4716.0,8913.0,13629.0,38%,+11% No
9,Hispanic/ Latino,2864.0,9539.0,12403.0,34%,+15% No


In [383]:
#clean up demographics page:

column_list = ["Population", "Sheltered", "Unsheltered", "Total", "Prevalence", "Percent_Change"]
demographics.columns = column_list
demographics = demographics.dropna()
demographics = demographics.set_index('Population')
demographics['Percent_Change'] = [demographics['Percent_Change'][i].split()[0] for i in range(len(demographics))]

In [384]:
demographics

Unnamed: 0_level_0,Sheltered,Unsheltered,Total,Prevalence,Percent_Change
Population,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Male,4697,19663,24360,67%,+17%
Female,4125,6720,10845,30%,+13%
Transgender,102,708,810,2%,+18%
Gender Non‐Conforming,20,130,150,0.4%,+33%
American Indian/ Alaska Native,42,488,530,1%,+64%
Asian,53,246,299,1%,‐15%
Black/African American,4716,8913,13629,38%,+11%
Hispanic/ Latino,2864,9539,12403,34%,+15%
Native Hawaiian/ Other Pacific Islander,36,64,100,0.3%,+30%
White,1078,7145,8223,23%,+18%


In [389]:
###build our totals dataframe:

totals = totals.dropna().reset_index()
totals['Population'] = [totals[0][i] for i in range(len(totals))]
totals['Sheltered'] = [totals[1][i].split()[0] for i in range(len(totals))]
totals['Unsheltered'] = [totals[1][i].split()[1] for i in range(len(totals))]
totals['Total'] = [totals[1][i].split()[2] for i in range(len(totals))]
totals['Prevalence'] = [totals[1][i].split()[3] for i in range(len(totals))]
totals['Percent_Change'] = [totals[1][i].split()[4] for i in range(len(totals))]
totals = totals.drop(['index', 0, 1, 2], axis=1)
totals = totals.set_index('Population')

In [390]:
totals

Unnamed: 0_level_0,Sheltered,Unsheltered,Total,Prevalence,Percent_Change
Population,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
All Persons,8944,27221,36165,100%,+16%
Individuals (Those not in family units),4636,25993,30629,85%,+17%
Adults (Over 24),4040,24756,28796,80%,+17%
Transition Age Youth (18‐24),596,1237,1833,5%,+17%
Chronically Homeless,967,9117,10084,28%,+21%
Veterans,306,1859,2165,6%,+6%
Unaccompanied Minors (Under 18),21,33,54,0.1%,‐7%
Family Members (Those in family units),4287,1195,5482,15%,+7.3%
Adult Family Members (Over 24 Head of Household),3777,1010,4787,13%,+6%
Young Family Members (18‐24 Head of Household),510,185,695,2%,+20%


In [None]:
###generalize this cleaning up so that we can read in council-district-specific datasets:



In [360]:
cd4_df = read_pdf("pdfs/CD4.pdf", multiple_tables=True)


In [409]:
def clean_df(dataframe):
    
    
    '''cleans up dataframes imported from tabula. As an input it takes a tabula-returned df. As an output it provides 
    a demographic table and a totals table with columns=population, sheltered, unsheltered, total, prevalence (
    as a proportion of total), percent change (in this case, since 2018)'''
    
    totals, demographics = dataframe[0], dataframe[1]
    totals = totals.dropna().reset_index()
    totals['Population'] = [totals[0][i] for i in range(len(totals))]
    totals['Sheltered'] = [totals[1][i].split()[0] for i in range(len(totals))]
    totals['Unsheltered'] = [totals[1][i].split()[1] for i in range(len(totals))]
    totals['Total'] = [totals[1][i].split()[2] for i in range(len(totals))]
    totals['Prevalence'] = [totals[1][i].split()[3] for i in range(len(totals))]
    totals['Percent_Change'] = [totals[1][i].split()[4] for i in range(len(totals))]
    totals = totals.drop(['index', 0, 1, 2], axis=1)
    totals = totals.set_index('Population')
    
    demographics = demographics.drop([6], axis=1)
    column_list = ["Population", "Sheltered", "Unsheltered", "Total", "Prevalence", "Percent_Change"]
    demographics.columns = column_list
    demographics = demographics.dropna()
    demographics = demographics.set_index('Population')
    demographics['Percent_Change'] = [demographics['Percent_Change'][i].split()[0] for i in range(len(demographics))]
    
    return totals, demographics


In [401]:
cd4_totals, cd4_demographics = cd4_df[0], cd4_df[1]
column_list = ["Population", "Sheltered", "Unsheltered", "Total", "Prevalence", "Percent_Change"]


In [408]:
cd4_demographics = cd4_demographics.drop([6], axis=1)
cd4_demographics

Unnamed: 0,0,1,2,3,4,5
0,Gender,,,,,
1,Male,15.0,847.0,862.0,73%,+64%
2,Female,15.0,250.0,265.0,22%,+37%
3,Transgender,16.0,37.0,53.0,4%,+10%
4,Gender Non-Conforming,5.0,2.0,7.0,0.6%,-30%
5,Race/Ethnicity,,,,,
6,American Indian/ Alaska Native,0.0,13.0,13.0,1%,-28%
7,Asian,2.0,23.0,25.0,2%,+47%
8,Black/African American,30.0,327.0,357.0,30%,+48%
9,Hispanic/ Latino,9.0,457.0,466.0,39%,+111%


In [394]:
column_list = ["Population", "Sheltered", "Unsheltered", "Total", "Prevalence", "Percent_Change"]
demographics.columns = column_list
demographics = demographics.dropna()
demographics = demographics.set_index('Population')
demographics['Percent_Change'] = [demographics['Percent_Change'][i].split()[0] for i in range(len(demographics))]

Unnamed: 0,index,0,1,2,Population,Sheltered,Unsheltered
0,6,All Persons,"51 1,136 1,187 100% +53%",Yes,All Persons,51,1136
1,8,Individuals (Those not in family units),"51 1,128 1,179 99% +62%",Yes,Individuals (Those not in family units),51,1128
2,9,Adults (Over 24),"29 1,020 1,049 88% +75%",No,Adults (Over 24),29,1020
3,10,Transition Age Youth (18-24),22 108 130 11% +1%,No,Transition Age Youth (18-24),22,108
4,11,Chronically Homeless,15 270 285 24% +28%,No,Chronically Homeless,15,270
5,12,Veterans,0 25 25 2% -54%,No,Veterans,0,25
6,13,Unaccompanied Minors (Under 18),0 1 1 0.1% N/A*,No,Unaccompanied Minors (Under 18),0,1
7,14,Family Members (Those in family units),0 7 7 1% -85.4%,Yes,Family Members (Those in family units),0,7
8,15,Adult Family Members,(Over 24 Head of Household) 0 5 5 0% -89%,No,Adult Family Members,(Over,24
9,16,Young Family Members,(18-24 Head of Household) 0 2 2 0% -50%,No,Young Family Members,(18-24,Head


In [411]:
cd4_totals, cd4_demographics = clean_df(cd4_df)

In [413]:
cd4_demographics

Unnamed: 0_level_0,Sheltered,Unsheltered,Total,Prevalence,Percent_Change
Population,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Male,15,847,862,73%,+64%
Female,15,250,265,22%,+37%
Transgender,16,37,53,4%,+10%
Gender Non-Conforming,5,2,7,0.6%,-30%
American Indian/ Alaska Native,0,13,13,1%,-28%
Asian,2,23,25,2%,+47%
Black/African American,30,327,357,30%,+48%
Hispanic/ Latino,9,457,466,39%,+111%
Native Hawaiian/ Other Pacific Islander,0,1,1,0.1%,N/A*
White,10,286,296,25%,+13%


In [417]:
def retrieve_clean_df(council_district):
    
    
    '''cleans up dataframes imported from tabula. As an input it takes a tabula-returned df. As an output it provides 
    a demographic table and a totals table with columns=population, sheltered, unsheltered, total, prevalence (
    as a proportion of total), percent change (in this case, since 2018)'''
    
    #cd4_df = read_pdf("pdfs/CD4.pdf", multiple_tables=True)
    
    file_name = "pdfs/" + str(council_district) + ".pdf"
    dataframe = read_pdf(file_name, multiple_tables=True)
    
    totals, demographics = dataframe[0], dataframe[1]
    totals = totals.dropna().reset_index()
    totals['Population'] = [totals[0][i] for i in range(len(totals))]
    totals['Sheltered'] = [totals[1][i].split()[0] for i in range(len(totals))]
    totals['Unsheltered'] = [totals[1][i].split()[1] for i in range(len(totals))]
    totals['Total'] = [totals[1][i].split()[2] for i in range(len(totals))]
    totals['Prevalence'] = [totals[1][i].split()[3] for i in range(len(totals))]
    totals['Percent_Change'] = [totals[1][i].split()[4] for i in range(len(totals))]
    totals = totals.drop(['index', 0, 1, 2], axis=1)
    totals = totals.set_index('Population')
    
    demographics = demographics.drop([6], axis=1)
    column_list = ["Population", "Sheltered", "Unsheltered", "Total", "Prevalence", "Percent_Change"]
    demographics.columns = column_list
    demographics = demographics.dropna()
    demographics = demographics.set_index('Population')
    demographics['Percent_Change'] = [demographics['Percent_Change'][i].split()[0] for i in range(len(demographics))]
    
    return totals, demographics

In [418]:
cd7_totals, cd7_demographics = retrieve_clean_df("CD7")

In [421]:
cd7_demographics

Unnamed: 0_level_0,Sheltered,Unsheltered,Total,Prevalence,Percent_Change
Population,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Male,160,531,691,74%,-14%
Female,88,143,231,25%,-44%
Transgender,4,3,7,1%,-86%
Gender Non-Conforming,2,0,2,0.2%,N/A*
American Indian/ Alaska Native,0,0,0,0%,-100%
Asian,3,0,3,0%,+50%
Black/African American,88,69,157,17%,-6%
Hispanic/ Latino,95,404,499,54%,-37%
Native Hawaiian/ Other Pacific Islander,0,0,0,0.0%,N/A*
White,64,154,218,23%,-22%
