# HUD - The U.S. Department of Housing and Urban Development

## Data set

* We utilized the dataset provided by HUD Exchange - https://www.hudexchange.info/resources/documents/2007-2017-PIT-Counts-by-CoC.XLSX
* We picked Point-in-Time (PIT) homeless counts by CoC from 2007 - 2017.

Continuum of Care (CoC)
* The group of community stakeholders involved in the decision making processes as the “Continuum of Care”. 
  They address various needs of homeless people and the needs include a full range of emergency, transitional and permanent housing and other services. 

## Data Cleanup and Consolidation
* Data set contained national estimates of homelessness by CoC Number and CoC Name and estimates of different categories of homelessness from 2007 - 2017. 
* Estimates of homeless veterans were included since 2011.
* Column headers count were not same across all worksheets. 
* Year information from column headers was removed and a new column ‘Year’ was inserted across all worksheets using VBA.
* Data from all worksheets were consolidated using pandas concat method and removed columns that had NaN. 
* This helped us to get the common data available across all years from 2007 to 2017.
* Consolidated Data (HUD_Consol_Data.csv) has been used to analyse the homelessness change since 2007. 
* Since we picked city_time_series from Kaggle Zillow, in order to map with Zillow data we had to break down HUD data by City.
* State value was derived from CoC Number (Ex: AL-500).
* County name was derived from CoC Name.
* New column CityState was introduced for mapping HUD with Zillow Data.
* 18 cities of interest selected (HUD_Cities_Data.csv) for analyzing housing impact on homelessness.

In [1]:
# dependencies
import pandas as pd
import os
import matplotlib

In [2]:
# load excel file
file_path = os.path.join('raw_data','2007-2017-PIT-HUD.xlsx')
# read the excel file
df  = pd.read_excel(file_path, sheet_name=None, ignore_index=True)

# concatenate all worksheets
cdf = pd.concat(df.values())

# display consolidated data
cdf.head()

Unnamed: 0,Children of Parenting Youth,Chronically Homeless,Chronically Homeless Individuals,Chronically Homeless People in Families,CoC Name,CoC Number,Homeless Individuals,Homeless People in Families,Homeless Unaccompanied Children (Under 18),Homeless Unaccompanied Young Adults (Age 18-24),...,Unsheltered Homeless Individuals,Unsheltered Homeless People in Families,Unsheltered Homeless Unaccompanied Children (Under 18),Unsheltered Homeless Unaccompanied Young Adults (Age 18-24),Unsheltered Homeless Unaccompanied Youth (Under 25),Unsheltered Homeless Veterans,Unsheltered Parenting Youth (Under 25),Unsheltered Parenting Youth Age 18-24,Unsheltered Parenting Youth Under 18,Year
0,21.0,116,114,2,Anchorage CoC,AK-500,848,280,8.0,107.0,...,155,0,0.0,14.0,14.0,12,0.0,0.0,0.0,2017
1,18.0,141,102,39,Alaska Balance of State CoC,AK-501,506,211,7.0,40.0,...,139,0,0.0,6.0,6.0,17,0.0,0.0,0.0,2017
2,10.0,92,89,3,"Birmingham/Jefferson, St. Clair, Shelby Counti...",AL-500,932,160,2.0,123.0,...,217,0,0.0,45.0,45.0,11,0.0,0.0,0.0,2017
3,18.0,65,65,0,Mobile City & County/Baldwin County CoC,AL-501,426,180,0.0,18.0,...,247,18,0.0,9.0,9.0,26,0.0,0.0,0.0,2017
4,0.0,24,24,0,Florence/Northwest Alabama CoC,AL-502,155,0,0.0,8.0,...,22,0,0.0,1.0,1.0,8,0.0,0.0,0.0,2017


In [3]:
# drop columns that have NaN
cdf.dropna(axis=1, how='any')

# Add State and County
cdf["State"] = cdf["CoC Number"].apply(lambda x: x.split('-')[0])
cdf["County"] = cdf["CoC Name"].apply(lambda x: x.split('CoC')[0])

cdf["County"] = cdf["County"].apply(lambda x: x.replace("County",''))
cdf["County"] = cdf["County"].apply(lambda x: x.replace("City",''))
cdf["County"] = cdf["County"].apply(lambda x: x.replace("&",''))
cdf["County"] = cdf["County"].apply(lambda x: x.replace("Continuum of Care",''))
cdf["County"] = cdf["County"].apply(lambda x: x.replace("Balance of State",''))
cdf["County"] = cdf["County"].apply(lambda x: x.replace("Metropolitan Denver Homeless Initiative",'Denver'))
cdf["County"] = cdf["County"].apply(lambda x: x.replace("Salt Lake",'Salt Lake City'))
cdf["County"] = cdf["County"].apply(lambda x: x.split('/')[0])
cdf["County"] = cdf["County"].apply(lambda x: x.strip())

cdf["CityState"] = cdf["County"] + ", " + cdf["State"]

# Select columns of Interest
cdf = cdf[[ 'Year','CityState', 'State', 'County', 'Total Homeless', 'Sheltered Homeless', 'Unsheltered Homeless',
       'Homeless Individuals', 'Sheltered Homeless Individuals', 'Unsheltered Homeless Individuals', 
       'Homeless People in Families', 'Sheltered Homeless People in Families', 'Unsheltered Homeless People in Families',
       'Chronically Homeless Individuals', 'Sheltered Chronically Homeless Individuals', 
       'Unsheltered Chronically Homeless Individuals' ]]

# set the index to Year
cdf = cdf.set_index(['Year'])

# Check record count before writing data into a csv file
cdf.count()

CityState                                       4358
State                                           4358
County                                          4358
Total Homeless                                  4358
Sheltered Homeless                              4358
Unsheltered Homeless                            4358
Homeless Individuals                            4358
Sheltered Homeless Individuals                  4358
Unsheltered Homeless Individuals                4358
Homeless People in Families                     4358
Sheltered Homeless People in Families           4358
Unsheltered Homeless People in Families         4358
Chronically Homeless Individuals                4358
Sheltered Chronically Homeless Individuals      4358
Unsheltered Chronically Homeless Individuals    4358
dtype: int64

In [4]:
cdf.to_csv("HUD_Consol_Data.csv")

In [5]:
# Filter data based on Cities of Interest
Cities = ["New York", "Philadelphia", "Boston", "Washington", "Chicago", "Minneapolis",
        "Denver", "Salt Lake City", "Seattle", "Los Angeles", "San Francisco", "Miami", 
        "Charlotte", "Atlanta", "Detroit", "Anchorage", "Honolulu", "Indianapolis"]

hud_cities_data = cdf[cdf["County"].isin(Cities)]
hud_cities_data = hud_cities_data.reset_index()

hud_cities_data = hud_cities_data.sort_values(by=["CityState","Year"])

hud_cities_data = hud_cities_data[[ 'Year','CityState', 'Total Homeless', 'Sheltered Homeless', 'Unsheltered Homeless',
       'Homeless Individuals', 'Sheltered Homeless Individuals', 'Unsheltered Homeless Individuals', 
       'Homeless People in Families', 'Sheltered Homeless People in Families', 'Unsheltered Homeless People in Families',
       'Chronically Homeless Individuals', 'Sheltered Chronically Homeless Individuals', 
       'Unsheltered Chronically Homeless Individuals' ]]

hud_cities_data = hud_cities_data.set_index(['Year'])

hud_cities_data.head()

Unnamed: 0_level_0,CityState,Total Homeless,Sheltered Homeless,Unsheltered Homeless,Homeless Individuals,Sheltered Homeless Individuals,Unsheltered Homeless Individuals,Homeless People in Families,Sheltered Homeless People in Families,Unsheltered Homeless People in Families,Chronically Homeless Individuals,Sheltered Chronically Homeless Individuals,Unsheltered Chronically Homeless Individuals
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2007,"Anchorage, AK",974,842,132,696,589,107,278,253,25,224,187,37
2008,"Anchorage, AK",1023,921,102,695,596,99,328,325,3,300,239,61
2009,"Anchorage, AK",1267,1110,157,821,689,132,446,421,25,198,152,46
2010,"Anchorage, AK",1231,1113,118,740,633,107,491,480,11,56,43,13
2011,"Anchorage, AK",1223,1082,141,794,677,117,429,405,24,112,94,18


In [6]:
hud_cities_data.to_csv("HUD_Cities_Data.csv")