# Applied Data Science Capstone

This notebook contains the study for the IBM Applied Data Science Capstone Project on Coursera. In this study I consider myself to be the owner of an existing, successful, brick-and-mortar coffee shop near Arizona State University in Tempe, AZ and seek to find a neighborhood into which I can expand in or near the City and County of Denver, CO. Using demographic and location data from a Location-Based Social Network (LBSN) I will attempt to use K-Means Clustering to find similar locations in Denver that I'm hoping will allow me to limit my search areas and focus instead on more specific location and available real-estate.

## Background

I identify my shop with those small coffee shops that took root mostly in immigrant neighborhoods in the United States during the 1950s and '60s. While mine is in a modern, suburban area, people with some leisure time and some money to spare still visit my store to experience single-origin, house roasted coffee. My clientele is a mixture of tourists, students and professionals nearby mixed in with connoisseurs who drive to my shop because they like that I'm not a corporate store and that I don't require mass appeal to stay in business.

I've been looking to expand my business outside of the Phoenix metro area, and recently met a coffee roaster based in Colorado, and he said time and again how "local" is so important in Colorado. I left the conversation thinking that I'd love to buy his coffee, but would have trouble marketing and selling that in my shop in Tempe. It got me thinking that I should explore expanding into a new market, where I could re-invent portions of my business, and really impress sustainability and the "local" nature of Colorado-roasted coffee.  

I remembered that WalletHub recently listed Denver as the fifth-fastest growing large city and they named several other cities near Denver high-growth cities, as well. Denver is also beautiful, is host to numerous universities, supports a wide variety of outdoor activities and has a rich and vibrant arts community. There are also numerous airlines that travel between Phoenix Sky Harbor and Denver International Airports. These attributes make Denver a viable location for me to expand.

## Intended Audience

While this study is limited in scope, namely the expansion of a single business, and may not have broad appeal, it does provide a working example and demonstrates the power of data in providing actionable intelligence to a business. If I were the business owner, it would help me to narrow my search to specific areas of Denver that might feel familiar and in which I could find similar customers, and help me to focus my efforts on other aspects of expansion, such as staffing, real-estate, and a modified business plan. 


## Data Used in the Study

In order to solve the stated problem I am going to be combining data from multiple sources.

* A list of target zip codes mostly in the City and County of Denver, CO and for Tempe, AZ
* Demographic information for all the zip codes
* Venue and attraction data from a LBSN

The rest of this notebook relies on some imports. The following code imports what is necessary. Note that I'm using a module called `capstoneutils` which is of my own creation.

In [2]:
# Get imports in place. Note that I'm using a module I created for all the utility
# functions, which will be linked in the final project
import capstoneutils as csutil
import pandas as pd

### Zip Codes

There were a number of sites with lists of zip codes for the City and County of Denver. Some, like Zillow, required me to pass a Captcha, and obviously are not friendly to scraping. Others, from inspection, were incomplete or out of date, having been based on 2010 Census data. I settled on using the data from Zip-Codes.com for the list of zip codes in the City and County of Denver. Note that I'll also be manually including two other zip codes: `80221` because it includes Regis University and spans both Denver and Adams County and `80302` because it includes most of Boulder and the University of Colorado, a university in the same NCAA conference as Arizona State in Tempe.

The following code produces the list of zip codes I will be using and displays the last few to show the additions.

In [3]:
# Scrape the page with the list of zip codes in the City and County of Denver
den_zips = csutil.scrape_zipcodes()

# Here we include our extra two zip codes
extras = pd.DataFrame([{'Zip Code': 80221},{'Zip Code': 80302}])
den_zips = den_zips.append(extras, ignore_index=True)
den_zips.tail()

Unnamed: 0,Zip Code
29,80293
30,80294
31,80299
32,80221
33,80302


### Demographics

I am using HomeTownLocator.com to obtain the demographics for the various zip codes on my list and for Tempe, AZ. After looking at sources available for purchase and other sites, this one actually had the most friendly pages for scraping and offered some interesting features, like a *diversity index* which is essentially the probability that if one were to choose two people at random that those people would be of the same ethnicity. So its a measure of how diverse a community is, not of what ethnic group its composed.

The site provides 19 features per zip code which are divided into four categories: Population, Housing, Income and Households. The following table summarizes the features in each category.

| Section | Notes |
| :--------- | :---|
| Population | Total population, population in families, households, density and diversity index |
| Housing | Total Housing Units (owner- and renter-occupied, vacant) and average  home values |
| Income | Median and mean household income and per capita income |
| Households | Total households, average household size, family households and average family size |

The following code produces the sample demographics data for Tempe, AZ

In [4]:
tempe_demog = pd.DataFrame(csutil.scrape_demographics(85281, 'arizona'), index=[0])
tempe_demog

Unnamed: 0,ZipCode,Total Population,Population in Households,Population in Familes,Population in Group Qrtrs,Population Density2,Diversity Index3,Median Household Income,Average Household Income,Per Capita Income,Total Housing Units,Owner Occupied HU,Renter Occupied HU,Vacant Housing Units,Median Home Value,Average Home Value,Total Households,Average Household Size,Family Households,Average Family Size
0,85281,70074,58682,26833,11392,5453,74,36193,47853,20533,31443,5416,21935,4092,197426,224072,27351,2.15,8653,3


### Location Data

I am using a Location-Based Social Network (LBSN) called Foursquare to obtain information about the venues and attractions that exist in each of the zip codes and how they are categorized. This data combined with the demographic data will hopefully provide me a good base upon which to cluster and find those zip codes that are most similar to Tempe, AZ.

The following code shows an example of the venue and attraction data in Tempe, AZ

In [5]:
# Geocode Tempe using the Google API
tempe_lat, tempe_lon = csutil.get_latlon('85281')
print('85281, Tempe, AZ has coordinates {},{}'.format(tempe_lat, tempe_lon))

# Now get the top 100 venues in Tempe and show an example of that data
tempe_venues = csutil.getNearbyVenues(names=['85281'], 
                                 latitudes=[tempe_lat], 
                                 longitudes=[tempe_lon])
tempe_venues.head()

85281, Tempe, AZ has coordinates 33.4366655,-111.9403254


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,85281,33.436665,-111.940325,Tempe Town Lake,33.433304,-111.936264,Lake
1,85281,33.436665,-111.940325,Tempe Beach Park,33.431625,-111.942087,Park
2,85281,33.436665,-111.940325,AC Hotel by Marriott,33.430929,-111.937336,Hotel
3,85281,33.436665,-111.940325,Culinary Dropout at Farmer Arts District,33.429122,-111.94394,Gastropub
4,85281,33.436665,-111.940325,The Yard,33.429118,-111.943979,Bar
