<h1> IBM Data Science Capstone



<i>This Jupyter Notebook contains the <b>Neighborhood Capstone Project Part I</b> for the IBM Data Science Course 9. <i>


## ABSTRACT
This capstone project aims to determine the ideal local for a new fitness venue. To do this, I must determine where current fitness centers and similar businesses are located within the neighborhoods of Arlington County, VA. I will use an open source data of neighborhood locations, fitness venue data from Foursquare API, and a k-means clustering algorithm to group similar venues together. 


## BACKGROUND

Amazon has officially chosen its East Coast Headquarters 2 in Arlington, VA, prompting an influx of young, technologically savvy white-collar workers and therefore causing property (business, housing, and commercial) prices to skyrocket in its wake. An investor has shown interest in my fitness center brand, which targets the busy but deeply connected 20-something-year-olds and wants to decide on a location before the prices increase beyond affordability. 

My fitness brand focuses on specialty classes such as yoga, cycling, MMA, in addition to classics cardio machines (treadmills, ellipticals, stair-steppers) and free-weights with personal trainers. However, it will be open 24/7 with keycode access, an onsite cafe and smoothie bar featuring a wide array of the trendiest diet fads (keto, vegan, raw diet, etc), and 5G with plenty of charging stations. The real draw is to make it the most Instragrammable gym in the county (complete with flattering lighting, hip workout rooms, and polished equipment). 

To prove my future business partner that this will be a worthy investment, I must determine a good location for your new fitness center. Ideally, the fitness center must be located in a densely populated area to attract as many gym members as possible. However, to minimize competition, it must not be located too close to top-rated fitness centers or in an area that is already saturated with similar venues.  


## DATA DESCRIPTION
The following data sets were used for this project:

### 1. Arlington, VA neighborhood data 
This data was extracted from the list on https://en.wikipedia.org/wiki/List_of_neighborhoods_in_Arlington_County,_Virginia using Beautiful Soup, and then transformed the data into a pandas dataframe.
*High-density areas were noted to be: Rosslyn, Courthouse, Ballston, Pentagon City, and Crystal City*

In [2]:
import requests
URL = requests.get('https://en.wikipedia.org/wiki/List_of_neighborhoods_in_Arlington_County,_Virginia').text

#Importing the 'BeautifulSoup' library
from bs4 import BeautifulSoup

#Designating webpage's html code as 'soup'
soup = BeautifulSoup(URL,'lxml')

In [17]:
#Calls upon specific list in the webpage
ArlingtonList = soup.find('ul')

for ul in ArlingtonList:
    newsoup = BeautifulSoup(str(ul), 'html.parser')
    lis = newsoup.find_all('li')
    for li in lis:
        print(li.text)

Alcova Heights
Arlington Forest
Arlington Heights
Arlington Ridge
Arlington View / Johnson's Hill
Ashton Heights
Aurora Hills
Ballston
Barcroft
Bellevue Forest
Bluemont
Bonair
Boulevard Manor
Brandon Village
Buckingham
Carlin Springs
Cherrydale
Claremont
Clarendon
Columbia Forest
Columbia Heights
Country Club Hills
Crescent Hills
Crystal City
Crystal Gateway
Dominion Hills
Donaldson Run
Douglas Park
East Falls Church
Fairlington
Forest Glen
Forest Hills
Fort Myer Heights
Glencarlyn
Garden City
Gates of Ballston
Greenbrier
High View Park / Halls Hill
Jackson Court
Lacey Forest
Lauderdale
Lee Heights
Lyon Park
Madison Manor
Maywood
New Dover
Nauck (Green Valley A.K.A. The Valley)
Old Glebe
Over Lee Knolls
Penrose
Pentagon City
Prospect House
Randolph Square
Rivercrest
Rosslyn
Shirlington Crest
Station Square
Tara
Waycroft-Woodlawn
Waverly Hills
Westmont
Westover
Willet Heights
Williamsburg
Williamsburg Village
Yorktown


In [14]:
#Import pandas library
import pandas as pd

# Creates dataframe ‘df’ with new column labels "Neighborhoods".
df = pd.DataFrame(newsoup) 
df.columns = ['Neighborhoods']


KeyError: 0


Then, I used Google Maps to get the coordinates of each town. The original data was cleaned by updating names from original Wikipedia list to Google's list, ensuring that town names were, in fact, in Arlington County and not somewhere else in Virginia, removing towns that were not in Arlington County, and formatting latitude coordinates to read "-77.XXXX" rather than "77.XXXX° W" 

**NOTE:** For a more in-depth look into the nuances of gym locations, I choose to use locally recognized neighborhoods (66 originally, 53 cleansed) rather than zipcodes (some sources say 28, others say 11). This list will hopefully provide greater insight on the distribution of gyms in the area.


### 2. Fitness center proximities and types in Arlington, VA
This data was extracted using the Foursquare API.

## HOW THE DATA WAS USED

The neighborhoods are relatively small, so I used a 1 km radius from each longitude/latitude center point while searching for fitness centers and all fitness-related venues using the Foursquare API.  

**I wanted to know**  
Which neighborhoods have the highest number of fitness venues? 
Which neighborhoods have the best ranking fitness venues?  
How many of each venue are in each neighborhood?   

**Other assumptions**
I did not create a choropleth map of the neighborhoods to check for area overlap, or for town areas that were not represented by a neighborhood (aka, a blank area). I assumed that if the area was not represented by a neighborhood, then it wouldn't be an ideal place for a new fitness center. Therefore, it didn't matter if it was in my data list as a location to be considered for a new fitness center.

When cleaning the data, I discarded the neighborhood if it was  
1) Unrecognized by Google  
2) Mistaken as an apartment complex  
3) Mistaken as a road
