<h1 align=center><font size = 5>Segmenting and Clustering MARTA Rail Stations in Atlanta</font></h1>

## Abstract

In this report, we use the Foursquare API and Folium library to explore and visualize data for venues around MARTA rail stations in Atlanta. First, we get the most common venues near each station. Next, we explore venue categories for each station. Finally, we use this data to group the stations into clusters using *k*-means and  hierarchical clustering algorithms. 

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px" >

<font size = 3>

<b>[Introduction/Business Problem](#item000)</b>
<p>
<b>[1. Download and Explore MARTA Data](#item100)</b> 
<p>
    [1.0 Wikipedia Data](#item101)    
                                      
    [1.1 Folium MARTA Map](#item110)  
    
    [1.2 Foursquare Data](#item120)      
<p>
<b>[2. Methodology](#item200)</b>  
<p>
<b>[Results](#item500)</b>    
<p>            
<b>[Discussion](#item600)</b>   
<p>            
<b>[Conclusion](#item700)</b>  
</font>
</div>

## Introduction/Business Problem  <a class="anchor" id="item000"></a>


It this report, we are going to explore venues associated with each of 38 MARTA rail stations in Atlanta, Georgia.  

This might help us answer the following questions.

* Why do people use MARTA?  
* What attracts them to each MARTA station and makes them go there?  
* Where do they go before\after riding MARTA?   
* Where do they spend the most time (and money)?  
* Are MARTA stations similar or dissimilar in respect to the venues near them?  
* Can we label each MARTA station with venue categories that are specific for this station?  

In our analysis, we can use Foursquare API to get the most popular venues near each MARTA station, k-means and hierarchical clustering for grouping similar stations, and Folium library to visualize our results.

MARTA station names and their coordinates can be extracted from Wikipedia https://en.wikipedia.org/wiki/MARTA_rail

## 1. Download and Explore MARTA Data  <a class="anchor" id="item100"></a>

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


### 1.0 Wikipedia Data <a class="anchor" id="item101"></a>   

Atlanta MARTA rail network has a total of 38 stations. In order to segement the stations and explore them, we will essentially need a dataset that contains the list of MARTA stations with the latitude and logitude coordinates of each station. 

Luckily, this data exists for free on Wikipedia: https://en.wikipedia.org/wiki/MARTA_rail
It can be easily downloaded and converted to Excel file.

In [2]:
stations = pd.read_excel('MARTA_stations.xlsx')
print(stations.shape)
stations.head()

(38, 8)


Unnamed: 0,Station,Code,Jurisdiction,Opened,Station Entries/Day (2013),Coordinates,Latitude,Longitude
0,Airport,S7,College Park,1988-06-18,9173,33.640758°N 84.446341°W,33.640758,-84.446341
1,Arts Center,N5,Atlanta,1982-12-18,6605,33.789705°N 84.387789°W,33.789705,-84.387789
2,Ashby,W3,Atlanta,1979-12-22,1791,33.756346°N 84.417556°W,33.756346,-84.417556
3,Avondale,E7,Decatur,1979-06-30,4327,33.775277°N 84.281903°W,33.775277,-84.281903
4,Bankhead,P4,Atlanta,1992-12-12,1903,33.77189°N 84.42884°W,33.77189,-84.42884


### 1.1 Folium map of Atlanta <a class="anchor" id="item110"></a>   

Now we can create a map of Atlanta with MARTA stations superimposed on top.

We will use geopy library to get the latitude and longitude values of Downtown Atlanta. In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>atlanta_explorer</em>, as shown below.

In [3]:
address = 'Downtown Atlanta, GA'

geolocator = Nominatim(user_agent="atlanta_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Downtown Atlanta are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Downtown Atlanta are 33.7509748, -84.3930464.


Let's visualize MARTA stations setting marker radius proportional to "Station Entries/Day (2013)".

In [4]:
# create map of Downtown Toronto using latitude and longitude values
map_downtown = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label, num  in zip(stations['Latitude'], stations['Longitude'], stations['Station'], stations['Station Entries/Day (2013)']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=num / 1000,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_downtown)  
    
map_downtown

**Folium** is a great visualization library. Feel free to zoom into the above map, and click on each circle mark to reveal the name of the station.

In [5]:
print(stations.shape[0])
stations.head()

38


Unnamed: 0,Station,Code,Jurisdiction,Opened,Station Entries/Day (2013),Coordinates,Latitude,Longitude
0,Airport,S7,College Park,1988-06-18,9173,33.640758°N 84.446341°W,33.640758,-84.446341
1,Arts Center,N5,Atlanta,1982-12-18,6605,33.789705°N 84.387789°W,33.789705,-84.387789
2,Ashby,W3,Atlanta,1979-12-22,1791,33.756346°N 84.417556°W,33.756346,-84.417556
3,Avondale,E7,Decatur,1979-06-30,4327,33.775277°N 84.281903°W,33.775277,-84.281903
4,Bankhead,P4,Atlanta,1992-12-12,1903,33.77189°N 84.42884°W,33.77189,-84.42884


### 1.2 Foursquare Data <a class="anchor" id="item120"></a>   

Next, we are going to start utilizing the Foursquare API to explore the stations and segment them.

#### Define Foursquare Credentials and Version

CLIENT_ID = '???????????' # your Foursquare ID  
CLIENT_SECRET = '??????????????' # your Foursquare Secret  
VERSION = '20180605' # Foursquare API version  

In [6]:
# @hidden_cell



#### Let's explore the first station in our dataframe

Get the neighborhood's name.

In [7]:
stations.loc[0, 'Station']

'Airport'

Get the neighborhood's latitude and longitude values.

In [8]:
station_latitude = stations.loc[0, 'Latitude'] # neighborhood latitude value
station_longitude = stations.loc[0, 'Longitude'] # neighborhood longitude value
station_name = stations.loc[0, 'Station'] # neighborhood name
print('Latitude and longitude values of {} are {}, {}.'.format(station_name, 
                                                               station_latitude, 
                                                               station_longitude))

Latitude and longitude values of Airport are 33.640758, -84.446341.


Now, let's get the top 100 venues within a radius of 500 meters from the Aiport Station

First, let's create the GET request URL. Name your URL **url**.

In [9]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    station_latitude, 
    station_longitude, 
    radius, 
    LIMIT)

Send the GET request and examine the resutls

In [10]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5d240b12f129b500253a9be1'},
 'response': {'groups': [{'items': [{'reasons': {'count': 0,
       'items': [{'reasonName': 'globalInteractionReason',
         'summary': 'This spot is popular',
         'type': 'general'}]},
      'referralId': 'e-0-4b5da9dff964a520ba6529e3-0',
      'venue': {'categories': [{'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/coffeeshop_',
          'suffix': '.png'},
         'id': '4bf58dd8d48988d1e0931735',
         'name': 'Coffee Shop',
         'pluralName': 'Coffee Shops',
         'primary': True,
         'shortName': 'Coffee Shop'}],
       'id': '4b5da9dff964a520ba6529e3',
       'location': {'address': 'Concourse C',
        'cc': 'US',
        'city': 'Atlanta',
        'country': 'United States',
        'crossStreet': 'at ATL Airport',
        'distance': 146,
        'formattedAddress': ['Concourse C (at ATL Airport)',
         'Atlanta, GA 30320',
         'United States'],
        'label

Now we will convert json data into a *pandas* dataframe. All the information is in the *items* key. 

In [11]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [12]:
venues = results['response']['groups'][0]['items']

# flatten JSON
nearby_venues = json_normalize(venues) 

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Starbucks,Coffee Shop,33.640891,-84.444769
1,Metal Penguin,Public Art,33.640059,-84.44453
2,Zimbabwe: A Tradition in Stone,Art Gallery,33.640745,-84.442539
3,TSA PreCheck North,Airport Service,33.641276,-84.444029
4,Starbucks,Coffee Shop,33.641678,-84.44662


And how many venues were returned for the Airport Station by Foursquare?

In [13]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

27 venues were returned by Foursquare.


#### Let's create a function to repeat the same process to all MARTA stations 

In [14]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Station', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Now we run the above function on each neighborhood and create a new dataframe called *atlanta_venues*.

In [15]:
atlanta_venues = getNearbyVenues(names=stations['Station'],
                                   latitudes=stations['Latitude'],
                                   longitudes=stations['Longitude']
                                  )

Airport
Arts Center
Ashby
Avondale
Bankhead
Brookhaven/Oglethorpe
Buckhead
Chamblee
Civic Center
College Park
Decatur
Dome/GWCC/Philips Arena/CNN Center
Doraville
Dunwoody
East Lake
East Point
Edgewood/Candler Park
Five Points*
Garnett
Georgia State
H. E. Holmes
Indian Creek
Inman Park/Reynoldstown
Kensington
King Memorial
Lakewood/Fort McPherson
Lenox
Lindbergh Center
Medical Center
Midtown
North Avenue
North Springs
Oakland City
Peachtree Center
Sandy Springs
Vine City
West End
West Lake


Let's check the size of the resulting dataframe

In [16]:
print(atlanta_venues.shape)
atlanta_venues.head()

(1156, 7)


Unnamed: 0,Station,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Airport,33.640758,-84.446341,Starbucks,33.640891,-84.444769,Coffee Shop
1,Airport,33.640758,-84.446341,Metal Penguin,33.640059,-84.44453,Public Art
2,Airport,33.640758,-84.446341,Zimbabwe: A Tradition in Stone,33.640745,-84.442539,Art Gallery
3,Airport,33.640758,-84.446341,TSA PreCheck North,33.641276,-84.444029,Airport Service
4,Airport,33.640758,-84.446341,Starbucks,33.641678,-84.44662,Coffee Shop


Let's check how many venues were returned for each neighborhood

In [17]:
atlanta_venues.groupby('Station').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Station,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Airport,27,27,27,27,27,27
Arts Center,48,48,48,48,48,48
Ashby,18,18,18,18,18,18
Avondale,21,21,21,21,21,21
Bankhead,3,3,3,3,3,3
Brookhaven/Oglethorpe,19,19,19,19,19,19
Buckhead,100,100,100,100,100,100
Chamblee,17,17,17,17,17,17
Civic Center,33,33,33,33,33,33
College Park,12,12,12,12,12,12


Let's save atlanta_venues to a backup csv file, just in case. 

In [18]:
atlanta_venues.to_csv("MARTA_venues.csv",index=False) 
#atlanta_venues = pd.read_csv("MARTA_venues.csv")
atlanta_venues.head()

Unnamed: 0,Station,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Airport,33.640758,-84.446341,Starbucks,33.640891,-84.444769,Coffee Shop
1,Airport,33.640758,-84.446341,Metal Penguin,33.640059,-84.44453,Public Art
2,Airport,33.640758,-84.446341,Zimbabwe: A Tradition in Stone,33.640745,-84.442539,Art Gallery
3,Airport,33.640758,-84.446341,TSA PreCheck North,33.641276,-84.444029,Airport Service
4,Airport,33.640758,-84.446341,Starbucks,33.641678,-84.44662,Coffee Shop


Now we have all data ready for our analysis,

### Thank you for reading!

This notebook was created by Nelli Fedorova(https://www.linkedin.com/in/nelli-fedorova-7710b01a/). I hope you found this report interesting and educational. 

This notebook is part of a course on **Coursera** called *Applied Data Science Capstone*. If you accessed this notebook outside the course, you can take this course online by clicking [here](http://cocl.us/DP0701EN_Coursera_Week3_LAB2).

<hr>

Copyright &copy; 2018 [Cognitive Class](https://cognitiveclass.ai/?utm_source=bducopyrightlink&utm_medium=dswb&utm_campaign=bdu). This notebook and its source code are released under the terms of the [MIT License](https://bigdatauniversity.com/mit-license/).