# IBM Capstone Project

## Introduction / Business Problem

<p>In this project, we are going to found a automatic way to see the cluster of the ethnic group in a area.  In many times, only the local people know which area in the city is for Certain group.  Fox example, there are Chinese Town, Korea Town, or Japanese town in every city.  However, this information is not shown on the map.  If I were the new real estate agent and looking for place for my client, I would want to know which area is best suite for my client.  Most new homeowner would prefer to live nearby their cultural group.  If a newcomer of the city try to settle, it would take trial and error to found out actually the first place that he or she choose may not be the best ideal place that he would like to live in.  This project is to explore a machine learning way to show the hidden information of ethnic gathering group.</p>

## Methodology and Data

<p>I am going to use Foursquare api to gather the venue type in the area.  Then I will use k-mean to see if cluster exist for each of the following group:</p>

- Italian
- Chinese
- Korean
- Japanese
- Indian
- Latino

<p>The list may not be inclusive that not all the ethnic group is listed.  Then I will use Folium map and heatmap.  I will use Los Angeles to verify my data.  Then I will run Bay area, Kansas City, Atlanta, and Washington DC to explore the ehnic group distribution</p>

In [1]:
#import library
import pandas as pd

In [2]:
#get the list of Los Angeles County
url = 'https://en.wikipedia.org/wiki/List_of_cities_in_Los_Angeles_County,_California'
df_list = pd.read_html(url)

In [4]:
LA_City = df_list[0].City

In [5]:
#extract all the location information of California Cities
#install the geocoder
from geopy.geocoders import Nominatim

In [6]:
address = 'Los Angeles, CA'
geolocator = Nominatim(user_agent="ca_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Los Angeles are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Los Angeles are 34.0536909, -118.2427666.


In [7]:
#Append LA City with the ", CA"
LA_City = LA_City + ', CA'

In [14]:
#Create new dataframe that sore the longitude and latitude of Los Angeles Cities
city_list=[]
for cities in LA_City:
    location = geolocator.geocode(cities)
    latitude = location.latitude
    longitude = location.longitude
    city_list.append((cities, latitude, longitude))

In [15]:
city_list

[('Agoura Hills, CA', 34.14791, -118.7657042),
 ('Alhambra, CA', 52.3429334, -114.6712396),
 ('Arcadia, CA', 43.8253822, -66.0609036),
 ('Artesia, CA', 50.8380696, -113.96947271477748),
 ('Avalon, CA', 52.0948774, -106.6560692704038),
 ('Azusa, CA', 34.1338751, -117.9056046),
 ('Baldwin Park, CA', 43.5272291, -79.87667193276754),
 ('Bell, CA', 45.326447599999995, -75.8095645993038),
 ('Bell Gardens, CA', 33.9694561, -118.1503953),
 ('Bellflower, CA', 33.8825705, -118.1167679),
 ('Beverly Hills, CA', 53.559348099999994, -113.0314111056661),
 ('Bradbury, CA', 34.1469511, -117.9708982),
 ('Burbank, CA', 34.1816482, -118.3258554),
 ('Calabasas, CA', 34.1446643, -118.6440973),
 ('Carson, CA', 45.4191995, -75.68979657728495),
 ('Cerritos, CA', 33.8644291, -118.0539323),
 ('Claremont, CA', 45.6782409, -63.9761354),
 ('Commerce, CA', 45.3633593, -73.73249956842082),
 ('Compton, CA', 45.243228, -71.827927),
 ('Covina, CA', 34.1024, -117.85486499999999),
 ('Cudahy, CA', 33.9620584, -118.1835395)

In [16]:
#convert city_list to dataframe
city_list_df = pd.DataFrame(city_list, columns = ['city', 'latitude', 'longitude'])

In [69]:
city_list_df[city_list_df.city == 'Los Angeles, CA']

Unnamed: 0,city,latitude,longitude
48,"Los Angeles, CA",34.053691,-118.242767


In [96]:
#insert Four Square Credential
CLIENT_ID = 'BB2VNIG1RCAF5NPJ3OFDRF0EUYDL4T4MT3LX3RLCM3OBDIAC' # your Foursquare ID
CLIENT_SECRET = 'SGNAUESM4B4A4Q124Z1RMRB1FZIGTSHM0KE42V1QMXRSFARU' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

In [71]:
#insert 4square api categories Id
categoriesarray = []
categoriesarray.append(['Chinese', '4bf58dd8d48988d145941735'])
categoriesarray.append(['Japanese', '4bf58dd8d48988d111941735'])
categoriesarray.append(['Korean', '4bf58dd8d48988d113941735'])
categoriesarray.append(['Italian', '4bf58dd8d48988d110941735'])
categoriesarray.append(['Indian', '4bf58dd8d48988d10f941735'])
categoriesarray.append(['Latino', '4bf58dd8d48988d1c1941735'])

In [72]:
#convert categoriesarray to dataframe
categoriesdf = pd.DataFrame(data=categoriesarray, index=None, columns=["ethnic", "categoriesid"])

In [75]:
categoriesdf

Unnamed: 0,ethnic,categoriesid
0,Chinese,4bf58dd8d48988d145941735
1,Japanese,4bf58dd8d48988d111941735
2,Korean,4bf58dd8d48988d113941735
3,Italian,4bf58dd8d48988d110941735
4,Indian,4bf58dd8d48988d10f941735
5,Latino,4bf58dd8d48988d1c1941735


In [12]:
#import request library
import requests

In [99]:
def getNearbyVenues(ethnics, categoriesid, cities, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for cities, lat, lng in zip(cities, latitudes, longitudes):
        print(cities + ' : ' + ethnics)
        
        for ethnic, categoriesid in categoriesarray:
            
            # create the API request URL
            url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&v={}&ll={},{}&limit={}&categoryId={}'.format(
                CLIENT_ID, 
                CLIENT_SECRET, 
                VERSION, 
                lat, 
                lng,  
                LIMIT,
                categoriesid)

            # make the GET request
            results = requests.get(url).json()["response"]['venues']

            # return only relevant information for each nearby venue
            venues_list.append([(
                cities, 
                lat, 
                lng, 
                v['name'], 
                v['location']['lat'], 
                v['location']['lng'],  
                ethnics) for v in results])

        nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
        nearby_venues.columns = ['Neighborhood', 
                      'Neighborhood Latitude', 
                      'Neighborhood Longitude', 
                      'Venue', 
                      'Venue Latitude', 
                      'Venue Longitude', 
                      'Venue Category']
    
    return(nearby_venues)

In [110]:
def getVeneuesbyCategories(categoriesinfo, citydataarray):
    tempdf = pd.DataFrame()
    for ethnic, categoryid in zip(categoriesinfo.ethnic, categoriesinfo.categoriesid):
        print(ethnic)
        tempdf2 = getNearbyVenues(ethnic, categoryid, citydataarray.city, citydataarray.latitude, citydataarray.longitude)
        print(len(tempdf2))
        tempdf = tempdf.append(tempdf2)
        print(len(tempdf))
    return tempdf

In [111]:
dataforheatmap = getVeneuesbyCategories(categoriesdf, city_list_df[city_list_df.city == 'Los Angeles, CA'])

Chinese
Los Angeles, CA : Chinese
300
300
Japanese
Los Angeles, CA : Japanese
300
600
Korean
Los Angeles, CA : Korean
300
900
Italian
Los Angeles, CA : Italian
300
1200
Indian
Los Angeles, CA : Indian
300
1500
Latino
Los Angeles, CA : Latino
300
1800


In [113]:
#import the map library
from folium import Map
from folium.plugins import HeatMap

In [129]:
dataforheatmap[dataforheatmap['Venue Category'] == 'Chinese'].head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Los Angeles, CA",34.053691,-118.242767,Today Starts Here,34.065198,-118.238093,Chinese
1,"Los Angeles, CA",34.053691,-118.242767,Dan Modern Chinese,34.042952,-118.445525,Chinese
2,"Los Angeles, CA",34.053691,-118.242767,Lunasia Dimsum House,33.858196,-118.08861,Chinese
3,"Los Angeles, CA",34.053691,-118.242767,Delicious Food Corner,34.079952,-118.08113,Chinese
4,"Los Angeles, CA",34.053691,-118.242767,China one,33.792343,-118.141991,Chinese


In [134]:
for_map = Map(location=[34.052235,-118.243683], zoom_start=15)

In [139]:
gradient = {.33: 'red', .66: 'brown', 1: 'green'}
Chinese_heat = HeatMap(
    list(zip(dataforheatmap[dataforheatmap['Venue Category'] == 'Chinese']['Venue Latitude'].values,dataforheatmap[dataforheatmap['Venue Category'] == 'Chinese']['Venue Longitude'].values)),
    min_opacity=0.2,
    gradient = {.33: 'red', .66: 'brown', 1: 'green'},
    radius=8,
    blur=15,
    max_zoom=1
)

In [140]:
for_map.add_child(Chinese_heat)

due to the limitation of foursquare api, the test is inconculsive.  For instance, there is way more chinese ethnic in El Monte and West Covina, but it is not showing on the map properly.  The Four square api can only return 300 results.