# Capstone Project

### Introduction

In this project, I want to see if people would like my home town, Dalin, Chiayi, Taiwan.  You can think of this project as a content-based recommendation system.  The program will first ask the user to input ratings for 5 locations of his/her choice.  The program will analyze the locations, exploring nearby venues and finding the pattern.  After the program gets a hint of what the user likes, it will apply the criteria to my home town, and see if the user will like it.

### Preparation

We first install/import the necessary libraries.

In [1]:
#Install the following if you haven't
#!conda install -c conda-forge folium=0.5.0 --yes
#!conda install -c conda-forge geopy --yes 

import folium
from IPython.display import Image 
from pandas.io.json import json_normalize
from geopy.geocoders import Nominatim
import numpy as np
import pandas as pd
import json
import requests

### User's time

We would like the user to rate 5 locations of their choice, but it is error-prone to ask the user for a location name.  The name may not be specific enough for the program to locate, and all kinds of problems may cause the program to break down.  Therefore, I will ask for coordinates instead.

But how does the user know the coordinates of the places?  They can take advantage of the following cell of code, which asks for an address as input, and show the location on the map.  If the location is valid and is what the user wanted, they can use the coordinates in the next cell.

I understand that this step is not very convinent, but at this time being, I still do not have enough skills in designing a graphical user interface, where users might be able to select the locations they want on the map and load the data to other parts of the program.  This is the best design I can do to avoid failure for now.  Feel free to tell me if you have an better idea!  I would very much appreciate it :)

In [2]:
address = input("Enter a city's name: ")

geolocator = Nominatim(user_agent = "foursquare_agent")
location = geolocator.geocode(address)

print(location.latitude, location.longitude)

venues_map = folium.Map(location = [location.latitude, location.longitude], zoom_start=13) # generate map centred around the Conrad Hotel
venues_map

Enter a city's name: manhattan
40.7900869 -73.9598295


Now, the program will ask the user to input the location (in coordinates, separated by a comma) and the rating (1-10).  This process will run for five times.  The program will also do some basic checking to make sure that the input values are valid.  If the input values are not valid, the program will make that value NaN. 

In [3]:
# Generate an empty dataframe before storing values in it.
df = pd.DataFrame({"Latitude":[np.nan]*5, "Longitude":[np.nan]*5, "Rating":[np.nan]*5})

for i in range(5):
    for j in range(2):
        try:
            # Propmt for input of coordinates.
            if j == 0:
                x1, x2 = input("Enter Coordinates, separated by a comma: ").replace(" ", "").split(",")
                x1, x2 = round(float(x1), 4), round(float(x2), 4)
                
                # Check if the values are true coordinates, and if so, store them into two cells.
                if (x1 >= -90) and (x1 <= 90):
                    df.iloc[[i], [j]] = x1
                elif (x1 < -90) or (x1 > 90):
                    print("Not a true latitude~~")
                    
                if (x2 >= -180) and (x2<= 180):
                    df.iloc[[i], [j+1]] = x2
                elif (x2 < -180) or (x2 > 180):
                    print("Not a true longitude~~")


            # Prompt for input of rating (0 - 10), and store the value into a cell.
            elif j == 1:
                x = round(float(input("Enter Numerical Rating Out of Ten: ")))
                if (x <= 10) and (x >= 1):
                    df.iloc[[i], [j+1]] = float(x)
                elif x > 10:
                    print("Max rating is 10.")
                    df.iloc[[i], [j+1]] = 10.0
                elif x < 1:
                    print("Min rating is 1.")
                    df.iloc[[i], [j+1]] = 1.0

                    
        # If any thing fails (inputs seems to fail very often), ignore them.
        except:
            continue

Enter Coordinates, separated by a comma: 40.876551, -73.910660
Enter Numerical Rating Out of Ten: 8
Enter Coordinates, separated by a comma: 40.715618, -73.994279
Enter Numerical Rating Out of Ten: 9
Enter Coordinates, separated by a comma: 40.851903, -73.936900
Enter Numerical Rating Out of Ten: 8
Enter Coordinates, separated by a comma: 40.867684, -73.921210
Enter Numerical Rating Out of Ten: 10
Enter Coordinates, separated by a comma: 40.823604, -73.949688
Enter Numerical Rating Out of Ten: 7


Dropping, the NaN, we get the final data frame, which consists of the five locations' latitude, longitude, and rating.

In [4]:
# Drop nans and print the final df
df = df.dropna()
df

Unnamed: 0,Latitude,Longitude,Rating
0,40.8766,-73.9107,8.0
1,40.7156,-73.9943,9.0
2,40.8519,-73.9369,8.0
3,40.8677,-73.9212,10.0
4,40.8236,-73.9497,7.0


### Getting information from Foursquare

After the user input all the necessary information, we can start analyzing them.  Since I want to get data from Foursquare, I need to set the parameters.

In [5]:
CLIENT_ID = 'N33X1C2M2311BSIALFYZMLSNMWVKEKUTSWPYJ2TPGLXF3LN0'
CLIENT_SECRET = 'EBXLWZKEVEMH3SA2ULL0BHYK4BBKIJHJDLEGMX2B3XGQM20B'
VERSION = '20180605'
LIMIT = 100 

Now, I will define a function that helps me get all useful information from Foursquare.  Since this function has no limit on the size of the data, it can also work with more location inputs.  As for now, five locations is enough for demonstration.

In [6]:
def getNearbyVenues(latitudes, longitudes, radius=500):
    
    venues_list=[]
    for lat, lng in zip(latitudes, longitudes):
        
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']


        # return only relevant information for each nearby venue
        venues_list.append([(
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Passing in the data, we can get the venues near the locations the the user input.

In [7]:
venues = getNearbyVenues(latitudes = df['Latitude'], longitudes = df['Longitude'])
venues.head()

Unnamed: 0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,40.8766,-73.9107,Arturo's,40.874412,-73.910271,Pizza Place
1,40.8766,-73.9107,Bikram Yoga,40.876844,-73.906204,Yoga Studio
2,40.8766,-73.9107,Tibbett Diner,40.880404,-73.908937,Diner
3,40.8766,-73.9107,Starbucks,40.877531,-73.905582,Coffee Shop
4,40.8766,-73.9107,Land & Sea Restaurant,40.877885,-73.905873,Seafood Restaurant


### Analyzing information

To analyze the information, we want to use one hot encoding.

In [8]:
# one hot encoding
onehot = pd.get_dummies(venues[['Venue Category']], prefix="", prefix_sep="")

# add coordinates column back to dataframe
onehot['Neighborhood Latitude'] = venues['Neighborhood Latitude'] 
onehot['Neighborhood Longitude'] = venues['Neighborhood Longitude'] 

# move neighborhood column to the first column
fixed_columns = [onehot.columns[-1]] + list(onehot.columns[:-1])
onehot = onehot[fixed_columns]

onehot.head()

Unnamed: 0,Neighborhood Longitude,Accessories Store,American Restaurant,Arepa Restaurant,Asian Restaurant,Austrian Restaurant,Bakery,Bank,Bar,Beer Bar,...,Tennis Stadium,Trail,Veterinarian,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store,Yoga Studio,Neighborhood Latitude
0,-73.9107,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,40.8766
1,-73.9107,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,40.8766
2,-73.9107,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,40.8766
3,-73.9107,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,40.8766
4,-73.9107,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,40.8766


The information is then grouped by their locations.  Specifically, the venues nearby each locations is listed by their appearance frequency.

In [9]:
grouped = onehot.groupby(['Neighborhood Longitude', 'Neighborhood Latitude']).mean().reset_index()
grouped

Unnamed: 0,Neighborhood Longitude,Neighborhood Latitude,Accessories Store,American Restaurant,Arepa Restaurant,Asian Restaurant,Austrian Restaurant,Bakery,Bank,Bar,...,Tea Room,Tennis Stadium,Trail,Veterinarian,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,-73.9943,40.7156,0.0,0.04,0.0,0.02,0.01,0.03,0.0,0.02,...,0.01,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.01
1,-73.9497,40.8236,0.0,0.0,0.0,0.0,0.0,0.017241,0.0,0.017241,...,0.0,0.0,0.0,0.0,0.0,0.0,0.017241,0.0,0.0,0.034483
2,-73.9369,40.8519,0.011905,0.011905,0.011905,0.0,0.0,0.047619,0.011905,0.011905,...,0.0,0.0,0.0,0.0,0.011905,0.0,0.011905,0.02381,0.011905,0.0
3,-73.9212,40.8677,0.0,0.034483,0.0,0.0,0.0,0.034483,0.0,0.017241,...,0.0,0.0,0.017241,0.017241,0.0,0.0,0.034483,0.017241,0.0,0.017241
4,-73.9107,40.8766,0.0,0.041667,0.0,0.0,0.0,0.0,0.041667,0.0,...,0.0,0.041667,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.041667


Finally, we can generate the user profile table by calculating the dot function of the venues' appearance frequency and the user's rating.  This gives us an understand of what the user likes.

In [10]:
userGenreTable = grouped.drop('Neighborhood Longitude', 1).drop('Neighborhood Latitude', 1)
userprofile = userGenreTable.transpose().dot(df['Rating'])
userprofile

Accessories Store        0.095238
American Restaurant      1.051732
Arepa Restaurant         0.095238
Asian Restaurant         0.160000
Austrian Restaurant      0.080000
Bakery                   1.120952
Bank                     0.386905
Bar                      0.582824
Beer Bar                 0.095238
Bike Shop                0.080000
Bistro                   0.172414
Boutique                 0.080000
Breakfast Spot           0.095238
Bubble Tea Shop          0.480000
Burger Joint             0.250411
Bus Station              0.172414
Café                     1.786535
Caribbean Restaurant     0.673235
Chinese Restaurant       1.565649
Clothing Store           0.095238
Cocktail Bar             0.725583
Coffee Shop              1.631675
Cosmetics Shop           0.235172
Deli / Bodega            1.620074
Department Store         0.386905
Dessert Shop             0.080000
Dim Sum Restaurant       0.320000
Diner                    0.559319
Discount Store           0.583333
Dog Run       

### Showing the results

We have now generated the user profile, and now we can see where the user might be interested in visiting.  We still ask for input of a location's coordinates, and our program will help the user decide whether or not he/her will like it.

In [11]:
df_user = pd.DataFrame({"Latitude":[np.nan], "Longitude":[np.nan]})

x1_user, x2_user = input("Enter Coordinates, separated by a comma: ").replace(" ", "").split(",")
x1_user, x2_user = round(float(x1_user), 4), round(float(x2_user), 4)

# Check if the values are true coordinates, and if so, store them into two cells.
if (x1_user >= -90) and (x1_user <= 90):
    df_user.iloc[[0], [0]] = x1_user
elif (x1_user < -90) or (x1_user > 90):
    print("Not a true latitude~~")

if (x2_user >= -180) and (x2_user<= 180):
    df_user.iloc[[0], [1]] = x2_user
elif (x2_user < -180) or (x2_user > 180):
    print("Not a true longitude~~")

Enter Coordinates, separated by a comma: 40.877531, -73.905582


Let's turn the input into a metric, using the same manipulations we did earlier.  The result will be the genre table of the input location.

In [12]:
venue_user = getNearbyVenues(latitudes = df_user['Latitude'], longitudes = df_user['Longitude'])

onehot_user = pd.get_dummies(venue_user[['Venue Category']], prefix="", prefix_sep="")

# add coordinates column back to dataframe
onehot_user['Neighborhood Latitude'] = venue_user['Neighborhood Latitude'] 
onehot_user['Neighborhood Longitude'] = venue_user['Neighborhood Longitude'] 

# move neighborhood column to the first column
fixed_columns = [onehot_user.columns[-1]] + list(onehot_user.columns[:-1])
onehot_user = onehot_user[fixed_columns]

grouped_user = onehot_user.groupby(['Neighborhood Longitude', 'Neighborhood Latitude']).mean().reset_index()

userGenreTable_user = grouped_user.drop('Neighborhood Longitude', 1).drop('Neighborhood Latitude', 1)

With the genre table and the user profile we calculated, the possibility that the user likes this place can be calculated.

In [13]:
rating_user = ((userGenreTable_user * userprofile).sum(axis = 1))/(userprofile.sum())
print("The possibility that you like this place is", round(rating_user.iloc[0] * 100, 2), "%")

The possibility that you like this place is 1.77 %


### Thank you