After some consideration, I have decided to look at cities in the US that are similar to my current location of Boise, ID. The reason being is that I have recently looked into moving cities because I am looking into a career change. I really like what Boise has to offer and not really excited about moving, however the job prospects elsewhere are better than that of Boise. So lets take a look at some of the other "Metropolitan Areas" around the country and see how they compare to that of Boise. I think that I will use a clustering algorithm again and will try to come up with some interesting features along the way. I am going to draw my list of cities from the "List of United States citites by population" wikipedia article. I will also be drawing data from other sources, including foursquare, to get other features Lets dive in and see what we can find. 

In [666]:
import bs4
import pandas as pd
import urllib3
import string
import requests
import json
import folium

In [443]:
wiki = "https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population"
lib = urllib3.PoolManager()
r = lib.request("GET", wiki)
wiki = bs4.BeautifulSoup(r.data)



In [482]:
tables = wiki.find_all("table")[4]
rows = tables.find_all("tr")
columns = [value.text[:-1] for value in rows[0].find_all("th")]

data = []
for row in rows[1:]:
    temp = []
    for value in row.find_all("td"):
        temp.append(value.text[:-1])
    data.append(temp)
    
wiki_data = pd.DataFrame.from_records(data)
wiki_data.drop(columns = [0, 2, 5, 7, 9], inplace = True)
to_remove = ["State[c]", "Change", "2018rank"]
[columns.remove(item) for item in to_remove]
wiki_data.columns = columns
wiki_data.head()

Unnamed: 0,City,2018estimate,2010Census,2016 land area,2016 population density,Location
0,New York[d],8398748,8175133,301.5 sq mi,"28,317/sq mi",40°39′49″N 73°56′19″W﻿ / ﻿40.6635°N 73.9387°W﻿...
1,Los Angeles,3990456,3792621,468.7 sq mi,"8,484/sq mi",34°01′10″N 118°24′39″W﻿ / ﻿34.0194°N 118.4108°...
2,Chicago,2705994,2695598,227.3 sq mi,"11,900/sq mi",41°50′15″N 87°40′54″W﻿ / ﻿41.8376°N 87.6818°W﻿...
3,Houston[3],2325502,2100263,637.5 sq mi,"3,613/sq mi",29°47′12″N 95°23′27″W﻿ / ﻿29.7866°N 95.3909°W﻿...
4,Phoenix,1660272,1445632,517.6 sq mi,"3,120/sq mi",33°34′20″N 112°05′24″W﻿ / ﻿33.5722°N 112.0901°...


We have some data cleaning to do here. We need to drop all the units, commas, and somehow extract all the longitude and latitude values. Foursqure however might be able to handle just the name of a city. That would save us a decent amount of work. Lets take a look.

In [510]:
removeComma = lambda x: int(x.replace(",", ""))
land_clean = lambda x: float(x[:-6].replace(",", ""))
pop_clean = lambda x: x[:-6]
remove_hyper = lambda x: x[:-3] if x[-1] == "]" else x
def location_cleaning(x):
    x = x.split("/")[1]
    x = x.split(" ")[1:3]
    new = []
    for i in x:
        temp = []
        if i[-1] == "S" or i[-2] == "W":
            new.append("-" + i) 
        else:
            new.append(i)
    return ",".join([a[:-3] for a in new])

# print(wiki_data.loc[wiki_data["City"] == "Boise[r]"]["Location"].values[0])
# print(location_cleaning(wiki_data.loc[wiki_data["City"] == "Boise[r]"]["Location"].values[0]))

wiki_data["2018estimate"] = wiki_data["2018estimate"].apply(removeComma)
wiki_data["2010Census"] = wiki_data["2010Census"].apply(removeComma)
wiki_data["2016 land area"] = wiki_data["2016 land area"].apply(land_clean)
wiki_data["2016 population density"] = wiki_data["2016 population density"].apply(pop_clean).apply(removeComma)
wiki_data["City"] = wiki_data["City"].apply(remove_hyper)
wiki_data["Location"] = wiki_data["Location"].apply(location_cleaning)
wiki_data.head()

Unnamed: 0,City,2018estimate,2010Census,2016 land area,2016 population density,Location
0,New York,8398748,8175133,301.5,28317,"﻿40.663,-73.9387"
1,Los Angeles,3990456,3792621,468.7,8484,"﻿34.019,-118.4108"
2,Chicago,2705994,2695598,227.3,11900,"﻿41.837,-87.6818"
3,Houston,2325502,2100263,637.5,3613,"﻿29.786,-95.3909"
4,Phoenix,1660272,1445632,517.6,3120,"﻿33.572,-112.0901"


Lets visualize all the cities that we are working with on the map.

In [673]:
location = [37.09, -95.71]
m = folium.Map(location = location, zoom_start = 4, tiles = "Stamen Toner")
for row in range(wiki_data.shape[0]):
    ll = wiki_data.iloc[row]["Location"].split(",")
    print(ll)

['\ufeff40.663', '-73.9387']
['\ufeff34.019', '-118.4108']
['\ufeff41.837', '-87.6818']
['\ufeff29.786', '-95.3909']
['\ufeff33.572', '-112.0901']
['\ufeff40.009', '-75.1333']
['\ufeff29.472', '-98.5251']
['\ufeff32.815', '-117.1350']
['\ufeff32.793', '-96.7665']
['\ufeff37.296', '-121.8189']
['\ufeff30.303', '-97.7544']
['\ufeff30.336', '-81.6616']
['\ufeff32.781', '-97.3467']
['\ufeff39.985', '-82.9848']
['\ufeff37.727', '-123.0322']
['\ufeff35.207', '-80.8310']
['\ufeff39.776', '-86.1459']
['\ufeff47.620', '-122.3509']
['\ufeff39.761', '-104.8811']
['\ufeff38.904', '-77.0172']
['\ufeff42.332', '-71.0202']
['\ufeff31.848', '-106.4270']
['\ufeff42.383', '-83.1022']
['\ufeff36.171', '-86.7850']
['\ufeff45.537', '-122.6500']
['\ufeff35.102', '-89.9774']
['\ufeff35.467', '-97.5137']
['\ufeff36.229', '-115.2601']
['\ufeff38.165', '-85.6474']
['\ufeff39.300', '-76.6105']
['\ufeff43.063', '-87.9667']
['\ufeff35.105', '-106.6474']
['\ufeff32.153', '-110.8706']
['\ufeff36.783', '-119.7934']
[

Now we have all of our relevant data from the wiki page cleaned up and in a new DataFrame. Let's look at some temperature data and see if we can add it to our existing DataFrame.

In [511]:
temp_url = "https://www.infoplease.com/math-science/weather/climate-of-100-selected-us-cities"
temp_lib = urllib3.PoolManager()
r = temp_lib.request("GET", temp_url)
temperature = bs4.BeautifulSoup(r.data)



In [707]:
data = temperature.find_all("table")[0]
rows = data.find_all("tr")[3:]
temp_data = []
for row in rows:
    temp = []
    for i in row.find_all("td"):
        temp.append(i.text)
    temp_data.append(temp)

temp_data = pd.DataFrame.from_records(temp_data)
temp_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,"Albany, N.Y.",22.2,46.6,71.1,49.3,38.6,136,64.4,57
1,"Albuquerque, N.M.",35.7,55.6,78.5,57.3,9.47,60,11.0,64
2,"Anchorage, Alaska",15.8,36.3,58.4,34.1,16.08,115,70.8,39 / 60
3,"Asheville, N.C.",35.8,54.1,73.0,55.2,47.07,126,15.3,39
4,"Atlanta, Ga.",42.7,61.6,80.0,62.8,50.2,115,2.1,69 / 65


In [708]:
temp_data.drop(columns = [8], inplace = True)
columns = ["City", "Avg Winter Temp", "Avg Spring Temp", "Avg Summer Temp", "Avg Fall Temp",\
           "Average Precip", "Precip Days", "Average Snowfall"]
temp_data.columns = columns
remove_state = lambda x: x.split(",")[0]
remove_trace = lambda x: 0 if x == "trace" else float(x)
remove_hyphen = lambda x: x.split("-")[0] if len(x.split("-")) > 1 else x

temp_data["City"] = temp_data["City"].apply(remove_state)
temp_data["City"] = temp_data["City"].apply(remove_hyphen)
for col in columns[1:]:
    temp_data[col] = temp_data[col].apply(remove_trace)
temp_data

Unnamed: 0,City,Avg Winter Temp,Avg Spring Temp,Avg Summer Temp,Avg Fall Temp,Average Precip,Precip Days,Average Snowfall
0,Albany,22.2,46.6,71.1,49.3,38.60,136.0,64.4
1,Albuquerque,35.7,55.6,78.5,57.3,9.47,60.0,11.0
2,Anchorage,15.8,36.3,58.4,34.1,16.08,115.0,70.8
3,Asheville,35.8,54.1,73.0,55.2,47.07,126.0,15.3
4,Atlanta,42.7,61.6,80.0,62.8,50.20,115.0,2.1
5,Atlantic City,32.1,50.6,75.3,55.1,40.59,113.0,16.2
6,Austin,50.2,68.3,84.2,70.6,33.65,85.0,0.9
7,Baltimore,32.3,53.2,76.5,55.4,41.94,115.0,21.5
8,Baton Rouge,50.1,66.6,81.7,68.1,63.08,110.0,0.2
9,Billings,24.0,46.1,72.0,48.1,14.77,96.0,56.9


Now we have all the temperature data all cleaned up. Lets try and add the two together. We will lose a couple of rows because not all the cities are matching. Hopefully Boise is leftover from all this still. We can see them on the map again.

In [709]:
joined = wiki_data.join(temp_data.set_index("City"), on = "City")
joined = joined.dropna()
joined.reset_index(drop = True, inplace = True)
joined.loc[joined["City"] == "Boise"]

Unnamed: 0,City,2018estimate,2010Census,2016 land area,2016 population density,Location,Avg Winter Temp,Avg Spring Temp,Avg Summer Temp,Avg Fall Temp,Average Precip,Precip Days,Average Snowfall
55,Boise,228790,205671,82.1,2718,"﻿43.600,-116.2317",30.2,50.6,74.7,52.8,12.19,89.0,20.6


Looks like we still have Boise on the list! Now we can start gathering data about the common places in the areas using the Foursquare API. Lets learn more about the JSON file that gets sent starting with Boise.

In [710]:
value = ",".join(joined.loc[joined["City"] == "Boise"]["Location"].values)[1:]

url = "https://api.foursquare.com/v2/venues/explore"
params = dict(
    client_id = 'W5CHCMS4RL2BMAQVYOORLJQCF4XLGWK42SDQWAG1XFLX2LRV',
    client_secret = 'BVUEP1F3ZHWSQGN24XBSDK0GK15CBU2OJZ53VLQIR4PQQQ4U',
    v = '20180323',
    ll = value,
    limit = 100,
    radius = 5000)

# print(params)
req = requests.get(url = url, params = params)
data = json.loads(req.text)

# data
# for i in data["response"]["groups"][0]["items"]:
#     venues = i["venue"]["categories"]
#     for v in venues:
#         print(v["shortName"])

In [741]:
m = folium.Map(location = [37.09,-95.71], zoom_start = 4, tiles = "Stamen Toner")
for row in range(joined.shape[0]):
    row = joined.iloc[row]
    location = row["Location"][1:].split(",")
    if row["City"] == "Boise":
        color = "red"
    else:
        color = "green"
    folium.Circle(location = location, color = color, popup = row["City"], radius = 3000).add_to(m)
m

This function just requests data from the Foursquare API and returns the category that venues in the area.

In [711]:
def returnPopVenue(row, limit = 75, rad = 5000):
    value = joined.iloc[row]["Location"][1:]
#     print(value)
    url = "https://api.foursquare.com/v2/venues/explore"
    params = dict(
        client_id = 'W5CHCMS4RL2BMAQVYOORLJQCF4XLGWK42SDQWAG1XFLX2LRV',
        client_secret = 'BVUEP1F3ZHWSQGN24XBSDK0GK15CBU2OJZ53VLQIR4PQQQ4U',
        v = '20180323',
        ll = value,
        limit = limit,
        radius = rad,
        section = "trending")
    
    req = requests.get(url = url, params = params)
    data = json.loads(req.text)
    ret = []
    for i in data["response"]["groups"][0]["items"]:
        venues = i["venue"]["categories"]
        for v in venues:
#             print(v["shortName"])
            ret.append(v["id"])
    type_count = {}
    return ret

Now we are going to gather all the categories that fall under the main "Parent" categories.

In [712]:
params = dict(
        client_id = 'W5CHCMS4RL2BMAQVYOORLJQCF4XLGWK42SDQWAG1XFLX2LRV',
        client_secret = 'BVUEP1F3ZHWSQGN24XBSDK0GK15CBU2OJZ53VLQIR4PQQQ4U',
        v = '20180323')
url = "https://api.foursquare.com/v2/venues/categories"
req = requests.get(url = url, params = params)
data = json.loads(req.text)["response"]
categories = []
for cat in data["categories"]:
    categories.append(cat["name"])

id_dict = {}
for cat in data["categories"]:
    name = cat["name"]
    cat = cat["categories"]
    ids = []
    try:
        cat = cat["categories"]
        for j in cat:
            ids.append(j["ids"])
    except:
        None
    for i in cat:
        ids.append(i["id"])
    id_dict.update({name: ids})

This function returns the number of values in the parent categories for each of the cities that we want to learn about.

In [717]:
def cat_count(row):
    type_count = {}
    [type_count.update({cat: 0}) for cat in categories]
    test = returnPopVenue(row)

    for a in test:
        for cat in id_dict.keys():
            if a in id_dict[cat]:
                type_count[cat] += 1
    return type_count

Finally we will run these functions on our cities and get new features to add to our dataset.

In [654]:
cat_counts = []
for row in range(joined.shape[0]):
    temp = cat_count(row)
    cat_counts.append(temp)

We get the average for each city across categories. This just normalizes the data across all cities. After that we join it our existing dataset.

In [827]:
cat_data = pd.DataFrame(cat_counts)
for row in range(cat_data.shape[0]):
    cat_data.iloc[row] = cat_data.iloc[row]/cat_data.iloc[row].sum()
cat_data["City"] = joined["City"]
final = joined.join(cat_data.set_index("City"), on = "City", how = "left")
cities = final[["City", "Location"]]
final.drop_duplicates(subset = "Location", inplace = True)
# final[final["City"] == "Portland"]
final.head()

Unnamed: 0,City,2018estimate,2010Census,2016 land area,2016 population density,Location,Avg Winter Temp,Avg Spring Temp,Avg Summer Temp,Avg Fall Temp,...,Arts & Entertainment,College & University,Event,Food,Nightlife Spot,Outdoors & Recreation,Professional & Other Places,Residence,Shop & Service,Travel & Transport
0,New York,8398748,8175133,301.5,28317,"﻿40.663,-73.9387",32.1,52.5,76.5,56.6,...,0.085106,0.0,0.0,0.553191,0.106383,0.12766,0.021277,0.0,0.106383,0.0
1,Los Angeles,3990456,3792621,468.7,8484,"﻿34.019,-118.4108",57.1,60.8,69.3,66.9,...,0.069767,0.0,0.0,0.604651,0.0,0.069767,0.023256,0.0,0.232558,0.0
2,Chicago,2705994,2695598,227.3,11900,"﻿41.837,-87.6818",22.0,47.8,73.3,52.1,...,0.071429,0.0,0.0,0.660714,0.053571,0.107143,0.0,0.0,0.107143,0.0
3,Houston,2325502,2100263,637.5,3613,"﻿29.786,-95.3909",51.8,68.5,83.6,70.4,...,0.054545,0.0,0.0,0.690909,0.036364,0.127273,0.0,0.0,0.090909,0.0
4,Phoenix,1660272,1445632,517.6,3120,"﻿33.572,-112.0901",54.2,70.2,92.8,74.6,...,0.05,0.0,0.0,0.6,0.016667,0.083333,0.0,0.0,0.25,0.0


In [828]:
from sklearn.cluster import KMeans

k = 7

model_input = final.drop(["City", "Location"], axis = 1)

model = KMeans(n_clusters = k, random_state = 2).fit(model_input.dropna())
labels = model.labels_

In [829]:
final.dropna(inplace = True)
# print(final[final["City"] == "Seattle"])
final["Labels"] = labels
m = folium.Map(location = [37.09,-95.71], zoom_start = 4, tiles = "Stamen Toner")
for row in range(final.shape[0]):
    row = final.iloc[row]
    location = row["Location"][1:].split(",")
    if row["Labels"] == 0:
        color = "orange"
    elif row["Labels"] == 1:
        color = "blue"
    elif row["Labels"] == 2:
        color = "yellow"
    elif row["Labels"] == 3:
        color = "pink"
    elif row["Labels"] == 4:
        color = "lightgreen"
    elif row["Labels"] == 5:
        color = "cadetblue"
    elif row["Labels"] == 6:
        color = "darkred"
    folium.Circle(location = location, color = color, popup = row["City"], radius = 3000).add_to(m)
m
# print(final[final["City"] == "Seattle"])

In [830]:
boise_label = final[final["City"] == "Boise"]["Labels"].values[0]
potential_cities = final[final["Labels"] == boise_label]

Interesting to see the list of cities that it has said are similar to Boise. Suprisingly large amount of cities on the east coast and in the south. Let's run it one more time with just the cities that got placed in the same label as Boise and see if we can get a little bit more separation and narrow down our list more.

In [843]:
k = 3

model_input = potential_cities.drop(["City", "Location"], axis = 1)

model = KMeans(n_clusters = k, random_state = 2).fit(model_input)
labels = model.labels_
potential_cities["New Labels"] = labels
m = folium.Map(location = [37.09,-95.71], zoom_start = 4, tiles = "Stamen Toner")
for row in range(potential_cities.shape[0]):
    row = potential_cities.iloc[row]
    location = row["Location"][1:].split(",")
    if row["New Labels"] == 0:
        color = "orange"
    elif row["New Labels"] == 1:
        color = "blue"
    elif row["New Labels"] == 2:
        color = "yellow"
    elif row["New Labels"] == 3:
        color = "pink"
    elif row["New Labels"] == 4:
        color = "lightgreen"
    elif row["New Labels"] == 5:
        color = "cadetblue"
    elif row["New Labels"] == 6:
        color = "darkred"
    folium.Circle(location = location, color = color, popup = row["City"], radius = 3000).add_to(m)
m

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


Finally, we can see the "Top 17" cities that are similar to Boise. The results are much different than I expected. I didn't think there would be such a variety in the location in the US. I was expecting that pretty much all the similar cities would be in the Northwest. There is likely other things that I could add to create a more accurate representation of similar cities or things that are important to me speicifically. Overall, I am pretty happy with the results of this experiment and gives me some ideas for places to move to!

In [844]:
boise_label = potential_cities[potential_cities["City"] == "Boise"]["New Labels"].values[0]
final_cities = potential_cities[potential_cities["New Labels"] == boise_label]
final_cities

Unnamed: 0,City,2018estimate,2010Census,2016 land area,2016 population density,Location,Avg Winter Temp,Avg Spring Temp,Avg Summer Temp,Avg Fall Temp,...,Event,Food,Nightlife Spot,Outdoors & Recreation,Professional & Other Places,Residence,Shop & Service,Travel & Transport,Labels,New Labels
55,Boise,228790,205671,82.1,2718,"﻿43.600,-116.2317",30.2,50.6,74.7,52.8,...,0.0,0.622951,0.032787,0.016393,0.0,0.0,0.311475,0.0,5,0
56,Richmond,228783,204214,59.8,3732,"﻿37.531,-77.4760",36.4,57.1,77.9,58.3,...,0.0,0.685185,0.092593,0.074074,0.0,0.0,0.092593,0.0,5,0
57,Baton Rouge,221599,229493,85.9,2651,"﻿30.442,-91.1309",50.1,66.6,81.7,68.1,...,0.0,0.568966,0.12069,0.051724,0.0,0.0,0.189655,0.0,5,0
58,Spokane,219190,208916,68.7,3144,"﻿47.666,-117.4333",27.3,46.5,68.6,47.2,...,0.0,0.596491,0.157895,0.087719,0.017544,0.0,0.087719,0.0,5,0
59,Des Moines,216853,203433,88.9,2424,"﻿41.572,-93.6102",20.4,50.6,76.1,52.8,...,0.0,0.542373,0.135593,0.118644,0.016949,0.0,0.101695,0.0,5,0
60,Birmingham,209880,212237,146.1,1452,"﻿33.527,-86.7990",42.6,61.3,80.2,62.9,...,0.0,0.684211,0.087719,0.070175,0.0,0.0,0.070175,0.0,5,0
61,Salt Lake City,200591,186440,111.2,1742,"﻿40.776,-111.9310",29.2,50.0,77.0,52.5,...,0.0,0.539683,0.047619,0.031746,0.0,0.0,0.349206,0.015873,5,0
62,Grand Rapids,200217,188040,44.4,4424,"﻿42.961,-85.6556",22.4,46.3,71.4,49.9,...,0.0,0.636364,0.0,0.0,0.0,0.0,0.363636,0.0,5,0
63,Montgomery,198218,205764,159.8,1252,"﻿32.347,-86.2661",46.6,64.3,81.8,65.4,...,0.0,0.46875,0.015625,0.09375,0.015625,0.0,0.40625,0.0,5,0
64,Little Rock,197881,193524,118.7,1673,"﻿34.725,-92.3586",40.1,61.4,82.4,63.3,...,0.0,0.655738,0.0,0.0,0.0,0.0,0.327869,0.016393,5,0


In [845]:
m = folium.Map(location = [37.09,-95.71], zoom_start = 4, tiles = "Stamen Toner")
for row in range(final_cities.shape[0]):
    row = final_cities.iloc[row]
    location = row["Location"][1:].split(",")
    if row["New Labels"] == 0:
        color = "orange"
    elif row["New Labels"] == 1:
        color = "blue"
    elif row["New Labels"] == 2:
        color = "yellow"
    elif row["New Labels"] == 3:
        color = "pink"
    elif row["New Labels"] == 4:
        color = "lightgreen"
    elif row["New Labels"] == 5:
        color = "cadetblue"
    elif row["New Labels"] == 6:
        color = "darkred"
    folium.Circle(location = location, color = color, popup = row["City"], radius = 3000).add_to(m)
m