# This is the week 4/5 capstone course

## The Mission

My mission is to find the best locations to open a new chain of Pet Cafés in the state of Connecticut. I'll be looking for parts of the state with high levels of pet ownership, as well as parts that are under represented in terms of pet shops.  We're not really a pet shop, although you can certainly purchase pet supplies at our stores. What we really are is imagine Spot Coffee, the chain found in the Buffalo NY area, funky, artisitic, hip, but with pets. Yes, taht's right, you can enjoy a cup of your favorite coffee or tea, munch on a croissant, and have fun with your pet while socializing with other pet owners. It's a great way to meet other people who share a love of pets.

## The Data

Connecticut has a database of dog ownership, based on number of dog licenses sold over the last 20 years. I'm going to trim that back to just 5 years worth of data, and the project what the future dog ownership rate will be for each town. I'll also discover the trend for each town, increasing dog ownership means more business, but only if there are large nubmer of pets already in the town. A steeply dropping rate will be a red flag.


I used the USPS zip code directory to get the coordinates of each town. 


Once I found the coordinates of each town, I used a 5 km radius from that center point to locate all for the pet services located in the town using FourSquare data. I narrowed the search to just pet services.


I wanted to get a map that showed all of the town borders. The best way to do that was to superimpose some GeoJSON data provided by the state of connecticut. The first map shows those towns, as well as the top 32 rated towns in terms of dog ownership.


Then I did a k-cluster analysis of those 32 towns, using their top venues, and plotted them out.  The resulting map shows ideal places to start pet cafés. These are cluster 1, and include places like Manchester, Windsor, Plainville, Torrington, and Middletown. This makes sense because these are towns with large populations of dogs, and presumably dog lovers as well.


In [1]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
import math
import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
#from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')



Libraries imported.


In [2]:
# The code was removed by Watson Studio for sharing.

# Predefined Methods

These were taken from the lab work and slightly modified

In [3]:
def getNearbyVenues(names, latitudes, longitudes, radius=5000):
    LIMIT=100
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)

        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&categoryId=56aa371be4b08b9a8d573508,4bf58dd8d48988d1e5941735,5032897c91d4c4b30a586d69,4bf58dd8d48988d100951735'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat,
            lng,
            radius, 
            LIMIT)

  
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        if len(results) > 0:
        
            # return only relevant information for each nearby venue
            venues_list.append([(
                name, 
                lat, 
                lng, 
                v['venue']['name'], 
                v['venue']['location']['lat'], 
                v['venue']['location']['lng'],  
                v['venue']['categories'][0]['name']) for v in results])

            nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
            #print(nearby_venues)
            nearby_venues.columns = ['Town', 
                      'Town Latitude', 
                      'Town Longitude', 
                      'Venue', 
                      'Venue Latitude', 
                      'Venue Longitude', 
                      'Venue Category']
        #    nearby_venues.groupby('Venue Category').count()
        #    nearby_venues.head()

    return(nearby_venues)



# Constants

In [4]:
ct_ll = [41.6032, -73.0877] #lat/lng of Connecticut

# Using a Watson Studio Data Asset

I found free information about the number of Dog Licenses sold in each town in my state. I downloaded that as a csv file and the uploaded it to watson studio. Watson Studio generated code to drop it into a Jupityr frame shown below

In [5]:
# The code was removed by Watson Studio for sharing.

# Using another Watson Studio Data Asset

I found free zipcode and location data in the USPS service. So I'm using that to get location data for each town. Then I'll clean it up and drop columns I won't be needing. CT is interesting because counties might be used for area searches

In [6]:

body = client_4b16bd7d10374b51bb1e27f0ff9b6dc3.get_object(Bucket='courseracapstoneproject-donotdelete-pr-gx3hsdtocpdrrc',Key='zip_code_database.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_zipcodes = pd.read_csv(body)
df_zipcodes.head()



Unnamed: 0,zip,type,decommissioned,primary_city,acceptable_cities,unacceptable_cities,state,county,timezone,area_codes,world_region,country,latitude,longitude,irs_estimated_population_2015
0,501,UNIQUE,0,Holtsville,,I R S Service Center,NY,Suffolk County,America/New_York,631,,US,40.81,-73.04,562
1,544,UNIQUE,0,Holtsville,,Irs Service Center,NY,Suffolk County,America/New_York,631,,US,40.81,-73.04,0
2,601,STANDARD,0,Adjuntas,,"Colinas Del Gigante, Jard De Adjuntas, Urb San...",PR,Adjuntas Municipio,America/Puerto_Rico,787939,,US,18.16,-66.72,0
3,602,STANDARD,0,Aguada,,"Alts De Aguada, Bo Guaniquilla, Comunidad Las ...",PR,Aguada Municipio,America/Puerto_Rico,787939,,US,18.38,-67.18,0
4,603,STANDARD,0,Aguadilla,Ramey,"Bda Caban, Bda Esteves, Bo Borinquen, Bo Ceiba...",PR,Aguadilla Municipio,America/Puerto_Rico,787,,US,18.43,-67.15,0


# geo data

Load CT Geodata that I pulled from CT gov website

In [7]:

# Your data file was loaded into a botocore.response.StreamingBody object.
# Please read the documentation of ibm_boto3 and pandas to learn more about your possibilities to load the data.
# ibm_boto3 documentation: https://ibm.github.io/ibm-cos-sdk-python/
# pandas documentation: http://pandas.pydata.org/
jsonFileStream = client_4b16bd7d10374b51bb1e27f0ff9b6dc3.get_object(Bucket='courseracapstoneproject-donotdelete-pr-gx3hsdtocpdrrc', Key='ct_geo.geojson')['Body']
# add missing __iter__ method so pandas accepts body as file-like object
if not hasattr(jsonFileStream, "__iter__"): jsonFileStream.__iter__ = types.MethodType( __iter__, jsonFileStream ) 

ct_geo = json.loads(jsonFileStream.read().decode('utf-8')) 



# Cleanse

Get rid of unwanted columns and remove leading spaces from town name

In [8]:
unwanted = ['2008/09','2007/08','2006/07','2005/06','2004/05','2003/04','2002/03','2001/02','2000/01','1999/00','1998/99','1997/98','1996/97','1995/96','1994/95','1993/94','1992/93','1991/92','1990/91','1989/90','1988/89','1987/88','1986/87','1985/86','1984/85']
dog_licenses.drop(unwanted, axis=1, inplace=True)
dog_licenses['TOWN'] = dog_licenses['TOWN'].str.strip()

dog_licenses.head()

Unnamed: 0,TOWN,2015/16,2014/15,2013/14,2012/13,2011/12,2010/11,2009/10
0,ANDOVER,281,257,261,318,332,335,377
1,ANSONIA,931,899,969,939,955,986,989
2,ASHFORD,531,539,499,527,570,602,586
3,AVON,1935,1952,1922,1918,1876,1873,1825
4,BARKHAMSTED,605,665,640,736,522,547,547


And then let's clean up the zipcode data and limit it to just CT

In [9]:

unwanted = ['zip','type','decommissioned','acceptable_cities','unacceptable_cities', 'timezone','area_codes','world_region','country']
df_zipcodes.drop(unwanted, axis=1, inplace=True)
df_ct = df_zipcodes[df_zipcodes['state'] == 'CT'].drop_duplicates(subset='primary_city')
df_ct.drop('state', axis=1, inplace=True)
df_ct.columns = ['TOWN','COUNTY','LAT','LNG','POPULATION 2015']
df_ct['TOWN'] = df_ct['TOWN'].str.upper()
df_ct.head()

Unnamed: 0,TOWN,COUNTY,LAT,LNG,POPULATION 2015
2069,AVON,Hartford County,41.8,-72.83,18380
2070,BLOOMFIELD,Hartford County,41.81,-72.73,18990
2071,WINDSOR,Hartford County,41.85,-72.65,0
2072,BRISTOL,Hartford County,41.68,-72.94,53490
2074,BURLINGTON,Hartford County,41.76,-72.96,9300


In [10]:
years = [16, 15, 14, 13, 12, 11, 10]
a_data = dog_licenses[['2015/16' ,'2014/15' ,'2013/14' ,'2012/13' ,'2011/12' ,'2010/11' ,'2009/10']]
trend = []
yhat = []
for row in a_data.iterrows():
    index, data = row
    fit = np.polyfit(years, data.tolist(), 1)
    trend.append(fit[0]/data[0])
    yhat.append(int(fit[0]*20 + fit[1]))

dog_licenses['trend'] = trend
dog_licenses['2020_hat'] = yhat

dog_licenses.sort_values(by=['2020_hat'], ascending=False, inplace=True)
#dog_licenses = dog_licenses.head(32)
yhat_max = dog_licenses.iloc[0]['2020_hat']
yhat_max
dog_licenses['2020_hat_norm'] = dog_licenses['2020_hat']/yhat_max

# Merge

Let's merge location data with the town data


In [11]:
df_ct2 = pd.merge(dog_licenses, df_ct, on='TOWN', how='inner')
df_ct2['TOWN'] = df_ct2['TOWN'].str.title()
df_ct2.tail()


Unnamed: 0,TOWN,2015/16,2014/15,2013/14,2012/13,2011/12,2010/11,2009/10,trend,2020_hat,2020_hat_norm,COUNTY,LAT,LNG,POPULATION 2015
151,Andover,281,257,261,318,332,335,377,-0.065455,179,0.040234,Tolland County,41.73,-72.36,3000
152,Eastford,164,176,182,198,185,192,187,-0.022648,157,0.035289,Windham County,41.9,-72.08,1310
153,Morris,242,298,311,304,339,404,423,-0.115555,135,0.030344,Litchfield County,41.68,-73.17,1940
154,Chaplin,165,150,207,227,216,203,220,-0.060606,128,0.028771,Windham County,41.8,-72.11,1920
155,Canaan,127,119,119,119,120,128,127,-0.005343,117,0.026298,Litchfield County,42.03,-73.33,2350


# Top 32 Dog Towns

For the 32 towns (out of 163 in CT) that are projected to have the highest dog ownership, let's see what doggie venues already exist!

In [12]:
df_ct3 = df_ct2.head(32)
ct_venues = getNearbyVenues(names=df_ct3['TOWN'],
                                  latitudes=df_ct3['LAT'],
                                  longitudes=df_ct3['LNG'],
                                 )

Enfield
Manchester
New Canaan
Glastonbury
Milford
Westport
Fairfield
Norwalk
Darien
Southington
Stamford
Windsor
West Hartford
Ridgefield
Stratford
South Windsor
Newtown
New Milford
Ellington
Middletown
Southbury
Simsbury
Danbury
Stafford
Bristol
Tolland
Greenwich
Groton
Torrington
Meriden
Avon
Plainville


In [13]:
ct_venues.columns = ['Town', 't_lat','t_lng', 'Venue', 'v_lat', 'v_lng', 'cat']
ct_venues['Town'] = ct_venues['Town'].str.title()
ct_venues.head()

Unnamed: 0,Town,t_lat,t_lng,Venue,v_lat,v_lng,cat
0,Enfield,41.96,-72.56,PetSmart,41.988065,-72.582588,Pet Store
1,Enfield,41.96,-72.56,Petco,41.991216,-72.582516,Pet Store
2,Enfield,41.96,-72.56,Enfield Dog Park,41.958943,-72.550173,Dog Run
3,Enfield,41.96,-72.56,The Lowdown on Dogs,41.952102,-72.545508,Pet Service
4,Enfield,41.96,-72.56,Enfield Animal Hospital,41.962701,-72.588416,Pet Service


# First Map

Let's see where all of the doggie venues are today in 2019, and see where there might be
 - a lot of dog interest
 - maybe some comptition
 
#### Just For Fun

Let's use some GeoJson data to overlay the town borders

In [14]:
# create map and display it
ct_map = folium.Map(location=ct_ll, zoom_start=9, height="100%",max_zoom=10, min_zoom=9)
spots = folium.map.FeatureGroup()

for lat, lng, in zip(ct_venues.v_lat, ct_venues.v_lng):
    spots.add_child(
        folium.features.CircleMarker(
            [lat, lng],
            radius=5, # define how big you want the circle markers to be
            color='yellow',
            fill=True,
            fill_color='blue',
            fill_opacity=0.6
        )
    )
    
    
folium.GeoJson(
    ct_geo,
    name='geojson'
).add_to(ct_map)    
ct_map.add_child(spots)
    
ct_map.save("dogmap.html")
# add incidents to map

# display the dog map
ct_map

# Great Map

This shows us where the doggie hotspots are. These are the towns with high dog ownership, and also a number of pet venues. But look! There is room to expand. And what's going on in the Southeast corner?

Now let's start the k-cluster analysis
Let's build a table of all the types of pet venues to start with, by town

In [15]:
ct_onehot = pd.get_dummies(ct_venues[['cat']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
ct_onehot['Town'] = ct_venues['Town'] 

# move neighborhood column to the first column
fixed_columns = [ct_onehot.columns[-1]] + list(ct_onehot.columns[:-1])
ct_onehot = ct_onehot[fixed_columns]

ct_onehot.head()


Unnamed: 0,Town,Animal Shelter,Aquarium,Dog Run,Garden Center,Hardware Store,Other Great Outdoors,Park,Pet Service,Pet Store,Plaza,Trail,Veterinarian
0,Enfield,0,0,0,0,0,0,0,0,1,0,0,0
1,Enfield,0,0,0,0,0,0,0,0,1,0,0,0
2,Enfield,0,0,1,0,0,0,0,0,0,0,0,0
3,Enfield,0,0,0,0,0,0,0,1,0,0,0,0
4,Enfield,0,0,0,0,0,0,0,1,0,0,0,0


And then normalize it by setting the mean for each field

In [16]:
ct_grouped = ct_onehot.groupby('Town').mean().reset_index()
ct_grouped


Unnamed: 0,Town,Animal Shelter,Aquarium,Dog Run,Garden Center,Hardware Store,Other Great Outdoors,Park,Pet Service,Pet Store,Plaza,Trail,Veterinarian
0,Avon,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.583333,0.0,0.166667,0.0
1,Bristol,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.571429,0.357143,0.0,0.0,0.071429
2,Danbury,0.0,0.0,0.0,0.0,0.066667,0.0,0.0,0.333333,0.466667,0.0,0.0,0.133333
3,Darien,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.2,0.5,0.0,0.0,0.1
4,Ellington,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.666667,0.0,0.0,0.0
5,Enfield,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.5,0.333333,0.0,0.0,0.0
6,Fairfield,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.4,0.6,0.0,0.0,0.0
7,Glastonbury,0.0,0.0,0.071429,0.0,0.0,0.0,0.0,0.071429,0.571429,0.0,0.0,0.285714
8,Greenwich,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.125,0.625,0.0,0.0,0.125
9,Groton,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.75,0.0,0.0,0.0


In [17]:
num_top_venues = 5

for hood in ct_grouped['Town']:
    print("----"+hood+"----")
    temp = ct_grouped[ct_grouped['Town'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')


----Avon----
            venue  freq
0       Pet Store  0.58
1     Pet Service  0.25
2           Trail  0.17
3  Animal Shelter  0.00
4        Aquarium  0.00


----Bristol----
            venue  freq
0     Pet Service  0.57
1       Pet Store  0.36
2    Veterinarian  0.07
3  Animal Shelter  0.00
4        Aquarium  0.00


----Danbury----
            venue  freq
0       Pet Store  0.47
1     Pet Service  0.33
2    Veterinarian  0.13
3  Hardware Store  0.07
4  Animal Shelter  0.00


----Darien----
            venue  freq
0       Pet Store   0.5
1         Dog Run   0.2
2     Pet Service   0.2
3    Veterinarian   0.1
4  Animal Shelter   0.0


----Ellington----
            venue  freq
0       Pet Store  0.67
1     Pet Service  0.33
2  Animal Shelter  0.00
3        Aquarium  0.00
4         Dog Run  0.00


----Enfield----
            venue  freq
0     Pet Service  0.50
1       Pet Store  0.33
2         Dog Run  0.17
3  Animal Shelter  0.00
4        Aquarium  0.00


----Fairfield----
            

In [18]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]



Now let's look at the top ten pet venue types in each town

In [19]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Town']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Town'] = ct_grouped['Town']

for ind in np.arange(ct_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(ct_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Town,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Avon,Pet Store,Pet Service,Trail,Veterinarian,Plaza,Park,Other Great Outdoors,Hardware Store,Garden Center,Dog Run
1,Bristol,Pet Service,Pet Store,Veterinarian,Trail,Plaza,Park,Other Great Outdoors,Hardware Store,Garden Center,Dog Run
2,Danbury,Pet Store,Pet Service,Veterinarian,Hardware Store,Trail,Plaza,Park,Other Great Outdoors,Garden Center,Dog Run
3,Darien,Pet Store,Pet Service,Dog Run,Veterinarian,Trail,Plaza,Park,Other Great Outdoors,Hardware Store,Garden Center
4,Ellington,Pet Store,Pet Service,Veterinarian,Trail,Plaza,Park,Other Great Outdoors,Hardware Store,Garden Center,Dog Run


# Clustering!

So, let's try 5 clusters and see if that tells us where the best and worst places might be to build a dog cafe

In [20]:
# set number of clusters
kclusters = 5

ct_grouped_clustering = ct_grouped.drop('Town', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(ct_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

df_ct2.columns = ['Town', '2015/16', '2014/15', '2013/14', '2012/13', '2011/12', '2010/11', '2009/10', 'trend', '2020_hat', '2020_hat_norm', 'COUNTY', 'LAT', 'LNG', 'POPULATION 2015']
ct_merged = df_ct2
ct_merged.drop(['2015/16', '2014/15' , '2013/14' , '2012/13' , '2011/12' , '2010/11' , '2009/10', 'trend', '2020_hat', '2020_hat_norm', 'COUNTY'], axis=1, inplace=True)

# merge ct_grouped with ct_data to add latitude/longitude for each neighborhood
ct_merged = ct_merged.join(neighborhoods_venues_sorted.set_index('Town'), on='Town')

ct_merged.head() # check the last columns!

Unnamed: 0,Town,LAT,LNG,POPULATION 2015,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Enfield,41.96,-72.56,36830,3.0,Pet Service,Pet Store,Dog Run,Veterinarian,Trail,Plaza,Park,Other Great Outdoors,Hardware Store,Garden Center
1,Manchester,41.78,-72.51,31370,1.0,Pet Store,Pet Service,Veterinarian,Trail,Garden Center,Plaza,Park,Other Great Outdoors,Hardware Store,Dog Run
2,New Canaan,41.14,-73.49,19690,2.0,Pet Store,Dog Run,Veterinarian,Trail,Plaza,Pet Service,Park,Other Great Outdoors,Hardware Store,Garden Center
3,Glastonbury,41.7,-72.6,27700,2.0,Pet Store,Veterinarian,Pet Service,Dog Run,Trail,Plaza,Park,Other Great Outdoors,Hardware Store,Garden Center
4,Milford,41.22,-73.06,34350,1.0,Pet Service,Pet Store,Veterinarian,Park,Trail,Plaza,Other Great Outdoors,Hardware Store,Garden Center,Dog Run


In [21]:
	# create map
    

map_clusters = folium.Map(location=ct_ll, zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

keepers = []
droppable = []
lab_idx = []
for x in range(len(ct_merged)):
    r = ct_merged.iloc[x]
    y = r['Cluster Labels']
    keepers.append(not math.isnan(y))
    droppable.append(math.isnan(y))
    if ( math.isnan(y)):
        lab_idx.append(0)
    else:
        lab_idx.append(math.floor(y))
            



In [22]:
# I have to fudge some of the data because it produced some NaN values, also floating point instead of int for the label values

ct_merged['keepers'] = keepers
ct_merged['droppable'] = droppable
ct_merged['lab_idx'] = lab_idx
print(ct_merged.describe())
#ct_merged = ct_merged[ct_merged['droppable']]
print(ct_merged.head())
print(ct_merged.describe())
        
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(ct_merged['LAT'], ct_merged['LNG'], ct_merged['Town'], ct_merged['lab_idx']):
    

    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)

map_clusters.save("doggie_clusters.html")    
map_clusters


              LAT         LNG  POPULATION 2015  Cluster Labels     lab_idx
count  156.000000  156.000000       156.000000       32.000000  156.000000
mean    41.571410  -72.760513     12172.467949        1.281250    0.262821
std      0.246761    0.470256     11273.028802        1.113969    0.719428
min     41.050000  -73.630000         0.000000        0.000000    0.000000
25%     41.367500  -73.120000      3360.000000        0.000000    0.000000
50%     41.575000  -72.785000      8115.000000        1.000000    0.000000
75%     41.765000  -72.420000     18365.000000        2.000000    0.000000
max     42.030000  -71.830000     53490.000000        4.000000    4.000000
          Town    LAT    LNG  POPULATION 2015  Cluster Labels  \
0      Enfield  41.96 -72.56            36830             3.0   
1   Manchester  41.78 -72.51            31370             1.0   
2   New Canaan  41.14 -73.49            19690             2.0   
3  Glastonbury  41.70 -72.60            27700             2.0   


# Final Report - Pet Café Project

## Introduction 

Pet Cafés are a new, hip, exciting idea, appealing to young and old alike. Pet Cafés are like coffee shops, but with pets. That's right, you can bring your dog or cat (I have rooms set aside for both). They an socialize. You can socialize. It's an amazing concept, and it's been proven in cities like Montreal Canada, and Portland Oregon.

So where is the ideal location for the next chain of Pet Cafés? There are some things to keep in mind. First, there has to be a population of pets, and hence pet owners. So I need to identify places that have lots of pets. And and next, I beleive that Pet Café should be located in areas where there are already some pet services avaialble. I're not trying to be a one stop shop for pet supplies, I just want to be the place you and your pet can socialize.

Ideally, I'll be able to find 5 or 6 locations in the state of Connecticut, where it will make sense to kick off this new and exciting venture.


##  Data 

First, I need to locate areas where there are a lot of pets. Happily, the state of Connecticut publishes statistics on things like dog licenses. And they have a lot of historical data. So to start with I'll pull in that data, do a linear regression against it to find the top 20% of towns where there are dog populations, and where the populations are growing or stable, rather than shrinking dramatically.

Next, I need to find the geolocation of each town. I pull that information from the USPS in order to get the location the center of each town,

Finally, FourSquare.com has information on pet related businesses in all of these towns. So I will search a 5km radius from the center of each hi-dog town looking for venues.  I used a 5km radius because the towns are rather sparse.  In previous labs I always used a much smaller radius. But I found with a 500 meter radius I was not getting a lot of hits on venues. 5km seemed like a good compromise even though there could be some overlap with other towns.



##  Methodology section 

There were three methodologies employed
 - linear regression
 - data visulization
 - k-cluster analysis and mapping
 
I used linear regression to project dog ownership into the future. Connecticut's dog license data ran over a dozen years, but stopped in 2016. Linear regression allowed me to estimate what dog ownership might be like in the future in each town. I picked just the top 20% percent of towns, about 32 of them as representive BIG DOG towns. That is, towns with the highest levels of dog ownership. The Liinear Regression also showed the growth rate of dog ownership. This could have provided an important red flag, in case ownership was dropping steeply. Happily, none of the BIG DOG towns were flagged this way.

I used data visualization to show where the most common locations of pet venues were already located. This showed some interesting results. There is definitely a corridor along the southwest part of the state, along the Long Island sound, where there are already a large number of venues. This trend then moves north along the4 I-84 corridor. Annecdotally these are also high income areas. A future study might try to correlate income levels with dog ownership

Finally, I used k-cluster analysis to discover towns that were similar and that might be a good fit for their first Pet Café. K-Cluster analysis looked at types of venues already in each town, and identified areas where there were good pet services and also high dog ownership.



##    Results section 

The results were surprising, but not unusual. I was surprised that dog ownership varied so much across all 163 towns in Connecticut. That said, ownership rates over time varied only slightly from town to town. There were no big changes, other than stable growth or stable decline, one or two percent in each direction. I was able to identify those towns with high numbers of dogs, which accounted for about 20% of the towns, I call these the BIG DOG towns.

The towns with highest dog ownership also seemed to be located in areas that were easily accessible by highway. That would be a considereed plus for any kind of venue. The data visualization, in particular mapping the towns with high dog populations onto a map of Connecticut really made this apparent.

The k-cluster analysis showed some interesting locations for placing a Pet Café, namely, Manchester, Windsor, Plainville, Torrington, and Middletown. These are locations with good support for pets already, and high pet ownership.



##    Discussion section 

There are clearly some ideal locations for the first Pet Cafés, namely, Manchester, Windsor, Plainville, Torrington, and Middletown. These are locations with good support for pets already, and high pet ownership. 

Connecticut already has a highly diverse, mobile population. Further recommendations would be to test the idea across a range of demographics and see how interested pet owners would be to spend time (and money) at the Pet Café. But that study is outside the scope of this work, and requires a different set of skills.


##    Conclusion section 

In brief, there are several locations in Connecticut that would be ideal for opening a new Pet Café. I used data from FourSquare.com to idenify twons that already have good
pet services, indicating a high level of interest in pet services in those towns. And I used publicly available pet ownership information published by the state of Connecticut.
