In [37]:
from credentials import FOURSQUARE_CLIENT_ID, FOURSQUARE_CLIENT_SECRET
FOURSQUARE_API_VERSION = '20191205'

In [38]:
%%html
<style> 
table td, table th, table tr {text-align:left !important;}
</style>
 
<!--
this cell fixes a bug in this jupyter notebook version, that does not allow left alignment in markdown tables
https://github.com/jupyterlab/jupyterlab/issues/6283
//-->

# Help saving the planet!
__Finding a location to set up a local second-hand store.__

---

## Introduction
### An environmental/cultural Problem
Fast fashion is the term used to describe clothing designs that move quickly from the catwalk to stores to meet new trends. The rise of fast fashion brands like Zara, Uniqlo, H&M, GAP and Topshop is accompanied with a huge increase of clothing production. According to a study of McKinsey & Company the worldwide clothing production doubled from 2000 to 2014 due to falling production costs, streamlined operations, and rising consumer spending.<sup>[1](https://www.mckinsey.com/business-functions/sustainability/our-insights/style-thats-sustainable-a-new-fast-fashion-formula)</sup> In Germany alone, 1,350,000 tons of textile waste arise yearly.<sup>[2](https://www.bvse.de/themen/geschichte-des-textilrecycling/zahlen-zur-sammlung-und-verwendung-von-altkleidern.html)</sup> 

Today 57% of German customers name sustainability as an importent buying criteria for clothing.<sup>[3](http://de.statista.com/statistik/daten/studie/955983/umfrage/umfrage-zu-wichtigen-kriterien-beim-kauf-von-bekleidung-in-deutschland)</sup> A break of the economic fast fashion trend towards sustainable and slow fashion is likely to happen in the next years.<sup>[4](https://www.fashionrevolution.org/),[5](https://www.greenpeace.de/sites/www.greenpeace.de/files/publications/s01951_greenpeace_report_konsumkollaps_fast_fashion.pdf),[6](https://fashion-week-berlin.com/en/blog/single-news/going-green-movement-on-german-catwalks.html)</sup> The reuse of already produced clothings is an approach of the Zero Waste and the Sustainable Fashion Movement to overcome the eco-unfriedly consumer behavior of the past years. Through the reuse of clothes and textiles, the pollution- and energy-intensive production of new clothing can be avoided.

With a combined 36% of the global share, the United States, the United Kingdom and Germany were the top three used clothes exporters in 2015. Because several countries in Eastern Africa and China announced to ban imports of textile waste, the purchasing prices on used clothings will most likely dramatically drop due to the collapse of the export market.<sup>[7](http://www.chinadaily.com.cn/china/2017-07/21/content_30194081.htm),[8](https://www.un.org/africarenewal/magazine/december-2017-march-2018/protectionist-ban-imported-used-clothing)</sup> 

### The Business Idea
The drop of purchasing prices for used clothings and the increase of consumer interest due to the zero waste/sustainability trend in German society, combined with a trend for vintage fashion leads to a relatively high profit margin. 

Therefore it's a great time to open up a second-hand store! Since our possible customers care about the environment and their own carbon footprint, let's decide to set up a local store instead of an online shop. Berlin has a lot of tourists and multiple fashion fairs every year, e.g. Bread&Butter, Mecedes Benz-Fashion Week. It's a fashion hotspot in Germany and the Berlin Street Style is famous among fashionistas around the globe.<sup>[8](https://www.vogue.com/slideshow/berlin-fashion-week-fall-2018)</sup> 

### The Business Problem
But what's the best address to set up our new store? In this project we will try to find an optimal location for a second-hand store. Specifically, this report will be targeted to stakeholders who are interested in opening up a second-hand store in Berlin, Germany.

Since another second-hand store nearby tends to draw the same demographic of customers and as all of our products are unique in their style, size and story they tell - we can assume, that we can benefit from another second-hand store nearby through an increase of our customer volume/walk-in traffic. 

We will use Data Science methods to generate a list of the most promissing locations based on the mentioned criteria.

## Data

For our analysis we will use mainly one data resource:

### __[Foursquare Places API:](https://developer.foursquare.com/docs/api/venues/search)__

We will use the _search_ endpoint of the _Foursquare Places API_ to fetch Data on existing local second-hand stores in Berlin, Germany. An API call will return a json object with two parts, _meta_ and _response_. The _meta_ part of the json will give us basic information on how the API handled our request, e.g. if an error occured. The _response_ part of the json contains a list of venues near the passed location, optionally matching a search term. 

An API-response will look like this:
```json
{
  "meta": {
    "code": 200,
    "requestId": "5ac51d7e6a607143d811cecb"
  },
  "response": {
    "venues": [
      {
        "id": "5642aef9498e51025cf4a7a5",
        "name": "Mr. Purple",
        "location": {
          "address": "180 Orchard St",
          "crossStreet": "btwn Houston & Stanton St",
          "lat": 40.72173744277209,
          "lng": -73.98800687282996,
          "labeledLatLngs": [
            {
              "label": "display",
              "lat": 40.72173744277209,
              "lng": -73.98800687282996
            }
          ],
          "distance": 8,
          "postalCode": "10002",
          "cc": "US",
          "city": "New York",
          "state": "NY",
          "country": "United States",
          "formattedAddress": [
            "180 Orchard St (btwn Houston & Stanton St)",
            "New York, NY 10002",
            "United States"
          ]
        },
        "categories": [
          {
            "id": "4bf58dd8d48988d1d5941735",
            "name": "Hotel Bar",
            "pluralName": "Hotel Bars",
            "shortName": "Hotel Bar",
            "icon": {
              "prefix": "https://ss3.4sqi.net/img/categories_v2/travel/hotel_bar_",
              "suffix": ".png"
            },
            "primary": true
          }
        ],
        "venuePage": {
          "id": "150747252"
        }
      }
    ]
  }
}
```
As we can see, the following attributes will be returned for each venue :

| Field | Description |
| :--- | :--- |
| id | A unique string identifier for this venue. |
| name | The best known name for this venue. |
| location | An object containing none, some, or all of address (street address), crossStreet, city, state, postalCode, country, lat, lng, and distance. All fields are strings, except for lat, lng, and distance. Distance is measured in meters. Some venues have their locations intentionally hidden for privacy reasons (such as private residences). If this is the case, the parameter isFuzzed will be set to true, and the lat/lng parameters will have reduced precision. |
| categories | An array, possibly empty, of categories that have been applied to this venue. One of the categories will have a primary field indicating that it is the primary category for the venue. For the complete category tree, see categories. |


Since we want another second-hand store nearby our location candidate, the _location_ attribute of the venues will be the most interesting attribute for our research.

## Methodology

In this project we direct our efforts on detecting areas of Berlin that have a high density of second-hand stores. 

In the first step we gather the necessary data. Then, we clean the collected data, e.g. remove irrelevant data or duplicates and possibly handle missing data. We then (visually) explore our clean data set and form a hypothesis about possible hot-spots and where would be a promissing location to open up a local second-hand store.

After that, we use DBSCAN-algorithm to cluster the locations of existing second-hand stores, to verify our hypothesis. While doing so, we take requirements of the stakeholders into consideration: At least 3 other second-hand stores nearby the location and one of them must be as close as 275 meters to our location candidate.

We then visualize our findings and present the promissing location candidates based on the given criteria. Our result should be a starting point for final 'street level' exploration and search for optimal venue location by stakeholders.

## Analysis

We start by getting our environment ready. Then, we define some function that will come in handy in the further analysis.

### Getting ready

In [39]:
# first, let's import all needed librarys
import json
import requests
import utm
import time
import random
import math
import numpy as np
import pandas as pd
import folium
from folium.plugins import MarkerCluster
import sklearn.utils
from sklearn.cluster import DBSCAN
from sklearn import metrics

In [40]:
# define functions to use in this project
def foursquare_search(payload: dict) -> requests.Response:
    """call the search endpoint of Foursquare Places API"""
    url = 'https://api.foursquare.com/v2/venues/search'    
    payload['client_id'] = FOURSQUARE_CLIENT_ID
    payload['client_secret'] = FOURSQUARE_CLIENT_SECRET
    payload['v'] = FOURSQUARE_API_VERSION
    try:
        r = requests.get(url, params=payload)
        r.raise_for_status()      
    except requests.exceptions.HTTPError as err:
        print(err, '\nServer-Response:\n', json.dumps(err.response.json(), sort_keys=True, indent=3))
    return r

def format_address(location):
    """receive a nicely formatted address out of the json-response from Foursquare API"""
    address = ', '.join(location['formattedAddress'])
    address = address.replace(', Deutschland', '')
    address = address.replace(', Germany', '')
    return address
    
def calc_xy_distance(x1, y1, x2, y2):
    """calculate distance in UTX coordinate system"""
    dx = x2 - x1
    dy = y2 - y1
    return math.sqrt(dx*dx + dy*dy)

def openstreetmap_reverse(latitude: float, longitude: float) -> requests.Response:
    url='http://nominatim.openstreetmap.org/reverse?'
    payload={
        'format': 'json',
        'lat': latitude,
        'lon': longitude,
        'zoom': 18,
        'addressdetails': 1
    }
    try:
        r = requests.get(url, params=payload)
        r.raise_for_status()      
    except requests.exceptions.HTTPError as err:
        print(err, '\nServer-Response:\n', err)
    return r

def jprint(obj):
    """create a formatted string of the Python JSON object"""
    text = json.dumps(obj, sort_keys=True, indent=4)
    print(text)

### Data Gathering

Now, we request the Foursquare Places API to respond second-hand stores located in a bounding box defined by the following coordinates:

| box-corner | latitude | longitude |
| --- | --- | --- |
| south-west | 52.370554 | 13.091972 |
| north-east | 52.670999 | 13.786405 |

The defined bounding box includes the whole area of Berlin, Germany. We take the category id from the API documentation: [here](https://developer.foursquare.com/docs/resources/categories).

In [41]:
response = foursquare_search({
            'intent': 'browse',
            'sw': '52.370554, 13.091972',
            'ne': '52.670999, 13.786405',
            'categoryId': '4bf58dd8d48988d101951735' # foursquare category of second-hand & vintage stores
        })

print(response.status_code)

200


The __code 200__, which is returned in the _meta_ part of the json object, tells us, that everything went well with our request. Now, we loop through the results of the API call to extract the venue data.

In [42]:
# create lists to buffer the values
id_foursquare, venue, latitude, longitude, address = [], [], [], [], []

# loop through all venues and extract only relevant values
for item in response.json()['response']['venues']:
    id_foursquare.append(item['id'])
    venue.append(item['name'])
    latitude.append(item['location']['lat']) 
    longitude.append(item['location']['lng'])
    address.append(format_address(item['location']))

# make a pandas dataframe from the lists
df_venues = pd.DataFrame({
    'id_foursquare': id_foursquare, 
    'venue': venue, 
    'latitude': latitude, 
    'longitude': longitude, 
    'address': address
    })

# view the first few lines of the dataframe to see if everything went well
display(df_venues.head())

# check on how many venues returned from our request
print('Number of venues: ', df_venues.shape[0])

Unnamed: 0,id_foursquare,venue,latitude,longitude,address
0,4bd070ea462cb7135250d807,Humana,52.516722,13.454311,"Frankfurter Tor 3, 10243 Berlin"
1,5016a16ee4b07c3cf3e2b3ef,Der Vorwende Laden,52.519731,13.453796,"Thaetstrasse 16, Berlin"
2,540afe74498e77fa823a120d,Repeater,52.48905,13.433036,"Pannierstraße 45 (Framstraße), Berlin"
3,4bd8278409ecb713f822487c,Humana,52.522775,13.416592,"Alexanderstr. 7 (Otto-Braun-Str.), 10178 Berlin"
4,4bbedab430c99c742a615411,Made in Berlin,52.524908,13.405949,"Neue Schönhauser Str. 19, 10178 Berlin"


Number of venues:  30


The response from the API __only__ contains __30 venues__. Did you also expect the number of second-hand stores in Berlin to be greater? The [API documentation](https://developer.foursquare.com/docs/api/venues/search) does not state any limits on the quantity of returned venues, but maybe there is an __undocumented limitation__. Let's break down the area of interest in __smaller search areas__ to see if we will get more venues by that. 

We will use the UTM coordinate system to make calculations easier.

In [43]:
# convertion of latitude and logitude of Berlin city center to UTM
u = utm.from_latlon(52.5219184, 13.4132147)
X = u[0]
Y = u[1]
print(f'X = {X}, Y = {Y}')

# check by reverse calculation:
lat, lon = utm.to_latlon(X, Y, 33, 'U')
print(f'longitude = {lon}, latitude = {lat}')

X = 392341.28017522604, Y = 5820273.243732004
longitude = 13.413214559173607, latitude = 52.52191840004866


In [44]:
# defining centers of search areas:
k = math.sqrt(3) / 2 # vertical offset for grid cells
x_min = X - 19000
x_step = 1200
y_min = Y - 6000 - (int(21/k)*k*1200 - 12000)/2
y_step = 1200 * k 

latitudes = []
longitudes = []
distances_from_center = []
xs = []
ys = []

for i in range(0, int(21/k)):
    y = y_min + i * y_step
    x_offset = 600 if i%2==0 else 0
    for j in range(0, 42):
        x = x_min + j * x_step + x_offset
        distance_from_center = calc_xy_distance(X, Y, x, y)
        if (distance_from_center <= 19501):
            lat, lon = utm.to_latlon(x, y, 33, 'U')
            latitudes.append(lat)
            longitudes.append(lon)
            distances_from_center.append(distance_from_center)
            xs.append(x)
            ys.append(y)

Let's take a look at the search areas we just created:

In [45]:
# visualize the search areas
berlin_map = folium.Map(location=[52.524, 13.410], zoom_start=11)
folium.Marker([52.524, 13.410], popup='City Center').add_to(berlin_map)

for lat, lon in zip(latitudes, longitudes):
    folium.Circle([lat, lon], radius=700, color='blue', fill=False).add_to(berlin_map)

berlin_map

That look's good! The search areas overlap a little, so we do not have any 'blind spots'. We can now go ahead and request all second-hand stores within a 700 meters radius around the center of each search area, thus we make one API call per search area. 

In [46]:
# Due to a Daily Call Quota/Rate Limit, we first try to load our results of a earlier run
try:
    df_venues = pd.read_csv("df_venues.csv", index_col=0)
    print('Read venues from csv file.', df_venues.shape)

except:
    
    # list of responses
    results = []

    # request venues per search area:
    for lat, lon in zip(latitudes, longitudes):
        response = foursquare_search({
            'intent': 'browse',
            'll': f'{lat}, {lon}',
            'radius': 700,
            'categoryId': '4bf58dd8d48988d101951735' # Second-Hand & Vintage Stores
        })
        
        # add response to list of responses
        results.append(response.json()['response']['venues'])
    
    # get rid of the searches, that did not return any venue
    venues = []

    for item in results:
            if item == []:
                continue
            else:
                venues.append(item)

    # put it all together in a pandas dataframe
    id_foursquare, venue, latitude, longitude, address = [], [], [], [], []

    for item in venues:
        for i in range(len(item)):
            id_foursquare.append(item[i]['id'])
            venue.append(item[i]['name'])
            latitude.append(item[i]['location']['lat']) 
            longitude.append(item[i]['location']['lng'])
            address.append(format_address(item[i]['location']))

    df_venues = pd.DataFrame({
        'id_foursquare': id_foursquare, 
        'venue': venue, 
        'latitude': latitude, 
        'longitude': longitude, 
        'address': address
        })
    
    # drop all duplicates
    df_venues.drop_duplicates(subset='id_foursquare', inplace=True)
    
    # back up the dataframe
    df_venues.to_csv("df_venues.csv")
    print('Fetched venues from Foursquare Places API and updated local csv file. ', df_venues.shape[0], 'venues found.')
    

Read venues from csv file. (215, 5)


This number of second-hand stores is more plausible. There really seems to be an undocumented limitation to the bounding box search. We use the data gathered in the second attempt for further analysis. 

### Data Cleaning
As you can see above, we selectively extracted attributes from the json object to a pandas dataframe. Because each venue-id in the foursquare database is unique, we were able to use this column of our dataframe (id_foursquare) to drop all duplicates, that we had because the search areas overlapped. So, we now already have a nicely cleaned data set.

In [47]:
df_venues.head(5)

Unnamed: 0,id_foursquare,venue,latitude,longitude,address
0,5303982d498e53f0d90073dc,"""Rückenwind"" Möbelbörse",52.404697,13.2544,"Oderstraße 23-25, 14513 Teltow"
1,5617ced0498e04bb1f4546cd,Zwergpiraten,52.43785,13.204486,"Machnower Str., Berlin"
2,4e4d33c718a822288ddbf530,KIRSCHGRÜN,52.447884,13.305843,"HORTENSIENSTRASS 12B (ASTERNPLATZ), 12203 Berlin"
3,50d3311be4b07c38e83afa86,Zwergenparadies,52.445305,13.565679,Deutschland
4,51fa7bee498e000d45b4dba2,joleen,52.45565,13.319324,Deutschland


### Data Exploration

Now, we visualize the locations of the second-hand stores on a map of Berlin. The map is interactive. We can zoom in and out. The stores get clustered depending on the zoom level, so we can get an idea of possible hot spots. As a guidence for people who know the city we enrich the map with borderlines of the postal codes and districts of the city. We also include a layer control element in the map, so the postal codes and districts can be hidden by deselecting the corresponding layer. If we click a store marker it's name will be shown in a popup.

In [48]:
# generate a map of Berlin 
berlin_map2 = folium.Map(location=[df_venues['latitude'].mean(), df_venues['longitude'].mean()], zoom_start=11)

# create a marker cluster, to visualize where the density of 2nd-hand stores is high
mc = MarkerCluster(name='2nd-hand stores')

# use df.apply(_ ,axis=1) to iterate through every row in our dataframe and mark the locations of the venues
df_venues.apply(lambda row:
        mc.add_child(
               folium.Marker(
                    location=[row['latitude'], row['longitude']], 
                    icon=folium.Icon(color='darkblue', icon_color='white', icon='shopping-cart', angle=0, prefix='fa'),
                    popup=row['venue']
                    )
        ), axis=1)

berlin_map2.add_child(mc)

# include postal code areas in the map
def postal_codes_style(feature):
    """Style postal code areas in Folium map"""
    return  { 'color': 'crimson', 'fill': False }

berlin_zip_topo = requests.get('https://raw.githubusercontent.com/funkeinteraktiv/Berlin-Geodaten/master/berlin_postleitzahlen.topojson').json()
folium.TopoJson(
    berlin_zip_topo,
    'objects.berlin_postleitzahlen',
    style_function= postal_codes_style,
    name='Postal Codes'
).add_to(berlin_map2)

# load data to visualize districts
def districts_style(feature):
    """Style districts in Folium map"""
    return { 'color': 'blue', 'fill': False }

berlin_geo = requests.get('https://raw.githubusercontent.com/m-hoerz/berlin-shapes/master/berliner-bezirke.geojson').json()

folium.GeoJson(
    berlin_geo,
    style_function= districts_style,
    name='Districts'
).add_to(berlin_map2)

# add layer control to map
folium.LayerControl().add_to(berlin_map2)

# show map
berlin_map2

Based on our visual exploration, there seem to be at least two areas with a rather high density of second-hand stores in Berlin: In the district of 'Mitte' near 'U Weinmeisterstr.' and in the district of 'Neukölln' near 'Sonnenallee'.

### Modeling
To get robust results we should validate our findings by creating a maschine learning model that clusters the locations according to the requirements of the stakeholders. The clusters can be of arbitrary shape and we want to locate regions of high density that are separated from one another by regions of low density. That's why we should go for density-based clustering. Density, in this context, is defined as the number of stores within a specified radius. We decide to use Density-Based Spatial Clustering of Applications with Noise, in short: DBSCAN, because the alorithm can find arbitrarily shaped clusters, it has a notion of noise, and thus is robust to outliers.

Two parameters need to be set:
 + _epsilon_ determines a specified radius that if it includes enough number of points within, we call it dense area
 + _minimum samples (/minPts)_ determines the minimum number of data points within a cluster
 
These parameters will be set in consultation with the stakeholders, as it needs domain knowledge and a good understanding of the data, to specify them. In our case, epsilon is the distance a potential customer needs to walk from one store to another within same cluster. MinPts is the minimum number of stores to form a cluster. For this research we assume 275 meters to be a good distance between the points of a cluster and that a minimum of three points per cluster would be a good choice.

In [49]:
# extract coordinates from dataframe and cast them to an numpy array 
coords = df_venues.filter(['latitude', 'longitude']).to_numpy()

kms_per_radian = 6371.0088
epsilon = 0.275 / kms_per_radian

db = DBSCAN(
    eps=epsilon, 
    min_samples=3, # because we want other 2nd-hand stores nearby we set min_samples to >1
    algorithm='ball_tree', 
    metric='haversine'
    ).fit(np.radians(coords))

# add cluster per venue to dataframe
df_venues['cluster'] = db.labels_

# count generated clusters (noise is labeled with -1)
num_clusters = len(set(df_venues['cluster'])) - (1 if -1 in df_venues['cluster'] else 0)

# let's check on how much clusters we have now
print('Clustered {:,} locations down to {:,} clusters.'.format(df_venues.shape[0], num_clusters))

Clustered 215 locations down to 20 clusters.


We clustered the second-hand store locations by DBSCAN algorithm. Let's visualize the results, so we can gasp an idea of how well that worked out.

In [50]:
# generate map of berlin with a relatively dark theme
berlin_map3 = folium.Map(location=[52.524, 13.410], tiles='CartoDB dark_matter', zoom_start=12)

# dynamically generate list of colors, so any number of clusters can be visualized
df_venues['marker_color'] = pd.cut(df_venues['cluster'], bins=num_clusters, 
                              labels=['#'+''.join(random.choice('0123456789abcdef') for n in range(6)) for x in range(num_clusters)])

def add_clustered_venues(row): 
    """Function to add customized venues per cluster to map; funtion generated for easier use of pandas.DataFrame.apply()"""
    
    # make the noise (which is labeled -1) appear as smaller gray points
    if row['cluster'] == -1:
        folium.CircleMarker(
                    location=[row['latitude'], row['longitude']], 
                    radius=5,
                    color='grey',
                    fill=True,
                    fill_color='grey',
                    fill_opacity=0.6,
                    popup=folium.Popup('Cluster: '+str(row['cluster'])+'<br>'+str(row['venue'])),
                    ).add_to(berlin_map3)

    else:
        folium.CircleMarker(
                    location=[row['latitude'], row['longitude']], 
                    radius=10,
                    color=row['marker_color'],
                    fill=True,
                    fill_color=row['marker_color'],
                    fill_opacity=0.6,
                    popup=folium.Popup('Cluster: '+str(row['cluster'])+'<br>'+str(row['venue'])),
                    ).add_to(berlin_map3)

# iterate through dataframe and mark the stores in the color of their cluster
df_venues.apply(add_clustered_venues, axis=1)    

# show map
berlin_map3

The stores are well split into clusters that are clearly seperated and that are neither too small, nor too big.

Finally, we create a dataframe containing the id, the location (latitude, longitude and address) of the clusters and the amount of stores the cluster contains of. Then we view the location canditates on a map.

In [51]:
# create dataframe of clusters, excluding noise
df_clusters= df_venues[df_venues.cluster != -1].groupby(['cluster']).mean()
df_clusters['count'] = df_venues['cluster'].value_counts()

# add address of the center of each cluster to the dataframe
addresses = []

for index, row in df_clusters.iterrows():
    response = openstreetmap_reverse(row['latitude'], row['longitude'])
    addresses.append(str(response.json()['address'].get('road', '')) + ' ' + str(response.json()['address'].get('house_number', '')) + ', ' + str(response.json()['address'].get('postcode', '')) + ' ' + str(response.json()['address'].get('state', '')))

df_clusters['address'] = addresses

# sort clusters by size
df_clusters.sort_values(by='count', ascending=False, inplace=True)

# view clusters sorted by their size
display(df_clusters)

Unnamed: 0_level_0,latitude,longitude,count,address
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
14,52.526282,13.405085,18,"Steinstraße , 10119 Berlin"
16,52.540359,13.411406,17,"Kastanienallee 3, 10435 Berlin"
9,52.503323,13.308335,10,", 10629 Berlin"
2,52.487676,13.430408,10,"Weserstraße 12, 12047 Berlin"
17,52.547073,13.415942,10,"Stargarder Straße 7, 10437 Berlin"
1,52.490914,13.388463,9,"Bergmannstraße , 10961 Berlin"
12,52.509761,13.453445,8,"Kopernikusstraße , 10245 Berlin"
7,52.50019,13.430578,6,"Lausitzer Platz , 10997 Berlin"
15,52.533587,13.429032,5,"Marienburger Straße 27, 10405 Berlin"
6,52.492934,13.425214,5,"Hobrechtstraße 48, 12047 Berlin"


This concludes our analysis. We have created location candiates representing centers of zones containing locations with at least 3 second-hand stores nearby, at least one of them within 275 meters. Remember the clusters can be of any shape - their centers/addresses should be considered only as a starting point for exploring area neighborhoods in search for potential store locations. In the following map we will visualize our findings, the promissing locations, by circles.

In [52]:
# visualize location candidates
df_clusters.apply(lambda row:
                 folium.Circle(
                    location=[row['latitude'], row['longitude']], 
                    radius=275,
                    ).add_to(berlin_map3), axis=1)

berlin_map3

## Results & Discussion
We fetched data on existing second-hand stores in Berlin from the Foursquare Places API. After preparing the data, we visualized the locations of the stores in an interactive map, so we could manually explore the data visually and thus get a better understanding of the data set. To find locations that fulfill the requirements based on the domain knowledge of the stakeholders, we decided to group the existing store locations by a density-based clustering algorithm. Because the clusters were in arbitrary shapes and we wanted a algorithm that's fairly robust to outliers we decided to build a model based on DBSCAN.

That way we clustered 216 locations of existing second-hand stores down to 20 clusters, that have a size of at least 3 stores, each with at least one store of the same cluster within 275 meters distance. We then found the location (lat/lon) and address of each center of these clusters. This overview and and interactive map, that illustrates their distribution in Berlin, are the result of this analysis. Those areas won't necessarily be the optimal locations to open up a new second-hand store. They are a good starting point for more detailed analysis, that will take demographic information (area's population, income brackets, median age, ...) as well as accessibility, visibility and cost of the potential new locations into consideration. Also this further analysis should evaluate which, or which number of, other second-hand stores nearby will help the new business by drawing the same demographic of customers, and which, or which number of, other stores hurt the new business, because they are too much competition.

## Conclusion

Purpose of this project was to identify Berlin areas with multiple second-hand stores to aid stakeholders in narrowing down the search for optimal location for a new second-hand store. Based on the hypothesis that other second-hand stores nearby will help the new business by drawing the same demographic of customers and thus increase the walk-in traffic and sell figures, we found promissing locations that justify further analysis. We have done this by density-based clustering of location data gathered via Foursquare Places API. 

The collection of locations which satisfy some basic requirements regarding existing nearby stores, can be used as starting points for final exploration by stakeholders. The final decission on optimal store location will be made by stakeholders based on specific characteristics of neighborhoods and locations in every recommended area, after further analysis.