# Capstone Project - Business developement in a growing city
   ### Applied Data Science Capstone by Matias Manavella

### Table of contents
    1 - Introduction
    2 - Data
    3 - Methodology
    4 - Results and Discussion
    5 - Conclusion

# Introduction: Business problem

In this project, we will try to find an optimal location for a new food business developement in Rosario, Santa Fe, Argentina. Rosario is one of the largest and more important cities in Argentina, with one of the most important ports in the country. Naturally it is a city in constant development with an ever-growing population, which gives place to numerous startups in the entertainment and food section. The idea is to identify which would be the best district of the city to start a business, and which kind of food-place should we start.

Knowing and being inhabitant of the city, our new developement should be located in the central area's district, where are the citizens go to entertein themselves, meet with friends, party and eat. Therefore the analysis will be restricted to  of the 18 districts of Rosario. 

# DATA

Based on the problem to be solved in this project, the information we will need will mainly be:
    1. Geographical detail and limits of each district in the city. 
    2. Geographical centroid of each district, used to get information of the nearby venues. 
    3. Ammount and category of the venues in the area

We will use the following data sources:
    1. City hall information about the districts (name, area, geoJson information)
    2. Foursquare API, to get the venues in the area.

In [1]:
# The code was removed by Watson Studio for sharing.

Libraries installed and imported!


***
First, we obtain the information of the city districts directly from the city hall web page. Using the HTML inspector, we extracted the link to downloead the CSV. We will use the function "read_csv" from the pandas library, to parse the information into a Data Frame. 
***

In [2]:
!wget -O latlon_data.csv https://datos.rosario.gob.ar/node/399/download
DistrictsRosario = pd.read_csv('latlon_data.csv')
print ('Data downloaded and saved into DataFrame!')

--2020-04-16 15:30:53--  https://datos.rosario.gob.ar/node/399/download
Resolving datos.rosario.gob.ar (datos.rosario.gob.ar)... 200.107.81.137
Connecting to datos.rosario.gob.ar (datos.rosario.gob.ar)|200.107.81.137|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://datos.rosario.gob.ar/sites/default/files/areas_barriales_json.csv [following]
--2020-04-16 15:30:54--  https://datos.rosario.gob.ar/sites/default/files/areas_barriales_json.csv
Reusing existing connection to datos.rosario.gob.ar:443.
HTTP request sent, awaiting response... 200 OK
Length: 29091 (28K) [text/csv]
Saving to: ‘latlon_data.csv’


2020-04-16 15:30:55 (168 KB/s) - ‘latlon_data.csv’ saved [29091/29091]

Data downloaded and saved into DataFrame!


***
Once the info is set up in the Data Frame, we translate the headers in order to make it easier to understand, drop unnecessary info, and end up with the table of districts, and the respective geojson data.
***

In [3]:
DistrictsRosario.rename(columns={"NOMBRE": "district", "DISTRITO": "area","DESCRIPCION":"name","ID_AREA_BA":"district_ID"},inplace=True)
DistrictsRosario.drop(['SE_ROW_ID'], axis=1,inplace=True)
print('Shape: ', DistrictsRosario.shape)
DistrictsRosario.head()

Shape:  (38, 6)


Unnamed: 0,GID,district,area,name,district_ID,GEOJSON
0,1,Área barrial 1,Centro,"Barrios Irigoyen, Magnano, Sindicato de la Car...",C1,"{ ""type"": ""Feature"", ""properties"": { ""SE_ROW_I..."
1,2,Área barrial 2,Centro,"Barrios Irigoyen, Magnano, Sindicato de la Car...",C2,"{ ""type"": ""Feature"", ""properties"": { ""SE_ROW_I..."
2,3,Área barrial 3,Centro,Barrios Centro y Pichincha.,C3,"{ ""type"": ""Feature"", ""properties"": { ""SE_ROW_I..."
3,4,Área barrial 4,Centro,Barrio República de la Sexta.,C4,"{ ""type"": ""Feature"", ""properties"": { ""SE_ROW_I..."
4,5,Área barrial 5,Centro,"Barrios Abasto, Corrientes, España y Hospitale...",C5,"{ ""type"": ""Feature"", ""properties"": { ""SE_ROW_I..."


***
Having the geojson info in different rows presented a problem to use the folium libraries in an efficient way, so we constructed an string which resemples a geoJson file. This data is accepted by the folium library, according to the official documentation.
***

In [4]:
tiles='{"type":"FeatureCollection","features":['

for tile in DistrictsRosario['GEOJSON']:
    tiles=tiles+tile+','
    
tiles = tiles[0:len(tiles)-1]
tiles = tiles +']}'

***
We designed a color pallete to make the Choropleth diagram more attractive, thinking of course in the stakeholders reading the report. the values are in between red and yellow.
***

In [5]:
#Color Pallete
red = 16711680
yellow = 16776960
step = int((yellow- red)/39)
colors = []
for i in range (0,39):
    colors.append('#'+str(hex(red+i*step))[2:])   

***
We use all this information and resources to draw the tiles of each district over the map of Rosario city. We used a black, dash line to separate each area.
***

In [6]:

Rosario_latlon=[-32.9521898,-60.7]
TestMap = folium.Map(location=Rosario_latlon,  width='70%', height='60%', zoom_start=11)

for tile in DistrictsRosario['GEOJSON']:
    folium.GeoJson(
        tile,
        style_function=lambda feature: {
        'fillColor': colors[int(feature['properties']['GID'])],
        'color': 'black',
        'weight': '2',
        'dashArray': '5, 5',
        'fillOpacity': 0.5,
        },
        name='geojson'
    ).add_to(TestMap)
    
# display map
TestMap



***
Another difficulty we encountered was to select the geographical centroids of each district. To solve these, and taking advantage of the almos rectangular shape of the central districts, we used the latitude and longitud information in each line of the geoJson file, and took the mean value of all the points. At this point we proceed only with the first 7 districts, being them part of the central area of Rosario. 
*** 

In [7]:
latitudes=[]
longitudes=[]
#We take just the first 7 districts, which belong to the central area of Rosario.
DR=DistrictsRosario.head(7)

for district in DR['GEOJSON']:
    data=ast.literal_eval(district)
    lats=[]
    lons=[]
    for lon, lat in data['geometry']['coordinates'][0]:
        lats.append(lat)
        lons.append(lon)
    latitudes.append((max(lats)+min(lats))/2)
    longitudes.append((max(lons)+min(lons))/2)

DR['Lat']=latitudes
DR['Lon']=longitudes
DR

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,GID,district,area,name,district_ID,GEOJSON,Lat,Lon
0,1,Área barrial 1,Centro,"Barrios Irigoyen, Magnano, Sindicato de la Car...",C1,"{ ""type"": ""Feature"", ""properties"": { ""SE_ROW_I...",-32.953422,-60.629725
1,2,Área barrial 2,Centro,"Barrios Irigoyen, Magnano, Sindicato de la Car...",C2,"{ ""type"": ""Feature"", ""properties"": { ""SE_ROW_I...",-32.949731,-60.647599
2,3,Área barrial 3,Centro,Barrios Centro y Pichincha.,C3,"{ ""type"": ""Feature"", ""properties"": { ""SE_ROW_I...",-32.939439,-60.640969
3,4,Área barrial 4,Centro,Barrio República de la Sexta.,C4,"{ ""type"": ""Feature"", ""properties"": { ""SE_ROW_I...",-32.964855,-60.630069
4,5,Área barrial 5,Centro,"Barrios Abasto, Corrientes, España y Hospitale...",C5,"{ ""type"": ""Feature"", ""properties"": { ""SE_ROW_I...",-32.962478,-60.655775
5,6,Área barrial 6,Centro,"Barrios Echesortu, La República, Ntra Sra. de ...",C6,"{ ""type"": ""Feature"", ""properties"": { ""SE_ROW_I...",-32.944616,-60.674336
6,7,Área barrial 7,Centro,"Barrios Pichincha, Luis Agote y Terminal.",C7,"{ ""type"": ""Feature"", ""properties"": { ""SE_ROW_I...",-32.936081,-60.669026


In [8]:
# The code was removed by Watson Studio for sharing.

***
we finally used this information to draw the markers over the map, with its respective label identifying the distrcits.
***

In [9]:
districts = folium.map.FeatureGroup()

for lat, lon, label in zip(DR.Lat, DR.Lon, DR.district_ID):
    districts.add_child(
        folium.CircleMarker(
            [lat, lon],
            radius=5, 
            popup=label,
            color='Blue',
            fill=True,
            fill_color='Yellow',
            fill_opacity=0.6
        )
    )
    
TestMap.add_child(districts)

***
# Methodology

In this we will direct our efforts to search, identify and filter the venues in each district. We divided our analysis in the following steps:
    1. First, we obtained the information and geografical centroids of the districts, thus having an idea of how the city is divided.
    2. We will obtain and organize the information about food-related venues in the selected areas, using the Foursqare API. 
    3. Third we will filter the venues-related information, getting only relevant data about food related stores. 
    4. Finally we will analyze the frequency of each kind of store in each relevant area, and use this information for our startup.
***

***
# Analysis

First, we obtain data about all the venues and its respective categories which can be found in each district. 
***

In [10]:
# The code was removed by Watson Studio for sharing.

***
With this function, we will obtain only the part of the category information that is relevant for us. In this case, the name of the category it belongs.
***

In [11]:
def GetCategorie(row):
    try:
        categorie = row['categories']
        categorie = categorie[0]['name']
    except:
        categorie = np.nan
    
    return categorie

In [12]:
#We create the DataFrame to store all the information together
Final_results = pd.DataFrame(columns = ['district','name', 'categories'])

#We start to iterate through each district
for district_lat, district_long, district in zip(DR['Lat'],DR['Lon'],DR['district_ID']):
    
    #We create the URL. Endpoint is Venue. Search so we collect all the information, and request the info.
    url = f'https://api.foursquare.com/v2/venues/search?&client_id={client_id}&client_secret={client_secret}&v={version}&ll={district_lat},{district_long}&radius={RADIUS}&limit={LIMIT}'
    result = requests.get(url).json()
    
    #Result is filtered to get only relevant info of venues
    venues = result['response']['venues']
    nearby_venues = json_normalize(venues)
    #Following string list will be used to identify each venue's district.
    dist=[district]*nearby_venues.shape[0]
    
    #We store all the venues from the consulted lat, lon pair in the Results dataframe.
    'district','name', 'categories'
    Results = pd.DataFrame({'district': dist})
    nearby_venues['categories'] = nearby_venues.apply(GetCategorie,axis=1)
    Results = pd.concat([Results, nearby_venues[['name', 'categories']]], axis=1)
    #Finally, data from each iteration is sotored in Final_results dataframe
    Final_results = pd.concat([Final_results, Results], ignore_index = True)

print(Final_results.shape)
Final_results.head()


(661, 3)


Unnamed: 0,district,name,categories
0,C1,Freedom Coffee Lounge,Ice Cream Shop
1,C1,Alem Wellness Club,Gym / Fitness Center
2,C1,La Gallega,Supermarket
3,C1,Polleria Lo Brutto,Fried Chicken Joint
4,C1,Farmacia San Jorge,Pharmacy


In [13]:
final=Final_results

***
Once we have our data, we will delete all the information that is not relevant to this project. Using the categories from Foursquare, we can create a filter. We will delete all the venues whose category does not contain any of the words ['bar', 'bakery', 'beer', 'Cafe', 'café', 'coffee' ,'food', 'Ice Cream', 'Pizza', 'Restaurant']
***

In [14]:
delete=True
i=0
for row in final['categories']:
    if str(row) != 'nan':
        for word in ['bar', 'bakery', 'beer', 'cafe', 'café', 'coffee' ,'food', 'ice cream', 'pizza', 'restaurant']:
            if word in row.lower():
                delete = False
                continue
    if delete:
        final.drop(index=i, axis=1, inplace=True)
    delete = True
    i=i+1
print(final.shape)
final.head()

(124, 3)


Unnamed: 0,district,name,categories
0,C1,Freedom Coffee Lounge,Ice Cream Shop
7,C1,Vittorio,Latin American Restaurant
8,C1,Raggio's,Salon / Barbershop
9,C1,Take It Easy,Tapas Restaurant
11,C1,Sablé París,Coffee Shop


***
Once we have the filtered list, we proceed to get the dummies for the categories. We will keep the info of the district, to which each venue is related and in doing so we will be able to calculate the frequency of each venue, by grouping the information.
***

In [15]:
df2=pd.get_dummies(final['categories'])
#pd.set_option('display.max_columns', None)
df2['district']= final['district']
fixed_columns = [df2.columns[-1]] + list(df2.columns[:-1])
df2 = df2[fixed_columns]
df2.head(10)

Unnamed: 0,district,Argentinian Restaurant,Bakery,Bar,Beer Garden,Cafeteria,Café,Coffee Shop,Fast Food Restaurant,Fondue Restaurant,...,Latin American Restaurant,Mexican Restaurant,Pizza Place,Restaurant,Salon / Barbershop,South American Restaurant,Spanish Restaurant,Sushi Restaurant,Tapas Restaurant,Vegetarian / Vegan Restaurant
0,C1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,C1,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
8,C1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
9,C1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
11,C1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
15,C1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
17,C1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
27,C1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
28,C1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
29,C1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


***
As said before, we group the rows in the table by district, taking the mean of each venue for each group
***

In [16]:
df2_grouped = df2.groupby('district').mean().reset_index()
print(df2_grouped.shape)
df2_grouped.head(20)

(7, 29)


Unnamed: 0,district,Argentinian Restaurant,Bakery,Bar,Beer Garden,Cafeteria,Café,Coffee Shop,Fast Food Restaurant,Fondue Restaurant,...,Latin American Restaurant,Mexican Restaurant,Pizza Place,Restaurant,Salon / Barbershop,South American Restaurant,Spanish Restaurant,Sushi Restaurant,Tapas Restaurant,Vegetarian / Vegan Restaurant
0,C1,0.0,0.181818,0.045455,0.0,0.0,0.0,0.090909,0.0,0.0,...,0.045455,0.0,0.045455,0.045455,0.136364,0.0,0.0,0.045455,0.045455,0.045455
1,C2,0.058824,0.117647,0.058824,0.117647,0.0,0.0,0.0,0.058824,0.058824,...,0.058824,0.058824,0.0,0.0,0.117647,0.0,0.0,0.0,0.0,0.0
2,C3,0.076923,0.0,0.0,0.0,0.0,0.230769,0.076923,0.0,0.0,...,0.0,0.0,0.076923,0.230769,0.0,0.0,0.076923,0.0,0.0,0.076923
3,C4,0.142857,0.071429,0.071429,0.0,0.0,0.214286,0.0,0.142857,0.0,...,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0
4,C5,0.181818,0.090909,0.045455,0.045455,0.0,0.0,0.045455,0.0,0.0,...,0.0,0.0,0.090909,0.045455,0.045455,0.045455,0.0,0.0,0.0,0.0
5,C6,0.0,0.333333,0.133333,0.0,0.0,0.066667,0.0,0.0,0.0,...,0.0,0.0,0.0,0.066667,0.133333,0.0,0.0,0.0,0.0,0.0
6,C7,0.047619,0.095238,0.047619,0.0,0.047619,0.047619,0.142857,0.047619,0.0,...,0.0,0.047619,0.142857,0.142857,0.0,0.0,0.0,0.0,0.0,0.0


***
We obtain finally the data we need. Which venues each district has, and how many of them there are in each of them. The final step will be the analysis of which are the most and less frequent venues in each area.

First, we see wich 6 venues are the most frequent. In the last cell we obtain the 10 less frequent. 
***

In [17]:
num_top_venues = 6
for district in df2_grouped['district']:
    print("----"+district+"----")
    temp = df2_grouped[df2_grouped['district'] == district].T.reset_index()
    
    temp.columns = ['venue','freq']
    
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----C1----
                venue  freq
0      Ice Cream Shop  0.23
1              Bakery  0.18
2  Salon / Barbershop  0.14
3         Coffee Shop  0.09
4          Food Truck  0.05
5    Tapas Restaurant  0.05


----C2----
                    venue  freq
0      Salon / Barbershop  0.12
1             Beer Garden  0.12
2                  Bakery  0.12
3  Argentinian Restaurant  0.06
4             Gaming Cafe  0.06
5      Mexican Restaurant  0.06


----C3----
                    venue  freq
0              Restaurant  0.23
1                    Café  0.23
2  Argentinian Restaurant  0.08
3       Food & Drink Shop  0.08
4      Spanish Restaurant  0.08
5             Pizza Place  0.08


----C4----
                    venue  freq
0                    Café  0.21
1  Argentinian Restaurant  0.14
2    Fast Food Restaurant  0.14
3      Salon / Barbershop  0.14
4       Food & Drink Shop  0.14
5                  Bakery  0.07


----C5----
                    venue  freq
0  Argentinian Restaurant  0.18
1    

In [18]:
num_top_venues = 12
for district in df2_grouped['district']:
    print("----"+district+"----")
    temp = df2_grouped[df2_grouped['district'] == district].T.reset_index()
    
    temp.columns = ['venue','freq']
    
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=True).reset_index(drop=True).head(num_top_venues))
    print('\n')

----C1----
                        venue  freq
0      Argentinian Restaurant   0.0
1          Spanish Restaurant   0.0
2   South American Restaurant   0.0
3          Mexican Restaurant   0.0
4                   Juice Bar   0.0
5          Italian Restaurant   0.0
6               Internet Cafe   0.0
7                 Gaming Cafe   0.0
8                  Food Court   0.0
9                   Hotel Bar   0.0
10          Fondue Restaurant   0.0
11       Fast Food Restaurant   0.0


----C2----
                            venue  freq
0   Vegetarian / Vegan Restaurant   0.0
1                Sushi Restaurant   0.0
2              Spanish Restaurant   0.0
3       South American Restaurant   0.0
4                       Cafeteria   0.0
5                            Café   0.0
6                     Coffee Shop   0.0
7                      Restaurant   0.0
8                     Pizza Place   0.0
9                      Food Court   0.0
10                     Food Truck   0.0
11               Tapas Resta

***
# Results and Discussion

I think the results obtained for this project are useful and in scope. We started from scratch, using and organizing the little information we had, and finally getting a clear view of how the city is distributed, which are the venues in each area and detecting the various areas of opportunity in each district.

We can observe, for example, that Ice Cream shops are very popular in districts C1,2,3,5 and 6, but there is a lack of ice cream shops in districts C4 and C7. Further analysis of this kind might be done with the info obtained. 
***

# Conclusion

Purpose of this project was to identify the areas of opportunity of each district in the central area of Rosario.By calculating frequency of different food-related venues from Foursquare data we have first identified the kind of food startups with less or none presence in each district.

Analyzing further each district, and taking into consideration additional factors which may be related to how attractive each area is, how many people transit each day in the district, etc, may lead the stakeholders to make a final decision.