# Final capstone project

## Tehran venues analysis

## 1. Introduction

### 1.1 Background

Tehran is a very big city in terms of the number of districts and population. Tehran has over 22 districts with a population of 12 million people(2018) which makes it the second most populated city in the middle east after Istanbul. A big city like Tehran has lot’s of neighborhoods and many venues.

### 1.2 Problem

Unfortunately, a big city like Tehran doesn’t have a good analysis of their venues. For example, a tourist doesn’t know the best places in Tehran based on each neighborhood so In this project we are going to cluster each neighborhood in 22 districts of Tehran based on their top 10 venues in each neighborhood

### 1.3 Interest

The output of this project can be very helpful for tourists or anyone who is interested in finding the best venues of Tehran based on their neigborhoods.

## 2. Data acquisition

Surprisingly, Tehran doesn’t have a clean database for data scientists. As a data scientist you have to pretty much collect everything you need yourself. So my first step would be to collect neighborhoods data from here and put each one in a row of my Dataframe. After this step I will collect the coordinates for my neighborhoods to plot my data on a map. Coordinates will be collected by Geopy library. For the Final step, I will get my venues data based on each neighborhood using Foursquare API which surprisingly has a valuable database for the city of Tehran.

## 3.Data analysis explanation

For start first we first import our package:

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt
# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
!pip install folium
import folium

print('Libraries imported.')

Collecting folium
[?25l  Downloading https://files.pythonhosted.org/packages/72/ff/004bfe344150a064e558cb2aedeaa02ecbf75e60e148a55a9198f0c41765/folium-0.10.0-py2.py3-none-any.whl (91kB)
[K     |████████████████████████████████| 92kB 13.9MB/s eta 0:00:01
Collecting branca>=0.3.0 (from folium)
  Downloading https://files.pythonhosted.org/packages/63/36/1c93318e9653f4e414a2e0c3b98fc898b4970e939afeedeee6075dd3b703/branca-0.3.1-py3-none-any.whl
Installing collected packages: branca, folium
Successfully installed branca-0.3.1 folium-0.10.0
Libraries imported.


### 3.1 Data gathering

Now we need to start our data gathering process. As explained above, since there is no csv file to convert data frame and wrangle the data we will create our dataframe directly using pandas library. The dataframe we are looking to create consists of columns <i> District, neighborhood, lantitude, logtitude </i>. To have the first two columns which are <b>District</b> and <b>neighborhood</b> we will get what we need from <a href="https://en.wikipedia.org/wiki/Template:Main_neighborhoods_of_Tehran">here</a> and we will put each neighborhood to it's district. Since the data in this page is in Farsi and there is no table, data scraping tools can't help us to retrieve data from this page.

In [72]:
dg = [
    ('North', 'Aghdasieh'),
    ('North', 'Lavizan'),
    ('North', 'Ajodanieh'),
    ('North', 'Darakeh'),
    ('North', 'Darband'),
    ('North', 'Darus'),
    ('North', 'Mirdamad'),
    ('North', 'Doulat'),
    ('North', 'Ekhtiarieh'),
    ('North', 'Elahieh'),
    ('North', 'Farmanieh'),
    ('North', 'Gheytarieh'),
    ('North', 'Gholhak'),
    ('North', 'Jamaran'),
    ('North', 'Jordan'),
    ('North', 'Kamranieh'),
    ('North', 'Mahmoodieh'),
    ('North', 'Mehran'),
    ('North', 'Niavaran'),
    ('North', 'Pasdaran'),
    ('North', 'Shemiran'),
    ('North', 'Tajrish'),
    ('North', 'Vanak'),
    ('North', 'Valiasr'),
    ('North', 'Velenjak'),
    ('North', 'Zafaraniyeh'),
    ('West', 'Ekbatan'),
    ('West', 'Shahrak Apadana'),
    ('West', 'Bagh Feyz'),
    ('West', 'Farahzad'),
    ('West', 'Gisha'),
    ('West', 'Jannat Abad'),
    ('West', 'Punak'),
    ('West', "Sa'adat Abad"),
    ('West', 'Sadeghiyeh'),
    ('West', 'Shahrak Gharb'),
    ('West', 'Shahran'),
    ('West', 'Shahrara'),
    ('West', 'Shahr-e Ziba'),
    ('West', 'Tarasht'),
    ('West', 'Tehransar'),
    ('Central', 'Abbas Abad'),
    ('Central', 'Amir Abad'),
    ('Central', 'Baharestan'),
    ('Central', 'Enghelab Street'),
    ('Central', 'Bazar'),
    ('Central', 'Hasan Abad'),
    ('Central', 'Jomhuri'),
    ('Central', 'Keshavarz Boulevard'),
    ('Central', 'Park-e Shahr'),
    ('Central', 'Seyed Khandan'),
    ('Central', 'Toopkhaneh'),
    ('Central', 'Tehran No'),
    ('East', 'Afsariyeh'),
    ('East', 'Lavizan'),
    ('East', 'Narmak'),
    ('East', 'Tehranpars'),
    ('East', 'Tehranno'),
    ('East', 'Piroozi'),
    ('South', 'Gomrok'),
    ('South', 'Javadiyeh'),
    ('South', 'Khavaran'),
    ('South', 'Navvab'),
    ('South', 'Nazi Abad'),
    ('South', 'Rey'),
    ('South', 'Yaft Abad')
]
data = pd.DataFrame(dg, columns={'City side', 'Neighborhood'})

In [73]:
data.head()
print('Tehran has {} unique neighborhoods.'.format(len(data['Neighborhood'].unique())))

Tehran has 65 unique neighborhoods.


### Now let's find lantitude and longtitude for each neighborhood using geopy.

In [74]:
lan, lon = [], []
for neighborhood in data['Neighborhood']:
    address = str(neighborhood) + ', Tehran'
    
    geolocator = Nominatim(user_agent="ir_explorer")
    location = geolocator.geocode(address)
    if location == None:
        print('No location for ' + address)
        lan.append(neighborhood+'Q')
        lon.append(neighborhood+'E')
        continue
    latitude = location.latitude
    longitude = location.longitude
    
    lan.append(latitude)
    lon.append(longitude)
    
    print('The geograpical coordinate of {} are {}, {}.'.format(address,latitude, longitude))
    
data['latitude'] = lan
data['longitude'] = lon

The geograpical coordinate of Aghdasieh, Tehran are 35.5938365, 51.4444056.
The geograpical coordinate of Lavizan, Tehran are 35.7770552, 51.5021499.
No location for Ajodanieh, Tehran
The geograpical coordinate of Darakeh, Tehran are 35.8043457, 51.3827097.
The geograpical coordinate of Darband, Tehran are 35.8135171, 51.429482.
The geograpical coordinate of Darus, Tehran are 35.7727353, 51.4566868.
The geograpical coordinate of Mirdamad, Tehran are 35.759387, 51.43502.
The geograpical coordinate of Doulat, Tehran are 35.7046285, 51.3607125.
The geograpical coordinate of Ekhtiarieh, Tehran are 35.7863456, 51.4611775.
The geograpical coordinate of Elahieh, Tehran are 35.7898612, 51.4262023.
The geograpical coordinate of Farmanieh, Tehran are 35.80325, 51.4594829.
The geograpical coordinate of Gheytarieh, Tehran are 35.79030505, 51.4449767570462.
The geograpical coordinate of Gholhak, Tehran are 35.7731769, 51.4442124.
The geograpical coordinate of Jamaran, Tehran are 35.8198534, 51.4583

As you can See some geopy can't give coordinates for some locations which is because of the old name of some neighborhoods. So we will explicitly give coordinates for those locations inside our for loop and place it in our dataframe.

In [76]:
data.replace(to_replace='AjodaniehQ', value=35.8078, inplace=True)
data.replace(to_replace='AjodaniehE', value=51.4836, inplace=True)
data.replace(to_replace='ToopkhanehQ', value=35.68578, inplace=True)
data.replace(to_replace='ToopkhanehE', value=51.42018, inplace=True)
data.replace(to_replace='TehrannoQ', value=35.70979, inplace=True)
data.replace(to_replace='TehrannoE', value=51.49646, inplace=True)

In [77]:
data

Unnamed: 0,City side,Neighborhood,latitude,longitude
0,North,Aghdasieh,35.593837,51.444406
1,North,Lavizan,35.777055,51.50215
2,North,Ajodanieh,35.8078,51.4836
3,North,Darakeh,35.804346,51.38271
4,North,Darband,35.813517,51.429482
5,North,Darus,35.772735,51.456687
6,North,Mirdamad,35.759387,51.43502
7,North,Doulat,35.704628,51.360712
8,North,Ekhtiarieh,35.786346,51.461177
9,North,Elahieh,35.789861,51.426202
