# IBM Advanced Data Science Project Report
Feel free to [contact me](https://www.linkedin.com/in/leonardo-iheme/) if you have any questions.  
September, 2019  
**By:**  
### Leonardo O. Iheme

# 1 Introduction

# 1.1 Background
In this report, I will outline the steps I took to discover similar and dissimilar coffee-neighborhoods in Istanbul locals' favorite districts.

Istanbul is one of the biggest and most populous cities in the world, the only city that exists on two continents. Both parts of the city are divided by the Bosphorous strait. Two Districts loved by Istanbul residents are Beşiktaş (be-shik-tash) and Kadıköy (ka-di-koy) on the European and Asian side respectively. While these districts have a lot in common, they have their fair share of differences as well, the surge of coffee shops for one. In fact, according to [Foursquare](https://foursquare.com/top-places/istanbul/best-places-coffee), [8 of 15 best coffee shops in Istanbul are located in Beşiktaş and Kadıköy]

There is a fierce debate among residents about the neighborhood to best enjoy a cup of coffee. This report will address the issue by providing insights drawn from data.  


# 1.2 Problem Statement
After tea, coffee is the next most consumed beverage in the country and that is for good reason. In modern Istanbul, to escape the tourist traps and mingle with locals, the best places to have a coffee are Beşiktaş and Kadıköy. As a visitor, finding a venue and neighborhood to have a coffee can be quite difficult given the wide range of choices and factors. 

This study will be of interest to both visitors of Istanbul and locals who yet to discover the hidden similarities between the two most sought after neighborhoods. The report will help readers to:
1. Be more familiar with the discussed neighborhoods
2. Understand the relationship between coffee shops and other neighborhood attributes
3. Discover the similarities between neighborhoods in terms of coffee shops and other attributes
4. Be able to make better-informed decisions about where to coffee in Istanbul like a resident

# 2 Data

All the data used in this project were obtained from various sources on the internet. While some were ready to use, others had to be wrangled and cleaned. Data manipulation was performed with _Python_, using mostly the _Pandas_ library. The data was collected towards building the attributes of each neighborhood bearing in mind the factors that could affect a coffee experience. It is worth noting that the data may be limited in some cases because I used free tier accounts but it gives a good sense of proportions.
## 2.1 Population & Demographics Data
Since the region of interest (ROI) has been narrowed down to Beşiktaş and Kadıköy, I could crawl the web and scrape using _beautiful soup_ to obtain such information as: 
* The list of neighborhoods in each district,
* The population of each neighborhood, and
* The average price of residential rent at each neighborhood

## 2.2 Geographical Data
To locate the neighborhoods, I leveraged on Google Maps API and the [Open Street maps project](https://www.openstreetmap.org). The information obtained from the respective sources are as follows:
* Google maps: coordinates as longitude and latitude pairs
* Open street maps: neighborhood boundaries as polygon coordinates which were then converted to *geojson* files using an API provided by [geojson.io](http://geojson.io/)

## 2.3 Location Data
The list of coffee shops was obtained by querying [foursquare](https://foursquare.com/) through the API. As I use a free tier account, the results of my queries were limited in some cases.
##  
The features which will be extracted from the data include the **number of coffee shops per neighborhood, the average distance from the center of the neighborhood to the seaside, the estimated number of people served by a coffee shop in each neighborhood, and the socioeconomic status of each neighborhood**. With the extracted features, exploratory as well as inferential data analysis will be carried out. Finally, the neighborhoods will be clustered using machine learning.

# 3 Methodology

In this section, the steps taken to obtain the results will be described in detail. It will include data acquisition techniques, data wrandling, data exploratory analysis as well as feature extraction techniques.
## 3.1 Data Acquisition
### 3.1.1 Population and Demographics
I started by searching the internet for the list of neighborhoods and their respective postal codes. This information was easily obtained from  [bulurum.com](https://www.bulurum.com/en/), an online business directory which provides detailed, geolocated information for all kinds of businesses and professionals in all regions and cities in Turkey.
The list can be found on the district's pages and requested as follows:
```python
# Source of neighborhoods and postal codes
besiktas = requests.get(r"https://www.bulurum.com/en/post-codes/besiktas/istanbul/").text
kadikoy = requests.get(r"https://www.bulurum.com/en/post-codes/kadikoy/istanbul/").text
```
using the requests library. Then a ```beautiful soup``` instance was created using the following code snipet:
```python
# Create a BeautifulSoup instance
b_soup = BeautifulSoup(besiktas)
k_soup = BeautifulSoup(kadikoy)```
After inspecting the pages, I could extract and organize the information that I needed into a dataframe as shown

  Idx | Post_codes |                                      Neighborhood 
:---:|:----------:|--------------------------------------------------:
 0 |    34022   |     ABBASAĞA MAH., CIHANNUMA MAH., SİNANPAŞA MAH. 
 1 |    34330   |                        KONAKLAR MAH., LEVENT MAH. 
 2 |    34335   |                                         AKAT MAH. 
 3 |    34337   |                                       ETİLER MAH. 
 4 |    34340   | KÜLTÜR MAH., LEVAZIM MAH., NİSBETİYE MAH., ULU... 

The population of each neighborhood was obtained from [endeksa.com](https://www.endeksa.com/en/), an up-to-date analytics website for real estate in Turkey.
### 3.1.2 Geography & Location
#### Geography
Using the Google maps API, it was possible to query the latitudes and longitudes of the neighborhoods. The names of the neigborhoods and the postal codes were used as an approximate address in a ```for``` loop.
```python
# Geocoding the addresses
latitude = []
longitude = []
for x in coffee_shops_population.itertuples():
    geocode_result = gmaps.geocode(f'{x.Neighborhood}, Istanbul')
    # concatenate the latitudes and longitudes
    latitude = latitude + [geocode_result[0]['geometry']['location']['lat']]
    longitude = longitude + [(geocode_result[0]['geometry']['location']['lng'])]

# Add the latitudes and longitudes to the dataframe
coffee_shops_population['Latitude'] = latitude
coffee_shops_population['Longitude'] = longitude
coffee_shops_population.head()
```
In addition to the coordinates obtained from the google maps API, I also got the mount of time it would take to walk from the neighbohood center to the nearest coast. I added this as a feature because in Istanbul, having a seaview is one of the factors that could affect one's coffee experience.

The neighborhoods to be examined are shown pinned on the [map](https://leonardoiheme.wixsite.com/ibmproject/neighborhoods) below

In [2]:
%%html
<!-- blank line -->
<figure class="video_container">
  <iframe width="100%" height="600px" name="htmlComp-iframe" scrolling="auto" sandbox="allow-same-origin allow-forms allow-popups allow-scripts allow-pointer-lock" src="https://leonardoiheme-wixsite-com.filesusr.com/html/d6f1dc_a4b91ebc367c3c4fe89f58cb5f350f75.html"></iframe>
</figure>
<!-- blank line -->

#### Location
Location data was obtained by querying the foursquare database via the API. Of particular interest to this project was the cafes and coffee shops within a 2 Km radius of each neighborhoods center. To download the data, the following information is required as input:
* Client ID
* Client secrete
* Latitude
* Longitude
* Version
* Search radius

After downloading the data and cleaning it, the number of coffee shops in each neighborhood was obtained and is depicted in the following figure:
![alt text](coffee_shops_neighborhood.png "The number of coffee shops in each neighborhood")

It is obvious which neighborhoods have the most coffee shops.


Before moving on to exploratory data analysis, it is worth mentioning one more feature of the data. Since the population and number of coffee shops in each neighbohood are known, a rough estimate of the number of people served by each coffee shop can be obtained by dividing the population by the number of coffee shops. With `pandas`, this was easy to calculate
```python
# Estimate number of people per coffee shop and add to the dataframe.
coffee_shops_population['PeoplePerCoffeeShop'] = coffee_shops_population['Population'] / coffee_shops_population['# Coffee shops']
```
The columns of the final dataframe is shown below:

|Idx|Neighborhood  |District|HouseRent(sqm)|Population|# Coffee shops|WalkToSeaside(min)|Latitude |Longitude|PeoplePerCoffeeShop|
|------|--------------|--------|--------------|----------|--------------|------------------|---------|---------|-------------------|

## 3.2 Exploratory Data Analysis

In this section, the data is looked into with more detail. A brief camparison of the two districts is made to highlight how they contrast based on the different features. A basic statistical summary of the data is shown below

|Stat |HouseRent(sqm)|Population  |# Coffee shops|WalkToSeaside(min)|Latitude |Longitude|PeoplePerCoffeeShop|
|-----|--------------|------------|--------------|------------------|---------|---------|-------------------|
|count|44.000000     |44.000000   |44.000000     |44.000000         |44.000000|44.000000|44.000000          |
|mean |22.909091     |14475.000000|12.659091     |22.318182         |41.026023|29.034685|1485.218322        |
|std  |5.116404      |10060.739575|6.647060      |14.806932         |0.042205 |0.026972 |1334.437329        |
|min  |15.000000     |2534.000000 |5.000000      |1.000000          |40.958317|28.992715|144.037037         |
|25%  |18.000000     |6209.500000 |8.000000      |10.000000         |40.988611|29.014604|510.093583         |
|50%  |23.000000     |11504.000000|9.500000      |20.000000         |41.045028|29.031286|960.270833         |
|75%  |27.000000     |19358.000000|17.000000     |32.750000         |41.062030|29.049266|1917.916667        |
|max  |35.000000     |35260.000000|27.000000     |60.000000         |41.093440|29.100420|5091.500000        |

If broken down further, the averages of each neighborhood can be observed as follows:

|District|HouseRent(sqm)|Population  |# Coffee shops|WalkToSeaside(min)|PeoplePerCoffeeShop|
|--------|--------------|------------|--------------|------------------|-------------------|
|Besiktas|26.739130     |8062.913043 |13.086957     |21.565217         |781.023908         |
|Kadikoy |18.714286     |21497.761905|12.190476     |23.142857         |2256.478872        |

##### Observations
* Even though Kadıköy has a much higher population than Beşiktaş, both neighborhoods have around the same average number of coffee shops per neighborhood. There must be a higher coffee shop density in Beşiktaş but according to the number of people per coffee shop, this is not the case

### 3.2.1 Population
The distribution of the population of both districs is given in the violin plot.
![alt text](population_bar.png "The population of each district")
The population of Kadıköy is much higher than that of Beşiktaş.
When observed at the neighborhood level, the distribution of the population becomes clearer. To see this, a [choropleth map](https://leonardoiheme.wixsite.com/ibmproject/population) is created

In [3]:
%%markdown
<!-- blank line -->
<figure class="video_container">
  <iframe width="100%" height="600px" name="htmlComp-iframe" scrolling="auto" sandbox="allow-same-origin allow-forms allow-popups allow-scripts allow-pointer-lock" src="https://leonardoiheme-wixsite-com.filesusr.com/html/d6f1dc_efc448e8403300d1f1c934b3098d0d3a.html"></iframe>
</figure>
<!-- blank line -->

<!-- blank line -->
<figure class="video_container">
  <iframe width="100%" height="600px" name="htmlComp-iframe" scrolling="auto" sandbox="allow-same-origin allow-forms allow-popups allow-scripts allow-pointer-lock" src="https://leonardoiheme-wixsite-com.filesusr.com/html/d6f1dc_efc448e8403300d1f1c934b3098d0d3a.html"></iframe>
</figure>
<!-- blank line -->


### 3.2.2 Number of coffee shops
The number of coffee shops and how they are distributed between the districts is shown in the violin plot below
![alt text](coffee_shops_violin.png "The distribution of coffee shops in each district")
##### Observations
* As seen from the table of averages, both districs have an almost equal number of coffee shops. We can also see that the variation in the number of coffee shops per neighborhood in both districs is almost eqal as well.

A [choropleth map](https://leonardoiheme.wixsite.com/ibmproject) reveals more detail.

In [4]:
%%html
<!-- blank line -->
<figure class="video_container">
  <iframe width="100%" height="600px" name="htmlComp-iframe" scrolling="auto" sandbox="allow-same-origin allow-forms allow-popups allow-scripts allow-pointer-lock" src="https://leonardoiheme-wixsite-com.filesusr.com/html/d6f1dc_49fd03c3e444d6b22e9bd739e9c60eb8.html"></iframe>
</figure>
<!-- blank line -->

### 3.2.3 Price of rent
The distribution of the cost of renting in each neighborhood can be visualized with a violin plot.
![alt text](price_of_rent_violin.png "The distribution of cost of renting in both districts")
#### Observations
* It costs more to rent in Beşiktaş than in Kadıköy.
* The price of rent varies more across the neighborhoods in Kadıköy. This can be seen from the wide shape of the violin plot.

A [choropleth map](https://leonardoiheme.wixsite.com/ibmproject/price-of-rent) reflects the observations.

In [5]:
%%html
<!-- blank line -->
<figure class="video_container">
  <iframe width="100%" height="600px" name="htmlComp-iframe" scrolling="auto" sandbox="allow-same-origin allow-forms allow-popups allow-scripts allow-pointer-lock" src="https://leonardoiheme-wixsite-com.filesusr.com/html/d6f1dc_49becf2cf26923789bd4c46dad762209.html"></iframe>
</figure>
<!-- blank line -->

# 4 Results and Discussion

This section provides insights into the extracted features and how they compare. Some observations are also highlighted to aid understanding.
## 4.1 Regression
To observe how the number of coffee shops is related to the other features such as the price of housing, the average shortest time it takes to get to the coast from the centre, and the number of people served by each coffee shop; a regression plot is made.
![alt text](reg_plots.png "The relationship between the number of coffee shops and other features")
From the left,
* The first plot shows that the price of rent is likely to be higher if there are a lot of coffee shops in the area. In Besikttas, the reverse is the case.
* The second figure reveals that as the population of an area decreases, the number of coffee shops also decreases.
* We can also observe that the further away the neighborhood is from the seaside, the fewer coffee shops it is likely to have.

## 4.2 Interesting Neighborhoods (Outliers)
Some neighborhoods stand out as we observe from the correlation plots. To take a closer look at these interesting neighborhoods, the scatter plot below is provided
![alt text](neighborhood_coffee_scatter.png "Observing the interesting neighborhoods")
With respect to the cost of renting, the following observations can be made:
* **Fenerbahce Mah.** is the most expensive neighborhood in Kadıköy and it has the most number of coffee shops.
* Renting at **Egitim Mah.** is relatively cheap and there are a lot of coffee shops in the area.
* **Ulus Mah.** in Beşiktaş has a few coffee shops but a high rental price.
* In general, Kadıköy seems to be a more diverse district than Beşiktaş.

## 4.3 Clustering Analysis (K-Means Clustering)
To detect similar neighborhoods based on the extracted features, an un-supervised machine learning modelis trained. Specifically, K-means clustering is used to divide all the neighborhoods into five groups. The reason for choosing five is, I imagine that, as a visitor, you have five days to spend in the city and you want to have a totally different coffee experience (like a local) everyday, this algorithm will help objectively guide you to selecting five neighborhoods in Beşiktaş and Kadıköy that will deliver different experiences.  
The following code shows how K-means clustering was run. With the `scikit-learn` library, in just a few lines, machine learning algorithms can be run
```python
# set number of clusters
kclusters = 5

#Drop categorical variables
clustering = coffee_shops_population.drop(['Neighborhood', 'District', 'Latitude', 'Longitude'], 1)

# Normalizing over the standard deviation
from sklearn.preprocessing import StandardScaler
X = clustering.values[:,1:]
X = np.nan_to_num(X)
Clus_dataSet = StandardScaler().fit_transform(X)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(Clus_dataSet)

# check cluster labels generated for each row in the dataframe
kmeans.labels_
```
The map below shows the neighborhoods color coded. Similar *coffee* neighborhoods have the same color.

In [6]:
%%html
<!-- blank line -->
<figure class="video_container">
  <iframe width="100%" height="600px" name="htmlComp-iframe" scrolling="auto" sandbox="allow-same-origin allow-forms allow-popups allow-scripts allow-pointer-lock" src="https://leonardoiheme-wixsite-com.filesusr.com/html/d6f1dc_b9cdb4cd6b3bc477a7380225601011e9.html"></iframe>
</figure>
<!-- blank line -->

The following table shows the neighborhoods in their respective clusters


|Neighborhood      |District|*_*|Neighborhood   |District|*_*|Neighborhood   |District|*_*|Neighborhood   |District|*_*|Neighborhood  |District|
|------------------|--------|---|---------------|--------|---|---------------|--------|---|---------------|--------|---|--------------|--------|
|19 MAYIS MAH.     |Kadikoy |* *|ABBASAGA MAH.  |Besiktas|* *|AKATLAR MAH.   |Besiktas|* *|CAFERAGA MAH.  |Kadikoy |* *|BALMUMCU MAH. |Besiktas|
|ACIBADEM MAH.     |Kadikoy |* *|ARNAVUTKOY MAH.|Besiktas|* *|DIKILITAS MAH. |Besiktas|* *|FENERBAHCE MAH.|Kadikoy |* *|BEBEK MAH.    |Besiktas|
|BOSTANCI MAH.     |Kadikoy |* *|ETILER MAH.    |Besiktas|* *|DUMLUPINAR MAH.|Kadikoy |* *|SUADIYE MAH.   |Kadikoy |* *|CIHANNUMA MAH.|Besiktas|
|CADDEBOSTAN MAH.  |Kadikoy |* *|KULTUR MAH.    |Besiktas|* *|FIKIRTEPE MAH. |Kadikoy |* *|* *            |* *     |* *|EGİTIM MAH.   |Kadikoy |
|ERENKOY MAH.      |Kadikoy |* *|KURUCESME MAH. |Besiktas|* *|GAYRETTEPE MAH.|Besiktas|* *|* *            |* *     |* *|MECIDIYE MAH. |Besiktas|
|FENERYOLU MAH.    |Kadikoy |* *|MURADIYE MAH.  |Besiktas|* *|HASANPASA MAH. |Kadikoy |* *|* *            |* *     |* *|ORTAKOY MAH.  |Besiktas|
|GOZTEPE MAH.      |Kadikoy |* *|OSMANAGA MAH.  |Kadikoy |* *|KONAKLAR MAH.  |Besiktas|* *|* *            |* *     |* *|YILDIZ MAH.   |Besiktas|
|KOZYATAGI MAH.    |Kadikoy |* *|RASIMPASA MAH. |Kadikoy |* *|KOSUYOLU MAH.  |Kadikoy |* *|* *            |* *     |* *|* *           |* *     |
|MERDIVENKOY MAH.  |Kadikoy |* *|SINANPASA MAH. |Besiktas|* *|LEVAZIM MAH.   |Besiktas|* *|* *            |* *     |* *|* *           |* *     |
|SAHRAYI CEDIT MAH.|Kadikoy |* *|TURKALI MAH.   |Besiktas|* *|LEVENT MAH.    |Besiktas|* *|* *            |* *     |* *|* *           |* *     |
|* *               |* *     |* *|VISNEZADE MAH. |Besiktas|* *|NISBETIYE MAH. |Besiktas|* *|* *            |* *     |* *|* *           |* *     |
|* *               |* *     |* *|ZUHTUPASA MAH. |Kadikoy |* *|ULUS MAH.      |Besiktas|* *|* *            |* *     |* *|* *           |* *     |


# 5 Discussion
The preceeding sections of the report have shown how complex the neighborhoods are. Specifically, the data exploratory section reveals that some unforseen traits such as the variation found in Kadıköy in contrast to Beşiktaş. In particular, the varying cost of living, while Beşiktaş is a more exensive neighborhood than Kadıköy, the scale in Kadıköy is spread out.

Furthermore, the likelihood of finding a wider range of coffee is higher in Beşiktaş than Kadıköy since there are more coffee shops in the neighborhoods of Beşiktaş. In Kadıköy, the distance of the neighborhoods from the coast seems to be negetively correlated with the number of coffee shops and positively correlated with the cost of rent.

The clustering analysis revealed the similar neighborhoods. For instance, according to the results of clustering, neighborhoods in clusters three and four have the most number of coffee shops as shown in the violin plot below
![alt text](cluster_coffee.png "Clustering result: number of coffee shops")
and neighborhoods in cluster one have the lowest rent
![alt text](cluster_housing.png "Clustering result: rent")
In addition, those in cluster three have got the least crowded coffee shops while those in cluster one have the most crowded coffee shops.
![alt text](cluster_crowd.png "Clustering result: rent")
Several recommendations can be made from the findings of this study.

# 6 Conclusion and Future Work
In this report, I have outlined the findings of both exporatory data analysis and inferential data analysis. Using several data gathering and wrangling techniques, features of two of the most loved by locals districts of Istanbul were analyzed. Furthermore, machine learning was applied to find the neighborhoods that were most similar. The aim of the study was not to say which neighborhood was better but to serve as a guide to those seeking to get a specific coffee experience. I leave it to the reader to decide where and how to have coffee given the forgone analysis and their preference.

There are many avnues for improving the work that has been done. One would be improving the quality of data by acquiring it from sources that do not limit the amout that can be collected. With such data, one will have the means to perform more sophisticated and accurate analysis. With information such as the customer reviews, sentiment analysis can be carried out.