<h1> Coursera Capstone Final Project - The Battle of Neighbourhoods </h1>
<h3> for the IBM Data Science Professional Certificate </h3>

<h2> Introduction / Business Problem </h2>

<p> A family is moving to London and wishes to find schools for kid(s). As a start, the family wishes to find the locations of London which are most convenient and accessible to nearby schools (and to be picky, as many of them as possible). Also, the family wants to explore the neighbo<u>u</u>rhoods suggested.</p>

<h2> Data </h2>

<p> This project will (i) obtain geographical information about London by parsing a wikipedia page. The names 33 boroughs of London will be used to obtain the corresponding coordinates; (ii) then search nearby schools of these 33 boroughs, specifically elementary schools, middle schools, and high schools using the Foursquare API; (iii) analyse and clean the data; (iv) classify the schools into k regions by k-means method and hence get k centroids most accessible to nearby schools; and (v) explore the vicinity of the centriods and discover patterns and information. </p>

<h2> Methodology </h2>

<p> The <strong>geographical information</strong> about London, in particular the names of boroughs, are obtained by parsing a <a href=https://en.wikipedia.org/wiki/List_of_London_boroughs>wikipedia page </a> using <strong>Beautiful Soup</strong>. Then the coordinates of the centres of these boroughs are obtained using geopy.Nominatim. With such geolocation data ready, they can be passed to the Foursquare API as searching latitudes and longitudes. </p>

<p> When using the <strong>Foursquare API</strong> to obtain <strong>search veunes</strong>, cateogry IDs are used as search basis to improve the accuracy. This is because keyword search of 'elementary/middle/high schools' may miss institutions which does not contain exactly the keywords. On the contrary, such keyword search may return venues which are not schools at all. </p>

<p> Also, instead of searching many venues starting from central London, the geolocation of 33 boroughs of London are obtained and used as seed locations. This bit of consideration to iterate over boroughs is that, if only one centre is used in the search query, the API may return results from near to far and may have accuracy issues outside central London. The trade-off is that using 33 seed locations will multiply the number of queries. The estimated usage is 33 (boroughs) x 3 (search categories) = 99 queries; and maximum 99 x 50 = 5445 venues. </p>

<p> Search queries are sent to Foursquare API and lists of searched venues are returned. The lists are combined into one dataframe for data cleansing and analysis including removal of duplicated entries. </p> 

<p> After cleaning the data, the <strong>k-means</strong> method is used to classify the data points of latitudes and longitudes into k regions and centroids. The best k is estimated by the <strong>elbow method</strong> and the <strong>Silhouette score</strong>. </p>

<p> Based on the lattitudes and longitudes of the centroids, the <strong>explore query</strong> of the <strong>Foursquare API</strong> is used. It is not surprising that food premises are often returned, the explore query results would be separately analysed for the food and non-food portion. Recommendation is made based on such analysis.</p> 

<h2> Results </h2>

<p> Geographical information about London is obtained. </p>

<img src='parse_london.png'></img>

<p> The Foursquare API also return search results (after data cleaning). There are fewer than half of the theoretical maximum number of venues returns (2278 / 5445) and in fact there are only <strong>823 unqiue venues </strong>.

<img src='sample_schools.png'></img>

<p> The k-means clustering model with <strong>elbow method</strong> and the <strong>Silhouette score</strong> suggested that <strong>k = 3</strong> gives a locally best-fit clustering result. </p>

<img src='best_k_metric.png'></img>

<p> The corresponding scatter plot at k=3 is as follows. The markers with home icon are the centroids. The <strong>green, yellow, red </strong> data points correspond to <strong>clusters 0,1,and 2</strong>.</p>

<img src='plot_clustered_schools.png'></img>

<p> Using the explore query of the Foursquare API, a maximum number of <strong>50 venues within 2km of the centroids</strong> are explored. The API would return some outside the vicinity if there are not many places within worth commenting by Foursquare users.</p>

<p> Suppose common food premises contain keywords like Restaurant | Café | Coffee Shop | Pub | Bar | Pizza Place | Gastropub | Fish & Chips Shop | Sandwich Place | Steakhouse (non exhaustive). The following pivot tables separate the data into food (dotted line) and non-food (solid line). </p>

<img src='common_food.png' style='width:30; border: 5px dotted gray'></img>
<img src='non_food.png' style='width:30; border: 5px solid gray'></img>

<h2> Discussion </h2>

<p> The clustering gives 3 distinct regions. Cluster 0 (green) at Northwestern part of London, cluster 1 (yellow) at Eastern part, and cluster 2 (red) at Southern part. </p>

<p> It can be seen that <strong>cluster 2 </strong> has the <strong> most unique types of food premises (18)</strong> compared to clusters 0 (13) and 1 (11). On the other hand, <strong> cluster 0 </strong> has the <strong> most unique tyes of non-food premises (20)</strong> compared to cluster 1 (17) and 2 (12). </p>

<p> Take a closer look at cluster 0, the varieties are quite good for living, for example, there are clothing stores, department stores, grocery, supermarket. These are seldom seen in cluster 1 and 2. </p>

<p> For cluster 1, it is remarked by an airport and a lots of hotels. </p>

<p> At cluster 2, there are indeed tons of restaurants and theatres. Good for entertainment. </p>

<h2> Conclusion </h2>

<p> So for a family with kids, with easy access to schools, London is clustered into 3 regions. <strong>The family may wish to settle at cluster 0</strong> (centroid coordinates 51.5754, -0.2254). It turns out that it is at <strong> Brent Cross </strong>, which there is a <a href='https://en.wikipedia.org/wiki/Brent_Cross_Shopping_Centre'>large shopping centre</a>. No doubt the numbers of stores are found by Foursquare there. </p>