<a href="https://colab.research.google.com/github/phyllsmoyo/Coursera_Capstone/blob/branch_1/Capstone_Project_The_Battle_of_Neighborhoods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**The Battle of Neighborhoods**

by [Phillip Sinothi Moyo](https://www.linkedin.com/in/phyllsmoyo/)

# **Introduction**

Education is important for everyone in the world. Education is the process of achieving knowledge, values, skills, beliefs, and moral habits. For any country education enables people become better citizens, get a better-paid job, shows the difference between good and bad. Education shows us the importance of hard work and, at the same time, helps us grow and develop. Thus, we are able to shape a better society to live in by knowing and respecting rights, laws, and regulations. People without an education hardly get by in life because education teaches us how to think, how to work properly, and how to make decisions. The better the education, the more choices and opportunities one is going to have in life.

**Importance of Education**

A famous quote about education goes; "Education is the key to success". Education is important in the holistic development of a child’s social, emotional, cognitive and physical needs in order to build a solid and broad foundation for lifelong learning and wellbeing. 


**Business Problem**

The objective of this project is to analyze universities and colleges of given provinces in South Africa and try to recommend the best Schools where they can send their children to school and are likely to obtain a good all around education with a balanced social life, on the other hand this project can be used by Department of Education and other wellwishers to identify poorly perfoming schools so that they can send required resources to improve the schools.

The target audience for this project include parents, government and non-governental organizations insterested in the education sector.

**Data**

To address the problem, we need to have the dataset that contains

> * All the provinces of South Africa.
> * All universities in South Africa.
> * Latitude and longitudes of all the universities.

Sources:
1. [List of South African Universitis](https://en.wikipedia.org/wiki/List_of_universities_in_South_Africa)
2. Obtain the coordinates of each University using thte Nominatin Geocorder.
3. Explore the universities using the Foursquare API for each University.

The Wikipedia is the major source of data that is being used to obtain all the Universities of South Africa. We then use the Pandas Library, a Python module that helps to scrape information from the web pages to extract all the tables from this Wikipedia page and convert it into a pandas dataframe. Then we use Python’s geopy package to obtain the latitude and longitude of all the Universities present in the dataframe.

**Methodology**
- Obtain data from the Wikipedia page.
- Get coordinates using geocoder.
- Get venues surrounding our universities of interest using the FourSquare API.
- Filter venue categories.
- Perform clustering on the data using k-means.
- Visualize clusters using folium.

## **Import the Libraries**

In [183]:
import pandas as pd # is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool
import numpy as np # is a general-purpose array-processing package for scientific computing
import matplotlib.pyplot as plt # is a plotting library
%matplotlib inline 
import requests # package to send HTTP requests using Python.
import json # package used to work with JSON data
from pandas.io.json import json_normalize # package Normalize semi-structured JSON data into a flat table.
from geopy.geocoders import Nominatim
import requests
import matplotlib.cm as cm
import matplotlib.colors as colors
from matplotlib.colors import rgb2hex
from sklearn.cluster import KMeans
import folium
import math
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', -1)

# **Data Gathering**

Url to with the table that contains the Universities.

In [184]:
url = 'https://en.wikipedia.org/wiki/List_of_universities_in_South_Africa'

Pandas ```read_html``` is pretty simple and works well on many Wikipedia pages since the tables are not complicated.

In [185]:
table = pd.read_html(url, flavor=['lxml', 'bs4'])
print(f'Total tables in the url: {len(table)}')

Total tables in the url: 10


The list of hte required universities is separated into 3 different tables, use pandas concat to add the tables together.


In [186]:
table1 = table[0]
table2 = table[1]
table3 = table[2]

df_uni = pd.concat([table1, table2, table3])

Obtain the column names as list in order to filter for only the required columns.

In [187]:
df_uni.columns.tolist()

['Institution',
 'Nickname',
 'Founded',
 'University status',
 'Undergrad',
 'Postgrad',
 'Total',
 'Location(s)',
 'Medium',
 'Total (2011)']

Create a dataframe of tables with only columns that we are interested in.

In [188]:
df = df_uni[['Institution', 'Location(s)']]
df = df.rename(columns={'Location(s)': 'Location'})

It should be noted that the some Universities has multiple campuses and to get the correct geocoordinates we will seperate the locations.

In [189]:
#Split the universities to for each campus in each location and reset index
df = df.assign(Location=df['Location'].str.split(',')).explode('Location').reset_index(drop=True)

#View the resulting DataFrame to c heck if the split worked correctly
df[df['Institution'] == 'Vaal University of Technology']

Unnamed: 0,Institution,Location
56,Vaal University of Technology,Vanderbijlpark
57,Vaal University of Technology,Secunda
58,Vaal University of Technology,Kempton Park
59,Vaal University of Technology,Klerksdorp
60,Vaal University of Technology,Upington


In [190]:
#Drop University of Pretoria - Johannesburg as there is no campus in Johannesburg.
df.drop(index=17, inplace=True)

#Remove University of South Africa from the DataFrame as it is a long distance University.
filter = (df['Institution'] == 'University of South Africa')
df = df[~filter].reset_index(drop=True)
df.head(20)

Unnamed: 0,Institution,Location
0,University of Cape Town,Cape Town
1,University of Fort Hare,Alice
2,University of Fort Hare,East London
3,University of Fort Hare,Bhisho
4,University of the Free State,Bloemfontein
5,University of the Free State,QwaQwa
6,University of KwaZulu-Natal,Durban
7,University of KwaZulu-Natal,Pietermaritzburg
8,University of KwaZulu-Natal,Pinetown
9,University of KwaZulu-Natal,Westville


In [191]:
#Check if the DataFrame has any duplicates.
df.duplicated().any()

False

In [192]:
df["Latitude"] = ""
df["Longitude"] = ""
df.shape

(57, 4)

In [193]:
# Need to drop those Neighborhood that the geocode does not find
to_drop_unknown = []
geolocator = Nominatim(user_agent="sa_explorer")
for index, row in df.iterrows():
    address = row['Location'] + ', South Africa'
    try:
        location = geolocator.geocode(address)
        latitude = location.latitude
        longitude = location.longitude
        print('The geograpical coordinate of {} are {}, {}.'.format(address, latitude, longitude))
        df.loc[index, 'Latitude'] = latitude
        df.loc[index, 'Longitude'] = longitude
    except AttributeError:
        print('Cannot do: {}, will drop index: {}'.format(address, index))
        to_drop_unknown.append(index)

The geograpical coordinate of Cape Town, South Africa are -33.928992, 18.417396.
The geograpical coordinate of Alice, South Africa are -32.7888889, 26.8344444.
The geograpical coordinate of  East London, South Africa are -33.0191604, 27.8998573.
The geograpical coordinate of  Bhisho, South Africa are -32.8494444, 27.4463889.
The geograpical coordinate of Bloemfontein, South Africa are -29.116395, 26.215496.
The geograpical coordinate of  QwaQwa, South Africa are -28.5359227, 28.8066789.
The geograpical coordinate of Durban, South Africa are -29.861825, 31.009909.
The geograpical coordinate of  Pietermaritzburg, South Africa are -29.6, 30.3788889.
The geograpical coordinate of  Pinetown, South Africa are -29.818056, 30.884167.
The geograpical coordinate of  Westville, South Africa are -29.8244444, 30.9386111.
The geograpical coordinate of Polokwane, South Africa are -23.9058333, 29.4613889.
The geograpical coordinate of  Turfloop, South Africa are -23.88687725, 29.73143763805989.
The ge

In [207]:
df.head()

Unnamed: 0,Institution,Location,Latitude,Longitude
0,University of Cape Town,Cape Town,-33.929,18.4174
1,University of Fort Hare,Alice,-32.7889,26.8344
2,University of Fort Hare,East London,-33.0192,27.8999
3,University of Fort Hare,Bhisho,-32.8494,27.4464
4,University of the Free State,Bloemfontein,-29.1164,26.2155


In [218]:
df.rename(columns={'Location':'Town'}, inplace=True)
df.head()

Unnamed: 0,Institution,Town,Latitude,Longitude
0,University of Cape Town,Cape Town,-33.929,18.4174
1,University of Fort Hare,Alice,-32.7889,26.8344
2,University of Fort Hare,East London,-33.0192,27.8999
3,University of Fort Hare,Bhisho,-32.8494,27.4464
4,University of the Free State,Bloemfontein,-29.1164,26.2155


In [221]:
#Filter for Universities in Johannesburg and Pretoria
filter_gp = (df['Town'] == 'Johannesburg') | (df['Town'] == 'Pretoria')
df_gp = df[filter_gp]
df_gp

Unnamed: 0,Institution,Town,Latitude,Longitude
16,University of Pretoria,Pretoria,-25.7459,28.1879
25,University of the Witwatersrand,Johannesburg,-26.205,28.0497
26,University of Johannesburg,Johannesburg,-26.205,28.0497
46,Tshwane University of Technology,Pretoria,-25.7459,28.1879


In [222]:
# define Foursquare Credentials and Version
CLIENT_ID = 'AEEH2T5ADCYUM3YRO35VFBSMZDLIMBC13JW02XPLSMGVDO2R' # your Foursquare ID
CLIENT_SECRET = 'XHLCJYOXO0QVK40PK3NGAACUIXUXZUKIFNJQ2FUVMZKVUYNB' # your Foursquare Secret
VERSION = '20210118' # Foursquare API version


In [223]:
#Top 100 venues that are within a radius of 500 meters.
radius = 500
LIMIT = 100

venues = []

for lat, long, town, institution in zip(df_gp['Latitude'], df_gp['Longitude'], df_gp['Town'], df_gp['Institution']):
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(CLIENT_ID,
                                                                                                                               CLIENT_SECRET,
                                                                                                                               VERSION,
                                                                                                                               lat,
                                                                                                                               long,        
                                                                                                                               radius,
                                                                                                                               LIMIT)
    
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    for venue in results:
        venues.append((institution,
                       town,
                       lat,
                       long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))


In [227]:
# convert the venues list into a new DataFrame
venues_df = pd.DataFrame(venues)

# define the column names
venues_df.columns = ['Institution', 'Town', 'TownLatitude', 'TownLongitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

print(venues_df.shape)
venues_df.head()

(32, 8)


Unnamed: 0,Institution,Town,TownLatitude,TownLongitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,University of Pretoria,Pretoria,-25.745937,28.187944,Café Riche,-25.746579,28.187304,Café
1,University of Pretoria,Pretoria,-25.745937,28.187944,TriBeCa Coffee Shop,-25.744835,28.188936,Coffee Shop
2,University of Pretoria,Pretoria,-25.745937,28.187944,Church Square,-25.746366,28.188006,Plaza
3,University of Pretoria,Pretoria,-25.745937,28.187944,Wimpy,-25.748014,28.189257,Burger Joint
4,University of Pretoria,Pretoria,-25.745937,28.187944,Wimpy,-25.744685,28.189424,Burger Joint


In [230]:

venues_df.groupby(["Institution", "Town"]).count().head()

Unnamed: 0_level_0,Unnamed: 1_level_0,TownLatitude,TownLongitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
Institution,Town,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Tshwane University of Technology,Pretoria,10,10,10,10,10,10
University of Johannesburg,Johannesburg,6,6,6,6,6,6
University of Pretoria,Pretoria,10,10,10,10,10,10
University of the Witwatersrand,Johannesburg,6,6,6,6,6,6


In [231]:
print('There are {} uniques categories.'.format(len(venues_df['VenueCategory'].unique())))

There are 9 uniques categories.


In [233]:
venues_df['VenueCategory'].unique()[:10].tolist()

['Café',
 'Coffee Shop',
 'Plaza',
 'Burger Joint',
 'Fast Food Restaurant',
 'Pharmacy',
 'Scenic Lookout',
 'Breakfast Spot',
 'Portuguese Restaurant']