# Applied Data Science Specialization Capstone

This notebook will be mainly used for the Applied Data Science Capstone project done in Coursera, provided by IBM.
In this capstone, The Battle of the Neighborhoods, Data location will be used to leverage location data to solve a problem or to get deeper insights into a neighborhood's reputation.

In [90]:
import pandas as pd
import numpy as np

Hello Capstone Project Course!

# Problem 1: Cluster the Neighborhoods in Toronto

Obtain the data that is in the table of postal codes in Canada where the first letter is M and transform the data into a pandas dataframe 

In [91]:
import requests
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
column_names = ['PostalCode', 'Borough', 'Neighborhood']
source = requests.get(url).text\

Define the url, column names to be used/found, and used use the requests module to get the url's text

In [92]:
from bs4 import BeautifulSoup 
soup = BeautifulSoup(source, 'lxml')
table = soup.find('table')
df = pd.DataFrame(columns = column_names)

Next, via the BeautifulSoup module, convert the source url to the lxml format in order to frame the date in a table.
Then, the pandas module in order to create the table

In [93]:
for tr_cell in table.find_all('tr'):
    row_data=[]
    for td_cell in tr_cell.find_all('td'):
        row_data.append(td_cell.text.strip())
    if len(row_data)==3:
        df.loc[len(df)] = row_data

 We grab all the tr elements from the table, followed by grabbing the td elements one at a time. We use the “get_text()” method from the td element (called a column in each iteration) and put it into our python object representing a table (it will eventually be a pandas dataframe).
 (https://srome.github.io/Parsing-HTML-Tables-in-Python-with-BeautifulSoup-and-pandas/)

In [94]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


# Clean Data

In [97]:
df[df['Borough']!= 'Not assigned']
df[df['Neighborhood']=='Not assigned']=df['Borough']

ValueError: shape mismatch: value array of shape (77,) could not be broadcast to indexing result of shape (77,3)

More than one neighborhood can exist in one postal code area. These two rows will be combined into one row with the neighborhoods separated with a comma. First, by creating a custom variable, custom_df, we are able to group PostalCode and Neighborhood together. Then, the Pandas .apply/lambda function was used in order to further filter the dataframe & then merge them

In [98]:
custom_df = df.groupby('PostalCode')['Neighborhood'].apply(lambda x: "%s" % ', '.join(x))
custom_df=custom_df.reset_index(drop=False)
custom_df.rename(columns={'Neighborhood':'Neighborhood_combined'},inplace=True)

df_merge is a new dataframe the takes the custom_df (where the PostalCode and Neighborhood are joined by commonalities w/ a comma) and merges the new columns with the original dataframe

In [99]:
df_merge = pd.merge(df, custom_df, on='PostalCode')
df_merge.drop(['Neighborhood'],axis=1,inplace=True)
df_merge.drop_duplicates(inplace=True)
df_merge.rename(columns={'Neighborhood_combined': 'Neighborhood'},inplace = True)

In [33]:
df_merge.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [58]:
df_merge.shape

(103, 3)

# Problem 2: Latitude and the Longitude Coordinates of Each Neighborhood

We have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

In [59]:
# initialize your variable to None
lat_lng_coords = None

# loop until you get the coordinates
def find_geocode(postal_code):
    while(lat_lng_coords is None):
        g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
        lat_lng_coords = g.latlng
    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]

In [60]:
geo_df = pd.read_csv('http://cocl.us/Geospatial_data')
geo_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [61]:
geo_df.rename(columns={'Postal Code':'PostalCode'}, inplace=True)

In [63]:
geo_merged = pd.merge(geo_df, df_merge, on='PostalCode')

In [70]:
geo_data=geo_merged[['PostalCode','Borough','Neighborhood','Latitude','Longitude']]
geo_merged.head()

Unnamed: 0,PostalCode,Latitude,Longitude,Borough,Neighborhood
0,M1B,43.806686,-79.194353,Scarborough,"Malvern, Rouge"
1,M1C,43.784535,-79.160497,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,43.763573,-79.188711,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,43.770992,-79.216917,Scarborough,Woburn
4,M1H,43.773136,-79.239476,Scarborough,Cedarbrae
