## This notebook is a part of Cousera   Applied Data Science Capstone Week 4

In this notebook, we are going to download, clean and structure New York's and Canada's data cities  <a href="#item1">on one aggregate data set with their geographical coordinates</a>. Thus, we can directly call and use this data set in the main notebook of this project. 

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Use python package BeautifulSoup  to scrap list of postal codes of Toronto</a>

2. <a href="#item2">Use the link http://cocl.us/Geospatial_data   to download Toronto's geographical coordinates directly</a>

3. <a href="#item3">Use the link https://cocl.us/new_york_dataset/newyork_data.json   to download New york 's geographical coordinates directly</a>

4. <a href="#item4">Aggregate New york 's geographical coordinates  with Toronto's data set  and save it for the next party</a>
</font>
</div>

### Import necessary Libraries
##### Attention: geocoder  and Folium installation takes few minutes

In [7]:
import numpy as np 
import pandas as pd 
import bs4 as bs
import requests
import urllib.request
#!conda install -c conda-forge geocoder --yes # install geocoder packages 
import geocoder
import matplotlib.pyplot as plt # plotting library
import matplotlib.cm as cm
import matplotlib.colors as colors

%matplotlib inline 
from sklearn.cluster import KMeans 
from geopy.geocoders import Nominatim
from IPython.display import Image 
from IPython.core.display import HTML 
from pandas.io.json import json_normalize
#!conda install -c conda-forge folium=0.5.0 --yes
import folium
print('Folium and geocoder installed')
print('Libraries imported.')

Folium and geocoder installed
Libraries imported.


####  1 : Use python package <a href="#item1">BeautifulSoup</a>  to scrap list of postal codes of Canada
##### Create funtion to scrap and  import all table data text

In [8]:
def data_url_scrapping(url,scrap="table", attrs={"class": "wikitable sortable"}):
    """ 
    this function helps you to scrap your url with BeautifulSoup with some parameters.
    Please Ajust scrap and attrs parameters on your url
    Parameters:
      url : url you want to scrap
      scrap : "table" 
      attrs : attributs of the table
    """
    req = urllib.request.Request(url)
    page = urllib.request.urlopen(req)
    soup = bs.BeautifulSoup(page, "html")
    data_table = soup.find(scrap, attrs)
    trs = data_table.find_all('tr')
    rows = []
    def get_data_table_row_and_header(tr, coltag='td'): # td (data) or th (header)  
        """
        this function helps you to get row of your data table that you parse.
        Parses a html segment started with tag <table> followed by multiple <tr> (table rows) and inner <td> (table data) tags. 
        It returns a list of rows with inner columns. Accepts only one <th> (table header/data) in the first row.
        Parameters:
          tr = data table scrap
          coltag :  can be 'th' for table header tag, or "td" for table data tag
        """
        return [td.get_text(strip=True) for td in tr.find_all(coltag)]
    headerow = get_data_table_row_and_header(trs[0], 'th')
    if headerow: # if there is a header row include first
        rows.append(headerow)
        trs = trs[1:]
    for tr in trs: # for every table row
        rows.append(get_data_table_row_and_header(tr, 'td') ) # data row       
    return(rows)

In [9]:
data_Toronto = data_url_scrapping('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Toronto:_M',scrap="table", attrs={"class": "wikitable sortable"})
data_Toronto = pd.DataFrame(data_Toronto[1:], columns=data_Toronto[0])
data_Toronto = data_Toronto[data_Toronto.Borough!="Not assigned"].reset_index(drop=True)
print("List of postal codes of Toronto scrapping and preprocessing done !! ")
print("List of postal codes of Toronto has : " + str(data_Toronto.shape[0]) +" rows")
print("\n")
print(data_Toronto.head())
print("\n")
print("\n")

List of postal codes of Canada scrapping and preprocessing done !! 
List of postal codes of Canada has : 103 rows


  Postal Code           Borough                                 Neighborhood
0         M3A        North York                                    Parkwoods
1         M4A        North York                             Victoria Village
2         M5A  Downtown Toronto                    Regent Park, Harbourfront
3         M6A        North York             Lawrence Manor, Lawrence Heights
4         M7A  Downtown Toronto  Queen's Park, Ontario Provincial Government






#### 2 : Use the link http://cocl.us/Geospatial_data to download Canada's geographical coordinates directly
##### after download  you merge an clean data set  with postal codes of Canada data

In [10]:
#!wget -q -O 'Geospatial_Coordinates.csv' https://cocl.us/Geospatial_data
print('Data downloaded!')
print("\n")
Geospatial_Coordinates = pd.read_csv('https://cocl.us/Geospatial_data/Geospatial_Coordinates.csv')
Geospatial_Coordinates.head()
data_Toronto_final= data_Toronto.merge(Geospatial_Coordinates)
print(data_Toronto_final.head(2))
print("\n")
print("#"*50)
print("    To make sure that data set has not null values")
print("#"*50)
print("\n")
print(data_Toronto_final.isnull().sum())
print("\n")
print("\n")
data_Toronto_final["City"] = "Toronto"

Data downloaded!


  Postal Code     Borough      Neighborhood   Latitude  Longitude
0         M3A  North York         Parkwoods  43.753259 -79.329656
1         M4A  North York  Victoria Village  43.725882 -79.315572


##################################################
    To make sure that data set has not null values
##################################################


Postal Code     0
Borough         0
Neighborhood    0
Latitude        0
Longitude       0
dtype: int64






#### 3 : Use the link https://cocl.us/new_york_dataset/newyork_data.json  to download New york 's geographical coordinates directly
##### after download  you merge an clean data set  

In [11]:
New_york_data = pd.read_json('https://cocl.us/new_york_dataset/newyork_data.json', orient='index')
New_york_data = New_york_data.T["features"].values[0]
New_york_data_final =pd.DataFrame()
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude']
for data in New_york_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    New_york_data_final = New_york_data_final.append({'Postal Code':'XXX', 'Borough': borough,'Neighborhood': neighborhood_name,'Latitude': neighborhood_lat,'Longitude': neighborhood_lon, 'City':"New York"}, ignore_index=True)


print("download New york 's geographical coordinates done correctly !! ")
print("It has : " + str(New_york_data_final.shape[0]) +" rows")
print("\n")
print(New_york_data_final.head())
print("\n")
print("\n")

download New york 's geographical coordinates done correctly !! 
It has : 306 rows


  Borough      City   Latitude  Longitude Neighborhood Postal Code
0   Bronx  New York  40.894705 -73.847201    Wakefield         XXX
1   Bronx  New York  40.874294 -73.829939   Co-op City         XXX
2   Bronx  New York  40.887556 -73.827806  Eastchester         XXX
3   Bronx  New York  40.895437 -73.905643    Fieldston         XXX
4   Bronx  New York  40.890834 -73.912585    Riverdale         XXX






#### 4 : Aggregate New york 's geographical coordinates  with Toronto's data set  and save it for the next party

In [12]:
New_york_and_Toronto_data_final = pd.concat([New_york_data_final, data_Toronto_final], axis=0, sort=True)
print("After aggregation New york and Toronto Neighborhoods dataframe has : " + str(New_york_and_Toronto_data_final.shape[0]) +" rows")
print("\n")
New_york_and_Toronto_data_final.head()
save_path = "C:/Users/iamadou/Desktop/Projet ML/Certification IBM data science/Coursera_ML_Capstone_week_4/NewYork_and_Toronto_Neighborhoods.csv"
New_york_and_Toronto_data_final.to_csv(save_path, index=False)
print("numbers of null value")
print(New_york_and_Toronto_data_final.isnull().sum())

After aggregation New york and Canada Neighborhoods dataframe has : 409 rows


numbers of null value
Borough         0
City            0
Latitude        0
Longitude       0
Neighborhood    0
Postal Code     0
dtype: int64
