## Data Scraper Notebook

This notebook scrapes google maps for location data to be added to the property sales data set. This notebook should be viewed as a 'paralle' notebook to the Sales Data Cleaning notebook. It takes in the same .csv as that notebook, and outputs a modified version of that data to be imported back into it.

The resulting csv is ultimately cleaned up in the sales notebook. 

A script from a 'Towards Data Science' post was modified to perform this task, which is cited below. In addition to using selenium, the blog post introduced me to 'tqdm' which was used in order to show a 'progress bar' of the scraping process. This same post introduced me to folium which was essential to quickly plotting the location data to an interactive map.

In [2]:
import pandas as pd
import numpy as np
import time

In [3]:
from selenium import webdriver
from tqdm import tqdm_notebook as tqdmn

In [4]:
#import sales manhattan only year 2016, ~20000 rows.
df = pd.read_csv('./Sales_Manhattan/sales_manhattan_16.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20486 entries, 0 to 20485
Data columns (total 21 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   BOROUGH                         20486 non-null  int64 
 1   NEIGHBORHOOD                    20486 non-null  object
 2   BUILDING CLASS CATEGORY         20486 non-null  object
 3   TAX CLASS AT PRESENT            20486 non-null  object
 4   BLOCK                           20486 non-null  int64 
 5   LOT                             20486 non-null  int64 
 6   EASE-MENT                       20486 non-null  object
 7   BUILDING CLASS AT PRESENT       20486 non-null  object
 8   ADDRESS                         20486 non-null  object
 9   APARTMENT NUMBER                20486 non-null  object
 10  ZIP CODE                        20486 non-null  int64 
 11  RESIDENTIAL UNITS               20486 non-null  object
 12  COMMERCIAL UNITS                20486 non-null

**The below cell was made with the intention of eventually using a larger sales data set but was never implemented.**

In [65]:
# Import All CSV files for Manhattan, 2003-2019 (~370000 rows)
def get_data():
    folder = "./Sales_Manhattan/sales_manhattan_{}.csv"
    dfs = []
    for num in range(18,19):
        file_loc = folder.format(num)
        dfs.append(pd.read_csv(file_loc, keep_default_na = False))
    return pd.concat(dfs)
        
df = get_data()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16837 entries, 0 to 16836
Data columns (total 21 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   BOROUGH                         16837 non-null  int64 
 1   NEIGHBORHOOD                    16837 non-null  object
 2   BUILDING CLASS CATEGORY         16837 non-null  object
 3   TAX CLASS AT PRESENT            16837 non-null  object
 4   BLOCK                           16837 non-null  int64 
 5   LOT                             16837 non-null  int64 
 6   EASE-MENT                       16837 non-null  object
 7   BUILDING CLASS AT PRESENT       16837 non-null  object
 8   ADDRESS                         16837 non-null  object
 9   APARTMENT NUMBER                16837 non-null  object
 10  ZIP CODE                        16837 non-null  int64 
 11  RESIDENTIAL UNITS               16837 non-null  int64 
 12  COMMERCIAL UNITS                16837 non-null

In [5]:
df['ADDRESS'] = df['ADDRESS'].str.strip()
df['Full Address'] = df['ADDRESS'].str.cat(df['ZIP CODE'].astype(str), sep = ' ')

In [6]:
a_url = pd.DataFrame()
a_url['Full Address'] = df['Full Address'].unique()
a_url.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11482 entries, 0 to 11481
Data columns (total 1 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Full Address  11482 non-null  object
dtypes: object(1)
memory usage: 89.8+ KB


In [121]:
a_url['Url'] = ['https://www.google.com/maps/search/' + i for i in a_url['Full Address'] ]

In [122]:
#Adapted from 'Using Python and Selenium to get coordinates from street addresses'
#Medium post by Khalid El Mouloudi

#Added time delay after ~1000 pages, Last kicked at 2288 pages w/o time delay.

Url_With_Coordinates = []

#"prefs to run the Webdriver without javascript and images. 
#This way the code will take much less time to load webpages. 
#Obviously, this isn’t a good choice if what you want to extract relies on javascript. 
#By removing 'images':2, 'javascript':2, the web pages will load 
#images and javascript normally"

option = webdriver.ChromeOptions()
prefs = {'profile.default_content_setting_values': {'images':2, 'javascript':2}}
option.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome("/Users/jhg/chromedriver", options=option)
i=0
for url in tqdmn(a_url.Url, leave=False):
    i += 1
    driver.get(url)
    Url_With_Coordinates.append(driver.find_element_by_css_selector('meta[itemprop=image]').get_attribute('content'))

    #time delay
    if i % 1500 == 0:
        time.sleep(180)
driver.close()

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


HBox(children=(FloatProgress(value=0.0, max=11482.0), HTML(value='')))

In [124]:
import csv

with open('Url_With_Coordinates.csv', 'w') as file:
    wr=csv.writer(file)
    wr.writerow(Url_With_Coordinates)

In [126]:
a_url.head(4)

Unnamed: 0,Full Address,Url,Url_C
0,EAST 29TH STREET 10016,https://www.google.com/maps/search/EAST 29TH ...,https://maps.google.com/maps/api/staticmap?cen...
1,264 EAST 7TH STREET 10009,https://www.google.com/maps/search/264 EAST 7T...,https://maps.google.com/maps/api/staticmap?cen...
2,21 AVENUE B 10009,https://www.google.com/maps/search/21 AVENUE B...,https://maps.google.com/maps/api/staticmap?cen...
3,615 EAST 6TH STREET 10009,https://www.google.com/maps/search/615 EAST 6T...,https://maps.google.com/maps/api/staticmap?cen...
