## Battle of Neighbourhoods - Week 1

## Optimal Choice Analysis of Buying House in Nashua, NH

### 2. A description of the data and how it will be used to solve the problem


### 2.1 Description of Data



To appropriately address Emma’s requests, I approached them as follows:

-	Collected data from ColdwellBankerHomes.com, from which I collected details of houses such as Price, number of bedrooms, location (Latitude, Longitude), Area (Sq. Ft.), zip code and more

-	Carefully inspecting HTML data from the website for search requests, I scraped data using Python and cleaned before storing in appropriate format for further use.

-	This rich data was overwhelming and Emma is looking for houses with at least 3 bedrooms and price, not more than $400,000. Therefore, I filtered that data to match the needs.

-	With help of geospatial data (Latitude and Longitude), I plotted the data on the map for better visualization.

-	This marker for house on sales are added with hover tooltip, to see more details

-	From Nashua School District website:  www.nashua.edu, I collected the information about street name and which school they belong to.

-	Combining the data from www.nashua.edu with house listing data fetched from ColdwellBankerHomes.com gave more information about school zone for each house in the market

-	Now it should be easy to filter houses in certain school group which would match Emma’s needs

-	Using FourSquare.com, I collected venue information relative to geolocation of houses listed and ranked them.

-	Finally, once all of the above data is cleaned and prepared, with the help of Clustering the data, the map would show the clusters of houses which fall in certain school range, and the visualization would be very helpful for Emma to view selective houses of her interest.


### 2.2 Data Source

##### ColdwellBankerHomes.com
##### www.nashua.edu
##### FourSquare.com

### 2.3 Data Detail


<h1 align=left><font size = 3>Build data frame from www.coldwellbankerhomes.com for Nashua, NH</font></h1>

In [2]:
# Load dependencies

import pandas as pd 
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import requests # library to handle requests
from bs4 import BeautifulSoup

print('Libraries imported.')

Libraries imported.


<h5>Scrape available houses  information from www.coldwellbankerhomes.com</h5>

In [4]:
from requests import get

In [5]:
headers = ({'User-Agent':
            'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})

In [6]:
sapo = "https://www.coldwellbankerhomes.com/nh/nashua?sortId=2&offset=12"
response = get(sapo, headers=headers)

In [7]:
html_soup = BeautifulSoup(response.text, 'html.parser')

In [8]:
house_containers = html_soup.find_all('div', class_="property-snapshot-psr-panel")

In [9]:
first = house_containers[0]
first.find_all('span')

[<span class="notranslate" itemprop="name">15 Bennett St, Nashua, NH 03064</span>,
 <span class="notranslate" itemprop="description">View this property at 15 Bennett St, Nashua, NH 03064</span>,
 <span class="notranslate" itemprop="streetAddress">15 Bennett St</span>,
 <span class="notranslate" itemprop="addressLocality">Nashua</span>,
 <span class="notranslate" itemprop="addressRegion">NH</span>,
 <span itemprop="postalCode">03064</span>,
 <span class="status"><span class="flag just-listed icon-house"></span><span class="property-status-indicator-text">Just Listed</span></span>,
 <span class="flag just-listed icon-house"></span>,
 <span class="property-status-indicator-text">Just Listed</span>,
 <span class="prev owl"><a class="nav-prev"></a></span>,
 <span class="next owl"><a class="nav-next"></a></span>,
 <span aria-hidden="true" class="icon-heart"></span>,
 <span class="visually-hidden">Save</span>,
 <span class="attr-agent notranslate">James Goddard</span>,
 <span class="attr-phon

In [10]:
import itertools 

In [12]:
%%time

addressArray = []
zipcodeArray = []
priceArray = []
latArray = []
lngArray = []
boroughArray = []
areaArray = []
bedroomArray = []
persqftArray = []

n_pages = 0

for page in range(0,20):
    n_pages += 1
    
    sapo_url = 'https://www.coldwellbankerhomes.com/nh/nashua?sortId=2&offset='+str(page*12)
    
 
    r = get(sapo_url, headers=headers)
    page_html = BeautifulSoup(r.text, 'html.parser')
    house_containers = page_html.find_all('div', class_="property-snapshot-psr-panel")
    
    if house_containers != []:
        for container in house_containers:
            
            # Latitude
            latArray.append(container['data-lat'])
            
            # Longitude
            lngArray.append(container['data-lng'])
            
            # Address
            location = container.find_all('div', class_="street-address")[0].text
            addressArray.append(location)

            # Zip
            zipcode = str(container.find_all('div', class_="city-st-zip")[0].text[-5:])
            zipcodeArray.append(zipcode)
            
             # Area_ft2
            area = container.find_all('li', class_="sq.-ft.")
            
            if area != [] :
                area = area[0].text.replace("Sq. Ft.","").replace(",","")
            else:
                area = 0
            areaArray.append(area)
            
             # Bedrooms
            bedroom = container.find_all('li', class_="beds")
            
            if bedroom != [] :
                bedroom = bedroom[0].text.replace("Beds","")
            else:
                bedroom = 0
            bedroomArray.append(bedroom)
            
        
            
            # Borough
            borough = str(container.find_all('div', class_="city-st-zip")[0].text[:6])
            boroughArray.append(borough)

            # Price
            price = container.find_all('div', class_="price-normal")[0].text.replace("$","").replace(",","")
            priceArray.append(price)
            
            # Price Per sq ft
            if int(area) > 0 :
                persqft = float(int(price)/int(area))
            else :
                persqft = 0
            
            persqftArray.append(persqft)
            
    else:
        break
    

CPU times: user 12.1 s, sys: 142 ms, total: 12.3 s
Wall time: 37.2 s


<h5>Build Pandas dataframe along with Latitude and Longitude</h5>

In [13]:
cols = ['Borough','Address', 'ZipCode', 'BedRooms', 'Area_ft2', 'Price', 'Price_ft2','Latitude', 'Longitude']

listNashua = pd.DataFrame({'Borough': boroughArray,
                           'Address': addressArray,
                           'ZipCode': zipcodeArray,
                           'BedRooms': bedroomArray,
                           'Area_ft2': areaArray,
                           'Price': priceArray,
                           'Price_ft2':persqftArray,
                           'Latitude': latArray,
                           'Longitude': lngArray
                           })[cols]

listNashua.to_excel('listNashua.xlsx')

listNashua = pd.read_excel('listNashua.xlsx')
listNashua

listNashua['ZipCode'] = listNashua['ZipCode'].astype(str).str.zfill(5)
print(listNashua)

     Unnamed: 0 Borough                            Address ZipCode  BedRooms  \
0             0  Nashua                     15 Bennett St    03064         2   
1             1  Nashua                 10 Indian Fern Dr    03062         4   
2             2  Nashua             230 Cannongate III Rd    03063         2   
3             3  Nashua                      2 Mayfair Ln    03063         2   
4             4  Nashua                    9-11 Martin St    03064         0   
5             5  Nashua                   37 Majestic Ave    03063         3   
6             6  Nashua                     7 Syracuse Rd    03064         4   
7             7  Nashua                       42 Norma Dr    03062         3   
8             8  Nashua                   14 Artillery Ln    03064         3   
9             9  Nashua              15 Cannongate III Rd    03063         2   
10           10  Nashua        107 Tolles N Of Lock St St    03064         3   
11           11  Nashua                 

In [14]:
listNashua.shape

(480, 10)