# <center>Zillow Webscraping Project<center>

## Project Details:
In this project, I will utilize BeautifulSoup to webscrape and collect Zillow housing data from multiple listing pages. The purpose of this project is to collect housing data for every state in the United States. For this project, I will attempt to collect about 600 house listings for each state to have roughly 30,000 house listings combined. 600 house listings from each state will ensure that enough data was collected to perform accurate data analysis.

## Data Collection and Wrangling:
I will attempt to collect
- House Price
- House Address
- House Beds, Baths, and SqFt
- House URL/Link to Zillow

## Project Goals:
- Collect 600 house listings per U.S. state using BeautifulSoup Webscraping
- Perform Exploratory Data Analysis with the housing data
- Create a dynamic dashboard that can be used to search house listings and analyze housing data in each state

*Note: Because of how the webscraping code is written, it will be ran 50 times for each U.S. state. The 50 dataframes will then be appended to create the final dataframe

## Import Needed Python Packages

In [3851]:
# Packages
import pandas as pd
import numpy as np

import requests
import os
import time
import sys
from IPython import display
from IPython.display import Image

import regex as re
import prettify
import numbers
import htmltext

import lxml
from lxml.html.soupparser import fromstring

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

## BeautifulSoup Webscraping/Data Collection

In [3852]:
# Add headers to keep from getting capthas returns from Zillow
req_headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-US,en;q=0.8',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}

In [3853]:
# Create variables for all pages of Zillow that will be looped
# About 15 pages will be looped to colelct 600 listings per state (Zillow has about 40 listings per page)

with requests.Session() as s:
    scrapingstate = 'washington dc/' #*--Change to desired state to scrape house listings--*
    
    Page = 'https://www.zillow.com/homes/for_sale/'+scrapingstate
    Page2 = 'https://www.zillow.com/homes/for_sale/'+scrapingstate+'/2_p/'
    Page3 = 'https://www.zillow.com/homes/for_sale/'+scrapingstate+'/3_p/'
    Page4 = 'https://www.zillow.com/homes/for_sale/'+scrapingstate+'/4_p/'
    Page5 = 'https://www.zillow.com/homes/for_sale/'+scrapingstate+'/5_p/'
    Page6 = 'https://www.zillow.com/homes/for_sale/'+scrapingstate+'/6_p/'
    Page7 = 'https://www.zillow.com/homes/for_sale/'+scrapingstate+'/7_p/'
    Page8 = 'https://www.zillow.com/homes/for_sale/'+scrapingstate+'/8_p/'
    Page9 = 'https://www.zillow.com/homes/for_sale/'+scrapingstate+'/9_p/'
    Page10 = 'https://www.zillow.com/homes/for_sale/'+scrapingstate+'/10_p/'
    Page11 = 'https://www.zillow.com/homes/for_sale/'+scrapingstate+'/11_p/'
    Page12 = 'https://www.zillow.com/homes/for_sale/'+scrapingstate+'/12_p/'
    Page13 = 'https://www.zillow.com/homes/for_sale/'+scrapingstate+'/13_p/'
    Page14 = 'https://www.zillow.com/homes/for_sale/'+scrapingstate+'/14_p/'
    Page15 = 'https://www.zillow.com/homes/for_sale/'+scrapingstate+'/15_p/'

    r = s.get(Page, headers=req_headers)
    r2 = s.get(Page2, headers=req_headers)
    r3 = s.get(Page3, headers=req_headers)
    r4 = s.get(Page4, headers=req_headers)
    r5 = s.get(Page5, headers=req_headers)
    r6 = s.get(Page6, headers=req_headers)
    r7 = s.get(Page7, headers=req_headers)
    r8 = s.get(Page8, headers=req_headers)
    r9 = s.get(Page9, headers=req_headers)
    r10 = s.get(Page10, headers=req_headers)
    r11 = s.get(Page11, headers=req_headers)
    r12 = s.get(Page12, headers=req_headers)
    r13 = s.get(Page13, headers=req_headers)
    r14 = s.get(Page14, headers=req_headers)
    r15 = s.get(Page15, headers=req_headers)
    
    url_links = [Page, Page2, Page3, Page4, Page5, Page6, Page7, Page8, Page9, Page10, Page11, Page12, Page13, Page14, Page15]

In [3854]:
# Now add the contents from urls to soup variables of each url
soup = BeautifulSoup(r.content, 'html.parser')
soup1 = BeautifulSoup(r2.content, 'html.parser')
soup2 = BeautifulSoup(r3.content, 'html.parser')
soup3 = BeautifulSoup(r4.content, 'html.parser')
soup4 = BeautifulSoup(r5.content, 'html.parser')
soup5 = BeautifulSoup(r6.content, 'html.parser')
soup6 = BeautifulSoup(r7.content, 'html.parser')
soup7 = BeautifulSoup(r8.content, 'html.parser')
soup8 = BeautifulSoup(r9.content, 'html.parser')
soup9 = BeautifulSoup(r10.content, 'html.parser')
soup10 = BeautifulSoup(r10.content, 'html.parser')
soup11 = BeautifulSoup(r10.content, 'html.parser')
soup12 = BeautifulSoup(r10.content, 'html.parser')
soup13 = BeautifulSoup(r10.content, 'html.parser')
soup14 = BeautifulSoup(r10.content, 'html.parser')

### Page 1 Loop

In [3855]:
# Create first datarame for first page of Zillow
df = pd.DataFrame()

# Create loop that pulls each specified variable
for i in soup:
    address = soup.find_all (class_= 'list-card-addr')
    price = list(soup.find_all (class_='list-card-price'))
    details = list(soup.find_all("ul", class_="list-card-details"))    
    
# Create columns for dataframe
df['prices'] = price
df['address'] = address
df['details'] = details

# Create an empty url list
urls = []
# Now Loop through the url and pull the href and strip out the address tag
for link in soup.find_all("article"):
    href = link.find('a',class_="list-card-link")
    addresses = href.find('address')
    addresses.extract()
    urls.append(href)

In [3856]:
# Import the urls into a column called links
df['links'] = urls
df['links'] = df['links'].astype('str')

### Page 2 Loop

In [3857]:
# Create the second datarame for second page of Zillow
df1 = pd.DataFrame()

for i in soup1:
    address1 = soup1.find_all (class_= 'list-card-addr')
    price1 = list(soup1.find_all (class_='list-card-price'))
    details1 = list(soup.find_all("ul", class_="list-card-details"))
            

df1['prices'] = price1
df1['address'] = address1
df1['details'] = details1


urls = []

for link in soup1.find_all("article"):
    href = link.find('a',class_="list-card-link")
    addresses = href.find('address')
    addresses.extract()
    urls.append(href)

In [3858]:
# Import the urls into a column called links
df1['links'] = urls
df1['links'] = df1['links'].astype('str')

### Append the first two dataframes to ensure that scraping and looping are working

In [3859]:
# Append
df = df.append(df1, ignore_index = True) 

In [3860]:
# Show top 5 rows
df.head()

Unnamed: 0,prices,address,details,links
0,"[$1,019,000]","[4810 48th St NW, Washington, DC 20016]","[[3, [ , , bds]], [3, [ , , ba]], [2,500, [ ...","<a class=""list-card-link list-card-link-top-ma..."
1,"[$1,049,999]","[1323 F St NE, Washington, DC 20002]","[[3, [ , , bds]], [3, [ , , ba]], [1,515, [ ...","<a class=""list-card-link list-card-link-top-ma..."
2,"[$395,000]","[1669 Columbia Rd NW #410, Washington, DC 20009]","[[2, [ , , bds]], [2, [ , , ba]], [1,120, [ ...","<a class=""list-card-link list-card-link-top-ma..."
3,"[$1,795,000]","[3750 Northampton St NW, Washington, DC 20015]","[[4, [ , , bds]], [4, [ , , ba]], [3,082, [ ...","<a class=""list-card-link list-card-link-top-ma..."
4,"[$1,499,000]","[609 Maryland Ave NE UNIT 5, Washington, DC 20...","[[2, [ , , bds]], [3, [ , , ba]], [2,634, [ ...","<a class=""list-card-link list-card-link-top-ma..."


In [3861]:
# See number of rows collected so far on 2 pages
# *Note: Each page on zillow contains between 35-40 houses
len(df)

80

### Scraping and looping appear to be working with 80 houses collected from page 1 and page 2. 

In [3862]:
# Now create the other empty dataframes for each page to keep from having to declare before each loop
df2 = pd.DataFrame()
df3 = pd.DataFrame()
df4 = pd.DataFrame()
df5 = pd.DataFrame()
df6 = pd.DataFrame()
df7 = pd.DataFrame()
df8 = pd.DataFrame()
df9 = pd.DataFrame()
df10 = pd.DataFrame()
df11 = pd.DataFrame()
df12 = pd.DataFrame()
df13 = pd.DataFrame()
df14 = pd.DataFrame()

### Page 3 Loop

In [3863]:
for i in soup2:
    address2 = soup2.find_all (class_= 'list-card-addr')
    price2 = list(soup2.find_all (class_='list-card-price'))
    details2 = list(soup2.find_all("ul", class_="list-card-details"))
            

df2['prices'] = price2
df2['address'] = address2
df2['details'] = details2


urls = []

for link in soup2.find_all("article"):
    href = link.find('a',class_="list-card-link")
    addresses = href.find('address')
    addresses.extract()
    urls.append(href)

In [3864]:
# Import the urls into a column called links
df2['links'] = urls
df2['links'] = df2['links'].astype('str')

In [3865]:
len(df2)

40

### Page 4 Loop

In [3866]:
for i in soup3:
    address3 = soup3.find_all (class_= 'list-card-addr')
    price3 = list(soup3.find_all (class_='list-card-price'))
    details3 = list(soup3.find_all("ul", class_="list-card-details"))
            

df3['prices'] = price3
df3['address'] = address3
df3['details'] = details3


urls = []

for link in soup3.find_all("article"):
    href = link.find('a',class_="list-card-link")
    addresses = href.find('address')
    addresses.extract()
    urls.append(href)

In [3867]:
# Import the urls into a column called links
df3['links'] = urls
df3['links'] = df3['links'].astype('str')

In [3868]:
len(df3)

40

### Page 5 Loop

In [3869]:
for i in soup4:
    address4 = soup4.find_all (class_= 'list-card-addr')
    price4 = list(soup4.find_all (class_='list-card-price'))
    details4 = list(soup4.find_all("ul", class_="list-card-details"))
            

df4['prices'] = price4
df4['address'] = address4
df4['details'] = details4


urls = []

for link in soup4.find_all("article"):
    href = link.find('a',class_="list-card-link")
    addresses = href.find('address')
    addresses.extract()
    urls.append(href)

In [3870]:
# Import the urls into a column called links
df4['links'] = urls
df4['links'] = df4['links'].astype('str')

In [3871]:
len(df4)

40

### Page 6 Loop

In [3872]:
for i in soup5:
    address5 = soup5.find_all (class_= 'list-card-addr')
    price5 = list(soup5.find_all (class_='list-card-price'))
    details5 = list(soup5.find_all("ul", class_="list-card-details"))
            

df5['prices'] = price5
df5['address'] = address5
df5['details'] = details5


urls = []

for link in soup5.find_all("article"):
    href = link.find('a',class_="list-card-link")
    addresses = href.find('address')
    addresses.extract()
    urls.append(href)

In [3873]:
# Import the urls into a column called links
df5['links'] = urls
df5['links'] = df5['links'].astype('str')

In [3874]:
len(df5)

40

### Page 7 Loop

In [3875]:
for i in soup6:
    address6 = soup6.find_all (class_= 'list-card-addr')
    price6 = list(soup6.find_all (class_='list-card-price'))
    details6 = list(soup6.find_all("ul", class_="list-card-details"))
            

df6['prices'] = price6
df6['address'] = address6
df6['details'] = details6


urls = []

for link in soup6.find_all("article"):
    href = link.find('a',class_="list-card-link")
    addresses = href.find('address')
    addresses.extract()
    urls.append(href)

In [3876]:
# Import the urls into a column called links
df6['links'] = urls
df6['links'] = df6['links'].astype('str')

In [3877]:
len(df6)

40

### Page 8 Loop

In [3878]:
for i in soup7:
    address7 = soup7.find_all (class_= 'list-card-addr')
    price7 = list(soup7.find_all (class_='list-card-price'))
    details7 = list(soup7.find_all("ul", class_="list-card-details"))
            

df7['prices'] = price7
df7['address'] = address7
df7['details'] = details7


urls = []

for link in soup7.find_all("article"):
    href = link.find('a',class_="list-card-link")
    addresses = href.find('address')
    addresses.extract()
    urls.append(href)

In [3879]:
# Import the urls into a column called links
df7['links'] = urls
df7['links'] = df7['links'].astype('str')

In [3880]:
len(df7)

40

### Page 9 Loop

In [3881]:
for i in soup8:
    address8 = soup8.find_all (class_= 'list-card-addr')
    price8 = list(soup8.find_all (class_='list-card-price'))
    details8 = list(soup8.find_all("ul", class_="list-card-details"))
            

df8['prices'] = price8
df8['address'] = address8
df8['details'] = details8


urls = []

for link in soup8.find_all("article"):
    href = link.find('a',class_="list-card-link")
    addresses = href.find('address')
    addresses.extract()
    urls.append(href)

In [3882]:
# Import the urls into a column called links
df8['links'] = urls
df8['links'] = df8['links'].astype('str')

In [3883]:
len(df8)

40

### Page 10 Loop

In [3884]:
for i in soup9:
    address9 = soup9.find_all (class_= 'list-card-addr')
    price9 = list(soup9.find_all (class_='list-card-price'))
    details9 = list(soup9.find_all("ul", class_="list-card-details"))
            

df9['prices'] = price9
df9['address'] = address9
df9['details'] = details9


urls = []

for link in soup9.find_all("article"):
    href = link.find('a',class_="list-card-link")
    addresses = href.find('address')
    addresses.extract()
    urls.append(href)

In [3885]:
# Import the urls into a column called links
df9['links'] = urls
df9['links'] = df9['links'].astype('str')

In [3886]:
len(df9)

40

### Page 11 Loop

In [3887]:
for i in soup10:
    address10 = soup10.find_all (class_= 'list-card-addr')
    price10 = list(soup10.find_all (class_='list-card-price'))
    details10 = list(soup10.find_all("ul", class_="list-card-details"))
            

df10['prices'] = price10
df10['address'] = address10
df10['details'] = details10


urls = []

for link in soup10.find_all("article"):
    href = link.find('a',class_="list-card-link")
    addresses = href.find('address')
    addresses.extract()
    urls.append(href)

In [3888]:
# Import the urls into a column called links
df10['links'] = urls
df10['links'] = df10['links'].astype('str')

In [3889]:
len(df10)

40

### Page 12 Loop

In [3890]:
for i in soup11:
    address11 = soup11.find_all (class_= 'list-card-addr')
    price11 = list(soup11.find_all (class_='list-card-price'))
    details11 = list(soup11.find_all("ul", class_="list-card-details"))
            

df11['prices'] = price11
df11['address'] = address11
df11['details'] = details11


urls = []

for link in soup11.find_all("article"):
    href = link.find('a',class_="list-card-link")
    addresses = href.find('address')
    addresses.extract()
    urls.append(href)

In [3891]:
# Import the urls into a column called links
df11['links'] = urls
df11['links'] = df11['links'].astype('str')

In [3892]:
len(df11)

40

### Page 13 Loop

In [3893]:
for i in soup12:
    address12 = soup12.find_all (class_= 'list-card-addr')
    price12 = list(soup12.find_all (class_='list-card-price'))
    details12 = list(soup12.find_all("ul", class_="list-card-details"))
            

df12['prices'] = price12
df12['address'] = address12
df12['details'] = details12


urls = []

for link in soup12.find_all("article"):
    href = link.find('a',class_="list-card-link")
    addresses = href.find('address')
    addresses.extract()
    urls.append(href)

In [3894]:
# Import the urls into a column called links
df12['links'] = urls
df12['links'] = df12['links'].astype('str')

In [3895]:
len(df12)

40

### Page 14 Loop

In [3896]:
for i in soup13:
    address13 = soup13.find_all (class_= 'list-card-addr')
    price13 = list(soup13.find_all (class_='list-card-price'))
    details13 = list(soup13.find_all("ul", class_="list-card-details"))
            

df13['prices'] = price13
df13['address'] = address13
df13['details'] = details13


urls = []

for link in soup13.find_all("article"):
    href = link.find('a',class_="list-card-link")
    addresses = href.find('address')
    addresses.extract()
    urls.append(href)

In [3897]:
# Import the urls into a column called links
df13['links'] = urls
df13['links'] = df13['links'].astype('str')

In [3898]:
len(df13)

40

### Page 15 Loop

In [3899]:
for i in soup14:
    address14 = soup14.find_all (class_= 'list-card-addr')
    price14 = list(soup14.find_all (class_='list-card-price'))
    details14 = list(soup14.find_all("ul", class_="list-card-details"))
            

df14['prices'] = price14
df14['address'] = address14
df14['details'] = details14


urls = []

for link in soup14.find_all("article"):
    href = link.find('a',class_="list-card-link")
    addresses = href.find('address')
    addresses.extract()
    urls.append(href)

In [3900]:
# Import the urls into a column called links
df14['links'] = urls
df14['links'] = df14['links'].astype('str')

In [3901]:
len(df14)

40

### Now that all 10 pages have been looped and scraped, we can now append all dataframes together into one dataframe
*Note: The first and second dataframes have already been appended from above. So we start by appending df2

In [3902]:
# Now Append all dataframes together
df = df.append(df2, ignore_index = True) 
df = df.append(df3, ignore_index = True) 
df = df.append(df4, ignore_index = True) 
df = df.append(df5, ignore_index = True) 
df = df.append(df6, ignore_index = True) 
df = df.append(df7, ignore_index = True) 
df = df.append(df8, ignore_index = True) 
df = df.append(df9, ignore_index = True)
df = df.append(df10, ignore_index = True)
df = df.append(df11, ignore_index = True)
df = df.append(df12, ignore_index = True)
df = df.append(df13, ignore_index = True)
df = df.append(df14, ignore_index = True)

In [3903]:
# See length of dataframe
len(df)

600

In [3904]:
# Lets look at first 50 rows
df.head(50)

Unnamed: 0,prices,address,details,links
0,"[$1,019,000]","[4810 48th St NW, Washington, DC 20016]","[[3, [ , , bds]], [3, [ , , ba]], [2,500, [ ...","<a class=""list-card-link list-card-link-top-ma..."
1,"[$1,049,999]","[1323 F St NE, Washington, DC 20002]","[[3, [ , , bds]], [3, [ , , ba]], [1,515, [ ...","<a class=""list-card-link list-card-link-top-ma..."
2,"[$395,000]","[1669 Columbia Rd NW #410, Washington, DC 20009]","[[2, [ , , bds]], [2, [ , , ba]], [1,120, [ ...","<a class=""list-card-link list-card-link-top-ma..."
3,"[$1,795,000]","[3750 Northampton St NW, Washington, DC 20015]","[[4, [ , , bds]], [4, [ , , ba]], [3,082, [ ...","<a class=""list-card-link list-card-link-top-ma..."
4,"[$1,499,000]","[609 Maryland Ave NE UNIT 5, Washington, DC 20...","[[2, [ , , bds]], [3, [ , , ba]], [2,634, [ ...","<a class=""list-card-link list-card-link-top-ma..."
5,"[$1,295,000]","[2349 King Pl NW, Washington, DC 20007]","[[3, [ , , bds]], [3, [ , , ba]], [3,350, [ ...","<a class=""list-card-link list-card-link-top-ma..."
6,"[$995,000]","[6125 32nd St NW, Washington, DC 20015]","[[3, [ , , bds]], [2, [ , , ba]], [1,758, [ ...","<a class=""list-card-link list-card-link-top-ma..."
7,"[$13,750,000]","[3301 Fessenden St NW, Washington, DC 20008]","[[11, [ , , bds]], [17, [ , , ba]], [17,631,...","<a class=""list-card-link list-card-link-top-ma..."
8,"[$875,000]","[1311 D St NE, Washington, DC 20002]","[[4, [ , , bds]], [2, [ , , ba]], [1,748, [ ...","<a class=""list-card-link list-card-link-top-ma..."
9,"[$849,000]","[3806 13th St NW, Washington, DC 20011]","[[3, [ , , bds]], [3, [ , , ba]], [2,535, [ ...","<a class=""list-card-link list-card-link-top-ma..."


# Data Cleaning
### Now that we have the scraped Zillow data, we need to clean this data up a bit for better analysis later on and easier handling in the BI Dashboard

In [3905]:
# Rearange columns a bit
df = df[['prices', 'address', 'links', 'details']]

In [3906]:
# Convert columns to string
df['prices'] = df['prices'].astype('str')
df['address'] = df['address'].astype('str')
df['details'] = df['details'].astype('str')

In [3907]:
# Show top 5 rows
df.head()

Unnamed: 0,prices,address,links,details
0,"<div aria-label=""$1,019,000"" class=""list-card-...","<address class=""list-card-addr"">4810 48th St N...","<a class=""list-card-link list-card-link-top-ma...","<ul class=""list-card-details""><li class="""">3<a..."
1,"<div aria-label=""$1,049,999"" class=""list-card-...","<address class=""list-card-addr"">1323 F St NE, ...","<a class=""list-card-link list-card-link-top-ma...","<ul class=""list-card-details""><li class="""">3<a..."
2,"<div aria-label=""$395,000"" class=""list-card-pr...","<address class=""list-card-addr"">1669 Columbia ...","<a class=""list-card-link list-card-link-top-ma...","<ul class=""list-card-details""><li class="""">2<a..."
3,"<div aria-label=""$1,795,000"" class=""list-card-...","<address class=""list-card-addr"">3750 Northampt...","<a class=""list-card-link list-card-link-top-ma...","<ul class=""list-card-details""><li class="""">4<a..."
4,"<div aria-label=""$1,499,000"" class=""list-card-...","<address class=""list-card-addr"">609 Maryland A...","<a class=""list-card-link list-card-link-top-ma...","<ul class=""list-card-details""><li class="""">2<a..."


### Clean Prices Column
- HTML needs to be removed
- Listing prices are collected twice and needs to only show once

In [3908]:
# Keep only last 20 characters to remove some unwanted string characters and 2nd price reocurrence
df['prices'] = df['prices'].str[-20:]
df.head()

Unnamed: 0,prices,address,links,details
0,"ce"">$1,019,000</div>","<address class=""list-card-addr"">4810 48th St N...","<a class=""list-card-link list-card-link-top-ma...","<ul class=""list-card-details""><li class="""">3<a..."
1,"ce"">$1,049,999</div>","<address class=""list-card-addr"">1323 F St NE, ...","<a class=""list-card-link list-card-link-top-ma...","<ul class=""list-card-details""><li class="""">3<a..."
2,"rice"">$395,000</div>","<address class=""list-card-addr"">1669 Columbia ...","<a class=""list-card-link list-card-link-top-ma...","<ul class=""list-card-details""><li class="""">2<a..."
3,"ce"">$1,795,000</div>","<address class=""list-card-addr"">3750 Northampt...","<a class=""list-card-link list-card-link-top-ma...","<ul class=""list-card-details""><li class="""">4<a..."
4,"ce"">$1,499,000</div>","<address class=""list-card-addr"">609 Maryland A...","<a class=""list-card-link list-card-link-top-ma...","<ul class=""list-card-details""><li class="""">2<a..."


In [3909]:
# Extract and keep 1 listing price as dollar amount 
df['prices'] = df['prices'].str.extract('(\$[0-9,.]+)')
df.head()

Unnamed: 0,prices,address,links,details
0,"$1,019,000","<address class=""list-card-addr"">4810 48th St N...","<a class=""list-card-link list-card-link-top-ma...","<ul class=""list-card-details""><li class="""">3<a..."
1,"$1,049,999","<address class=""list-card-addr"">1323 F St NE, ...","<a class=""list-card-link list-card-link-top-ma...","<ul class=""list-card-details""><li class="""">3<a..."
2,"$395,000","<address class=""list-card-addr"">1669 Columbia ...","<a class=""list-card-link list-card-link-top-ma...","<ul class=""list-card-details""><li class="""">2<a..."
3,"$1,795,000","<address class=""list-card-addr"">3750 Northampt...","<a class=""list-card-link list-card-link-top-ma...","<ul class=""list-card-details""><li class="""">4<a..."
4,"$1,499,000","<address class=""list-card-addr"">609 Maryland A...","<a class=""list-card-link list-card-link-top-ma...","<ul class=""list-card-details""><li class="""">2<a..."


### Clean Address Column
- HTML needs to be removed

In [3910]:
# Remove all html tags from address
df['address'] = df['address'].replace('<address class="list-card-addr">', ' ', regex=True)
df['address'] = df['address'].replace('</address>', ' ', regex=True)
df.head()

Unnamed: 0,prices,address,links,details
0,"$1,019,000","4810 48th St NW, Washington, DC 20016","<a class=""list-card-link list-card-link-top-ma...","<ul class=""list-card-details""><li class="""">3<a..."
1,"$1,049,999","1323 F St NE, Washington, DC 20002","<a class=""list-card-link list-card-link-top-ma...","<ul class=""list-card-details""><li class="""">3<a..."
2,"$395,000","1669 Columbia Rd NW #410, Washington, DC 20009","<a class=""list-card-link list-card-link-top-ma...","<ul class=""list-card-details""><li class="""">2<a..."
3,"$1,795,000","3750 Northampton St NW, Washington, DC 20015","<a class=""list-card-link list-card-link-top-ma...","<ul class=""list-card-details""><li class="""">4<a..."
4,"$1,499,000","609 Maryland Ave NE UNIT 5, Washington, DC 20...","<a class=""list-card-link list-card-link-top-ma...","<ul class=""list-card-details""><li class="""">2<a..."


### Clean Links Column
- HTML needs to be removed

In [3911]:
# Remove all html tags from links
df['links'] = df['links'].replace('<a class="list-card-link list-card-link-top-margin" href="', ' ', regex=True)
df['links'] = df['links'].replace('" tabindex="0"></a>', ' ', regex=True)
df.head()

Unnamed: 0,prices,address,links,details
0,"$1,019,000","4810 48th St NW, Washington, DC 20016",https://www.zillow.com/homedetails/4810-48th-...,"<ul class=""list-card-details""><li class="""">3<a..."
1,"$1,049,999","1323 F St NE, Washington, DC 20002",https://www.zillow.com/homedetails/1323-F-St-...,"<ul class=""list-card-details""><li class="""">3<a..."
2,"$395,000","1669 Columbia Rd NW #410, Washington, DC 20009",https://www.zillow.com/homedetails/1669-Colum...,"<ul class=""list-card-details""><li class="""">2<a..."
3,"$1,795,000","3750 Northampton St NW, Washington, DC 20015",https://www.zillow.com/homedetails/3750-North...,"<ul class=""list-card-details""><li class="""">4<a..."
4,"$1,499,000","609 Maryland Ave NE UNIT 5, Washington, DC 20...",https://www.zillow.com/homedetails/609-Maryla...,"<ul class=""list-card-details""><li class="""">2<a..."


### Clean Details Column
- HTML needs to be removed
- Beds, Baths, and SQFT need to be seperated

In [3912]:
# Remove all characters from each value of details except numbers and commas
def find_number(text):
    num = re.findall(r'[0-9,]+',text)
    return " ".join(num)
df['details']=df['details'].apply(lambda x: find_number(x))

In [3913]:
# Trim any leading and trailing white spaces from the details column to avoid errors when splitting
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
df.head()

Unnamed: 0,prices,address,links,details
0,"$1,019,000","4810 48th St NW, Washington, DC 20016",https://www.zillow.com/homedetails/4810-48th-S...,"3 3 2,500"
1,"$1,049,999","1323 F St NE, Washington, DC 20002",https://www.zillow.com/homedetails/1323-F-St-N...,"3 3 1,515"
2,"$395,000","1669 Columbia Rd NW #410, Washington, DC 20009",https://www.zillow.com/homedetails/1669-Columb...,"2 2 1,120"
3,"$1,795,000","3750 Northampton St NW, Washington, DC 20015",https://www.zillow.com/homedetails/3750-Northa...,"4 4 3,082"
4,"$1,499,000","609 Maryland Ave NE UNIT 5, Washington, DC 20002",https://www.zillow.com/homedetails/609-Marylan...,"2 3 2,634"


In [3914]:
# Split details column into 3 new columns beds, baths, sqft
df['beds'] = df.details.str.split(' ', expand = True)[0]
df['baths'] = df.details.str.split(' ', expand = True)[1]
df['sqft'] = df.details.str.split(' ', expand = True)[2]

In [3915]:
# Show top 5 rows
df.head()

Unnamed: 0,prices,address,links,details,beds,baths,sqft
0,"$1,019,000","4810 48th St NW, Washington, DC 20016",https://www.zillow.com/homedetails/4810-48th-S...,"3 3 2,500",3,3,2500
1,"$1,049,999","1323 F St NE, Washington, DC 20002",https://www.zillow.com/homedetails/1323-F-St-N...,"3 3 1,515",3,3,1515
2,"$395,000","1669 Columbia Rd NW #410, Washington, DC 20009",https://www.zillow.com/homedetails/1669-Columb...,"2 2 1,120",2,2,1120
3,"$1,795,000","3750 Northampton St NW, Washington, DC 20015",https://www.zillow.com/homedetails/3750-Northa...,"4 4 3,082",4,4,3082
4,"$1,499,000","609 Maryland Ave NE UNIT 5, Washington, DC 20002",https://www.zillow.com/homedetails/609-Marylan...,"2 3 2,634",2,3,2634


In [3916]:
# Remove commas from sqft and convert to float
df["sqft"] = df["sqft"].str.replace(",","").astype(float)

In [3917]:
# Remove $ from prices column and remove commas
df['prices'] = df['prices'].str.replace('$', '')
df['prices'] = df['prices'].str.replace(',', '')

In [3918]:
#convert prices column to float
df['prices'] = df['prices'].astype('float')

In [3919]:
df.head()

Unnamed: 0,prices,address,links,details,beds,baths,sqft
0,1019000.0,"4810 48th St NW, Washington, DC 20016",https://www.zillow.com/homedetails/4810-48th-S...,"3 3 2,500",3,3,2500.0
1,1049999.0,"1323 F St NE, Washington, DC 20002",https://www.zillow.com/homedetails/1323-F-St-N...,"3 3 1,515",3,3,1515.0
2,395000.0,"1669 Columbia Rd NW #410, Washington, DC 20009",https://www.zillow.com/homedetails/1669-Columb...,"2 2 1,120",2,2,1120.0
3,1795000.0,"3750 Northampton St NW, Washington, DC 20015",https://www.zillow.com/homedetails/3750-Northa...,"4 4 3,082",4,4,3082.0
4,1499000.0,"609 Maryland Ave NE UNIT 5, Washington, DC 20002",https://www.zillow.com/homedetails/609-Marylan...,"2 3 2,634",2,3,2634.0


## Extract City and State from Address Column

In [3920]:
# Split address column into street, city, and statezip
df[['street','city','statezip']] = df.address.str.split(',',expand=True)

In [3921]:
# Show top 2 rows
df.head(2)

Unnamed: 0,prices,address,links,details,beds,baths,sqft,street,city,statezip
0,1019000.0,"4810 48th St NW, Washington, DC 20016",https://www.zillow.com/homedetails/4810-48th-S...,"3 3 2,500",3,3,2500.0,4810 48th St NW,Washington,DC 20016
1,1049999.0,"1323 F St NE, Washington, DC 20002",https://www.zillow.com/homedetails/1323-F-St-N...,"3 3 1,515",3,3,1515.0,1323 F St NE,Washington,DC 20002


In [3922]:
# Trim any leading and trailing spaces from statezip column to avoid errors while splitting zip column
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)

In [3923]:
# Split statezip column into  new columns state and zip
df['state'] = df.statezip.str.split(' ', expand = True)[0]
df['zip'] = df.statezip.str.split(' ', expand = True)[1]

In [3924]:
# Show top 2 rows
df.head(2)

Unnamed: 0,prices,address,links,details,beds,baths,sqft,street,city,statezip,state,zip
0,1019000.0,"4810 48th St NW, Washington, DC 20016",https://www.zillow.com/homedetails/4810-48th-S...,"3 3 2,500",3,3,2500.0,4810 48th St NW,Washington,DC 20016,DC,20016
1,1049999.0,"1323 F St NE, Washington, DC 20002",https://www.zillow.com/homedetails/1323-F-St-N...,"3 3 1,515",3,3,1515.0,1323 F St NE,Washington,DC 20002,DC,20002


In [3925]:
# Now drop unwanted columns details and statezip
df = df.drop('details', 1)
df = df.drop('statezip', 1)
df.head()

Unnamed: 0,prices,address,links,beds,baths,sqft,street,city,state,zip
0,1019000.0,"4810 48th St NW, Washington, DC 20016",https://www.zillow.com/homedetails/4810-48th-S...,3,3,2500.0,4810 48th St NW,Washington,DC,20016
1,1049999.0,"1323 F St NE, Washington, DC 20002",https://www.zillow.com/homedetails/1323-F-St-N...,3,3,1515.0,1323 F St NE,Washington,DC,20002
2,395000.0,"1669 Columbia Rd NW #410, Washington, DC 20009",https://www.zillow.com/homedetails/1669-Columb...,2,2,1120.0,1669 Columbia Rd NW #410,Washington,DC,20009
3,1795000.0,"3750 Northampton St NW, Washington, DC 20015",https://www.zillow.com/homedetails/3750-Northa...,4,4,3082.0,3750 Northampton St NW,Washington,DC,20015
4,1499000.0,"609 Maryland Ave NE UNIT 5, Washington, DC 20002",https://www.zillow.com/homedetails/609-Marylan...,2,3,2634.0,609 Maryland Ave NE UNIT 5,Washington,DC,20002


### Run to look at full dataframe

In [3926]:
df

Unnamed: 0,prices,address,links,beds,baths,sqft,street,city,state,zip
0,1019000.0,"4810 48th St NW, Washington, DC 20016",https://www.zillow.com/homedetails/4810-48th-S...,3.0,3.0,2500.0,4810 48th St NW,Washington,DC,20016
1,1049999.0,"1323 F St NE, Washington, DC 20002",https://www.zillow.com/homedetails/1323-F-St-N...,3.0,3.0,1515.0,1323 F St NE,Washington,DC,20002
2,395000.0,"1669 Columbia Rd NW #410, Washington, DC 20009",https://www.zillow.com/homedetails/1669-Columb...,2.0,2.0,1120.0,1669 Columbia Rd NW #410,Washington,DC,20009
3,1795000.0,"3750 Northampton St NW, Washington, DC 20015",https://www.zillow.com/homedetails/3750-Northa...,4.0,4.0,3082.0,3750 Northampton St NW,Washington,DC,20015
4,1499000.0,"609 Maryland Ave NE UNIT 5, Washington, DC 20002",https://www.zillow.com/homedetails/609-Marylan...,2.0,3.0,2634.0,609 Maryland Ave NE UNIT 5,Washington,DC,20002
5,1295000.0,"2349 King Pl NW, Washington, DC 20007",https://www.zillow.com/homedetails/2349-King-P...,3.0,3.0,3350.0,2349 King Pl NW,Washington,DC,20007
6,995000.0,"6125 32nd St NW, Washington, DC 20015",https://www.zillow.com/homedetails/6125-32nd-S...,3.0,2.0,1758.0,6125 32nd St NW,Washington,DC,20015
7,13750000.0,"3301 Fessenden St NW, Washington, DC 20008",https://www.zillow.com/homedetails/3301-Fessen...,11.0,17.0,17631.0,3301 Fessenden St NW,Washington,DC,20008
8,875000.0,"1311 D St NE, Washington, DC 20002",https://www.zillow.com/homedetails/1311-D-St-N...,4.0,2.0,1748.0,1311 D St NE,Washington,DC,20002
9,849000.0,"3806 13th St NW, Washington, DC 20011",https://www.zillow.com/homedetails/3806-13th-S...,3.0,3.0,2535.0,3806 13th St NW,Washington,DC,20011


In [3927]:
# Export data to excel file
df.to_excel('StateData/washintondc.xlsx', index = False)

### State list
- Alabama ////////////
- Alaska /////////////
- Arizona ////////////
- Arkansas ///////////
- California //////////
- Colorado //////////
- Connecticut //////////
- Delaware //////////
- Florida //////////
- Georgia //////////
- Hawaii //////////
- Idaho //////////
- Illinois //////////
- Indiana //////////
- Iowa //////////
- Kansas ///////////
- Kentucky ///////////
- Louisiana ///////////
- Maine ///////////
- Maryland ///////////
- Massachusetts ///////////
- Michigan ///////////
- Minnesota ///////////
- Mississippi ///////////
- Missouri /////////
- Montana /////////
- Nebraska /////////
- Nevada /////////
- New Hampshire /////////
- New Jersey /////////
- New Mexico /////////
- New York /////////
- North Carolina /////////
- North Dakota /////////
- Ohio /////////
- Oklahoma /////////
- Oregon /////////
- Pennsylvania //////////
- Rhode Island //////////
- South Carolina /////////
- South Dakota /////////
- Tennessee /////////
- Texas ////////////
- Utah ////////////
- Vermont ////////////
- Virginia ////////////
- Washington ////////////
- West Virginia ////////////
- Wisconsin /////////
- Wyoming /////////
- Washington D.C. /////////