# Scrape sale price documents for Brooklyn homes

## Build a list of documents we would like to download

Visit https://www.nyc.gov/site/finance/taxes/property-annualized-sales-update.page and peek under "Detailed Annual Sales Reports by Borough." We want to build a list of all of the excel files that link to **one borough**. It's your choice - Manhattan, Brooklyn, Staten Island, etc.

* _**Tip:** You can basically cut and paste from the end of class on this one_
* _**Tip:** 2017 and earlier files are `.xls`, not `.xlsx`_

In [1]:
!ls

import requests
from bs4 import BeautifulSoup

response = requests.get("https://www.nyc.gov/site/finance/taxes/property-annualized-sales-update.page")
soup = BeautifulSoup(response.text)

01 - Data Acquisition.ipynb 2016_brooklyn.xls
02 - Data Compilation.ipynb 2017_brooklyn.xls
03 - Data Analysis.ipynb    2018_brooklyn.xlsx
04 - Data Exploration.ipynb 2019_brooklyn.xlsx
2009_brooklyn.xls           2020_brooklyn.xlsx
2010_brooklyn.xls           2021_brooklyn.xlsx
2011_brooklyn.xls           cleaned.csv
2012_brooklyn.xls           merged.csv
2013_brooklyn.xls           sales_2007_brooklyn.xls
2014_brooklyn.xls           sales_2008_brooklyn.xls
2015_brooklyn.xls


In [29]:
# Copy and pasted the first loop from class

links = soup.find_all('a')
all_links = []
rough_brooklyn_links = []
brooklyn_links = []

for link in links:
    try:
        if '/assets/finance/downloads/pdf/rolling_sales/annualized-sales/' in link['href']:
            
            full_link = 'https://www.nyc.gov' +link['href']
            all_links.append(full_link)
    except:
        
        pass


for link in all_links:
    if 'brooklyn' in link:
        rough_brooklyn_links.append(link)


for link in rough_brooklyn_links:
    if 'xls' in link:
        brooklyn_links.append(link)



## Use Python to make a list of the URLs to be downloaded, and save them to a file.

The format is a _little_ different than what we did in class, as `/` at the beginning of a url means "start from the top of the domain" instead of "start relative to the page you're on now." Just examine your URLs and you'll notice it.

_**Tip:** If you want to google around at other ways to do this, the `'\n'.join(urls)` method might be an interesting one to look at._

In [30]:
brooklyn_links

['https://www.nyc.gov/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2021/2021_brooklyn.xlsx',
 'https://www.nyc.gov/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2020/2020_brooklyn.xlsx',
 'https://www.nyc.gov/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2019/2019_brooklyn.xlsx',
 'https://www.nyc.gov/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2018/2018_brooklyn.xlsx',
 'https://www.nyc.gov/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2017/2017_brooklyn.xls',
 'https://www.nyc.gov/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2016/2016_brooklyn.xls',
 'https://www.nyc.gov/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2015/2015_brooklyn.xls',
 'https://www.nyc.gov/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2014/2014_brooklyn.xls',
 'https://www.nyc.gov/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2013/2013_brooklyn.xls',
 'https://www.nyc.gov/assets/fina

In [31]:


with open("urls.txt", 'w') as fp:
    for url in brooklyn_links:
        fp.write(url + "\n")

## Download the Excel files with `wget` or `curl`

You can see what I did in class, but `wget` has an option to provide it with a filename to download al ist of files from.

In [32]:


!wget -i urls.txt

--2022-11-17 15:29:00--  https://www.nyc.gov/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2021/2021_brooklyn.xlsx
Resolving www.nyc.gov (www.nyc.gov)... 173.223.185.104
Connecting to www.nyc.gov (www.nyc.gov)|173.223.185.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3212511 (3.1M) [application/vnd.openxmlformats-officedocument.spreadsheetml.sheet]
Saving to: ‘2021_brooklyn.xlsx.1’


2022-11-17 15:29:01 (4.67 MB/s) - ‘2021_brooklyn.xlsx.1’ saved [3212511/3212511]

--2022-11-17 15:29:01--  https://www.nyc.gov/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2020/2020_brooklyn.xlsx
Reusing existing connection to www.nyc.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 2277851 (2.2M) [application/vnd.openxmlformats-officedocument.spreadsheetml.sheet]
Saving to: ‘2020_brooklyn.xlsx.1’


2022-11-17 15:29:04 (1.07 MB/s) - ‘2020_brooklyn.xlsx.1’ saved [2277851/2277851]

--2022-11-17 15:29:04--  https://www.nyc.gov/assets/