# Intermediate Scraping Homework: Wikipedia Table

In this assignment, we'll be extracting data from Wikipedia's table of the tallest buildings in Brooklyn: https://en.wikipedia.org/wiki/List_of_tallest_buildings_in_Brooklyn

### 0) Setup

Import `requests`, `BeautifulSoup`, and `pandas`. Although this homework uses `BeautifulSoup`, you can choose to use `lxml` instead, if you prefer.

In [4]:
import requests
from bs4 import BeautifulSoup
import pandas as pd 

### 1) Grab the HTML for the webpage linked above

Use `requests` to get the HTML, assigning it to a variable

In [5]:
url = "https://en.wikipedia.org/wiki/List_of_tallest_buildings_in_Brooklyn"
html = requests.get(url).text

### 2) Convert the HTML into a `BeautifulSoup` object

In [6]:
soup = BeautifulSoup(html)

In [7]:
type(soup)

bs4.BeautifulSoup

### 3) Use `.select(...)` (and `[0]`) to select the main table

That's the one directly under the "Tallest buildings" heading.

Print out the first 100 characters of text from the table to make sure you have the right one.

In [8]:
table = soup.select(".wikitable") [0]


#How do I print out first 100 characters?
table.text[:100]
table.text[:500]

"\n\nRank\n\nName[a]\n\nImage\n\nHeightft (m)\n\nFloors\n\nYear completed\n\nNotes\n\n\n1\n\nThe Brooklyn Tower\n\n\n\n1,066 (325)\n\n93\n\n2022\n\nTopped out in October 2021.[2][23][24][25]\n\n\n2\n\nBrooklyn Point\n\n\n\n720 (219)\n\n68\n\n2019\n\nThe final phase of Extell's City Point development; topped out in April 2019, it is now the second tallest building in Brooklyn.[26] Also known as 138 Willoughby Street,[27][28] 1 City Point,[29] and City Point Tower III.[29][30][31]\n\n\n3\n\n11 Hoyt\n\n\n\n626 (191)\n\n51\n\n2020\n\nTopped out in June 2019."

In [9]:
len(table.text)

8661

### 4) Use `.select(...)` (and `[0]`) again to select the table's first row

... which is its header. (Reminder that the `<thead>` tag is optional. Wikipedia tables don't use it.)

In [1]:
#table

In [10]:
rows = table.select("tr")[0]

In [2]:
#table

In [11]:
rows.select("th")

[<th>Rank
 </th>,
 <th>Name<style data-mw-deduplicate="TemplateStyles:r1041539562">.mw-parser-output .citation{word-wrap:break-word}.mw-parser-output .citation:target{background-color:rgba(0,127,255,0.133)}</style><sup class="citation nobold" id="ref_id1none"><a href="#endnote_id1none">[a]</a></sup>
 </th>,
 <th class="unsortable">Image
 </th>,
 <th data-sort-type="number">Height<br/><span style="font-size:85%;">ft (m)</span>
 </th>,
 <th>Floors
 </th>,
 <th>Year completed
 </th>,
 <th class="unsortable">Notes
 </th>]

In [55]:
len(rows)

14

In [17]:
# len(table.select("tr")[:1])

1

### 5) Extract the column names from that header

Use `.strip()` to remove any leading or trailing whitespace from the names.

First, try doing this with a standard `for` loop:

In [13]:
for cell in rows:
    print(cell.text.strip())


Rank

Name[a]

Image

Heightft (m)

Floors

Year completed

Notes


Try to do the same, but with a list comprehension:

In [14]:
#list comprehension: https://www.w3schools.com/python/python_lists_comprehension.asp

[ cell.text for cell in rows ]

#How to get rid of the \n values? https://stackoverflow.com/questions/21325212/how-to-remove-n-from-end-of-strings-inside-a-list
stripped = [cell.text.rstrip() for cell in rows]
stripped

# stripped_part_two = cell.text.rstrip()
# stripped_part_two
#^ I don't need to strip it again because I need the '' to separate the lines

['',
 'Rank',
 '',
 'Name[a]',
 '',
 'Image',
 '',
 'Heightft (m)',
 '',
 'Floors',
 '',
 'Year completed',
 '',
 'Notes']

### 6) Select all non-header row *elements* from the table

Since the header was the first row, you'll want to skip that one. How many rows are there? (Check with your eyes that this number matches what you deduce from the rankings column in the browser-rendered table.)

In [21]:
# len(rows)

number_of_rows = len(table.select("tr")[1:])

print(f'There are {number_of_rows} rows.')

There are 82 rows.


In [62]:
#Hmmm, it doesn't.

### 7) For each row, extract the text of each cell into a Python list

First, try this as two nested `for` loops:

In [26]:
row = table.select("tr")
#row

In [37]:
#How to extract the text of each cell? - Is this for all the cells in all the rows? 
#We want an if statement - if it's a string, extract it and append to the list  

for cell in row:
    cell_text = []
    for cell in table.select("td"):
        cell_text.append(cell.text.rstrip())

#cell_text

#screaming, idk what to do here :)
#https://www.geeksforgeeks.org/python-nested-loops/



In [40]:
#cell_text

Try the same, but with two list comprehensions (one nested in the other):

### 8) Turn the data you've extracted into a `pandas` `DataFrame`

In [42]:
# df = pd.DataFrame([{
#     "rank":
#     "name": 
#     "height": 
#     "floors":
#     "year_complete":
#     "notes":
# }  for text in cell_text])

# df
                   

SyntaxError: invalid syntax (386989390.py, line 3)

In [44]:
pd.DataFrame(cell_text)

#help: https://jonathansoma.com/everything/scraping/convert-web-pages-to-csv/#__tabbed_2_3

#Question: how do I structure this dataframe?

Unnamed: 0,0
0,1
1,The Brooklyn Tower
2,
3,"1,066 (325)"
4,93
...,...
569,Upload image
570,295 (90)
571,30
572,2023


### 9) Which years are represented by at least 5 buildings?

### 10) How many total floors do all the buildings have, combined?

### 11) How many of the buildings have their own Wikipedia page?

For this, you'll need to query the row elements again; the information won't have been extracted into your `DataFrame`. 

(Hint: Whether a building has its own Wikipedia page isn't an explicit piece of data, but something you can infer from the presence of a particular sub-element.)

### 12) How many have an image?

You could do this by testing for the presence of another element:

Or through information that's already in your `DataFrame`:

### Bonus challenge

If we tried to run the same code on https://en.wikipedia.org/wiki/List_of_tallest_buildings_in_New_York_City instead, the results wouldn't be quite right. Try it. Then, examining the HTML of that page, try to figure out why.

If you want an extra-extra challenge, try writing code that would parse that table correctly.

---

---

---