# Web Scraping: Practice

Helpful link: 
- https://medium.freecodecamp.org/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe

In [1]:
# Imports
import requests
from bs4 import BeautifulSoup

## List of San Diego Communities

In [2]:
# San Diego Communities Webpage
sd_communities = 'https://en.wikipedia.org/wiki/List_of_communities_and_neighborhoods_of_San_Diego'

In [None]:
### Setup: First, scrape the web page above, and use BeautifulSoup to parse it

In [3]:
# YOUR CODE HERE
page = requests.get(sd_communities)
soup = BeautifulSoup(page.content, 'html.parser')

In [11]:
# What is the title of the webpage?

# YOUR CODE HERE
title = soup.title.string
print(title)

List of communities and neighborhoods of San Diego - Wikipedia


Goal: we would like a dictionary of all the communities listed in the wikipedia page, with their links. 

We want all the community names as keys, and the (relative) links as values. 

In [15]:
page.content

b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>List of communities and neighborhoods of San Diego - Wikipedia</title>\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_communities_and_neighborhoods_of_San_Diego","wgTitle":"List of communities and neighborhoods of San Diego","wgCurRevisionId":852307991,"wgRevisionId":852307991,"wgArticleId":9146786,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Neighborhoods in San Diego","Urban communities in San Diego","Lists of neighborhoods in U.S. cities","San Diego-related lists","Geography of San Diego"],"wgBreakFrames":false,"wgPageContentLanguage":"en","w

In [28]:
table = soup.find('table')

In [41]:
type(table)

bs4.element.Tag

In [51]:
ll = table.findAll('a')

In [55]:
ll

[<a class="mw-redirect" href="/wiki/Balboa_Park,_San_Diego" title="Balboa Park, San Diego">Balboa Park</a>,
 <a href="/wiki/Bankers_Hill,_San_Diego" title="Bankers Hill, San Diego">Bankers Hill</a>,
 <a href="/wiki/Barrio_Logan,_San_Diego" title="Barrio Logan, San Diego">Barrio Logan</a>,
 <a class="mw-redirect" href="/wiki/Bay_Ho,_San_Diego" title="Bay Ho, San Diego">Bay Ho</a>,
 <a class="mw-redirect" href="/wiki/Bay_Park,_San_Diego" title="Bay Park, San Diego">Bay Park</a>,
 <a href="/wiki/Birdland,_San_Diego" title="Birdland, San Diego">Birdland</a>,
 <a href="/wiki/Black_Mountain_Ranch,_San_Diego" title="Black Mountain Ranch, San Diego">Black Mountain Ranch</a>,
 <a class="mw-redirect" href="/wiki/Border,_San_Diego" title="Border, San Diego">Border</a>,
 <a href="/wiki/Burlingame,_San_Diego" title="Burlingame, San Diego">Burlingame</a>,
 <a href="/wiki/Carmel_Mountain_Ranch,_San_Diego" title="Carmel Mountain Ranch, San Diego">Carmel Mountain Ranch</a>,
 <a href="/wiki/Carmel_Valle

In [58]:
solution = {}

for row in table.findAll('a'):
    
    k = row.get('title')
    href = row.get('href')

    solution[k] = href

In [59]:
solution

{'Balboa Park, San Diego': '/wiki/Balboa_Park,_San_Diego',
 'Bankers Hill, San Diego': '/wiki/Bankers_Hill,_San_Diego',
 'Barrio Logan, San Diego': '/wiki/Barrio_Logan,_San_Diego',
 'Bay Ho, San Diego': '/wiki/Bay_Ho,_San_Diego',
 'Bay Park, San Diego': '/wiki/Bay_Park,_San_Diego',
 'Birdland, San Diego': '/wiki/Birdland,_San_Diego',
 'Black Mountain Ranch, San Diego': '/wiki/Black_Mountain_Ranch,_San_Diego',
 'Border, San Diego': '/wiki/Border,_San_Diego',
 'Burlingame, San Diego': '/wiki/Burlingame,_San_Diego',
 'Carmel Mountain Ranch, San Diego': '/wiki/Carmel_Mountain_Ranch,_San_Diego',
 'Carmel Valley, San Diego': '/wiki/Carmel_Valley,_San_Diego',
 'City Heights, San Diego': '/wiki/City_Heights,_San_Diego',
 'Clairemont, San Diego': '/wiki/Clairemont,_San_Diego',
 'College Area, San Diego': '/wiki/College_Area,_San_Diego',
 'Del Mar Heights, San Diego': '/wiki/Del_Mar_Heights,_San_Diego',
 'Del Mar Mesa, San Diego': '/wiki/Del_Mar_Mesa,_San_Diego',
 'Downtown San Diego': '/wiki/Do

### Communities - Part 1

Create a dictionary called `communities`, and fill it with the communities information, as above. 

From your `soup` object, use the find_all method to find all the links. 

Using that, you can loop through all links, to collect them into a dictionary.

For a first pass, don't worry about sub-selecting links, just get all links on the page. 

In [None]:
# YOU CODE HERE


In [61]:
# Check the resulting dictionary
solution

{'Balboa Park, San Diego': '/wiki/Balboa_Park,_San_Diego',
 'Bankers Hill, San Diego': '/wiki/Bankers_Hill,_San_Diego',
 'Barrio Logan, San Diego': '/wiki/Barrio_Logan,_San_Diego',
 'Bay Ho, San Diego': '/wiki/Bay_Ho,_San_Diego',
 'Bay Park, San Diego': '/wiki/Bay_Park,_San_Diego',
 'Birdland, San Diego': '/wiki/Birdland,_San_Diego',
 'Black Mountain Ranch, San Diego': '/wiki/Black_Mountain_Ranch,_San_Diego',
 'Border, San Diego': '/wiki/Border,_San_Diego',
 'Burlingame, San Diego': '/wiki/Burlingame,_San_Diego',
 'Carmel Mountain Ranch, San Diego': '/wiki/Carmel_Mountain_Ranch,_San_Diego',
 'Carmel Valley, San Diego': '/wiki/Carmel_Valley,_San_Diego',
 'City Heights, San Diego': '/wiki/City_Heights,_San_Diego',
 'Clairemont, San Diego': '/wiki/Clairemont,_San_Diego',
 'College Area, San Diego': '/wiki/College_Area,_San_Diego',
 'Del Mar Heights, San Diego': '/wiki/Del_Mar_Heights,_San_Diego',
 'Del Mar Mesa, San Diego': '/wiki/Del_Mar_Mesa,_San_Diego',
 'Downtown San Diego': '/wiki/Do

### Communities - Part 2

If you did the part above, extracting links, you probably realized that you extracted a whole bunch of links you don't really want, for example, links from the side bar.

Figure out how to sub-select the part of the page that includes the table with all the links, and then run the the same link extraction on that specific part of the page. This should allow you to only extact the relevant links. 

In [None]:
# YOUR CODE HERE


In [None]:
# Check out the results
communities

### Communities - Part 3

You now have a dictionary of neighbourhoods in San Diego, and links to their respective pages on wikipedia.

See if you can loop through the list of links you have, and collect latitute and longitude data from each one (if available). 

I recommend you start by figuring out how to do this on an example page, and then put that in a loop.

In [None]:
# YOUR CODE HERE

In [64]:
for key, item in solution.items():
    page = requests.get(item)
    soup = BeautifulSoup(page.content, 'html.parser')
    #soup.find()

MissingSchema: Invalid URL '/wiki/Balboa_Park,_San_Diego': No schema supplied. Perhaps you meant http:///wiki/Balboa_Park,_San_Diego?

## San Diego Crime Stats Page

Check out the San Diego crime stats page below. Let's try and get some data from it. 

From the landing page, pull all all the table data, storing it into a dictionary that encodes the type of crime, and the number. 

Hints:
- Look for the HTML tag that holds the table data, and loop through all of those labels. 
- Using this approach, you can get all the table data by looping across one tag. 

In [None]:
# SD Crime stats page
crime_stats_link = "http://crimestats.arjis.org/default.aspx"

In [None]:
# YOUR CODE HERE

### Discussion

The crime page above takes inputs to select dates and places. How could we programmatically enter queries into it, and get the results?