## Intro to Web Scraping


Scraping is not always legal!  

Some rules to consider: 
* Be respectful and do not bombard a website with scraping request or else you can get your IP address blocked
* Check the website permission before you begin! If there is an API available, use it. Most websites won't let you use their data commercially.
* Each website is unique and may update, so you may need to update your code and/or customize your scraping code for each website


When is it a good idea to scrape a website:
* API is not available, or information you want is not in the API
* You want to anonoymously scrape a website (use a VPN) 

Here is a Web Scraping Sandbox where you can practice scraping: 
http://toscrape.com/

Today, we're going to start with scraping www.wikipedia.com because it is *legal* to scrape

This lesson was adapted from: https://github.com/Pierian-Data/Complete-Python-3-Bootcamp/blob/master/13-Web-Scraping/00-Guide-to-Web-Scraping.ipynb

Make sure you download requests and bs4 via terminal 

* pip install requests
* pip install bs4

or if you're using Anaconda 

* conda install requests
* conda install bs4

or install it via notebook 

* !pip install requests
* !pip install bs4 

In [1]:
# The request library will grab the page
import requests

response = requests.get("https://en.wikipedia.org/wiki/Black_Lives_Matter")



In [2]:
response.text



In [3]:
# The beautifulsoup library makes your code legible and helps you analyze the extracted page

import bs4 
soup = bs4.BeautifulSoup(response.text, 'html.parser')

In [4]:
soup


<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Black Lives Matter - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"e811a5c8-b82b-4d3c-a75b-dcca28baeed9","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Black_Lives_Matter","wgTitle":"Black Lives Matter","wgCurRevisionId":967895873,"wgRevisionId":967895873,"wgArticleId":44751865,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 maint: date format","CS1 Danish-language sources (da)","CS1 maint: archived copy as title","Articles that may be too long from July 2020","Wikipedi

#### Scraping Text

In [5]:
# next inspect the elements on the wiki page, I want to grab the headlines 
# the headlines are in the class="mw-headline" in a <span> 

soup.select(".mw-headline")

[<span class="mw-headline" id="Structure_and_organization">Structure and organization</span>,
 <span class="mw-headline" id="Loose_structure">Loose structure</span>,
 <span class="mw-headline" id="Guiding_principles">Guiding principles</span>,
 <span class="mw-headline" id="Broader_movement">Broader movement</span>,
 <span class="mw-headline" id="Funding_of_the_movement">Funding of the movement</span>,
 <span class="mw-headline" id="Policy_demands">Policy demands</span>,
 <span class="mw-headline" id="Strategies_and_tactics">Strategies and tactics</span>,
 <span class="mw-headline" id="Internet_and_social_media">Internet and social media</span>,
 <span class="mw-headline" id="Direct_action">Direct action</span>,
 <span class="mw-headline" id="Media,_music,_and_other_cultural_impacts">Media, music, and other cultural impacts</span>,
 <span class="mw-headline" id="Police_use_of_excessive_force">Police use of excessive force</span>,
 <span class="mw-headline" id="Black_Lives_Matters_influ

In [6]:
# Create a list for the scrapped headlines 

headlines = []
for item in soup.select(".mw-headline"):
    headlines.append(item.text)

In [7]:
headlines

['Structure and organization',
 'Loose structure',
 'Guiding principles',
 'Broader movement',
 'Funding of the movement',
 'Policy demands',
 'Strategies and tactics',
 'Internet and social media',
 'Direct action',
 'Media, music, and other cultural impacts',
 'Police use of excessive force',
 'Black Lives Matters influence',
 'Reaction',
 'Corporate support',
 'Timeline of notable US events and demonstrations',
 '2014',
 '2015',
 '2016',
 '2017',
 '2018',
 '2020',
 'George Floyd protests',
 'BLM international movement',
 'Australia',
 'Canada',
 'New Zealand',
 'United Kingdom',
 'Germany',
 'Denmark',
 'Japan',
 '2016 US presidential election',
 'Primaries',
 'Democrats',
 'Republicans',
 'General election',
 'Counter-slogans and movements',
 '"All Lives Matter"',
 '"Blue Lives Matter"',
 '"White Student Union" Facebook groups',
 '"White Lives Matter"',
 'Criticism of "Black Lives Matter"',
 'Tactics',
 'Disagreement over racial bias',
 'Views on law enforcement',
 'Ferguson effect

In [8]:
# To save to a CSV, we first want to create a dataframe for the data

import pandas as pd

headline_df = pd.DataFrame()
headline_df['headlines'] = pd.Series(headlines).values

In [9]:
headline_df

Unnamed: 0,headlines
0,Structure and organization
1,Loose structure
2,Guiding principles
3,Broader movement
4,Funding of the movement
5,Policy demands
6,Strategies and tactics
7,Internet and social media
8,Direct action
9,"Media, music, and other cultural impacts"


In [10]:
# Save to csv 
headline_df.to_csv('headline.csv')

#### Scraping Image

In [11]:
# We're starting by going through the HTML and looking for something all the images have in common 
# the class 'thumbimage' applies to all the images 

image_info = soup.select('.thumbimage')

In [12]:
image_info

[<img alt='Protesters lying down over rail tracks with a "Black Lives Matter" banner' class="thumbimage" data-file-height="4424" data-file-width="6629" decoding="async" height="147" src="//upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Black_Lives_Matter_protest_against_St._Paul_police_brutality_%2821587635011%29.jpg/220px-Black_Lives_Matter_protest_against_St._Paul_police_brutality_%2821587635011%29.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Black_Lives_Matter_protest_against_St._Paul_police_brutality_%2821587635011%29.jpg/330px-Black_Lives_Matter_protest_against_St._Paul_police_brutality_%2821587635011%29.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Black_Lives_Matter_protest_against_St._Paul_police_brutality_%2821587635011%29.jpg/440px-Black_Lives_Matter_protest_against_St._Paul_police_brutality_%2821587635011%29.jpg 2x" width="220"/>,
 <img alt="" class="thumbimage" data-file-height="3881" data-file-width="5175" decoding="async" height="165

In [13]:
len(image_info)

36

In [14]:
# We're creating a list of the links for the thumbnails 

links = []
for link in image_info:
    #the links is in the 'src' attribute 
    item =link.get('src')
    #we're adding https: to format it properly 
    print(item)
    #this if statement is bc some <div> tags also have the class "thumbnail"
    if type(item) is str: 
        # print(type(item))
        links.append("https:"+item)

//upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Black_Lives_Matter_protest_against_St._Paul_police_brutality_%2821587635011%29.jpg/220px-Black_Lives_Matter_protest_against_St._Paul_police_brutality_%2821587635011%29.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/f/fc/Black_Lives_Matter_protest_against_St._Paul_police_brutality_%2821552673186%29.jpg/220px-Black_Lives_Matter_protest_against_St._Paul_police_brutality_%2821552673186%29.jpg
None
None
//upload.wikimedia.org/wikipedia/commons/thumb/d/d7/OaklandBLM-4174.jpg/220px-OaklandBLM-4174.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/8/8e/Protesters_with_signs_in_Ferguson.jpg/220px-Protesters_with_signs_in_Ferguson.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/d/d2/2020.06.05_Protesting_the_Murder_of_George_Floyd%2C_Washington%2C_DC_USA_157_34232.jpg/220px-2020.06.05_Protesting_the_Murder_of_George_Floyd%2C_Washington%2C_DC_USA_157_34232.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/5/52/Black_Lives_Matter_Black_F

In [15]:
print (links)

['https://upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Black_Lives_Matter_protest_against_St._Paul_police_brutality_%2821587635011%29.jpg/220px-Black_Lives_Matter_protest_against_St._Paul_police_brutality_%2821587635011%29.jpg', 'https://upload.wikimedia.org/wikipedia/commons/thumb/f/fc/Black_Lives_Matter_protest_against_St._Paul_police_brutality_%2821552673186%29.jpg/220px-Black_Lives_Matter_protest_against_St._Paul_police_brutality_%2821552673186%29.jpg', 'https://upload.wikimedia.org/wikipedia/commons/thumb/d/d7/OaklandBLM-4174.jpg/220px-OaklandBLM-4174.jpg', 'https://upload.wikimedia.org/wikipedia/commons/thumb/8/8e/Protesters_with_signs_in_Ferguson.jpg/220px-Protesters_with_signs_in_Ferguson.jpg', 'https://upload.wikimedia.org/wikipedia/commons/thumb/d/d2/2020.06.05_Protesting_the_Murder_of_George_Floyd%2C_Washington%2C_DC_USA_157_34232.jpg/220px-2020.06.05_Protesting_the_Murder_of_George_Floyd%2C_Washington%2C_DC_USA_157_34232.jpg', 'https://upload.wikimedia.org/wikipedia/co

#### Now, I want to download these images on to my computer

In [16]:
# To save the images, we first want to create a dataframe for the data, we already imported Panda earlier 

# Creating a new dataframe 
image_df = pd.DataFrame()
image_df['wiki_images'] = pd.Series(links).values

In [17]:
image_df

Unnamed: 0,wiki_images
0,https://upload.wikimedia.org/wikipedia/commons...
1,https://upload.wikimedia.org/wikipedia/commons...
2,https://upload.wikimedia.org/wikipedia/commons...
3,https://upload.wikimedia.org/wikipedia/commons...
4,https://upload.wikimedia.org/wikipedia/commons...
5,https://upload.wikimedia.org/wikipedia/commons...
6,https://upload.wikimedia.org/wikipedia/commons...
7,https://upload.wikimedia.org/wikipedia/commons...
8,https://upload.wikimedia.org/wikipedia/commons...
9,https://upload.wikimedia.org/wikipedia/commons...


In [18]:
# FIRST: Create a folder title Scraped_Images

# We can use urllib to download the image urls
import urllib.request
# We want to add a sleeper to not get blocked 
import time 

# Iterate over DataFrame rows as (index, row) pairs
for index, row in image_df.iterrows():

# Iterating over all the url in series: row 
    for url in row:
        print(url)
        # Sets the file name as everything after the / and the end of the link
        file_name = url.split('/')[-1]
        print(file_name)
        # downloads the image 
        urllib.request.urlretrieve(url, 'Scraped_Images/' + file_name)
        # Adding a 1 second break in between each image scraped 
        time.sleep(1)

https://upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Black_Lives_Matter_protest_against_St._Paul_police_brutality_%2821587635011%29.jpg/220px-Black_Lives_Matter_protest_against_St._Paul_police_brutality_%2821587635011%29.jpg
220px-Black_Lives_Matter_protest_against_St._Paul_police_brutality_%2821587635011%29.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/f/fc/Black_Lives_Matter_protest_against_St._Paul_police_brutality_%2821552673186%29.jpg/220px-Black_Lives_Matter_protest_against_St._Paul_police_brutality_%2821552673186%29.jpg
220px-Black_Lives_Matter_protest_against_St._Paul_police_brutality_%2821552673186%29.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/d/d7/OaklandBLM-4174.jpg/220px-OaklandBLM-4174.jpg
220px-OaklandBLM-4174.jpg
https://upload.wikimedia.org/wikipedia/commons/thumb/8/8e/Protesters_with_signs_in_Ferguson.jpg/220px-Protesters_with_signs_in_Ferguson.jpg
220px-Protesters_with_signs_in_Ferguson.jpg
https://upload.wikimedia.org/wikipedia/commons/t

https://upload.wikimedia.org/wikipedia/commons/thumb/e/e5/Ferguson_Day_6%2C_Picture_45.png/220px-Ferguson_Day_6%2C_Picture_45.png
220px-Ferguson_Day_6%2C_Picture_45.png
