## Introduction to Data Science

#### University of Redlands - DATA 101
#### Prof: Joanna Bieri [joanna_bieri@redlands.edu](mailto:joanna_bieri@redlands.edu)
#### [Class Website: data101.joannabieri.com](https://joannabieri.com/data101.html)

---------------------------------------
# Homework Day 12
---------------------------------------

GOALS:

1. Understand Causal vs Experimental Studies
2. Do a more free form data analysis
3. Start doing your ethics reading

----------------------------------------------------------

This homework has **3 questions**, **1 exercise** and **1 optional challenge problem**.

## Important Information

- Email: [joanna_bieri@redlands.edu](mailto:joanna_bieri@redlands.edu)
- Office Hours: Duke 209 <a href="https://joannabieri.com/schedule.html"> Click Here for Joanna's Schedule</a>

## Announcements

**In NEXT WEEK - Data Ethics** This week you should be reading your book or articles.

## Day 12 Assignment - same drill.

1. Make sure you can **Fork** and **Clone** the Day12 repo from [Redlands-DATA101](https://github.com/Redlands-DATA101)
2. Open the file Day12-HW.ipynb and start doing the problems.
    * You can do these problems as you follow along with the lecture notes and video.
3. Get as far as you can before class.
4. Submit what you have so far **Commit** and **Push** to Git.
5. Take the daily check in quiz on **Canvas**.
7. Come to class with lots of questions!

------------------------------
---------------------

### Web Scraping Ethical Issues

There are some things to be aware of before you start scraping data from the web. 

- Some data is private or protected. Just because you have access to a websites data doesn't mean you are allowed to scrape it. For example, when you log into Facebook or another social media site, you are granted special access to data about your connected people. It is unethical to use that access to scrape their private data!

- Some websites have rules against scraping and will cut of service to users who are clearly scraping data. How do they know? Webscrapers access the website very differently that regular users. If they site has a policy about scraping data then you should follow it and/or content them about getting the data if you have a true academic interest in the data.

- The line between web scraping and plagiarism can be very blurry. Make sure that you are citing where your data comes from AND not just reproducing the data exactly. Always citing the source of your data and make sure you are doing something new with it.

- Ethics are different depending on if you are using the data for a personal project (eg. you just want to check scores for your favorite team daily and print the stuff you care about) vs if you are using the project for your business or website (eg. publishing information to drive clicks to your site/video/account or making money from the data you collect). In the later case it is EXTRA important to respect the original owner of the data. Drive web traffic back to their site, check with them about using their data, etc.

**The Ethical Scraper** (from https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01):

I, the web scraper will live by the following principles:

- If you have a public API that provides the data I’m looking for, I’ll use it and avoid scraping all together.
- I will always provide a User Agent string that makes my intentions clear and provides a way for you to contact me with questions or concerns.
- I will request data at a reasonable rate. I will strive to never be confused for a DDoS attack.
- I will only save the data I absolutely need from your page. If all I need it OpenGraph meta-data, that’s all I’ll keep.
- I will respect any content I do keep. I’ll never pass it off as my own.
- I will look for ways to return value to you. Maybe I can drive some (real) traffic to your site or credit you in an article or post.
- I will respond in a timely fashion to your outreach and work with you towards a resolution.
- I will scrape for the purpose of creating new value from the data, not to duplicate it.

In [361]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.defaule = 'colab'

from itables import show

# This stops a few warning messages from showing
pd.options.mode.chained_assignment = None 
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

## Using pandas to get table data.

**Optional** - try using pandas to scrape a new site. See if you get errors or are able to find tables on that site.


---
---

## Using Beautiful Soup to get HTML code

### How to get data from static sites:

You should already have the packages bs4 and requests but if you get an error try running:

```{python}
!conda install -y bs4
!conda install -y requests
```

In [366]:
import requests
from bs4 import BeautifulSoup

In [368]:
website = 'https://www.scrapethissite.com/pages/simple/'

In [370]:
raw_code = requests.get(website)
html_doc = raw_code.text
soup = BeautifulSoup(html_doc, 'html.parser')

#### Lets see what is in soup

**Q1** Uncomment this line and run the cell. You will see a TON of text!

In [373]:
#soup

### Extracting data from HTML:

We will use the **soup.find_all()** function.

Here is the simplified function signature:

    soup.find_all(name=None,attrs={})
    
You can type soup.find_all? and run it to see all the information about additional arguments and advanced processes.

**Here is how we will mostly use it**, but there are much more advanced things you can do:

    soup.find_all( <type of section>, <info> )
    
The **.find_all()** function searches through the information in soup to match and return only sections that match the info. Here are some important types you might search for:

- 'h2' - this is a heading
- 'div' - this divides a block of information
- 'span' - this divides inline information
- 'a' - this specifies a hyperlink
- 'li' - this is a list item

- class_= - many things have the class label (notice the underscore!)
- string= - you can also search by strings.


### Using Developer tools:

To figure out what data to extract I suggest you use developer tools on the website to find what you need. 

**Q2** Navigate to the website and use developer tools to explore the html code.

[Scrape This Site -  Website](https://www.scrapethissite.com/pages/simple/)

I really like Brave Browser or Google Chrome for this, but most browsers with have **More Tools/Developer Tools** where you can see the code.

### Search for all the country names

The names of the country are inside 
        
        <h3 class="country-name">

So lets search for this:

In [376]:
result = soup.find_all('h3',class_="country-name")
#print(result)

In [378]:
DF = pd.DataFrame()
DF['country']=result
DF['country'] =DF['country'].apply(lambda x: x.text.rstrip().lstrip())
DF

Unnamed: 0,country
0,Andorra
1,United Arab Emirates
2,Afghanistan
3,Antigua and Barbuda
4,Anguilla
...,...
245,Yemen
246,Mayotte
247,South Africa
248,Zambia


In [380]:
result = soup.find_all('span',class_="country-capital")

In [382]:
result[0]

<span class="country-capital">Andorra la Vella</span>

In [384]:
result[0].text

'Andorra la Vella'

In [407]:
# Create a data frame and add the site names
DF['capital']=result
DF['capital'] = DF['capital'].apply(lambda x: x.text)
DF

Unnamed: 0,country,capital,population,area
0,Andorra,Andorra la Vella,84000,468.0
1,United Arab Emirates,Abu Dhabi,4975593,82880.0
2,Afghanistan,Kabul,29121286,647500.0
3,Antigua and Barbuda,St. John's,86754,443.0
4,Anguilla,The Valley,13254,102.0
...,...,...,...,...
245,Yemen,Sanaa,23495361,527970.0
246,Mayotte,Mamoudzou,159042,374.0
247,South Africa,Pretoria,49000000,1219912.0
248,Zambia,Lusaka,13460305,752614.0


### Your turn to try this!

**Q3** Now it's your turn. See if you can write code that gets the population and area information into the data frame. See if you can make your example match what I get below, including having the correct data types. Population should be an int and area should be a float.

Your goal is to get a data frame that looks like the one in lecture with columns for country, capital, population, and area!

In [410]:
result2 = soup.find_all('span',class_="country-population")

In [412]:
result2[0]

<span class="country-population">84000</span>

In [414]:
result2[0].text

'84000'

In [416]:
DF['population']=result2
DF['population'] = DF['population'].apply(lambda x: x.text)
DF['population'] = DF['population'].astype(int)
DF

Unnamed: 0,country,capital,population,area
0,Andorra,Andorra la Vella,84000,468.0
1,United Arab Emirates,Abu Dhabi,4975593,82880.0
2,Afghanistan,Kabul,29121286,647500.0
3,Antigua and Barbuda,St. John's,86754,443.0
4,Anguilla,The Valley,13254,102.0
...,...,...,...,...
245,Yemen,Sanaa,23495361,527970.0
246,Mayotte,Mamoudzou,159042,374.0
247,South Africa,Pretoria,49000000,1219912.0
248,Zambia,Lusaka,13460305,752614.0


In [418]:
result3 = soup.find_all('span',class_="country-area")

In [420]:
result3[0].text

'468.0'

In [422]:
DF['area']=result3
DF['area'] = DF['area'].apply(lambda x: x.text)
DF['area'] = DF['area'].astype(float)
DF

Unnamed: 0,country,capital,population,area
0,Andorra,Andorra la Vella,84000,468.0
1,United Arab Emirates,Abu Dhabi,4975593,82880.0
2,Afghanistan,Kabul,29121286,647500.0
3,Antigua and Barbuda,St. John's,86754,443.0
4,Anguilla,The Valley,13254,102.0
...,...,...,...,...
245,Yemen,Sanaa,23495361,527970.0
246,Mayotte,Mamoudzou,159042,374.0
247,South Africa,Pretoria,49000000,1219912.0
248,Zambia,Lusaka,13460305,752614.0


In [424]:
DF.dtypes

country        object
capital        object
population      int64
area          float64
dtype: object

### Where can you get more practice

Here is a website dedicated to allowing students to practice webscraping:

[www.scrapethissite.com](https://www.scrapethissite.com/pages/)

-----------------
### Exercise 1:

We are going to scrape the site above to get the list of all links. 

Open the website and look at the developer tools.

Here is our goal: Make a pandas data frame that contains three columns: 

* "site_name" - which contains just the words of the link
* "link" - which contains just the website part of the link
* "description" - which contains the words below the link

When looking for links you can use

    result = soup.find_all('a')
    result[0].text # Get the text associated with the link
    result[0].get('href') # Get the link location



See if you can do this without looking at my code in the notes! What would your plan have to be?

**See the lecture notes for hints and the solution!**


In [491]:
website = 'https://www.scrapethissite.com/pages/'
raw_code = requests.get(website)
html_doc = raw_code.text
soup = BeautifulSoup(html_doc, 'html.parser')

In [492]:
result = soup.find_all('a')

In [495]:
result[5].text

'Countries of the World: A Simple Example'

In [497]:
DF = pd.DataFrame()
DF['site_name']=result
DF['site_name'] =DF['site_name'].apply(lambda x: x.text.rstrip().lstrip())
DF

Unnamed: 0,site_name
0,Scrape This Site
1,Sandbox
2,Lessons
3,FAQ
4,Login
5,Countries of the World: A Simple Example
6,"Hockey Teams: Forms, Searching and Pagination"
7,Oscar Winning Films: AJAX and Javascript
8,Turtles All the Way Down: Frames & iFrames
9,Advanced Topics: Real World Challenges You'll ...


In [499]:
result[5].text

'Countries of the World: A Simple Example'

In [501]:
result[5].get('href')

'/pages/simple/'

In [503]:
DF['link']= result
DF['link'] = DF['link'].apply(lambda x: x.get('href'))
DF

Unnamed: 0,site_name,link
0,Scrape This Site,/
1,Sandbox,/pages/
2,Lessons,/lessons/
3,FAQ,/faq/
4,Login,/login/
5,Countries of the World: A Simple Example,/pages/simple/
6,"Hockey Teams: Forms, Searching and Pagination",/pages/forms/
7,Oscar Winning Films: AJAX and Javascript,/pages/ajax-javascript/
8,Turtles All the Way Down: Frames & iFrames,/pages/frames/
9,Advanced Topics: Real World Challenges You'll ...,/pages/advanced/


In [505]:
DF.drop([0, 1, 2, 3, 4])

Unnamed: 0,site_name,link
5,Countries of the World: A Simple Example,/pages/simple/
6,"Hockey Teams: Forms, Searching and Pagination",/pages/forms/
7,Oscar Winning Films: AJAX and Javascript,/pages/ajax-javascript/
8,Turtles All the Way Down: Frames & iFrames,/pages/frames/
9,Advanced Topics: Real World Challenges You'll ...,/pages/advanced/


In [507]:
result[0].text.rstrip().lstrip()

'Scrape This Site'

In [509]:
DF['description'] = result
DF['description'] = DF['description'].apply(lambda x: x.text.rstrip().lstrip())
DF

Unnamed: 0,site_name,link,description
0,Scrape This Site,/,Scrape This Site
1,Sandbox,/pages/,Sandbox
2,Lessons,/lessons/,Lessons
3,FAQ,/faq/,FAQ
4,Login,/login/,Login
5,Countries of the World: A Simple Example,/pages/simple/,Countries of the World: A Simple Example
6,"Hockey Teams: Forms, Searching and Pagination",/pages/forms/,"Hockey Teams: Forms, Searching and Pagination"
7,Oscar Winning Films: AJAX and Javascript,/pages/ajax-javascript/,Oscar Winning Films: AJAX and Javascript
8,Turtles All the Way Down: Frames & iFrames,/pages/frames/,Turtles All the Way Down: Frames & iFrames
9,Advanced Topics: Real World Challenges You'll ...,/pages/advanced/,Advanced Topics: Real World Challenges You'll ...


In [516]:
# added from answer on notes
base_website = 'https://www.scrapethissite.com'
DF['link'] = DF['link'].apply(lambda x: base_website+x)
DF

Unnamed: 0,site_name,link,description
0,Scrape This Site,https://www.scrapethissite.comhttps://www.scra...,Scrape This Site
1,Sandbox,https://www.scrapethissite.comhttps://www.scra...,Sandbox
2,Lessons,https://www.scrapethissite.comhttps://www.scra...,Lessons
3,FAQ,https://www.scrapethissite.comhttps://www.scra...,FAQ
4,Login,https://www.scrapethissite.comhttps://www.scra...,Login
5,Countries of the World: A Simple Example,https://www.scrapethissite.comhttps://www.scra...,Countries of the World: A Simple Example
6,"Hockey Teams: Forms, Searching and Pagination",https://www.scrapethissite.comhttps://www.scra...,"Hockey Teams: Forms, Searching and Pagination"
7,Oscar Winning Films: AJAX and Javascript,https://www.scrapethissite.comhttps://www.scra...,Oscar Winning Films: AJAX and Javascript
8,Turtles All the Way Down: Frames & iFrames,https://www.scrapethissite.comhttps://www.scra...,Turtles All the Way Down: Frames & iFrames
9,Advanced Topics: Real World Challenges You'll ...,https://www.scrapethissite.comhttps://www.scra...,Advanced Topics: Real World Challenges You'll ...


### Challenge Problem

Here is another website to scrape. See if you can create a data frame that looks like the one in the lecture notes. Notice that you can only scrape the first page with the first link. 

If you want to try scraping the other pages you have to notice how the website updates its address for each page. Then write a for loop to loop through how ever many pages you want to scrape. Do the same set of operations for each page keep adding data to your data frame.

Make a histogram of your final data.

In [512]:
website='https://books.toscrape.com/index.html'
raw_code = requests.get(website)
html_doc = raw_code.text
soup = BeautifulSoup(html_doc, 'html.parser')

**Try to scrape the name, the link to the book, and the prices! I decided to put the name and link information into a single column and then break that apart**