## Intermediate Data Science

#### University of Redlands - DATA 201
#### Prof: Joanna Bieri [joanna_bieri@redlands.edu](mailto:joanna_bieri@redlands.edu)
#### [Class Website: data201.joannabieri.com](https://joannabieri.com/data201_intermediate.html)

In [2]:
# Some basic package imports
import os
import numpy as np
import pandas as pd

# Visualization packages
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.defaule = 'colab'
import seaborn as sns

In [3]:
import requests
from bs4 import BeautifulSoup

# For dynamic sites using chrome
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
import time

### You Try - 2 Warm-Up Problems From Lecture

## You Try

Extract the application link for each of the jobs and add it to the data frame. Save this information to a list and then add it to the data frame as a column named "application link"

NOTE - you will need to copy and paste the code from the lecture

In [16]:
# Code from lecture
URL = "https://realpython.github.io/fake-jobs/"
page = requests.get(URL)
soup = BeautifulSoup(page.text)

results = soup.find(id="ResultsContainer")
job_cards = results.find_all("div", class_="card-content")

jobs = dict()
for i,job_card in enumerate(job_cards):
    # Get the information from each job card
    title_element = job_card.find("h2", class_="title")
    company_element = job_card.find("h3", class_="company")
    location_element = job_card.find("p", class_="location")
    link_element = job_card.find_all("a", class_="card-footer-item")[1]
    # Add the information to the dictionary
    jobs[i] = {'job':title_element.text.strip(),
               'company':company_element.text.strip(),
               'location':location_element.text.strip(),
               'link':link_element['href']}

df = pd.DataFrame(jobs).T
df

Unnamed: 0,job,company,location,link
0,Senior Python Developer,"Payne, Roberts and Davis","Stewartbury, AA",https://realpython.github.io/fake-jobs/jobs/se...
1,Energy engineer,Vasquez-Davidson,"Christopherville, AA",https://realpython.github.io/fake-jobs/jobs/en...
2,Legal executive,"Jackson, Chambers and Levy","Port Ericaburgh, AA",https://realpython.github.io/fake-jobs/jobs/le...
3,Fitness centre manager,Savage-Bradley,"East Seanview, AP",https://realpython.github.io/fake-jobs/jobs/fi...
4,Product manager,Ramirez Inc,"North Jamieview, AP",https://realpython.github.io/fake-jobs/jobs/pr...
...,...,...,...,...
95,Museum/gallery exhibitions officer,"Nguyen, Yoder and Petty","Lake Abigail, AE",https://realpython.github.io/fake-jobs/jobs/mu...
96,"Radiographer, diagnostic",Holder LLC,"Jacobshire, AP",https://realpython.github.io/fake-jobs/jobs/ra...
97,Database administrator,Yates-Ferguson,"Port Susan, AE",https://realpython.github.io/fake-jobs/jobs/da...
98,Furniture designer,Ortega-Lawrence,"North Tiffany, AA",https://realpython.github.io/fake-jobs/jobs/fu...


## You Try

Try to scrape the quotes and authors from this website:

https://quotes.toscrape.com/

- what happens to the url when you push the next button at the bottom of the page? 
- what happens to the url when you click on a tag?

Try to scrape all the quotes for one of the larger tags: love, inspirational, life, humor, or books. Make sure to get all the pages!

Put this information into a DataFrame.

Challenge (optional) - Scrape all the quotes on the site along with authors names and tags for each quote. Put all of this into a DataFrame and do an analysis. Who has the most quotes, longest or shortest quotes, most love quotes, etc

In [46]:
# Your code here
tag = 'love'
pages = ['1','2']

quotes = []
authors = []

for p in pages:
    url = "https://quotes.toscrape.com/tag/" + tag + "/page/" + p + "/"
    page = requests.get(url)
    soup = BeautifulSoup(page.text)
    
    results = soup.find_all('div',class_='quote')
    for r in results:
        quotes.append(r.find('span',class_="text").text)
        authors.append(r.find('small',class_="author").text)

df = pd.DataFrame({'quote':quotes,'author':authors})
df

Unnamed: 0,quote,author
0,“It is better to be hated for what you are tha...,André Gide
1,“This life is what you make it. No matter what...,Marilyn Monroe
2,"“You may not be her first, her last, or her on...",Bob Marley
3,"“The opposite of love is not hate, it's indiff...",Elie Wiesel
4,"“It is not a lack of love, but a lack of frien...",Friedrich Nietzsche
5,"“I love you without knowing how, or when, or f...",Pablo Neruda
6,"“If you can make a woman laugh, you can make h...",Marilyn Monroe
7,“The real lover is the man who can thrill you ...,Marilyn Monroe
8,“Love does not begin and end the way we seem t...,James Baldwin
9,“There is nothing I would not do for those who...,Jane Austen


---------------
## Web Scraping - Day10 HW

FREE FORM!!!

See if you can find a website that interests you, scrape some information from that site, see if you can learn something interesting.

For example:

Try scraping a job website using the job name/keyword and location that you are interested in. Can you create a data frame with job title, location, pay, etc? 


**NOTE** Just for fun I included an example of a job search wordcloud visualization that I coded up a few years ago. Check it out.


Your final notebooks should:

- [ ] Be a completely new notebook with just the Day9 stuff in it NO YOU TRY: Read in the data, make the plots. Make sure to discuss what you see and comment on why your plots are great!
- [ ] Be reproducible with junk code removed.
- [ ] Have lots of language describing what you are doing, especially for questions you are asking or things that you find interesting about the data. Use complete sentences, nice headings, and good markdown formatting: https://www.markdownguide.org/cheat-sheet/
- [ ] It should run without errors from start to finish.

**Remember to Proof Read and Proof Run your code** Restart the kernel and go through it all once before submitting.

In [14]:
df = pd.read_csv('data/opsd_germany_daily.csv', parse_dates = True, index_col=0)
df

Unnamed: 0_level_0,Consumption,Wind,Solar,Wind+Solar
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2006-01-01,1069.18400,,,
2006-01-02,1380.52100,,,
2006-01-03,1442.53300,,,
2006-01-04,1457.21700,,,
2006-01-05,1477.13100,,,
...,...,...,...,...
2017-12-27,1263.94091,394.507,16.530,411.037
2017-12-28,1299.86398,506.424,14.162,520.586
2017-12-29,1295.08753,584.277,29.854,614.131
2017-12-30,1215.44897,721.247,7.467,728.714


## Data Basics and Preparation

Here you get some basic familiarity with the data:

- run stats on each of the variables
- count nans in each column
- look at the data types
- double check that you understand the variables and their units
- what is the date range and frequency
- add columns to the data set: Year, Month, and Weekday Name

## Data Exploration - Basic Visualization

Start to make plots and see if you can generate some questions about the data. Make sure that you make observations about each plot - say what you see and what it means in terms of the data.

- Plot the overall consumption over time.
- Plot the wind and solar consumption over time.

- Choose a focal year and redo the plots to look at variability over the year.
- Redo this for a focal month

## Further Exploration

Now continue exploring the data to see what you can find out. Remember to explain what you are learning from each graph or calculation. Add guiding words in markdown to talk about what your code should be doing and why.

- How does seasonality effect the energy consumption? Consider the consumption grouped on a monthly basis. You could look at max, min, mean, etc. Make an interesting plot of this data (bar plot, box plot, etc). What do you learn?

- How does the day of the week change energy consumption?

- Using downsampling, plot on the same graph the daily (original data) and the average weekly (downsampled data) consumption for both solar and wind.

- Using downsampling plot the yearly rolling average of both wind and solar consumption.

- See if you can come up with a really cool graph of your own!

## Conclusion