# Web Scraping with BeautifulSoup and Requests

    Practicing web scraping. 
    
Video that helped me: https://www.youtube.com/watch?v=ng2o98k983k&list=WL&index=32&t=0s

## Agenda
        Context
        1 - Getting the post title
        2 - Getting the post date
        3 - Getting the post description
        4 - Getting the post tags
        5 - Modeling?
        6 - Joining information in a DataFrame

In [1]:
# Imports

# Web scraping tools
import requests
from bs4 import BeautifulSoup

# To create the DataFrame
import pandas as pd

## Context 
    Let's say We are looking for creating a dataset with information about
    Corey Schafer's website. And that We specifically want the articles 
    title, post date, description and tags.    
        
    
    So, requests.get().text  will basically get all of the HTML from the web page
    
    and BeautifulSoup() is responsible for making sense of it, so we'll be 
    able to use methods and attributes (from BeautifulSoup() object) in order
    to obtain all the data we need.

In [2]:
# Requesting access to the connection and html content

source = requests.get('http://coreyms.com').text

In [3]:
# Tranforming the raw content into a BeautifulSoup object

soup = BeautifulSoup(source, 'lxml')

In [4]:
# With .prettify() we can see the HTML better structured (try and print only 'soup')

print(soup.prettify()[0:500])

<!DOCTYPE html>
<html lang="en-US">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <!-- This site is optimized with the Yoast SEO plugin v14.4.1 - https://yoast.com/wordpress/plugins/seo/ -->
  <title>
   CoreyMS - Development, Design, DIY, and more
  </title>
  <meta content="Development, Design, DIY, and more" name="description"/>
  <meta content="index, follow" name="robots"/>
  <meta content="index, follow, max-snippet:-1, max-imag


    The easiest way to get information from a tag is just access it 
    like an attribute

In [5]:
soup.title

<title>CoreyMS - Development, Design, DIY, and more</title>

In [6]:
# Now, to get the actual content (without <title>tags</title>) we need to use the attribute .text

soup.title.text

'CoreyMS - Development, Design, DIY, and more'

    If you notice, soup.title returns to us the first title tag in the html
    (look back at soup.prettify()[0:500])
    
    but the first tag is not always what we want. Then we can use 
    .find() method while passing some arguments to find the exact
    tag we're looking for.

In [7]:
'''
    soup.find('div') will give us the first (and long) div of the page.
    Let's take a look at a specific div with a class called 'entry-content'
    
    We can pass to the parameter attrs (attributes) a dictionary containing
    the html attribute name an its value
'''
soup.find('div', attrs = {'class':'entry-content'})

<div class="entry-content" itemprop="text">
<p>In this video, we will be learning how to create and extract zip archives. We will start by using the zipfile module, and then we will see how to do this using the shutil module. We will learn how to do this with single files and directories, as well as learning how to use gzip as well. Let’s get started…<br/></p>
<span class="embed-youtube" style="text-align:center; display: block;"><iframe allowfullscreen="true" class="youtube-player" height="360" src="https://www.youtube.com/embed/z0gguhEmWiY?version=3&amp;rel=1&amp;fs=1&amp;autohide=2&amp;showsearch=0&amp;showinfo=1&amp;iv_load_policy=1&amp;wmode=transparent" style="border:0;" width="640"></iframe></span>
</div>

In [8]:
# Shorter way to achieve the same result   attribute_ = 'attrs name'    
# ps: the underscore_ is essencial, it is how bs4 knows the attrs is an attrsan not another variable

soup.find('div', class_ = 'entry-content')

<div class="entry-content" itemprop="text">
<p>In this video, we will be learning how to create and extract zip archives. We will start by using the zipfile module, and then we will see how to do this using the shutil module. We will learn how to do this with single files and directories, as well as learning how to use gzip as well. Let’s get started…<br/></p>
<span class="embed-youtube" style="text-align:center; display: block;"><iframe allowfullscreen="true" class="youtube-player" height="360" src="https://www.youtube.com/embed/z0gguhEmWiY?version=3&amp;rel=1&amp;fs=1&amp;autohide=2&amp;showsearch=0&amp;showinfo=1&amp;iv_load_policy=1&amp;wmode=transparent" style="border:0;" width="640"></iframe></span>
</div>

### 1 - Getting the post title
    If we want to get multiple things from a page, a good way to start is just to get
    one of whatever it is we want to parse and then use a loop to get all of them.

In [9]:
title = soup.find('a', class_ = 'entry-title-link')
title

<a class="entry-title-link" href="https://coreyms.com/development/python/python-tutorial-zip-files-creating-and-extracting-zip-archives" rel="bookmark">Python Tutorial: Zip Files – Creating and Extracting Zip Archives</a>

In [10]:
# We did it, one of the titles from the page

title.contents[0]

'Python Tutorial: Zip Files – Creating and Extracting Zip Archives'

In [11]:
'''
    To fetch all of the 'a' tags with the 'entry-title-link' class attribute
    We need to use .find_all() method
'''
soup.find_all('a', class_ = 'entry-title-link')

[<a class="entry-title-link" href="https://coreyms.com/development/python/python-tutorial-zip-files-creating-and-extracting-zip-archives" rel="bookmark">Python Tutorial: Zip Files – Creating and Extracting Zip Archives</a>,
 <a class="entry-title-link" href="https://coreyms.com/development/python/python-data-science-tutorial-analyzing-the-2019-stack-overflow-developer-survey" rel="bookmark">Python Data Science Tutorial: Analyzing the 2019 Stack Overflow Developer Survey</a>,
 <a class="entry-title-link" href="https://coreyms.com/development/python/python-multiprocessing-tutorial-run-code-in-parallel-using-the-multiprocessing-module" rel="bookmark">Python Multiprocessing Tutorial: Run Code in Parallel Using the Multiprocessing Module</a>,
 <a class="entry-title-link" href="https://coreyms.com/development/python/python-threading-tutorial-run-code-concurrently-using-the-threading-module" rel="bookmark">Python Threading Tutorial: Run Code Concurrently Using the Threading Module</a>,
 <a cl

In [12]:
# Now, using a loop, let's put them into a list 

page_titles = soup.find_all('a', class_ = 'entry-title-link')

titles = []

for title in page_titles:
    titles.append(title.contents[0])
    

# YEEAAHHHH \(*O*)/ IT WORKED!!!
titles

['Python Tutorial: Zip Files – Creating and Extracting Zip Archives',
 'Python Data Science Tutorial: Analyzing the 2019 Stack Overflow Developer Survey',
 'Python Multiprocessing Tutorial: Run Code in Parallel Using the Multiprocessing Module',
 'Python Threading Tutorial: Run Code Concurrently Using the Threading Module',
 'Update (2019-09-03)',
 'Python Quick Tip: The Difference Between “==” and “is” (Equality vs Identity)',
 'Python Tutorial: Calling External Commands Using the Subprocess Module',
 'Visual Studio Code (Windows) – Setting up a Python Development Environment and Complete Overview',
 'Visual Studio Code (Mac) – Setting up a Python Development Environment and Complete Overview',
 'Clarifying the Issues with Mutable Default Arguments']

### 2 - Getting the post date
    it's basically the same process

In [13]:
# Looking for one

soup.find('time', class_='entry-time')

<time class="entry-time" datetime="2019-11-19T13:02:37-05:00" itemprop="datePublished">November 19, 2019</time>

In [14]:
# Getting only the information we aim for

date = soup.find('time', class_='entry-time')

date.contents[0]

'November 19, 2019'

In [15]:
# Finding all

soup.find_all('time', class_='entry-time')

[<time class="entry-time" datetime="2019-11-19T13:02:37-05:00" itemprop="datePublished">November 19, 2019</time>,
 <time class="entry-time" datetime="2019-10-17T12:35:51-04:00" itemprop="datePublished">October 17, 2019</time>,
 <time class="entry-time" datetime="2019-09-21T10:59:18-04:00" itemprop="datePublished">September 21, 2019</time>,
 <time class="entry-time" datetime="2019-09-12T10:49:54-04:00" itemprop="datePublished">September 12, 2019</time>,
 <time class="entry-time" datetime="2019-09-03T16:42:01-04:00" itemprop="datePublished">September 3, 2019</time>,
 <time class="entry-time" datetime="2019-08-06T12:17:28-04:00" itemprop="datePublished">August 6, 2019</time>,
 <time class="entry-time" datetime="2019-07-24T15:26:19-04:00" itemprop="datePublished">July 24, 2019</time>,
 <time class="entry-time" datetime="2019-05-01T14:03:24-04:00" itemprop="datePublished">May 1, 2019</time>,
 <time class="entry-time" datetime="2019-05-01T14:01:45-04:00" itemprop="datePublished">May 1, 2019<

In [16]:
# Looping through to get all data, bust this time I had to use .text (yet not sure why)

page_post_dates = soup.find_all('time', class_ = 'entry-time')
post_dates = []

for post_date in page_post_dates:
    
    post_dates.append(post_date.text)
    
post_dates

['November 19, 2019',
 'October 17, 2019',
 'September 21, 2019',
 'September 12, 2019',
 'September 3, 2019',
 'August 6, 2019',
 'July 24, 2019',
 'May 1, 2019',
 'May 1, 2019',
 'April 24, 2019']

### 3 - Getting the post description

In [17]:
description = soup.find('div', class_='entry-content')

In [18]:
description

<div class="entry-content" itemprop="text">
<p>In this video, we will be learning how to create and extract zip archives. We will start by using the zipfile module, and then we will see how to do this using the shutil module. We will learn how to do this with single files and directories, as well as learning how to use gzip as well. Let’s get started…<br/></p>
<span class="embed-youtube" style="text-align:center; display: block;"><iframe allowfullscreen="true" class="youtube-player" height="360" src="https://www.youtube.com/embed/z0gguhEmWiY?version=3&amp;rel=1&amp;fs=1&amp;autohide=2&amp;showsearch=0&amp;showinfo=1&amp;iv_load_policy=1&amp;wmode=transparent" style="border:0;" width="640"></iframe></span>
</div>

In [19]:
description.p

<p>In this video, we will be learning how to create and extract zip archives. We will start by using the zipfile module, and then we will see how to do this using the shutil module. We will learn how to do this with single files and directories, as well as learning how to use gzip as well. Let’s get started…<br/></p>

In [20]:
description.p.text

'In this video, we will be learning how to create and extract zip archives. We will start by using the zipfile module, and then we will see how to do this using the shutil module. We will learn how to do this with single files and directories, as well as learning how to use gzip as well. Let’s get started…'

In [21]:
post_descriptions = soup.find_all('div', class_='entry-content')

descriptions = []

for description in post_descriptions:
    descriptions.append(description.p.text)
    
descriptions

['In this video, we will be learning how to create and extract zip archives. We will start by using the zipfile module, and then we will see how to do this using the shutil module. We will learn how to do this with single files and directories, as well as learning how to use gzip as well. Let’s get started…',
 'In this Python Programming video, we will be learning how to download and analyze real-world data from the 2019 Stack Overflow Developer Survey. This is terrific practice for anyone getting into the data science field. We will learn different ways to analyze this data and also some best practices. Let’s get started…',
 'In this Python Programming video, we will be learning how to run code in parallel using the multiprocessing module. We will also look at how to process multiple high-resolution images at the same time using a ProcessPoolExecutor from the concurrent.futures module. Let’s get started…',
 'In this Python Programming video, we will be learning how to run threads conc

In [22]:
descriptions[0]

'In this video, we will be learning how to create and extract zip archives. We will start by using the zipfile module, and then we will see how to do this using the shutil module. We will learn how to do this with single files and directories, as well as learning how to use gzip as well. Let’s get started…'

### 4 - Getting the post tags
    There's a different approach here, I don't know how to explain the technical details
    about the structure of .contents and .text
    So I encorage you to change the code and try to see the differences and understand what
    is happening.

In [23]:
soup.find('span',class_ = 'entry-tags')

<span class="entry-tags">Tagged With: <a href="https://coreyms.com/tag/gzip" rel="tag">gzip</a>, <a href="https://coreyms.com/tag/shutil" rel="tag">shutil</a>, <a href="https://coreyms.com/tag/zip" rel="tag">zip</a>, <a href="https://coreyms.com/tag/zipfile" rel="tag">zipfile</a></span>

In [24]:
tag = soup.find('span',class_ = 'entry-tags')
tag.text

'Tagged With: gzip, shutil, zip, zipfile'

In [25]:
tags = soup.find_all('span',class_ = 'entry-tags')
tags[0].text[13:-1]

'gzip, shutil, zip, zipfil'

In [26]:
page_tags = soup.find_all('span',class_ = 'entry-tags')

tags = []

for tag in page_tags:
    tags.append(tag.text[13:-1])

tags

['gzip, shutil, zip, zipfil',
 'data analysis, Data Science, stack overflo',
 'asynchronous, concurrent.futures, multiprocessing, parallel, threadin',
 'asynchronous, concurrency, multiprocessing, threadin',
 '== vs is, equality, identit',
 'standard library, subproces',
 'Development Environment, visual studio code, visual studios, vs code, vscod',
 'Development Environment, visual studio code, visual studios, vs code, vscod',
 'common errors, common mistakes, functions, mutable default argument']

## Modeling?
    Now We can kind of create a model where we will be able to
    scrape again and again. It's sort of a code optimization.
    Well, not really, reviewing I can see it is a long way to do it.
    Since I basically just appended the for loops...
    
    ps: there were a problem with the structure of one article.
        I solved it manually, but that is not the best solution.
        We are always trying to automate thing

In [33]:
#      Titles
page_title_records = soup.find_all('a', class_ = 'entry-title-link')
titles = []

for title in page_title_records:
    if title.contents[0] == 'Update (2019-09-03)':
        continue
    else:
        titles.append(title.contents[0])

    
#      Post dates
page_post_dates = soup.find_all('time', class_ = 'entry-time')
post_dates = []

for post_date in page_post_dates:
    if  post_date.text == 'September 3, 2019':
        continue
    else:
        post_dates.append(post_date.text)
    

#      Descriptions
post_descriptions = soup.find_all('div', class_='entry-content')
descriptions = []
not_wanted_description = 'Hey everyone. I wanted to give you an update on my videos. I will be releasing videos on threading and multiprocessing within the next week. Thanks so much for your patience. I currently have a temporary recording studio setup at my Airbnb that will allow me to record and edit the threading/multiprocessing videos. I am going to be moving into my new house in 10 days and once I have my recording studio setup then you can expect much faster video releases. I really appreciate how patient everyone has been while I go through this move, especially those of you who are contributing monthly through YouTube '
for description in post_descriptions:
    if description.p.text == not_wanted_description:
        continue
    else:
        descriptions.append(description.p.text)
    

#      Tags
page_tags = soup.find_all('span',class_ = 'entry-tags')
tags = []

for tag in page_tags:
    tags.append(tag.text[13:-1])

## Joining information in a DataFrame

In [34]:
data_frame = pd.DataFrame(data=titles, columns=['title'])
data_frame

Unnamed: 0,title
0,Python Tutorial: Zip Files – Creating and Extr...
1,Python Data Science Tutorial: Analyzing the 20...
2,Python Multiprocessing Tutorial: Run Code in P...
3,Python Threading Tutorial: Run Code Concurrent...
4,Python Quick Tip: The Difference Between “==” ...
5,Python Tutorial: Calling External Commands Usi...
6,Visual Studio Code (Windows) – Setting up a Py...
7,Visual Studio Code (Mac) – Setting up a Python...
8,Clarifying the Issues with Mutable Default Arg...


In [35]:
data_frame = data_frame.join(pd.DataFrame(data=post_dates, columns=['post_dates']))
data_frame

Unnamed: 0,title,post_dates
0,Python Tutorial: Zip Files – Creating and Extr...,"November 19, 2019"
1,Python Data Science Tutorial: Analyzing the 20...,"October 17, 2019"
2,Python Multiprocessing Tutorial: Run Code in P...,"September 21, 2019"
3,Python Threading Tutorial: Run Code Concurrent...,"September 12, 2019"
4,Python Quick Tip: The Difference Between “==” ...,"August 6, 2019"
5,Python Tutorial: Calling External Commands Usi...,"July 24, 2019"
6,Visual Studio Code (Windows) – Setting up a Py...,"May 1, 2019"
7,Visual Studio Code (Mac) – Setting up a Python...,"May 1, 2019"
8,Clarifying the Issues with Mutable Default Arg...,"April 24, 2019"


In [36]:
data_frame = data_frame.join(pd.DataFrame(data=descriptions, columns=['descriptions']))
data_frame

Unnamed: 0,title,post_dates,descriptions
0,Python Tutorial: Zip Files – Creating and Extr...,"November 19, 2019","In this video, we will be learning how to crea..."
1,Python Data Science Tutorial: Analyzing the 20...,"October 17, 2019","In this Python Programming video, we will be l..."
2,Python Multiprocessing Tutorial: Run Code in P...,"September 21, 2019","In this Python Programming video, we will be l..."
3,Python Threading Tutorial: Run Code Concurrent...,"September 12, 2019","In this Python Programming video, we will be l..."
4,Python Quick Tip: The Difference Between “==” ...,"August 6, 2019","In this Python Programming Tutorial, we will b..."
5,Python Tutorial: Calling External Commands Usi...,"July 24, 2019","In this Python Programming Tutorial, we will b..."
6,Visual Studio Code (Windows) – Setting up a Py...,"May 1, 2019","In this Python Programming Tutorial, we will b..."
7,Visual Studio Code (Mac) – Setting up a Python...,"May 1, 2019","In this Python Programming Tutorial, we will b..."
8,Clarifying the Issues with Mutable Default Arg...,"April 24, 2019","In this Python Programming Tutorial, we will b..."


In [37]:
data_frame = data_frame.join(pd.DataFrame(data=tags, columns=['tags']))
data_frame

Unnamed: 0,title,post_dates,descriptions,tags
0,Python Tutorial: Zip Files – Creating and Extr...,"November 19, 2019","In this video, we will be learning how to crea...","gzip, shutil, zip, zipfil"
1,Python Data Science Tutorial: Analyzing the 20...,"October 17, 2019","In this Python Programming video, we will be l...","data analysis, Data Science, stack overflo"
2,Python Multiprocessing Tutorial: Run Code in P...,"September 21, 2019","In this Python Programming video, we will be l...","asynchronous, concurrent.futures, multiprocess..."
3,Python Threading Tutorial: Run Code Concurrent...,"September 12, 2019","In this Python Programming video, we will be l...","asynchronous, concurrency, multiprocessing, th..."
4,Python Quick Tip: The Difference Between “==” ...,"August 6, 2019","In this Python Programming Tutorial, we will b...","== vs is, equality, identit"
5,Python Tutorial: Calling External Commands Usi...,"July 24, 2019","In this Python Programming Tutorial, we will b...","standard library, subproces"
6,Visual Studio Code (Windows) – Setting up a Py...,"May 1, 2019","In this Python Programming Tutorial, we will b...","Development Environment, visual studio code, v..."
7,Visual Studio Code (Mac) – Setting up a Python...,"May 1, 2019","In this Python Programming Tutorial, we will b...","Development Environment, visual studio code, v..."
8,Clarifying the Issues with Mutable Default Arg...,"April 24, 2019","In this Python Programming Tutorial, we will b...","common errors, common mistakes, functions, mut..."


In [32]:
# Finally we can create a csv with our collected data

data_frame.to_csv('Corey_Schafer_Articles.csv')