# Web Scraping with BeautifulSoup and Requests

    Practicing web scraping Version 2. after watch the whole video, I could
    make a little different (maybe better?) work at web scraping.
    
Video that helped me: https://www.youtube.com/watch?v=ng2o98k983k&list=WL&index=32&t=0s


## Agenda
        Context
        Fecthing the information
            1 - Getting the post title
            2 - Getting the post date
            3 - Getting the post description
            4 - Getting the post tags
            5 - Modeling?
            6 - Storing our work in a DataFrame  
        

In [24]:
# Web scraping
import requests
from bs4 import BeautifulSoup

# Making a DataFrame with the data we will scrape
import pandas as pd

## Context 
    Let's say We are looking for creating a dataset with information about
    Corey Schafer's website. And that We specifically want the articles 
    title, post date, description and tags. 

requests.get().text is basically getting all of the HTML from the web page

and BeautifulSoup() is responsible for making sense of it, so we'll be 
able to use methods and attributes (from BeautifulSoup() object) in order
to obtain all the data we need.

In [2]:
source = requests.get('http://coreyms.com').text

In [3]:
soup = BeautifulSoup(source, 'lxml')

In [4]:
# With .prettify() we can see the HTML better structured

print(soup.prettify()[0:500])

<!DOCTYPE html>
<html lang="en-US">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <!-- This site is optimized with the Yoast SEO plugin v14.4.1 - https://yoast.com/wordpress/plugins/seo/ -->
  <title>
   CoreyMS - Development, Design, DIY, and more
  </title>
  <meta content="Development, Design, DIY, and more" name="description"/>
  <meta content="index, follow" name="robots"/>
  <meta content="index, follow, max-snippet:-1, max-imag


In [5]:
# We can go after a tag that contains all the data we want
# and use it ('article') as the object we'll parse

article = soup.find('article')
print(article.prettify())

<article class="post-1670 post type-post status-publish format-standard has-post-thumbnail category-development category-python tag-gzip tag-shutil tag-zip tag-zipfile entry" itemscope="" itemtype="https://schema.org/CreativeWork">
 <header class="entry-header">
  <h2 class="entry-title" itemprop="headline">
   <a class="entry-title-link" href="https://coreyms.com/development/python/python-tutorial-zip-files-creating-and-extracting-zip-archives" rel="bookmark">
    Python Tutorial: Zip Files – Creating and Extracting Zip Archives
   </a>
  </h2>
  <p class="entry-meta">
   <time class="entry-time" datetime="2019-11-19T13:02:37-05:00" itemprop="datePublished">
    November 19, 2019
   </time>
   by
   <span class="entry-author" itemprop="author" itemscope="" itemtype="https://schema.org/Person">
    <a class="entry-author-link" href="https://coreyms.com/author/coreymschafer" itemprop="url" rel="author">
     <span class="entry-author-name" itemprop="name">
      Corey Schafer
     </spa

## Fecthing the information 
    
### 1 - Getting the post title
    If we want to get multiple things from a page, a good way to start is just to get
    one of whatever it is we want to parse and then use a loop to get all of them.
    
    And the easiest way to get information from a tag is just access it 
    like an attribute

In [6]:
article.h2

<h2 class="entry-title" itemprop="headline"><a class="entry-title-link" href="https://coreyms.com/development/python/python-tutorial-zip-files-creating-and-extracting-zip-archives" rel="bookmark">Python Tutorial: Zip Files – Creating and Extracting Zip Archives</a></h2>

In [7]:
# It is possible to go deeper into the tags

article.h2.a

<a class="entry-title-link" href="https://coreyms.com/development/python/python-tutorial-zip-files-creating-and-extracting-zip-archives" rel="bookmark">Python Tutorial: Zip Files – Creating and Extracting Zip Archives</a>

In [8]:
# and then: Article title obtained.

article.h2.a.text

'Python Tutorial: Zip Files – Creating and Extracting Zip Archives'

### 2 - Getting the post date

    It is important that you look at the HTML and find the tags that
    can help you on this data search
    
    It's not here but I just used print(article.prettify()) again, to 
    look at the HTML and figure out which tag could bring me closer to
    the post date.

In [9]:
article.time

<time class="entry-time" datetime="2019-11-19T13:02:37-05:00" itemprop="datePublished">November 19, 2019</time>

In [10]:
article.time.text

'November 19, 2019'

### 3 - Getting the post description

In [11]:
article.div

<div class="entry-content" itemprop="text">
<p>In this video, we will be learning how to create and extract zip archives. We will start by using the zipfile module, and then we will see how to do this using the shutil module. We will learn how to do this with single files and directories, as well as learning how to use gzip as well. Let’s get started…<br/></p>
<span class="embed-youtube" style="text-align:center; display: block;"><iframe allowfullscreen="true" class="youtube-player" height="360" src="https://www.youtube.com/embed/z0gguhEmWiY?version=3&amp;rel=1&amp;fs=1&amp;autohide=2&amp;showsearch=0&amp;showinfo=1&amp;iv_load_policy=1&amp;wmode=transparent" style="border:0;" width="640"></iframe></span>
</div>

In [12]:
article.div.p

<p>In this video, we will be learning how to create and extract zip archives. We will start by using the zipfile module, and then we will see how to do this using the shutil module. We will learn how to do this with single files and directories, as well as learning how to use gzip as well. Let’s get started…<br/></p>

In [13]:
article.div.p.text 

'In this video, we will be learning how to create and extract zip archives. We will start by using the zipfile module, and then we will see how to do this using the shutil module. We will learn how to do this with single files and directories, as well as learning how to use gzip as well. Let’s get started…'


    Changing the approach, We could also do

    

In [14]:
article.find('div', class_='entry-content')

<div class="entry-content" itemprop="text">
<p>In this video, we will be learning how to create and extract zip archives. We will start by using the zipfile module, and then we will see how to do this using the shutil module. We will learn how to do this with single files and directories, as well as learning how to use gzip as well. Let’s get started…<br/></p>
<span class="embed-youtube" style="text-align:center; display: block;"><iframe allowfullscreen="true" class="youtube-player" height="360" src="https://www.youtube.com/embed/z0gguhEmWiY?version=3&amp;rel=1&amp;fs=1&amp;autohide=2&amp;showsearch=0&amp;showinfo=1&amp;iv_load_policy=1&amp;wmode=transparent" style="border:0;" width="640"></iframe></span>
</div>

In [15]:
article.find('div', class_='entry-content').p.text

'In this video, we will be learning how to create and extract zip archives. We will start by using the zipfile module, and then we will see how to do this using the shutil module. We will learn how to do this with single files and directories, as well as learning how to use gzip as well. Let’s get started…'

### 4 - Getting the post tags
    Using the new approach

In [16]:
article.find('span', class_='entry-tags')

<span class="entry-tags">Tagged With: <a href="https://coreyms.com/tag/gzip" rel="tag">gzip</a>, <a href="https://coreyms.com/tag/shutil" rel="tag">shutil</a>, <a href="https://coreyms.com/tag/zip" rel="tag">zip</a>, <a href="https://coreyms.com/tag/zipfile" rel="tag">zipfile</a></span>

In [17]:
article.find('span', class_='entry-tags').text 

'Tagged With: gzip, shutil, zip, zipfile'

In [18]:
# And this method doesn't bring the information the way we want

article.footer.text

'Filed Under: Development, Python Tagged With: gzip, shutil, zip, zipfile'

## Modeling?
    Now We can kind of create a model where we will be able to
    scrape again and again. It's sort of a code optimization

In [19]:
# These will give us the data we are looking for

article.h2.a.text
article.time.text 
article.div.p.text 
article.footer.text

'Filed Under: Development, Python Tagged With: gzip, shutil, zip, zipfile'

In [20]:
# Testing to see if we'll receive the correct content

articles = soup.find_all('article')

for article in articles:
    print(article.h2.a.text)
    print(article.time.text)
    print(article.div.p.text)
    print(article.footer.text)
    break # We'll get out of the loop right after the first iteration

Python Tutorial: Zip Files – Creating and Extracting Zip Archives
November 19, 2019
In this video, we will be learning how to create and extract zip archives. We will start by using the zipfile module, and then we will see how to do this using the shutil module. We will learn how to do this with single files and directories, as well as learning how to use gzip as well. Let’s get started…
Filed Under: Development, Python Tagged With: gzip, shutil, zip, zipfile


In [40]:
articles = soup.find_all('article')

# Lists to keep the scraped information
titles = []
dates = []
descriptions = []
tags = []

# Putting the information into the lists
for article in articles:
    try:
        titles.append(article.h2.a.text)
        dates.append(article.time.text)
        descriptions.append(article.div.p.text)
        tags.append(article.footer.text)
    except Exception as e:
        titles = None
        dates = None
        descriptions = None
        tags = None

# Testing if we are receiving the right data
print(tags)

['Filed Under: Development, Python Tagged With: gzip, shutil, zip, zipfile', 'Filed Under: Development, Python Tagged With: data analysis, Data Science, stack overflow', 'Filed Under: Development, Python Tagged With: asynchronous, concurrent.futures, multiprocessing, parallel, threading', 'Filed Under: Development, Python Tagged With: asynchronous, concurrency, multiprocessing, threading', 'Filed Under: General ', 'Filed Under: Development, Python Tagged With: == vs is, equality, identity', 'Filed Under: Development, Python Tagged With: standard library, subprocess', 'Filed Under: Development, Python Tagged With: Development Environment, visual studio code, visual studios, vs code, vscode', 'Filed Under: Development, Python Tagged With: Development Environment, visual studio code, visual studios, vs code, vscode', 'Filed Under: Development, Python Tagged With: common errors, common mistakes, functions, mutable default arguments']


## Storing our work in a DataFrame
        For now, the way I know how to do this is creating
        them and joinig on each other.

In [41]:
data_frame = pd.DataFrame(data=titles, columns=['post_title'])
data_frame

Unnamed: 0,post_title
0,Python Tutorial: Zip Files – Creating and Extr...
1,Python Data Science Tutorial: Analyzing the 20...
2,Python Multiprocessing Tutorial: Run Code in P...
3,Python Threading Tutorial: Run Code Concurrent...
4,Update (2019-09-03)
5,Python Quick Tip: The Difference Between “==” ...
6,Python Tutorial: Calling External Commands Usi...
7,Visual Studio Code (Windows) – Setting up a Py...
8,Visual Studio Code (Mac) – Setting up a Python...
9,Clarifying the Issues with Mutable Default Arg...


In [42]:
data_frame = data_frame.join(pd.DataFrame(data=dates, columns=['post_dates']))
data_frame

Unnamed: 0,post_title,post_dates
0,Python Tutorial: Zip Files – Creating and Extr...,"November 19, 2019"
1,Python Data Science Tutorial: Analyzing the 20...,"October 17, 2019"
2,Python Multiprocessing Tutorial: Run Code in P...,"September 21, 2019"
3,Python Threading Tutorial: Run Code Concurrent...,"September 12, 2019"
4,Update (2019-09-03),"September 3, 2019"
5,Python Quick Tip: The Difference Between “==” ...,"August 6, 2019"
6,Python Tutorial: Calling External Commands Usi...,"July 24, 2019"
7,Visual Studio Code (Windows) – Setting up a Py...,"May 1, 2019"
8,Visual Studio Code (Mac) – Setting up a Python...,"May 1, 2019"
9,Clarifying the Issues with Mutable Default Arg...,"April 24, 2019"


In [43]:
data_frame = data_frame.join(pd.DataFrame(data=descriptions, columns=['post_descriptions']))
data_frame

Unnamed: 0,post_title,post_dates,post_descriptions
0,Python Tutorial: Zip Files – Creating and Extr...,"November 19, 2019","In this video, we will be learning how to crea..."
1,Python Data Science Tutorial: Analyzing the 20...,"October 17, 2019","In this Python Programming video, we will be l..."
2,Python Multiprocessing Tutorial: Run Code in P...,"September 21, 2019","In this Python Programming video, we will be l..."
3,Python Threading Tutorial: Run Code Concurrent...,"September 12, 2019","In this Python Programming video, we will be l..."
4,Update (2019-09-03),"September 3, 2019",Hey everyone. I wanted to give you an update o...
5,Python Quick Tip: The Difference Between “==” ...,"August 6, 2019","In this Python Programming Tutorial, we will b..."
6,Python Tutorial: Calling External Commands Usi...,"July 24, 2019","In this Python Programming Tutorial, we will b..."
7,Visual Studio Code (Windows) – Setting up a Py...,"May 1, 2019","In this Python Programming Tutorial, we will b..."
8,Visual Studio Code (Mac) – Setting up a Python...,"May 1, 2019","In this Python Programming Tutorial, we will b..."
9,Clarifying the Issues with Mutable Default Arg...,"April 24, 2019","In this Python Programming Tutorial, we will b..."


In [44]:
data_frame = data_frame.join(pd.DataFrame(data=tags, columns=['post_tags']))
data_frame

Unnamed: 0,post_title,post_dates,post_descriptions,post_tags
0,Python Tutorial: Zip Files – Creating and Extr...,"November 19, 2019","In this video, we will be learning how to crea...","Filed Under: Development, Python Tagged With: ..."
1,Python Data Science Tutorial: Analyzing the 20...,"October 17, 2019","In this Python Programming video, we will be l...","Filed Under: Development, Python Tagged With: ..."
2,Python Multiprocessing Tutorial: Run Code in P...,"September 21, 2019","In this Python Programming video, we will be l...","Filed Under: Development, Python Tagged With: ..."
3,Python Threading Tutorial: Run Code Concurrent...,"September 12, 2019","In this Python Programming video, we will be l...","Filed Under: Development, Python Tagged With: ..."
4,Update (2019-09-03),"September 3, 2019",Hey everyone. I wanted to give you an update o...,Filed Under: General
5,Python Quick Tip: The Difference Between “==” ...,"August 6, 2019","In this Python Programming Tutorial, we will b...","Filed Under: Development, Python Tagged With: ..."
6,Python Tutorial: Calling External Commands Usi...,"July 24, 2019","In this Python Programming Tutorial, we will b...","Filed Under: Development, Python Tagged With: ..."
7,Visual Studio Code (Windows) – Setting up a Py...,"May 1, 2019","In this Python Programming Tutorial, we will b...","Filed Under: Development, Python Tagged With: ..."
8,Visual Studio Code (Mac) – Setting up a Python...,"May 1, 2019","In this Python Programming Tutorial, we will b...","Filed Under: Development, Python Tagged With: ..."
9,Clarifying the Issues with Mutable Default Arg...,"April 24, 2019","In this Python Programming Tutorial, we will b...","Filed Under: Development, Python Tagged With: ..."


In [45]:
# And finally Creating a csv file. So we can use it later
data_frame.to_csv('Corey Schafer Articles V2.csv')