In [1]:
from requests import get
from bs4 import BeautifulSoup
import os
import pandas as pd

At a high level, we'll go about web scraping through this process:

- Manually explore the site in a web browser, and identify the relevant HTML elements.
- Use the requests module to obtain the HTML from the page.
- Use BeautifulSoup to parse the HTML and obtain the text/data that we want.
- (Maybe) Script the process of requesting another page and parsing the data from it as well.
- Take this data further down the data science pipeline.

**Steps**
- Import the get() function from the requests module, BeautifulSoup from bs4, and pandas.
- Assign the address of the web page to a variable named url.
- Request the server the content of the web page by using get(), and store the server’s response in the variable response.
- Print the response text to ensure you have an html page.
- Take a look at the actual web page contents and inspect the source to understand the structure a bit.
- Use BeautifulSoup to parse the HTML into a variable ('soup').
- Identify the key tags you need to extract the data you are looking for.
- Create a dataframe of the data desired.
- Run some summary stats and inspect the data to ensure you have what you wanted.
- Edit the data structure as needed, especially so that one column has all the text you want included in this analysis.
- Create a corpus of the column with the text you want to analyze.
- Store that corpus for use in a future notebook.

## Codeup Blog Articles

**Question 1**

Codeup Blog Articles

Visit Codeup's Blog and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's title and content.

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

```
{
    'title': 'the title of the article',
    'content': 'the full text content of the article'
}
```
Plus any additional properties you think might be helpful.

**Bonus**: Scrape the text of *all* the articles linked on codeup's blog page.


In [21]:
url = 'https://codeup.com/blog/'
headers = {'User-Agent': 'Codeup Data Science'} # Some websites don't accept the pyhon-requests default user-agent
response = get(url, headers=headers)

In [25]:
print(response.text[:400])

<!DOCTYPE html>
<html lang="en-US">
<head>
	<meta charset="UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge">
	<link rel="pingback" href="https://codeup.com/xmlrpc.php" />

	<script type="text/javascript">
		document.documentElement.className = 'js';
	</script>
	
	<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin /><script id="diviarea-loader">window.DiviPopupData=wi


In [23]:
# Make a soup variable holding the response content
soup = BeautifulSoup(response.content, 'html.parser')

In [32]:
soup.find_all("h2")

[<h2 class="et_pb_slide_title"><a href="https://codeup.com/data-science/recession-proof-career/">Is a Career in Tech Recession-Proof?</a></h2>,
 <h2 class="et_pb_slide_title"><a href="https://codeup.com/featured/series-part-3-web-development/">What Jobs Can You Get After a Coding Bootcamp? Part 3: Web Development</a></h2>,
 <h2 class="et_pb_slide_title"><a href="https://codeup.com/featured/what-jobs-can-you-get-after-a-coding-bootcamp-part-2-cloud-administration/">What Jobs Can You Get After a Coding Bootcamp? Part 2: Cloud Administration</a></h2>,
 <h2 class="et_pb_slide_title"><a href="https://codeup.com/data-science/jobs-after-a-coding-bootcamp-part-1-data-science/">What Jobs Can You Get After a Coding Bootcamp? Part 1: Data Science</a></h2>,
 <h2 class="entry-title"><a href="https://codeup.com/data-science/recession-proof-career/">Is a Career in Tech Recession-Proof?</a></h2>,
 <h2 class="entry-title"><a href="https://codeup.com/codeup-news/codeup-x-comic-con/">Codeup X Superhero C

In [14]:
soup.select('h2 a[href]')

[<a href="https://codeup.com/data-science/recession-proof-career/">Is a Career in Tech Recession-Proof?</a>,
 <a href="https://codeup.com/featured/series-part-3-web-development/">What Jobs Can You Get After a Coding Bootcamp? Part 3: Web Development</a>,
 <a href="https://codeup.com/featured/what-jobs-can-you-get-after-a-coding-bootcamp-part-2-cloud-administration/">What Jobs Can You Get After a Coding Bootcamp? Part 2: Cloud Administration</a>,
 <a href="https://codeup.com/data-science/jobs-after-a-coding-bootcamp-part-1-data-science/">What Jobs Can You Get After a Coding Bootcamp? Part 1: Data Science</a>,
 <a href="https://codeup.com/data-science/recession-proof-career/">Is a Career in Tech Recession-Proof?</a>,
 <a href="https://codeup.com/codeup-news/codeup-x-comic-con/">Codeup X Superhero Car Show &amp; Comic Con</a>,
 <a href="https://codeup.com/featured/series-part-3-web-development/">What Jobs Can You Get After a Coding Bootcamp? Part 3: Web Development</a>,
 <a href="https://

In [15]:
soup.select('h2 a[href]')[0]

<a href="https://codeup.com/data-science/recession-proof-career/">Is a Career in Tech Recession-Proof?</a>

In [35]:
[i for i in range(1, 6)]

[1, 2, 3, 4, 5]

In [39]:
urls = [urls['href'] for urls in soup.select('h2 a[href]')]
urls

['https://codeup.com/data-science/recession-proof-career/',
 'https://codeup.com/featured/series-part-3-web-development/',
 'https://codeup.com/featured/what-jobs-can-you-get-after-a-coding-bootcamp-part-2-cloud-administration/',
 'https://codeup.com/data-science/jobs-after-a-coding-bootcamp-part-1-data-science/',
 'https://codeup.com/data-science/recession-proof-career/',
 'https://codeup.com/codeup-news/codeup-x-comic-con/',
 'https://codeup.com/featured/series-part-3-web-development/',
 'https://codeup.com/codeup-news/codeup-dallas-campus/',
 'https://codeup.com/codeup-news/codeup-tv-commercial/',
 'https://codeup.com/featured/what-jobs-can-you-get-after-a-coding-bootcamp-part-2-cloud-administration/',
 'https://codeup.com/data-science/jobs-after-a-coding-bootcamp-part-1-data-science/',
 'https://codeup.com/workshops/san-antonio/in-person-workshop-learn-to-code-javascript-on-7-26/',
 'https://codeup.com/workshops/in-person-workshop-learn-to-code-python-on-7-19/',
 'https://codeup.co

In [55]:
webpages = urls[0]
response = get(webpages, headers=headers)
soup = BeautifulSoup(response.text)

In [44]:
# getting title
soup.find('h1', class_='entry-title').text

'Is a Career in Tech Recession-Proof?'

In [63]:
# getting content
soup.find('div', class_='entry-content').text

'\nGiven the current economic climate, many economists are considering the U.S. to be entering a recession. This can cause confusion, fear, and uncertainty, especially as it pertains to job security.\nTo ease some of those feelings, below you’ll find some careers in tech that tend to hold up better than others amid a recession. In the event of a recession, companies will likely shift to digital strategies, making these careers in tech valuable and highly coveted.\n\xa0\n\n\nProgrammer/Developer\nNo matter the programming language you’ve mastered, having the knowledge alone makes you extremely valuable. The coding skills you possess as a programmer or developer are in-demand for companies looking to build or enhance their websites, and enhance their consumer experience. According to the U.S. Bureau of Labor Statistics, jobs in software development are expected to grow 22% by 2030. This is much faster than the average career.\n\n\xa0\n\n\nCloud Administrator\nMore businesses are transiti

In [46]:
# Getting content of article
soup.find('div', class_='entry-content').text.strip()

'Given the current economic climate, many economists are considering the U.S. to be entering a recession. This can cause confusion, fear, and uncertainty, especially as it pertains to job security.\nTo ease some of those feelings, below you’ll find some careers in tech that tend to hold up better than others amid a recession. In the event of a recession, companies will likely shift to digital strategies, making these careers in tech valuable and highly coveted.\n\xa0\n\n\nProgrammer/Developer\nNo matter the programming language you’ve mastered, having the knowledge alone makes you extremely valuable. The coding skills you possess as a programmer or developer are in-demand for companies looking to build or enhance their websites, and enhance their consumer experience. According to the U.S. Bureau of Labor Statistics, jobs in software development are expected to grow 22% by 2030. This is much faster than the average career.\n\n\xa0\n\n\nCloud Administrator\nMore businesses are transition

In [50]:
url = 'https://codeup.com/blog/'
headers = {'User-Agent': 'Codeup Data Science'} 
response = get(url, headers=headers)

# Make a soup variable holding the response content
soup = BeautifulSoup(response.content, 'html.parser')

urls = [urls['href'] for urls in soup.select('h2 a[href]')]

articles = []

for url in urls:
    
    url_response = get(url, headers=headers)
    soup = BeautifulSoup(url_response.text)
    
    title = soup.find('h1', class_='entry-title').text
    content = soup.find('div', class_='entry-content').text.strip()

    article_dict = {
        'title': title,
        'content': content
    }
    
    articles.append(article_dict)

In [51]:
articles

[{'title': 'Is a Career in Tech Recession-Proof?',
  'content': 'Given the current economic climate, many economists are considering the U.S. to be entering a recession. This can cause confusion, fear, and uncertainty, especially as it pertains to job security.\nTo ease some of those feelings, below you’ll find some careers in tech that tend to hold up better than others amid a recession. In the event of a recession, companies will likely shift to digital strategies, making these careers in tech valuable and highly coveted.\n\xa0\n\n\nProgrammer/Developer\nNo matter the programming language you’ve mastered, having the knowledge alone makes you extremely valuable. The coding skills you possess as a programmer or developer are in-demand for companies looking to build or enhance their websites, and enhance their consumer experience. According to the U.S. Bureau of Labor Statistics, jobs in software development are expected to grow 22% by 2030. This is much faster than the average career.\

In [53]:
pd.DataFrame(articles)

Unnamed: 0,title,content
0,Is a Career in Tech Recession-Proof?,"Given the current economic climate, many econo..."
1,What Jobs Can You Get After a Coding Bootcamp?...,If you’re considering a career in web developm...
2,What Jobs Can You Get After a Coding Bootcamp?...,Have you been considering a career in Cloud Ad...
3,What Jobs Can You Get After a Coding Bootcamp?...,If you are interested in embarking on a career...
4,Is a Career in Tech Recession-Proof?,"Given the current economic climate, many econo..."
5,Codeup X Superhero Car Show & Comic Con,Codeup had a blast at the San Antonio Superher...
6,What Jobs Can You Get After a Coding Bootcamp?...,If you’re considering a career in web developm...
7,Codeup’s New Dallas Campus,Codeup’s Dallas campus has a new location! For...
8,Codeup TV Commercial,Codeup has officially made its TV debut! Our c...
9,What Jobs Can You Get After a Coding Bootcamp?...,Have you been considering a career in Cloud Ad...


### Date

In [64]:
url = 'https://codeup.com/blog/'
headers = {'User-Agent': 'Codeup Data Science'} # Some websites don't accept the pyhon-requests default user-agent
response = get(url, headers=headers)

In [65]:
# Make a soup variable holding the response content
soup = BeautifulSoup(response.content, 'html.parser')

In [66]:
soup.find_all("span")

[<span class="menu-name-behind"></span>,
 <span class="menu-text"></span>,
 <span class="hamburger-box">
 <span class="hamburger-inner"></span>
 </span>,
 <span class="hamburger-inner"></span>,
 <span id="et_search_icon"></span>,
 <span class="close"></span>,
 <span class="mobile_menu_bar"></span>,
 <span class="published">Aug 12, 2022</span>,
 <span class="published">Aug 10, 2022</span>,
 <span class="published">Aug 2, 2022</span>,
 <span class="published">Jul 25, 2022</span>,
 <span class="published">Jul 20, 2022</span>,
 <span class="published">Jul 14, 2022</span>,
 <span class="published">Jul 7, 2022</span>,
 <span class="published">Jul 6, 2022</span>,
 <span class="published">Jun 20, 2022</span>,
 <span class="published">Jun 19, 2022</span>,
 <span class="published">Jun 8, 2022</span>,
 <span class="published">Jun 5, 2022</span>,
 <span class="published">Jun 1, 2022</span>,
 <span class="published">May 31, 2022</span>,
 <span class="published">May 23, 2022</span>,
 <span class="pu

In [None]:
soup.select('h2 a[href]')]
soup.select('p span[href]')]

In [95]:
dates = soup.find_all('span', class_='published')

In [90]:
soup.find_all('span', class_='published')[0].text

'Aug 12, 2022'

In [96]:
dates

[<span class="published">Aug 12, 2022</span>,
 <span class="published">Aug 10, 2022</span>,
 <span class="published">Aug 2, 2022</span>,
 <span class="published">Jul 25, 2022</span>,
 <span class="published">Jul 20, 2022</span>,
 <span class="published">Jul 14, 2022</span>,
 <span class="published">Jul 7, 2022</span>,
 <span class="published">Jul 6, 2022</span>,
 <span class="published">Jun 20, 2022</span>,
 <span class="published">Jun 19, 2022</span>,
 <span class="published">Jun 8, 2022</span>,
 <span class="published">Jun 5, 2022</span>,
 <span class="published">Jun 1, 2022</span>,
 <span class="published">May 31, 2022</span>,
 <span class="published">May 23, 2022</span>,
 <span class="published">May 17, 2022</span>,
 <span class="published">May 16, 2022</span>,
 <span class="published">May 16, 2022</span>]

In [98]:
dates = [date.text for date in soup.find_all('span', class_='published')]
dates

['Aug 12, 2022',
 'Aug 10, 2022',
 'Aug 2, 2022',
 'Jul 25, 2022',
 'Jul 20, 2022',
 'Jul 14, 2022',
 'Jul 7, 2022',
 'Jul 6, 2022',
 'Jun 20, 2022',
 'Jun 19, 2022',
 'Jun 8, 2022',
 'Jun 5, 2022',
 'Jun 1, 2022',
 'May 31, 2022',
 'May 23, 2022',
 'May 17, 2022',
 'May 16, 2022',
 'May 16, 2022']

In [4]:
url = 'https://codeup.com/blog/'
headers = {'User-Agent': 'Codeup Data Science'} 
response = get(url, headers=headers)

# Make a soup variable holding the response content
soup = BeautifulSoup(response.content, 'html.parser')

urls = [urls['href'] for urls in soup.select('h2 a[href]')]

articles = []

for url in urls:
    
    url_response = get(url, headers=headers)
    soup = BeautifulSoup(url_response.text)
    
    title = soup.find('h1', class_='entry-title').text
    content = soup.find('div', class_='entry-content').text.strip()
    dates = [date.text for date in soup.find_all('span', class_='published')]

    article_dict = {
        'title': title,
        'content': content,
        'date': dates
    }
    
    articles.append(article_dict)

In [6]:
pd.DataFrame(articles)

Unnamed: 0,title,content,date
0,Is a Career in Tech Recession-Proof?,"Given the current economic climate, many econo...","[Aug 12, 2022]"
1,What Jobs Can You Get After a Coding Bootcamp?...,If you’re considering a career in web developm...,"[Aug 2, 2022]"
2,What Jobs Can You Get After a Coding Bootcamp?...,Have you been considering a career in Cloud Ad...,"[Jul 14, 2022]"
3,What Jobs Can You Get After a Coding Bootcamp?...,If you are interested in embarking on a career...,"[Jul 7, 2022]"
4,Is a Career in Tech Recession-Proof?,"Given the current economic climate, many econo...","[Aug 12, 2022]"
5,Codeup X Superhero Car Show & Comic Con,Codeup had a blast at the San Antonio Superher...,"[Aug 10, 2022]"
6,What Jobs Can You Get After a Coding Bootcamp?...,If you’re considering a career in web developm...,"[Aug 2, 2022]"
7,Codeup’s New Dallas Campus,Codeup’s Dallas campus has a new location! For...,"[Jul 25, 2022]"
8,Codeup TV Commercial,Codeup has officially made its TV debut! Our c...,"[Jul 20, 2022]"
9,What Jobs Can You Get After a Coding Bootcamp?...,Have you been considering a career in Cloud Ad...,"[Jul 14, 2022]"


## News Articles

**Question 2**

We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

- Business
- Sports
- Technology
- Entertainment

The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:

```
{
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' # for example
}
```

**Hints**:

- Start by inspecting the website in your browser. Figure out which elements will be useful.
- Start by creating a function that handles a single article and produces a dictionary like the one above.
- Next create a function that will find all the articles on a single page and call the function you created in the last step for every article on the page.
- Now create a function that will use the previous two functions to scrape the articles from all the pages that you need, and do any additional processing that needs to be done.


In [11]:
url = 'https://inshorts.com/en/read'
response = get(url)

# Make a soup variable holding the response content
soup = BeautifulSoup(response.content, 'html.parser')

In [12]:
print(response.text[:400])

<!doctype html>
<html lang="en">

<head>
  <meta charset="utf-8" />
  <style>
    /* The Modal (background) */
    .modal_contact {
        display: none; /* Hidden by default */
        position: fixed; /* Stay in place */
        z-index: 8; /* Sit on top */
        left: 0;
        top: 0;
        width: 100%; /* Full width */
        height: 100%;
        overflow: auto; /* Enable scroll if ne
