## Acquire Data through Web Scraping:

#### Steps

1. Import the get() function from the requests module, BeautifulSoup from bs4, and pandas.
1. Assign the address of the web page to a variable named url.
1. Request the server the content of the web page by using get(), and store the server’s response in the variable response.
1. Print the response text to ensure you have an html page.
1. Take a look at the actual web page contents and inspect the source to understand the structure a bit.
1. Use BeautifulSoup to parse the HTML into a variable ('soup').
1. Identify the key tags you need to extract the data you are looking for.
1. Create a dataframe of the data desired.
1. Run some summary stats and inspect the data to ensure you have what you wanted.
1. Edit the data structure as needed, especially so that one column has all the text you want included in this analysis.
1. Create a corpus of the column with the text you want to analyze.
1. Store that corpus for use in a future notebook.

In [1]:
from requests import get
from bs4 import BeautifulSoup
import pandas as pd


1. Codeup Blog Articles

Visit Codeup's [Blog](https://codeup.com/blog/) and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's title and content.

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

```
{
    'title': 'the title of the article',
    'content': 'the full text content of the article'
}
```

Plus any additional properties you think might be helpful. 
__Bonus:__ Scrape the text of __all__ the articles linked on [codeup's blog page](https://codeup.com/blog/).

***

2. News Articles

We will now be scraping text data from [inshorts] (https://inshorts.com/en/read), a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

- Business
- Sports
- Technology
- Entertainment       

The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:

```
{
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' # for example
}
```

Hints:

- Start by inspecting the website in your browser. Figure out which elements will be useful.
- Start by creating a function that handles a single article and produces a dictionary like the one above.
- Next create a function that will find all the articles on a single page and call the function you created in the last step for every article on the page.
- Now create a function that will use the previous two functions to scrape the articles from all the pages that you need, and do any additional processing that needs to be done.
***
3. Bonus: cache the data

Write your code such that the acquired data is saved locally in some form or fashion. Your functions that retrieve the data should prefer to read the local data instead of having to make all the requests everytime the function is called. Include a boolean flag in the functions to allow the data to be acquired "fresh" from the actual sources (re-writing your local cache).
***
***
***

1. Codeup Blog Articles

Visit Codeup's [Blog](https://codeup.com/blog/) and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's title and content.

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

```
{
    'title': 'the title of the article',
    'content': 'the full text content of the article'
}
```

Plus any additional properties you think might be helpful. 
__Bonus:__ Scrape the text of __all__ the articles linked on [codeup's blog page](https://codeup.com/blog/).

In [81]:
url = 'https://codeup.com/blog/'
headers = {'User-Agent': 'Codeup Data Science'} # Some websites don't accept the pyhon-requests default user-agent
response = get(url, headers=headers)

In this instance, a header is necessary to forgo a 403 error. 
- headers are bits of meta information that can go along with a request
- The user-agent header can be used to identify ourself to the web server
- we can include headers as part of our request with a keyword argument       

Codeup otherwise prevents scraping unless this header, or something similar to it, is present.

In [82]:
response

<Response [200]>

In [9]:
# Perform a sanity check to ensure HTML data is observed
print(response.text[:400])

<!DOCTYPE html>
<html lang="en-US">
<head>
	<meta charset="UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge">
	<link rel="pingback" href="https://codeup.com/xmlrpc.php" />

	<script type="text/javascript">
		document.documentElement.className = 'js';
	</script>
	
	<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin /><script id="diviarea-loader">window.DiviPopupData=wi


In [83]:
# I'll take this one blog at a time
url2 = 'https://codeup.com/workshops/from-bootcamp-to-bootcamp-a-military-appreciation-panel/'
headers = {'User-Agent': 'Codeup Data Science'} # Some websites don't accept the pyhon-requests default user-agent
response2 = get(url2, headers=headers)
print(response)
print(response.text[:100])

<Response [200]>
<!DOCTYPE html>
<html lang="en-US">
<head>
	<meta charset="UTF-8" />
<meta http-equiv="X-UA-Compatib


In [102]:
# Make some soup (I wish it was Chowder)
# make a soup variable holding the response content
soup = BeautifulSoup(response.content, 'html.parser')
soup2 = BeautifulSoup(response2.content, 'html.parser')
soup3 = BeautifulSoup(response.text, 'html.parser')
soup4 = BeautifulSoup(response2.text, 'html.parser')
soup5 = BeautifulSoup(response.text)
soup6 = BeautifulSoup(response2.text)

In [95]:
# Check.
# so I will need the title and full content of the article. 

title = soup2.title.string
title

'From Bootcamp to Bootcamp | A Military Appreciation Panel - Codeup'

In [None]:
# //*[@id="post-18280"]/div[2]/div/div/div/div[1]/div/div/div/p[1]/span/text()

In [96]:
soup.title.string

'Blog - Codeup'

In [108]:
# testing other soup variants out of curiosity for how they may differ
print(soup2.title.string)
print(soup3.title.string)
print(soup4.title.string)
print(soup5.title.string)
print(soup6.title.string)

From Bootcamp to Bootcamp | A Military Appreciation Panel - Codeup
Blog - Codeup
From Bootcamp to Bootcamp | A Military Appreciation Panel - Codeup
Blog - Codeup
From Bootcamp to Bootcamp | A Military Appreciation Panel - Codeup


In [137]:
soup2.find('div', id='main-content')
# the class should be et_pb_text_inner but I can't get it to work yet.

<div id="main-content">
<div class="container">
<div class="clearfix" id="content-area">
<div id="left-area">
<article class="et_pb_post post-18280 post type-post status-publish format-standard has-post-thumbnail hentry category-alumni-stories category-dallas category-events category-featured category-military category-san-antonio category-veterans category-virtual category-workshops" id="post-18280">
<div class="et_post_meta_wrapper">
<h1 class="entry-title">From Bootcamp to Bootcamp | A Military Appreciation Panel</h1>
<p class="post-meta"><span class="published">Apr 27, 2022</span> | <a href="https://codeup.com/category/alumni-stories/" rel="category tag">Alumni Stories</a>, <a href="https://codeup.com/category/workshops/dallas/" rel="category tag">Dallas</a>, <a href="https://codeup.com/category/events/" rel="category tag">Events</a>, <a href="https://codeup.com/category/featured/" rel="category tag">Featured</a>, <a href="https://codeup.com/category/military/" rel="category tag">M

This is what I want to extract
```
<div class="et_pb_text_inner"><p data-key="16"><span data-key="17">In honor of Military Appreciation Month, join us for a discussion with Codeup Alumni who are also Military Veterans! We will chat about their experiences attending a coding bootcamp, and how their military training set them up for success here at Codeup. Grab your virtual seat now so you can be sent the exclusive Livestream link on the 11th! </span></p>
```

In [100]:
type(soup2)

bs4.BeautifulSoup

In [141]:
#print(soup2.prettify())
# if that's beautification then I should maybe try making it hideous instead..

In [148]:
#print(soup2.get_text())