# NOTES:
At a high level, we'll go about web scraping through this process:

1. Manually explore the site in a web browser, and identify the relevant HTML elements.
1. Use the requests module to obtain the HTML from the page.
1. Use BeautifulSoup to parse the HTML and obtain the text/data that we want.
1. (Maybe) Script the process of requesting another page and parsing the data from it as well.
1. Take this data further down the data science pipeline.

### *Steps
1. Import the get() function from the requests module, BeautifulSoup from bs4, and pandas.
1. Assign the address of the web page to a variable named url.
1. Request the server the content of the web page by using get(), and store the server’s response in the variable response.
1. Print the response text to ensure you have an html page.
1. Take a look at the actual web page contents and inspect the source to understand the structure a bit.
1. Use BeautifulSoup to parse the HTML into a variable ('soup').
1. Identify the key tags you need to extract the data you are looking for.
1. Create a dataframe of the data desired.
1. Run some summary stats and inspect the data to ensure you have what you wanted.
1. Edit the data structure as needed, especially so that one column has all the text you want included in this analysis.
1. Create a corpus of the column with the text you want to analyze.
1. Store that corpus for use in a future notebook.

CSS Selectors
1. The name of the **element** itself is a selector. For example soup.select("p") will select every paragraph tag and soup.select("footer") selects the footer element (and everything inside it)
1. The **id** selector is denoted with a **#**. For example soup.select("#posts") will return the html element noted with the id=posts attribute
1. The **class** selector is denoted with a **.** symbol before the class name. For example, soup.select(".blog_post") returns all of the elements that have that class name.

In [67]:
#imports
import pandas as pd
import requests
from bs4 import BeautifulSoup

By the end of this exercise, you should have a file named acquire.py that contains the specified functions. If you wish, you may break your work into separate files for each website (e.g. acquire_codeup_blog.py and acquire_news_articles.py), but the end function should be present in acquire.py (that is, acquire.py should import get_blog_articles from the acquire_codeup_blog module.)

In [2]:
#acquire.py

1.a. Record the **urls** for at least 5 distinct blog posts. For each post, you should scrape at least the post's **title and content**.

In [3]:
url = 'https://codeup.com/blog/'
headers = {'User-Agent': 'Codeup Data Science'}
response = requests.get(url, headers=headers)

In [4]:
soup = BeautifulSoup(response.text)
soup

<!DOCTYPE html>
<html lang="en-US">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<link href="https://codeup.com/xmlrpc.php" rel="pingback"/>
<script type="text/javascript">
		document.documentElement.className = 'js';
	</script>
<link crossorigin="" href="https://fonts.gstatic.com" rel="preconnect"/><script id="diviarea-loader">window.DiviPopupData=window.DiviAreaConfig={"zIndex":1000000,"animateSpeed":400,"triggerClassPrefix":"show-popup-","idAttrib":"data-popup","modalIndicatorClass":"is-modal","blockingIndicatorClass":"is-blocking","defaultShowCloseButton":true,"withCloseClass":"with-close","noCloseClass":"no-close","triggerCloseClass":"close","singletonClass":"single","darkModeClass":"dark","noShadowClass":"no-shadow","altCloseClass":"close-alt","popupSelector":".et_pb_section.popup","initializeOnEvent":"et_pb_after_init_modules","popupWrapperClass":"area-outer-wrap","fullHeightClass":"full-height","openPopupClass":"da-overlay-visible","over

In [33]:
links = soup.select('.entry-featured-image-url', href=True)[0]
links

<a class="entry-featured-image-url" href="https://codeup.com/alumni-stories/from-speech-pathology-to-business-intelligence/"><img alt="From Speech Pathology to Business Intelligence" class="" height="675" loading="lazy" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) and (max-width: 980px) 980px, (min-width: 981px) 1080px, 100vw" src="https://199lj33nqk3p88xz03dvn481-wpengine.netdna-ssl.com/wp-content/uploads/2021/10/1200x628_BlogPost-02-Alicia-Success-Story-1080x629.png" srcset="https://199lj33nqk3p88xz03dvn481-wpengine.netdna-ssl.com/wp-content/uploads/2021/10/1200x628_BlogPost-02-Alicia-Success-Story-1080x629.png 1080w, https://199lj33nqk3p88xz03dvn481-wpengine.netdna-ssl.com/wp-content/uploads/2021/10/1200x628_BlogPost-02-Alicia-Success-Story-980x514.png 980w, https://199lj33nqk3p88xz03dvn481-wpengine.netdna-ssl.com/wp-content/uploads/2021/10/1200x628_BlogPost-02-Alicia-Success-Story-480x252.png 480w" width="1080"/></a>

In [39]:
url = links.attrs['href']
url

'https://codeup.com/alumni-stories/from-speech-pathology-to-business-intelligence/'

In [55]:
def blog_links():
    url = 'https://codeup.com/blog/'
    headers = {'User-Agent': 'Codeup Data Science'}
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text)
    urls = [links.attrs['href'] for links in soup.select('.entry-featured-image-url')]
    
    return urls

In [56]:
blog_links()

['https://codeup.com/alumni-stories/from-speech-pathology-to-business-intelligence/',
 'https://codeup.com/codeup-news/is-codeup-the-best-bootcamp-in-san-antonio-or-the-world/',
 'https://codeup.com/codeup-news/codeup-launches-first-podcast-hire-tech/',
 'https://codeup.com/tips-for-prospective-students/why-should-i-become-a-system-administrator/',
 'https://codeup.com/codeup-news/codeup-candidate-for-accreditation/',
 'https://codeup.com/codeup-news/codeup-takes-over-more-of-the-historic-vogue-building/',
 'https://codeup.com/codeup-news/inclusion-at-codeup-during-pride-month-and-always/',
 'https://codeup.com/tips-for-prospective-students/why-you-need-the-best-coding-bootcamp-instructors/',
 'https://codeup.com/codeup-news/meet-the-new-codeup-coo-stephen-noteboom/',
 'https://codeup.com/alumni-stories/how-i-went-from-codeup-to-business-owner/',
 'https://codeup.com/tips-for-prospective-students/coding-is-for-women/',
 'https://codeup.com/codeup-news/codeup-acquires-rackspace-cloud-ac

In [149]:
title = soup.select_one('.entry-title').text
title

'From Speech Pathology to Business Intelligence'

In [150]:
content = soup.select_one('.post-content-inner').text
content

'By: Alicia Gonzalez Before Codeup, I was a home health Speech-Language Pathologist Assistant. I would go from home to...\n'

In [151]:
def article(url):
    url = 'https://codeup.com/blog/'
    headers = {'User-Agent': 'Codeup Data Science'}
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text)
    
    return {
        'title':soup.select_one('.entry-title').text,
        'content':soup.select_one('.post-content-inner').text.strip(),
    }

In [152]:
article(url)

{'title': 'From Speech Pathology to Business Intelligence',
 'content': 'By: Alicia Gonzalez Before Codeup, I was a home health Speech-Language Pathologist Assistant. I would go from home to...'}

1.b. Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article.

In [165]:
def get_blog_articles():
    links = blog_links()
    df = pd.DataFrame([article(url) for link in links])
    return df

In [166]:
#NEED to figure out why its repeating??????
df = get_blog_articles()
df.head()

Unnamed: 0,title,content
0,From Speech Pathology to Business Intelligence,"By: Alicia Gonzalez Before Codeup, I was a hom..."
1,From Speech Pathology to Business Intelligence,"By: Alicia Gonzalez Before Codeup, I was a hom..."
2,From Speech Pathology to Business Intelligence,"By: Alicia Gonzalez Before Codeup, I was a hom..."
3,From Speech Pathology to Business Intelligence,"By: Alicia Gonzalez Before Codeup, I was a hom..."
4,From Speech Pathology to Business Intelligence,"By: Alicia Gonzalez Before Codeup, I was a hom..."


1.c. Bonus: Scrape the text of all the articles 

In [None]:
#scraped all in the above answer.

2. Write a function that scrapes the news articles for the following topics:

- Business
- Sports
- Technology
- Entertainment

The end product of this should be a function named get_news_articles that returns a list of dictionaries

In [167]:
response = requests.get('https://inshorts.com/en/read', headers={'User-Agent': 'Codeup Data Science'})

In [168]:
soup = BeautifulSoup(response.text)

In [169]:
soup

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<style>
    /* The Modal (background) */
    .modal_contact {
        display: none; /* Hidden by default */
        position: fixed; /* Stay in place */
        z-index: 8; /* Sit on top */
        left: 0;
        top: 0;
        width: 100%; /* Full width */
        height: 100%;
        overflow: auto; /* Enable scroll if needed */
        background-color: rgb(0,0,0); /* Fallback color */
        background-color: rgba(0,0,0,0.4); /* Black w/ opacity */
    }

    /* Modal Content/Box */
    .modal-content {
        background-color: #fefefe;
        margin: 15% auto;
        padding: 20px !important;
        padding-top: 0 !important;
        /* border: 1px solid #888; */
        text-align: center;
        position: relative;
        border-radius: 6px;
    }

    /* The Close Button */
    .close {
      left: 90%;
      color: #aaa;
      float: right;
      font-size: 28px;
      font-weight: bold;
    /* positio

In [179]:
author = soup.select_one('.author').text
author

'Ridham Gambhir'

In [180]:
published = soup.select_one('.time').attrs['content']
published

'2021-10-26T17:08:44.000Z'

In [185]:
headline = soup.select_one('.news-card-title')
headline

<div class="news-card-title news-right-box">
<a class="clickable" href="/en/news/fire-breaks-out-at-firecracker-shop-in-tamil-nadu-at-least-5-dead-1635268124813" onclick="track_GA_Mixpanel({'hitType': 'event', 'category': 'TitleOfNews', 'action': 'clicked', 'label': 'Fire%20breaks%20out%20at%20firecracker%20shop%20in%20Tamil%20Nadu%3B%20at%20least%205%20dead)' });" style="color:#44444d!important">
<span itemprop="headline">Fire breaks out at firecracker shop in Tamil Nadu; at least 5 dead</span>
</a>
<div class="news-card-author-time news-card-author-time-in-title">
<a href="/prev/en/news/fire-breaks-out-at-firecracker-shop-in-tamil-nadu-at-least-5-dead-1635268124813"><span class="short">short</span></a> by <span class="author">Ridham Gambhir</span> / 
      <span class="time" content="2021-10-26T17:08:44.000Z" itemprop="datePublished">10:38 pm</span> on <span clas="date">26 Oct 2021,Tuesday</span>
</div>
</div>

In [189]:
headline.select('a')[0].text.strip()

'Fire breaks out at firecracker shop in Tamil Nadu; at least 5 dead'

In [192]:
def news_card():
    response = requests.get('https://inshorts.com/en/read', headers={'User-Agent': 'Codeup Data Science'})
    soup = BeautifulSoup(response.text)
    author = soup.select_one('.author').text
    published = soup.select_one('.time').attrs['content']
    headline = soup.select_one('.news-card-title')
    headlines = headline.select('a')[0].text.strip()
    
    return author, published, headlines
    

In [193]:
news_card()

('Ridham Gambhir',
 '2021-10-26T17:08:44.000Z',
 'Fire breaks out at firecracker shop in Tamil Nadu; at least 5 dead')

In [194]:
response = requests.get('https://inshorts.com/en/read', headers={'User-Agent': 'Codeup Data Science'})
soup = BeautifulSoup(response.text)

In [260]:
url = soup.select('.news-card-title')[0]
url

<div class="news-card-title news-right-box">
<a class="clickable" href="/en/news/fire-breaks-out-at-firecracker-shop-in-tamil-nadu-at-least-5-dead-1635268124813" onclick="track_GA_Mixpanel({'hitType': 'event', 'category': 'TitleOfNews', 'action': 'clicked', 'label': 'Fire%20breaks%20out%20at%20firecracker%20shop%20in%20Tamil%20Nadu%3B%20at%20least%205%20dead)' });" style="color:#44444d!important">
<span itemprop="headline">Fire breaks out at firecracker shop in Tamil Nadu; at least 5 dead</span>
</a>
<div class="news-card-author-time news-card-author-time-in-title">
<a href="/prev/en/news/fire-breaks-out-at-firecracker-shop-in-tamil-nadu-at-least-5-dead-1635268124813"><span class="short">short</span></a> by <span class="author">Ridham Gambhir</span> / 
      <span class="time" content="2021-10-26T17:08:44.000Z" itemprop="datePublished">10:38 pm</span> on <span clas="date">26 Oct 2021,Tuesday</span>
</div>
</div>

In [267]:
urls=url.find_all('a')[0]
urls

<a class="clickable" href="/en/news/fire-breaks-out-at-firecracker-shop-in-tamil-nadu-at-least-5-dead-1635268124813" onclick="track_GA_Mixpanel({'hitType': 'event', 'category': 'TitleOfNews', 'action': 'clicked', 'label': 'Fire%20breaks%20out%20at%20firecracker%20shop%20in%20Tamil%20Nadu%3B%20at%20least%205%20dead)' });" style="color:#44444d!important">
<span itemprop="headline">Fire breaks out at firecracker shop in Tamil Nadu; at least 5 dead</span>
</a>

In [268]:
urls.attrs['href']

'/en/news/fire-breaks-out-at-firecracker-shop-in-tamil-nadu-at-least-5-dead-1635268124813'

In [275]:
def page_urls(urls):
    response = requests.get('https://inshorts.com/en/read', headers={'User-Agent': 'Codeup Data Science'})
    soup = BeautifulSoup(response.text)
   #url = soup.select('.news-card-title')[0]
   # urls=url.find_all('a')
    links = [urls.attrs['href'] for links in soup.select('.news-card-title')]
    
    return links

In [276]:
page_urls(urls)

['/en/news/fire-breaks-out-at-firecracker-shop-in-tamil-nadu-at-least-5-dead-1635268124813',
 '/en/news/fire-breaks-out-at-firecracker-shop-in-tamil-nadu-at-least-5-dead-1635268124813',
 '/en/news/fire-breaks-out-at-firecracker-shop-in-tamil-nadu-at-least-5-dead-1635268124813',
 '/en/news/fire-breaks-out-at-firecracker-shop-in-tamil-nadu-at-least-5-dead-1635268124813',
 '/en/news/fire-breaks-out-at-firecracker-shop-in-tamil-nadu-at-least-5-dead-1635268124813',
 '/en/news/fire-breaks-out-at-firecracker-shop-in-tamil-nadu-at-least-5-dead-1635268124813',
 '/en/news/fire-breaks-out-at-firecracker-shop-in-tamil-nadu-at-least-5-dead-1635268124813',
 '/en/news/fire-breaks-out-at-firecracker-shop-in-tamil-nadu-at-least-5-dead-1635268124813',
 '/en/news/fire-breaks-out-at-firecracker-shop-in-tamil-nadu-at-least-5-dead-1635268124813',
 '/en/news/fire-breaks-out-at-firecracker-shop-in-tamil-nadu-at-least-5-dead-1635268124813',
 '/en/news/fire-breaks-out-at-firecracker-shop-in-tamil-nadu-at-least-