# NOTES:
At a high level, we'll go about web scraping through this process:

1. Manually explore the site in a web browser, and identify the relevant HTML elements.
1. Use the requests module to obtain the HTML from the page.
1. Use BeautifulSoup to parse the HTML and obtain the text/data that we want.
1. (Maybe) Script the process of requesting another page and parsing the data from it as well.
1. Take this data further down the data science pipeline.

### *Steps
1. Import the get() function from the requests module, BeautifulSoup from bs4, and pandas.
1. Assign the address of the web page to a variable named url.
1. Request the server the content of the web page by using get(), and store the server’s response in the variable response.
1. Print the response text to ensure you have an html page.
1. Take a look at the actual web page contents and inspect the source to understand the structure a bit.
1. Use BeautifulSoup to parse the HTML into a variable ('soup').
1. Identify the key tags you need to extract the data you are looking for.
1. Create a dataframe of the data desired.
1. Run some summary stats and inspect the data to ensure you have what you wanted.
1. Edit the data structure as needed, especially so that one column has all the text you want included in this analysis.
1. Create a corpus of the column with the text you want to analyze.
1. Store that corpus for use in a future notebook.

CSS Selectors
1. The name of the **element** itself is a selector. For example soup.select("p") will select every paragraph tag and soup.select("footer") selects the footer element (and everything inside it)
1. The **id** selector is denoted with a **#**. For example soup.select("#posts") will return the html element noted with the id=posts attribute
1. The **class** selector is denoted with a **.** symbol before the class name. For example, soup.select(".blog_post") returns all of the elements that have that class name.

In [45]:
#imports
import pandas as pd
import requests
from bs4 import BeautifulSoup
from requests import get

By the end of this exercise, you should have a file named acquire.py that contains the specified functions. If you wish, you may break your work into separate files for each website (e.g. acquire_codeup_blog.py and acquire_news_articles.py), but the end function should be present in acquire.py (that is, acquire.py should import get_blog_articles from the acquire_codeup_blog module.)

In [None]:
#acquire.py

1.a. Record the **urls** for at least 5 distinct blog posts. For each post, you should scrape at least the post's **title and content**.

In [2]:
url = 'https://codeup.com/blog/'
headers = {'User-Agent': 'Codeup Data Science'}
response = requests.get(url, headers=headers)

In [3]:
soup = BeautifulSoup(response.text)
soup

<!DOCTYPE html>
<html lang="en-US">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<link href="https://codeup.com/xmlrpc.php" rel="pingback"/>
<script type="text/javascript">
		document.documentElement.className = 'js';
	</script>
<link crossorigin="" href="https://fonts.gstatic.com" rel="preconnect"/><script id="diviarea-loader">window.DiviPopupData=window.DiviAreaConfig={"zIndex":1000000,"animateSpeed":400,"triggerClassPrefix":"show-popup-","idAttrib":"data-popup","modalIndicatorClass":"is-modal","blockingIndicatorClass":"is-blocking","defaultShowCloseButton":true,"withCloseClass":"with-close","noCloseClass":"no-close","triggerCloseClass":"close","singletonClass":"single","darkModeClass":"dark","noShadowClass":"no-shadow","altCloseClass":"close-alt","popupSelector":".et_pb_section.popup","initializeOnEvent":"et_pb_after_init_modules","popupWrapperClass":"area-outer-wrap","fullHeightClass":"full-height","openPopupClass":"da-overlay-visible","over

In [4]:
links = soup.select('.entry-featured-image-url', href=True)[0]
links

<a class="entry-featured-image-url" href="https://codeup.com/alumni-stories/from-speech-pathology-to-business-intelligence/"><img alt="From Speech Pathology to Business Intelligence" class="" height="675" loading="lazy" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) and (max-width: 980px) 980px, (min-width: 981px) 1080px, 100vw" src="https://199lj33nqk3p88xz03dvn481-wpengine.netdna-ssl.com/wp-content/uploads/2021/10/1200x628_BlogPost-02-Alicia-Success-Story-1080x629.png" srcset="https://199lj33nqk3p88xz03dvn481-wpengine.netdna-ssl.com/wp-content/uploads/2021/10/1200x628_BlogPost-02-Alicia-Success-Story-1080x629.png 1080w, https://199lj33nqk3p88xz03dvn481-wpengine.netdna-ssl.com/wp-content/uploads/2021/10/1200x628_BlogPost-02-Alicia-Success-Story-980x514.png 980w, https://199lj33nqk3p88xz03dvn481-wpengine.netdna-ssl.com/wp-content/uploads/2021/10/1200x628_BlogPost-02-Alicia-Success-Story-480x252.png 480w" width="1080"/></a>

In [5]:
url = links.attrs['href']
url

'https://codeup.com/alumni-stories/from-speech-pathology-to-business-intelligence/'

In [6]:
def blog_links():
    url = 'https://codeup.com/blog/'
    headers = {'User-Agent': 'Codeup Data Science'}
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text)
    urls = [links.attrs['href'] for links in soup.select('.entry-featured-image-url')]
    
    return urls

In [7]:
blog_links()

['https://codeup.com/alumni-stories/from-speech-pathology-to-business-intelligence/',
 'https://codeup.com/codeup-news/is-codeup-the-best-bootcamp-in-san-antonio-or-the-world/',
 'https://codeup.com/codeup-news/codeup-launches-first-podcast-hire-tech/',
 'https://codeup.com/tips-for-prospective-students/why-should-i-become-a-system-administrator/',
 'https://codeup.com/codeup-news/codeup-candidate-for-accreditation/',
 'https://codeup.com/codeup-news/codeup-takes-over-more-of-the-historic-vogue-building/',
 'https://codeup.com/codeup-news/inclusion-at-codeup-during-pride-month-and-always/',
 'https://codeup.com/tips-for-prospective-students/why-you-need-the-best-coding-bootcamp-instructors/',
 'https://codeup.com/codeup-news/meet-the-new-codeup-coo-stephen-noteboom/',
 'https://codeup.com/alumni-stories/how-i-went-from-codeup-to-business-owner/',
 'https://codeup.com/tips-for-prospective-students/coding-is-for-women/',
 'https://codeup.com/codeup-news/codeup-acquires-rackspace-cloud-ac

In [8]:
title = soup.select_one('.entry-title').text
title

'From Speech Pathology to Business Intelligence'

In [9]:
content = soup.select_one('.post-content-inner').text
content

'By: Alicia Gonzalez Before Codeup, I was a home health Speech-Language Pathologist Assistant. I would go from home to...\n'

In [16]:
def article(url):
    url = 'https://codeup.com/blog/'
    headers = {'User-Agent': 'Codeup Data Science'}
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text)
    
    return {
        'title':soup.select_one('.entry-title').text,
        'content':soup.select_one('.post-content-inner').text.strip(),
    }

In [17]:
article(url)

{'title': 'From Speech Pathology to Business Intelligence',
 'content': 'By: Alicia Gonzalez Before Codeup, I was a home health Speech-Language Pathologist Assistant. I would go from home to...'}

1.b. Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article.

In [18]:
def get_blog_articles():
    links = blog_links()
    df = pd.DataFrame([article(link) for link in links])
    return df

In [19]:
#NEED to figure out why its repeating??????
df = get_blog_articles()
df.head()

Unnamed: 0,title,content
0,From Speech Pathology to Business Intelligence,"By: Alicia Gonzalez Before Codeup, I was a hom..."
1,From Speech Pathology to Business Intelligence,"By: Alicia Gonzalez Before Codeup, I was a hom..."
2,From Speech Pathology to Business Intelligence,"By: Alicia Gonzalez Before Codeup, I was a hom..."
3,From Speech Pathology to Business Intelligence,"By: Alicia Gonzalez Before Codeup, I was a hom..."
4,From Speech Pathology to Business Intelligence,"By: Alicia Gonzalez Before Codeup, I was a hom..."


1.c. Bonus: Scrape the text of all the articles 

In [None]:
#scraped all in the above answer.

2. Write a function that scrapes the news articles for the following topics:

- Business
- Sports
- Technology
- Entertainment

The end product of this should be a function named get_news_articles that returns a list of dictionaries

In [20]:
response = requests.get('https://inshorts.com/en/read', headers={'User-Agent': 'Codeup Data Science'})

In [21]:
soup = BeautifulSoup(response.text)

In [22]:
soup

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<style>
    /* The Modal (background) */
    .modal_contact {
        display: none; /* Hidden by default */
        position: fixed; /* Stay in place */
        z-index: 8; /* Sit on top */
        left: 0;
        top: 0;
        width: 100%; /* Full width */
        height: 100%;
        overflow: auto; /* Enable scroll if needed */
        background-color: rgb(0,0,0); /* Fallback color */
        background-color: rgba(0,0,0,0.4); /* Black w/ opacity */
    }

    /* Modal Content/Box */
    .modal-content {
        background-color: #fefefe;
        margin: 15% auto;
        padding: 20px !important;
        padding-top: 0 !important;
        /* border: 1px solid #888; */
        text-align: center;
        position: relative;
        border-radius: 6px;
    }

    /* The Close Button */
    .close {
      left: 90%;
      color: #aaa;
      float: right;
      font-size: 28px;
      font-weight: bold;
    /* positio

In [23]:
author = soup.select_one('.author').text
author

'Shalini Ojha'

In [24]:
published = soup.select_one('.time').attrs['content']
published

'2021-10-27T17:26:52.000Z'

In [25]:
headline = soup.find_all('span', itemprop = 'headline')[0].get_text()
headline

"Munawar Faruqui cancels shows in Mumbai over 'safety of audience'"

In [26]:
def news_card():
    response = requests.get('https://inshorts.com/en/read', headers={'User-Agent': 'Codeup Data Science'})
    soup = BeautifulSoup(response.text)
    author = soup.select_one('.author').text
    published = soup.select_one('.time').attrs['content']
    headline = soup.find_all('span', itemprop = 'headline')[0].get_text()
    
    return author, published, headline
    

In [27]:
news_card()

('Shalini Ojha',
 '2021-10-27T17:26:52.000Z',
 "Munawar Faruqui cancels shows in Mumbai over 'safety of audience'")

In [36]:
# set up base url and categories list
base_url = 'https://inshorts.com/en/read'
categories = ['sports', 'entertainment', 'business', 'technology']

In [37]:
# figure out how to use the categories to add to base url
base_url + '/' + categories[0]

'https://inshorts.com/en/read/sports'

In [38]:
# make function to create a list of urls 
# probably could just call this line in the main function but it was fun

def create_urls(base_url, categories):
    '''
    This function takes in a baseurl and list of categories
    It will create a new list with the base url a / and each category
    
    This is for scraping info from the inshorts website
    '''
    
    website_list = [base_url + '/' + category for category in categories]
    
    return website_list

In [39]:
create_urls(base_url, categories)

['https://inshorts.com/en/read/sports',
 'https://inshorts.com/en/read/entertainment',
 'https://inshorts.com/en/read/business',
 'https://inshorts.com/en/read/technology']

In [43]:
def get_blog_articles2(base_url, categories): # title_finder, body_finder
    '''
    This function takes in a list of website urls, 
    the title finder and body finder (must be the same for each article)
    And returns a list of dictionaries with title text and body text in dictionaries
    Keys in dictionaries are 'title' and 'content'
    Returns dataframe of Titles, Articles, and Categories
    '''
    
    # initalize empty list for the dictionaries
    article_list = []
    
    # set up headers
    headers = {'User-Agent': 'Codeup Data Science'} 
    
    # create list of websites using the categories
    website_list = create_urls(base_url, categories)
    
    # loop through list of websites and category list
    for website, category in zip(website_list, categories): 
        
        # get response
        response = get(website, headers=headers)
    
        # create soup object
        soup = BeautifulSoup(response.text)
        
        # find titles
        headlines= soup.find_all('span', itemprop = 'headline')
        
        # find bodies 
        bodies = soup.find_all('div', itemprop = 'articleBody')
        
        # loop through length of headlines (could also be bodies) use index to get text and add to dictionary
        for i in range(len(headlines)):
            title = headlines[i].get_text()
            body = bodies[i].get_text()
            
            # create dictionary
            dictionary = {'title': title,
                         'content': body,
                         'category': category}
            
            # add dictionary to list of dictionaries
            article_list.append(dictionary)
        
    return pd.DataFrame(article_list)

In [46]:
# set up base url and categories list
base_url = 'https://inshorts.com/en/read'
categories = ['sports', 'entertainment', 'business', 'technology']

# test function
get_blog_articles2(base_url, categories)

Unnamed: 0,title,content,category
0,I actually think people talking about my form ...,"Australia opener David Warner said that he ""ac...",sports
1,Waqar apologises for his 'namaz in front of Hi...,Former Pakistan captain Waqar Younis has apolo...,sports
2,Namibia beat Scotland in T20 WC 2021 as Ruben ...,Namibia defeated Scotland by four wickets in t...,sports
3,England chase down 125-run target in 85 balls ...,England chased down a target of 125 runs in ju...,sports
4,Sourav Ganguly quits ATK Mohun Bagan position ...,BCCI President Sourav Ganguly has confirmed to...,sports
...,...,...,...
95,"Twitter posts $1.28 bn revenue, says iOS chang...",Twitter has reported a 37% jump in its third q...,technology
96,Google-parent Alphabet posts record profit in ...,Google-parent Alphabet has recorded a quarterl...,technology
97,Facebook hires Britney's lawyer to fight upcom...,"Facebook has hired Mathew Rosengart, the lawye...",technology
98,Tesla Model 3 becomes first EV to top European...,Tesla's Model 3 has become the first electric ...,technology


2. News Articles

We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

- Business
- Sports
- Technology
- Entertainment

In [47]:
# get response 
url = 'https://inshorts.com/en/read/'
headers = {'User-Agent': 'Codeup Data Science'} 
response = get(url, headers=headers)

In [48]:
soup = BeautifulSoup(response.content)

In [49]:
soup.find_all('span', itemprop = 'headline')

[<span itemprop="headline">Munawar Faruqui cancels shows in Mumbai over 'safety of audience'</span>,
 <span itemprop="headline">I actually think people talking about my form is quite funny: David Warner</span>,
 <span itemprop="headline">Sourav Ganguly quits ATK Mohun Bagan position after their owners RPSG buy IPL team</span>,
 <span itemprop="headline">Namibia beat Scotland in T20 WC 2021 as Ruben takes 3 wickets in 1st over of match</span>,
 <span itemprop="headline">Clear Air India dues, purchase tickets in cash: Govt to ministries</span>,
 <span itemprop="headline">It was a family bet: Udaipur teacher fired for 'celebrating' Pak's win against India</span>,
 <span itemprop="headline">India will govern whole of Kashmir someday: Senior IAF official</span>,
 <span itemprop="headline">India successfully test-fires nuclear-capable Agni-5 missile with range of 5,000 km</span>,
 <span itemprop="headline">2 painters left hanging above 26th floor of building after woman cuts rope in Thailand

In [51]:
soup.find_all('span', itemprop = 'headline')[0].get_text()

"Munawar Faruqui cancels shows in Mumbai over 'safety of audience'"

In [52]:
soup.find_all('div', itemprop = 'articleBody')[0].get_text()

'Stand-up comedian Munawar Faruqui on Wednesday announced that he is cancelling his shows in Mumbai, titled \'Dongri to Nowhere\', scheduled for October 29-31. "The safety of the audience is what matters most to me," he wrote. Notably, Vishva Hindu Parishad (VHP) had written to Mumbai Police calling Faruqui a threat to law and order and claimed he mocks Hindu Gods.\n'

In [53]:
# set up base url and categories list
base_url = 'https://inshorts.com/en/read'
categories = ['sports', 'entertainment', 'business', 'technology']

In [54]:
# figure out how to use the categories to add to base url
base_url + '/' + categories[0]

'https://inshorts.com/en/read/sports'

In [55]:
# make function to create a list of urls 
# probably could just call this line in the main function but it was fun

def create_urls(base_url, categories):
    '''
    This function takes in a baseurl and list of categories
    It will create a new list with the base url a / and each category
    
    This is for scraping info from the inshorts website
    '''
    
    website_list = [base_url + '/' + category for category in categories]
    
    return website_list

In [56]:
create_urls(base_url, categories)

['https://inshorts.com/en/read/sports',
 'https://inshorts.com/en/read/entertainment',
 'https://inshorts.com/en/read/business',
 'https://inshorts.com/en/read/technology']

In [57]:

def get_blog_articles2(base_url, categories): # title_finder, body_finder
    '''
    This function takes in a list of website urls, 
    the title finder and body finder (must be the same for each article)
    And returns a list of dictionaries with title text and body text in dictionaries
    Keys in dictionaries are 'title' and 'content'
    Returns dataframe of Titles, Articles, and Categories
    '''
    
    # initalize empty list for the dictionaries
    article_list = []
    
    # set up headers
    headers = {'User-Agent': 'Codeup Data Science'} 
    
    # create list of websites using the categories
    website_list = create_urls(base_url, categories)
    
    # loop through list of websites and category list
    for website, category in zip(website_list, categories): 
        
        # get response
        response = get(website, headers=headers)
    
        # create soup object
        soup = BeautifulSoup(response.text)
        
        # find titles
        headlines= soup.find_all('span', itemprop = 'headline')
        
        # find bodies 
        bodies = soup.find_all('div', itemprop = 'articleBody')
        
        # loop through length of headlines (could also be bodies) use index to get text and add to dictionary
        for i in range(len(headlines)):
            title = headlines[i].get_text()
            body = bodies[i].get_text()
            
            # create dictionary
            dictionary = {'title': title,
                         'content': body,
                         'category': category}
            
            # add dictionary to list of dictionaries
            article_list.append(dictionary)
        
    return pd.DataFrame(article_list)

In [58]:
# set up base url and categories list
base_url = 'https://inshorts.com/en/read'
categories = ['sports', 'entertainment', 'business', 'technology']

# test function
get_blog_articles2(base_url, categories)

Unnamed: 0,title,content,category
0,I actually think people talking about my form ...,"Australia opener David Warner said that he ""ac...",sports
1,England chase down 125-run target in 85 balls ...,England chased down a target of 125 runs in ju...,sports
2,Waqar apologises for his 'namaz in front of Hi...,Former Pakistan captain Waqar Younis has apolo...,sports
3,Namibia beat Scotland in T20 WC 2021 as Ruben ...,Namibia defeated Scotland by four wickets in t...,sports
4,Sourav Ganguly quits ATK Mohun Bagan position ...,BCCI President Sourav Ganguly has confirmed to...,sports
...,...,...,...
95,Tesla Model 3 becomes first EV to top European...,Tesla's Model 3 has become the first electric ...,technology
96,"Govt to pitch to Tesla, Samsung for local batt...",The government is planning to pitch to compani...,technology
97,Govt preparing report on Facebook whistleblowe...,The government is preparing a report on Facebo...,technology
98,Facebook staff who raise alarms less likely to...,Facebook whistleblower Frances Haugen has said...,technology
