In [1]:
import pandas as pd
from requests import get
from bs4 import BeautifulSoup
import os

# Exercises:

## 1. Codeup Blog Articles

- Visit [Codeup's Blog](https://codeup.com/blog/) and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's title and content.

- Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

{
    'title': 'the title of the article',
    'content': 'the full text content of the article'
}


In [2]:
#Setting the URL that I'm going to access:
url = 'https://codeup.com/data-science/jobs-after-a-coding-bootcamp-part-1-data-science/'
#
headers = {'User-Agent': 'Codeup Data Science'}
response = get(url, headers=headers)

In [3]:
#Verifying that the response is valid:
print(response.text[:500])

<!DOCTYPE html>
<html lang="en-US">
<head>
	<meta charset="UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge">
	<link rel="pingback" href="https://codeup.com/xmlrpc.php" />

	<script type="text/javascript">
		document.documentElement.className = 'js';
	</script>
	
	<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin /><script id="diviarea-loader">window.DiviPopupData=window.DiviAreaConfig={"zIndex":1000000,"animateSpeed":400,"triggerClassPrefix":"show-popup-","idAttri


In [4]:
#Creating a beautiful soup object to contain the HTML information from the page:
soup = BeautifulSoup(response.content, 'html.parser')

In [5]:
#Getting the title of the article:
soup.title.string

'What Jobs Can You Get After a Coding Bootcamp?'

In [6]:
#Getting the content of the article by selecting everything with the 'entry-content' class designation:
soup.select(".entry-content")

[<div class="entry-content">
 <p><span style="font-weight: 400;">If you are interested in embarking on a career in tech, you’re probably wondering what your new job title could be, and even what your salary might look like.* </span><span style="font-weight: 400;">In this mini-series, we will take each of our programs here at Codeup: Data Science, Web Development, and Cloud Administration, and outline respectively potential job titles, as well as entry-level salaries. </span><span style="font-weight: 400;">Today we will be diving into our </span><a href="https://codeup.com/program/data-science/"><span style="font-weight: 400;">Data Science</span></a><span style="font-weight: 400;"> program, with four potential job titles you could take on!</span></p>
 <h2><b>Program Overview</b><span style="font-weight: 400;"> </span></h2>
 <p><span style="font-weight: 400;">During this 20-week program, you will have the opportunity to take your career to new heights with data science being one of the m

In [7]:
def get_blog_articles(urls, refresh = False):
    
    codeup_articles = []
    
    #Checks whether there is already a CSV or if user wants to refresh data:
    if not os.path.isfile('blog_articles.csv') or refresh:
        
        for url in urls:
            #Creating an empty dataframe to store lists of article data components:
            #article_info = pd.DataFrame()
            headers = {'User-Agent': 'Codeup Data Science'}
            response = get(url, headers=headers)
            soup = BeautifulSoup(response.content, 'html.parser')
            title = soup.title.string
            contents = soup.select(".entry-content")
            #Creating a dictionary entry for current article:
            article_info = {
                'url' : url,
                'title' : title,
                'contents' : contents
            }
            #Converting article dictionary to DataFrame:
            article_info = pd.DataFrame(article_info)
            #Appending dictionary entry for current article onto list of blog info:
            codeup_articles.append(article_info)
            
        #Concatenating all of the DataFrames in the list to create one large DataFrame:
        codeup_articles = pd.concat(codeup_articles)
        #Writes the total DataFrame to a CSV for caching:
        codeup_articles.to_csv('blog_articles.csv', index = False)
        
    return codeup_articles

In [8]:
urls = ['https://codeup.com/data-science/jobs-after-a-coding-bootcamp-part-1-data-science/', 
        'https://codeup.com/featured/what-jobs-can-you-get-after-a-coding-bootcamp-part-2-cloud-administration/',
        'https://codeup.com/tips-for-prospective-students/is-our-cloud-administration-program-right-for-you/',
        'https://codeup.com/tips-for-prospective-students/mental-health-first-aid-training/',
        'https://codeup.com/codeup-news/inclusion-at-codeup-during-pride-month-and-always/']

codeup_articles = get_blog_articles(urls, refresh = True)

In [9]:
codeup_articles

Unnamed: 0,url,title,contents
0,https://codeup.com/data-science/jobs-after-a-c...,What Jobs Can You Get After a Coding Bootcamp?,"[\n, [[If you are interested in embarking on a..."
0,https://codeup.com/featured/what-jobs-can-you-...,What Jobs Can You Get After a Coding Bootcamp?...,"[\n, [[Have you been considering a career in C..."
0,https://codeup.com/tips-for-prospective-studen...,Is Our Cloud Administration Program Right for ...,"[\n, [[Changing careers can be scary. The firs..."
0,https://codeup.com/tips-for-prospective-studen...,Mental Health First Aid Training - Codeup,"[\n, [[As a student of Codeup, going through a..."
0,https://codeup.com/codeup-news/inclusion-at-co...,Inclusion at Codeup During Pride Month (and Al...,"[\n, [Happy Pride Month! Pride Month is a dedi..."


In [10]:
df = pd.DataFrame(codeup_articles)
df

Unnamed: 0,url,title,contents
0,https://codeup.com/data-science/jobs-after-a-c...,What Jobs Can You Get After a Coding Bootcamp?,"[\n, [[If you are interested in embarking on a..."
0,https://codeup.com/featured/what-jobs-can-you-...,What Jobs Can You Get After a Coding Bootcamp?...,"[\n, [[Have you been considering a career in C..."
0,https://codeup.com/tips-for-prospective-studen...,Is Our Cloud Administration Program Right for ...,"[\n, [[Changing careers can be scary. The firs..."
0,https://codeup.com/tips-for-prospective-studen...,Mental Health First Aid Training - Codeup,"[\n, [[As a student of Codeup, going through a..."
0,https://codeup.com/codeup-news/inclusion-at-co...,Inclusion at Codeup During Pride Month (and Al...,"[\n, [Happy Pride Month! Pride Month is a dedi..."


## 2. News Articles:

- We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

- Write a function that scrapes the news articles for the following topics:
    - Business
    - Sports
    - Technology
    - Entertainment

- The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:

{
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' # for example
}

Hints:
- a. Start by inspecting the website in your browser. Figure out which elements will be useful.
- b. Start by creating a function that handles a single article and produces a dictionary like the one above.
- c. Next create a function that will find all the articles on a single page and call the function you created in the last step for every article on the page.
- d. Now create a function that will use the previous two functions to scrape the articles from all the pages that you need, and do any additional processing that needs to be done.

In [11]:
#Setting the URL that I'm going to access:
url = 'https://inshorts.com/en/read'
#
headers = {'User-Agent': 'Codeup Data Science'}
response = get(url, headers=headers)

In [12]:
#Creating a beautiful soup object to contain the HTML information from the page:
soup = BeautifulSoup(response.content, 'html.parser')

In [13]:
title = soup.find_all(itemprop="headline")
content  = soup.find_all(itemprop="articleBody")

In [14]:
def get_shorts_articles(categories, refresh = False):
        
    #Creating an empty list to contain DataFrames of scraped data components:
    inshorts_articles = []
    
    #Checks whether there is already a CSV or if user wants to refresh data:
    if not os.path.isfile('news_articles.csv') or refresh:

        #Establishing a for-loop to iterate through desired categories:
        for category in categories:
            #Creating an empty dataframe to store lists of article data components:
            article_info = pd.DataFrame()
            #Establishing baseline url and using format string to iterate through categories:
            url = f'https://inshorts.com/en/read/{category}'
            #Establishing header so it doesn't look like 'python-request':
            headers = {'User-Agent': 'Codeup Data Science'}
            #saving the response from the website:
            response = get(url, headers=headers)
            #Creating a beautiful soup object to contain the HTML information from the page:
            soup = BeautifulSoup(response.content, 'html.parser')
            # creating a list of all titles in the given category:
            titles = soup.find_all(itemprop = 'headline')
            #Creating a list of all article bodies in the given category
            contents = soup.find_all(itemprop = 'articleBody')
            #Adding 'title' column to DataFrame containing title text for each article:
            article_info['title'] = [title.text for title in titles]
            #Adding 'contents' column to DataFrame containing article body text for each article
            article_info['contents'] = [content.text for content in contents]
            #Adding 'category' column to list category for each article in the category:
            article_info['category'] = category
            #Appending DataFrame for each category to overall list of DataFrames:
            inshorts_articles.append(article_info)

        #Concatenating all of the DataFrames in the list to create one large DataFrame:
        inshorts_articles = pd.concat(inshorts_articles)
        #Writes the total DataFrame to a CSV for caching:
        inshorts_articles.to_csv('news_articles.csv', index = False)
        #Returning final DataFrame:
    return inshorts_articles

In [17]:
categories = ['business', 'sports', 'technology', 'entertainment']

inshorts_articles = get_shorts_articles(categories, refresh = True)

In [18]:
inshorts_articles

Unnamed: 0,title,contents,category
0,Rupee hits 80 per US dollar for the first time...,The Indian rupee touched 80 per US dollar for ...,business
1,ED arrests ex-Mumbai Police chief Sanjay Pande...,The Enforcement Directorate (ED) on Tuesday ar...,business
2,Gautam Adani overtakes Bill Gates to become wo...,Gautam Adani has overtaken Bill Gates to becom...,business
3,Who are now the world's 10 richest people as A...,Gautam Adani has overtaken Bill Gates to becom...,business
4,List of items exempt from GST when sold loose ...,Amid criticism over pre-packaged and pre-label...,business
...,...,...,...
20,Dad Rishi Kapoor called my film choices 'nonse...,Ranbir Kapoor revealed his late father Rishi K...,entertainment
21,My producers tell me 'You double our money in ...,Following the success of his film 'Bhool Bhula...,entertainment
22,"Jackie Chan opened doors for me in H'wood, he'...","Talking about Jackie Chan, Mallika Sherawat sa...",entertainment
23,I salute Sushmita Sen for living life on her o...,Filmmaker Mahesh Bhatt defended Sushmita Sen a...,entertainment


In [None]:
inshorts_articles

## 3. Bonus: cache the data

Write your code such that the acquired data is saved locally in some form or fashion. Your functions that retrieve the data should prefer to read the local data instead of having to make all the requests everytime the function is called. Include a boolean flag in the functions to allow the data to be acquired "fresh" from the actual sources (re-writing your local cache).