In [1]:
import numpy as np
import pandas as pd
from requests import get
from bs4 import BeautifulSoup
from bs4 import BeautifulSoup as soupify

## Exercise
### 1. Codeup Blog Articles

### Visit Codeup's Blog and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's title and content.

### Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

In [2]:
{
    'title': 'the title of the article',
    'content': 'the full text content of the article'
}


{'title': 'the title of the article',
 'content': 'the full text content of the article'}

### Plus any additional properties you think might be helpful.

In [3]:
#Blog Post URLs
url1 = 'https://codeup.com/codeup-news/dei-report/'
url2 = 'https://codeup.com/codeup-news/diversity-and-inclusion-award/'
url3 = 'https://codeup.com/featured/financing-career-transition/'
url4 = 'https://codeup.com/tips-for-prospective-students/tips-for-women/'
url5 = 'https://codeup.com/cloud-administration/cloud-computing-and-aws/'
url6 = 'https://codeup.com/codeup-news/c-suite-award-stephen-noteboom/'

## Let us work with url1

In [4]:
#get url1
url1

'https://codeup.com/codeup-news/dei-report/'

In [5]:
headers = {'User-Agent': 'Codeup Data Science'}
response = get(url1, headers=headers)
response

<Response [200]>

In [6]:
#let's take a look at sanity check
print(response.text[:400])

<!DOCTYPE html>
<html lang="en-US">
<head>
	<meta charset="UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge">
	<link rel="pingback" href="https://codeup.com/xmlrpc.php" />

	<script type="text/javascript">
		document.documentElement.className = 'js';
	</script>
	
	<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin /><script id="diviarea-loader">window.DiviPopupData=wi


In [7]:
# make a soup:
# recipe:
# call BeautifulSoup on the content of our response
soup = BeautifulSoup(response.content, 'html.parser')

In [8]:
# If we look at soup, its of the same structure as the text, but a little cleaner
# and furthermore, its a new object -- a BeautifulSoup object
# soup

In [9]:
# let us get title of the article for url1
title = soup.find('h1', class_='entry-title').text
title

'Diversity Equity and Inclusion Report'

In [10]:
# what date was it published?
date = soup.find('span', class_='published').text
date

'Oct 7, 2022'

In [11]:
# what is the category of the article?
category = soup.find('a', rel = 'category tag').text
category

'Codeup News'

In [12]:
# let us take a look at the content of the article 
content = soup.find('div', class_= 'entry-content').text.strip().replace('\n', ' ')
content

'Codeup is excited to launch our first Diversity Equity, and Inclusion (DEI) report! In over eight years as an organization, we’ve implemented policies and grown our DEI efforts. We are extremely proud of the progress we’ve made as a staff and Codeup community, and we recognize there is more to learn. This report captures some of the ways that we’ve lived our value of Cultivating Inclusive Growth, and how we will continue doing so as we look to the future. We wanted to shine a light on the demographics of our students and staff, and in particular how that compares to the tech industry as a whole. How we collect, organize, and share employee demographic data is informed by standards set by the Equal Employment Opportunity Commission (EEOC). We are proud to celebrate how we’ve grown and are motivated and committed to do more and be better. To view the report visit the link here, or download it below.'

In [13]:
# now let us create a function that creates a dictionary out of a codeup blog page url

def parse_blog(url):
    url = url
    #establish header
    headers = {'User-Agent': 'Codeup Data Science'}
    response = get(url, headers=headers)
    
    # Make a soup variable holding the response content
    soup = BeautifulSoup(response.content, 'html.parser')
    
    output = {}
    output['title'] = soup.find('h1', class_='entry-title').text
    output['date'] = soup.find('span', class_='published').text
    output['category'] = soup.find('a', rel = 'category tag').text
    output['content'] = soup.find('div', class_= 'entry-content').text.strip().replace('\n', ' ').replace('\xa0', ' ')
    
    return output

In [14]:
#now let us test the function on url3
parse_blog(url3)

{'title': 'How Can I Finance My Career Transition?',
 'date': 'Sep 29, 2022',
 'category': 'Cloud Administration',
 'content': 'Deciding to transition into a tech career is a big step and a more significant commitment. Often after deciding to commit to a journey of this nature, the main obstacle is finding a way to finance your training. At Codeup, we recognize that many of our students are career transitioners, and attending one of our programs can sometimes require sacrifice. Luckily, we have several ways to help you finance your career transition, ultimately leading to that new career you’ve decided to pursue. Programs We offer three different accelerated coding bootcamps at Codeup including Full-Stack Web Development, Cloud Administration, and Data Science. These are all instructor-led and designed to quickly equip you with the skills and knowledge to secure an entry-level position in-field. Getting Started For all of our programs, a $1,000 deposit one week before a student’s first

In [15]:
# this function scrapes articles from codeup blog page and returns a list of dictionaries

def get_blog_articles(urls):
    """
    This function takes in a list of urls for Codeup blog articles. It loops through each, and uses the parse_blog() function to get the title,
    publishing date, category, and article content and create a dictionaries that are placed into a list. It returns a list of dictionaries.
    """
  
    output = []
    
    for url in urls:
        output.append(parse_blog(url))
    
    return output

In [16]:
# now let us define list of urls
urls = [url1, url2, url3, url4, url5, url6]
urls

['https://codeup.com/codeup-news/dei-report/',
 'https://codeup.com/codeup-news/diversity-and-inclusion-award/',
 'https://codeup.com/featured/financing-career-transition/',
 'https://codeup.com/tips-for-prospective-students/tips-for-women/',
 'https://codeup.com/cloud-administration/cloud-computing-and-aws/',
 'https://codeup.com/codeup-news/c-suite-award-stephen-noteboom/']

In [17]:
# now we can test final function 
get_blog_articles(urls)

[{'title': 'Diversity Equity and Inclusion Report',
  'date': 'Oct 7, 2022',
  'category': 'Codeup News',
  'content': 'Codeup is excited to launch our first Diversity Equity, and Inclusion (DEI) report! In over eight years as an organization, we’ve implemented policies and grown our DEI efforts. We are extremely proud of the progress we’ve made as a staff and Codeup community, and we recognize there is more to learn. This report captures some of the ways that we’ve lived our value of Cultivating Inclusive Growth, and how we will continue doing so as we look to the future. We wanted to shine a light on the demographics of our students and staff, and in particular how that compares to the tech industry as a whole. How we collect, organize, and share employee demographic data is informed by standards set by the Equal Employment Opportunity Commission (EEOC). We are proud to celebrate how we’ve grown and are motivated and committed to do more and be better. To view the report visit the li

## 2. News Articles

### We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

### Write a function that scrapes the news articles for the following topics:

- Business
- Sports
- Technology
- Entertainment The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:

In [27]:
{
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' 
}

{'title': 'The article title',
 'content': 'The article content',
 'category': 'business'}

In [28]:
#let's define news category url
url1 = 'https://inshorts.com/en/read'

In [29]:
soup = soupify(get(url1).content)

In [30]:
#Find the title of the first article on the page
soup.find_all('li')[1].text.lower()

'india'

In [31]:
# try concatenation:
url1 + '/' + soup.find_all('li')[1].text.lower()

'https://inshorts.com/en/read/india'

In [32]:
def get_cats(base_url):
    soup = soupify(get(base_url).content)
    return [cat.text.lower() for cat in soup.find_all('li')[1:]]

In [34]:
get_cats(url1)

['india',
 'business',
 'sports',
 'world',
 'politics',
 'technology',
 'startup',
 'entertainment',
 'miscellaneous',
 'hatke',
 'science',
 'automobile']

In [37]:
# lets make a first go of everything
cat_url1 = url1 + '/' + 'science'

In [38]:
cat_soup = soupify(get(cat_url1).content)

In [39]:
#let's find the title of the first article on the page
cat_soup.find_all('span', itemprop='headline')[0].text

'New species of beetle named after Novak Djokovic'

In [41]:
cat_titles = [title.text for title in cat_soup.find_all('span', itemprop='headline')]
cat_titles

['New species of beetle named after Novak Djokovic',
 "Scientists map 'graveyard of stars' in our galaxy for the 1st time, pics released",
 'Toxic air pollution particles found in lungs, brains of unborn babies for the first time',
 'Orcas caught chasing, killing great white shark on video for the first time',
 'Microplastics found in human breast milk for the first time',
 'Asteroid hit by NASA leaves 10,000 km trail of debris; pic surfaces',
 'You have 50 mins until your life changes: Nobel Committee to winner at 1:53 am',
 "NASA releases new pic of Jupiter's ice-covered moon Europa captured by Juno",
 "Material coming out of black hole is like nothing we've ever seen: Scientists",
 "Pics show aerial view of US' Florida before and after Hurricane Ian's approach",
 "Asteroid's path altered after NASA deliberately crashes spacecraft into it",
 'SpaceX & NASA launch crew of 4, including a Russian cosmonaut, to ISS',
 'SpaceX launches 52 Starlink satellites hours after launching Crew-5 m

In [43]:
#let's check article's body
cat_soup.find_all('div', itemprop='articleBody')[0].text

'Serbian scientists named a new species of beetle after ex-world number one men\'s tennis player Novak Djokovic. The insect, which belongs to Duvalius genus of ground beetles present in Europe and was discovered several years ago in underground pit in Serbia, has been named \'Duvalius djokovici\'. "We feel urged to pay Djokovic back in...way we can," a researcher said.'

In [45]:
# let's create a function that creates a dictionary out all the articles on a single page 

def get_all_shorts(base_url):
    cats = get_cats(base_url)
    all_articles = []
    for cat in cats:
        cat_url = base_url + '/' + cat
        print(get(cat_url))
        cat_soup = soupify(get(cat_url).content)
        cat_titles = [
            title.text for title in cat_soup.find_all('span', itemprop='headline')
        ]
        cat_bodies = [
            body.text for body in cat_soup.find_all('div', itemprop='articleBody')]
        cat_articles = [{'title': title,
        'category': cat,
        'body': body} for title, body in zip(
        cat_titles, cat_bodies)]
        print('cat articles length: ',len(cat_articles))
        all_articles.extend(cat_articles)
        print('length of all_articles: ', len(all_articles))
    return all_articles
        

In [47]:
#get all the articles
all_articles = get_all_shorts(url1)

<Response [200]>
cat articles length:  12
length of all_articles:  12
<Response [200]>
cat articles length:  25
length of all_articles:  37
<Response [200]>
cat articles length:  25
length of all_articles:  62
<Response [200]>
cat articles length:  25
length of all_articles:  87
<Response [200]>
cat articles length:  25
length of all_articles:  112
<Response [200]>
cat articles length:  25
length of all_articles:  137
<Response [200]>
cat articles length:  25
length of all_articles:  162
<Response [200]>
cat articles length:  25
length of all_articles:  187
<Response [200]>
cat articles length:  24
length of all_articles:  211
<Response [200]>
cat articles length:  25
length of all_articles:  236
<Response [200]>
cat articles length:  25
length of all_articles:  261
<Response [200]>
cat articles length:  24
length of all_articles:  285


In [48]:
#get the articles into a DataFrame
all_articles = pd.DataFrame(all_articles)

In [49]:
#check value counts
all_articles.category.value_counts()

business         25
sports           25
world            25
politics         25
technology       25
startup          25
entertainment    25
hatke            25
science          25
miscellaneous    24
automobile       24
india            12
Name: category, dtype: int64

In [50]:
#check value counts by titles
all_articles.title.value_counts

<bound method IndexOpsMixin.value_counts of 0      Afghanistan wins SAFF title, spoils India's ha...
1      Nigerian weightlifter in dope net, India may gain
2           India beat NZ 3-2 to enter CWG hockey finals
3                 India's first Billiards Premier League
4              Infosys Gifts Sikka Shares Worth Rs 8.2cr
                             ...                        
280    Passenger vehicle wholesales rise by 92% in Se...
281    Porsche becomes Europe's most valuable automak...
282    Fix for wheel issue that caused electric car r...
283    Vintage cars on display to promote wildlife pr...
284    Tesla delivered record 83,135 China-made EVs i...
Name: title, Length: 285, dtype: object>