Acquire NLP Exercise

In [1]:
import pandas as pd
import numpy as np
from requests import get
from bs4 import BeautifulSoup
import os
from acquire import get_blog_articles, get_news_articles

In [2]:
posts = get_blog_articles()

In [3]:
posts[:5]

[{'url': 'https://codeup.com/workshops/from-bootcamp-to-bootcamp-a-military-appreciation-panel/',
  'title': 'From Bootcamp to Bootcamp | A Military Appreciation Panel\nApr 27, 2022 | Alumni Stories, Dallas, Events, Featured, Military, San Antonio, Veterans, Virtual, WorkshopsIn honor of Military Appreciation Month, join us for a discussion with Codeup Alumni who are also Military Veterans!...',
  'date_published': 'Apr 27, 2022',
  'content': 'In honor of Military Appreciation Month, join us for a discussion with Codeup Alumni who are also Military Veterans! We will chat about their experiences attending a coding bootcamp, and how their military training set them up for success here at Codeup. Grab your virtual seat now so you can be sent the exclusive Livestream link on the 11th! \nThank you to our panelists for participating: \n\nChristopher Aguirre\nTaryn McKenzie \nDesiree McElroy \n\n\nAnd thanks to Codeup’s Trey Iapachino who is also an Air Force Veteran!'},
 {'url': 'https://co

In [4]:
news = get_news_articles(desired_categories=['Business','Sports','Technology','Entertainment'], get_fresh_news=True)

Scraping category:  Business
Total News Articles in Category:  25
Scraping category:  Sports
Total News Articles in Category:  25
Scraping category:  Technology
Total News Articles in Category:  24
Scraping category:  Entertainment
Total News Articles in Category:  25


In [5]:
news[:3]

[{'headline': 'Rupee hits all-time low of 77.42 against US dollar',
  'author': 'Apaar Sharma',
  'datetime': '2022-05-09T05:05:31.000Z',
  'category': 'business',
  'content': 'The Indian rupee fell to an all-time low of 77.42 against the US dollar on Monday, Reuters reported. Asian markets were lower on Monday as US stock futures fell on fears of more policy tightening from the Federal Reserve and strict lockdown in Shanghai impacting global growth, according to Reuters.'},
 {'headline': 'Bitcoin falls to the lowest level since January, trades below $34,000',
  'author': 'Pragya Swastik',
  'datetime': '2022-05-09T09:20:34.000Z',
  'category': 'business',
  'content': "Bitcoin fell on Monday to as low as $33,266 in morning trade, nearing January's low of $32,951 as slumping equity markets continued to hurt cryptocurrencies. It then steadied to trade above $33,600. According to BBC, the world's largest cryptocurrency has fallen by 50% since its peak in November 2021."},
 {'headline': 

Visit Codeup's Blog and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's title and content. Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:
- {
    'title': 'the title of the article',
    'content': 'the full text content of the article'
}
- Plus any additional properties you think might be helpful.

In [6]:
url = 'https://codeup.com/blog/'
headers = {'User-Agent': 'Codeup Data Science'} # Some websites don't accept the pyhon-requests default user-agent
response = get(url, headers=headers)


# Make a soup variable holding the response content
soup = BeautifulSoup(response.content, 'html.parser')


In [7]:
def get_post_details(post):
    """ Returns dictionary of url, title, date published, and content for each post on the Codeup.com/blog site"""
    output = {}
    # Extract URL
    output['url'] = post.select('a')[0].attrs['href']
    # Extract title
    output['title'] = post.text.strip()
    # Extract date published
    output['date_published'] = post.select('span.published')[0].text
    # Extracts blog post contents
    output['content'] = get_blog_content(output['url'])
    
    return output

def get_blog_content(url):
    """ Returns the content of the blog post """
    headers = {'User-Agent': 'Codeup Data Science'} # Some websites don't accept the pyhon-requests default user-agent
    response = get(url, headers=headers)


    # Make a soup variable holding the response content
    soup = BeautifulSoup(response.content, 'html.parser')
    entry_text = ""
    for t in soup.select('div.entry-content'):
        entry_text += t.text.strip()
    return entry_text

def get_blog_articles(return_dataframe = False):
    """ Returns dictionary (or dataframe) of information about blog posts on codeup.com/blog site """
    url = 'https://codeup.com/blog/'
    headers = {'User-Agent': 'Codeup Data Science'} # Some websites don't accept the pyhon-requests default user-agent
    response = get(url, headers=headers)


    # Make a soup variable holding the response content
    soup = BeautifulSoup(response.content, 'html.parser')
    
    if return_dataframe:
        return pd.DataFrame([get_post_details(post) for post in soup.select('article.et_pb_post')])

    return [get_post_details(post) for post in soup.select('article.et_pb_post')]

In [8]:
articles = get_blog_articles(return_dataframe=True)
articles.head()

Unnamed: 0,url,title,date_published,content
0,https://codeup.com/workshops/from-bootcamp-to-...,From Bootcamp to Bootcamp | A Military Appreci...,"Apr 27, 2022","In honor of Military Appreciation Month, join ..."
1,https://codeup.com/featured/our-acquisition-of...,Our Acquisition of the Rackspace Cloud Academy...,"Apr 14, 2022","Just about a year ago on April 16th, 2021 we a..."
2,https://codeup.com/workshops/virtual/learn-to-...,"Learn to Code: HTML & CSS on 4/30\nApr 1, 2022...","Apr 1, 2022",HTML & CSS are the design building blocks of a...
3,https://codeup.com/workshops/virtual/learn-to-...,Learn to Code: Python Workshop on 4/23\nMar 31...,"Mar 31, 2022","According to LinkedIn, the “#1 Most Promising ..."
4,https://codeup.com/codeup-news/coming-soon-clo...,"Coming Soon: Cloud Administration\nMar 17, 202...","Mar 17, 2022",We’re launching a new program out of San Anton...


In [9]:
def get_category_news_cards(category):
    """ Returns list with each item the soup for a different news card from the category page"""
    
    # Note that having the category name capitalized sends you to a different website than lowercase!!
    base_url = r'https://inshorts.com/en/read'
    url = base_url +r'/'+category.lower()
    
    headers = {'User-Agent': 'Codeup Data Science'} # Some websites don't accept the pyhon-requests default user-agent
    response = get(url, headers=headers)

    # Make a soup variable holding the response content
    cat_soup = BeautifulSoup(response.content, 'html.parser')
    
    return cat_soup.select('div.news-card.z-depth-1')

def get_news_details(news_card, category):
    """ Returns dictionary with information about the article 
    news_card: the soup for an individual news card within a category card stack
    category: this is passed to this function so it can be inputted to the dictionary"""
    
    output={}
    output['headline'] = news_card.select('div.news-card-title')[0].find("span").text
    output['author'] =  news_card.select('div.news-card-author-time')[0].find('span', class_='author').text
    output['datetime'] = news_card.select('div.news-card-author-time')[0].find('span', class_='time').attrs['content']
    output['category'] = category.lower()
    output['content'] = news_card.select('div.news-card-content')[0].find('div').text
    
    return output
    
def get_each_news_in_category(category):
    """ Returns list of dictionaries for each article in the category with article information """
    
    list_of_news_cards = get_category_news_cards(category)
    print("Total News Articles in Category: ",len(list_of_news_cards))
    return [get_news_details(news_card, category) for news_card in list_of_news_cards]
    
def get_news_categories(soup):
    """ Returns list of news categories from the inshorts homepage """
    
    categories = soup.select('ul.category-list')[0].select('li.active-category')[1:]
    
    return [c.text.lower() for c in categories]

def get_news_articles(desired_categories = 'all', update_cache = False):
    """ Returns dictionary of news article information from https://inshorts.com/ .
    desired_categories: 'all' by default or a list of categories desired
    update_cache: if True gets fresh news"""
    
    # Filepath for cache
    news_cache_file = 'news.csv'
    
    if ~update_cache:
        if os.path.exists(news_cache_file):
            return pd.read_csv('news.csv')
        else:
            print("News cache does not exist, acquiring fresh news...")
    
    
    url = 'https://inshorts.com/en/read'
    headers = {'User-Agent': 'Codeup Data Science'} # Some websites don't accept the python-requests default user-agent
    response = get(url, headers=headers)


    # Make a soup variable holding the response content
    soup = BeautifulSoup(response.content, 'html.parser')
    
    categories = get_news_categories(soup)
    
    # Initialize news list
    news = []
    
    # Check if we want articles from all categories or just specific ones
    if desired_categories == 'all':
        
        # Iterate through each category, scraping each article, save details to news list
        for cat in categories:
            
            print("Scraping category: ", cat)
            news+=get_each_news_in_category(cat)
    else:
        # For the case when we only want to scrape articles in particular categories
        for cat in desired_categories:
            # Checks if the desired category exists. If it doesn't moves on to the next category desired
            if cat.lower() not in categories:
                print(cat,"does not exist at site, skipping this category")
                continue
            print("Scraping category: ", cat)
            news+=get_each_news_in_category(cat)
    
    # Write results to cache
    pd.DataFrame(news).to_csv(news_cache_file, index = None)
       
    return news

In [10]:
news = get_news_articles(desired_categories=['sports','Business','Technology','Entertainment'], update_cache=False)

In [11]:
news

Unnamed: 0,headline,author,datetime,category,content
0,Rupee hits all-time low of 77.42 against US do...,Apaar Sharma,2022-05-09T05:05:31.000Z,business,The Indian rupee fell to an all-time low of 77...
1,Bitcoin falls to the lowest level since Januar...,Pragya Swastik,2022-05-09T09:20:34.000Z,business,"Bitcoin fell on Monday to as low as $33,266 in..."
2,Made best possible decision: IndiGo on barring...,Pragya Swastik,2022-05-09T09:50:34.000Z,business,IndiGo's CEO Ronojoy Dutta said the airline ma...
3,India's biggest IPO of LIC subscribed nearly 3...,Pragya Swastik,2022-05-09T14:10:38.000Z,business,"LIC's IPO, India's biggest IPO which opened on..."
4,I will do my best to stay alive: Musk to his m...,Ridham Gambhir,2022-05-09T04:21:36.000Z,business,Soon after Tesla CEO Elon Musk shared a tweet ...
...,...,...,...,...,...
94,"Salman said I wasn't cut for B'wood, told me t...",Kriti Kambiri,2022-05-09T11:19:42.000Z,entertainment,Actor Sidharth Malhotra has revealed that when...
95,Actress Mohena Kumari Singh shares first glimp...,Udit Gupta,2022-05-09T11:28:33.000Z,entertainment,Actress-choreographer Mohena Kumari Singh took...
96,I'm fortunate I got to work with Akshay sir: M...,Mahima Kharbanda,2022-05-09T11:38:52.000Z,entertainment,When asked how it was working with Akshay Kuma...
97,Actor Faisal Shaikh to participate in 'Khatron...,Udit Gupta,2022-05-09T09:09:28.000Z,entertainment,Actor Faisal Shaikh aka Mr Faisu is all set to...
