### ``Exercises: NLP Acquire/Web Scrapping``

    30AUGUST2022

----

In [1]:
# notebook dependencies 
import os # for caching purposeses
import pandas as pd
import numpy as np

# visualization imports
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# regular expression import
import re

# JSON import
import json

# importing BeautifulSoup for parsing HTML/XTML
from bs4 import BeautifulSoup

# request module for connecting to APIs
from requests import get

#### ``Exercise Number 1: Web Scrapping -- Codeup Blog Articles``

**<u>``Prompt:``</u>**

* Visit Codeup's Blog and record the urls for at least 5 distinct blog posts. 

* For each post, you should scrape at least the post's title and content.

* Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries:

    - With each dictionary representing one article. The shape of each dictionary should look like this:

>{
    'title': 'the title of the article',\
    'content': 'the full text content of the article'
}


- Plus any additional properties you think might be helpful

In [2]:
# let's connect to the Codeup url/domain

domain = 'https://codeup.com'
endpoint = '/blog/'

# creating the url
url = domain + endpoint

# creating the response element/object (including headers)
# note: some websites don't accept the pyhon-requests default user-agent
headers = {'User-Agent': 'Codeup Data Science'} 
response = get(url, headers = headers)

print(f'url: {url}')

url: https://codeup.com/blog/


In [3]:
# checking the response object/type

type(response)

requests.models.Response

In [4]:
# let's use the BeautifulSoup module to create an HTML object

soup = BeautifulSoup(response.content, 'html.parser')
type(soup)

bs4.BeautifulSoup

**``Beautiful Soup Methods and Properties``**

* ``soup.title.string`` gets the page's title (the same text in the browser tab for a page, this is the title element

* ``soup.prettify()`` is useful to print in case you want to see the HTML

* ``soup.find_all("a")`` find all the anchor tags, or whatever argument is specified.

* ``soup.find("h1")`` finds the first matching element

* ``soup.get_text()`` gets the text from within a matching piece of soup/HTML

* The ``soup.select()`` method takes in a CSS selector as a string and returns all matching elements. super useful

In [5]:
# in looking at the Codeup blog page, i notice the article titles at the 'h2 <a href = ' attribute level
# i can use the select method to hit this attribute and return back all text tagged as such

soup.select('h2 a[href]') # checks out!

[<a href="https://codeup.com/data-science/recession-proof-career/">Is a Career in Tech Recession-Proof?</a>,
 <a href="https://codeup.com/codeup-news/codeup-x-comic-con/">Codeup X Superhero Car Show &amp; Comic Con</a>,
 <a href="https://codeup.com/featured/series-part-3-web-development/">What Jobs Can You Get After a Coding Bootcamp? Part 3: Web Development</a>,
 <a href="https://codeup.com/codeup-news/codeup-dallas-campus/">Codeup’s New Dallas Campus</a>,
 <a href="https://codeup.com/codeup-news/codeup-tv-commercial/">Codeup TV Commercial</a>,
 <a href="https://codeup.com/featured/what-jobs-can-you-get-after-a-coding-bootcamp-part-2-cloud-administration/">What Jobs Can You Get After a Coding Bootcamp? Part 2: Cloud Administration</a>]

In [6]:
# what if we just want a single title or link?

url = soup.select('h2 a[href]')[0]['href'] # this is the link, what about the title?
url

'https://codeup.com/data-science/recession-proof-career/'

In [7]:
# what if we want just the links to iterate through?

urls = soup.select('h2 a[href]')[:]

type(urls)
urls

[<a href="https://codeup.com/data-science/recession-proof-career/">Is a Career in Tech Recession-Proof?</a>,
 <a href="https://codeup.com/codeup-news/codeup-x-comic-con/">Codeup X Superhero Car Show &amp; Comic Con</a>,
 <a href="https://codeup.com/featured/series-part-3-web-development/">What Jobs Can You Get After a Coding Bootcamp? Part 3: Web Development</a>,
 <a href="https://codeup.com/codeup-news/codeup-dallas-campus/">Codeup’s New Dallas Campus</a>,
 <a href="https://codeup.com/codeup-news/codeup-tv-commercial/">Codeup TV Commercial</a>,
 <a href="https://codeup.com/featured/what-jobs-can-you-get-after-a-coding-bootcamp-part-2-cloud-administration/">What Jobs Can You Get After a Coding Bootcamp? Part 2: Cloud Administration</a>]

In [8]:
# extracting the published date

published_date = soup.find('span', class_ = "published").text.strip() # checks out!
published_date

'Aug 12, 2022'

In [9]:
# sample title extraction code

# container = []

# # create the blog url
# url = 'https://codeup.com/blog/'

# # include the headers
# headers = {'User-Agent': 'Codeup Data Science'} 

# #create the response object
# response1 = get(url, headers = headers)

# # first soup
# soup1 = BeautifulSoup(response1.content, 'html.parser')

# # hit the blog domain and retrieve article link
# link_url = soup1.select('h2 a[href]')[counter]["href"]

# response2 = get(link_url, headers = headers)

# # new soup object
# soup2 = BeautifulSoup(response2.content, 'html.parser')

# soup2.find('h1', class_ = "entry-title").text # checks out!

In [10]:
# with the links accessible, i can hit the needed articles to extract more data

container = []

# let's extract all articles in the Codeup blog post website
for num in range(len(urls)):

    # extracting the article url from Codeup blog urls
    article_url = urls[num]['href']

    # creating the response object (Article Website)
    response = get(article_url, headers = headers)

    # create the soup object
    soup = BeautifulSoup(response.content, 'html.parser')

    # extract the title
    title = soup.find('h1', class_ = "entry-title").text
    
    # extract the publish date
    published = soup.find('span', class_ = "published").text.strip()
    
    # extract article body
    contents = soup.find('div', class_ = 'entry-content').text.strip()

    # create dictionary that holds article contents
    article_dict = { 
        "article_title": title,
        "publish_date": published,
        "contents": contents
    }

    # append article dictionary to the container list
    container.append(article_dict)

articles = pd.DataFrame(container).sort_values("publish_date").reset_index(drop = True)
articles

# notes to self:
# when creating a function that pulls information at scale, ensure the headers, tags, or required labeling of information is consistent and accurate

Unnamed: 0,article_title,publish_date,contents
0,Codeup X Superhero Car Show & Comic Con,"Aug 10, 2022",Codeup had a blast at the San Antonio Superher...
1,Is a Career in Tech Recession-Proof?,"Aug 12, 2022","Given the current economic climate, many econo..."
2,What Jobs Can You Get After a Coding Bootcamp?...,"Aug 2, 2022",If you’re considering a career in web developm...
3,What Jobs Can You Get After a Coding Bootcamp?...,"Jul 14, 2022",Have you been considering a career in Cloud Ad...
4,Codeup TV Commercial,"Jul 20, 2022",Codeup has officially made its TV debut! Our c...
5,Codeup’s New Dallas Campus,"Jul 25, 2022",Codeup’s Dallas campus has a new location! For...


In [11]:
# extract article/blog contents 

soup.find('div', class_ = 'entry-content').text.strip() # checks out!

'Have you been considering a career in Cloud Administration, but have no idea what your job title or potential salary could be? Continue reading below to find out!\nIn this mini-series, we will take each of our programs here at Codeup: Data Science, Web Development, and Cloud Administration, and outline respectively potential job titles, as well as entry-level salaries.*\xa0Let’s discuss Cloud Administration.\nProgram Overview\nAt Codeup, we offer a 15-week Cloud Administration program, which was derived from our previous two programs: Systems Engineering and Cyber Cloud. We combined the best of both and blended hands-on practical knowledge with skilled instructors to create the Cloud Administration program.\nUpon completing this program, you’ll have the opportunity to take on two exams for certifications: Amazon Web Services (AWS) Cloud Practitioner and AWS Solutions Architect Associate.\xa0\nPotential Jobs\nAccording to A Cloud Guru, with an AWS Certification you’ll be equipped with 

In [12]:
# creating a function to scrape all Codeup blogs

def scrape_codeup_blogs(url):

    # providing url headers for referencing/access
    headers = {'User-Agent': 'Codeup Data Science'}

    # creating the response object to access to the url
    response = get(url, headers = headers)

    # creating the soup object
    soup = BeautifulSoup(response.content, "html.parser")
    
    # selecting/extracting all urls from the blog home page
    urls = soup.select('h2 a[href]')[:]

    # container list to store needed contents/attributes
    container = []

    # let's extract all articles in the Codeup blog post website
    for num in range(len(urls)):

        # extracting the article url from Codeup blog urls
        article_url = urls[num]['href']

        # creating the response object (Article Website)
        response = get(article_url, headers = headers)

        # create the soup object
        soup = BeautifulSoup(response.content, "html.parser")

        # extract the title
        title = soup.find('h1', class_ = "entry-title").text
        
        # extract the publish date
        published = soup.find('span', class_ = "published").text.strip()
        
        # extract article body
        contents = soup.find('div', class_ = 'entry-content').text.strip()

        # create dictionary that holds article contents
        article_dict = { 
            
            "article_title": title,
            "publish_date": published,
            "contents": contents
        }

        # append article dictionary to the container list
        container.append(article_dict)

    # create an articles/blogs dataframe
    df = pd.DataFrame(container).sort_values("publish_date").reset_index(drop = True)
    
    # print the shape
    print(f'dataframe shape: {df.shape}')

    # return articles/blogs in a Pandas Dataframe
    return df

In [13]:
# trying out the function

codeup_blogs = scrape_codeup_blogs('https://codeup.com/blog/')
codeup_blogs

dataframe shape: (6, 3)


Unnamed: 0,article_title,publish_date,contents
0,Codeup X Superhero Car Show & Comic Con,"Aug 10, 2022",Codeup had a blast at the San Antonio Superher...
1,Is a Career in Tech Recession-Proof?,"Aug 12, 2022","Given the current economic climate, many econo..."
2,What Jobs Can You Get After a Coding Bootcamp?...,"Aug 2, 2022",If you’re considering a career in web developm...
3,What Jobs Can You Get After a Coding Bootcamp?...,"Jul 14, 2022",Have you been considering a career in Cloud Ad...
4,Codeup TV Commercial,"Jul 20, 2022",Codeup has officially made its TV debut! Our c...
5,Codeup’s New Dallas Campus,"Jul 25, 2022",Codeup’s Dallas campus has a new location! For...


In [14]:
# creating a function to scrape all Codeup blogs

def get_blogs_dict(url):

    # providing url headers for referencing/access
    headers = {'User-Agent': 'Codeup Data Science'}

    # creating the response object to access to the url
    response = get(url, headers = headers)

    # creating the soup object
    soup = BeautifulSoup(response.content, "html.parser")
    
    # selecting/extracting all urls from the blog home page
    urls = soup.select('h2 a[href]')[:]

    # container list to store needed contents/attributes
    container = []

    # let's extract all articles in the Codeup blog post website
    for num in range(len(urls)):

        # extracting the article url from Codeup blog urls
        article_url = urls[num]['href']

        # creating the response object (Article Website)
        response = get(article_url, headers = headers)

        # create the soup object
        soup = BeautifulSoup(response.content, "html.parser")

        # extract the title
        title = soup.find('h1', class_ = "entry-title").text
        
        # extract the publish date
        published = soup.find('span', class_ = "published").text.strip()
        
        # extract article body
        contents = soup.find('div', class_ = 'entry-content').text.strip()

        # create dictionary that holds article contents
        article_dict = { 
            "article_title": title,
            "publish_date": published,
            "contents": contents
        }

        # append article dictionary to the container list
        container.append(article_dict)

    with open("filename", 'w') as f:

        json.dump(container, f)

    # return articles/blogs in a Pandas Dataframe
    return container

In [15]:
# creating a function to scrape all Codeup blogs

def return_blogs_list(url):
    
    # providing url headers for referencing/access
    headers = {'User-Agent': 'Codeup Data Science'}

    # creating the response object to access to the url
    response = get(url, headers = headers)

    # creating the soup object
    soup = BeautifulSoup(response.content, "html.parser")

    # selecting/extracting all urls from the blog home page
    urls = soup.select('h2 a[href]')[:]

    # container list to store needed contents/attributes
    container = []

    # let's extract all articles in the Codeup blog post website
    for num in range(len(urls)):

        # extracting the article url from Codeup blog urls
        article_url = urls[num]['href']

        # creating the response object (Article Website)
        response = get(article_url, headers = headers)

        # create the soup object
        soup = BeautifulSoup(response.content, "html.parser")

        # extract the title
        title = soup.find('h1', class_ = "entry-title").text

        # extract the publish date
        published = soup.find('span', class_ = "published").text.strip()

        # extract article body
        contents = soup.find('div', class_ = 'entry-content').text.strip()

        # create dictionary that holds article contents
        article_dict = { 

        "article_title": title,
        "publish_date": published,
        "contents": contents

        }

        # append article dictionary to the container list
        container.append(article_dict)

In [16]:
# testing out the Codeup web scrape function
# if successful, it should return back the same/similar df to the one previously created

codeup_blogs = scrape_codeup_blogs("https://codeup.com/blog/")
codeup_blogs # checks out!

dataframe shape: (6, 3)


Unnamed: 0,article_title,publish_date,contents
0,Codeup X Superhero Car Show & Comic Con,"Aug 10, 2022",Codeup had a blast at the San Antonio Superher...
1,Is a Career in Tech Recession-Proof?,"Aug 12, 2022","Given the current economic climate, many econo..."
2,What Jobs Can You Get After a Coding Bootcamp?...,"Aug 2, 2022",If you’re considering a career in web developm...
3,What Jobs Can You Get After a Coding Bootcamp?...,"Jul 14, 2022",Have you been considering a career in Cloud Ad...
4,Codeup TV Commercial,"Jul 20, 2022",Codeup has officially made its TV debut! Our c...
5,Codeup’s New Dallas Campus,"Jul 25, 2022",Codeup’s Dallas campus has a new location! For...


----
#### ``Exercise Number 2: News Articles``

We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

* Business
* Sports
* Technology
* Entertainment


``The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:``

>{
'title': 'The article title',\
'content': 'The article content',\
'category': 'business' # for example
}


In [17]:
# let's check out the initial site

url = 'https://inshorts.com/en/read/business'

response = get(url)
type(response)

requests.models.Response

In [18]:
# what's in the object

response # successful connection

<Response [200]>

In [19]:
# creating a beautifulsoup object and exploring the site further

soup = BeautifulSoup(response.content, 'html.parser')
type(soup) # object type checks out!

bs4.BeautifulSoup

In [20]:
# what's in the read page of inshorts: looking at one title

soup.find('span', itemprop = 'headline').text



In [21]:
# ok, but can we get all the tiles? using the find_all() method

soup.find_all('span', itemprop = 'headline') # checks out

 <span itemprop="headline">Snap CEO confirms 20% job cuts, says 'We must reduce cost to avoid ongoing losses'</span>,
 <span itemprop="headline">Price of commercial LPG cylinders cut by up to ₹100; list of rates in cities released</span>,
 <span itemprop="headline">Chairman of Russia's 2nd largest oil firm dies after fall from hospital window: Reports</span>,
 <span itemprop="headline">SpiceJet shares fall nearly 15% after CFO resigns amid widening losses</span>,
 <span itemprop="headline">August GST revenue collection jumps 28% YoY at ₹1.44 lakh crore</span>,
 <span itemprop="headline">SpiceJet makes payments in 'graded format', delays salaries for 2nd straight month</span>,
 <span itemprop="headline">Value of UPI transactions touches all-time high of ₹10.73 lakh cr in Aug</span>,
 <span itemprop="headline">Taiwan looks forward to producing 'democracy chips' with US: Prez</span>,
 <span itemprop="headline">Dennis Woodside appointed as President of Freshworks</span>,
 <span itemprop="h

In [22]:
# notes to self: use the find_all() method and iterate through the needed attributes/tags
# ensure that the total numner of titles matches the total number of authors, publish date, content, etc. 
# 25 articles on the Business page

len(soup.find_all('span', itemprop = 'headline'))

23

In [23]:
# extracting the author 
# reminder that class is a reserved python word, so must use 'class_' to specify html tag
# here! i see that there can be multiple authors on one single blog; makes sense!

len(soup.find_all('span', class_ = 'author'))

46

In [24]:
# what about content blurbs/paragraphas
# checks out! 25 articles and 25 titles

len(soup.find_all('div', itemprop = 'articleBody'))

23

In [25]:
# understanding the contents object

contents = soup.find_all('div', itemprop = 'articleBody')
range(len(contents))

range(0, 23)

In [26]:
# creating the news article function

def get_news_articles(website_url):

    # create the unique response object
    response = get(website_url)

    # create the soup object
    soup = BeautifulSoup(response.content, 'html.parser')

    # creating a list of titles/headlines to iterate throug
    titles = soup.find_all('span', itemprop = 'headline')

    dates = soup.find_all('span', class_ = 'date')

    sources = soup.find_all('a', class_ = 'source')

    authors = soup.find_all('span', class_ = 'author')

    contents = soup.find_all('div', itemprop = 'articleBody')

    # creating a container list to hold article contents in
    container = []

    # iterate through the total number of headlines on website
    for num in range(len(titles)):
        
        published = dates[num].text

        title = titles[num].text

        author = authors[num].text

        content = contents[num].text
        
        '''IF Statement to handle instances where there is not a source.
        This code can probably be written more efficiently and/or across all collected attributes.'''
        
        if num in range(len(sources)):

                source = sources[num].text

        else: 

            source = None
            
        # creating a dictionary to save the articles contents
        article_dict = { 
            
            'publish_date': published, 
            'source': source, 
            'title': title,
            'authors': author,
            'content': content
        }

        # append to container list
        container.append(article_dict)
    
    # creating a dataframe from all scrapped articles
    article_df = pd.DataFrame(container)

    # printing the dataframe shape
    print(f'dataframe shape: {article_df.shape}')

    return article_df

In [27]:
# testing out the function

inshort_busns = get_news_articles('https://inshorts.com/en/read/business')
inshort_busns.head() # where there are 5 unique authors on the inshort business site

dataframe shape: (23, 5)


Unnamed: 0,publish_date,source,title,authors,content
0,01 Sep,Moneycontrol,Don't eff this up: Bezos recalls warning from ...,Ridham Gambhir,Ahead of the debut of The Lord of the Rings' p...
1,01 Sep,The Associated Press,"Snap CEO confirms 20% job cuts, says 'We must ...",Ridham Gambhir,"In a letter to staff posted on Snap’s website,..."
2,01 Sep,Times Now,Price of commercial LPG cylinders cut by up to...,Ridham Gambhir,State-owned fuel retailers on Thursday announc...
3,01 Sep,Reuters,Chairman of Russia's 2nd largest oil firm dies...,Ridham Gambhir,The chairman of Russia's second-largest oil pr...
4,01 Sep,Reuters,SpiceJet shares fall nearly 15% after CFO resi...,Ridham Gambhir,SpiceJet shares declined nearly 15% during Thu...


In [28]:
# testing the function on the 'Sports' section

inshort_sports = get_news_articles('https://inshorts.com/en/read/sports')
inshort_sports["source"].unique()

dataframe shape: (25, 5)


array(['BCCI', 'ICC', 'News18', 'ACC', 'ANI', 'Instagram', 'Twitter',
       'Sportskeeda', 'Hindustan Times', 'Times Now', None], dtype=object)

In [29]:
# testing the function on the 'Technology' section

inshort_tech = get_news_articles('https://inshorts.com/en/read/technology')
inshort_tech.head()

dataframe shape: (24, 5)


Unnamed: 0,publish_date,source,title,authors,content
0,31 Aug,Reuters,2 top executives at Snap quit hours after repo...,Ridham Gambhir,Two senior advertising executives at Snap quit...
1,01 Sep,Twitter,Google workers protest outside US offices agai...,Ridham Gambhir,Former and current Google employees protested ...
2,01 Sep,The Associated Press,"Snap CEO confirms 20% job cuts, says 'We must ...",Aishwarya Awasthi,"In a letter to staff posted on Snap’s website,..."
3,31 Aug,Reuters,Musk seeks to delay Twitter trial to Nov amid ...,Aishwarya Awasthi,Tesla CEO Elon Musk is seeking to delay the tr...
4,31 Aug,Financial Express,Facebook's Gaming app to be shut down in Octob...,Ridham Gambhir,Facebook’s Gaming app for iOS and Android is s...


In [30]:
# testing the function on the 'Entertainment' section

inshort_ent = get_news_articles('https://inshorts.com/en/read/entertainment')
inshort_ent.head()

dataframe shape: (25, 5)


Unnamed: 0,publish_date,source,title,authors,content
0,01 Sep,News18,Jacqueline deleted data on Sukesh from her pho...,Ridham Gambhir,The Enforcement Directorate in its chargesheet...
1,01 Sep,LatestLY,"Rajeev Sen, Charu Asopa call off their divorce...",Ridham Gambhir,"Rajeev Sen and Charu Asopa, who had earlier co..."
2,01 Sep,Twitter,Punjabi singer Nirvair Singh dies in road acci...,Daisy Mowke,Punjabi singer Nirvair Singh was killed in a c...
3,01 Sep,Moneycontrol,Aamir Khan Productions shares & then deletes a...,Daisy Mowke,Aamir Khan Productions on Thursday posted and ...
4,01 Sep,Bollywood Hungama,Don't eff this up: Bezos recalls warning from ...,Daisy Mowke,Ahead of the debut of The Lord of the Rings' p...


----

#### ``Exercise Number 3: Caching the Data``

**<u>Notes:</u>**

* Write your code such that the acquired data is saved locally in some form or fashion. Your functions that retrieve the data should prefer to read the local data instead of having to make all the requests everytime the function is called. 
* Include a boolean flag in the functions to allow the data to be acquired "fresh" from the actual sources (re-writing your local cache)

In [31]:
# let's first cache the Codeup and Inshorts article dataframes

codeup_blogs.to_csv("/Users/mijailmariano/codeup-data-science/natural-language-processing-exercises/codeup_blogs.csv", index = False)

In [32]:
# creating a function to first: check if the Codeup Blogs dataset exists, if not: scrape the web for it

def codeup_blogs_df():

    # creating the operating system filename for referencing
    filename = "codeup_blogs.csv"
    
    # check to see if the file path exists
    if os.path.isfile(filename):
        
        # if found, read the csv as a Pandas Dataframe
        df = pd.read_csv(filename)

        # let's print the shape
        print(f'df shape: {df.shape}')

        # return the blogs dataset
        return df
    
    # if not cached, then retrieve the data from Codeup's blog site
    else:

        # set the Codeup Blogs url
        url = "https://codeup.com/blog/"

        # providing url headers for referencing/web access
        headers = {'User-Agent': 'Codeup Data Science'}

        # creating the response object to access to the url
        response = get(url, headers = headers)

        # creating the Codeup Blogs soup object
        soup = BeautifulSoup(response.content, "html.parser")
        
        # selecting/extracting all urls from the blog home page
        urls = soup.select('h2 a[href]')[:]

        # container list to store needed contents/attributes
        container = []

        # let's extract all articles in the Codeup blog post website
        for num in range(len(urls)):

            # extracting the article url from Codeup blog urls
            article_url = urls[num]['href']

            # creating the response object (Article Website)
            response = get(article_url, headers = headers)

            # create the soup object
            soup = BeautifulSoup(response.content, "html.parser")

            # extract the title
            title = soup.find('h1', class_ = "entry-title").text
            
            # extract the publish date
            published = soup.find('span', class_ = "published").text.strip()
            
            # extract article body
            contents = soup.find('div', class_ = 'entry-content').text.strip()

            # create dictionary that holds article contents
            article_dict = { 
                "article_title": title,
                "publish_date": published,
                "contents": contents
            }

            # append article dictionary to the container list
            container.append(article_dict)

        # create an articles/blogs dataframe
        df = pd.DataFrame(container).sort_values("publish_date").reset_index(drop = True)
        
        # creating a .csv file in local directory for future referencing
        df.to_csv("codeup_blogs.csv", index = False)

        # print the shape
        print(f'dataframe shape: {df.shape}')

        # return articles/blogs in a Pandas Dataframe
        return df

In [33]:
# let's test the get codeup blogs function
# can add to acquire file

df = codeup_blogs_df()
df.head() # checks out!

df shape: (22, 3)


Unnamed: 0,article_title,publish_date,contents
0,Codeup X Superhero Car Show & Comic Con,"Aug 10, 2022",Codeup had a blast at the San Antonio Superher...
1,Is a Career in Tech Recession-Proof?,"Aug 12, 2022","Given the current economic climate, many econo..."
2,Is a Career in Tech Recession-Proof?,"Aug 12, 2022","Given the current economic climate, many econo..."
3,What Jobs Can You Get After a Coding Bootcamp?...,"Aug 2, 2022",If you’re considering a career in web developm...
4,What Jobs Can You Get After a Coding Bootcamp?...,"Aug 2, 2022",If you’re considering a career in web developm...


In [46]:
# creating a codeup blogs json file

# df.to_json("codeup_blogs.json") checks out!

In [34]:
# let's try extracting the genre name from the url with regex

re.findall(r'\w+\/?$', 'https://inshorts.com/en/read/entertainment')[0] # checks out!

'entertainment'

In [35]:
# does it find the regex as a variable?

url = 'https://inshorts.com/en/read/entertainment'

re.findall(r'\w+\/?$', 'https://inshorts.com/en/read/entertainment')[0] # checks out!

'entertainment'

In [37]:
# creating the news article function

def get_news_articles(website_url):

    # create the unique response object
    response = get(website_url)

    # creating a topic/genre object
    genre = re.findall(r'\w+\/?$', website_url)[0]

    # create the soup object
    soup = BeautifulSoup(response.content, 'html.parser')

    # creating a list of titles/headlines to iterate throug
    titles = soup.find_all('span', itemprop = 'headline')

    dates = soup.find_all('span', class_ = 'date')

    sources = soup.find_all('a', class_ = 'source')

    authors = soup.find_all('span', class_ = 'author')

    contents = soup.find_all('div', itemprop = 'articleBody')

    # creating a container list to hold article contents in
    container = []

    # iterate through the total number of headlines on website
    for num in range(len(titles)):
        
        published = dates[num].text

        title = titles[num].text

        author = authors[num].text

        content = contents[num].text
        
        '''IF Statement to handle instances where there is not a source.
        This code can probably be written more efficiently and/or across all collected attributes.'''
        
        if num in range(len(sources)):

                source = sources[num].text

        else: 

            source = None
            
        # creating a dictionary to save the articles contents
        article_dict = { 
            
            'genre': genre,
            'publish_date': published, 
            'source': source, 
            'title': title,
            'authors': author,
            'content': content
        }

        # append to container list
        container.append(article_dict)
    
    # creating a dataframe from all scrapped articles
    article_df = pd.DataFrame(container)

    # printing the dataframe shape
    print(f'dataframe shape: {article_df.shape}')

    return article_df

In [38]:
# let's test this function

inshort_ent = get_news_articles('https://inshorts.com/en/read/entertainment')
inshort_ent.head() # checks out!

dataframe shape: (25, 6)


Unnamed: 0,genre,publish_date,source,title,authors,content
0,entertainment,01 Sep,News18,Jacqueline deleted data on Sukesh from her pho...,Ridham Gambhir,The Enforcement Directorate in its chargesheet...
1,entertainment,01 Sep,LatestLY,"Rajeev Sen, Charu Asopa call off their divorce...",Ridham Gambhir,"Rajeev Sen and Charu Asopa, who had earlier co..."
2,entertainment,01 Sep,Twitter,Punjabi singer Nirvair Singh dies in road acci...,Daisy Mowke,Punjabi singer Nirvair Singh was killed in a c...
3,entertainment,01 Sep,Moneycontrol,Aamir Khan Productions shares & then deletes a...,Daisy Mowke,Aamir Khan Productions on Thursday posted and ...
4,entertainment,01 Sep,Bollywood Hungama,Don't eff this up: Bezos recalls warning from ...,Daisy Mowke,Ahead of the debut of The Lord of the Rings' p...


In [39]:
# trying another website

inshort_sports = get_news_articles('https://inshorts.com/en/read/sports')
inshort_sports.head() # checks out!

dataframe shape: (25, 6)


Unnamed: 0,genre,publish_date,source,title,authors,content
0,sports,31 Aug,BCCI,India beat Hong Kong to reach Asia Cup Super 4...,Anmol Sharma,India beat Hong Kong by 40 runs to qualify for...
1,sports,31 Aug,ICC,Suryakumar Yadav smashes most sixes ever by an...,Anmol Sharma,Suryakumar Yadav on Wednesday broke the record...
2,sports,01 Sep,News18,Hong Kong cricketer Kinchit proposes to his gi...,Anmol Sharma,Hong Kong batter Kinchit Shah proposed to his ...
3,sports,01 Sep,ACC,How do the Asia Cup 2022 points tables read af...,Anmol Sharma,With their 40-run victory against Hong Kong in...
4,sports,01 Sep,ANI,I was wondering why he wasn't leaving the fiel...,Anmol Sharma,On being asked about Virat Kohli bowing down t...


In [40]:
# entertainment

inshort_bus = get_news_articles('https://inshorts.com/en/read/business')
inshort_bus.head() # checks out!

dataframe shape: (23, 6)


Unnamed: 0,genre,publish_date,source,title,authors,content
0,business,01 Sep,Moneycontrol,Don't eff this up: Bezos recalls warning from ...,Ridham Gambhir,Ahead of the debut of The Lord of the Rings' p...
1,business,01 Sep,The Associated Press,"Snap CEO confirms 20% job cuts, says 'We must ...",Ridham Gambhir,"In a letter to staff posted on Snap’s website,..."
2,business,01 Sep,Times Now,Price of commercial LPG cylinders cut by up to...,Ridham Gambhir,State-owned fuel retailers on Thursday announc...
3,business,01 Sep,Reuters,Chairman of Russia's 2nd largest oil firm dies...,Ridham Gambhir,The chairman of Russia's second-largest oil pr...
4,business,01 Sep,Reuters,SpiceJet shares fall nearly 15% after CFO resi...,Ridham Gambhir,SpiceJet shares declined nearly 15% during Thu...


In [41]:
# technology

inshort_tech = get_news_articles('https://inshorts.com/en/read/technology')
inshort_tech.head() # checks out!

dataframe shape: (24, 6)


Unnamed: 0,genre,publish_date,source,title,authors,content
0,technology,31 Aug,Reuters,2 top executives at Snap quit hours after repo...,Ridham Gambhir,Two senior advertising executives at Snap quit...
1,technology,01 Sep,Twitter,Google workers protest outside US offices agai...,Ridham Gambhir,Former and current Google employees protested ...
2,technology,01 Sep,The Associated Press,"Snap CEO confirms 20% job cuts, says 'We must ...",Aishwarya Awasthi,"In a letter to staff posted on Snap’s website,..."
3,technology,31 Aug,Reuters,Musk seeks to delay Twitter trial to Nov amid ...,Aishwarya Awasthi,Tesla CEO Elon Musk is seeking to delay the tr...
4,technology,31 Aug,Financial Express,Facebook's Gaming app to be shut down in Octob...,Ridham Gambhir,Facebook’s Gaming app for iOS and Android is s...


In [42]:
# let's now work on the inshort function
# where i want to capture all required genre datasets

frames = [inshort_bus, inshort_tech, inshort_ent, inshort_sports]

inshort_articles = pd.concat(frames, axis = 0).reset_index(drop = True)
inshort_articles

Unnamed: 0,genre,publish_date,source,title,authors,content
0,business,01 Sep,Moneycontrol,Don't eff this up: Bezos recalls warning from ...,Ridham Gambhir,Ahead of the debut of The Lord of the Rings' p...
1,business,01 Sep,The Associated Press,"Snap CEO confirms 20% job cuts, says 'We must ...",Ridham Gambhir,"In a letter to staff posted on Snap’s website,..."
2,business,01 Sep,Times Now,Price of commercial LPG cylinders cut by up to...,Ridham Gambhir,State-owned fuel retailers on Thursday announc...
3,business,01 Sep,Reuters,Chairman of Russia's 2nd largest oil firm dies...,Ridham Gambhir,The chairman of Russia's second-largest oil pr...
4,business,01 Sep,Reuters,SpiceJet shares fall nearly 15% after CFO resi...,Ridham Gambhir,SpiceJet shares declined nearly 15% during Thu...
...,...,...,...,...,...,...
92,sports,01 Sep,Sportskeeda,KL Rahul probably has 'more ability' than Rohi...,Ankur Taliyan,Discussing Team India opener KL Rahul's 36 off...
93,sports,01 Sep,Hindustan Times,"I want Kohli at his best, but just make sure h...",Ankur Taliyan,Ex-Australia captain Ricky Ponting has said he...
94,sports,01 Sep,Sportskeeda,Short of words to describe Suryakumar Yadav's ...,Ankur Taliyan,"India captain Rohit Sharma said ""words will be..."
95,sports,01 Sep,Sportskeeda,Hong Kong not the right opposition to judge Vi...,Ankur Taliyan,Ex-India opener Gautam Gambhir has said Hong K...


In [43]:
# creating an inshort_articles

inshort_articles.to_json("inshort_articles.json") # checks out!

----

``JSON Cache Functions:``

In [None]:
# creating a function to first: check if the Codeup Blogs dataset exists, if not: scrape the web for it

def codeup_blogs_json():

    # creating the operating system filename for referencing
    filename = "codeup_blogs.json"
    
    # check to see if the file path exists
    if os.path.isfile(filename):
        
        # read-in the json file
        file = open(filename, "r")

        # read the json file
        json_data = file.read()

        # load the json data
        json_obj = json.loads(json_data)

        # return articles/blogs
        return json_obj
    
    # if not cached, then retrieve the data from Codeup's blog site
    else:

        # set the Codeup Blogs url
        url = "https://codeup.com/blog/"

        # providing url headers for referencing/web access
        headers = {'User-Agent': 'Codeup Data Science'}

        # creating the response object to access to the url
        response = get(url, headers = headers)

        # creating the Codeup Blogs soup object
        soup = BeautifulSoup(response.content, "html.parser")
        
        # selecting/extracting all urls from the blog home page
        urls = soup.select('h2 a[href]')[:]

        # container list to store needed contents/attributes
        container = []

        # let's extract all articles in the Codeup blog post website
        for num in range(len(urls)):

            # extracting the article url from Codeup blog urls
            article_url = urls[num]['href']

            # creating the response object (Article Website)
            response = get(article_url, headers = headers)

            # create the soup object
            soup = BeautifulSoup(response.content, "html.parser")

            # extract the title
            title = soup.find('h1', class_ = "entry-title").text
            
            # extract the publish date
            published = soup.find('span', class_ = "published").text.strip()
            
            # extract article body
            contents = soup.find('div', class_ = 'entry-content').text.strip()

            # create dictionary that holds article contents
            article_dict = { 
                
                "article_title": title,
                "publish_date": published,
                "contents": contents
            }

            # append article dictionary to the container list
            container.append(article_dict)

        # create an articles/blogs dataframe
        df = pd.DataFrame(container).sort_values("publish_date").reset_index(drop = True)
        
        # creating a .json file in local directory for future referencing
        df.to_json("codeup_blogs.json")

        # read-in the json file
        file = open("codeup_blogs.json", "r")

        # read the json file
        json_data = file.read()

        # load the json data
        json_obj = json.loads(json_data)

        # return articles/blogs
        return json_obj

In [55]:
# testing the function

codeup_blogs = codeup_blogs_json()
codeup_blogs # checks out!

{'article_title': {'0': 'Codeup X Superhero Car Show & Comic Con',
  '1': 'Is a Career in Tech Recession-Proof?',
  '2': 'Is a Career in Tech Recession-Proof?',
  '3': 'What Jobs Can You Get After a Coding Bootcamp? Part 3: Web Development',
  '4': 'What Jobs Can You Get After a Coding Bootcamp? Part 3: Web Development',
  '5': 'What Jobs Can You Get After a Coding Bootcamp? Part 2: Cloud Administration',
  '6': 'What Jobs Can You Get After a Coding Bootcamp? Part 2: Cloud Administration',
  '7': 'Codeup TV Commercial',
  '8': 'Codeup’s New Dallas Campus',
  '9': 'In-Person Workshop: Learn to Code – JavaScript on 7/26',
  '10': 'What Jobs Can You Get After a Coding Bootcamp? Part 1: Data Science',
  '11': 'What Jobs Can You Get After a Coding Bootcamp? Part 1: Data Science',
  '12': 'Inclusion at Codeup During Pride Month (and Always)',
  '13': 'Free JavaScript Workshop at Codeup Dallas on 6/28',
  '14': 'In-Person Workshop: Learn to Code – Python on 7/19',
  '15': 'PRIDE in Tech Panel

In [58]:
# # let's create an inshort article cashe function function 

# def inshorts_articles():

#     # creating the operating system filename for referencing
#     filename = "inshorts_articles.json"
    
#     # check to see if the file path exists
#     if os.path.isfile(filename):
        
#         # read-in the json file
#         file = open(filename, "r")

#         # read the json file
#         json_data = file.read()

#         # load the json data
#         json_obj = json.loads(json_data)

#         # return articles/blogs
#         return json_obj
    
#     urls = [ 
#             'https://inshorts.com/en/read/technology',
#             'https://inshorts.com/en/read/sports',
#             'https://inshorts.com/en/read/business',
#             'https://inshorts.com/en/read/entertainment'
#             ]

#     # container to hold all article dataframes
#     df_container = []

#     # if not cached, then retrieve the data from Codeup's blog site
#     for url in urls:

#         website_url = url

#         # create the unique response object
#         response = get(website_url)

#         # creating a topic/genre object
#         genre = re.findall(r'\w+\/?$', website_url)[0]

#         # create the soup object
#         soup = BeautifulSoup(response.content, 'html.parser')

#         # creating a list of titles/headlines to iterate throug
#         titles = soup.find_all('span', itemprop = 'headline')

#         dates = soup.find_all('span', class_ = 'date')

#         sources = soup.find_all('a', class_ = 'source')

#         authors = soup.find_all('span', class_ = 'author')

#         contents = soup.find_all('div', itemprop = 'articleBody')

#         # creating a container list to hold article contents in
#         article_container = []

#         # iterate through the total number of headlines on website
#         for num in range(len(titles)):
        
#             published = dates[num].text

#             title = titles[num].text

#             author = authors[num].text

#             content = contents[num].text
            
#             '''IF Statement to handle instances where there is not a source.
#             This code can probably be written more efficiently and/or across all collected attributes.'''
            
#             if num in range(len(sources)):

#                     source = sources[num].text

#                     # creating a dictionary to save the articles contents
#                     article_dict = { 
                        
#                         'genre': genre,
#                         'publish_date': published, 
#                         'source': source, 
#                         'title': title,
#                         'authors': author,
#                         'content': content
#                     }

#                     # append to container list
#                     article_container.append(article_dict)

#             else: 

#                 source = None
                
#                 # creating a dictionary to save the articles contents
#                 article_dict = { 
                    
#                     'genre': genre,
#                     'publish_date': published, 
#                     'source': source, 
#                     'title': title,
#                     'authors': author,
#                     'content': content
#                 }

#                 # append to container list
#                 article_container.append(article_dict)

#             # creating a dataframe from all scrapped articles
#             article_df = pd.DataFrame(article_container)

#         # append the dataframe
#         df = pd.concat(pd.concat(frames, axis = 0).reset_index(drop = True)
  

In [83]:
# get inshorts csv articles as Pandas df

def get_articles_df():

    # filename to search 
    filename = "inshort_articles.csv"

    # check to see if the file path exists
    if os.path.isfile(filename):
        
        # if found, read the csv as a Pandas Dataframe
        df = pd.read_csv(filename)

        # let's print the shape
        print(f'df shape: {df.shape}')

        # return the blogs dataset
        return df


In [84]:
# testing the function

df = get_articles_df()
df # works for now!

df shape: (100, 6)


Unnamed: 0,genre,publish_date,source,title,authors,content
0,business,31 Aug,Twitter,India's GDP grows at 13.5% in first quarter of...,Anmol Sharma,India's GDP grew at 13.5% in the first quarter...
1,business,31 Aug,Reuters,"Snap to lay off 20% of staff, cancel several p...",Anmol Sharma,Snap said on Wednesday it will lay off 20% of ...
2,business,31 Aug,Reuters,2 top executives at Snap quit hours after repo...,Ananya Goyal,Two senior advertising executives at Snap quit...
3,business,31 Aug,Reuters,Musk seeks to delay Twitter trial to Nov amid ...,Ananya Goyal,Tesla CEO Elon Musk is seeking to delay the tr...
4,business,31 Aug,News18,Viral video shows Amazon parcels thrown out of...,Ridham Gambhir,A video from Guwahati railway station has gone...
...,...,...,...,...,...,...
95,sports,31 Aug,Sportskeeda,India faced a 'lot of difficulty' when they la...,Anmol Sharma,Ex-Team India opener Wasim Jaffer said Team In...
96,sports,31 Aug,Times Now,Pant can get into this side if Rahul doesn't f...,Anmol Sharma,Former India cricketer Saba Karim has said wit...
97,sports,31 Aug,Hindustan Times,I can't believe his place is under threat: Sty...,Anmol Sharma,During a discussion ahead of India's match aga...
98,sports,31 Aug,CricTracker,"Team India can tackle our bowlers, Pak can't: ...",Anmol Sharma,Former Afghanistan captain Asghar Afghan said ...


In [None]:
#Define a function to scrape articles from one topic
# borrowed Codeup Function

def scrape_one_page(topic):
    
    base_url = 'https://inshorts.com/en/read/'
    
    response = get(base_url + topic)
    
    soup = BeautifulSoup(response.content, 'html.parser')
    
    titles = soup.find_all('span', itemprop='headline')
    
    summaries = soup.find_all('div', itemprop='articleBody')
    
    summary_list = []
    
    for i in range(len(titles)):
        
        temp_dict = {}
        
        temp_dict['title'] = titles[i].text
        
        temp_dict['content'] = summaries[i].text
        
        temp_dict['category'] = topic
        
        summary_list.append(temp_dict)
        
    return summary_list 

In [None]:
#Define a function that will scrape information about an array of topics
# borrowed Codeup function

def get_news_articles():
    
    file = 'inshorts_articles.json'
    
    if os.path.exists(file):
        
        with open(file) as f:
            
            return json.load(f)
    
    topic_list = ['business', 'sports', 'technology', 'entertainment']
    
    final_list = []
    
    for topic in topic_list:
        
        final_list.extend(scrape_one_page(topic))
        
    with open(file, 'w') as f:
        
        json.dump(final_list, f)
        
    return final_list 