### ``Exercises: NLP Acquire/Web Scrapping``

    30AUGUST2022

----

In [217]:
# notebook dependencies 
import os # for caching purposeses
import pandas as pd
import numpy as np

# visualization imports
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# regular expression import
import re

# importing BeautifulSoup for parsing HTML/XTML
from bs4 import BeautifulSoup

# request module for connecting to APIs
from requests import get

#### ``Exercise Number 1: Web Scrapping -- Codeup Blog Articles``

**<u>``Prompt:``</u>**

* Visit Codeup's Blog and record the urls for at least 5 distinct blog posts. 

* For each post, you should scrape at least the post's title and content.

* Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries:

    - With each dictionary representing one article. The shape of each dictionary should look like this:

>{
    'title': 'the title of the article',\
    'content': 'the full text content of the article'
}


- Plus any additional properties you think might be helpful

In [218]:
# let's connect to the Codeup url/domain

domain = 'https://codeup.com'
endpoint = '/blog/'

# creating the url
url = domain + endpoint

# creating the response element/object (including headers)
# note: some websites don't accept the pyhon-requests default user-agent
headers = {'User-Agent': 'Codeup Data Science'} 
response = get(url, headers = headers)

print(f'url: {url}')

url: https://codeup.com/blog/


In [219]:
# checking the response object/type

type(response)

requests.models.Response

In [220]:
# let's use the BeautifulSoup module to create an HTML object

soup = BeautifulSoup(response.content, 'html.parser')
type(soup)

bs4.BeautifulSoup

**``Beautiful Soup Methods and Properties``**

* ``soup.title.string`` gets the page's title (the same text in the browser tab for a page, this is the title element

* ``soup.prettify()`` is useful to print in case you want to see the HTML

* ``soup.find_all("a")`` find all the anchor tags, or whatever argument is specified.

* ``soup.find("h1")`` finds the first matching element

* ``soup.get_text()`` gets the text from within a matching piece of soup/HTML

* The ``soup.select()`` method takes in a CSS selector as a string and returns all matching elements. super useful

In [221]:
# in looking at the Codeup blog page, i notice the article titles at the 'h2 <a href = ' attribute level
# i can use the select method to hit this attribute and return back all text tagged as such

soup.select('h2 a[href]') # checks out!

[<a href="https://codeup.com/data-science/recession-proof-career/">Is a Career in Tech Recession-Proof?</a>,
 <a href="https://codeup.com/featured/series-part-3-web-development/">What Jobs Can You Get After a Coding Bootcamp? Part 3: Web Development</a>,
 <a href="https://codeup.com/featured/what-jobs-can-you-get-after-a-coding-bootcamp-part-2-cloud-administration/">What Jobs Can You Get After a Coding Bootcamp? Part 2: Cloud Administration</a>,
 <a href="https://codeup.com/data-science/jobs-after-a-coding-bootcamp-part-1-data-science/">What Jobs Can You Get After a Coding Bootcamp? Part 1: Data Science</a>,
 <a href="https://codeup.com/data-science/recession-proof-career/">Is a Career in Tech Recession-Proof?</a>,
 <a href="https://codeup.com/codeup-news/codeup-x-comic-con/">Codeup X Superhero Car Show &amp; Comic Con</a>,
 <a href="https://codeup.com/featured/series-part-3-web-development/">What Jobs Can You Get After a Coding Bootcamp? Part 3: Web Development</a>,
 <a href="https://

In [222]:
# what if we just want a single title or link?

url = soup.select('h2 a[href]')[0]['href'] # this is the link, what about the title?
url

'https://codeup.com/data-science/recession-proof-career/'

In [223]:
# what if we want just the links to iterate through?

urls = soup.select('h2 a[href]')[:]
type(urls)
urls

[<a href="https://codeup.com/data-science/recession-proof-career/">Is a Career in Tech Recession-Proof?</a>,
 <a href="https://codeup.com/featured/series-part-3-web-development/">What Jobs Can You Get After a Coding Bootcamp? Part 3: Web Development</a>,
 <a href="https://codeup.com/featured/what-jobs-can-you-get-after-a-coding-bootcamp-part-2-cloud-administration/">What Jobs Can You Get After a Coding Bootcamp? Part 2: Cloud Administration</a>,
 <a href="https://codeup.com/data-science/jobs-after-a-coding-bootcamp-part-1-data-science/">What Jobs Can You Get After a Coding Bootcamp? Part 1: Data Science</a>,
 <a href="https://codeup.com/data-science/recession-proof-career/">Is a Career in Tech Recession-Proof?</a>,
 <a href="https://codeup.com/codeup-news/codeup-x-comic-con/">Codeup X Superhero Car Show &amp; Comic Con</a>,
 <a href="https://codeup.com/featured/series-part-3-web-development/">What Jobs Can You Get After a Coding Bootcamp? Part 3: Web Development</a>,
 <a href="https://

In [224]:
urls[0]['href']

'https://codeup.com/data-science/recession-proof-career/'

In [225]:
# with the links accessible, i can hit the needed articles to extract more data

container = []

# let's extract all articles in the Codeup blog post website
for url in range(len(urls)):

    # extracting the article url from Codeup blog urls
    article_url = urls[url]['href']

    # creating the response object (Article Website)
    response = get(article_url, headers = headers)

    # create the soup object
    soup = BeautifulSoup(response.content, 'html.parser')

    # extract the title
    title = soup.find('h1', class_ = "entry-title").text
    
    # extract the publish date
    published = soup.find('span', class_ = "published").text.strip()
    
    # extract article body
    contents = soup.find('div', class_ = 'entry-content').text.strip()

    # create dictionary that holds article contents
    article_dict = { 
        "article_title": title,
        "publish_date": published,
        "contents": contents
    }

    # append article dictionary to the container list
    container.append(article_dict)

articles = pd.DataFrame(container).sort_values("publish_date").reset_index(drop = True)
articles

# notes to self:
# when creating a function that pulls information at scale, ensure the headers, tags, or required labeling of information is consistent and accurate

Unnamed: 0,article_title,publish_date,contents
0,Codeup X Superhero Car Show & Comic Con,"Aug 10, 2022",Codeup had a blast at the San Antonio Superher...
1,Is a Career in Tech Recession-Proof?,"Aug 12, 2022","Given the current economic climate, many econo..."
2,Is a Career in Tech Recession-Proof?,"Aug 12, 2022","Given the current economic climate, many econo..."
3,What Jobs Can You Get After a Coding Bootcamp?...,"Aug 2, 2022",If you’re considering a career in web developm...
4,What Jobs Can You Get After a Coding Bootcamp?...,"Aug 2, 2022",If you’re considering a career in web developm...
5,What Jobs Can You Get After a Coding Bootcamp?...,"Jul 14, 2022",Have you been considering a career in Cloud Ad...
6,What Jobs Can You Get After a Coding Bootcamp?...,"Jul 14, 2022",Have you been considering a career in Cloud Ad...
7,Codeup TV Commercial,"Jul 20, 2022",Codeup has officially made its TV debut! Our c...
8,Codeup’s New Dallas Campus,"Jul 25, 2022",Codeup’s Dallas campus has a new location! For...
9,In-Person Workshop: Learn to Code – JavaScript...,"Jul 6, 2022",Join us for our live in-person JavaScript cras...


In [226]:
# sample title extraction code

# create the blog url
url = 'https://codeup.com/blog/'

# include the headers
headers = {'User-Agent': 'Codeup Data Science'} 

#create the response object
response1 = get(url, headers = headers)

# first soup
soup1 = BeautifulSoup(response1.content, 'html.parser')

# hit the blog domain and retrieve article link
link_url = soup1.select('h2 a[href]')[counter]["href"]

response2 = get(link_url, headers = headers)

# new soup object
soup2 = BeautifulSoup(response2.content, 'html.parser')

soup2.find('h1', class_ = "entry-title").text # checks out!

'In-Person Workshop: Learn to Code – JavaScript on 7/26'

In [227]:
# extracting the published date

published_date = soup2.find('span', class_ = "published").text.strip() # checks out!
published_date

'Jul 6, 2022'

In [228]:
# extract article/blog contents 

soup2.find('div', class_ = 'entry-content').text.strip() # checks out!

'Join us for our live in-person JavaScript crash course, where we will dig into one of the fastest-growing languages in the software development industry. It’s free and open to all – you don’t need to have any previous programming knowledge to participate. \nBy the end of the presentation, you will:\n\nHave a good understanding of what programming means\nKnow what JavaScript is and how it’s used\nThe best part: we will get our hands dirty writing some JavaScript. \n\nDon’t worry…we’ll walk you through every step. Come learn to code live with our very own instructor staff. Maybe this will be your jumpstart into an exciting and in-demand career…for FREE! \nMust be 18+ to participate. \nParking is free on Tuesdays in downtown San Antonio.'

----
#### ``Exercise Number 2: News Articles``

We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

* Business
* Sports
* Technology
* Entertainment


``The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:``

>{
'title': 'The article title',\
'content': 'The article content',\
'category': 'business' # for example
}


In [242]:
# let's check out the initial site

url = 'https://inshorts.com/en/read/business'

response = get(url)
type(response)

requests.models.Response

In [243]:
# what's in the object

response # successful connection

<Response [200]>

In [244]:
# creating a beautifulsoup object and exploring the site further

soup = BeautifulSoup(response.content, 'html.parser')
type(soup) # object type checks out!

bs4.BeautifulSoup

In [283]:
# what's in the read page of inshorts: looking at one title

soup.find('span', itemprop = 'headline').text

"Adani Transmission becomes India's 8th most valued company"

In [285]:
# ok, but can we get all the tiles? using the find_all() method

soup.find_all('span', itemprop = 'headline') # checks out

[<span itemprop="headline">Adani Transmission becomes India's 8th most valued company</span>,
 <span itemprop="headline">Musk cites whistleblower's claims in new notice as reason to end Twitter deal</span>,
 <span itemprop="headline">No plan to rebrand Zomato app to Eternal: CEO Deepinder Goyal</span>,
 <span itemprop="headline">Cancelling AC, first-class confirmed train tickets to now attract 5% GST</span>,
 <span itemprop="headline">China arrests over 230 people tied to its largest-ever bank fraud of $5.8 billion</span>,
 <span itemprop="headline">Bank of India files insolvency plea with NCLT against Future Lifestyle Fashions</span>,
 <span itemprop="headline">Mukesh Ambani lays succession plan; allots Retail to Isha &amp; energy to Anant</span>,
 <span itemprop="headline">We need to use more oil &amp; gas, not less, otherwise civilisation will crumble: Musk</span>,
 <span itemprop="headline">BSE Sensex jumps over 1,000 pts to trade above 59,000, Nifty tops 17,600</span>,
 <span item

In [286]:
# notes to self: use the find_all() method and iterate through these
# ensure that the total numner of titles matches the total number of authors, publish date, content, etc. 
# 25 articles on the Business page

len(soup.find_all('span', itemprop = 'headline'))

25

In [308]:
# extracting the author 
# reminder that class is a reserved python word, so must use 'class_' to specify html tag
# here! i see that there can be multiple authors on one single blog; makes sense!

len(soup.find_all('span', class_ = 'author'))

50

In [289]:
# what about content blurbs/paragraphas
# checks out! 25 articles and 25 titles

len(soup.find_all('div', itemprop = 'articleBody'))

25

In [353]:
# understanding the contents object

contents = soup.find_all('div', itemprop = 'articleBody')
range(len(contents))

True

In [362]:
# creating the news article function

def get_news_articles(website_url):

    # create the unique response object
    response = get(website_url)

    # create the soup object
    soup = BeautifulSoup(response.content, 'html.parser')

    # creating a list of titles/headlines to iterate throug
    titles = soup.find_all('span', itemprop = 'headline')

    dates = soup.find_all('span', class_ = 'date')

    sources = soup.find_all('a', class_ = 'source')

    authors = soup.find_all('span', class_ = 'author')

    contents = soup.find_all('div', itemprop = 'articleBody')

    # creating a container list to hold article contents in
    container = []

    # iterate through the total number of headlines on website
    for num in range(len(titles)):
        
        published = dates[num].text

        title = titles[num].text

        author = authors[num].text

        content = contents[num].text
        
        '''IF Statement to handle instances where there is not a source.
        This code can probably be written more efficiently and/or across all collected attributes.'''
        
        if num in range(len(sources)):

                source = sources[num].text

        else: 

            source = None
            
        # creating a dictionary to save the articles contents
        article_dict = { 
            
            'publish_date': published, 
            'source': source, 
            'title': title,
            'authors': author,
            'content': content
        }

        # append to container list
        container.append(article_dict)
    
    # creating a dataframe from all scrapped articles
    article_df = pd.DataFrame(container)

    # printing the dataframe shape
    print(f'dataframe shape: {article_df.shape}')

    return article_df

In [357]:
# testing out the function

df = get_news_articles('https://inshorts.com/en/read/business')
df.head() # where there are 5 unique authors on the inshort business site

dataframe shape: (25, 5)


Unnamed: 0,publish_date,source,title,authors,content
0,30 Aug,Times Now,Adani Transmission becomes India's 8th most va...,Hiral Goyal,Adani Transmission has entered the club of Ind...
1,30 Aug,Reuters,Musk cites whistleblower's claims in new notic...,Hiral Goyal,Tesla CEO Elon Musk's legal team has filed ano...
2,30 Aug,Financial Express,No plan to rebrand Zomato app to Eternal: CEO ...,Ridham Gambhir,Zomato CEO Deepinder Goyal clarified in an exc...
3,30 Aug,Hindustan Times,"Cancelling AC, first-class confirmed train tic...",Ridham Gambhir,The Finance Ministry stated that cancellation ...
4,30 Aug,ANI News,China arrests over 230 people tied to its larg...,Hiral Goyal,China has announced that 234 people who are su...


In [363]:
# testing the function on the 'Sports' section

df = get_news_articles('https://inshorts.com/en/read/sports')
df["source"].unique()

dataframe shape: (25, 5)


array(['Sportskeeda', 'The Print', 'Hindustan Times', 'ICC', 'Instagram',
       'Twitter', 'ANI News', 'Reuters', 'News18', None], dtype=object)

In [366]:
# testing the function on the 'Technology' section

df = get_news_articles('https://inshorts.com/en/read/technology')
df.head()

dataframe shape: (25, 5)


Unnamed: 0,publish_date,source,title,authors,content
0,30 Aug,BQ Prime,Who are now the world's 10 wealthiest people a...,Hiral Goyal,Gautam Adani has become the world's third rich...
1,30 Aug,Twitter,Lost 9 kg from my peak weight: Musk after reve...,Hiral Goyal,"Responding to a Twitter user, the world's rich..."
2,29 Aug,Free Press Journal,"Reliance Industries to stream AGM on VR, socia...",Ridham Gambhir,Reliance Industries on Monday will become one ...
3,29 Aug,ANI,No such plans: Union Minister on reports of ba...,Ridham Gambhir,Union Minister of State for IT Rajeev Chandras...
4,30 Aug,Reuters,Musk cites whistleblower's claims in new notic...,Ridham Gambhir,Tesla CEO Elon Musk's legal team has filed ano...


In [367]:
# testing the function on the 'Entertainment' section

df = get_news_articles('https://inshorts.com/en/read/entertainment')
df.head()

dataframe shape: (25, 5)


Unnamed: 0,publish_date,source,title,authors,content
0,30 Aug,Free Press Journal,Amala Paul's rumoured ex-boyfriend arrested af...,Daisy Mowke,"Actress Amala Paul's rumoured ex-boyfriend, Bh..."
1,30 Aug,Twitter,Kamaal R Khan taken to hospital due to chest p...,Daisy Mowke,"Kamaal R Khan, who was arrested on Tuesday aft..."
2,30 Aug,Bollywood Hungama,"When Barjatya announced 'Vivah', everyone thou...",Daisy Mowke,Filmmaker Ram Gopal Varma in an interview said...
3,30 Aug,PINKVILLA,Michelle Yeoh to be honoured at Toronto Intern...,Daisy Mowke,Celebrated Malaysian actress Michelle Yeoh wil...
4,30 Aug,PINKVILLA,Megan Thee Stallion to make a cameo appearance...,Udit Gupta,Rapper Megan Thee Stallion will reportedly fea...


----

#### ``Exercise Number 3: Caching the Data``

**<u>Notes:</u>**

* Write your code such that the acquired data is saved locally in some form or fashion. Your functions that retrieve the data should prefer to read the local data instead of having to make all the requests everytime the function is called. 
* Include a boolean flag in the functions to allow the data to be acquired "fresh" from the actual sources (re-writing your local cache)