By the end of this exercise, you should have a file named acquire.py that contains the specified functions. If you wish, you may break your work into separate files for each website (e.g. acquire_codeup_blog.py and acquire_news_articles.py), but the end function should be present in acquire.py (that is, acquire.py should import get_blog_articles from the acquire_codeup_blog module.)

### Codeup Blog Articles

Scrape the article text from the following pages:

- https://codeup.com/codeups-data-science-career-accelerator-is-here/
- https://codeup.com/data-science-myths/
- https://codeup.com/data-science-vs-data-analytics-whats-the-difference/
- https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/
- https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

{
    'title': 'the title of the article',
    'content': 'the full text content of the article'
}
Plus any additional properties you think might be helpful.

Bonus:
 - Scrape the text of all the articles linked on codeup's blog page.

Bonus Bonus:

 - Starting from the blog homepage, scrape the full text of every article linked on the page, then move on to the next page and keep doing the same thing until you have scraped the entire text of Codeup's blog.

In [14]:
from requests import get
from bs4 import BeautifulSoup
import os

In [17]:
def get_article_text():
    # if we already have the data, read it locally
    if os.path.exists('article.txt'):
        with open('article.txt') as f:
            return f.read()

    # otherwise go fetch the data
    url = 'https://codeup.com/codeups-data-science-career-accelerator-is-here/'
    headers = {'User-Agent': 'Codeup Ada Data Science'}
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.text)
    article = soup.find('div', class_='mk-single-content')

    # save it for next time
    with open('article.txt', 'w') as f:
        f.write(article.text)

    return article.text

In [140]:
def make_dictionary_from_article(url):
    headers = {'User-Agent': 'Codeup Ada Data Science'}
    response = get(url,headers=headers)
    soup = BeautifulSoup(response.text)
    title = soup.find("h1")
    body = soup.find("div", class_="mk-single-content")   
    
    return {
        "title": title.get_text(),
        "body": body.get_text()
    }

In [141]:
make_dictionary_from_article("https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/")

{'title': 'Competitor Bootcamps Are Closing. Is the Model in Danger?',
 'body': '\nCompetitor Bootcamps Are Closing. Is the Model in Danger?\n\xa0\n\nIs the programming bootcamp model in danger?\nIn recent news, DevBootcamp and The Iron Yard announced that they are closing their doors. This is big news. DevBootcamp was the first programming bootcamp model and The Iron Yard is a national player with 15 campuses across the U.S. In both cases, the companies cited an unsustainable business model. Does that mean the boot-camp model is dead?\n\ntl;dr “Nope!”\nBootcamps exist because traditional education models have failed to provide students job-ready skills for the 21st century. Students demand better employment options from their education. Employers demand skilled and job ready candidates. Big Education’s failure to meet those needs through traditional methods created the fertile ground for the new business model of the programming bootcamp.\nEducation giant Kaplan and Apollo Education G

In [127]:
headers = {'User-Agent': 'Codeup Data Science'}
response = get("https://codeup.com/data-science-myths/",headers=headers)
soup = BeautifulSoup(response.text)
title = soup.find("h1", class_ ="page-title")
body = soup.find("div", class_="mk-single-content")   
body.get_text()
title.get_text()

'Data Science Myths'

In [156]:
def get_blog_articles():
    # if we already have the data, read it locally
    if os.path.exists('articles.txt'):
        with open('articles.txt') as f:
            return f.read()

    # otherwise go fetch the data    
    urls = [
        "https://codeup.com/codeups-data-science-career-accelerator-is-here/",
        "https://codeup.com/data-science-myths/",
        "https://codeup.com/data-science-vs-data-analytics-whats-the-difference/",
        "https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/",
        "https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/",
    ]
    articles = []
    
    for url in urls:
        articles.append(make_dictionary_from_article(url))
 
    # save it for next time
#     with open('articles.txt', 'w') as f:
#         f.write(articles.txt)

    return articles

In [157]:
get_blog_articles()

[{'title': 'Codeup’s Data Science Career Accelerator is Here!',
  'body': '\nThe rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.\nData Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.\nOur program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Rackspace

### News Articles

We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

- Business
- Sports
- Technology
- Entertainment

The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:

{
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' # for example
}

Hints:

- Start by inspecting the website in your browser. Figure out which elements will be useful.
- Start by creating a function that handles a single article and produces a dictionary like the one above.
- Next create a function that will find all the articles on a single page and call the function you created in the last step for every article on the page.
- Now create a function that will use the previous two functions to scrape the articles from all the pages that you need, and do any additional processing that needs to be done.

Bonus: cache the data

Write your code such that the acquired data is saved locally in some form or fashion. Your functions that retrieve the data should prefer to read the local data instead of having to make all the requests everytime the function is called. Include a boolean flag in the functions to allow the data to be acquired "fresh" from the actual sources (re-writing your local cache).

In [189]:
headers = {'User-Agent': 'Codeup Data Science'}
response = get("https://inshorts.com/en/read/business",headers=headers)
soup = BeautifulSoup(response.text)
title = soup.find("span", itemprop ="headline")
body = soup.find("div", itemprop="articleBody")   

In [212]:
# url = "https://inshorts.com/en/read/business"
url = "https://inshorts.com"
cat = "entertainment"
cats = ["business", "sports", "technology", "entertainment"]

In [213]:
def make_dictionary_from_article(url, catergory):
    headers = {'User-Agent': 'Codeup Data Science'}
    specific_url = url + "/en/read/"+ catergory
    response = get(specific_url, headers=headers)
    soup = BeautifulSoup(response.text)
    title = soup.find("span", itemprop ="headline")
    body = soup.find("div", itemprop="articleBody") 
        
    return {
        "title": title.text,
        "body": body.text,
        "catergory": catergory
    }

In [214]:
make_dictionary_from_article(url, cat)

{'title': 'Adult film star claims Censor Board members demanded bribe to pass film',
 'body': 'Malayalam adult film actress Shakeela has claimed that two Censor Board members asked her for bribe to pass her upcoming production venture \'Ladies Not Allowed\', an adult comedy. "Our film has been [rejected] by...Censor Board not once but twice so far," she said. Shakeela added, "Before starting the movie, we made it clear that we\'re making an adult comedy."',
 'catergory': 'entertainment'}

In [237]:
def get_specific_articles(catergory):
    headers = {'User-Agent': 'Codeup Data Science'}
    specific_url = url + "/en/read/"+ catergory
    response = get(specific_url, headers=headers)
    soup = BeautifulSoup(response.text)
    title = soup.find_all("span", itemprop ="headline")
    body = soup.find_all("div", itemprop="articleBody")

    news = []
    for x in range(len(title)):
         news.append(
             {
            "title": title[x].text,
            "body": body[x].text,
            "catergory": catergory
             }
         )
    return news

In [244]:
def get_news_articles():
    cats = ["business", "sports", "technology", "entertainment"]
    all_news = []
    
    for cat in cats:
        all_news.extend(get_specific_articles(cat))
    return all_news

In [245]:
get_news_articles()

[{'title': 'RBI keeps repo rate unchanged at 5.15% after five cuts this year',
  'body': 'The RBI on Thursday kept the repo rate unchanged after five cuts this year. In 2019, the RBI has cut repo rate by 135 basis points so far but banks have passed on only a fraction of rate cuts to consumers. The status quo comes despite data showing growth slumping to a six-and-a-half-year low of 4.5% in the July-September quarter.',
  'catergory': 'business'},
 {'title': 'P Chidambaram walks out of Tihar Jail after 106 days',
  'body': 'Congress leader P Chidambaram walked out of Tihar Jail on Wednesday evening after spending 106 days in custody in connection with the INX Media case. His release came after the Supreme Court today granted him bail on conditions like surrendering his passport and making himself available for questioning. Chidambaram was arrested on August 21 by the Central Bureau of Investigation.',
  'catergory': 'business'},
 {'title': '8, 7, 6.6, 5.8, 5 & 4.5 is the state of econo

In [262]:
import pandas as pd 
import numpy as np

from requests import get
from bs4 import BeautifulSoup
import os

def make_dictionary_from_article(url):
    headers = {'User-Agent': 'Codeup Data Science'}
    response = get(url,headers=headers)
    soup = BeautifulSoup(response.text)
    title = soup.find("h1")
    body = soup.find("div", class_="mk-single-content")   
    
    return {
        "title": title.get_text(),
        "body": body.get_text()
    }

def get_blog_articles():
    # if we already have the data, read it locally
    if os.path.exists('articles.txt'):
        with open('articles.txt') as f:
            return f.read()

    # otherwise go fetch the data    
    urls = [
        "https://codeup.com/codeups-data-science-career-accelerator-is-here/",
        "https://codeup.com/data-science-myths/",
        "https://codeup.com/data-science-vs-data-analytics-whats-the-difference/",
        "https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/",
        "https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/",
    ]
    articles = []
    
    for url in urls:
        articles.append(make_dictionary_from_article(url))

    df = pd.DataFrame(articles)
    df.to_csv('articles.csv')

    return df

In [263]:
get_blog_articles()

Unnamed: 0,title,body
0,Codeup’s Data Science Career Accelerator is Here!,\nThe rumors are true! The time has arrived. C...
1,Data Science Myths,\nBy Dimitri Antoniou and Maggie Giust\nData S...
2,Data Science VS Data Analytics: What’s The Dif...,"\nBy Dimitri Antoniou\nA week ago, Codeup laun..."
3,10 Tips to Crush It at the SA Tech Job Fair,\n10 Tips to Crush It at the SA Tech Job Fair\...
4,Competitor Bootcamps Are Closing. Is the Model...,\nCompetitor Bootcamps Are Closing. Is the Mod...


In [None]:
soup.find("div", class_="mk-single-content")
soup.select("div.mk-single-content")