## Data Acquisition Exercises


By the end of this exercise, you should have a file named acquire.py that contains the specified functions. If you wish, you may break your work into separate files for each website (e.g. acquire_codeup_blog.py and acquire_news_articles.py), but the end function should be present in acquire.py (that is, acquire.py should import get_blog_articles from the acquire_codeup_blog module.)

In [3]:
import pandas as pd
import requests
import re

from bs4 import BeautifulSoup

## 1. Codeup Blog Articles

Visit Codeup's Blog (https://codeup.edu/blog/) and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's title and content.

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

In [None]:
#{
#    'title': 'the title of the article',
#    'content': 'the full text content of the article'
#}

Plus any additional properties you think might be helpful.

In [24]:
def get_page_content(url):
    response = requests.get(url, headers={'User-Agent': 'Codeup Data Science'})
    soup = BeautifulSoup(response.content, 'html.parser')
    return soup

In [25]:
pg1 = get_page_content('https://codeup.edu/blog/')

In [13]:
def extract_links(page):
    links = page.find_all("h2")
    new_links = []
    for article in links:
        if article.find("a"):
            new_links.append(article.find("a").get("href"))
    return new_links

In [17]:
links_pg1 = extract_links(pg1)

In [30]:
links_pg1

['https://codeup.edu/featured/apida-heritage-month/',
 'https://codeup.edu/featured/women-in-tech-panelist-spotlight/',
 'https://codeup.edu/featured/women-in-tech-rachel-robbins-mayhill/',
 'https://codeup.edu/codeup-news/women-in-tech-panelist-spotlight-sarah-mellor/',
 'https://codeup.edu/events/women-in-tech-madeleine/',
 'https://codeup.edu/codeup-news/panelist-spotlight-4/']

In [47]:
def get_content(link):
    page_soup = get_page_content(link)
    header = page_soup.find("h1").get_text()
    content = page_soup.select(".entry-content")[0].find_all("p")
    clean = []
    for p in content:
        clean.append(p.get_text())
    clean = ' '.join(clean)
    page_content = {'title': header,
                    'content': clean}
    return page_content

In [37]:
get_content(links_pg1[0])

{'title': 'Spotlight on APIDA Voices: Celebrating Heritage and Inspiring Change ft. Arbeena Thapa',
 'content': 'May is traditionally known as Asian American and Pacific Islander (AAPI) Heritage Month. This month we celebrate the history and contributions made possible by our AAPI friends, family, and community. We also examine our level of support and seek opportunities to better understand the AAPI community.  In an effort to address real concerns and experiences, we sat down with Arbeena Thapa, one of Codeup’s Financial Aid and Enrollment Managers. Arbeena identifies as Nepali American and Desi. Arbeena’s parents immigrated to Texas in 1988 for better employment and educational opportunities. Arbeena’s older sister was five when they made the move to the US. Arbeena was born later, becoming the first in her family to be a US citizen. At Codeup we take our efforts at inclusivity very seriously. After speaking with Arbeena, we were taught that the term AAPI excludes Desi-American indi

In [60]:
def scrape(url):
    all_content = []
    page = get_page_content(url)
    links = extract_links(page)
    for link in links:
        all_content.append(get_content(link))
    return all_content

In [63]:
page_one = scrape('https://codeup.edu/blog/')

In [64]:
len(page_one)

6

In [82]:
def scrape_links(url):
    all_pages = []
    all_pages.append(url)
    while True:
        page = get_page_content(all_pages[-1])
        previous_page = page.select(".alignleft")[0].find("a")
        if previous_page == None:
            break
        previous_page_link = previous_page.get("href")
        all_pages.append(previous_page_link)
    return all_pages

In [83]:
links = scrape_links('https://codeup.edu/blog/')
everything = []
i=1
for link in links:
    everything.append(scrape(link))
    print(f'Page {i} done!')
    i += 1

Page 1 done!
Page 2 done!
Page 3 done!
Page 4 done!
Page 5 done!
Page 6 done!
Page 7 done!
Page 8 done!
Page 9 done!
Page 10 done!
Page 11 done!
Page 12 done!
Page 13 done!
Page 14 done!
Page 15 done!
Page 16 done!
Page 17 done!
Page 18 done!
Page 19 done!
Page 20 done!
Page 21 done!
Page 22 done!
Page 23 done!
Page 24 done!
Page 25 done!
Page 26 done!
Page 27 done!
Page 28 done!
Page 29 done!
Page 30 done!
Page 31 done!
Page 32 done!
Page 33 done!
Page 34 done!
Page 35 done!
Page 36 done!
Page 37 done!
Page 38 done!
Page 39 done!
Page 40 done!
Page 41 done!
Page 42 done!
Page 43 done!
Page 44 done!
Page 45 done!


In [85]:
len(everything)

45

In [228]:
titles = []
contents = []

In [229]:
for little_list in everything:
    for dictionary in little_list:
        titles.append(dictionary['title'])
        contents.append(dictionary['content'])

In [232]:
len(titles), len(contents)

(270, 270)

In [238]:
codeup_blogs = {'title':titles,
                'content':contents}

In [241]:
codeup_blogs = pd.DataFrame(codeup_blogs)

In [243]:
codeup_blogs.to_csv('codeup_blogs.csv',index=False)

## 2. News Articles

We will now be scraping text data from inshorts (https://inshorts.com/en/read), a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

* Business
* Sports
* Technology
* Entertainment

The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:

In [None]:
#{
#    'title': 'The article title',
#    'content': 'The article content',
#    'category': 'business' # for example
#}

Hints:

1. Start by inspecting the website in your browser. Figure out which elements will be useful.
0. Start by creating a function that handles a single article and produces a dictionary like the one above.
0. Next create a function that will find all the articles on a single page and call the function you created in the last step for every article on the page.
0. Now create a function that will use the previous two functions to scrape the articles from all the pages that you need, and do any additional processing that needs to be done.

In [204]:
category = ['business', 'politics', 'sports', 'technology']

In [195]:
def get_inshorts(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    return soup

In [196]:
def get_data_inshorts(cat):
    page = get_inshorts(f'https://inshorts.com/en/read/{cat}')
    headlines = page.find_all(itemprop='headline')
    body = page.find_all(itemprop='articleBody')

    clean_headlines = []
    clean_body = []

    for h,b in zip(headlines,body):
        clean_headlines.append(h.get_text())
        clean_body.append(b.get_text())
        
    all_data = []
    for head,body in zip(clean_headlines, clean_body):
        data = {'title': head,
                'content': body,
                'category': cat}
        all_data.append(data)
        
    return all_data

In [197]:
all_news = []

for cat in category:
    all_news.append(get_data_inshorts(cat))

In [245]:
titles = []
contents = []
cats = []

In [246]:
for list_list in all_news:
    for dictionary in list_list:
        titles.append(dictionary['title'])
        contents.append(dictionary['content'])
        cats.append(dictionary['category'])

In [249]:
news = {'title':titles,
         'content':contents,
         'category':cats}

In [252]:
news = pd.DataFrame(news)

In [253]:
news.to_csv('news.csv',index=False)

In [244]:
all_news

[[{'title': 'Govt probing accounts of Adani Group-run Mumbai and Navi Mumbai airports',
   'content': 'The Ministry of Corporate Affairs has opened an investigation into accounts of Adani Group-run Mumbai and Navi Mumbai airports. A significant part of the information being sought by the government pertains to the period from fiscal 2018 to 2022 prior to their acquisition, Adani Enterprises said. The company stated its units will respond to the communications with applicable legal provisions.',
   'category': 'business'},
  {'title': 'IndiGo Co-founder Gangwal to buy SpiceJet stake, say reports; stock up 20%',
   'content': 'IndiGo Co-founder Rakesh Gangwal is at advanced stage of talks to buy a sizeable stake in SpiceJet, ET NOW reported. Following reports of the deal, shares of cash-strapped SpiceJet closed nearly 20% higher on Friday after hitting 52-week high of ₹43.82/share. Gangwal holds a 13.23% stake in IndiGo-parent InterGlobe Aviation while his wife holds a 2.99% stake.',
   