# Acquire Exercises

By the end of this exercise, you should have a file named acquire.py that contains the specified functions. If you wish, you may break your work into separate files for each website (e.g. acquire_codeup_blog.py and acquire_news_articles.py), but the end function should be present in acquire.py (that is, acquire.py should import get_blog_articles from the acquire_codeup_blog module.)

In [1]:
from requests import get
from bs4 import BeautifulSoup
import os
import re
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
import pprint

1. Codeup Blog Articles

Scrape the article text from the following pages:

* https://codeup.com/codeups-data-science-career-accelerator-is-here/
* https://codeup.com/data-science-myths/
* https://codeup.com/data-science-vs-data-analytics-whats-the-difference/
* https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/
* https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

>  ```
{
    'title': 'the title of the article',
    'content': 'the full text content of the article'
}
```

In [2]:
urls = ['https://codeup.com/codeups-data-science-career-accelerator-is-here/',
        'https://codeup.com/data-science-myths/', 
        'https://codeup.com/data-science-vs-data-analytics-whats-the-difference/',
        'https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/',
        'https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/'
       ]

In [3]:
url = 'https://codeup.com/codeups-data-science-career-accelerator-is-here/'
headers = {'User-Agent': 'Codeup Bayes Data Science'} # codeup.com doesn't like our default user-agent
response = get(url, headers=headers)

In [4]:
soup = BeautifulSoup(response.content, 'html.parser')

In [5]:
title = soup.title.text
title

'Codeup’s Data Science Career Accelerator is Here! - Codeup'

In [6]:
# see also `soup.find_all`
#
# beautiful soup uses `class_` as the keyword argument for searching
# for a class because `class` is a reserved word in python
# we'll use the class name that we identified from looking in the inspector in chrome
article = soup.find('div', class_='jupiterx-post-content clearfix')
text = article.text

In [7]:
date = soup.select("header > ul > li.jupiterx-post-meta-date.list-inline-item > time")[0]["datetime"]

In [8]:
date

'2018-09-30T05:26:22+00:00'

In [9]:
websites = []

In [10]:
websites.append({"title": title, "content": text})

In [11]:
websites

[{'title': 'Codeup’s Data Science Career Accelerator is Here! - Codeup',
  'content': 'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.Data Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.Our program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Rac

In [12]:
def get_blog_articles():
    websites = []
    urls = ['https://codeup.com/codeups-data-science-career-accelerator-is-here/',
        'https://codeup.com/data-science-myths/', 
        'https://codeup.com/data-science-vs-data-analytics-whats-the-difference/',
        'https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/',
        'https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/'
       ]
    for i in urls:
        headers = {'User-Agent': 'Codeup Bayes Data Science'} # codeup.com doesn't like our default user-agent
        response = get(i, headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')
        title = soup.title.text
        text = soup.find('div', class_='jupiterx-post-content clearfix').text
        date = soup.select("header > ul > li.jupiterx-post-meta-date.list-inline-item > time")[0]["datetime"]
        websites.append({"title": title, "content": text, "date_published": date})
    return websites

In [13]:
websites = get_blog_articles()

In [14]:
websites

[{'title': 'Codeup’s Data Science Career Accelerator is Here! - Codeup',
  'content': 'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.Data Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.Our program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Rac

## Bonus:

Scrape the text of all the articles linked on codeup's blog page.
https://codeup.com/resources/#blog

In [67]:
url = "https://codeup.com/resources/#blog"

In [68]:
headers = {'User-Agent': 'Codeup Bayes Data Science'} # codeup.com doesn't like our default user-agent
response = get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')

In [69]:
urls = []
for link in soup.find_all("a", href=True):
    urls.append(link["href"])

In [64]:
urls = []
for link in soup.find_all("a", class_='jet-listing-dynamic-link__link'):
    urls.append(link)

In [65]:
urls

[<a class="jet-listing-dynamic-link__link" href="https://codeup.com/bootcamp-to-bootcamp/"><span class="jet-listing-dynamic-link__label">From Bootcamp to Bootcamp: Two Military Veterans Discuss Their Transition Into Tech</span></a>,
 <a class="jet-listing-dynamic-link__link" href="https://codeup.com/bootcamp-to-bootcamp/"><span class="jet-listing-dynamic-link__label">Read More</span></a>,
 <a class="jet-listing-dynamic-link__link" href="https://codeup.com/how-to-get-started-on-a-programming-exercise/"><span class="jet-listing-dynamic-link__label">How to Get Started On Any Programming Exercise</span></a>,
 <a class="jet-listing-dynamic-link__link" href="https://codeup.com/how-to-get-started-on-a-programming-exercise/"><span class="jet-listing-dynamic-link__label">Read More</span></a>,
 <a class="jet-listing-dynamic-link__link" href="https://codeup.com/career-in-data-science/"><span class="jet-listing-dynamic-link__label">The Best Path to a Career in Data Science</span></a>,
 <a class="j

In [70]:
urls = pd.Series(urls)

In [71]:
urls = urls.iloc[58:257].unique()

In [74]:
links = soup.find_all('a', class_='jet-listing-dynamic-link__link')

In [75]:
urls = []
for link in links:

    # Add the link to my urls list
    urls.append(link['href'])

In [76]:
urls

['https://codeup.com/bootcamp-to-bootcamp/',
 'https://codeup.com/bootcamp-to-bootcamp/',
 'https://codeup.com/how-to-get-started-on-a-programming-exercise/',
 'https://codeup.com/how-to-get-started-on-a-programming-exercise/',
 'https://codeup.com/career-in-data-science/',
 'https://codeup.com/career-in-data-science/',
 'https://codeup.com/getting-hired-in-a-remote-environment/',
 'https://codeup.com/getting-hired-in-a-remote-environment/',
 'https://codeup.com/codeup-remote-students/',
 'https://codeup.com/codeup-remote-students/',
 'https://codeup.com/covid-relief/',
 'https://codeup.com/covid-relief/',
 'https://codeup.com/discovering-my-passion-through-codeup/',
 'https://codeup.com/discovering-my-passion-through-codeup/',
 'https://codeup.com/covid-19/',
 'https://codeup.com/covid-19/',
 'https://codeup.com/15-tips-for-virtual-interview-and-meetings/',
 'https://codeup.com/15-tips-for-virtual-interview-and-meetings/',
 'https://codeup.com/setting-myself-up-for-success-at-codeup/'

In [77]:
def get_all_urls():
    '''
    This function scrapes all of the Codeup blog urls from
    the main Codeup blog page and returns a list of urls.
    '''
    # The main Codeup blog page with all the urls
    url = 'https://codeup.com/resources/#blog'
    
    headers = {'User-Agent': 'Codeup Data Science'} 
    
    # Send request to main page and get response
    response = get(url, headers=headers)
    
    # Create soup object using response
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Create empty list to hold the urls for all blogs
    urls = []
    
    # Create a list of the element tags that hold the href/links
    link_list = soup.find_all('a', class_='jet-listing-dynamic-link__link')
    
    # get the href/link from each element tag in my list
    for link in link_list:
        
        # Add the link to my urls list
        urls.append(link['href'])
        
    return urls

In [78]:
def get_blog_articles(urls, cache=False):
    '''
    This function takes in a list of Codeup Blog urls and a parameter
    with default cache == False which returns a df from a csv file.
    If cache == True, the function scrapes the title and text for each url, creates a list of dictionaries
    with the title and text for each blog, converts list to df, and returns df.
    '''

    if cache == False:
        df = pd.read_csv('big_blogs.csv', index_col=0)
    else:
        headers = {'User-Agent': 'Codeup Data Science'} 

        # Create an empty list to hold dictionaries
        articles = []

        # Loop through each url in our list of urls
        for url in urls:

            # get request to each url saved in response
            response = get(url, headers=headers)

            # Create soup object from response text and parse
            soup = BeautifulSoup(response.text, 'html.parser')

            # Save the title of each blog in variable title
            title = soup.find('h1', itemprop='headline' ).text

            # Save the text in each blog to variable text
            text = soup.find('div', itemprop='text').text

            # Create a dictionary holding the title and text for each blog
            article = {'title': title, 'content': text}

            # Add each dictionary to the articles list of dictionaries
            articles.append(article)
        # convert our list of dictionaries to a df
        df = pd.DataFrame(articles)

        # Write df to csv file for faster access
        df.to_csv('big_blogs.csv')
    
    return df

In [79]:
urls = get_all_urls()

In [23]:
def get_blog_articles_bonus(urls):
    websites = []
    for i in urls:
        headers = {'User-Agent': 'Codeup Bayes Data Science'} # codeup.com doesn't like our default user-agent
        response = get(i, headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')
        title = soup.title.text
        text = soup.find('div', class_='jupiterx-post-content clearfix').text
        date = soup.select("header > ul > li.jupiterx-post-meta-date.list-inline-item > time")
        websites.append({"title": title, "content": text, "date_published": date})
    return websites

In [72]:
websites = get_blog_articles_bonus(urls)

In [73]:
pprint.pprint(websites)

[{'content': '100% Tuition Refund Policy_Codeup is dedicated in helping you '
             'transition into a career you love. That’s why we have our '
             'tuition refund policy in place! If you do not find a career '
             'in-field within 6 months of graduating our program, you will '
             'receive a 100% tuition refund. Read more on our policy '
             'below!What_Codeup will refund 100% of received tuition to '
             'graduates who do not receive an offer of “in-field” employment '
             '(as defined by the Texas Workforce Commission) within 6 months '
             'of graduation. To qualify, graduates must meet and maintain '
             'eligibility as outlined below. Refunds will be issued to the '
             'parties responsible for tuition funds, including individuals, '
             'loan partners, grant partners, and the VA. Refunds exclude '
             'associates interest fees and costs.Who_Graduates are eligible '
        

             'really where we shine! Zach and Ryan have been teaching at '
             'Codeup for 4 years in our web development program and have '
             'developed special depth of expertise in programming and teaching '
             'programming. They each have experience across OOP in Java, PHP, '
             'and Python, as well as database experience in SQL and web '
             'framework expertise in Laravel and Spring. They each have '
             'experience in both object oriented and functional programming, '
             'and experience building database-backed full-stack web '
             'applications in numerous languages (including Python, Java, '
             'Javascript, and Clojure) and web frameworks. Maggie has been '
             'working in R and Python for years in her different data science '
             'roles. When it comes to programming, these 3 can not only build '
             'you a Machine Learning model, but they can build a full-stack '


             'wasted, I have a fulfilling professional career with exciting '
             'possibilities for in-field advancement.” – Matthew CapperProject '
             'QUEST is a San Antonio based non-profit dedicated to workforce '
             'development and building our city’s future economy. They provide '
             'grant funding for educational and training programs, as well as '
             'career coaching and support resources (like utilities, '
             'childcare, and transportation). Over the last 4 years, Project '
             'QUEST and Codeup have collaborated to help dozens of students '
             'change careers.“Assistance from Project QUEST was invaluable and '
             'empowering.\xa0They offer incredible support for career '
             'training.” – Jesse RuizIn 2019 alone, Project QUEST has helped '
             'Codeup train 23 Software Developers and Data Scientists. Those '
             'students received grant funding (i.e. free money

             'would be the place for me, and it was going to be where I knew '
             'my life would change. I was going to be walking out of one door '
             'and into another. And I wasn’t going to look back.When I found '
             'out that I had got in, I was so happy that my hard work paid off '
             'and I knew it was just the beginning of what was to come. I '
             'looked forward to the start date and marked it on my calendar. I '
             'also quit my job, which was much needed because salon life was '
             'stressful enough, let alone learning a new skill.The next few '
             'months were filled with triumphs and failures, which is normal '
             'in any career. No one is born good at everything, it takes '
             'practice and determination (I totally had to repeat that to '
             'myself every single day at Codeup. Thanks Ryan!). Sometimes it '
             'felt like two steps back, but I was determin

In [27]:
for i in websites:
    print(i["title"])

From Bootcamp to Bootcamp: Veterans Transitioning into Tech
How to Get Started On Any Programming Exercise - Codeup
The Best Path to a Career in Data Science - Codeup
Getting Hired in a Remote Environment - Codeup
The Remote Codeup Student Experience - Codeup
COVID-19 Relief Scholarship | Codeup Scholarships
Discovering My Passion Through Codeup - Codeup
How To Launch Your New Career With Codeup During COVID-19 - Codeup
15 Tips on How to Prepare For Virtual Interviews and Meetings - Codeup
Setting Myself Up For Success at Codeup - Codeup
Landing My Dream Job Through A Web Development Course - Codeup
How To Have A Second Career Start With Codeup - Codeup
2019: A Codeup Year In Review - Codeup
How To Pick A Coding Bootcamp Curriculum - Codeup
Your Investment Towards Your Future With Codeup - Codeup
The Best Path To A Career In Software Development - Codeup
Financial Aid Options For Your Investment - Codeup
Hey Dallas, Meet Your Software Development Mentors! - Codeup
Hey San Antonio, Meet

## 2. News Articles

News Articles

We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

* Business
* Sports
* Technology
* Entertainment


The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:

> ```
{
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' # for example
}
```

In [28]:
# We will test the code by using the first link
url = f"https://inshorts.com/en/read/business"

### Getting the content from the specific cards

In [29]:
url = "https://inshorts.com/en/news/lack-of-elite-umpires-an-issue-kumble-on-reason-behind-extra-review-in-tests-1590514459399"

In [30]:
headers = {'User-Agent': 'Codeup Bayes Data Science'} # codeup.com doesn't like our default user-agent
response = get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')

In [31]:
title = soup.select("body > div.container > div > div.card-stack > div > div > div.news-card-title.news-right-box > a > span")[0].text

In [32]:
title

'Lack of elite umpires an issue: Kumble on reason behind extra review in Tests'

In [33]:
body = soup.select("body > div.container > div > div.card-stack > div > div > div.news-card-content.news-right-box > div:nth-child(1)")[0].text

In [34]:
author = soup.find("span", class_ = "author").text
author

'Anmol Sharma'

In [35]:
date_published = soup.find("span", class_="time")["content"]

### Getting all the URL's in the topic page

In [36]:
url = "https://inshorts.com/en/read/business"

In [37]:
headers = {'User-Agent': 'Codeup Bayes Data Science'} # codeup.com doesn't like our default user-agent
response = get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')

In [38]:
soup.find("body > div.container > div > div.card-stack")

In [39]:
urls = []
for link in soup.find_all("a", href=True):
    urls.append(link["href"])

In [40]:
lines = pd.Series(urls)

In [41]:
urls = lines[lines.str.contains(r"^/en/news")].tolist()

In [42]:
new_urls = []
for i in urls:
    new_urls.append("http://" + i)

In [43]:
new_urls

['http:///en/news/twitter-ceo-donates-$10m-to-project-giving-$1000-cash-to-covid19-hit-families-1590570982863',
 'http:///en/news/us-firm-buys-serum-institute-parents-czech-unit-to-make-covid19-vaccine-1590641548030',
 'http:///en/news/25yearold-anant-ambani-joins-$65-billion-jio-platforms-as-director-1590558778462',
 'http:///en/news/google-in-talks-to-buy-5-stake-in-vodafone-idea-reports-1590665747153',
 'http:///en/news/microsoft-in-talks-to-buy-25-stake-in-jio-for-$2-billion-report-1590656462690',
 'http:///en/news/white-woman-calls-police-over-black-man-in-us-franklin-templeton-fires-her-1590558726265',
 'http:///en/news/by-2025-only-25-of-our-staff-will-work-from-office-at-any-point-of-time-tcs-1590577247227',
 'http:///en/news/abu-dhabi-state-fund-in-talks-to-invest-$1-billion-in-jio-platforms-report-1590667027248',
 'http:///en/news/kents-atta-maker-ad-says-maids-hands-may-be-infected-company-apologises-1590660739781',
 'http:///en/news/ge-to-sell-its-129yearold-lightbulb-busin

In [44]:
def get_url(topic):
    url = f"https://inshorts.com/en/read/{topic}"
    headers = {'User-Agent': 'Codeup Bayes Data Science'} # codeup.com doesn't like our default user-agent
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')
    urls = []
    # Find all links within that topic
    for link in soup.find_all("a", href=True):
        urls.append(link["href"])
        lines = pd.Series(urls)
        urls = lines[lines.str.contains(r"^/en/news")].tolist()
        new_urls = []
        for i in urls:
            new_urls.append("https://inshorts.com" + i)
    return new_urls

def get_article_info(new_urls, topic):
    news = []
    for new_url in new_urls:
        headers = {'User-Agent': 'Codeup Bayes Data Science'} # codeup.com doesn't like our default user-agent
        response = get(new_url, headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')
        title = soup.select("body > div.container > div > div.card-stack > div > div > div.news-card-title.news-right-box > a > span")[0].text
        body = soup.select("body > div.container > div > div.card-stack > div > div > div.news-card-content.news-right-box > div:nth-child(1)")[0].text
        author = soup.find("span", class_ = "author").text
        date_published = soup.find("span", class_="time")["content"]
        news.append({"title": title, "author": author, "topic": topic, "article": body, "date_published": date_published, "page_url": new_url})
    return news
    
def get_news_articles(topics = []):
    all_news = []
    for topic in topics:
        new_urls = get_url(topic)
        news = get_article_info(new_urls, topic)
        all_news.append(news)
    all_news = sum(all_news, [])
    return all_news


In [45]:
news = get_news_articles(topics = ["business", "sports", "technology", "entertainment"])

In [46]:
df = pd.DataFrame(news)

In [47]:
df.groupby(["author", "topic"])[["topic"]].count()

Unnamed: 0_level_0,Unnamed: 1_level_0,topic
author,topic,Unnamed: 2_level_1
Aishwarya,business,2
Aishwarya,technology,13
Ankur Taliyan,sports,14
Anmol Sharma,sports,11
Anushka Dixit,business,12
Anushka Dixit,technology,5
Apaar Sharma,entertainment,1
Atul Mishra,entertainment,12
Daisy Mowke,entertainment,2
Kiran Khatri,business,3


In [48]:
df.to_csv("inshort_cards.csv")