# Acquire Exercises

By the end of this exercise, you should have a file named acquire.py that contains the specified functions. If you wish, you may break your work into separate files for each website (e.g. acquire_codeup_blog.py and acquire_news_articles.py), but the end function should be present in acquire.py (that is, acquire.py should import get_blog_articles from the acquire_codeup_blog module.)

In [1]:
from requests import get
from bs4 import BeautifulSoup
import os
import re
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
import pprint

1. Codeup Blog Articles

Scrape the article text from the following pages:

* https://codeup.com/codeups-data-science-career-accelerator-is-here/
* https://codeup.com/data-science-myths/
* https://codeup.com/data-science-vs-data-analytics-whats-the-difference/
* https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/
* https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

>  ```
{
    'title': 'the title of the article',
    'content': 'the full text content of the article'
}
```

In [2]:
urls = ['https://codeup.com/codeups-data-science-career-accelerator-is-here/',
        'https://codeup.com/data-science-myths/', 
        'https://codeup.com/data-science-vs-data-analytics-whats-the-difference/',
        'https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/',
        'https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/'
       ]

In [3]:
url = 'https://codeup.com/codeups-data-science-career-accelerator-is-here/'
headers = {'User-Agent': 'Codeup Bayes Data Science'} # codeup.com doesn't like our default user-agent
response = get(url, headers=headers)

In [4]:
soup = BeautifulSoup(response.content, 'html.parser')

In [5]:
title = soup.title.text
title

'Codeup’s Data Science Career Accelerator is Here! - Codeup'

In [6]:
# see also `soup.find_all`
#
# beautiful soup uses `class_` as the keyword argument for searching
# for a class because `class` is a reserved word in python
# we'll use the class name that we identified from looking in the inspector in chrome
article = soup.find('div', class_='jupiterx-post-content clearfix')
text = article.text

In [7]:
date = soup.select("header > ul > li.jupiterx-post-meta-date.list-inline-item > time")[0]["datetime"]

In [8]:
date

'2018-09-30T05:26:22+00:00'

In [9]:
websites = []

In [10]:
websites.append({"title": title, "content": text})

In [11]:
websites

[{'title': 'Codeup’s Data Science Career Accelerator is Here! - Codeup',
  'content': 'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.Data Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.Our program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Rac

In [12]:
def get_blog_articles():
    websites = []
    urls = ['https://codeup.com/codeups-data-science-career-accelerator-is-here/',
        'https://codeup.com/data-science-myths/', 
        'https://codeup.com/data-science-vs-data-analytics-whats-the-difference/',
        'https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/',
        'https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/'
       ]
    for i in urls:
        headers = {'User-Agent': 'Codeup Bayes Data Science'} # codeup.com doesn't like our default user-agent
        response = get(i, headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')
        title = soup.title.text
        text = soup.find('div', class_='jupiterx-post-content clearfix').text
        date = soup.select("header > ul > li.jupiterx-post-meta-date.list-inline-item > time")[0]["datetime"]
        websites.append({"title": title, "content": text, "date_published": date})
    return websites

In [13]:
websites = get_blog_articles()

In [14]:
websites

[{'title': 'Codeup’s Data Science Career Accelerator is Here! - Codeup',
  'content': 'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.Data Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.Our program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Rac

## Bonus:

Scrape the text of all the articles linked on codeup's blog page.
https://codeup.com/resources/#blog

In [15]:
url = "https://codeup.com/resources/#blog"

In [16]:
headers = {'User-Agent': 'Codeup Bayes Data Science'} # codeup.com doesn't like our default user-agent
response = get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')

In [17]:
for link in soup.find_all("a", href=True):
    print(link["href"])

#jupiterx-primary
https://codeup.com
https://codeup.com/student-page/
https://codeup.com/wd-admissions/
https://codeup.com/ds-admissions/
https://codeup.com/events/
#
https://codeup.com/sanantonio/
/dallas/
https://codeup.com/employer-partner/
https://alumni.codeup.com/
https://codeup.com/resources/
/resources/#outcomes
/resources/#blog
https://codeup.com/financial-aid/
https://codeup.com/frequently-asked-questions/
https://codeup.com/about-codeup/
https://codeup.com/careers/
https://codeup.com/contact/
https://codeup.com/apply-now/
https://codeup.com/student-page/
https://codeup.com/wd-admissions/
https://codeup.com/ds-admissions/
https://codeup.com/events/
#
https://codeup.com/sanantonio/
/dallas/
https://codeup.com/employer-partner/
https://alumni.codeup.com/
https://codeup.com/resources/
/resources/#outcomes
/resources/#blog
https://codeup.com/financial-aid/
https://codeup.com/frequently-asked-questions/
https://codeup.com/about-codeup/
https://codeup.com/careers/
https://codeup.co

In [18]:
urls = []
for link in soup.find_all("a", class_='jet-listing-dynamic-link__link'):
    urls.append(link)

In [19]:
urls = pd.Series(urls)

In [20]:
urls = urls.iloc[58:257].unique()

In [21]:
links = soup.find_all('a', class_='jet-listing-dynamic-link__link')

In [22]:
urls = []
for link in links:

    # Add the link to my urls list
    urls.append(link['href'])

In [23]:
def get_blog_articles_bonus(urls):
    websites = []
    for i in urls:
        headers = {'User-Agent': 'Codeup Bayes Data Science'} # codeup.com doesn't like our default user-agent
        response = get(i, headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')
        title = soup.title.text
        text = soup.find('div', class_='jupiterx-post-content clearfix').text
        date = soup.select("header > ul > li.jupiterx-post-meta-date.list-inline-item > time")
        websites.append({"title": title, "content": text, "date_published": date})
    return websites

In [25]:
websites = get_blog_articles_bonus(urls)

In [26]:
pprint.pprint(websites)

[{'content': 'Are you a veteran or active-duty military member considering '
             'your next steps? Our alumni have been in your boots. In a recent '
             'virtual panel, two vets discussed their transition into '
             'technology careers with Codeup: Benny Fields III, a retired Air '
             'Force Master Sergeant turned Full Stack Web Developer, and '
             'Jeffery Roeder, a Navy Intelligence Analyst turned Data '
             'Scientist. Whether you’re interested in Data Science or Web '
             'Development, here are some key takeaways from the event.\xa0Why '
             'Codeup?“The GI Bill was a huge plus, but the icing on the cake '
             'was the placement program.” – Benny FieldsAfter retiring from '
             'the Air Force, Benny Fields took a job as a technical writer, '
             'but he quickly became more interested in the software he was '
             'writing about than the writing itself. His friend suggested '

             'empowering life change and launching careers in technology.We’ve '
             'secured a space in the heart of downtown Dallas, where students '
             'will have access to vibrant city life and dozens of employers. '
             'While we await final state approval for an early program 2020 '
             'launch, join our mailing list to be the first to hear when we '
             'launch!',
  'date_published': [<time datetime="2019-09-27T01:41:23+00:00" itemprop="datePublished">September 27, 2019</time>],
  'title': 'Codeup Dallas 2020 - Codeup'},
 {'content': 'For some, it may be difficult to grasp how prevalent data '
             'science is in our world! I’ve written a post here that brings to '
             'life some of the ways in which you may see data science in your '
             'everyday routine. However, you may find yourself wondering '
             'exactly how far-reaching data science actually is, or how it '
             'actually impacts ce

             'job: You ‘did it live’ and hacked your way into a data science '
             'skillset.Universities: You studied Data Science, Analytics, '
             'Statistics, Programming, or Business formally in a university '
             'setting.MOOCs (Massive Open Online Courses): You learned through '
             'an online resource like Udemy or Codecademy.On-site or corporate '
             'training: You were trained by a learning & development '
             'department, internal academy, or contracted provider.Immersive '
             'programs/bootcamps: You went to coding bootcamp \xa0and learned '
             'Data Science through an immersive, hands-on career accelerator '
             '(like Codeup, perhaps?)Each of these pathways has unique '
             'advantages and disadvantages across variables like cost, formal '
             'credential, length, and pace. A free online program is free and '
             'accessible, but takes a lot of dedication to foll

             'programming languages under your belt the more value you will be '
             'able to provide during every stage of your career.#3: Am I '
             'gritty?Grit is a character trait that can make or break '
             'long-term success. Psychologist Angela Duckworth explains it as, '
             '“Passion and perseverance for very long-term goals. Grit is '
             'having stamina…sticking with your future, day in and day out.” '
             'When it comes to software development, having grit means '
             'learning from your failures. We’ve seen students struggle not '
             'because they can’t figure out the content, but because they are '
             'afraid of failing. Move past that fear! You will inevitably make '
             'mistakes, but our successful students are those that learn from '
             'their failure along the way.#4: Am I a problem-solver?At the end '
             'of the day, code is designed to solve a problem! 

In [27]:
for i in websites:
    print(i["title"])

From Bootcamp to Bootcamp: Veterans Transitioning into Tech
How to Get Started On Any Programming Exercise - Codeup
The Best Path to a Career in Data Science - Codeup
Getting Hired in a Remote Environment - Codeup
The Remote Codeup Student Experience - Codeup
COVID-19 Relief Scholarship | Codeup Scholarships
Discovering My Passion Through Codeup - Codeup
How To Launch Your New Career With Codeup During COVID-19 - Codeup
15 Tips on How to Prepare For Virtual Interviews and Meetings - Codeup
Setting Myself Up For Success at Codeup - Codeup
Landing My Dream Job Through A Web Development Course - Codeup
How To Have A Second Career Start With Codeup - Codeup
2019: A Codeup Year In Review - Codeup
How To Pick A Coding Bootcamp Curriculum - Codeup
Your Investment Towards Your Future With Codeup - Codeup
The Best Path To A Career In Software Development - Codeup
Financial Aid Options For Your Investment - Codeup
Hey Dallas, Meet Your Software Development Mentors! - Codeup
Hey San Antonio, Meet

## 2. News Articles

News Articles

We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

* Business
* Sports
* Technology
* Entertainment


The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:

> ```
{
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' # for example
}
```

In [28]:
# We will test the code by using the first link
url = f"https://inshorts.com/en/read/business"

### Getting the content from the specific cards

In [29]:
url = "https://inshorts.com/en/news/lack-of-elite-umpires-an-issue-kumble-on-reason-behind-extra-review-in-tests-1590514459399"

In [30]:
headers = {'User-Agent': 'Codeup Bayes Data Science'} # codeup.com doesn't like our default user-agent
response = get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')

In [31]:
title = soup.select("body > div.container > div > div.card-stack > div > div > div.news-card-title.news-right-box > a > span")[0].text

In [32]:
title

'Lack of elite umpires an issue: Kumble on reason behind extra review in Tests'

In [33]:
body = soup.select("body > div.container > div > div.card-stack > div > div > div.news-card-content.news-right-box > div:nth-child(1)")[0].text

In [34]:
author = soup.find("span", class_ = "author").text
author

'Anmol Sharma'

In [35]:
date_published = soup.find("span", class_="time")["content"]

### Getting all the URL's in the topic page

In [36]:
url = "https://inshorts.com/en/read/business"

In [37]:
headers = {'User-Agent': 'Codeup Bayes Data Science'} # codeup.com doesn't like our default user-agent
response = get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')

In [38]:
soup.find("body > div.container > div > div.card-stack")

In [39]:
urls = []
for link in soup.find_all("a", href=True):
    urls.append(link["href"])

In [40]:
lines = pd.Series(urls)

In [41]:
urls = lines[lines.str.contains(r"^/en/news")].tolist()

In [42]:
new_urls = []
for i in urls:
    new_urls.append("http://" + i)

In [43]:
new_urls

['http:///en/news/twitter-ceo-donates-$10m-to-project-giving-$1000-cash-to-covid19-hit-families-1590570982863',
 'http:///en/news/us-firm-buys-serum-institute-parents-czech-unit-to-make-covid19-vaccine-1590641548030',
 'http:///en/news/25yearold-anant-ambani-joins-$65-billion-jio-platforms-as-director-1590558778462',
 'http:///en/news/google-in-talks-to-buy-5-stake-in-vodafone-idea-reports-1590665747153',
 'http:///en/news/microsoft-in-talks-to-buy-25-stake-in-jio-for-$2-billion-report-1590656462690',
 'http:///en/news/white-woman-calls-police-over-black-man-in-us-franklin-templeton-fires-her-1590558726265',
 'http:///en/news/by-2025-only-25-of-our-staff-will-work-from-office-at-any-point-of-time-tcs-1590577247227',
 'http:///en/news/abu-dhabi-state-fund-in-talks-to-invest-$1-billion-in-jio-platforms-report-1590667027248',
 'http:///en/news/kents-atta-maker-ad-says-maids-hands-may-be-infected-company-apologises-1590660739781',
 'http:///en/news/ge-to-sell-its-129yearold-lightbulb-busin

In [44]:
def get_url(topic):
    url = f"https://inshorts.com/en/read/{topic}"
    headers = {'User-Agent': 'Codeup Bayes Data Science'} # codeup.com doesn't like our default user-agent
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')
    urls = []
    # Find all links within that topic
    for link in soup.find_all("a", href=True):
        urls.append(link["href"])
        lines = pd.Series(urls)
        urls = lines[lines.str.contains(r"^/en/news")].tolist()
        new_urls = []
        for i in urls:
            new_urls.append("https://inshorts.com" + i)
    return new_urls

def get_article_info(new_urls, topic):
    news = []
    for new_url in new_urls:
        headers = {'User-Agent': 'Codeup Bayes Data Science'} # codeup.com doesn't like our default user-agent
        response = get(new_url, headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')
        title = soup.select("body > div.container > div > div.card-stack > div > div > div.news-card-title.news-right-box > a > span")[0].text
        body = soup.select("body > div.container > div > div.card-stack > div > div > div.news-card-content.news-right-box > div:nth-child(1)")[0].text
        author = soup.find("span", class_ = "author").text
        date_published = soup.find("span", class_="time")["content"]
        news.append({"title": title, "author": author, "topic": topic, "article": body, "date_published": date_published, "page_url": new_url})
    return news
    
def get_news_articles(topics = []):
    all_news = []
    for topic in topics:
        new_urls = get_url(topic)
        news = get_article_info(new_urls, topic)
        all_news.append(news)
    all_news = sum(all_news, [])
    return all_news


In [45]:
news = get_news_articles(topics = ["business", "sports", "technology", "entertainment"])

In [46]:
df = pd.DataFrame(news)

In [47]:
df.groupby(["author", "topic"])[["topic"]].count()

Unnamed: 0_level_0,Unnamed: 1_level_0,topic
author,topic,Unnamed: 2_level_1
Aishwarya,business,2
Aishwarya,technology,13
Ankur Taliyan,sports,14
Anmol Sharma,sports,11
Anushka Dixit,business,12
Anushka Dixit,technology,5
Apaar Sharma,entertainment,1
Atul Mishra,entertainment,12
Daisy Mowke,entertainment,2
Kiran Khatri,business,3


In [48]:
df.to_csv("inshort_cards.csv")