# Acquire Exercises

__1) Codeup Blog Articles__

Scrape the article text from the following pages:

* https://codeup.com/data-science/codeups-data-science-career-accelerator-is-here/
* https://codeup.com/data-science/data-science-myths/
* https://codeup.com/data-science/data-science-vs-data-analytics-whats-the-difference/
* https://codeup.com/data-science/10-tips-to-crush-it-at-the-sa-tech-job-fair/
* https://codeup.com/data-science/competitor-bootcamps-are-closing-is-the-model-in-danger/

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

{

    'title': 'the title of the article',
    
    'content': 'the full text content of the article'
}

Plus any additional properties you think might be helpful.

__Bonus:__

Scrape the text of all the articles linked on codeup's blog page.

In [1]:
import numpy as np
import pandas as pd
from requests import get
from bs4 import BeautifulSoup

In [12]:
#For the first article
url = 'https://codeup.com/data-science/codeups-data-science-career-accelerator-is-here/'

In [14]:
response = get(url, headers={'user-agent': 'Codeup DS Germain'})

In [17]:
response

<Response [200]>

In [18]:
soup = BeautifulSoup(response.content, 'html.parser')

In [24]:
#Find the title of the article
title = soup.find('h1', class_ = 'entry-title').text
title

'Codeup’s Data Science Career Accelerator is Here!'

In [30]:
#Find the article content
content = soup.find('div', class_ = 'et_pb_post_content').text.strip().replace('\n', ' ').replace('\xa0', ' ')
content

'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in Glassdoor’s #1 Best Job in America. Data Science is a method of providing actionable intelligence from data. The data revolution has hit San Antonio, resulting in an explosion in Data Scientist positions across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen UTSA invest $70 M for a Cybersecurity Center and School of Data Science. We built a program to specifically meet the growing demands of this industry. Our program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Rackspace, along with input from dozens of practitioners and hiring partners. Students will work with rea

In [33]:
#Get publication date
date = soup.find('span', class_ = 'published').text
date

'Sep 30, 2018'

In [35]:
#Get category
category = soup.find('a', rel = 'category tag').text
category

'Data Science'

In [43]:
#Since these blog articles will have similar structure, create a function that loops through
#Each url and grabs the appropriate info
def get_blog_articles(urls):
    """
    This function takes in a list of urls for Codeup blog articles. It will loop through each
    of them, gather the article title and content, put it into a dictionary, and finally
    return the list of dictionaries.
    """
    
    #Create the empty list to append the article dicts to
    articles = []
    
    for url in urls:
        response = get(url, headers = {'user-agent':'Codeup DS Germain'})
        
        soup = BeautifulSoup(response.content, 'html.parser')
        
        article = {
            'title': soup.find('h1', class_ = 'entry-title').text,
            'date': soup.find('span', class_ = 'published').text,
            'category': soup.find('a', rel = 'category tag').text,
            'content': soup.find('div', class_ = 'et_pb_post_content').text.strip().replace('\n', ' ').replace('\xa0', ' ')
        }
        
        articles.append(article)
        
    #Convert the list to dataframe
    #articles = pd.DataFrame(articles)
    
    return articles

In [40]:
#Create the list of urls
urls = ['https://codeup.com/data-science/codeups-data-science-career-accelerator-is-here/',
        'https://codeup.com/data-science/data-science-myths/',
        'https://codeup.com/data-science/data-science-vs-data-analytics-whats-the-difference/',
        'https://codeup.com/data-science/10-tips-to-crush-it-at-the-sa-tech-job-fair/',
        'https://codeup.com/data-science/competitor-bootcamps-are-closing-is-the-model-in-danger/'
       ]

In [44]:
#Get article data
articles = get_blog_articles(urls)

In [45]:
articles

[{'title': 'Codeup’s Data Science Career Accelerator is Here!',
  'date': 'Sep 30, 2018',
  'category': 'Data Science',
  'content': 'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in Glassdoor’s #1 Best Job in America. Data Science is a method of providing actionable intelligence from data. The data revolution has hit San Antonio, resulting in an explosion in Data Scientist positions across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen UTSA invest $70 M for a Cybersecurity Center and School of Data Science. We built a program to specifically meet the growing demands of this industry. Our program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked

__2) News Articles__

We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

* Business
* Sports
* Technology
* Entertainment

The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:

{

    'title': 'The article title',
    
    'content': 'The article content',
    
    'category': 'business' # for example
    
}

It seems like the articles themselves do not contain a label for the category they belong in. So, I will have to use the specific links provided by the website for each category.

In [46]:
#Create a list of category urls
urls = ['https://inshorts.com/en/read/business',
       'https://inshorts.com/en/read/sports',
       'https://inshorts.com/en/read/technology',
       'https://inshorts.com/en/read/entertainment'
       ]

In [47]:
#Gather the information for a single article from a single category
url = 'https://inshorts.com/en/read/entertainment'

In [86]:
response = get(url, headers = {'user-agent': 'Codeup DS Germain'})

In [87]:
response

<Response [200]>

In [88]:
soup = BeautifulSoup(response.content, 'html.parser')

In [89]:
#Find the title of the first article on the page
news_stack = soup.find('div', class_ = 'card-stack')

In [90]:
articles = news_stack.find_all('div', class_ = 'news-card')

In [91]:
article = articles[0]

In [92]:
#Find the title of the first article on the page
title = article.find('span', itemprop = 'headline').text
title

'Aishwaryaa shares pic of father Rajinikanth, husband Dhanush with National Awards medals'

In [93]:
#Find the author of the first article
author = article.find('span', class_ = 'author').text
author

'Daisy Mowke'

In [99]:
#Find the publication date
date = article.find('span', clas = 'date').text.split(',')[0]
date

'26 Oct 2021'

In [104]:
#Find the article content
content = article.find('div', itemprop = 'articleBody').text
content

'Following the 67th National Film Awards on Monday, Aishwaryaa R Dhanush shared a picture of her father Rajinikanth and husband Dhanush with their medals. "They are mine...and this is history. #prouddaughter #proudwife," she wrote. While Dhanush received the National Film Award in Best Actor category for \'Asuran\', Rajinikanth was bestowed with the Dadasaheb Phalke Award. '

In [105]:
#Now build a function to get the data from the entire page
def get_page_data(category):
    #Build the url
    url = 'https://inshorts.com/en/read/' + category
    
    #Get the html
    response = get(url, headers = {'user-agent':'Codeup DS Germain'})
    soup = BeautifulSoup(response.content, 'html.parser')
    
    #Access the news_stack that contains the articles
    news_stack = soup.find('div', class_ = 'card-stack')
    
    #Access the articles inside the news_stack
    articles = news_stack.find_all('div', class_ = 'news-card')
    
    #Create an empty list to store the article info
    article_info = []
    
    #Loop through each article and gather the info
    for article in articles:
        article_dict = {
            'title': article.find('span', itemprop = 'headline').text,
            'author': article.find('span', class_ = 'author').text,
            'date': article.find('span', clas = 'date').text.split(',')[0],
            'category': category,
            'content': article.find('div', itemprop = 'articleBody').text
        }
        
        article_info.append(article_dict)
        
    return article_info

Now that I have a function to find the articles in a single page (which is a single category), I will build a function that gathers the info for each category I need.

In [108]:
def get_news_articles(categories):
    """
    This function takes in a list of categories. It will loop through the list, gather the article
    info from each category, add it to a list of dicts, and return a list of dictionaries of all
    articles in those categories.
    """
    #First, create the empty list of article dicts
    article_info = []
    
    #Loop through the list of cats
    for cat in categories:
        article_list = get_page_data(cat)
        
        #Loop through the list and append each entry individually to article_info
        for article in article_list:
            article_info.append(article)
    
    return article_info
    

In [109]:
#Testing 
#Create list of cats
categories = ['business', 'sports', 'technology', 'entertainment']

In [110]:
news_articles = get_news_articles(categories)

In [111]:
len(news_articles)

100

In [113]:
pd.DataFrame(news_articles)

Unnamed: 0,title,author,date,category,content
0,India's Covaxin may get WHO approval in next 2...,Kiran Khatri,26 Oct 2021,business,A WHO technical advisory group which met on Tu...
1,I decided to support Doge as it felt like the ...,Pragya Swastik,26 Oct 2021,business,Tesla CEO and the world's richest person Elon ...
2,Elon Musk tweets 'Wild $T1mes' after Tesla hit...,Pragya Swastik,26 Oct 2021,business,Tesla CEO and the world's richest person Elon ...
3,Which companies have $1 trillion or more marke...,Pragya Swastik,26 Oct 2021,business,Tesla has become the latest company to surpass...
4,How many years did it take for various compan...,Pragya Swastik,26 Oct 2021,business,Tesla took 18 years to hit the $1-trillion m-c...
...,...,...,...,...,...
95,I've only made 5 really good films in my caree...,Kriti Kambiri,26 Oct 2021,entertainment,Actress Kristen Stewart has said that she thin...
96,"Jr NTR's fan injured in accident, actor helps ...",Kriti Kambiri,26 Oct 2021,entertainment,A fan of Telugu actor Jr NTR was injured in a ...
97,Wanted to be part of this family: Angelina on ...,Amartya Sharma,26 Oct 2021,entertainment,"Angelina Jolie has said that there were ""many ..."
98,It was 1 story for me: Kabir on not directing ...,Udit Gupta,26 Oct 2021,entertainment,Filmmaker Kabir Khan spoke about why he didn't...
