## Acquire Data through Web Scraping:

#### Steps

1. Import the get() function from the requests module, BeautifulSoup from bs4, and pandas.
1. Assign the address of the web page to a variable named url.
1. Request the server the content of the web page by using get(), and store the server’s response in the variable response.
1. Print the response text to ensure you have an html page.
1. Take a look at the actual web page contents and inspect the source to understand the structure a bit.
1. Use BeautifulSoup to parse the HTML into a variable ('soup').
1. Identify the key tags you need to extract the data you are looking for.
1. Create a dataframe of the data desired.
1. Run some summary stats and inspect the data to ensure you have what you wanted.
1. Edit the data structure as needed, especially so that one column has all the text you want included in this analysis.
1. Create a corpus of the column with the text you want to analyze.
1. Store that corpus for use in a future notebook.

In [12]:
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
from time import strftime
import os

1. Codeup Blog Articles

Visit Codeup's [Blog](https://codeup.com/blog/) and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's title and content.

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

```
{
    'title': 'the title of the article',
    'content': 'the full text content of the article'
}
```

Plus any additional properties you think might be helpful. 
__Bonus:__ Scrape the text of __all__ the articles linked on [codeup's blog page](https://codeup.com/blog/).

***

2. News Articles

We will now be scraping text data from [inshorts] (https://inshorts.com/en/read), a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

- Business
- Sports
- Technology
- Entertainment       

The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:

```
{
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' # for example
}
```

Hints:

- Start by inspecting the website in your browser. Figure out which elements will be useful.
- Start by creating a function that handles a single article and produces a dictionary like the one above.
- Next create a function that will find all the articles on a single page and call the function you created in the last step for every article on the page.
- Now create a function that will use the previous two functions to scrape the articles from all the pages that you need, and do any additional processing that needs to be done.
***
3. Bonus: cache the data

Write your code such that the acquired data is saved locally in some form or fashion. Your functions that retrieve the data should prefer to read the local data instead of having to make all the requests everytime the function is called. Include a boolean flag in the functions to allow the data to be acquired "fresh" from the actual sources (re-writing your local cache).
***
***
***

1. Codeup Blog Articles

Visit Codeup's [Blog](https://codeup.com/blog/) and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's title and content.

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

```
{
    'title': 'the title of the article',
    'content': 'the full text content of the article'
}
```

Plus any additional properties you think might be helpful. 
__Bonus:__ Scrape the text of __all__ the articles linked on [codeup's blog page](https://codeup.com/blog/).

In [81]:
url = 'https://codeup.com/blog/'
headers = {'User-Agent': 'Codeup Data Science'} # Some websites don't accept the pyhon-requests default user-agent
response = get(url, headers=headers)

In this instance, a header is necessary to forgo a 403 error. 
- headers are bits of meta information that can go along with a request
- The user-agent header can be used to identify ourself to the web server
- we can include headers as part of our request with a keyword argument       

Codeup otherwise prevents scraping unless this header, or something similar to it, is present.

In [82]:
response

<Response [200]>

In [9]:
# Perform a sanity check to ensure HTML data is observed
print(response.text[:400])

<!DOCTYPE html>
<html lang="en-US">
<head>
	<meta charset="UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge">
	<link rel="pingback" href="https://codeup.com/xmlrpc.php" />

	<script type="text/javascript">
		document.documentElement.className = 'js';
	</script>
	
	<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin /><script id="diviarea-loader">window.DiviPopupData=wi


In [83]:
# I'll take this one blog at a time
url2 = 'https://codeup.com/workshops/from-bootcamp-to-bootcamp-a-military-appreciation-panel/'
headers = {'User-Agent': 'Codeup Data Science'} # Some websites don't accept the pyhon-requests default user-agent
response2 = get(url2, headers=headers)
print(response)
print(response.text[:100])

<Response [200]>
<!DOCTYPE html>
<html lang="en-US">
<head>
	<meta charset="UTF-8" />
<meta http-equiv="X-UA-Compatib


In [102]:
# Make some soup (I wish it was Chowder)
# make a soup variable holding the response content
soup = BeautifulSoup(response.content, 'html.parser')
soup2 = BeautifulSoup(response2.content, 'html.parser')
soup3 = BeautifulSoup(response.text, 'html.parser')
soup4 = BeautifulSoup(response2.text, 'html.parser')
soup5 = BeautifulSoup(response.text)
soup6 = BeautifulSoup(response2.text)

In [183]:
# Check.
# so I will need the title and full content of the article. 

soup.title.string

'Blog - Codeup'

In [None]:
# //*[@id="post-18280"]/div[2]/div/div/div/div[1]/div/div/div/p[1]/span/text()

In [108]:
# testing other soup variants out of curiosity for how they may differ
print(soup2.title.string)
print(soup3.title.string)
print(soup4.title.string)
print(soup5.title.string)
print(soup6.title.string)

From Bootcamp to Bootcamp | A Military Appreciation Panel - Codeup
Blog - Codeup
From Bootcamp to Bootcamp | A Military Appreciation Panel - Codeup
Blog - Codeup
From Bootcamp to Bootcamp | A Military Appreciation Panel - Codeup


In [182]:
#soup2.find('div', id='main-content')
# the class should be et_pb_text_inner but I can't get it to work yet.

This is what I want to extract
```
<div class="et_pb_text_inner"><p data-key="16"><span data-key="17">In honor of Military Appreciation Month, join us for a discussion with Codeup Alumni who are also Military Veterans! We will chat about their experiences attending a coding bootcamp, and how their military training set them up for success here at Codeup. Grab your virtual seat now so you can be sent the exclusive Livestream link on the 11th! </span></p>
```

In [100]:
type(soup2)

bs4.BeautifulSoup

In [185]:
blogtitle = soup2.title.string
blogtitle

'From Bootcamp to Bootcamp | A Military Appreciation Panel - Codeup'

In [178]:
#soup2.select('.et_pb_text_inner').text

# AttributeError: ResultSet object has no attribute 'text'.
#You're probably treating a list of elements like a single element.
#Did you call find_all() when you meant to call find()?
# I DID NOTTTT. and without .text it is too dirty. The same problem as before.

In [180]:
# still giving me way more than what I want. 
# I should have looked more closely on Chrome and followed the trail up
# I was being too vague, and allowing the text_inner to fog my mind. 
soup2.select_one('.entry-content').text.strip()

'In honor of Military Appreciation Month, join us for a discussion with Codeup Alumni who are also Military Veterans! We will chat about their experiences attending a coding bootcamp, and how their military training set them up for success here at Codeup. Grab your virtual seat now so you can be sent the exclusive Livestream link on the 11th! \nThank you to our panelists for participating: \n\nChristopher Aguirre\nTaryn McKenzie \nDesiree McElroy \n\n\nAnd thanks to Codeup’s Trey Iapachino who is also an Air Force Veteran!'

In [187]:
# published date
publication_date = soup.select_one('.published').text
publication_date

'Apr 27, 2022'

In [141]:
#print(soup2.prettify())
# if that's beautification then I should maybe try making it hideous instead..

In [148]:
#print(soup2.get_text())

In [163]:
soup5.select('.post-meta')

[<p class="post-meta"><span class="published">Apr 27, 2022</span> | <a href="https://codeup.com/category/alumni-stories/" rel="tag">Alumni Stories</a>, <a href="https://codeup.com/category/workshops/dallas/" rel="tag">Dallas</a>, <a href="https://codeup.com/category/events/" rel="tag">Events</a>, <a href="https://codeup.com/category/featured/" rel="tag">Featured</a>, <a href="https://codeup.com/category/military/" rel="tag">Military</a>, <a href="https://codeup.com/category/workshops/san-antonio/" rel="tag">San Antonio</a>, <a href="https://codeup.com/category/veterans/" rel="tag">Veterans</a>, <a href="https://codeup.com/category/workshops/virtual/" rel="tag">Virtual</a>, <a href="https://codeup.com/category/workshops/" rel="tag">Workshops</a></p>,
 <p class="post-meta"><span class="published">Apr 14, 2022</span> | <a href="https://codeup.com/category/codeup-news/" rel="tag">Codeup News</a>, <a href="https://codeup.com/category/featured/" rel="tag">Featured</a>, <a href="https://codeu

In [168]:
blog_links = [link['href'] for link in soup.select('.more-link')]

In [188]:
blog_links

['https://codeup.com/workshops/virtual/learn-to-code-python-workshop-on-4-16/',
 'https://codeup.com/codeup-news/coming-soon-cloud-administration/',
 'https://codeup.com/featured/5-books-every-woman-in-tech-should-read/',
 'https://codeup.com/codeup-news/codeup-start-dates-for-march-2022/',
 'https://codeup.com/codeup-news/vet-tec-funding-dallas/',
 'https://codeup.com/codeup-news/dallas-campus-re-opens-with-new-grant-partner/',
 'https://codeup.com/codeup-news/codeups-placement-team-continues-setting-records/',
 'https://codeup.com/it-training/it-certifications-101/',
 'https://codeup.com/cybersecurity/a-rise-in-cyber-attacks-means-opportunities-for-veterans-in-san-antonio/',
 'https://codeup.com/codeup-news/use-your-gi-bill-benefits-to-land-a-job-in-tech/',
 'https://codeup.com/tips-for-prospective-students/which-program-is-right-for-me-cyber-security-or-systems-engineering/',
 'https://codeup.com/it-training/what-the-heck-is-system-engineering/',
 'https://codeup.com/alumni-stories/

In [204]:
def get_blog_links():
    """
    Uses beautiful soup to request and record information from Codeup's blog, using the necessary headers that grant access.
    The import of requests.get is merely a failsafe in case it was forgotten in the initial notebook import section. 
    The use of .attrs initializes a dictionary containing the hyperlinks. 
    """
    # imports in case they were forgotten
    from requests import get
    from bs4 import BeautifulSoup
    import pandas as pd
    
    # Make request to gain access to codeup's HTML for their blog site.
    response = get("https://codeup.com/blog/", headers={"user-agent": "Codeup Data Science"})
    # Create Beautiful Soup to get parse tree of the pages parsed in the request.
    soup = BeautifulSoup(response.text)
    # Initialize a dictionary which fills with the links obtained through soup.select.
    links = [link.attrs["href"] for link in soup.select(".more-link")]
    #[link['href'] for link in soup.select('.more-link')]

    return links

def parse_blog(url):
    """
    Copy and paste the specific blog you wish to parse and put it in quotes as a string for the argument,
    this UDF will then proceed to extract the title, publication date, and content therein before returning it
    all as a dictionary.
    """
    # imports in case they were forgotten
    from requests import get
    from bs4 import BeautifulSoup
    import pandas as pd
    
    # Make request to gain access to codeup's HTML for their blog site.
    response = get(url, headers={"user-agent": "Codeup Data Science"})
    # Create Beautiful Soup to get parse tree of the pages parsed in the request.
    soup = BeautifulSoup(response.text)
    # Use select_one functions from Beautiful Soup to access dictionary entries.
    return {
        "title": soup.select_one(".entry-title").text,
        "publication_date": soup.select_one(".published").text,
        "content": soup.select_one(".entry-content").text.strip(),
    }


def get_blog_df():
    """
    A master function that combines the execution of the two previous UDFs and
    subsequently returns a dataframe generated by a for-loop.
    """
    # imports in case they were forgotten
    from requests import get
    from bs4 import BeautifulSoup
    import pandas as pd
    
    links = get_blog_links()
    df = pd.DataFrame([parse_blog(link) for link in links])
    return df

In [202]:
get_blog_links()

['https://codeup.com/workshops/virtual/learn-to-code-python-workshop-on-4-16/',
 'https://codeup.com/codeup-news/coming-soon-cloud-administration/',
 'https://codeup.com/featured/5-books-every-woman-in-tech-should-read/',
 'https://codeup.com/codeup-news/codeup-start-dates-for-march-2022/',
 'https://codeup.com/codeup-news/vet-tec-funding-dallas/',
 'https://codeup.com/codeup-news/dallas-campus-re-opens-with-new-grant-partner/',
 'https://codeup.com/codeup-news/codeups-placement-team-continues-setting-records/',
 'https://codeup.com/it-training/it-certifications-101/',
 'https://codeup.com/cybersecurity/a-rise-in-cyber-attacks-means-opportunities-for-veterans-in-san-antonio/',
 'https://codeup.com/codeup-news/use-your-gi-bill-benefits-to-land-a-job-in-tech/',
 'https://codeup.com/tips-for-prospective-students/which-program-is-right-for-me-cyber-security-or-systems-engineering/',
 'https://codeup.com/it-training/what-the-heck-is-system-engineering/',
 'https://codeup.com/alumni-stories/

In [205]:
parse_blog('https://codeup.com/featured/our-acquisition-of-the-rackspace-cloud-academy-one-year-later/')

{'title': 'Our Acquisition of the Rackspace Cloud Academy: One Year Later',
 'publication_date': 'Apr 14, 2022',
 'content': 'Just about a year ago on April 16th, 2021 we announced our acquisition of the Rackspace Cloud Academy! For a short time after the acquisition, it was rebranded as the Codeup Cloud Academy and is now a full-time part of the Codeup brand. You can read our blog when we announced this last year by clicking here.\xa0\nTo look back at the past year, we checked in with Marcus Benavidez and Mike Jaime who stayed with Codeup after previously working with the Rackspace Cloud Academy. We also checked in with Dimitri Antoniou who is our VP of Strategic Initiatives and helped with the merger of the two companies. We asked them questions about the current campus, and also what the future may look like at Codeup’s new “Castle” campus.\xa0\nWhat makes Codeup’s new “Castle” campus so special?\nMike Jamie: Its history! Rackspace and former Rackers were instrumental in jumpstartin

In [208]:
get_blog_df()

Unnamed: 0,title,publication_date,content
0,Learn to Code: Python Workshop on 4/23,"Mar 31, 2022","According to LinkedIn, the “#1 Most Promising Job” is data science! But we here at Codeup understand changing careers can be a daunting idea. That’s where our free Learn to Code workshops come in! \nOn Saturday 4/23 we will be teaching a free Learn to Code workshop on the programming language Python which is one of the major building blocks of Data Science!\nWhat is data science? What is Python? \nIf you’re curious, join for free to learn the basics of Python from our very own instructors and get an introduction to the field of Data Science. This is all done from the comfort of home.\nSave your seat quickly – our Python workshops are always in high demand! \nWhat you need:\n1. Laptop (does not matter what kind). You need to be able to access WiFi and run an internet browser.\n2. To RSVP!\nYou can register for the event below!"
1,Coming Soon: Cloud Administration,"Mar 17, 2022","We’re launching a new program out of San Antonio!\nWith the acquisition of Rackspace Open Cloud Academy last year, Codeup expanded its tech training into IT, cybersecurity, and cloud. Now, we are excited to announce our newest coming-soon program: Cloud Administration! Learn to build and manage cloud-based solutions.\nWhat is the cloud? \nIn short, the cloud lets you utilize the power of other people’s computers and infrastructure. Instead of having to run everything on your own computer (and dealing with pesky problems of storage, security, and reliability), you can instantly tap into the cloud to utilize the power of companies that specialize in these areas, like AWS, Google, and Microsoft.\nWhy the cloud? \nThe future is in the cloud! As the new way of doing business, all companies need to build or interact with cloud products. And these aren’t just high-tech solutions – they’re the core of everyday products like cloud-based gaming from Xbox, collaborative work tools from Google Drive, and even music streaming services from Spotify.\nWhat will we learn? \nThe Cloud Administration program will be a 15-week career accelerator that marries the best of our Systems Engineering and Cyber Cloud programs. With hands-on training in networking, Linux, Windows, security, and AWS Cloud, we’ll prepare you for entry-level jobs like Cloud Specialist, Cloud Administrator, and Cloud Engineer. Learn to build infrastructure that enables software and data science products, manage cloud deployments, and optimize for cloud performance.\n\nWho is this program for? \nIf coding or statistics aren’t your things, but you’re interested in a career in tech, this is the program for you. Tinkerers, computer enthusiasts, mechanics, gamers – anyone who likes to fix and build things will love this program. With this foundation, students will be launched into IT towards careers in Cloud Architecture, Cybersecurity, DevOps, and Solutions Engineering.\nWhen will the program launch? \nWe anticipate starting our inaugural class in late May. With only 20 seats available in the inaugural class, space is limited. Sign up now to be the first to hear about the launch of our Cloud Administration program!\n\n\nBe the first know when the Cloud Administration program launches by signing up on this page here."
2,5 Books Every Woman In Tech Should Read,"Mar 8, 2022","On this International Women’s Day 2022 we wanted to tell stories about women in tech. What better way to do that than celebrate female authors! These women have written phenomenal books in the tech space to tell their stories. These are women who have walked the walk in the tech world and/or offered unique perspectives to not just women in tech, but women in the workplace. This list goes in no particular order as we believe you should add all of these to your kindle or book library asap. You can click on each book image to take you to a purchase page on Amazon.\nLet’s dive into the list below:\nReset: My Fight for Inclusion and Lasting Change by Ellen Pao\nFrom the book’s description on Amazon:\n“In 2015, Ellen K. Pao sued a powerhouse Silicon Valley venture capital firm, calling out workplace discrimination and retaliation against women and other underrepresented groups. Her suit rocked the tech world—and exposed its toxic culture and its homogeneity. Her message overcame negative PR attacks that took aim at her professional conduct and her personal life, and she won widespread public support—Time hailed her as “the face of change.” Though Pao lost her suit, she revolutionized the conversation at tech offices, in the media, and around the world. In Reset, she tells her full story for the first time.”\n\n \nFemale Innovators at Work: Women on Top of Tech by Danielle Newnham \nFrom the book’s description on Amazon:\n“This book describes the experiences and successes of female innovators and entrepreneurs in the still largely male-dominated tech world in twenty candid interviews. It highlights the varied life and career stories that lead these women to the top positions in the technology industry that they are in now.\nInterviewees include CEOs, founders, and inventors from a wide spectrum of tech organizations across sectors as varied as mobile technology, e-commerce, online education, and video games. Interviewer Danielle Newnham, a mobile startup and e-commerce entrepreneur herself as well as an online community organizer, presents the insights, instructive anecdotes, and advice shared with her in the interviews, including stories about raising capital for one’s start-up, and about the obstacles these women encountered and how they overcame them.”\n\n \nTechnically Wrong: Sexist Apps, Biased Algorithms, and Other Threats of Toxic Tech by Sara Wachter-Boettcher\nFrom the book’s description on Amazon:\n“Buying groceries, tracking our health, finding a date: whatever we want to do, odds are that we can now do it online. But few of us ask how all these digital products are designed, or why. It’s time we change that. Many of the services we rely on are full of oversights, biases, and downright ethical nightmares. Chatbots that harass women. Signup forms that fail anyone who’s not straight. Social media sites that send peppy messages about dead relatives. Algorithms that put more black people behind bars. \nTechnically Wrong takes an unflinching look at the values, processes, and assumptions that lead to these problems and more. Wachter-Boettcher demystifies the tech industry, leaving those of us on the other side of the screen better prepared to make informed choices about the services we use – and demand more from the companies behind them.”\n\n \nBrotopia: Breaking Up the Boys’ Club of Silicon Valley by Emily Chang\nFrom the book’s description on Amazon:\n“Silicon Valley is not a fantasyland of unicorns, virtual reality rainbows, and 3D-printed lollipops for women in tech. Instead, it’s a “Brotopia,” where men hold the cards and make the rules. While millions of dollars may seem to grow on trees in this land of innovation, tech’s aggressive, misogynistic, work-at-all costs culture has shut women out of the greatest wealth creation in the history of the world.\nBrotopia reveals how Silicon Valley got so sexist despite its utopian ideals, why bro culture endures even as its companies claim the moral high ground, and how women are speaking out and fighting back. Drawing on her deep network of Silicon Valley insiders, Chang opens the boardroom doors of male-dominated venture capital firms like Kleiner Perkins, the subject of Ellen Pao’s high-profile gender discrimination lawsuit, and Sequoia, where a partner once famously said they “won’t lower their standards” just to hire women. Exposing the flawed logic in common excuses for why tech has long suffered the “pipeline” problem and invests in the delusion of meritocracy, Brotopia also shows how bias coded into AI, internet troll culture, and the reliance on pattern recognition harms not just women in tech but us all, and at an unprecedented scale.”\n\n \nLife in Code: A Personal History of Technology by Ellen Ullman\nFrom the book’s description on Amazon:\n“The last twenty years have brought us the rise of the internet, the development of artificial intelligence, the ubiquity of once unimaginably powerful computers, and the thorough transformation of our economy and society. Through it all, Ellen Ullman lived and worked inside that rising culture of technology, and in Life in Code she tells the continuing story of the changes it wrought with a unique, expert perspective.\nWhen Ellen Ullman moved to San Francisco in the early 1970s and went on to become a computer programmer, she was joining a small, idealistic, and almost exclusively male cadre that aspired to genuinely change the world. In 1997 Ullman wrote Close to the Machine, the now classic and still definitive account of life as a coder at the birth of what would be a sweeping technological, cultural, and financial revolution.\nTwenty years later, the story Ullman recounts is neither one of unbridled triumph nor a nostalgic denial of progress. It is necessarily the story of digital technology’s loss of innocence as it entered the cultural mainstream, and it is a personal reckoning with all that has changed, and so much that hasn’t. Life in Code is an essential text toward our understanding of the last twenty years–and the next twenty.”\n\n \nLet us know your thoughts on this list on social media! What books or authors should we add to this list for a future post?\nAre you a woman who is interested in launching your career in tech? Help us close the gender gap in tech and apply for our Women in Tech scholarship! You can learn more by clicking here. \nWe have a Data Science program that starts on 3/21 and a Web Development program that starts on 4/1. Let us know if you have questions by submitting your application or reaching out to us at admissions@codeup.com!"
3,Codeup Start Dates for March 2022,"Jan 26, 2022","As we approach the end of January we wanted to look forward to our next start dates for all of our current programs.\nFull Stack Web Development – 3/7/22\nFull Stack Web Development is the first program we built and also our most popular. You’ve asked and we listened! Our next Web Development cohort will start on 3/7/2022 and is ENTIRELY VIRTUAL! THESE SEATS WILL GO FAST!\nAs one of the most in-demand jobs in the country, software and web development is the tech career with the newest jobs. In the U.S., there’s:\n\n1.5 million developer jobs*\n250,000 of them remain open\na high growth rate of 13%*\n\n \nData Science – 3/22/22\nOur first new Data Science class of 2022 starts Monday 3/22/2022 at our downtown campus at the Vogue building.\nWhy consider pivoting careers to Data Science?\n\n#1 job in America from 2016-2020 (Glassdoor*)\n650% increase in data science positions since 2012\nNearly 12 million new jobs between 2019 and 2029\n31% ten-year growth rate\n\nThe supply of data scientists remains painfully low compared to the outrageous demand. YOU can help close the gap while launching a fulfilling, secure, and high-paying career – one of the very best in the country!\nEmployers are scrambling to find talent due to a lack of qualified applicants. YOU can help fill the gap while future-proofing your skillset. Have the flexibility, security, and salary that you’ve always wanted in a career.\nAre you ready to launch your career in tech? Apply today so our admissions team can save your seat and get your name on the list. Our application can be found here. \nWant to experience Codeup early? Join one of our workshops to get an intro to a specific coding language, learn about our financing options, or maybe even code yourself a resume! All of our events can be located here. \nWe can’t wait to help you launch your career in tech!"
4,VET TEC Funding Now Available For Dallas Veterans,"Jan 7, 2022","We are so happy to announce that VET TEC benefits are now available to be used at our campus located in Dallas, TX! Our next Dallas start date for our web development program is January 31st, 2022!\nYou don’t want to delay your application as this type of funding is limited. Apply now here for our web development program and note you are interested in using VET TEC.\n \nVET TEC IN-PERSON WORKSHOP AT OUR DALLAS CAMPUS on 1/12/22\nWe’re hosting a VET TECH workshop next Wednesday evening at our Dallas Campus to discuss everything about this new funding option for our Dallas Veterans.\nWe’re one of the few coding bootcamps in Dallas approved for the use of VET TEC funding, so this workshop will go over why Codeup is the best place to use your benefits and help launch your career in tech once you’re out of the military. \nDetails about the event and the link to grab your free tickets can be found here. \nThe VET TEC program details and eligibility requirements are listed below. You can also visit our VET TEC program page here.\n \nWhat is VET TEC?\nVET TEC, which stands for Veteran Employment Through Technology Education Courses, is a program through the VA that matches career accelerators like Codeup with veterans looking to gain high-tech skills. Programs like Codeup will help you build the skills you need to become, for example, a Web Developer, while the VA helps you to pay for the tuition.\n \nAre you eligible to use VET TEC funding?\nTo qualify for VET TEC, you must:\n1. Not be on active duty or are within 180 days of separating from active duty2. Qualify for VA Education Assistance under the GI Bill3. Have at least one day of unexpired GI Bill® entitlement\n \nWhen is VET TEC funding available?\nVET TEC renews its funding on an annual cycle in October. \nIMPORTANT: Funds run out very quickly each year. If you’re interested in using VET TEC, we recommend applying immediately and going through our admissions process as quickly as possible. That way, you’ll be accepted early and we can certify your enrollment as soon as VET TEC funding becomes available from the VA.\nDon’t delay your access to VET TEC funding and apply today!"
...,...,...,...
10,Which program is right for me: Cyber Security or Systems Engineering?,"Oct 28, 2021","What IT Career should I choose?\nIf you’re thinking about a career in IT, there are a lot of directions you could go. You could become a web developer or data scientist or study UI/UX and graphic design. Or for those of you reading right now, you might be looking at entering a career in networking, cybersecurity, or cloud administration. Which might lead you to ask: what’s the difference, and which program is right for me?\nBoth these programs are 13-weeks long. While they have different names, they share a lot of similar content as entry-level IT accelerators. Of the seven technical modules, four of them are shared. Both cover:\n\nNetworking – Gain in-depth exposure to networks and topics across the OSI model, networks, protocols, and packet Capture, network analysis, and more.\nLinux – Gain exposure and get hands-on configuring and maintaining the Linux operating system.\nWindows – Exposure to the Windows Server 2012 R2 Operating System to learn server management functions using Server Manager and basic command-line utilities and tools to run from the command prompt.\nAWS – Become familiar with key industry terms and concepts while building a foundation of cloud product knowledge across Amazon Web Services.\n\nNetworking, Linux, Windows, and AWS are a LOT of content for new IT employees. Outside of this content, the programs split into their areas of focus:\n\nCyber Cloud gets students certified in Security+ and AWS\nSystems Engineering gets students certified in Network+ and Linux+\n\nNow that you understand where the programs are similar and where they differ, which one is right for you? Here are some questions to consider:\n\nDo I have any technical work experience?\nDo I have any training or education in IT?\nDo you prefer solving open-ended and ambiguous problems, or concrete problems with solutions?\nAre you interested in building and developing systems, or in securing and protecting systems?\n\nIf you don’t have any work or education in IT, the Systems Engineering program is the one for you. It is designed to be the 0-1 step in your IT career to get you started! If you do have work experience or education in IT, then you might be interested in Cyber Cloud. Why is that?\nOur programs focus on CompTIA certifications, which is a major industry provider in technical training. The typical track through their IT certifications is to start with their A+, then complete their Network +, and then progress to their Security +. That means if you don’t have your Network + (or relevant experience), it can be really hard to catch up.\nStill not sure which is right for you? Try it out for yourself with this free Network Fundamentals Crash Course. Ready to get some questions answered? Request More Information Here"
11,What the Heck is System Engineering?,"Oct 21, 2021","Codeup offers a 13-week training program: Systems Engineering. Designed to help you launch your career in tech, this program takes you from 0 experience to IT hero with certifications and hands-on experience in just weeks. But if you’re new to tech, you might be wondering…what is Systems Engineering?\nWhat is Systems Engineering?\nIn IT terms, a Systems Engineer (Sysadmin or Sysad for short) is responsible for the configuration, upkeep, and operation of computer systems. They manage things like security, storage, automation, troubleshooting, and a whole lot more. For example, some of the day to day tasks include things like:\n\nReviewing system logs for anomalies and issues\nUpdating operating systems (OS)\nInstalling new hardware and software\nManaging user accounts\nDocumenting information about the system setup\nManaging file systems\n\nYou might be thinking, “Woah, that looks like a lot!” and that’s because it is! A Systems Engineer is a critical piece of IT infrastructure. And in today’s digital age, all companies are IT companies, which means Sysadmins are critical everywhere. You can think of a Sysad as a jack-of-all-trades of what we typically think of as IT. But don’t forget how the old saying actually ends: “A jack of all trades is a master of none, but oftentimes better than a master of none.” So, what are the skills a Sysadmin uses?\n\nNetworking – Networking skills include how to configure, maintain, and troubleshoot networks, including using hardware like switches, routers, and firewalls. These skills and concepts are crucial to any systems engineering position. You not only need to understand them but also need some way to prove it. With the Network+ certification, job seekers can come to employers with proof of their networking know-how.\nLinux- Linux is one of the most popular Operating Systems (OS). It is fully compatible with Mac OS and underpins all Android systems. In fact, the world’s IT infrastructure is run primarily on Linux. Certifications are the primary way technical folks prove their skills and Linux is no different. The CompTIA Linux+ certification is an entry-level certification that proves knowledge and basic technical proficiency in the job role\nCloud – This involves using servers and tools (i.e. processing and storage hardware) that are based in a different location, instead of on-site. Our program introduces students to cloud concepts like storage, computing, permissions, and more!\n\nWhy Become a Systems Engineer?\nNow you know what a Sysadmin does and what skills they use. But why become a Sysadmin? The world is increasingly dependent on computer systems and their underlying technologies. Configuring computer software, designing innovative solutions for specific problems, and basic troubleshooting skills are more and more in demand each year. Without systems engineers, the technical world would stop innovating and become stagnant.\n\nThey are in demand! Companies rely on Sysadmins to keep all their IT running\nYou’ll become a tech guru! You won’t just be the person who can fix the printer. You’ll also be the person who can script, work with servers, navigate AWS Cloud tools, and work on modern web technologies.\nYou can specialize! With the breadth of knowledge you’ll obtain, you can have a specialized career in IT. Just hone in on one of your interests, whether it’s security, cloud, databases, or something that doesn’t even exist yet!\n\nInterested in learning more? Sign up for our Network Fundamentals Crash Course to get a taste for these skills – a $500 value you can get for free!Ready to go all-in? Apply to our 13-week Systems Engineering program where you’ll earn your Network+ and Linux+ certifications, along with skills in AWS, LAMP, Microsoft, and more!"
12,From Speech Pathology to Business Intelligence,"Oct 18, 2021","By: Alicia Gonzalez\nBefore Codeup, I was a home health Speech-Language Pathologist Assistant. I would go from home to home working with children anywhere from 1 year old to 18 years old. After 5 years of the profession, my body was getting tired driving around all the time and I felt overworked. The work was satisfying and I do miss working with the kiddos, but I just felt drained. I knew it was time to find a different field, so I started thinking about what else I was good at.\nThen one day as I was on my way to a patient’s home, I found a Codeup billboard near Splashtown that said “drive to a job you love” and I took it as a sign. I thought, “I want to drive to a job I love!” I had also heard nothing but good things about Codeup from a friend that went the Web Development route. She answered my questions and helped me through the whole process.\nAt first, I was a little worried about tuition payment, but once I got started in the application process, they held Zoom meetings where they would talk about all the funding options. They made it really easy and addressed all of my concerns. I realized I didn’t have to be scared about the financial aspect. I applied to a couple of grants and was able to get about a third of my tuition covered through the Train for Jobs SA grant from Workforce Solutions Alamo. That really helped me. I also received a few different scholarships from Codeup. The rest of it was a loan from Meritize.\nIt was difficult leaving something I was comfortable and familiar with to start something new with no data science background, but I have always enjoyed all things tech. Once we got into the pre-assessment work and then the prework leading up to class, it showed me, “okay this IS what I wanted to be doing.” I really enjoyed it, thought it was fun, and realized this was for me.\nOverall, I really enjoyed the program. We were in Zoom the whole time because of the pandemic, but they made it slow, nice, and easy. The instructors are absolutely wonderful. Every single instructor helped me at one point or another, teaching things from ground zero and explaining everything really nicely in a way I can understand. One thing I value the most about the instructors is that, not only do they go over all the necessary skills needed in a data-related career, but they teach us how to learn the skills as well. They didn’t just give us the work and tell us exactly how to do it. They provide instruction, hands-on exercises/projects, and answers to all questions, but most importantly, they walk us through how to work with the data and obtain the answers on our own. This is extremely valuable as I believe it is the key to success in this field.\nNow after graduating from Codeup, I’m a Business Intelligence Analyst at Liquid Web. I’ve been here almost three months now and I’m contributing to so many projects. I really feel like a true member of the team. I can figure things out for myself and my bosses are telling me how well I’m doing and how far I’ve come since working here. I didn’t have to ask for much help or training, because Codeup prepared me well and taught me how to learn when I run into skills they don’t teach. It’s been really good.\nI’m so glad I made this change. It was scary to take that leap of faith, but ultimately, I’m really glad I did. I love my job and I love what I’m doing. If anyone is looking to make a career change and struggling to find their best fit, I recommend looking into Codeup and going through their process. I’m extremely happy where I am right now and wouldn’t change a thing."
13,Boris – Behind the Billboards,"Oct 3, 2021",


***
***

2. News Articles

We will now be scraping text data from [inshorts] (https://inshorts.com/en/read), a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

- Business
- Sports
- Technology
- Entertainment       

The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:

```
{
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' # for example
}
```

Hints:

- Start by inspecting the website in your browser. Figure out which elements will be useful.
- Start by creating a function that handles a single article and produces a dictionary like the one above.
- Next create a function that will find all the articles on a single page and call the function you created in the last step for every article on the page.
- Now create a function that will use the previous two functions to scrape the articles from all the pages that you need, and do any additional processing that needs to be done.
***

In [18]:
def parse_news_card(card, category):
    ''' 
    Starts with an empty dictionary, selects an anchor tag of the class 'clickable'
    and strips it of its excess, delivered in text form to the dictionary's title label.
    The news card content is selected and the output is fed into the dictionary via 
    the first instance of div. Additionally, the element of the class author has its
    details included, as does the time via a machine-readable time-stamp via the content
    attribute via the .attrs on a beautiful soup object date. 
    Once the category, title, author, and publication date are all garnered, the function
    returns the dictionary. 
    '''
    output = {}

    output['category'] = category
    output['title'] = card.select_one('a.clickable').text.strip()

    card_content = card.select_one('.news-card-content')
    output['content'] = card_content.select_one('div').text

    author_and_time = card_content.select_one('.news-card-author-time')
    output['author'] = author_and_time.select_one('.author').text
    output['published'] = author_and_time.select_one('.time').attrs['content']

    return output

In [20]:
import requests

def parse_news_category(category):
    '''
    This UDF implements parse_news_card, but a particular category 
    of interest is set as the argument. The response is set by
    getting the information via an HTTPS request of the url + category,
    and then a Beautiful Soup object is created. This soup then selects
    the class .news-card, establishes an empty list for the articles, and
    then incorporates a for-loop strategy to add the parsed card per category
    to the list for each card therein. At last, the articles are returned. 
    '''
    url = 'https://inshorts.com/en/read/' + category
    response = requests.get(url)
    soup = BeautifulSoup(response.text)

    cards = soup.select('.news-card')
    articles = []

    for card in cards:
        articles.append(parse_news_card(card, category))

    return articles

In [19]:
def get_news_articles(use_cache=True):
    '''
    Using a caching approach, this function first checks to determine whether
    a .json concerning this target is extant. If so, it immediately returns a data
    frame containing the identified categories by using Pandas to read that .json. Extend
    is used in favor of append, as every dictionary in the list of dictionary needs to be
    continually added to the articles list, which is then converted into a dataframe.
    Lastly, a .json is created so that future uses will automatically read via the cache,
    as long as it abides by the naming convention set here.
    '''
    
    if os.path.exists('news_articles.json') and use_cache:
        return pd.read_json('news_articles.json')

    categories = ['business', 'sports', 'technology', 'entertainment']

    articles = []

    for category in categories:
        print(f'Getting {category} articles')
        articles.extend(parse_news_category(category))

    df = pd.DataFrame(articles)
    df.to_json('news_articles.json', orient='records')
    return df

In [15]:
today = strftime('%Y-%m-%d')
get_news_articles().to_json(f'inshorts-{today}.json')

Getting business articles
Getting sports articles
Getting technology articles
Getting entertainment articles


In [16]:
df = pd.read_json('inshorts-2022-05-09.json')

In [17]:
df.head()

Unnamed: 0,category,title,content,author,published
0,business,Rupee hits all-time low of 77.42 against US dollar,"The Indian rupee fell to an all-time low of 77.42 against the US dollar on Monday, Reuters reported. Asian markets were lower on Monday as US stock futures fell on fears of more policy tightening from the Federal Reserve and strict lockdown in Shanghai impacting global growth, according to Reuters.",Apaar Sharma,2022-05-09T05:05:31.000Z
1,business,"Bitcoin falls to the lowest level since January, trades below $34,000","Bitcoin fell on Monday to as low as $33,266 in morning trade, nearing January's low of $32,951 as slumping equity markets continued to hurt cryptocurrencies. It then steadied to trade above $33,600. According to BBC, the world's largest cryptocurrency has fallen by 50% since its peak in November 2021.",Pragya Swastik,2022-05-09T09:20:34.000Z
2,business,Rupee closes at all-time low of 77.50 against US dollar,"The Indian rupee weakened further on Monday to close at a new all-time low of 77.50 against the US dollar, 60 paise over its previous close. During the trading session, the rupee touched its lifetime low of 77.52. The currency was weighed down by elevated crude oil prices and a widening trade deficit.",Pragya Swastik,2022-05-09T15:27:43.000Z
3,business,Made best possible decision: IndiGo on barring differently-abled child from flight,"IndiGo's CEO Ronojoy Dutta said the airline made ""the best possible decision"" by barring a differently-abled teenager and his family from boarding a Ranchi-Hyderabad flight. ""At boarding area, the teenager was visibly in panic...the airport staff, in line with safety guidelines, were forced to make a difficult decision,"" Dutta said. IndiGo offered to purchase an electric wheelchair for the child.",Pragya Swastik,2022-05-09T09:50:34.000Z
4,business,India's biggest IPO of LIC subscribed nearly 3 times on final day of bidding,"LIC's IPO, India's biggest IPO which opened on May 4 and closed on May 9, was subscribed 2.95 times on Monday. Expected to raise ₹20,557 crore, the IPO received bids for 47.83 crore equity shares against the IPO size of 16.2 crore shares. The policyholders' portion was subscribed 6.11 times, employees bid 4.39 times and retail investors bid 1.99 times.\n\n",Pragya Swastik,2022-05-09T14:10:38.000Z


***
***
***

In [6]:
# Zach's solutions to the first problem:

def get_blog_article_urls():
    headers = {'user-agent': 'Innis Codeup Data Science'}
    response = requests.get('https://codeup.com/blog/', headers=headers)
    soup = BeautifulSoup(response.text)
    urls = [a.attrs['href'] for a in soup.select('a.more-link')]
    return urls

def parse_blog_article(soup):
    return {
        'title': soup.select_one('h1.entry-title').text,
        'published': soup.select_one('.published').text,
        'content': soup.select_one('.entry-content').text.strip(),
    }

def get_blog_articles(use_cache=True):
    if os.path.exists('codeup_blog_articles.json') and use_cache:
        return pd.read_json('codeup_blog_articles.json')

    urls = get_blog_article_urls()
    articles = []

    for url in urls:
        print(f'fetching {url}')
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.text)
        articles.append(parse_blog_article(soup))

    df = pd.DataFrame(articles)
    df.to_json('codeup_blog_articles.json', orient='records')
    return df