# Data Acquisition with Web Scraping

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os

`$ pip install beautifulsoup4`

## Soup Methods

- `soup.select('.class')`: to get all the elements with class `class`
- `soup.select_one('.class')`: to get the first element with class `class`
- `soup.h2`: to get the first `h2` element
- `soup.find_all('h2')`: to get all the elements with tag name of `h2`
- `soup('h2')` : same as `find_all` method above
- `soup.find('h2')`: finds the first matching element

First make the request. The response is a bunch of html.

In [2]:
response = requests.get('https://web-scraping-demo.zgulde.net/news')
html = response.text
# html

We can make more sense of that html with the beautiful soup library.

In [3]:
soup = BeautifulSoup(html)
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <title>
   News Example Page
  </title>
  <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet"/>
  <link href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" rel="stylesheet"/>
 </head>
 <body class="mx-auto max-w-screen-lg pb-32">
  <h1 class="my-5 text-4xl text-center">
   News!
  </h1>
  <div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">
   <p>
    <i class="bi bi-exclamation-circle text-xl">
    </i>
    All data on this page is strictly for demonstration purposes and fake.
   </p>
  </div>
  <div class="grid gap-y-12">
   <div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
    <img src="/static/placeholder.png"/>
    <div class="col-span-3 space-y-3 py-3

From here we can switch between the browser and python and try out different ways of getting different parts of the html document.

We can leverage Google Chrome's developer tools by right clicking and choosing "Inspect". We can then use this html document inspector to help us with our web scraping.

In [4]:
# Use beautifulsoup methods to extract necessary content from an article

In [5]:
articles = soup.select('.grid-cols-4')
# articles

In [6]:
article = articles[0]
article

<div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
<img src="/static/placeholder.png"/>
<div class="col-span-3 space-y-3 py-3">
<h2 class="text-2xl text-green-900">throughout time quality</h2>
<div class="grid grid-cols-2 italic">
<p> 1977-10-15 </p>
<p class="text-right">By Hannah Villarreal </p>
</div>
<p>Part entire challenge truth resource between fund fund. Government describe attention.
Single religious green war identify. Sea mother fight hospital country water girl. Prevent no conference management attention.</p>
</div>
</div>

In [7]:
# soupmethod.tagname.text
headline = article.h2.text
headline

'throughout time quality'

In [8]:
# get the date
# there was some white space that we stripped out
date = article.p.text.strip()
date

'1977-10-15'

In [9]:
article.select('.text-right')[0].text.strip()[3:]

'Hannah Villarreal'

In [10]:
# the dot before text is a notation to use before selecting the class and is required
author = article.select('.text-right')[0].text.strip()[3:]
author

'Hannah Villarreal'

In [11]:
# getting the actual content
content = article.select('p')[-1].text
content

'Part entire challenge truth resource between fund fund. Government describe attention.\nSingle religious green war identify. Sea mother fight hospital country water girl. Prevent no conference management attention.'

Bringing it all together: Make a function

In [12]:
def parse_news(article):
    headline = article.h2.text
    date = article.p.text.strip()
    author = article.select('.text-right')[0].text.strip()[3:]
    content = article.select('p')[-1].text
    
    return {
        'headline': headline, 'date': date, 'author': author,
        'content': content
    }

In [13]:

parse_news(article)

{'headline': 'throughout time quality',
 'date': '1977-10-15',
 'author': 'Hannah Villarreal',
 'content': 'Part entire challenge truth resource between fund fund. Government describe attention.\nSingle religious green war identify. Sea mother fight hospital country water girl. Prevent no conference management attention.'}

In [14]:
# loop through all the articles
# [parse_news(article) for article in articles]

In [15]:
# loop through all the articles
pd.DataFrame([parse_news(article) for article in articles])

Unnamed: 0,headline,date,author,content
0,throughout time quality,1977-10-15,Hannah Villarreal,Part entire challenge truth resource between f...
1,among detail begin,1974-02-12,Joseph King,Public wind special new fine because. Seat exp...
2,doctor color source,2012-04-13,Stephanie Nichols,Interview keep other way. Bar side politics.\n...
3,college although receive,1987-10-02,Priscilla Murphy MD,Available short hair. Real staff cell movement...
4,life particularly politics,1991-10-18,Kyle Lewis,Tax computer defense six either west success s...
5,between modern wait,1996-12-19,Anthony Mckee,Deal election piece similar arm experience.\nH...
6,man value general,2007-09-07,Donna Nichols,Night generation admit could. Face individual ...
7,hope campaign small,1994-12-29,Kevin Scott,Story let oil factor skill house. Represent qu...
8,professional follow soldier,1982-01-19,Tracy Cox,Reflect member finally option. Walk how knowle...
9,while range mission,2019-11-27,Barbara Camacho,Risk say ago hand. Church front rate rate. Tec...


## Scraping People

In [16]:
response = requests.get('https://web-scraping-demo.zgulde.net/people', headers={'user-agent': 'Codeup DS Hoppper'})
soup = BeautifulSoup(response.text)

In [17]:
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <title>
   Example People Page
  </title>
  <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet"/>
  <link href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" rel="stylesheet"/>
 </head>
 <body class="mx-auto max-w-screen-lg pb-32">
  <h1 class="my-5 text-4xl text-center">
   People
  </h1>
  <div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">
   <p>
    <i class="bi bi-exclamation-circle text-xl">
    </i>
    All data on this page is strictly for demonstration purposes and fake.
   </p>
  </div>
  <div class="grid grid-cols-2 gap-x-12 gap-y-16" id="people">
   <div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
    <h2 class="text-2xl text-purp

In [18]:
cards = soup.select(".person")
cards

[<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
 <h2 class="text-2xl text-purple-800 name col-span-full border-b">Susan Nguyen</h2>
 <p class="quote col-span-full px-5 py-5 text-center text-gray-500">
             "Visionary exuding moderator"
         </p>
 <div class="grid grid-cols-9">
 <i class="bi bi-envelope-fill text-purple-800"></i>
 <p class="email col-span-8">andrewmorgan@lane.com</p>
 <i class="bi bi-telephone-fill text-purple-800"></i>
 <p class="phone col-span-8">925-794-3250</p>
 </div>
 <div class="address grid grid-cols-9">
 <i class="bi bi-geo-fill text-purple-800"></i>
 <p class="col-span-8">
                 0043 Nathan Harbor <br/>
                 Ashleystad, MN 16113
             </p>
 </div>
 </div>,
 <div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
 <h2 class="text-2xl text-purple-800 name col-span-full border-b"

In [19]:
card = cards[0]
card

<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
<h2 class="text-2xl text-purple-800 name col-span-full border-b">Susan Nguyen</h2>
<p class="quote col-span-full px-5 py-5 text-center text-gray-500">
            "Visionary exuding moderator"
        </p>
<div class="grid grid-cols-9">
<i class="bi bi-envelope-fill text-purple-800"></i>
<p class="email col-span-8">andrewmorgan@lane.com</p>
<i class="bi bi-telephone-fill text-purple-800"></i>
<p class="phone col-span-8">925-794-3250</p>
</div>
<div class="address grid grid-cols-9">
<i class="bi bi-geo-fill text-purple-800"></i>
<p class="col-span-8">
                0043 Nathan Harbor <br/>
                Ashleystad, MN 16113
            </p>
</div>
</div>

In [20]:
name = card.h2.text
name

'Susan Nguyen'

In [21]:
quote = card.p.text.strip()
quote

'"Visionary exuding moderator"'

In [22]:
email = card.find_all('p')[1].text
email

'andrewmorgan@lane.com'

In [23]:
phone = card.find_all('p')[2].text
phone

'925-794-3250'

In [24]:
# address = card.find_all('p')[3].text.strip()
# address

In [25]:
import re
address = card.find_all('p')[3].text.strip()
address = re.sub(r"\s{2,}", "", address)

address

'0043 Nathan HarborAshleystad, MN 16113'

In [26]:
card.find_all('p')

[<p class="quote col-span-full px-5 py-5 text-center text-gray-500">
             "Visionary exuding moderator"
         </p>,
 <p class="email col-span-8">andrewmorgan@lane.com</p>,
 <p class="phone col-span-8">925-794-3250</p>,
 <p class="col-span-8">
                 0043 Nathan Harbor <br/>
                 Ashleystad, MN 16113
             </p>]

In [27]:
def parse_person(card):
    name = card.h2.text
    quote = card.p.text.strip()
    email = card.find_all('p')[1].text
    phone = card.find_all('p')[2].text
    address = card.find_all('p')[3].text.strip()
    address = re.sub(r"\s{2,}", "", address)
    
    return {
        'name': name, 'quote': quote, 'email': email,
        'phone': phone,
        'address': address
    }

In [28]:
parse_person(card)

{'name': 'Susan Nguyen',
 'quote': '"Visionary exuding moderator"',
 'email': 'andrewmorgan@lane.com',
 'phone': '925-794-3250',
 'address': '0043 Nathan HarborAshleystad, MN 16113'}

In [29]:
# loop through all the persons
pd.DataFrame([parse_person(card) for card in cards])

Unnamed: 0,name,quote,email,phone,address
0,Susan Nguyen,"""Visionary exuding moderator""",andrewmorgan@lane.com,925-794-3250,"0043 Nathan HarborAshleystad, MN 16113"
1,Derek Sanchez,"""Persevering systemic help-desk""",keith45@bridges.com,602-359-6575x83313,"7343 Hardy SquaresGregoryburgh, VA 02125"
2,James Mcclain,"""Advanced modular help-desk""",joan90@yahoo.com,(067)374-0403,"52340 Medina FreewayLake Rebekah, MI 32931"
3,Jordan Figueroa,"""Expanded global standardization""",byrddanielle@singh.net,(476)704-0564x1646,"254 Austin Hill Apt. 038East Andrew, OK 59826"
4,Michael Nguyen,"""Visionary asymmetric function""",david51@williams.com,192-325-2459x49485,"348 Alyssa Circle Suite 795Tiffanymouth, MT 01010"
5,Teresa Harris,"""Monitored directional Graphic Interface""",jessica84@salazar.com,207.763.3744x80532,"4221 Jon Way Apt. 694South Nicolemouth, GA 01140"
6,Kathy Fitzpatrick,"""Front-line secondary protocol""",smiththeresa@hotmail.com,001-836-731-9340,"0291 Veronica CenterWest Rodneyborough, IL 73663"
7,Kelly Porter,"""Fundamental holistic success""",lisa52@yahoo.com,824.721.7467x312,"83746 Larson LoafLake William, IL 93169"
8,Jason Walker,"""Stand-alone global strategy""",shannonarnold@gmail.com,(410)194-2225x476,"32541 Maddox Valley Apt. 270Port Billyton, WA ..."
9,Adrienne Glenn,"""Configurable cohesive framework""",brownrobert@woodward.com,923-063-3537,"642 Kylie Dam Suite 150Hooperside, ME 29406"


## Web Scraping Etiquitte

- respect the `robots.txt` file if present

    * [Wikipedia: Robots exclusion standard](https://en.wikipedia.org/wiki/Robots_exclusion_standard)
    * [robotstxt.org](http://www.robotstxt.org/robotstxt.html)
    * [codeup's robots.txt](https://codeup.com/robots.txt)

- use a descriptive user agent

    ```python
    requests.get('http://example.com', headers={'user-agent': 'codeup data science germain cohort'})
    ```

## Exercises

#### Codeup Blog Articles

Visit Codeup's Blog(http://codeup.com/blog/) and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's title and content.

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

In [30]:
response = requests.get('https://codeup.com/blog/', headers={'user-agent': 'Codeup DS Hopper'})
soup = BeautifulSoup(response.text)

In [31]:
# print(codeup.prettify)

In [32]:
# <h2 class="entry-title"><a href="https://codeup.com/codeup-news/codeup-start-dates-for-march-2022/">Codeup Start Dates for March 2022</a></h2>

In [33]:
articles = soup.find_all('h2', class_ = 'entry-title')
articles

[<h2 class="entry-title"><a href="https://codeup.com/codeup-news/codeup-start-dates-for-march-2022/">Codeup Start Dates for March 2022</a></h2>,
 <h2 class="entry-title"><a href="https://codeup.com/codeup-news/vet-tec-funding-dallas/">VET TEC Funding Now Available For Dallas Veterans</a></h2>,
 <h2 class="entry-title"><a href="https://codeup.com/codeup-news/dallas-campus-re-opens-with-new-grant-partner/">Dallas Campus Re-opens With New Grant Partner</a></h2>,
 <h2 class="entry-title"><a href="https://codeup.com/dallas-newsletter/codeup-dallas-open-house/">Codeup Dallas Open House</a></h2>,
 <h2 class="entry-title"><a href="https://codeup.com/codeup-news/codeups-placement-team-continues-setting-records/">Codeup’s Placement Team Continues Setting Records</a></h2>,
 <h2 class="entry-title"><a href="https://codeup.com/it-training/it-certifications-101/">IT Certifications 101: Why They Matter, and Why They Don’t</a></h2>,
 <h2 class="entry-title"><a href="https://codeup.com/cybersecurity/a-

In [34]:
articles[0]

<h2 class="entry-title"><a href="https://codeup.com/codeup-news/codeup-start-dates-for-march-2022/">Codeup Start Dates for March 2022</a></h2>

In [35]:
article = articles[0]
article

<h2 class="entry-title"><a href="https://codeup.com/codeup-news/codeup-start-dates-for-march-2022/">Codeup Start Dates for March 2022</a></h2>

In [36]:
title = article.text
title

'Codeup Start Dates for March 2022'

In [37]:
article

<h2 class="entry-title"><a href="https://codeup.com/codeup-news/codeup-start-dates-for-march-2022/">Codeup Start Dates for March 2022</a></h2>

In [38]:
link = article.a.attrs['href']
link

'https://codeup.com/codeup-news/codeup-start-dates-for-march-2022/'

In [39]:
def get_links():
    link_list = []
    response = requests.get('https://codeup.com/blog/', headers={'user-agent': 'Codeup DS Hopper'})
    soup = BeautifulSoup(response.text)
    articles = soup.find_all('h2', class_ = 'entry-title')
    for article in articles:
        link = article.a.attrs['href']
        link_list.append(link)
    return link_list

In [40]:
def get_link(article):
    link = article.a.attrs['href']
    return link

In [41]:
get_link(article)

'https://codeup.com/codeup-news/codeup-start-dates-for-march-2022/'

In [42]:
temp_list = get_links()
temp_list

['https://codeup.com/codeup-news/codeup-start-dates-for-march-2022/',
 'https://codeup.com/codeup-news/vet-tec-funding-dallas/',
 'https://codeup.com/codeup-news/dallas-campus-re-opens-with-new-grant-partner/',
 'https://codeup.com/dallas-newsletter/codeup-dallas-open-house/',
 'https://codeup.com/codeup-news/codeups-placement-team-continues-setting-records/',
 'https://codeup.com/it-training/it-certifications-101/',
 'https://codeup.com/cybersecurity/a-rise-in-cyber-attacks-means-opportunities-for-veterans-in-san-antonio/',
 'https://codeup.com/codeup-news/use-your-gi-bill-benefits-to-land-a-job-in-tech/',
 'https://codeup.com/tips-for-prospective-students/which-program-is-right-for-me-cyber-security-or-systems-engineering/',
 'https://codeup.com/it-training/what-the-heck-is-system-engineering/',
 'https://codeup.com/alumni-stories/from-speech-pathology-to-business-intelligence/',
 'https://codeup.com/behind-the-billboards/boris-behind-the-billboards/',
 'https://codeup.com/codeup-new

In [43]:
article_response = requests.get(link, headers={'user-agent': 'Codeup DS Hopper'})
article_soup = BeautifulSoup(article_response.text)
# article_soup

In [44]:
article_content = [p.text for p in article_soup.find_all('p')]

In [45]:
article_content

['Jan 26, 2022 | Codeup News',
 'As we approach the end of January we wanted to look forward to our next start dates for all of our current programs.',
 'Full Stack Web Development is the first program we built and also our most popular. You’ve asked and we listened! Our next Web Development cohort will start on 3/7/2022 and is ENTIRELY VIRTUAL! THESE SEATS WILL GO FAST!',
 'As one of the most in-demand jobs in the country, software and web development is the tech career with the newest jobs. In the U.S., there’s:',
 '\xa0',
 'Our first new Data Science class of 2022 starts Monday 3/22/2022 at our downtown campus at the Vogue building.',
 'Why consider pivoting careers to Data Science?',
 'The supply of data scientists remains painfully low compared to the outrageous demand. YOU can help close the gap while launching a fulfilling, secure, and high-paying career – one of the very best in the country!',
 'Employers are scrambling to find talent due to a lack of qualified applicants. YOU 

In [46]:
# # for content:
# summaries = soup.find_all('div',class_="post-content")
# summaries[0]

In [47]:
# summary = summaries[0].text.strip()
# summary

In [48]:
# def get_blog_articles(article):
#     response = requests.get('https://codeup.com/blog/', headers={'user-agent': 'Codeup DS Hopper'})
#     soup = BeautifulSoup(response.text)
#     articles = soup.find_all('h2', class_ = 'entry-title')
#     title = article.text
#     link = article.a.attrs['href']
#     article_response = requests.get(link, headers={'user-agent': 'Codeup DS Hopper'})
#     article_soup = BeautifulSoup(article_response.text)
#     article_content = [p.text for p in article_soup.find_all('p')]

#     return {
#         'title': title, 'article_content': article_content
#     }


# codeup_blog_posts = pd.DataFrame([get_blog_articles(article) for article in articles])



### this was the original, working function i created before making it more 
### complicated with the link function and writing to json


In [49]:
# def get_blog_articles():
#     filename = 'codeup_blog_articles.json'
#     if os.path.isfile(filename):
#         return pd.read_csv(filename)
    
#     else:
#         article_list=[]
#         response = requests.get('https://codeup.com/blog/', headers={'user-agent': 'Codeup DS Hopper'})
#         soup = BeautifulSoup(response.text)
#         articles = soup.find_all('h2', class_ = 'entry-title')
#         for article in articles:
# #             response = requests.get('https://codeup.com/blog/', headers={'user-agent': 'Codeup DS Hopper'})
# #             soup = BeautifulSoup(response.text)
# #             articles = soup.find_all('h2', class_ = 'entry-title')
#             title = article.text
#             links = get_links()
#             for link in links:
#                 article_response = requests.get(link, headers={'user-agent': 'Codeup DS Hopper'})
#                 article_soup = BeautifulSoup(article_response.text)
#                 article_content = [p.text for p in article_soup.find_all('p')]

#                 article = {
#                     'title': title, 'article_content': article_content
#                 }
#             article_list.append(article)
#         df = pd.DataFrame(article_list)
#         df.to_json('codeup_blog_articles.json')
#     return df



# this function was a nice try, but got into an infinity loop

In [50]:
def get_blog_articles():
    filename = 'codeup_blog_articles.json'
    if os.path.isfile(filename):
        return pd.read_csv(filename)
    
    else:
        article_list=[]
        response = requests.get('https://codeup.com/blog/', headers={'user-agent': 'Codeup DS Hopper'})
        soup = BeautifulSoup(response.text)
        articles = soup.find_all('h2', class_ = 'entry-title')
        for article in articles:
            title = article.text
            link = get_link(article)
            article_response = requests.get(link, headers={'user-agent': 'Codeup DS Hopper'})
            article_soup = BeautifulSoup(article_response.text)
            article_content = [p.text for p in article_soup.find_all('p')]

            article = {
                'title': title, 'article_content': article_content
            }
            article_list.append(article)
        df = pd.DataFrame(article_list)
        df.to_json('codeup_blog_articles.json')
    return df


# Great, this function is working now

In [51]:
codeup_blog_posts = get_blog_articles()
codeup_blog_posts

Unnamed: 0,title,article_content
0,Codeup Start Dates for March 2022,"[Jan 26, 2022 | Codeup News, As we approach th..."
1,VET TEC Funding Now Available For Dallas Veterans,"[Jan 7, 2022 | Codeup News, Dallas Newsletter,..."
2,Dallas Campus Re-opens With New Grant Partner,"[Dec 30, 2021 | Codeup News, Featured, We are ..."
3,Codeup Dallas Open House,"[Nov 30, 2021 | Dallas Newsletter, Events, Com..."
4,Codeup’s Placement Team Continues Setting Records,"[Nov 19, 2021 | Codeup News, Employers, Who ex..."
5,"IT Certifications 101: Why They Matter, and Wh...","[Nov 18, 2021 | IT Training, Tips for Prospect..."
6,A rise in cyber attacks means opportunities fo...,"[Nov 17, 2021 | Cybersecurity, In the last few..."
7,Use your GI Bill® benefits to Land a Job in Tech,"[Nov 4, 2021 | Codeup News, Tips for Prospecti..."
8,Which program is right for me: Cyber Security ...,"[Oct 28, 2021 | IT Training, Tips for Prospect..."
9,What the Heck is System Engineering?,"[Oct 21, 2021 | IT Training, Tips for Prospect..."


In [52]:
# codeup_blog_posts = pd.DataFrame([get_blog_articles(article) for article in articles])


In [53]:
# codeup_blog_posts = 
# codeup_blog_posts

In [54]:
# codeup_blog_posts.article_content[0]

### Cracked it : )

# News Articles

We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

- Business
- Sports
- Technology
- Entertainment

The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:


`
{
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' # for example
}
`

In [55]:
base_url = 'https://inshorts.com'
section_links = ["/en/read/business","/en/read/sports","/en/read/technology","/en/read/entertainment"]
response = requests.get(base_url + '/en/read', headers={'user-agent': 'ds_student'})
soup = BeautifulSoup(response.text)


In [56]:
response

<Response [200]>

In [57]:
soup.find_all(class_ = 'active-category')

[<li class="active-category selected">All News</li>,
 <li class="active-category">India</li>,
 <li class="active-category">Business</li>,
 <li class="active-category">Sports</li>,
 <li class="active-category">World</li>,
 <li class="active-category">Politics</li>,
 <li class="active-category">Technology</li>,
 <li class="active-category">Startup</li>,
 <li class="active-category">Entertainment</li>,
 <li class="active-category">Miscellaneous</li>,
 <li class="active-category">Hatke</li>,
 <li class="active-category">Science</li>,
 <li class="active-category">Automobile</li>]

In [58]:
temp = soup.find_all('ul')

In [59]:
soup.ul.attrs

{'class': ['category-list']}

In [60]:
temp2 = soup.ul.find_all('a')

In [61]:
temp2

[<a href="/en/read" onclick="track_GA_Mixpanel({'hitType': 'event', 'category': 'LinkToAllNews', 'action': 'clicked', 'label': 'RedirectedToAllNews' });"> <li class="active-category selected">All News</li> </a>,
 <a href="/en/read/national" onclick="track_GA_Mixpanel({'hitType': 'event', 'category': 'LinkToIndiaNews', 'action': 'clicked', 'label': 'RedirectedToIndiaNews' });"> <li class="active-category">India</li> </a>,
 <a href="/en/read/business" onclick="track_GA_Mixpanel({'hitType': 'event', 'category': 'LinkToBusinessNews', 'action': 'clicked', 'label':  'RedirectedToBusinessNews' });"> <li class="active-category">Business</li> </a>,
 <a href="/en/read/sports" onclick="track_GA_Mixpanel({'hitType': 'event', 'category': 'LinkTosportsNews', 'action': 'clicked', 'label': 'RedirectedToSportsNews' });"> <li class="active-category">Sports</li> </a>,
 <a href="/en/read/world" onclick="track_GA_Mixpanel({'hitType': 'event', 'category': 'LinkToworldNews', 'action': 'clicked', 'label': 'Re

In [62]:
temp2[2]

<a href="/en/read/business" onclick="track_GA_Mixpanel({'hitType': 'event', 'category': 'LinkToBusinessNews', 'action': 'clicked', 'label':  'RedirectedToBusinessNews' });"> <li class="active-category">Business</li> </a>

In [63]:
temp3 = soup.ul.find_all('a')

In [64]:
temp3

[<a href="/en/read" onclick="track_GA_Mixpanel({'hitType': 'event', 'category': 'LinkToAllNews', 'action': 'clicked', 'label': 'RedirectedToAllNews' });"> <li class="active-category selected">All News</li> </a>,
 <a href="/en/read/national" onclick="track_GA_Mixpanel({'hitType': 'event', 'category': 'LinkToIndiaNews', 'action': 'clicked', 'label': 'RedirectedToIndiaNews' });"> <li class="active-category">India</li> </a>,
 <a href="/en/read/business" onclick="track_GA_Mixpanel({'hitType': 'event', 'category': 'LinkToBusinessNews', 'action': 'clicked', 'label':  'RedirectedToBusinessNews' });"> <li class="active-category">Business</li> </a>,
 <a href="/en/read/sports" onclick="track_GA_Mixpanel({'hitType': 'event', 'category': 'LinkTosportsNews', 'action': 'clicked', 'label': 'RedirectedToSportsNews' });"> <li class="active-category">Sports</li> </a>,
 <a href="/en/read/world" onclick="track_GA_Mixpanel({'hitType': 'event', 'category': 'LinkToworldNews', 'action': 'clicked', 'label': 'Re

### might need to come back to this and try an easier approach--for example, just break down the section page instead of trying to access through the main page

In [65]:
import os

def get_titanic_data():
    filename = "titanic.csv"

    if os.path.isfile(filename):
        return pd.read_csv(filename)
    else:
        # read the SQL query into a dataframe
        df = pd.read_sql('SELECT * FROM passengers', get_connection('titanic_db'))

        # Write that dataframe to disk for later. Called "caching" the data for later.
        df.to_file(filename)

        # Return the dataframe to the calling code
        return df  
