# Scraping vanityfair article
The goal is to scrape this article:
[Shame and Survival](https://www.vanityfair.com/style/society/2014/06/monica-lewinsky-humiliation-culture)

Features:
- There is a lot of special formattings which split the article into different sections, so scraping is not straightforward.
- Also, this time I would like to try making a class and understand how exceptions are handled and passed between functions and the class that contains them.


## Preparation

In [29]:
import requests
from bs4 import BeautifulSoup

url = 'https://www.vanityfair.com/style/society/2014/06/monica-lewinsky-humiliation-culture'
    
html_data = requests.get(url)
soup = BeautifulSoup(html_data.text, 'html.parser')

Could not get https://www.vanityfair.com/style/society/2014/06/monica-lewinsky-humiliation-culture


## html analysis
The article section is contained within `<div class="content paywall drop-cap" data-reactid="223">`.

Then contained in different `<section class="content-section" ...>` are different `<p>` with the article text.

Now we extract these texts.


In [None]:
article_container = soup.find('oop
                              ', attrs={'class': 'content paywall drop-cap'})
article_sections = article_container.findAll('p')
article_sections_untagged = [tag.text for tag in article_sections]
article_sections_untagged


## Making a class
Now we do from scratch, but instead we make a class called Aricle.

Features:
- Pulling data and processing the data as soup are compartmentalised as two methods within the class. We then call the methods when initialise
- These functions need exception handling in case they fail
- The class then need to spit out the exception handling of the function when it runs them, whichever one fails
- We don't know which one will fail really


In [53]:
import requests
from bs4 import BeautifulSoup

url = 'https://www.vanityfair.com/style/society/2014/06/monica-lewinsky-humiliation-culture'

class MyException(Exception):
    pass

class PullDataException(MyException):
    pass

class ProcessDataException(MyException):
    pass

class Article:
    """
    A class representing an article at website vanityfair.
    
    Attributes:
        url (str): url of article
        html (Response object): the html code returned by requests.get
        soupdata (soup object): the soup object after processing
    """
    def __init__(self, url):
        self.url = url
        try:
            self.html = self.pull_data()
            self.soupdata = self.process_data()
        except:
            raise 
        
    def pull_data(self):
        try:
            r = requests.get(self.url)
        except:
            raise PullDataException('Could not get ' + self.url)
        else:
            return r
        
    def process_data(self):
        try:
            s = BeautifulSoup(self.html.text, 'htl.parser')
        except BeautifulSoup.FeatureNotFound:
            pass
            #raise ProcessDataException('Could not process ' + self.url)
        else:
            return s

mySoupObject = Article(url)

AttributeError: type object 'BeautifulSoup' has no attribute 'FeatureNotFound'

## Conclusion
- In function, we mostly `if`, `then`, `raise exception`. Then when running multiple functions in series, we try each function except exception as e, print e.
- Should really investigate exception propegation in python. We don't know how to handle exceptions when there are three levels.
- `Except:` followed by nothing is bad practice
- `Except: raise` is ok though