# Argus

Scope:
* Sweden:
* Scandinavia:
* Europe:
* World

Scrape the main news for each country.

Do I need a manual approach for each newspaper? Tiresome. Let's try to automate things.
Grab the text element with the largest font on the main page?

Grab the elements, search for h1...h6, pick the first largest one (and the subtext?)?

No, font size is better.

## Links
https://en.wikipedia.org/wiki/List_of_newspapers_in_Sweden

## DONE
https://www.svd.se/

## Structure

Each scraper will have:
* URL
* Source Country
* Source Language
* Source Newspaper
*
* Scrape_function



In [541]:
import requests
from bs4 import BeautifulSoup
import pycountry
from googletrans import Translator


In [575]:
class mind_of_argus:
    """
    The Mind of Argus processes what all eyes see...
    This class takes in a destination language and a list of eyes.
    
    """
    def __init__(self, target_lang = 'en'):
        self.target_lang = target_lang
        self.translator = Translator()
    

    def communicate(self, url):
        eye = eye_of_argus(url)
        header = eye.perspective['header']
        excerpt = eye.perspective['excerpt']
        
        print(f"{eye.country}: {eye.newspaper}")

        trans_header = self.translator.translate(header, dest = self.target_lang).text
        trans_excerpt = self.translator.translate(excerpt, dest = self.target_lang).text
        print(trans_header)
        print(trans_excerpt)
        print()

        
        
        
a = mind_of_argus()



places = [
'https://www.aftonbladet.se/',
'https://www.svd.se/',
'https://www.gp.se/',
'https://www.dn.se/',
'https://www.expressen.se/',
'https://politiken.dk/',
'https://www.spiegel.de/',
'https://www.lemonde.fr/'
         
]
m = mind_of_argus()

In [576]:

for p in places:
    m.communicate(p)

(Auto): Sweden: (Auto): aftonbladet
Mass sales when the Swedes are corona bunkers
▸ Become a hard currency ✓ "Out of stock, out of stock"
(Auto): Sweden: (Auto): svd
Australia demands formal apology from Beijing
Prime Minister Scott Morrison calls China's behavior "disgusting."
(Auto): Sweden: (Auto): gp
After suspected copper thefts - entire department is replaced
Gothenburg Trams' CEO Hans Nilsson: "We simply need to start over"
(Auto): Sweden: (Auto): dn
Russia is fighting a devastating second wave
Death toll rises sharply - but the Kremlin wants to avoid a new shutdown.
(Auto): Sweden: (Auto): expressen
Large fire in reservoirs - risk of spreading
Magazine overheated in Bjuv - close to other buildings
(Auto): Denmark: (Auto): https://politiken
New restrictions in the capital: The spread of viruses among children and young people must go down
Not Found
(Auto): Germany: (Auto): spiegel
Marketplace
DISPLAY
(Auto): France: (Auto): lemonde
Services
Service: Conforama promo codes: free s

In [556]:
Translator.translate

'Service: vocational training Discover our distance learning courses Learn to code online Take your skills assessment online'

In [8]:
def get_soup(url):
    html = requests.get(url)
    soup = BeautifulSoup(html.content)
    return soup


In [533]:
class eye_of_argus():
    """
    The Eyes of Argus sees all.
    
    The basic scraper class for this project.
    Each instance of an eye should at least provide an url, with which the eye will do it's best to get relevant content.
    Specifying the type of scraper, newspaper name etc make the returned information more relevant
    
    
    """
    def __init__(self, url, country = 'auto', newspaper = 'auto', language = 'auto', notes = 'None'):
        self.url = url
        self.country = country
        self.newspaper = newspaper
        self.language = language
        self.notes = notes
        self.perspective = self.get_perspective()
        self.format_perspective()
        
        if self.newspaper == 'auto':
            self.newspaper =  '(Auto): ' + self.url.split(sep = '.')[-2]
        
        if self.country =='auto':
            domain = self.url.split(sep = '.')[-1][:-1] 
            try:
                country = pycountry.countries.get(alpha_2=domain).name
                self.country = '(Auto): ' + country
            except:
                self.country = "Failed to infer"
            
    def get_soup(self):
        html = requests.get(self.url)
        soup = BeautifulSoup(html.content)
        return soup


    def get_perspective(self):
        content = self.get_soup()

        response = {
        'header' : "Not Found",
        'excerpt' : "Not Found"
        }
        
        if self.url not in manual.keys():
            response = self.naive_scrape(content, response)
        
        else:
            eye = manual[self.url]
            response = self.smart_scraper(content, response, eye)

        return response
    
    def format_perspective(self):
        #Make sure the content is a string. If not, assume it is a bs4 tag and extract the text element
        for key in self.perspective:
            
            if self.perspective[key] == '\n':
                self.perspective[key] = "Not Found"
                
            if type(self.perspective[key]) != type('string'):
                self.perspective[key] = self.perspective[key].text
            
            # Remove leading and trailing whitespace
            split = self.perspective[key].split()
            join = ' '.join(split)
            self.perspective[key] = join
            
    def naive_scrape(self, content, response):
        for head in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
            rubric = content.find(head)
            if rubric != None:
                response['header'] = rubric
                response['excerpt'] = rubric.findNext()
                if response['excerpt'] == '\n':
                    response['excerpt'] = rubric.findNext().findNext()
                    
        return response
    
    def smart_scraper(self, content, response, specific_scraper):
        """
        The smart_scraper is a higher-order function
        It relies on the existence of functions with a specific format.
        That is, a function that retures the header element of content when 'header' is specified'
        and the excerpt element when 'excerpt is specified'
        """
    
        header = specific_scraper(content, 'header')
        if header != None:
            response['header'] = header
            excerpt = specific_scraper(content, 'excerpt')
            if excerpt != None:
                response['excerpt'] =  excerpt
        return response
   
    def describe(self, level = 0):
        auto = ''
        if self.url not in manual.keys():
            auto = '(Auto):'

        if level == 0:

            print(f"""
            SOURCE: {self.newspaper}
            Header: {auto} {self.perspective['header']}

            Excerpt: {auto} {self.perspective['excerpt']}
            """)
        else:
            
            print(f"""
            url: {self.url}
            country: {self.country}
            newspaper: {self.newspaper}
            language: {self.language}
            notes: {self.notes}

            Header:{auto} {self.perspective['header']}

            Excerpt: {auto} {self.perspective['excerpt']}
            """)
            
manual = {
    'https://www.aftonbladet.se/' : swe_aft,
    'https://www.svd.se/' : swe_svd,
    'https://www.gp.se/': swe_gp,
    'https://www.dn.se/': swe_dn,
    'https://www.expressen.se/': swe_exp,
    'https://politiken.dk/': dk_pol
      
}

In [500]:

# GERMANY

# DENMARK
def dk_pol(content, item):
    if item == 'header':
        return content.find('h2', attrs={'class': 'article-intro__title headline headline--xxxlarge'})
    if item == 'excerpt':
        return content.find('h2', attrs={'class': 'article-intro__title headline headline--xxxlarge'}).parent.find('ul', attrs={'class': 'article-intro__related'})

# SWEDEN

def swe_exp(content, item):
    if item == 'header':
        return content.find('div', attrs={'class': 'teaser'}).find('h2')
    if item == 'excerpt':
        return content.find('div', attrs={'class': 'teaser'}).find('h2').next_sibling

def swe_dn(content, item):
    if item == 'header':
        return content.find('div', attrs= {'class': 'teaser-package__content'}).find('h1')
    if item == 'excerpt':
        return content.find('div', attrs= {'class': 'teaser-package__content'}).find('h1').next_sibling.next_sibling

def swe_gp(content, item):
    if item == 'header':
        return content.find("div",  attrs = {'class': 'c-teaser__content'}).find("h2", attrs = {'class': 'c-teaser__title'})
    if item == 'excerpt':
        return content.find("div", attrs = {'class': 'c-teaser__summery'})
    
def swe_aft(content, item):
    if item == 'header':
        return content.find('h3')
    if item == 'excerpt':
        return content.find('h3').next_sibling
    
def swe_svd(content, item):
    if item == 'header':
        return content.find('h2')
    if item == 'excerpt':
        return content.find('h2').next_sibling.next_sibling