# BeatifulSoup 

**html.parser** - BeautifulSoup(markup, "html.parser")
* Advantages: Batteries included, Decent speed, Lenient (as of Python 2.7.3 and 3.2.)
* Disadvantages: Not very lenient (before Python 2.7.3 or 3.2.2)

**lxml** - BeautifulSoup(markup, "lxml")
* Advantages: Very fast, Lenient
* Disadvantages: External C dependency

**html5lib** - BeautifulSoup(markup, "html5lib")
* Advantages: Extremely lenient, Parses pages the same way a web browser does, Creates valid HTML5
* Disadvantages: Very slow, External Python dependency

In [95]:
from bs4 import BeautifulSoup
from urllib.request import urlopen
html = urlopen("https://www.wikipedia.org").read().decode('utf-8')
#print(html)

In [98]:
soup = BeautifulSoup(html, features='html.parser')
print(soup.h1, "\n\n", soup.p)

<h1 class="central-textlogo-wrapper">
<span class="central-textlogo__image sprite svg-Wikipedia_wordmark">
Wikipedia
</span>
<strong class="jsl10n localized-slogan" data-jsl10n="slogan">The Free Encyclopedia</strong>
</h1> 

 <p class="jsl10n" data-jsl10n="app-links.description">
Save your favorite articles to read offline, sync your reading lists across devices and customize your reading experience with the official Wikipedia app.
</p>


## Scrape github trending using ```find_all```


```python 
def find_all(self, name=None, attrs={}, recursive=True, text=None,
         limit=None, **kwargs):
"""Look in the children of this PageElement and find all
PageElements that match the given criteria.

All find_* methods take a common set of arguments. See the online
documentation for detailed explanations.

:param name: A filter on tag name.
:param attrs: A dictionary of filters on attribute values.
:param recursive: If this is True, find_all() will perform a
    recursive search of this PageElement's children. Otherwise,
    only the direct children will be considered.
:param limit: Stop looking after finding this many results.
:kwargs: A dictionary of filters on attribute values.
:return: A ResultSet of PageElements.
:rtype: bs4.element.ResultSet 
```

In [57]:
soup = BeautifulSoup("""<a data-hydro-click="{&quot;event_type&quot;:&quot;explore.click&quot;,&quot;payload&quot;:{&quot;click_context&quot;:&quot;TRENDING_REPOSITORIES_PAGE&quot;,&quot;click_target&quot;:&quot;REPOSITORY&quot;,&quot;click_visual_representation&quot;:&quot;REPOSITORY_NAME_HEADING&quot;,&quot;actor_id&quot;:8834824,&quot;record_id&quot;:364705340,&quot;originating_url&quot;:&quot;https://github.com/trending&quot;,&quot;user_id&quot;:8834824}}" data-hydro-click-hmac="026cba0368a35bfcf8f67322e96885e0ab123e75fb5e407bd02e1e0c6ea9b308" href="/google/fully-homomorphic-encryption" data-view-component="true">       <span data-view-component="true" class="text-normal">google /</span> fully-homomorphic-encryption</a>""")

In [108]:
def match_class(target):   
    """ 
    beatifulsoup treats classes as a list rather than a string
    which means class='Link--muted d-inline-block mr-3' will 
    match with class='d-inline-block'
    """
    def do_match(tag):                                                          
        classes = " ".join(tag.get('class', []))                                         
        return classes == target                               
    return do_match  

In [89]:
page = requests.get('https://github.com/trending')
soup = BeautifulSoup(page.text, 'html.parser')
repos = soup.find_all(name = 'article', attrs={"class": "Box-row"})
for repo in repos:
    name = repo.find(name="a", attrs={"data-view-component":"true"})
    print("repo: " + name["href"].split("/")[-1])
    devs = repo.find_all(match_class("d-inline-block"))
    for i,dev in enumerate(devs):
        print(f"developer {i}: " + dev["href"].replace("/",""))

repo: fully-homomorphic-encryption
developer 0: dibakch
repo: jd
developer 0: lsh26
developer 1: star261
repo: Watchy
developer 0: sqfmi
developer 1: dandelany
developer 2: LeonMatthes
developer 3: kicker22004
developer 4: per1234
repo: jina
developer 0: hanxiao
developer 1: jina-bot
developer 2: JoanFM
developer 3: nan-wang
developer 4: alexcg1
repo: tinygrad
developer 0: geohot
developer 1: marcelbischoff
developer 2: ryanneph
developer 3: adriangb
developer 4: Liamdoult
repo: turbo-rails
developer 0: dhh
developer 1: javan
developer 2: kaspth
developer 3: alexandreruban
developer 4: georgeclaghorn
repo: NvChad
developer 0: siduck76
developer 1: Vanderscycle
developer 2: marvelman3284
developer 3: mTvare6
developer 4: jaydamani
repo: jwasham
developer 0: jwasham
developer 1: avizmarlon
developer 2: YoSaucedo
developer 3: aleen42
developer 4: Ilyushin
repo: chriskiehl
developer 0: chriskiehl
developer 1: Shura1oplot
developer 2: eladeyal-intel
developer 3: Roshgar
developer 4: conradh