# Scraping SoundCoud

## Initial Setup

In [47]:
# import packages
import requests
from bs4 import BeautifulSoup

In [48]:
from requests_html import AsyncHTMLSession

In [49]:
session = AsyncHTMLSession()

## Connect to SoundCloud

In [22]:
# make connection to webpage
resp = requests.get("https://soundcloud.com/discover")

In [23]:
# get HTML from response object
html = resp.content

In [24]:
# convert HTML to BeautifulSoup object
soup = BeautifulSoup(html, "lxml")

## Get links on the webpage.  Notice how this doesn't extract all the links visible on the webpage...what can we do about that?

In [25]:
links = soup.find_all("a")
links

[<a class="header__logoLink sc-border-box sc-ir" href="/" title="Home">SoundCloud</a>,
 <a class="sc-button sc-button-medium" href="http://www.enable-javascript.com/" target="_blank">Show me how to enable it</a>,
 <a href="https://help.soundcloud.com/hc/articles/115003564308-Technical-requirements">Need help?</a>,
 <a href="http://google.com/chrome" target="_blank" title="Chrome">Chrome</a>,
 <a href="http://firefox.com" target="_blank" title="Firefox">Firefox</a>,
 <a href="http://apple.com/safari" target="_blank" title="Safari">Safari</a>,
 <a href="http://windows.microsoft.com/ie" target="_blank" title="Internet Explorer">Internet Explorer</a>,
 <a class="sc-button" href="https://help.soundcloud.com" id="try-again" target="_blank">I need help</a>,
 <a href="/popular/searches" title="Popular searches">Popular searches</a>]

## 1) Use requests-html to extract other links on the page by executing JavaScript.  How many links do you see now?
## 2) After you complete 1), get the text of the new paragraphs now visible in the HTML.
## 3) Try out a few other tags - what else appears after executing the JavaScript?
## 4) Using a CSS selector, extract the meta tag with name = "keywords".  Can you get this tag's attributes?
## 5) Links that automatically open to a new a tab are identified by having a "target" attribute equal to "_blank".  Try extracting these links and their URLs.

----------


## 1) Use requests-html to extract other links on the page by executing JavaScript

In [50]:
#using dynamically generated JS page

r = await session.get("https://soundcloud.com/discover")

In [51]:
r.status_code

200

In [52]:
await r.html.arender()

In [53]:
new_links = r.html.find('a')
new_links

[<Element 'a' href='/' title='Home' class=('header__logoLink', 'header__logoLink-iconOnly', 'sc-border-box', 'sc-ir')>,
 <Element 'a' href='/' title='Home' class=('header__logoLink', 'header__logoLink-wordmark', 'sc-border-box', 'sc-ir')>,
 <Element 'a' class=('header__navMenuItem',) data-menu-name='home' href='/discover'>,
 <Element 'a' class=('header__navMenuItem',) data-menu-name='stream' href='/stream'>,
 <Element 'a' class=('header__navMenuItem',) data-menu-name='library' href='/you/library'>,
 <Element 'a' href='/upload' class=('uploadButton', 'header__link') tabindex='0'>,
 <Element 'a' href='' class=('header__moreButton', 'sc-ir') tabindex='0' aria-haspopup='true' role='button' aria-owns='dropdown-button-98'>,
 <Element 'a' href='/pages/cookies' class=('sc-link-dark',)>,
 <Element 'a' class=('playableTile__artworkLink', 'audibleTile__artworkLink') href='/capitalcqu/sets/midnight-lofi'>,
 <Element 'a' role='button' href='' class=('sc-button-play', 'playButton', 'sc-button', 'm-s

In [54]:
len(links), len(new_links)

(9, 169)

-------

## 2) get the text of the new paragraphs now visible in the HTML.

In [64]:
paragraphs = r.html.find('p')
paragraphs

[<Element 'p' class=('mixedSelectionModule__description', 'sc-type-small', 'sc-type-light')>,
 <Element 'p' class=('mixedSelectionModule__description', 'sc-type-small', 'sc-type-light')>,
 <Element 'p' class=('mixedSelectionModule__description', 'sc-type-small', 'sc-type-light')>,
 <Element 'p' class=('mixedSelectionModule__description', 'sc-type-small', 'sc-type-light')>,
 <Element 'p' class=('mixedSelectionModule__description', 'sc-type-small', 'sc-type-light')>,
 <Element 'p' class=('mixedSelectionModule__description', 'sc-type-small', 'sc-type-light')>,
 <Element 'p' class=('mixedSelectionModule__description', 'sc-type-small', 'sc-type-light')>,
 <Element 'p' class=('mixedSelectionModule__description', 'sc-type-small', 'sc-type-light')>]

In [65]:
for p in paragraphs:
    print(p.text)

Popular playlists from the SoundCloud community
Popular playlists from the SoundCloud community
Popular playlists from the SoundCloud community
Popular playlists from the SoundCloud community
Popular playlists from the SoundCloud community
Popular playlists from the SoundCloud community
Up-and-coming tracks on SoundCloud
The most played tracks on SoundCloud this week


## 3) Trying out a few other tags - what else appears after executing the JavaScript

In [66]:
r.html.find('div')

[<Element 'div' id='app'>,
 <Element 'div' class=('header__inner', 'l-container', 'l-fullwidth')>,
 <Element 'div' class=('header__left',)>,
 <Element 'div' class=('header__logo', 'left')>,
 <Element 'div' class=('header__middle',)>,
 <Element 'div' class=('header__search',) role='search'>,
 <Element 'div' class=('header__right', 'sc-clearfix')>,
 <Element 'div' class=('header__loginMenu', 'left')>,
 <Element 'div' class=('header__upload', 'left')>,
 <Element 'div' class=('announcements', 'g-z-index-fixed-top', 'g-z-index-header')>,
 <Element 'div' class=('sc-list-nostyle', 'sc-clearfix')>,
 <Element 'div' class=('announcements__item', 'sc-clearfix')>,
 <Element 'div' class=('announcement', 'g-dark', 'm-dismiss-visible')>,
 <Element 'div' class=('l-container', 'l-content')>,
 <Element 'div' class=('l-product-banners', 'l-inner-fullwidth')>,
 <Element 'div' class=('l-container',)>,
 <Element 'div' >,
 <Element 'div' id='content' role='main'>,
 <Element 'div' >,
 <Element 'div' class=('l

In [67]:
r.html.links

{'/',
 '//blog.soundcloud.com',
 '//creators.soundcloud.com',
 '/agamidae',
 '/agamidae/sets/instrumental-7',
 '/billie-kihega',
 '/billie-kihega/sets/billies-sleep-sounds',
 '/capitalcqu',
 '/capitalcqu/sets/midnight-lofi',
 '/charts/top',
 '/digitalstreams',
 '/digitalstreams/sets/deephousehits',
 '/discover',
 '/discover/sets/charts-top:all-music',
 '/discover/sets/charts-top:alternativerock',
 '/discover/sets/charts-top:ambient',
 '/discover/sets/charts-top:classical',
 '/discover/sets/charts-top:country',
 '/discover/sets/charts-trending:all-music',
 '/discover/sets/charts-trending:alternativerock',
 '/discover/sets/charts-trending:ambient',
 '/discover/sets/charts-trending:classical',
 '/discover/sets/charts-trending:country',
 '/egemannen',
 '/egemannen/sets/bed-bug-classic-ii',
 '/fitnation-egypt',
 '/fitnation-egypt/sets/jointhefitnation',
 '/goldsgymegypt',
 '/goldsgymegypt/sets/dubstep-by-golds-gym-egypt',
 '/ilyanaazman',
 '/ilyanaazman/sets/tsyn',
 '/imprint',
 '/itsulana'

In [68]:
r.html.absolute_links

{'https://blog.soundcloud.com',
 'https://creators.soundcloud.com',
 'https://itunes.apple.com/us/app/soundcloud/id336353151?mt=8',
 'https://play.google.com/store/apps/details?id=com.soundcloud.android&hl=us&referrer=utm_source%3Dsoundcloud%26utm_medium%3Dweb%26utm_campaign%3Dweb_xsell_discover_page',
 'https://soundcloud.com/',
 'https://soundcloud.com/agamidae',
 'https://soundcloud.com/agamidae/sets/instrumental-7',
 'https://soundcloud.com/billie-kihega',
 'https://soundcloud.com/billie-kihega/sets/billies-sleep-sounds',
 'https://soundcloud.com/capitalcqu',
 'https://soundcloud.com/capitalcqu/sets/midnight-lofi',
 'https://soundcloud.com/charts/top',
 'https://soundcloud.com/digitalstreams',
 'https://soundcloud.com/digitalstreams/sets/deephousehits',
 'https://soundcloud.com/discover',
 'https://soundcloud.com/discover/sets/charts-top:all-music',
 'https://soundcloud.com/discover/sets/charts-top:alternativerock',
 'https://soundcloud.com/discover/sets/charts-top:ambient',
 'http

## 4) Using a CSS selector, extract the meta tag with name = "keywords".

In [75]:
r.html.find('meta[name=keywords]')

[<Element 'meta' content='record, sounds, share, sound, audio, tracks, music, soundcloud' name='keywords'>]

## 5) Links that automatically open to a new a tab are identified by having a "target" attribute equal to "_blank". Try extracting these links and their URLs.

In [90]:
a_list = r.html.find('a[target=_blank][href]')
a_list

[<Element 'a' href='https://itunes.apple.com/us/app/soundcloud/id336353151?mt=8' target='_blank' class=('mobileAppsButtons__button', 'mobileAppsButtons__appStore', 'g-appStoreButton', 'g-appStoreButton__appStore', 'sc-ir')>,
 <Element 'a' href='https://play.google.com/store/apps/details?id=com.soundcloud.android&hl=us&referrer=utm_source%3Dsoundcloud%26utm_medium%3Dweb%26utm_campaign%3Dweb_xsell_discover_page' target='_blank' class=('mobileAppsButtons__button', 'mobileAppsButtons__googlePlay', 'g-appStoreButton', 'g-appStoreButton__googlePlay', 'sc-ir')>,
 <Element 'a' class=('sc-link-verylight',) href='//creators.soundcloud.com' target='_blank' title='Creator Resources'>,
 <Element 'a' class=('sc-link-verylight',) href='//blog.soundcloud.com' target='_blank' title='SoundCloud blog'>]

In [93]:
for link in a_list:
    print(link.absolute_links)

{'https://itunes.apple.com/us/app/soundcloud/id336353151?mt=8'}
{'https://play.google.com/store/apps/details?id=com.soundcloud.android&hl=us&referrer=utm_source%3Dsoundcloud%26utm_medium%3Dweb%26utm_campaign%3Dweb_xsell_discover_page'}
{'https://creators.soundcloud.com'}
{'https://blog.soundcloud.com'}
