# Capstone Recap: APIs and Webscraping

1. Twint for tweets: https://pypi.org/project/twint/
2. Spotipy for Spotify data: https://spotipy.readthedocs.io/en/2.16.1/
3. Scraping XML from BoardGameGeek
4. Scraping apartments from Craigslist

## Twint

In [1]:
import twint
import pandas as pd

import nest_asyncio # for some reason, needed to excute the api call
nest_asyncio.apply()

In [2]:
c = twint.Config()
c.Search = "covid"
c.Min_likes = 100
c.Count = True
c.Limit = 100
c.Store_csv = True
c.Output = 'covidtweets.csv'

In [3]:
twint.run.Search(c)

1357042101897297921 2021-02-03 14:03:44 -0500 <AP> AstraZeneca’s COVID-19 vaccine appears to reduce transmission of the virus and offer strong protection for three months on just a single dose, researchers say. The preliminary findings appears to be good news in the effort to curb the spread of the virus.  https://t.co/01zRYbTEUc
1357041513885282317 2021-02-03 14:01:24 -0500 <KloppStyle> @LFC Please don’t be covid please don’t be covid please don’t be covid please don’t be covid please don’t be covid please don’t be covid please don’t be covid please don’t be covid please don’t be covid please don’t be covid please don’t be covid please don’t be covid
1357041455487942658 2021-02-03 14:01:10 -0500 <tumbaburross> Siempre hemos tenido deficiencias en nuestro sistema de salud, pero lo de hoy es inhumano.  Están dejando morir a la gente DE TODO, no solamente de Covid.  ¡Algún día los haremos pagar por tanto daño desgraciados!
1357041235752607745 2021-02-03 14:00:17 -0500 <HaddadDebochado> L

In [5]:
tweets = pd.read_csv('covidtweets.csv')

In [7]:
tweets['tweet']

0      Importante! Anvisa modifica os requisitos míni...
1      Join the CDC #COVID19 Partner Update Call Mond...
2      Trying to get things done but can’t stop think...
3      Rogan's wildly inconsistent responses to the C...
4      Unbelievable: Instead of paying essential work...
                             ...                        
195    Chris Whitty on Covid deniers:  “If you don’t ...
196    A la derecha, un mexicano con necesidad extrem...
197    Based on my own experiencing caring for patien...
198    Pareciera que la secretaria se manifiesta en c...
199    &gt;@EstadaoPolitica Senador Randolfe Rodrigue...
Name: tweet, Length: 200, dtype: object

## Spotipy
* Documentation: https://spotipy.readthedocs.io/en/2.16.1/
* Spotify's API: https://developer.spotify.com/dashboard/applications
* Additional Spotify datasets: https://research.atspotify.com/datasets/

Don't forget to store your credentials!!

In [21]:
import config
import sys
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

sp = spotipy.Spotify(client_credentials_manager=SpotifyClientCredentials(config.client_id, config.client_secret))

In [22]:
my_user_id = config.user

In [23]:
playlists = sp.user_playlists(my_user_id)['items']
playlist_ids = [p['id'] for p in playlists]
sp.playlist_tracks(playlist_ids[0])['items']

[{'added_at': '2021-01-15T14:26:44Z',
  'added_by': {'external_urls': {'spotify': 'https://open.spotify.com/user/125382293'},
   'href': 'https://api.spotify.com/v1/users/125382293',
   'id': '125382293',
   'type': 'user',
   'uri': 'spotify:user:125382293'},
  'is_local': False,
  'primary_color': None,
  'track': {'album': {'album_type': 'single',
    'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/3gIRvgZssIb9aiirIg0nI3'},
      'href': 'https://api.spotify.com/v1/artists/3gIRvgZssIb9aiirIg0nI3',
      'id': '3gIRvgZssIb9aiirIg0nI3',
      'name': 'Jeremy Zucker',
      'type': 'artist',
      'uri': 'spotify:artist:3gIRvgZssIb9aiirIg0nI3'},
     {'external_urls': {'spotify': 'https://open.spotify.com/artist/5JMLG56F1X5mFmWNmS0iAp'},
      'href': 'https://api.spotify.com/v1/artists/5JMLG56F1X5mFmWNmS0iAp',
      'id': '5JMLG56F1X5mFmWNmS0iAp',
      'name': 'Chelsea Cutler',
      'type': 'artist',
      'uri': 'spotify:artist:5JMLG56F1X5mFmWNmS0iAp'}],


In [25]:
def spotipy_search(artist, track):
    query = f'artist: {artist} track: {track}'
    return sp.search(q=query, limit=3)['tracks']['items']

In [26]:
spotipy_search('dua lipa', 'levitating')

[{'album': {'album_type': 'single',
   'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/6M2wZ9GZgrQXHCFfjv46we'},
     'href': 'https://api.spotify.com/v1/artists/6M2wZ9GZgrQXHCFfjv46we',
     'id': '6M2wZ9GZgrQXHCFfjv46we',
     'name': 'Dua Lipa',
     'type': 'artist',
     'uri': 'spotify:artist:6M2wZ9GZgrQXHCFfjv46we'},
    {'external_urls': {'spotify': 'https://open.spotify.com/artist/4r63FhuTkUYltbVAg5TQnk'},
     'href': 'https://api.spotify.com/v1/artists/4r63FhuTkUYltbVAg5TQnk',
     'id': '4r63FhuTkUYltbVAg5TQnk',
     'name': 'DaBaby',
     'type': 'artist',
     'uri': 'spotify:artist:4r63FhuTkUYltbVAg5TQnk'}],
   'available_markets': ['AD',
    'AE',
    'AL',
    'AR',
    'AU',
    'BA',
    'BE',
    'BG',
    'BH',
    'BO',
    'BR',
    'BY',
    'CA',
    'CL',
    'CO',
    'CR',
    'CY',
    'CZ',
    'DK',
    'DO',
    'DZ',
    'EC',
    'EE',
    'EG',
    'ES',
    'FI',
    'FR',
    'GB',
    'GR',
    'GT',
    'HK',
    'HN',
 

In [27]:
sp.search(q='genre: pop', limit=10)['tracks']['items']

[{'album': {'album_type': 'single',
   'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/1McMsnEElThX1knmY4oliG'},
     'href': 'https://api.spotify.com/v1/artists/1McMsnEElThX1knmY4oliG',
     'id': '1McMsnEElThX1knmY4oliG',
     'name': 'Olivia Rodrigo',
     'type': 'artist',
     'uri': 'spotify:artist:1McMsnEElThX1knmY4oliG'}],
   'available_markets': ['AD',
    'AE',
    'AL',
    'AR',
    'AT',
    'AU',
    'BA',
    'BE',
    'BG',
    'BH',
    'BO',
    'BR',
    'BY',
    'CA',
    'CH',
    'CL',
    'CO',
    'CR',
    'CY',
    'CZ',
    'DE',
    'DK',
    'DO',
    'DZ',
    'EC',
    'EE',
    'EG',
    'ES',
    'FI',
    'FR',
    'GB',
    'GR',
    'GT',
    'HK',
    'HN',
    'HR',
    'HU',
    'ID',
    'IE',
    'IL',
    'IN',
    'IS',
    'IT',
    'JO',
    'JP',
    'KR',
    'KW',
    'KZ',
    'LB',
    'LI',
    'LT',
    'LU',
    'LV',
    'MA',
    'MC',
    'MD',
    'ME',
    'MK',
    'MT',
    'MX',
    'MY',
    'NI',

## Webscraping Boardgames

(And dealing with XML formats)

In [28]:
import json
import requests
import time
from bs4 import BeautifulSoup

In [29]:
url = 'https://boardgamegeek.com/xmlapi2/thing?id=3&type=boardgame'
req = requests.get(url).content
soup = BeautifulSoup(req, 'xml')
games = soup.find_all('item')

In [34]:
games[0]

<item id="3" type="boardgame">
<thumbnail>https://cf.geekdo-images.com/o9-sNXmFS_TLAb7ZlZ4dRA__thumb/img/22MSUC0-ZWgwzhi_VKIbENJik1w=/fit-in/200x150/filters:strip_icc()/pic3211873.jpg</thumbnail>
<image>https://cf.geekdo-images.com/o9-sNXmFS_TLAb7ZlZ4dRA__original/img/TPKZgpNxB_C73RNbhKyP6UR76X0=/0x0/filters:format(jpeg)/pic3211873.jpg</image>
<name sortindex="1" type="primary" value="Samurai"/>
<name sortindex="1" type="alternate" value="Samouraï"/>
<name sortindex="1" type="alternate" value="Samurái"/>
<name sortindex="1" type="alternate" value="Samuraj"/>
<name sortindex="1" type="alternate" value="Самурай"/>
<name sortindex="1" type="alternate" value="侍"/>
<name sortindex="1" type="alternate" value="사무라이"/>
<description>Samurai is set in medieval Japan. Players compete to gain the favor of three factions: samurai, peasants, and priests, which are represented by helmet, rice paddy, and Buddha figures scattered about the board, which features the islands of Japan. The competition is 

In [30]:
names = games[0].find_all('name')
names

[<name sortindex="1" type="primary" value="Samurai"/>,
 <name sortindex="1" type="alternate" value="Samouraï"/>,
 <name sortindex="1" type="alternate" value="Samurái"/>,
 <name sortindex="1" type="alternate" value="Samuraj"/>,
 <name sortindex="1" type="alternate" value="Самурай"/>,
 <name sortindex="1" type="alternate" value="侍"/>,
 <name sortindex="1" type="alternate" value="사무라이"/>]

In [35]:
games[0].find_all('name')[0]['type']

'primary'

In [36]:
name = list(filter(lambda n: n['type'] == "primary", names))

In [37]:
name[0]['value']

'Samurai'

In [39]:
games[0].find('description').text

'Samurai is set in medieval Japan. Players compete to gain the favor of three factions: samurai, peasants, and priests, which are represented by helmet, rice paddy, and Buddha figures scattered about the board, which features the islands of Japan. The competition is waged through the use of hexagonal tiles, each of which help curry favor of one of the three factions &mdash; or all three at once! Players can make lightning-quick strikes with horseback ronin and ships or approach their conquests more methodically. As each figure (helmets, rice paddies, and Buddhas) is surrounded, it is awarded to the player who has gained the most favor with the corresponding group.&#10;&#10;Gameplay continues until all the symbols of one type have been removed from the board or four figures have been removed from play due to a tie for influence.&#10;&#10;At the end of the game, players compare captured symbols of each type, competing for majorities in each of the three types. Ties are not uncommon and a

## Scraping Craigslist

https://newyork.craigslist.org/search/aap

In [40]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://newyork.craigslist.org/search/hhh?'
soup = BeautifulSoup(requests.get(url).text)

# getting each apartment's html as an element in a list

apartments =  soup.findAll('li', {'class':"result-row"})



In [42]:
apartments[0]

<li class="result-row" data-pid="7272107009" data-repost-of="6998435258">
<a class="result-image gallery" data-ids="3:00z0z_byG0RkJRpzO_0CI0pO,3:00E0E_cVHFXCBXhxw_0CI0pO,3:01111_h1zE3Bb56Uc_0Ny1ck,3:00u0u_37A9KV9fXmv_0pO0CI,3:00V0V_jBnumoAuceK_0CI0pO,3:00O0O_kJ9lOOrjDJe_0CI0pO,3:00k0k_cP3WBg3LP5W_0CI0pO,3:00q0q_hnY6OeS9ScS_0t20CI,3:00x0x_2I9xDfltOCo_0t20CI,3:00M0M_eDMlMU02g9m_0t20CI,3:00b0b_2F3sfI9LUW_0CI0pO,3:00v0v_lLBdntamwai_0t20CI,3:00O0O_gzRXJZOcTOH_0t20CI,3:00Z0Z_2ba7RSRx3QY_0CI0pO,3:00Y0Y_kTHyu2DtvZ0_0CI0pO" href="https://newyork.craigslist.org/stn/roo/d/staten-island-12-minute-walk-from-ferry/7272107009.html">
<span class="result-price">$750</span>
</a>
<div class="result-info">
<span class="icon icon-star" role="button">
<span class="screen-reader-text">favorite this post</span>
</span>
<time class="result-date" datetime="2021-02-03 14:37" title="Wed 03 Feb 02:37:25 PM">Feb  3</time>
<h3 class="result-heading">
<a class="result-title hdrlnk" data-id="7272107009" href="https://

In [43]:
titles = [a.find('a', {'class': 'result-title hdrlnk'}).text for a in apartments]
prices = [a.find('span', {'class': 'result-price'}).text for a in apartments]

In [46]:
len(prices)

120

In [51]:
# getting attributes via list comprehension
hoods = []
for a in apartments:
    result = a.find('span', {'class': 'result-hood'})
    if result == None:
        hoods.append(None)
    else:
        hoods.append(result.text)

# for 'hoods', some are NoneType

# putting it in a dataframe:

df = pd.DataFrame(columns = ['titles', 'prices', 'hoods'])
df['titles'] = titles
df['prices'] = prices
df['hoods'] = hoods

In [52]:
df

Unnamed: 0,titles,prices,hoods
0,12 minute walk from Ferry $750,$750,(Staten Island)
1,NEW CONSTRUCTION | LUXURY CONDO | TWO PRIVATE ...,"$1,275",(Bushwick)
2,FRESH ON THE MARKET,$700,(East Flatbush)
3,PRIVATE OFFICE WITH ITS OWN PRIVATE TOILET & P...,"$3,000",(Brooklyn)
4,Attorney Offices and Cubicles for Rent,$0,(Downtown Brooklyn)
...,...,...,...
115,750 with all utilities included,$750,(Staten Island)
116,Large 2 bedroom+Laundry great sunlight+ Patio ...,"$2,665",(Bergen street)
117,NO BROKER FEE - MUST SEE! High End 3 BR Apt wi...,"$2,800",(Brooklyn)
118,4Bed w/ office space PRIME BEDSTUY in Brownstone,"$4,200",(Brooklyn)


## Additional Things to Explore

* BeautifulSoup not finding the exact info you're trying to scrape? Try **Selenium** 
    * https://www.scrapingbee.com/blog/selenium-python/
    
* ScraPy is another library (newer) used for scraping
    * https://scrapy.org/