# Capstone Recap: APIs and Webscraping

1. Twint for tweets: https://pypi.org/project/twint/
2. Spotipy for Spotify data: https://spotipy.readthedocs.io/en/2.16.1/
3. PRAW for Reddit data
3. Scraping XML from BoardGameGeek
4. Scraping apartments from Craigslist

## Twint

In [None]:
import twint
import pandas as pd

import nest_asyncio # for some reason, needed to excute the api call
nest_asyncio.apply()

In [None]:
c = twint.Config()

c.Search = "covid"
c.Min_likes = 100000
c.Count = True
c.Limit = 100
c.Store_csv = True
c.Output = 'covidtweets.csv'

Twint Config attributes: https://github.com/twintproject/twint/wiki/Tweet-attributes

In [None]:
twint.run.Search(c)

In [None]:
tweets = pd.read_csv('covidtweets.csv')

In [None]:
tweets.head()

In [None]:
tweets['tweet']

## Spotipy
* Documentation: https://spotipy.readthedocs.io/en/2.16.1/
* Spotify's API: https://developer.spotify.com/dashboard/applications
* Additional Spotify datasets: https://research.atspotify.com/datasets/

Don't forget to store your credentials!!

In [None]:
import config
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

sp = spotipy.Spotify(client_credentials_manager=SpotifyClientCredentials(config.spotify['client_id'], 
                                                                         config.spotify['client_secret']))


In [None]:
playlists = sp.user_playlists('11138449814')['items']
playlist_ids = [p['id'] for p in playlists]
sp.playlist_tracks(playlist_ids[0])['items']

In [None]:
def spotipy_search(artist, track):
    query = f'artist: {artist} track: {track}'
    return sp.search(q=query, limit=3)['tracks']['items']

In [None]:
spotipy_search('dua lipa', 'levitating')

In [None]:
sp.search(q='genre: pop', limit=10)['tracks']['items']

## Reddit

https://praw.readthedocs.io/en/latest/

In [None]:
import praw
import config

In [None]:
reddit = praw.Reddit(
    client_id=config.reddit['client_id'],
    client_secret=config.reddit['client_secret'],
    username=config.reddit['username'],
    password=config.reddit['password'],
    user_agent='test'
)

In [None]:
for submission in reddit.subreddit("learnpython").hot(limit=10):
    print(submission.title)

In [None]:
list(reddit.subreddit("punpatrol").hot(limit=20))

https://praw.readthedocs.io/en/latest/code_overview/models/submission.html

In [None]:
data = [(sub.id, sub.title, sub.url, sub.score) for sub in reddit.subreddit("news").hot(limit=20)]

In [None]:
pd.DataFrame(data, columns=['id', 'title', 'url', 'score'])

## Webscraping Boardgames

(And dealing with XML formats)

In [None]:
import json
import requests
import time
from bs4 import BeautifulSoup

In [None]:
url = 'https://boardgamegeek.com/xmlapi2/thing?id=3&type=boardgame'
req = requests.get(url).content
soup = BeautifulSoup(req, 'xml')
games = soup.find_all('item')

In [None]:
games[0]

In [None]:
names = games[0].find_all('name')
names

In [None]:
names[0]['value']

In [None]:
games[0].find_all('name')[0]['type']

In [None]:
name = list(filter(lambda n: n['type'] == "primary", names))

In [None]:
name[0]['value']

In [None]:
games[0].find('description').text

In [None]:
games[0].find('description').text

## Scraping Craigslist

https://newyork.craigslist.org/search/aap

In [None]:
soup.findAll('li', {'class':"result-row"})[0]

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://newyork.craigslist.org/search/hhh?'
soup = BeautifulSoup(requests.get(url).text)

# getting each apartment's html as an element in a list

apartments =  soup.findAll('li', {'class':"result-row"})



In [None]:
apartments[0]

In [None]:
titles = [a.find('a', {'class': 'result-title hdrlnk'}).text for a in apartments]
prices = [a.find('span', {'class': 'result-price'}).text for a in apartments]

In [None]:
len(titles)

In [None]:
prices

In [None]:
# getting attributes via list comprehension
hoods = []
for a in apartments:
    result = a.find('span', {'class': 'result-hood'})
    if result == None:
        hoods.append(None)
    else:
        hoods.append(result.text)

# for 'hoods', some are NoneType

# putting it in a dataframe:

df = pd.DataFrame(columns = ['titles', 'prices', 'hoods'])
df['titles'] = titles
df['prices'] = prices
df['hoods'] = hoods

In [None]:
df

## Additional Things to Explore

* BeautifulSoup not finding the exact info you're trying to scrape? Try **Selenium** 
    * https://www.scrapingbee.com/blog/selenium-python/
    
* ScraPy is another library (newer) used for scraping
    * https://scrapy.org/