## Thu thập dữ liệu bằng API
- Họ tên: Nguyễn Thế Hải
- MSSV: 19120069
- Nhiệm vụ: Code API
***
- Họ tên: Trần Đức Thắng
- MSSV: 19120130
- Nhiệm vụ: Tìm hiểu API , hoàn thành các file notebook.

## Môi trường code
Dùng phiên bản các package như trong file "min_ds-env.yml" 

***
## Import

In [1]:
import requests
import requests_cache
from bs4 import BeautifulSoup
import time
import json
import re
import pandas as pd # Dùng để đọc và hiển thị file csv/tsv
from datetime import datetime, timedelta # Dùng để xử lý dữ liệu thời gian
# YOUR CODE HERE (OPTION) 
import csv

import urllib.robotparser # Kiểm tra file robot.txt có được phép crawl không

***

[soundcloud](https://soundcloud.com/discover) là một trang web cho phép người dùng upload và chia sẻ các bài hát. Chúng ta sẽ thu thập thông tin về các nghệ sĩ, ban nhạc, podcast và người sáng tác âm nhạc trên trang Soundcloud.

## Setup ở mức toàn cục

Đầu tiên, chúng ta sẽ setup các biến toàn cục:
- Do số lượng request trong 1 phút bị giới hạn , vì vậy chúng ta cho chương trình sleep rồi requests lại. Biến `sleep_time` toàn cục để qui định số giây mà chương trình sẽ sleep (trong bài này cho 60 giây).
- Số lượng mỗi file cần có ít nhất tầm > 1000 records vì vậy chúng ta sẽ setup biến `n_records` toàn cục để quy định số lượng records
- 'Client ID là một dãy số độc nhất được đặt cho mỗi người truy cập website trên một thiết bị/trình duyệt web'.Vì vậy trong hàm này chúng ta setup biến `client_id` toàn cục quy định về Client ID

In [2]:
sleep_time = 60

# client id
client_id = 'o2BWXZ9TFWJtTjM1cF9OvS5BEYPk1hBS'

# number of records needed to crawl
n_records = 1000

Công việc trong phần này là viết hàm `collect_api` ở bên dưới. Hàm này có các input như sau:
- `entities` là list , cho biết các từ khóa để lọc ra các đường link cần lấy. Ví dụ entity = {user , track , playlist} sẽ thu thập thông tin 
- `key_word` là chuỗi, cho biết là muốn tìm kiếm các thùng chứa với từ khóa nào. Ví dụ, key_word='blackpink'.
- `per_page` là số nguyên, cho biết là muốn bao nhiêu kết quả trên một page


Output trả về của hàm collect_api:

- results: là list các thùng chứa lấy được (list gồm "collection" của tất cả các trang).

In [3]:
def collect_api(entities, key_word, per_page):
    results = list()
    
    url = f'https://api-v2.soundcloud.com/search/{entities}?q={key_word}&client_id={client_id}&limit={per_page}'
    
    while(len(results) <= n_records):
        r = requests.get(url)
        
        # request failed
        if not r.ok:
            time.sleep(sleep_time)
            continue
              
        # extract items
        text = json.loads(r.text)
        results.extend(text['collection'])
        
        # find the next page
        if 'next_href' in text.keys():
            url = text['next_href']
            # add clinet_id to rul
            url = url + f'&client_id={client_id}'
        else:
            break
            
    return results

Sau khi đã lấy được danh sách thùng chứa gồm các "collection" của các trang. Chúng ta sử dụng biến `entities` để lọc các đường link cần lấy, với entities là các chuỗi:
- `playlists` : có thể có nhiều track , chúng ta sẽ lấy 5 id bản nhạc đầu tiên trong danh sách playlists
- `users` : danh sách người dùng 
- `track`:  danh sách các bản nhạc 


In [4]:
# format playlists
def format_playlists(playlists):
    for playlist in playlists:
        del playlist['user']

        # get first 5 track ids in a playlist
        track_ids = ''
        count = playlist['track_count']
        for i in range(min(count, 5)):
            track = playlist['tracks'][i]
            track_ids = track_ids + ', ' + str(track['id'])
        playlist['track_ids'] = track_ids
        del playlist['tracks']
        
    return playlists


# format users
def format_users(users):
    for user in users:
        del user['creator_subscriptions']
        del user['creator_subscription']      
        
        user['pro_badges'] = user['badges']['pro']
        user['pro_unlimited_badges'] = user['badges']['pro_unlimited']        
        user['verified_badges'] = user['badges']['verified'] 
        del user['badges']
        
    return users


# format tracks
def format_tracks(tracks):
    for track in tracks:
        del track['publisher_metadata']
        del track['media']
        del track['user']
        
    return tracks

Hàm `write_file` sử dụng các input như sau:
- file_name là tên của các entities 
- data là danh sách các thùng chứa gồm các "collection" của các trang.
Chúng ta sử dụng `Pandas DataFrame` để chuyển dữ liệu data tổ chức sang không gian 2 chiều bao rows và columns. 
Ghi vào tập tin file_name.  

In [5]:
def write_file(file_name, data):
    # convert to dataframe
    df = pd.DataFrame.from_dict(data)
    # write to file
    df.to_csv(file_name, index=False)

Sử dụng lại hàm `collect_api` để lấy dữ liệu.

In [6]:
def collect_data(entities, key_word):
    # collect api
    data = collect_api(entities, key_word, 101)
    
    # data cleaning
    if entities == 'users':
        data = format_users(data)
    if entities == 'tracks':
        data = format_tracks(data)
    if entities == 'playlists':
        data = format_playlists(data)
   
    # out to file
    write_file(f'{entities}.csv', data)


Chúng ta sẽ test một vài `key_word` cho chương trình với các entities mặc định, lưu vào trong file .csv.

In [7]:
collect_data('users', 'blackpink')
collect_data('tracks', 'blackpink')
collect_data('playlists', 'blackpink')

## Đọc dữ liệu từ file users.csv

In [8]:
courses_tracks = pd.read_csv('users.csv')
courses_tracks

Unnamed: 0,avatar_url,city,comments_count,country_code,created_at,description,followers_count,followings_count,first_name,full_name,...,uri,urn,username,verified,visuals,station_urn,station_permalink,pro_badges,pro_unlimited_badges,verified_badges
0,https://i1.sndcdn.com/avatars-3YiljcXAvj7pU7Vb...,,0,,2019-08-23T10:12:13Z,,224250,0,,,...,https://api.soundcloud.com/users/688679639,soundcloud:users:688679639,BLACKPINK,True,,soundcloud:system-playlists:artist-stations:68...,artist-stations:688679639,False,False,True
1,https://i1.sndcdn.com/avatars-000006487712-r86...,Jakarta,4,ID,2011-10-07T16:12:54Z,,5000,11,black,black pink,...,https://api.soundcloud.com/users/7907166,soundcloud:users:7907166,blackpink,False,,soundcloud:system-playlists:artist-stations:79...,artist-stations:7907166,False,False,False
2,https://i1.sndcdn.com/avatars-Wse6DLdqgv7cYfAd...,,0,,2020-09-22T23:10:22Z,,3860,0,,,...,https://api.soundcloud.com/users/884772682,soundcloud:users:884772682,blackpink,False,,soundcloud:system-playlists:artist-stations:88...,artist-stations:884772682,False,False,False
3,https://i1.sndcdn.com/avatars-i49yuVyJxcoYwzdf...,Seoul,0,KR,2019-11-13T18:52:41Z,,549,0,BLACKPINK,BLACKPINK,...,https://api.soundcloud.com/users/733288924,soundcloud:users:733288924,BLACKPINK,False,"{'urn': 'soundcloud:users:733288924', 'enabled...",soundcloud:system-playlists:artist-stations:73...,artist-stations:733288924,False,False,False
4,https://i1.sndcdn.com/avatars-RP5nzFyWTUqaFWtA...,,0,,2019-03-07T01:12:29Z,,865,2,BLACKPINK,BLACKPINK BLINK,...,https://api.soundcloud.com/users/600372183,soundcloud:users:600372183,BLACKPINK BLINK,False,"{'urn': 'soundcloud:users:600372183', 'enabled...",soundcloud:system-playlists:artist-stations:60...,artist-stations:600372183,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1005,https://i1.sndcdn.com/avatars-000737344153-6c6...,,0,,2019-12-10T13:34:24Z,,1,0,BLACKPINK,BLACKPINK LOVE,...,https://api.soundcloud.com/users/747018262,soundcloud:users:747018262,BLACKPINK LOVE,False,,soundcloud:system-playlists:artist-stations:74...,artist-stations:747018262,False,False,False
1006,https://i1.sndcdn.com/avatars-000579216768-g8q...,,5,,2019-02-09T02:49:26Z,,3,2,Blackpink,Blackpink Yg,...,https://api.soundcloud.com/users/586147170,soundcloud:users:586147170,Blackpink Yg,False,,soundcloud:system-playlists:artist-stations:58...,artist-stations:586147170,False,False,False
1007,https://i1.sndcdn.com/avatars-J3WxchBJI0ZdSUlh...,,0,US,2020-10-01T22:30:52Z,,1,1,,,...,https://api.soundcloud.com/users/888607462,soundcloud:users:888607462,BLACKPINK_BLINK,False,,soundcloud:system-playlists:artist-stations:88...,artist-stations:888607462,False,False,False
1008,https://i1.sndcdn.com/avatars-000298753519-rpc...,,0,,2017-02-27T23:10:16Z,,11,1,Blackpink,Blackpink Blink,...,https://api.soundcloud.com/users/291902432,soundcloud:users:291902432,Blackpink Blink,False,,soundcloud:system-playlists:artist-stations:29...,artist-stations:291902432,False,False,False


## Đọc dữ liệu từ file tracks.csv

In [9]:
courses_tracks = pd.read_csv('tracks.csv')
courses_tracks

Unnamed: 0,artwork_url,caption,commentable,comment_count,created_at,description,downloadable,download_count,duration,full_duration,...,urn,user_id,visuals,waveform_url,display_date,station_urn,station_permalink,track_authorization,monetization_model,policy
0,https://i1.sndcdn.com/artworks-000515081802-u0...,,True,9401.0,2019-04-04T16:23:08Z,BLACKPINK - Kill This Love\nDownload Album (MP...,False,0.0,186087,186087,...,soundcloud:tracks:600933039,609199215,,https://wave.sndcdn.com/ejCQ0jBoBeBF_m.json,2019-04-04T16:23:08Z,soundcloud:system-playlists:track-stations:600...,track-stations:600933039,eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJnZW8iO...,NOT_APPLICABLE,ALLOW
1,https://i1.sndcdn.com/artworks-000515083182-qp...,,True,2586.0,2019-04-04T16:26:35Z,블랙핑크 BLACKPINK - Don't Know What To Do\nDownlo...,False,0.0,197086,197086,...,soundcloud:tracks:600934458,609199215,,https://wave.sndcdn.com/SgN7Mn4FAw3M_m.json,2019-04-04T16:26:35Z,soundcloud:system-playlists:track-stations:600...,track-stations:600934458,eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJnZW8iO...,NOT_APPLICABLE,ALLOW
2,https://i1.sndcdn.com/artworks-000229999096-3i...,,True,4679.0,2017-06-22T10:25:58Z,#BLACKPINK #블랙핑크 #마지막처럼 #ASIFITSYOURLAST #TODA...,False,0.0,213388,213388,...,soundcloud:tracks:329471802,294531917,,https://wave.sndcdn.com/V0Y7raC7Adhv_m.json,2017-06-22T10:25:58Z,soundcloud:system-playlists:track-stations:329...,track-stations:329471802,eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJnZW8iO...,NOT_APPLICABLE,ALLOW
3,https://i1.sndcdn.com/artworks-000274869476-ct...,,True,3047.0,2017-12-25T12:00:02Z,"Remix by TEDDY, 24, Danny Chung \n\nTitle : SO...",True,100.0,140029,140029,...,soundcloud:tracks:374190761,102916922,,https://wave.sndcdn.com/tA6w8SOZdDeE_m.json,2017-12-25T12:00:02Z,soundcloud:system-playlists:track-stations:374...,track-stations:374190761,eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJnZW8iO...,NOT_APPLICABLE,ALLOW
4,https://i1.sndcdn.com/artworks-000175418654-4i...,,True,4481.0,2016-08-08T18:43:20Z,,False,0.0,245260,245260,...,soundcloud:tracks:277386211,236716649,,https://wave.sndcdn.com/Vc0T1XqN5gOg_m.json,2016-08-08T18:43:20Z,soundcloud:system-playlists:track-stations:277...,track-stations:277386211,eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJnZW8iO...,NOT_APPLICABLE,ALLOW
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1005,https://i1.sndcdn.com/artworks-mbPiLJMNEj2toay...,,True,0.0,2020-07-01T04:15:55Z,,False,0.0,41561,41561,...,soundcloud:tracks:850034938,846383797,,https://wave.sndcdn.com/QSRwrBrESPmK_m.json,2020-07-01T04:15:55Z,soundcloud:system-playlists:track-stations:850...,track-stations:850034938,eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJnZW8iO...,NOT_APPLICABLE,ALLOW
1006,https://i1.sndcdn.com/artworks-PH6hMHVNEnz34zy...,,True,284.0,2020-07-21T13:57:07Z,Download : https://hypeddit.com/track/snn07h\n...,False,4.0,211801,211801,...,soundcloud:tracks:861849877,145898152,,https://wave.sndcdn.com/VZJICPU6W8i8_m.json,2020-07-21T13:57:07Z,soundcloud:system-playlists:track-stations:861...,track-stations:861849877,eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJnZW8iO...,NOT_APPLICABLE,ALLOW
1007,https://i1.sndcdn.com/artworks-aIFTc19BHUzV9vm...,,True,14.0,2020-06-01T21:39:17Z,,False,0.0,178233,178233,...,soundcloud:tracks:832434946,466475376,,https://wave.sndcdn.com/105Q3X9pr89A_m.json,2020-06-01T21:39:17Z,soundcloud:system-playlists:track-stations:832...,track-stations:832434946,eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJnZW8iO...,NOT_APPLICABLE,ALLOW
1008,https://i1.sndcdn.com/artworks-000016550599-de...,,True,0.0,2012-01-10T12:28:29Z,Warae warae onaka kakaete warae\r\nTanoshiku n...,False,0.0,285078,285078,...,soundcloud:tracks:32959435,7907166,,https://wave.sndcdn.com/O518TeUScgzL_m.json,2012-01-10T12:28:29Z,soundcloud:system-playlists:track-stations:329...,track-stations:32959435,eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJnZW8iO...,NOT_APPLICABLE,ALLOW


## Đọc dữ liệu từ file playlists.csv

In [10]:
courses_playlists = pd.read_csv('playlists.csv')
courses_playlists

Unnamed: 0,artwork_url,created_at,description,duration,embeddable_by,genre,id,kind,label_name,last_modified,...,tag_list,title,uri,user_id,set_type,is_album,published_at,display_date,track_count,track_ids
0,,2018-10-15T15:50:14Z,,656908,all,,623598537,playlist,,2018-10-15T16:05:03Z,...,,blackpink,https://api.soundcloud.com/playlists/623598537,508598199,,False,2018-10-15T15:50:14Z,2018-10-15T15:50:14Z,2,", 329471802, 401058264"
1,,2017-12-25T14:48:56Z,,103595810,none,,406348544,playlist,,2021-02-18T17:31:50Z,...,,Blackpink,https://api.soundcloud.com/playlists/406348544,257009086,,False,2017-12-25T14:48:56Z,2017-12-25T14:48:56Z,358,", 360260393, 295987850, 307372211, 343190783, ..."
2,,2018-03-20T00:01:15Z,,84464453,all,Kpop,476522658,playlist,,2021-07-05T04:54:08Z,...,,blackpink,https://api.soundcloud.com/playlists/476522658,299633379,,False,2018-03-20T00:01:15Z,2018-03-20T00:01:15Z,373,", 350872552, 308578451, 206558035, 329468204, ..."
3,https://i1.sndcdn.com/artworks-6c8141af-033c-4...,2020-10-02T08:05:26Z,,1485062,all,Dance,1138455559,playlist,,2021-05-19T22:08:39Z,...,,THE ALBUM,https://api.soundcloud.com/playlists/1138455559,688679639,album,True,2020-10-02T00:00:00Z,2020-10-02T00:00:00Z,8,", 847247068, 883389634, 903272074, 903272005, ..."
4,,2019-01-13T16:07:19Z,,45806285,all,,684966594,playlist,,2019-01-29T17:19:51Z,...,,Blackpink,https://api.soundcloud.com/playlists/684966594,522440574,,False,2019-01-13T16:07:19Z,2019-01-13T16:07:19Z,139,", 277386211, 277382742, 290812196, 290812191, ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1001,,2019-04-28T02:05:13Z,,12585925,all,,766988553,playlist,,2019-04-28T02:05:21Z,...,,blackpink,https://api.soundcloud.com/playlists/766988553,416176383,,False,2019-04-28T02:05:13Z,2019-04-28T02:05:13Z,47,", 603636444, 544449522, 310637766, 448743879, ..."
1002,,2019-04-18T12:02:28Z,,11541267,all,,757188897,playlist,,2019-04-18T12:02:48Z,...,,blackpink,https://api.soundcloud.com/playlists/757188897,416627478,,False,2019-04-18T12:02:28Z,2019-04-18T12:02:28Z,42,", 544449522, 310637766, 448743879, 574858791, ..."
1003,,2019-05-10T02:41:15Z,,12585925,all,,776029629,playlist,,2019-05-10T02:42:02Z,...,,blackpink,https://api.soundcloud.com/playlists/776029629,416848206,,False,2019-05-10T02:41:15Z,2019-05-10T02:41:15Z,47,", 603636444, 544449522, 310637766, 448743879, ..."
1004,,2019-04-27T13:30:06Z,,12585925,all,,766405365,playlist,,2019-04-27T13:30:09Z,...,,blackpink,https://api.soundcloud.com/playlists/766405365,419093001,,False,2019-04-27T13:30:06Z,2019-04-27T13:30:06Z,47,", 603636444, 544449522, 310637766, 448743879, ..."
