# Crawling Daum news article comments (다음 뉴스 기사 댓글 크롤링)
----

### Finding the API url for news article

- news article: https://news.v.daum.net/v/20191207151034339 


- basic API URL Format: http://comment.daum.net/apis/v1/posts/{post_id}/comments?parentId=0&offset={id_of_comment}&limit={number_of_comments_to_call_from_API}&sort=RECOMMEND


- Elements of the URL:
    * **post_id**: post id in API
      - Finding the post_id of news article (Chrome):
        * Open developer tool in Chrome
        * Go into network > XHR > select the url that starts with "comments?"
        * You can find postid in Preview 

    * **offset**: id of the comment (in order starting from 0) 
    * **limit**: number of comments to call from api

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re

In [2]:
#crawling and minor cleansing
raw_comment = []

for n in range (0,1000):
    url = "http://comment.daum.net/apis/v1/posts/138163733/comments?parentId=0&offset="+str(n)+"&limit=1&sort=RECOMMEND"
    page = requests.get(url)
    soup = BeautifulSoup(page.text,'html.parser')
    
    soup = str(soup)
    if soup != "[]":
        clean = soup.replace("[","").replace("]","").replace('"',"").replace("{","").replace("}","")
        raw_comment.append(clean)
        
        if n%50 == 0:
            print("crawled:", n)

    elif soup == "[]":
        print("done crawling:", n)
        break

crawled: 0
crawled: 50
crawled: 100
crawled: 150
crawled: 200
crawled: 250
crawled: 300
crawled: 350
crawled: 400
crawled: 450
crawled: 500
crawled: 550
crawled: 600
done crawling: 621


In [3]:
#cleansing and formating into tuple
_raw_comment = list(raw_comment)

clean_data=[]
for i in _raw_comment:
    #save comment and roles separately because it does not comma split
    comment = re.findall("(?<=content:)(.*)(?=createdAt)",i)
    roles = re.findall("(?<=roles:)(.*)(?=providerId:)",i)

    #remove comment and roles from string of data
    i = i.replace("content:"+comment[0], "").replace("roles:"+roles[0], "")
    
    #split other contents with comma
    api = i.split(",")
    
    #put content and roles into api list
    api.insert(8, "content:"+comment[0][:-1])  
    api.insert(22, "roles:"+roles[0][:-1])
    
    tupled = []
    for i in api:
        x = tuple(i.split(":",1))
        tupled.append(x)
            
    clean_data.append(tupled)

In [4]:
len(clean_data)

621

In [5]:
#convert into dictionary
_dict = []

for i in clean_data:
    y = dict(i)
    
    _dict.append(y)

In [6]:
#convert dict into pandas
df = pd.DataFrame(_dict)

In [7]:
df

Unnamed: 0,id,userId,postId,forumId,parentId,type,status,flags,content,createdAt,...,user,icon,url,username,roles,providerId,providerUserId,displayName,description,commentCount
0,455313986,-84130810,138163733,-99,0,USER,S,0,택시기사새끼들 무서워서 타다를 막으면 국토부 공무원들 다 짤라야지,2019-12-07T15:40:18+0900,...,id:-84130810,https://t1.daumcdn.net/profile/NIAlo.1nGPU0,,DAUM:5H0gy,"ROLE_USER,ROLE_DAUM,ROLE_IDENTIFIED",DAUM,5H0gy,Violet,,4076
1,455313955,-546739925,138163733,-99,0,USER,S,0,택시는 왜 특혜 받아야 하나,2019-12-07T15:40:07+0900,...,id:-546739925,https://t1.daumcdn.net/profile/S.u713LnfqQ0,,DAUM:B041T,"ROLE_USER,ROLE_DAUM,ROLE_IDENTIFIED",DAUM,B041T,김종섭,,937
2,455314346,-51585116,138163733,-99,0,USER,S,0,시민들 편의에 믿겨야지,2019-12-07T15:42:34+0900,...,id:-51585116,https://t1.daumcdn.net/profile/_fFSMWKjphw0,,DAUM:3urE0,"ROLE_USER,ROLE_DAUM,ROLE_IDENTIFIED",DAUM,3urE0,황제폐하,,10902
3,455315081,281528,138163733,-99,0,USER,S,0,택시가 공공재인가? 택시를 국가에서 보호해야할 산업인가? \n\n서로 경쟁하는 것...,2019-12-07T15:47:02+0900,...,id:281528,https://t1.daumcdn.net/profile/kW_eP5UL9nw0,,DAUM:1sjeC,"ROLE_USER,ROLE_DAUM,ROLE_IDENTIFIED",DAUM,1sjeC,자유인,,10404
4,455315470,-34432195,138163733,-99,0,USER,S,0,경쟁을 시키세요.\n\n소비자가 현명하게 선택합니다.,2019-12-07T15:49:31+0900,...,id:-34432195,https://t1.daumcdn.net/profile/UBlkblXiupk0,,DAUM:2ktnZ,"ROLE_USER,ROLE_DAUM,ROLE_IDENTIFIED",DAUM,2ktnZ,정-지-서,,11456
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
616,455331867,-65923416,138163733,-99,0,USER,S,0,타다는 불법이다.\n타다를 정부가 옹호하려면\n택시 면허를 사야하는게 맞다.\n이런...,2019-12-07T17:27:48+0900,...,id:-65923416,https://t1.daumcdn.net/profile/b8s.4SeAtq90,,DAUM:4sBGU,"ROLE_USER,ROLE_DAUM,ROLE_IDENTIFIED",DAUM,4sBGU,moon,,26283
617,455329765,-92636193,138163733,-99,0,USER,S,0,타다 유사택시 절대금지 찬성 합니다,2019-12-07T17:14:53+0900,...,id:-92636193,https://t1.daumcdn.net/profile/M4tSuk52Iy10,,DAUM:6gGU9,"ROLE_USER,ROLE_DAUM,ROLE_IDENTIFIED",DAUM,6gGU9,산사랑,,23740
618,455313868,19558989,138163733,-99,0,USER,S,0,안탄다. 나쁜 타다.,2019-12-07T15:39:28+0900,...,id:19558989,https://k.kakaocdn.net/dn/4W0Oa/btqzHZ0AIKs/AG...,https://story.kakao.com/_f7tsX7,KAKAO:509053830,"ROLE_USER,ROLE_KAKAO,ROLE_IDENTIFIED",KAKAO,6001594,이한림,,8
619,455314192,28190016,138163733,-99,0,USER,S,0,타다가 나쁘다고는안할게요 댓글에 제발 택시를 마냥 나쁘다고 하지말아주세요;,2019-12-07T15:41:36+0900,...,id:28190016,https://k.kakaocdn.net/dn/bJGUfK/btqArDDGy4y/D...,,KAKAO:1101654557,"ROLE_USER,ROLE_KAKAO,ROLE_IDENTIFIED",KAKAO,132755140,서나,,60
