# Crawling Daum News Article comments (다음 뉴스 기사 댓글 크롤링)

___

This jupyter notebook crawls user comments of Daum news articles.

As an example, this code crawls the following news article:  https://news.v.daum.net/v/20191207151034339 


In [125]:
import requests
import json
import pandas as pd

### 1. Find postId from Daum API

1. Open `web developer tool` of the article link you are trying to crawl.

2. Go into **Network > XHR** and click the file that starts with `comments?...`. Then click `Preview` and you will see `postId`.

  ![Imgur](https://i.imgur.com/lWpyxWH.png)



#### Structure of Daum news article API  URL

- basic API URL Format: http://comment.daum.net/apis/v1/posts/{post_id}/comments?parentId=0&offset={id_of_comment}&limit={number_of_comments_to_call_from_API}&sort=RECOMMEND


- Elements of the URL:
    * **post_id**: post id in API
    * **offset**: id of the comment (in order starting from 0) 
    * **limit**: number of comments to call from api
 

In [143]:
#Input postId
postId = 138163733

### 2. Define a function that crawls basic information of the article

* title of the article
* date the article of published
* total number of comments

In [144]:
def basic_info(postId):
    url = "http://comment.daum.net/apis/v1/posts/"+str(postId)
    page = requests.get(url)
    data = page.json()
    
    count = data['commentCount'] - data['childCount']
    title = data['title']
    date = data['createdAt']
    
    return ([ title, date, int(count)])

In [145]:
basic_info(postId)

['정부 "타다 금지법 없다"..타다 제도권 수용법 주장', '2019-12-07T15:10:46+0900', 604]

### 3. Define a function that crawls and parses the comment API

In [139]:
def crawling_comment(count, info):
    data = []
    title = info[0]
    published = info[1]
    count = info[2]
    
    for n in range(0, count-1):
        url = "http://comment.daum.net/apis/v1/posts/"+str(postId)+"/comments?parentId=0&offset="+str(n)+"&limit=1&sort=RECOMMEND"
        page = requests.get(url)
        item = page.json()[0]
        
        _dict = {}

        _dict['title'] = title
        _dict['publishedDate'] = published
        _dict['id'] = item['id']
        _dict['userId'] = item['userId']
        _dict['postId'] = item['postId']
        _dict['content'] = item['content']
        _dict['createdAt'] = item['createdAt']
        _dict['updatedAt'] = item['updatedAt']
        _dict['childCount'] = item['childCount']
        _dict['likeCount'] = item['likeCount']
        _dict['dislikeCount'] = item['dislikeCount']
        _dict['recommendCount'] = item['recommendCount']
        _dict['username'] = item['user']['providerUserId']
        _dict['displayName'] = item['user']['displayName']
        _dict['commentCount'] = item['user']['commentCount']
        
        data.append(_dict)
                
        if n%50 == 0:
            print("crawled:", n)
            
    print("Done! total:", count)
    
    return data #returns list of dictionaries

### 4. Run the functions and convert the output to dataframe

In [140]:
info = basic_info(postId)
data = crawling_comment(count, info)
df = pd.DataFrame(data)

crawled: 0
crawled: 50
crawled: 100
crawled: 150
crawled: 200
crawled: 250
crawled: 300
crawled: 350
crawled: 400
crawled: 450
crawled: 500
crawled: 550
crawled: 600
Done! total: 604


In [141]:
df

Unnamed: 0,title,publishedDate,id,userId,postId,content,createdAt,updatedAt,childCount,likeCount,dislikeCount,recommendCount,username,displayName,commentCount
0,"정부 ""타다 금지법 없다""..타다 제도권 수용법 주장",2019-12-07T15:10:46+0900,455313986,-84130810,138163733,택시기사새끼들 무서워서 타다를 막으면 국토부 공무원들 다 짤라야지,2019-12-07T15:40:18+0900,2019-12-07T15:40:18+0900,88,1884,377,1507,5H0gy,Violet,4189
1,"정부 ""타다 금지법 없다""..타다 제도권 수용법 주장",2019-12-07T15:10:46+0900,455313955,-546739925,138163733,택시는 왜 특혜 받아야 하나,2019-12-07T15:40:07+0900,2019-12-07T15:40:07+0900,39,808,133,675,B041T,김종섭,937
2,"정부 ""타다 금지법 없다""..타다 제도권 수용법 주장",2019-12-07T15:10:46+0900,455314346,-51585116,138163733,시민들 편의에 믿겨야지,2019-12-07T15:42:34+0900,2019-12-07T15:42:34+0900,10,396,38,358,3urE0,황제폐하,10890
3,"정부 ""타다 금지법 없다""..타다 제도권 수용법 주장",2019-12-07T15:10:46+0900,455315081,281528,138163733,택시가 공공재인가? 택시를 국가에서 보호해야할 산업인가? \n\n서로 경쟁하는 것...,2019-12-07T15:47:02+0900,2019-12-07T15:47:02+0900,6,301,34,267,1sjeC,자유인,10562
4,"정부 ""타다 금지법 없다""..타다 제도권 수용법 주장",2019-12-07T15:10:46+0900,455315470,-34432195,138163733,경쟁을 시키세요.\n\n소비자가 현명하게 선택합니다.,2019-12-07T15:49:31+0900,2019-12-07T15:49:31+0900,7,155,8,147,2ktnZ,정-지-서,11585
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
598,"정부 ""타다 금지법 없다""..타다 제도권 수용법 주장",2019-12-07T15:10:46+0900,455333906,16554897,138163733,타다이재웅이는누구냐\n그렇게무서운가\n타다는불법인데\n왜\n난리들이야,2019-12-07T17:40:00+0900,2019-12-07T17:40:00+0900,0,0,2,-2,80999326,이성덕,860
599,"정부 ""타다 금지법 없다""..타다 제도권 수용법 주장",2019-12-07T15:10:46+0900,455331867,-65923416,138163733,타다는 불법이다.\n타다를 정부가 옹호하려면\n택시 면허를 사야하는게 맞다.\n이런...,2019-12-07T17:27:48+0900,2019-12-07T17:27:48+0900,0,0,2,-2,4sBGU,moon,26423
600,"정부 ""타다 금지법 없다""..타다 제도권 수용법 주장",2019-12-07T15:10:46+0900,455329765,-92636193,138163733,타다 유사택시 절대금지 찬성 합니다,2019-12-07T17:14:53+0900,2019-12-07T17:14:53+0900,0,0,2,-2,6gGU9,산사랑,23942
601,"정부 ""타다 금지법 없다""..타다 제도권 수용법 주장",2019-12-07T15:10:46+0900,455313868,19558989,138163733,안탄다. 나쁜 타다.,2019-12-07T15:39:28+0900,2019-12-07T15:39:28+0900,0,6,11,-5,6001594,이한림,8
