# Get Basic Information with Beautiful Soup
*DATE:* 2022-07-29

*SUMMARY:* Get Webtoon ID, Update day, Thumbnail, Title, Author, Plot, Genre, Recommended age from the Naver Webtoon homepage. 

*FEATURE:* Use BeautifulSoup for scraping information. Other version would be uploaded soon.

**WARNING**\
It is based on the HTML/CSS format from July, 2022. So it may not function correctly now. For correction, check Naver Webtoon homepage's updated HTML/CSS format.


# Scraping information from the page

In [None]:
from selenium import webdriver 
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup as bs
import numpy as np
import pandas as pd
import time
import requests
import os 
from urllib.request import urlretrieve 
from tqdm import tqdm_notebook
import re

In [8]:
idlist = []
daylist = []
imglist = [] 
titlelist = [] 
namelist = [] 
contentlist = [] 
genrelist = []
agelist = [] 
formlist = []

In [None]:
# From the url, get every information that classified class:title as info
url='https://comic.naver.com/webtoon/weekday.nhn'
res=requests.get(url)
soup=bs(res.text, 'html.parser')

info = soup.find_all('a', {'class':'title'}) 

In [4]:
# Get ID and Update day from info 
numID = []
href = []
dates = []

for i in info:
    href = i.attrs['href'].split('/')
    href = href[2]
    
    num = re.findall(r'\d+', href)
    numID.append(num[0])
    
    hrefday = href.split('=')
    hrefday = hrefday[2]

    dates.append(hrefday)

# Make a dataframe 
df = pd.DataFrame()
df['id']=numID
df['day']=dates

In [None]:
# Visualize a datafram
# If a toon updates more than twice within a week, information can be retrieved redundantly. 
# This duplication issue is addressed during the scraping process.
df

Unnamed: 0,id,day
0,758037,mon
1,183559,mon
2,648419,mon
3,783052,mon
4,602910,mon
...,...,...
549,776096,sun
550,786973,sun
551,798303,sun
552,798177,sun


In [None]:
# Scarping according to ID and Update day information
for ID, date in tqdm_notebook(iddate):
    url='https://comic.naver.com/webtoon/list?titleId='+ID+'&weekday='+date
    res = requests.get(url)
    res.raise_for_status() # Break when warn
    soup = bs(res.text, 'lxml')
    time.sleep(0.5)
    
    # Duplication processing
    if (ID in idlist): 
        daylist[idlist.index(ID)] += ',' + date
        
    else: 
        idlist.append(ID)
        daylist.append(date)

        imgpath= soup.select_one('.comicinfo > .thumb')
        imgsrc = imgpath.select_one('img').attrs.get('src')
        imglist.append(imgsrc)

        detailpath = soup.select_one('.comicinfo > .detail')
        title = detailpath.select_one('h2 > .title').text
        titlelist.append(title)

        name = detailpath.select_one('h2 > .wrt_nm').text[8:]
        namelist.append(name)
        content = detailpath.select_one('p').text
        contentlist.append(content)

        detailinfopath = soup.select_one('.comicinfo > .detail > .detail_info')
        genre = detailinfopath.select_one('.genre').text
        genrelist.append(genre)
        age = detailinfopath.select_one('.age').text
        agelist.append(age)
        
        # Get the form cut-toon using the feature that cut-toon has 'ico_cut' path
        try:
            cuttoon = detailpath.select_one('h2 > .ico_cut').txt
            form = '컷툰'
        except:
            form = '스크롤'
        formlist.append(form) 

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for ID, date in tqdm_notebook(iddate):


  0%|          | 0/549 [00:00<?, ?it/s]

In [7]:
# Visualize a dataframe
cols = []
df = pd.DataFrame(columns=cols)

df['id'] = idlist
df['title'] = titlelist
df['name'] = namelist 
df['day'] = daylist
df['genre'] = genrelist
df['age'] = agelist
df['form'] = formlist
df['content'] = contentlist
df['imgsrc'] = imglist

df

Unnamed: 0,id,title,name,day,genre,age,form,content,imgsrc
0,758037,참교육,채용택 / 한가람,mon,"스토리, 액션",15세 이용가,스크롤,무너진 교권을 지키기 위해 교권보호국 소속 나화진의 참교육이 시작된다!<부활남> 채...,https://shared-comic.pstatic.net/thumb/webtoon...
1,183559,신의 탑,SIU,mon,"스토리, 판타지",12세 이용가,스크롤,자신의 모든 것이었던 소녀를 쫓아 탑에 들어온 소년그리고 그런 소년을 시험하는 탑,https://shared-comic.pstatic.net/thumb/webtoon...
2,648419,뷰티풀 군바리,설이 / 윤성원,mon,"스토리, 드라마",15세 이용가,스크롤,'여자도 군대에 간다면?'본격 여자도 군대 가는 만화!,https://shared-comic.pstatic.net/thumb/webtoon...
3,783052,퀘스트지상주의,박태준 만화회사,mon,"스토리, 드라마",15세 이용가,스크롤,"[외모지상주의], [싸움독학], [인생존망]과 세계관을 공유하는 작품!공부, 싸움,...",https://shared-comic.pstatic.net/thumb/webtoon...
4,602910,윈드브레이커,조용석,mon,"스토리, 스포츠",12세 이용가,스크롤,혼자서 자전거를 즐겨타던 모범생 조자현.원치 않게 자전거 크루의 일에 자꾸 휘말리게...,https://shared-comic.pstatic.net/thumb/webtoon...
...,...,...,...,...,...,...,...,...,...
534,773524,거래하실래요?,99C / 백도,sun,"스토리, 로맨스",전체연령가,스크롤,"새로운 사랑은 중고거래로 장만하세요(?!)남친과 헤어진 후, 추억의 물건을 중고거래...",https://shared-comic.pstatic.net/thumb/webtoon...
535,773067,제타,하지,sun,"스토리, 스릴러",12세 이용가,스크롤,극심한 슬럼프에 안드로이드 '제타'의 그림을 표절해버린 금세기 최고의 천재 화가 '...,https://shared-comic.pstatic.net/thumb/webtoon...
536,785812,"구해줘, 호구!",기천,sun,"스토리, 로맨스",전체연령가,스크롤,세상에서 착한 게 제일 싫은 준영은 '자타공인 청정호구' 산호가 자꾸 거슬린다. 취...,https://shared-comic.pstatic.net/thumb/webtoon...
537,776096,짝사랑의 유서,군밤,sun,"스토리, 로맨스",12세 이용가,스크롤,"민결은 단골카페 사장님의 동생인 태은을 1년동안 짝사랑 하던 중, 태은에게서 편지 ...",https://shared-comic.pstatic.net/thumb/webtoon...


In [8]:
# Visualize the toon that upload more than twice a week
df[df['id']=='793283']

Unnamed: 0,id,title,name,day,genre,age,form,content,imgsrc
89,793283,악몽의 형상,김용키,"tue,sat","스토리, 스릴러",15세 이용가,컷툰,"'타인은지옥이다', '관계의종말' 이후 9년..종우와 다은은 여전히 끔찍한 지옥 속...",https://shared-comic.pstatic.net/thumb/webtoon...


In [9]:
# Save as CSV format. Encoding is needed for KOREAN
df.to_csv('네이버웹툰_기본정보.csv', encoding = 'utf-8-sig')

# Download Thumbnail files
With urlretrieve, download thumbnail image files from src paths.\
Use title as a file name.

In [20]:
# Remove special characters
retitle = []
count=0
for i in titlelist:
    retitle.append(''.join(filter(str.isalnum, i)))
    count += 1
retitle    

['참교육',
 '신의탑',
 '뷰티풀군바리',
 '퀘스트지상주의',
 '윈드브레이커',
 '호랑신랑뎐',
 '장씨세가호위무사',
 '소녀의세계',
 '백수세끼',
 '신화급귀속아이템을손에넣었다',
 '앵무살수',
 '잔불의기사',
 '불청객',
 '버림받은왕녀의은밀한침실',
 '절대검감',
 '똑닮은딸',
 '리턴투플레이어',
 '히어로메이커',
 '아쫌참으세요영주님',
 '꼬리잡기',
 '야생천사보호구역',
 '이별후사내결혼',
 '더블클릭',
 '결혼생활그림일기',
 '황제와의하룻밤',
 '메리의불타는행복회로',
 '순정말고순종',
 '북부공작님을유혹하겠습니다',
 '세번째로망스',
 '칼가는소녀',
 '물어보는사이',
 '우산없는애',
 '파운더',
 '신군',
 '꿈의기업',
 '오빠집이비어서',
 '미니어처생활백서',
 '또다시계약부부',
 '다시쓰는연애사',
 '제왕빛과그림자',
 '오늘의비너스',
 '버그이터',
 '입술이예쁜남자',
 '사랑의헌옷수거함',
 '루크비셸따라잡기',
 '말박왕',
 '레지나레나용서받지못한그대에게',
 '원작은완결난지한참됐습니다만',
 '하루의하루',
 '경비실에서안내방송드립니다',
 '홍천기',
 '아마도',
 '싸이코리벤지',
 '최후의금빛아이',
 '백호랑',
 '매지컬급식암살법사',
 '그림자신부',
 '달로만든아이',
 '모노마니아',
 '파견체',
 '왕따협상',
 '찌질하지만로맨스는하고싶어',
 '나만의고막남친',
 '악녀18세공략기',
 '모락모락왕세자님',
 '지옥연애환담',
 '흔들리는세계로부터',
 '디나운스',
 '역주행',
 '결혼공략',
 '사막에핀달',
 '헬로맨스',
 '슈퍼스타천대리',
 '남주서치',
 '오로지오로라',
 '별을쫓는소년들',
 '기사님을지켜줘',
 '모스크바의여명',
 '김부장',
 '대학원탈출일지',
 '여신강림',
 '마루는강쥐',
 '1을줄게',
 '멸망이후의세계',
 '내가키운S급들',
 '중증외상센터골든아워',
 '신도림',
 '용사가돌아왔다',
 '삼국지톡',

In [21]:
# Save thumbnail with src path
fileNo = 0
for i in tqdm_notebook(range(len(imglist))):
    urlretrieve(imglist[i], "C:/Users/gynchoi/Trap/image/"+retitle[i]+".jpg")
    fileNo += 1
    time.sleep(1)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for i in tqdm_notebook(range(len(imglist))):


  0%|          | 0/539 [00:00<?, ?it/s]