# North Korean News

Scrape the North Korean news agency http://kcna.kp

Save a CSV called `nk-news.csv`. This file should include:

* The **article headline**
* The value of **`onclick`** (they don't have normal links)
* The **article ID** (for example, the article ID for `fn_showArticle("AR0125885", "", "NT00", "L")` is `AR0125885`

The last part is easiest using pandas. Be sure you don't save the index!

* _**Tip:** If you're using requests+BeautifulSoup, you can always look at response.text to see if the page looks like what you think it looks like_
* _**Tip:** Check your URL to make sure it is what you think it should be!_
* _**Tip:** Does it look different if you scrape with BeautifulSoup compared to if you scrape it with Selenium?_
* _**Tip:** For the last part, how do you pull out part of a string from a longer string?_
* _**Tip:** `expand=False` is helpful if you want to assign a single new column when extracting_
* _**Tip:** `(` and `)` mean something special in regular expressions, so you have to say "no really seriously I mean `(`" by using `\(` instead_
* _**Tip:** if your `.*` is taking up too much stuff, you can try `.*?` instead, which instead of "take as much as possible" it means "take only as much as needed"_

In [1]:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd



In [3]:
response = requests.get("http://kcna.kp/kcna.user.home.retrieveHomeInfoList.kcmsf")
doc = BeautifulSoup(response.content, 'html.parser')
doc.prettify()

'<html>\n <head>\n  <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n  <script language="javascript">\n   var globalContextPath = "";\r\n\tvar jsLangCode = "kor";\r\n\tvar flashPlayer = "/download/FlashPlayer10.zip";\r\n\tvar gYearStr = "주체";\n  </script>\n  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>\n  <link href="/sys/css/homepage.css" rel="stylesheet" type="text/css"/>\n  <link href="/sys/css/homecss.css" rel="stylesheet" type="text/css"/>\n  <link href="/sys/css/calendar.css" rel="stylesheet" type="text/css"/>\n  <link href="/sys/css/special.css" rel="stylesheet" type="text/css"/>\n  <style>\n   body {\r\n\t\t\t\r\n\t\tfont-family: 돋음, 굴림, 청봉, Arial, Helvetica, sans-serif;;\t\t\r\n\t\r\n}\n  </style>\n  <!--[if IE]> \r\n\t<link href="/sys/css/homepage_ie.css" rel="stylesheet" type="text/css"/>\r\n<![endif]-->\n  <!--[if IE 6]> \r\n\t<link href="/sys/css/homepage_ie6.css" rel="stylesh

In [4]:
full = doc.find_all('a', class_='titlebet')
# full = doc.find_all('a')
#print (full)
fulllist = []
for article in full:
        title = (article.text)
        onclick = (article['onclick'])
        eacharticle = {'Article Headline': title,
                      'Onclick Value': onclick}
        fulllist.append(eacharticle)
print (fulllist)
# a = article.find_all(class_='titlebet') --- this returns blank results after searching a because it's the same tag!

          

[{'Article Headline': '경애하는 최고령도자 김정은동지께서 라오스인민혁명당 중앙위원회 총비서인 라오스인민민주주의공화국 주석에게 축전을 보내시였다', 'Onclick Value': 'fn_showArticle("AR0140322", "", "NT00", "L")'}, {'Article Headline': '조선로동당 중앙위원회 제7기 제21차 정치국 확대회의 진행', 'Onclick Value': 'fn_showArticle("AR0140253", "", "NT00", "L")'}, {'Article Headline': '경애하는 최고령도자 김정은동지께서 수리아대통령에게 축전을 보내시였다', 'Onclick Value': 'fn_showArticle("AR0139989", "", "NT00", "L")'}, {'Article Headline': '조선로동당 중앙위원회 제7기 제20차 정치국 확대회의 진행', 'Onclick Value': 'fn_showArticle("AR0139950", "", "NT00", "L")'}, {'Article Headline': '경애하는 최고령도자 김정은동지께서  《총련분회대표자대회-2020》(새 전성기 3차대회) 참가자들에게 축하문을 보내시였다', 'Onclick Value': 'fn_showArticle("AR0139645", "", "NT00", "L")'}, {'Article Headline': '경애하는 최고령도자 김정은동지께서 고 라명희동지의 령전에 화환을 보내시였다', 'Onclick Value': 'fn_showArticle("AR0139638", "", "NT00", "L")'}, {'Article Headline': '작업현장소독을 구체적으로', 'Onclick Value': 'fn_showArticle("AR0140387", "", "NT41", "L")'}, {'Article Headline': '주체109(2020)년 12월 5일 신문개관', 'Onclick Value': 'fn_showA

In [5]:
df = pd.DataFrame(fulllist)
df['Article ID'] = df['Onclick Value'].str.extract("fn_showArticle\(\"([\w\d]+)\", \"", expand=False)
df.to_csv("nk-news.csv", index=False)
df

Unnamed: 0,Article Headline,Onclick Value,Article ID
0,경애하는 최고령도자 김정은동지께서 라오스인민혁명당 중앙위원회 총비서인 라오스인민민주...,"fn_showArticle(""AR0140322"", """", ""NT00"", ""L"")",AR0140322
1,조선로동당 중앙위원회 제7기 제21차 정치국 확대회의 진행,"fn_showArticle(""AR0140253"", """", ""NT00"", ""L"")",AR0140253
2,경애하는 최고령도자 김정은동지께서 수리아대통령에게 축전을 보내시였다,"fn_showArticle(""AR0139989"", """", ""NT00"", ""L"")",AR0139989
3,조선로동당 중앙위원회 제7기 제20차 정치국 확대회의 진행,"fn_showArticle(""AR0139950"", """", ""NT00"", ""L"")",AR0139950
4,경애하는 최고령도자 김정은동지께서 《총련분회대표자대회-2020》(새 전성기 3차대...,"fn_showArticle(""AR0139645"", """", ""NT00"", ""L"")",AR0139645
...,...,...,...
111,모범기술혁신단위칭호쟁취운동 활발,"fn_showArticle(""AR0139668"", """", ""NT09"", ""L"")",AR0139668
112,방역사업의 강도를 높여,"fn_showArticle(""AR0140316"", """", ""NT10"", ""L"")",AR0140316
113,작업현장소독을 구체적으로,"fn_showArticle(""AR0140387"", """", ""NT10"", ""L"")",AR0140387
114,대중봉사장소에 대한 소독사업 강화,"fn_showArticle(""AR0140379"", """", ""NT10"", ""L"")",AR0140379
