# North Korean News

Scrape the North Korean news agency http://kcna.kp

Save a CSV called `nk-news.csv`. This file should include:

* The **article headline**
* The value of **`onclick`** (they don't have normal links)
* The **article ID** (for example, the article ID for `fn_showArticle("AR0125885", "", "NT00", "L")` is `AR0125885`

The last part is easiest using pandas. Be sure you don't save the index!

* _**Tip:** If you're using requests+BeautifulSoup, you can always look at response.text to see if the page looks like what you think it looks like_
* _**Tip:** Check your URL to make sure it is what you think it should be!_
* _**Tip:** Does it look different if you scrape with BeautifulSoup compared to if you scrape it with Selenium?_
* _**Tip:** For the last part, how do you pull out part of a string from a longer string?_
* _**Tip:** `expand=False` is helpful if you want to assign a single new column when extracting_
* _**Tip:** `(` and `)` mean something special in regular expressions, so you have to say "no really seriously I mean `(`" by using `\(` instead_
* _**Tip:** if your `.*` is taking up too much stuff, you can try `.*?` instead, which instead of "take as much as possible" it means "take only as much as needed"_

In [1]:
from selenium import webdriver

In [2]:
driver = webdriver.Chrome()

In [3]:
driver.get('http://kcna.kp')

In [4]:
articles = []
items = driver.find_elements_by_tag_name('a')
for item in items:
    article = {}
    article['article headline'] = item.text
    article['onclick'] = item.get_attribute('onclick')
    articles.append(article)

StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
  (Session info: chrome=75.0.3770.100)


In [5]:
import pandas as pd
df = pd.DataFrame(articles)

In [6]:
df = df.dropna(subset=['article headline', 'onclick'])
df = df[df.onclick.str.contains ('AR')]
df = df[df.onclick.str.contains ('L')]

In [7]:
df['article ID'] = df['onclick'].str.extract(".+(AR\d{7}).+", expand=False)

In [8]:
df.head()

Unnamed: 0,article headline,onclick,article ID
33,김정은동지께서 김대중 전 대통령의 부인 리희호녀사의 유가족들에게 조의문을 보내시였다,"fn_showArticle(""AR0126135"", """", ""NT00"", ""L"")",AR0126135
34,경애하는 최고령도자 김정은동지께서 김대중 전 대통령의 부인 리희호녀사의 유가족들에게...,"fn_showArticle(""AR0126133"", """", ""NT00"", ""L"")",AR0126133
36,김정은동지께서 로씨야대통령에게 축전을 보내시였다,"fn_showArticle(""AR0126098"", """", ""NT00"", ""L"")",AR0126098
37,경애하는 최고령도자 김정은동지께서 조선인민군 제2기 제7차 군인가족예술소조경연에서 ...,"fn_showArticle(""AR0125885"", """", ""NT00"", ""L"")",AR0125885
39,김정은동지께서 꾸바공산당 중앙위원회 제1비서에게 축전을 보내시였다,"fn_showArticle(""AR0125876"", """", ""NT00"", ""L"")",AR0125876


In [9]:
df.to_csv("nk-news.csv", index=False)