# North Korean News

Scrape the North Korean news agency http://kcna.kp

Save a CSV called `nk-news.csv`. This file should include:

* The **article headline**
* The value of **`onclick`** (they don't have normal links)
* The **article ID** (for example, the article ID for `fn_showArticle("AR0125885", "", "NT00", "L")` is `AR0125885`

The last part is easiest using pandas. Be sure you don't save the index!

* _**Tip:** If you're using requests+BeautifulSoup, you can always look at response.text to see if the page looks like what you think it looks like_
* _**Tip:** Check your URL to make sure it is what you think it should be!_
* _**Tip:** Does it look different if you scrape with BeautifulSoup compared to if you scrape it with Selenium?_
* _**Tip:** For the last part, how do you pull out part of a string from a longer string?_
* _**Tip:** `expand=False` is helpful if you want to assign a single new column when extracting_
* _**Tip:** `(` and `)` mean something special in regular expressions, so you have to say "no really seriously I mean `(`" by using `\(` instead_
* _**Tip:** if your `.*` is taking up too much stuff, you can try `.*?` instead, which instead of "take as much as possible" it means "take only as much as needed"_

In [22]:
import requests
import re
import pandas as pd

from bs4 import BeautifulSoup

In [23]:
url = 'http://kcna.kp/kcna.user.home.retrieveHomeInfoList.kcmsf'
response = requests.get(url, verify=False)
doc = BeautifulSoup(response.text)

In [24]:
doc


<html>
<head>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<script language="javascript">
	var globalContextPath = "";
	var jsLangCode = "kor";
	var flashPlayer = "/download/FlashPlayer10.zip";
	var gYearStr = "주체";
</script>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="/sys/css/homepage.css" rel="stylesheet" type="text/css"/>
<link href="/sys/css/homecss.css" rel="stylesheet" type="text/css"/>
<link href="/sys/css/calendar.css" rel="stylesheet" type="text/css"/>
<link href="/sys/css/special.css" rel="stylesheet" type="text/css"/>
<style>
	body {
			
		font-family: 돋음, 굴림, 청봉, Arial, Helvetica, sans-serif;;		
	
}
</style>
<!--[if IE]> 
	<link href="/sys/css/homepage_ie.css" rel="stylesheet" type="text/css"/>
<![endif]-->
<!--[if IE 6]> 
	<link href="/sys/css/homepage_ie6.css" rel="stylesheet" type="text/css"/>
<![endif]-->
<script language="javascript" src="/sys/js/comjs.j

In [25]:
headlines = doc.find_all('h3')

In [29]:
rows = []
for head in headlines:
    row = {}
    row['Headline'] = head.text
    row['Link'] = head.find('a', onclick=True)['onclick']
    row['ID'] = re.findall('\("([A-Z]+\d+)"', link)[0]
    rows.append(row)
rows

[{'Headline': '\n김정은동지께서 김대중 전 대통령의 부인 리희호녀사의 유가족들에게 조의문을 보내시였다',
  'Link': 'fn_showArticle("AR0126135", "", "NT00", "L")',
  'ID': 'AR0126237'},
 {'Headline': '\n경애하는 최고령도자 김정은동지께서 김대중 전 대통령의 부인 리희호녀사의 유가족들에게 조의문과 조화를 보내시였다\xa0',
  'Link': 'fn_showArticle("AR0126133", "", "NT00", "L")',
  'ID': 'AR0126237'},
 {'Headline': '\n김정은동지께서 로씨야대통령에게 축전을 보내시였다',
  'Link': 'fn_showArticle("AR0126098", "", "NT00", "L")',
  'ID': 'AR0126237'},
 {'Headline': '\n경애하는 최고령도자 김정은동지께서 조선인민군 제2기 제7차 군인가족예술소조경연에서 당선된 군부대들의 군인가족예술소조원들과 기념사진을 찍으시였다\xa0',
  'Link': 'fn_showArticle("AR0125885", "", "NT00", "L")',
  'ID': 'AR0126237'},
 {'Headline': '\n김정은동지께서 꾸바공산당 중앙위원회 제1비서에게 축전을 보내시였다',
  'Link': 'fn_showArticle("AR0125876", "", "NT00", "L")',
  'ID': 'AR0126237'},
 {'Headline': '\n대집단체조와 예술공연 《인민의 나라》 개막\xa0',
  'Link': 'fn_showArticle("AR0125856", "", "NT00", "L")',
  'ID': 'AR0126237'},
 {'Headline': '\n조국통일연구원 상보',
  'Link': 'fn_showArticle("AR0125916", "", "NT04", "L")',
  'ID': 'AR0126237'},
 {'Head

In [30]:
df = pd.DataFrame(rows)
df.head()

Unnamed: 0,Headline,ID,Link
0,\n김정은동지께서 김대중 전 대통령의 부인 리희호녀사의 유가족들에게 조의문을 보내시였다,AR0126237,"fn_showArticle(""AR0126135"", """", ""NT00"", ""L"")"
1,\n경애하는 최고령도자 김정은동지께서 김대중 전 대통령의 부인 리희호녀사의 유가족들...,AR0126237,"fn_showArticle(""AR0126133"", """", ""NT00"", ""L"")"
2,\n김정은동지께서 로씨야대통령에게 축전을 보내시였다,AR0126237,"fn_showArticle(""AR0126098"", """", ""NT00"", ""L"")"
3,\n경애하는 최고령도자 김정은동지께서 조선인민군 제2기 제7차 군인가족예술소조경연에...,AR0126237,"fn_showArticle(""AR0125885"", """", ""NT00"", ""L"")"
4,\n김정은동지께서 꾸바공산당 중앙위원회 제1비서에게 축전을 보내시였다,AR0126237,"fn_showArticle(""AR0125876"", """", ""NT00"", ""L"")"


In [31]:
df.to_csv("nk-news.csv", index=False)