# North Korean News

Scrape the North Korean news agency http://kcna.kp

Save a CSV called `nk-news.csv`. This file should include:

* The **article headline**
* The value of **`onclick`** (they don't have normal links)
* The **article ID** (for example, the article ID for `fn_showArticle("AR0125885", "", "NT00", "L")` is `AR0125885`

The last part is easiest using pandas. Be sure you don't save the index!

* _**Tip:** If you're using requests+BeautifulSoup, you can always look at response.text to see if the page looks like what you think it looks like_
* _**Tip:** Check your URL to make sure it is what you think it should be!_
* _**Tip:** Does it look different if you scrape with BeautifulSoup compared to if you scrape it with Selenium?_
* _**Tip:** For the last part, how do you pull out part of a string from a longer string?_
* _**Tip:** `expand=False` is helpful if you want to assign a single new column when extracting_
* _**Tip:** `(` and `)` mean something special in regular expressions, so you have to say "no really seriously I mean `(`" by using `\(` instead_
* _**Tip:** if your `.*` is taking up too much stuff, you can try `.*?` instead, which instead of "take as much as possible" it means "take only as much as needed"_

In [7]:
import requests
import re
import pandas as pd

from bs4 import BeautifulSoup

In [8]:
url = 'http://kcna.kp/kcna.user.home.retrieveHomeInfoList.kcmsf;jsessionid=BD33A8A3052EA200FB70740E9828BF04'
response = requests.get(url, verify=False)
doc = BeautifulSoup(response.text)

In [9]:
doc


<html>
<head>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<script language="javascript">
	var globalContextPath = "";
	var jsLangCode = "eng";
	var flashPlayer = "/download/FlashPlayer10.zip";
	var gYearStr = "Juche";
</script>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="/sys/css/homepage.css" rel="stylesheet" type="text/css"/>
<link href="/sys/css/homecss.css" rel="stylesheet" type="text/css"/>
<link href="/sys/css/calendar.css" rel="stylesheet" type="text/css"/>
<link href="/sys/css/special.css" rel="stylesheet" type="text/css"/>
<style>
	body {
	
		font-family: Tahoma, serif, Arial, Helvetica;	
	
}
</style>
<!--[if IE]> 
	<link href="/sys/css/homepage_ie.css" rel="stylesheet" type="text/css"/>
<![endif]-->
<!--[if IE 6]> 
	<link href="/sys/css/homepage_ie6.css" rel="stylesheet" type="text/css"/>
<![endif]-->
<script language="javascript" src="/sys/js/comjs.js"></scrip

In [14]:
headlines = doc.find_all('h3')

[<h3><font style="font-size:9.5pt">
 <a class="titlebet" href="#this" onclick='fn_showArticle("AR0126283", "", "NT09", "L")'>Pyongyang International Sci-Tech Exhibition of Health and Medical Appliances Opens</a> <a href="#this" onclick='fn_showArticle("AR0126283", "", "NT09", "I")'><img alt="" border="0" height="11" src="images/photo.png" width="15"/></a> <a href="#this" onclick='fn_showArticle("AR0126283", "", "NT09", "V")'><img alt="" border="0" height="10" src="images/video.png" width="18"/></a></font></h3>,
 <h3><font style="font-size:9.5pt">
 <a class="titlebet" href="#this" onclick='fn_showArticle("AR0125978", "", "NT09", "L")'>Fictions and Models - 2019 Held</a> <a href="#this" onclick='fn_showArticle("AR0125978", "", "NT09", "I")'><img alt="" border="0" height="11" src="images/photo.png" width="15"/></a></font></h3>,
 <h3><font style="font-size:9.5pt">
 <a class="titlebet" href="#this" onclick='fn_showArticle("AR0124586", "", "NT12", "L")'>Performance Given by State Merited Cho

In [24]:
rows = []
for head in headlines:
    row = {}
    row['headline'] = head.text
    row['link'] = head.find('a', onclick=True)['onclick']
    row['ID'] = re.findall('\("([A-Z]+\d+)"', link)[0]
    rows.append(row)
rows


[{'headline': '\nPyongyang International Sci-Tech Exhibition of Health and Medical Appliances Opens\xa0\xa0',
  'link': 'fn_showArticle("AR0126283", "", "NT09", "L")',
  'ID': 'AR0126290'},
 {'headline': '\nFictions and Models - 2019 Held\xa0',
  'link': 'fn_showArticle("AR0125978", "", "NT09", "L")',
  'ID': 'AR0126290'},
 {'headline': '\nPerformance Given by State Merited Chorus to Mark Founding Anniversary of KPRA\xa0',
  'link': 'fn_showArticle("AR0124586", "", "NT12", "L")',
  'ID': 'AR0126290'},
 {'headline': '\nPiano Concert Held by Students of Kim Won Gyun University of Music\xa0',
  'link': 'fn_showArticle("AR0126163", "", "NT12", "L")',
  'ID': 'AR0126290'},
 {'headline': '\n"Unbangul"-trademarked Musical Instruments Popular in DPRK\xa0',
  'link': 'fn_showArticle("AR0126072", "", "NT12", "L")',
  'ID': 'AR0126290'},
 {'headline': '\nArt Performance Given to Celebrate KCU Founding Anniversary\xa0',
  'link': 'fn_showArticle("AR0125945", "", "NT12", "L")',
  'ID': 'AR0126290'}

In [25]:
df = pd.DataFrame(rows)
df.head()

Unnamed: 0,ID,headline,link
0,AR0126290,\nPyongyang International Sci-Tech Exhibition ...,"fn_showArticle(""AR0126283"", """", ""NT09"", ""L"")"
1,AR0126290,\nFictions and Models - 2019 Held,"fn_showArticle(""AR0125978"", """", ""NT09"", ""L"")"
2,AR0126290,\nPerformance Given by State Merited Chorus to...,"fn_showArticle(""AR0124586"", """", ""NT12"", ""L"")"
3,AR0126290,\nPiano Concert Held by Students of Kim Won Gy...,"fn_showArticle(""AR0126163"", """", ""NT12"", ""L"")"
4,AR0126290,"\n""Unbangul""-trademarked Musical Instruments P...","fn_showArticle(""AR0126072"", """", ""NT12"", ""L"")"


In [26]:
df.to_csv("NKnews.csv", index=False)