# Using Python 3

Using `print` vs. not using print makes a big difference in Python 2, but not in Python 3. Python 3 supports Unicode natively.

The biggest difference between Python 2 and Python 3 is that in Python 3 when you use `print` you have to use parenthesis.

In [92]:
# Works fine

print('hello world')

hello world


In [93]:
# Works fine

'hello world'

'hello world'

In [94]:
# Works fine

print('你好世界')

你好世界


In [95]:
# Works fine

'你好世界'

'你好世界'

# Working with libraries

Some libraries work better than others with UTF-8/Unicode

In [96]:
from bs4 import BeautifulSoup
import requests

In [97]:
# Use result.text instead of result.content to make sure you get Unicode
# (We might have used .content in class)
# Because we're using Python 3 instead of Python 2, we can see the
# Chinese characters in the output

result = requests.get("http://djchina.org")
result.text

'<!DOCTYPE html>\r\n\r\n<!-- BEGIN html -->\r\n<html xmlns="http://www.w3.org/1999/xhtml" lang="zh-CN">\r\n\r\n<!-- BEGIN head -->\r\n<head>\r\n\r\n\t<!-- Meta Tags -->\r\n\t<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />\r\n    \r\n    <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">\r\n\t\r\n\t<!-- Title -->\r\n\t<title>数据新闻中文网 | When Data Meet Journalism</title>\r\n\r\n    <!--Favicon code-->\r\n     <link rel="shortcut icon" href="http://djchina.org/djchina/wp-content/uploads/2013/09/djchina_weibologo-03.jpg"/>\n   \r\n\r\n    <!-- 1140px Grid styles for IE -->\r\n\t<!--[if lte IE 9]><link rel="stylesheet" href="http://djchina.org/djchina/wp-content/themes/zend/css/ie.css" type="text/css" media="screen" /><![endif]-->\r\n\r\n\t<!--add css for different pages-->\r\n\t\r\n\t\r\n\t    \r\n\t        <link rel="stylesheet" href="http://djchina.org/djchina/wp-content/themes/zend/css/not-home.css" type="text/css" media="screen" />

In [98]:
# I personally like using
# soup = BeautifulSoup(result.text, 'html.parser')
# but let's keep it simpler

soup = BeautifulSoup(result.text)

In [99]:
# Pull out the titles and tags

tags = soup.select("h2.title a")
titles = [tag.text for tag in tags]
urls = [tag['href'] for tag in tags]

In [100]:
# URLs are only ASCII, they look fine
urls[:5]

['http://djchina.org/2015/07/12/miami-alberto-cairo/',
 'http://djchina.org/2015/04/24/2015nicar_without_coding/',
 'http://djchina.org/2015/02/10/tools-entry-level/',
 'http://djchina.org/2015/01/28/2015events/',
 'http://djchina.org/2014/12/28/sohu_visualization/']

In [101]:
# Looks fine (didn't in Python 2)
titles[:5]

['迈阿密大学：Alberto Cairo教授的数据新闻',
 '【2015 NICAR会议系列报道之二】不写代码，也能成为记者极客',
 '数据新闻实用法宝——入门篇',
 '2015可视化大事件一览',
 '回首PC时代的富媒体专题—以搜狐为例']

In [102]:
# Looks fine (didn't in Python 2)
print(titles[:5])

['迈阿密大学：Alberto Cairo教授的数据新闻', '【2015 NICAR会议系列报道之二】不写代码，也能成为记者极客', '数据新闻实用法宝——入门篇', '2015可视化大事件一览', '回首PC时代的富媒体专题—以搜狐为例']


In [103]:
# Looks fine (this is the only one that worked in Python 2)
for title in titles[:5]:
    print(title)

迈阿密大学：Alberto Cairo教授的数据新闻
【2015 NICAR会议系列报道之二】不写代码，也能成为记者极客
数据新闻实用法宝——入门篇
2015可视化大事件一览
回首PC时代的富媒体专题—以搜狐为例


## Pandas

Sometimes `pandas` is good at UTF-8, sometimes not.

In [104]:
import pandas as pd

content = pd.DataFrame({'title': text, 'url': urls})
content.head()

Unnamed: 0,title,url
0,迈阿密大学：Alberto Cairo教授的数据新闻,http://djchina.org/2015/07/12/miami-alberto-ca...
1,【2015 NICAR会议系列报道之二】不写代码，也能成为记者极客,http://djchina.org/2015/04/24/2015nicar_withou...
2,数据新闻实用法宝——入门篇,http://djchina.org/2015/02/10/tools-entry-level/
3,2015可视化大事件一览,http://djchina.org/2015/01/28/2015events/
4,回首PC时代的富媒体专题—以搜狐为例,http://djchina.org/2014/12/28/sohu_visualization/


In [105]:
# Saves fine without specifying encoding (unlike Python 2)

content.to_csv("pandas-output-python3.csv")