# Using Python 2

Using `print` vs. not using print makes a big difference in Python 2

In [98]:
# Works fine since it's ASCII
print 'hello world'

hello world


In [99]:
# Works fine since it's ASCII
'hello world'

'hello world'

In [100]:
# It's UTF-8, but it works because we're using print
print '你好世界'

你好世界


In [101]:
# Shows raw Unicode codes because we didn't use print :(
'你好世界'

'\xe4\xbd\xa0\xe5\xa5\xbd\xe4\xb8\x96\xe7\x95\x8c'

# Working with libraries

Some libraries work better than others with UTF-8/Unicode

In [102]:
from bs4 import BeautifulSoup
import requests

In [103]:
# Use result.text instead of result.content to make sure you get Unicode
# (We might have used .content in class)
# Because we're using Python 2, this shows up as raw Unicode codes

result = requests.get("http://djchina.org")
result.text

u'<!DOCTYPE html>\r\n\r\n<!-- BEGIN html -->\r\n<html xmlns="http://www.w3.org/1999/xhtml" lang="zh-CN">\r\n\r\n<!-- BEGIN head -->\r\n<head>\r\n\r\n\t<!-- Meta Tags -->\r\n\t<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />\r\n    \r\n    <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">\r\n\t\r\n\t<!-- Title -->\r\n\t<title>\u6570\u636e\u65b0\u95fb\u4e2d\u6587\u7f51 | When Data Meet Journalism</title>\r\n\r\n    <!--Favicon code-->\r\n     <link rel="shortcut icon" href="http://djchina.org/djchina/wp-content/uploads/2013/09/djchina_weibologo-03.jpg"/>\n   \r\n\r\n    <!-- 1140px Grid styles for IE -->\r\n\t<!--[if lte IE 9]><link rel="stylesheet" href="http://djchina.org/djchina/wp-content/themes/zend/css/ie.css" type="text/css" media="screen" /><![endif]-->\r\n\r\n\t<!--add css for different pages-->\r\n\t\r\n\t\r\n\t    \r\n\t        <link rel="stylesheet" href="http://djchina.org/djchina/wp-content/themes/zend/css/not-home.cs

In [104]:
# I personally like using
# soup = BeautifulSoup(result.text, 'html.parser')
# but let's keep it simpler

soup = BeautifulSoup(result.text)

In [105]:
# Pull out the titles and tags

tags = soup.select("h2.title a")
titles = [tag.text for tag in tags]
urls = [tag['href'] for tag in tags]

In [106]:
# URLs look fine
urls[:5]

['http://djchina.org/2015/07/12/miami-alberto-cairo/',
 'http://djchina.org/2015/04/24/2015nicar_without_coding/',
 'http://djchina.org/2015/02/10/tools-entry-level/',
 'http://djchina.org/2015/01/28/2015events/',
 'http://djchina.org/2014/12/28/sohu_visualization/']

In [107]:
# Shows up as UTF-8 codes
titles[:5]

[u'\u8fc8\u963f\u5bc6\u5927\u5b66\uff1aAlberto Cairo\u6559\u6388\u7684\u6570\u636e\u65b0\u95fb',
 u'\u30102015 NICAR\u4f1a\u8bae\u7cfb\u5217\u62a5\u9053\u4e4b\u4e8c\u3011\u4e0d\u5199\u4ee3\u7801\uff0c\u4e5f\u80fd\u6210\u4e3a\u8bb0\u8005\u6781\u5ba2',
 u'\u6570\u636e\u65b0\u95fb\u5b9e\u7528\u6cd5\u5b9d\u2014\u2014\u5165\u95e8\u7bc7',
 u'2015\u53ef\u89c6\u5316\u5927\u4e8b\u4ef6\u4e00\u89c8',
 u'\u56de\u9996PC\u65f6\u4ee3\u7684\u5bcc\u5a92\u4f53\u4e13\u9898\u2014\u4ee5\u641c\u72d0\u4e3a\u4f8b']

In [108]:
# Shows up as UTF-8 codes
print(titles[:5])

[u'\u8fc8\u963f\u5bc6\u5927\u5b66\uff1aAlberto Cairo\u6559\u6388\u7684\u6570\u636e\u65b0\u95fb', u'\u30102015 NICAR\u4f1a\u8bae\u7cfb\u5217\u62a5\u9053\u4e4b\u4e8c\u3011\u4e0d\u5199\u4ee3\u7801\uff0c\u4e5f\u80fd\u6210\u4e3a\u8bb0\u8005\u6781\u5ba2', u'\u6570\u636e\u65b0\u95fb\u5b9e\u7528\u6cd5\u5b9d\u2014\u2014\u5165\u95e8\u7bc7', u'2015\u53ef\u89c6\u5316\u5927\u4e8b\u4ef6\u4e00\u89c8', u'\u56de\u9996PC\u65f6\u4ee3\u7684\u5bcc\u5a92\u4f53\u4e13\u9898\u2014\u4ee5\u641c\u72d0\u4e3a\u4f8b']


In [109]:
# Shows up as correct characters
for title in titles[:5]:
    print(title)

迈阿密大学：Alberto Cairo教授的数据新闻
【2015 NICAR会议系列报道之二】不写代码，也能成为记者极客
数据新闻实用法宝——入门篇
2015可视化大事件一览
回首PC时代的富媒体专题—以搜狐为例


## Pandas

Sometimes `pandas` is good at UTF-8, sometimes not.

In [110]:
import pandas as pd

# Displays okay in the notebook

content = pd.DataFrame({'title': titles, 'url': urls})
content.head()

Unnamed: 0,title,url
0,迈阿密大学：Alberto Cairo教授的数据新闻,http://djchina.org/2015/07/12/miami-alberto-ca...
1,【2015 NICAR会议系列报道之二】不写代码，也能成为记者极客,http://djchina.org/2015/04/24/2015nicar_withou...
2,数据新闻实用法宝——入门篇,http://djchina.org/2015/02/10/tools-entry-level/
3,2015可视化大事件一览,http://djchina.org/2015/01/28/2015events/
4,回首PC时代的富媒体专题—以搜狐为例,http://djchina.org/2014/12/28/sohu_visualization/


In [111]:
# Fails to save because it's trying to save as ASCII

content.to_csv("pandas-output-python2.csv")

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)

In [None]:
# If you set encoding it works, though

content.to_csv("pandas-output-python2.csv", encoding='utf-8')

In [None]:
# But reads in the UTF-8 CSV without a problem

df = pd.read_csv("pandas-output-python2.csv")
df.head()

# Enabling Python 3 in IPython Notebooks

When you're picking Python 2 when creating a new Notebook, you're picking the **kernel**, the little bundle of Python 2 code. First, let's save a copy of the Python 2 kernel. From Terminal, run:

    ipython kernelspec install-self --user

Now we need to use anaconda to install a fresh version of Python 3, along with all of the Python 3 versions of the libraries we use. This is called **creating a Python 3 environment**. Our default environment so far has been Python 2. This will take a while.

    conda create -n python3 python=3 anaconda
    
`python3` is the name of the environment (other tutorials might call it `py3k` or `py33` or othe things like this). Now we need to switch to our new environment:

    source activate python3
  
And then save the python3 kernel, just like we did with the other (default, Python 2) one. `kernelspec` saves the current kernel, so we run:

    ipython kernelspec install-self --user

You might get an error about `$PYTHONPATH`, which might be set to have Python 3 looking for Python 2 libraries (oh no!). If you get this error, just run `unset PYTHONPATH` and then try the kernelspec line again.

Now that the kernel is saved, we can exit out of it

    source deactivate

**Even though we won't be inside the python3 environment, IPython Notebook can still access it**. Start up your IPython Notebook with

    ipython notebook

And when you go to create a new notebook, you should have `Python 3` as an option. Hooray!