# COLX 521 Lecture 6: Text documents

* Text IO
* Encodings
* Webpages

## Text IO

The classic method for opening files in Python is to assign a file object (f) to the result of the open function, and then close the file by calling the close method. 

In [75]:
f = open("Lecture6_files.ipynb")
print(f.read(500))
f.close()

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# COLX 521 Lecture 6: Text documents\n",
    "\n",
    "* Text IO\n",
    "* Encodings\n",
    "* Webpages"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Text IO"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subs


An popular alternative method is to use Python with...as syntax, which will close the file automatically at the end of the code block. 

- Advantage: won't accidently leave files open and lose data
- Disadvantage: extra indentation

In [76]:
with open("Lecture6_files.ipynb") as f:
    print(f.read(500))

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# COLX 521 Lecture 6: Text documents\n",
    "\n",
    "* Text IO\n",
    "* Encodings\n",
    "* Webpages"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Text IO"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subs


Remember that "r" mode does not need to be specified for reading, it is the default mode. Write mode "w" overwrites the file you are creating. The append option "a" can be useful if you are adding continuously to a file

In [77]:
#provided code
with open("test.txt","w") as fout:
    fout.write("test write 1\n")

In [78]:

#provided code
with open("test.txt","w",) as fout:
    fout.write("test write 2\n")

In [79]:
#provided code
with open("test.txt","a") as fout:
    fout.write("test append")

In [80]:
#provided code
f = open("test.txt")
print(f.read())
f.close()

test write 2
test append


The two most common options for reading files are iterating line by line using a *for* loop (which does not require holding the entire file in memory), or reading the entire file into a single string at once, using [read](https://docs.python.org/3/library/io.html#io.TextIOBase.read). You can read a single line of a file without loop by using [readline](https://docs.python.org/3/library/io.html#io.TextIOBase.readline). A fourth option is [readlines](https://docs.python.org/3/library/io.html#io.IOBase.readlines), which will read the entire file into a list of string where each string is a line. Remember that in all of these cases, the newline characters will still be there! 

In [81]:
#provided code
some_lines = "line1\nline2\nline3\nline4"
with open("test.txt","w") as fout:
    fout.write(some_lines)

In [82]:
with open("test.txt") as f:
    # my code here
    print(f.read())
    # my code here

line1
line2
line3
line4


In [83]:
with open("test.txt") as f:
    # my code here
    for line in f:
        print(line.strip())
    # my code here

line1
line2
line3
line4


In [84]:
with open("test.txt") as f:
    # my code here
    print(f.readline())
    #print(f.readline())
    # my code here

line1



In [85]:
with open("test.txt") as f:
    # my code here
    print(f.readlines())
    # my code here

['line1\n', 'line2\n', 'line3\n', 'line4']


For writing, the write method can be used whether you are writing incrementally or one shot. There is a [writelines](https://docs.python.org/3/library/io.html#io.IOBase.writelines) method if you already have a list of strings, though note that newlines are not added.

In [86]:
#provided code
some_lines = ["line1","line2","line3","line4"]

In [87]:

with open("test.txt","w") as fout:
    #my code here
    for line in some_lines:
        fout.write(line + "\n")
    #my code here


In [88]:
with open("test.txt","a") as fout:
    #my code here
    fout.writelines(some_lines)


In [89]:
#provided code
with open("test.txt") as f:
    print(f.read())

line1
line2
line3
line4
line1line2line3line4


## Encodings

For computers, numbers are everything. However, when dealing with texts, we need a way to associate numbers with characters. *Encodings* provide such a mapping. Generally there is trade-off between the number of possible characters that can be represented and the amount of space required to store text on disk, so different encodings were developed so they could represent the particular characters used in particular languages.

![test](http://www.asciitable.com/index/asciifull.gif)

In Python, an encoding can be selecting using the `encoding` keyword when you open a file. ASCII was the first major encoding and is very compact but can only represent 128 characters; using ASCII will fail if you try to write text that uses characters which aren't found on typical English keyboard. 

In [90]:
#provided code
with open("test.txt", "w",encoding="ascii") as fout:
    fout.write("this works\n")
    fout.write("ça ne va pas\n")
    fout.write("不行\n")

UnicodeEncodeError: 'ascii' codec can't encode character '\xe7' in position 0: ordinal not in range(128)

Latin-1 and various related formats can use up to 256 characters (a full byte), and support most of the languages of Europe. A variation on Latin-1 called CP-1252 is usually the default encoding for Windows.

In [None]:
#provided code
with open("test.txt", "w",encoding="latin-1") as fout:
    fout.write("this works")
    fout.write("ça va")
    fout.write("还是不行")

These days, the most popular encoding is definitely UTF-8, which supports all the characters included in Unicode, including all the characters of pretty much every written language, as well as things like emoji. Even if you don't think you need it, it is a good idea to save the text files you create to be in utf-8. The characters included in ASCII have the same representation in UTF-8, so for normal English texts it is actually no less efficient. Note that UTF-8 is the default encoding for OS X.

In [None]:
#provided code
with open("test.txt", "w",encoding="utf-8") as fout:
    fout.write("this works\n")
    fout.write("ça va\n")
    fout.write("可以了\n")

In [None]:
#provided code
with open("test.txt",encoding="utf-8") as f:
    print(f.read())

In [None]:
#provided code  
with open("test.txt",encoding="ascii") as f:
    print(f.read())

Exercise: write or grab some text in another language with characters that don't appear in English and show that Unicode can represent it but ASCII cannot

In [None]:
with open("test.txt", "w",encoding="utf-8") as fout:
    fout.write("هذا يعمل")
    
with open("test.txt", "r",encoding="utf-8") as f:
    print(f.read())    

with open("test.txt", "w",encoding="ascii") as fout:
    fout.write("هذا لا يعمل")
    

Most of the time, encodings just work and you don't have to think about them. However, sooner or later (probably sooner) you will get an encoding error when you read a file. You might try changing the encoding, or trying to autodetect the encoding (more on this below). But sometimes it just doesn't work (or you don't have the patience), at which point you might want to try a more liberal option for the *errors* keyword parameter such as ignore or replace. 

In [None]:
#provided code
with open("test.txt", "w",encoding="utf-8") as fout:
    fout.write("this works")
    fout.write("ça va")
    fout.write("可以了")
    
    
with open("test.txt",encoding="ascii",errors="replace") as f:
    print(f.read())
          

Usually encodings can be handled as part of file IO, but sometimes you need to encode to a bytes string or decode from a bytes string when there is no file involved. Use the encode and decode methods for strings, with also have the errors keyword argument

In [None]:
with open("test.txt","rb") as f:
    text = f.read()
    # my code here
    print(text)
    print(text.decode("utf-8"))
    print(text.decode("ascii",errors="ignore"))
    # my code here

## Webpages

Webpages are opened much like a file on disk, using the urlopen command (from the package urllib.request) which creates the file object; just pass it the url. The string corresponding to the HTML file is then accessed using the read method.

In [None]:
from urllib.request import urlopen

url = "http://www.ubc.ca"

#my code here

f = urlopen(url)
binary_html = f.read()
print(binary_html[:100])

#my code here

One tricky bit is that what is returned by this is not a text string, but rather a raw binary string which should be decoded. Since UTF-8 is the standard encoding on the internet, that usually does the trick. 

In [None]:
html = binary_html.decode("utf-8")
print(html[:100])

XML is a way to add explicit structure to text. 

- Spans of text are enclosed in tags representing metatextual information, forming an element
- Tags optionally include attributes and their values
- Basic syntax of an XML element: `<tagname attribute1=value attribute2=value ...> text </tagname>` 
- In addition to text, elements can contain other elements, but elements cannot otherwise overlap



HTML is essentially an instantiation of XML used for representing webpages.

In [None]:
print(html)

Exercise: use regular expressions to pull out all the opening HTML tags from the webpage. You should not include either closing tags which are marked with a /, nor *declarations* which are marked with a !

In [None]:
import re

xml_re = "<[^/!][^>]+>"

for match in re.finditer(xml_re,html):
    print(match.group())

Though you can use regex to pull out specific tags, often it is useful to have a parse XML/HTML into an explicit tree structure which can be navigated. There are many packages which will do this for you, one of the most popular is [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc). You can [create a "Soup" tree](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup) from either a file or a existing string; one bonus is that if you give beautiful soup an opened file corresponding to a webpage, it will guess the encoding for you and convert it to unicode. If you want this functionality without parsing XML, use [UnicodeDammit](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#unicode-dammit)

In [None]:
#provided code
from bs4 import BeautifulSoup

soup =  BeautifulSoup(urlopen("http://www.ubc.ca"),"lxml")
print(soup.prettify())

If you are looking for (the contents of) a specific node or nodes, you can find them using the [find/find_all](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree) methods

In [None]:
for node in soup.find_all("a"):
    print(node)

You can also look around the tree manually by using attributes of each node object, including the [contents](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#contents-and-children) (which have a node's children) and .parent. You can treat the contents as if it is a list.

In [93]:
#provided code
form = soup.find("form")

In [94]:
print(form.prettify())

<form action="//www.ubc.ca/search/" class="form-search" method="get" role="search">
 <label for="q">
  Search UBC
 </label>
 <input class="input-xlarge search-query" id="q" name="q" placeholder="Search UBC" type="text"/>
 <button class="btn" type="submit">
  Search
 </button>
</form>



In [95]:
form.contents[7].contents[0]

IndexError: list index out of range

In [96]:
for node in form.contents:
    print(node)



<label for="q">Search UBC</label>


<input class="input-xlarge search-query" id="q" name="q" placeholder="Search UBC" type="text"/>


<button class="btn" type="submit">Search</button>




In [29]:
print(form.parent)

<div id="ubc7-search-box">
<form action="//www.ubc.ca/search/refine/" class="form-search" method="get">
<input class="input-xlarge search-query" name="q" placeholder="Search " type="text"/>
<input name="label" type="hidden" value="UBC Master of Data Science"/>
<input name="site" type="hidden" value="masterdatascience.ubc.ca"/>
<button class="btn btn-primary" type="submit">Search</button>
</form> </div>


Other things that can be accessed through a node object: the tag (in the `name` variable), and the attributes (in the `attrs` variable), which can also be accessed by treating the node as a dictionary.

In [30]:
form.name

'form'

In [31]:
form.attrs

{'class': ['form-search'],
 'method': 'get',
 'action': '//www.ubc.ca/search/refine/'}

In [32]:
form["class"]

['form-search']

Text is wrapped up in a special object (NavigableString) that form the leaves of the tree. If you want to just grab any and all text under a node in the tree, use [get_text](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text)

In [34]:
form.get_text()

'\n\n\n\nSearch\n'

In [35]:
type(form.contents[0])

bs4.element.NavigableString

In [36]:
type(form.contents[1])

bs4.element.Tag

Lets work through an exercise together: from the MDS-CL page (https://masterdatascience.ubc.ca/programs/computational-linguistics), we'll get a list of strings which correspond to the text that appears in "p" HTML tags whose parent is a div of with "id" of "cl-curriculum"

In [11]:
strings  = []
soup = BeautifulSoup(urlopen("http://masterdatascience.ubc.ca/programs/computational-linguistics"),"lxml")
for node in soup.find_all("p"):
    if node.parent.name == "div" and "id" in node.parent.attrs and node.parent["id"] == "cl-curriculum":
        text = node.get_text()
        if text.strip():
            strings.append(text)
print(strings)

['The program structure includes 24 one-credit courses offered in four-week segments. Courses are lab-oriented and delivered in-person with some blended online content.', 'At the end of the six segments, an eight-week capstone project is also included, allowing students to apply their newly acquired knowledge, while working alongside other students with real-life data sets. Please note that instructors are subject to change.', 'Review Admission Requirements Contact Us With Questions', 'As part of their capstone project, students from UBC’s Master of Data Science program partnered with Finn Ai, to help the banking software company improve their AI assistant’s ability to identify user intents.', 'Examining the company’s existing neural network model, the students were able to identify areas of confusion for the AI and improve customer service.', 'View Full Story']
