# Text processing

<img src="./outline.png" width="400" align="left">

In [9]:
import nltk
nltk.data.path.append("/data/3/zwang/nltk_data") # setting environment variable to your path

# show the plot in file
from matplotlib import pyplot as plt
plt.style.use('default')

# Currency exchange example

<img src="./currency_exchange.png" width="1000" align="center">
<br><br><br><br>

# Text processing with unicode
- Computers just deal with numbers
    - letters and characters are assigned to a number
    
- Texts in different languages has different **character encoding** systems:
    - English, ASCII
    - Europe, Latin (e.g., "ø", "ő", "ñ", "ň")
    - India, Hindi, ISCII (Indian Script Code for Information Interchange)
    - China, GB2321
    - Korea, Window 949
    
- Earlier character encodings were limited:
    - not cover characters for all the world’s languages
    - no single encoding covered all the letters, punctuation, and technical symbols in common use.  

- Conflicted with each another:  
    - two encodings could use the same number for two different characters
        - (english, 0100, a)
        - (european, 0100, x)
    - use different numbers for the same character
        - (english, 0100, a)
        - (european, 0101, a)

- Challenge to support pictographic languages:
    - Japanese (e.g., こんばんは)
    - Chinese (e.g., 自然语言处理)

- Passing data between computers that have different character encoding systems:
    - corruption
    - conflict
    - errors

- Any ideas for dealing with this issue?


## Unicode characters 
- Unicode Standard:
    - the universal character encoding standard used for representation of text for computer processing

- A global standard to support all the world’s languages
    - An encoding large enough to support the writing systems of all the world’s languages
    - Provides a standardized system of character codes
    - https://unicode.org/standard/principles.html
    - https://www.unicode.org/charts/
    
    
- A unique code for:
    - every character, in every language, in every program, on every platform
    - enables computers to support virtually every language
    - codes for more than 135,000 characters: 
        - the world's alphabets
        - writing systems
        - symbols
        
<br><br><br><br>

## ASCII 
- American Standard Code for Information Interchange
    - first edition published in 1963
    

- ASCII character table:
    - ASCII table: http://www.asciitable.com/
    - encodes 128 characters into **seven-bit** integers
    - 95 printable characters:
        - digits 0 to 9, 
        - lowercase letters a to z, 
        - uppercase letters A to Z, 
        - and punctuation symbols. 
    - 33 non-printing control codes
    - E.g., lowercase i: 
        - binary 1101001 = hexadecimal 69 = decimal 105
        - hexadecimal
        - octal
    
- Usecase: CV in ASCII format:
    - 'plain' text with no formatting such as tabs, bold or underscoring 
    - Notepad++ (ANSI)

<img src="./ASCII.png" width="600" align="center">

## Unicode in Python
- **code point**: each character is assigned a number
    - \uXXXX, 4-digit hexadecimal
    - single byte per code point (e.g., ASCII and Latin-2), support a small subset of Unicode, enough for a single language
    - multiple bytes per code point (e.g., UTF-8) and can represent the full range of Unicode characters
    - In Python 3, source code is encoded using UTF-8 by default
    

- **decoding**: 
    - file/terminal --> program
    - bytes --> Unicode
- **encoding**: 
    - program --> file/terminal
    - Unicode --> bytes


<img src="./encode_decode.png" width="500" align="center">

## Extracting encoded text from files
- a small text file and we know how it is encoded 
- For example, polish-lat2.txt, Polish text encoded as Latin-2, also known as ISO-8859-2. 
    - Polish Wikipedia: http://pl.wikipedia.org/wiki/Biblioteka_Pruska

In [11]:
# locate the file path
path = nltk.data.find('corpora/unicode_samples/polish-lat2.txt') 
path
# this is the location in my computer and you might have a different path in your computer

FileSystemPathPointer('/data/3/zwang/nltk_data/corpora/unicode_samples/polish-lat2.txt')

In [12]:
# read encoded data into Unicode strings and write out Unicode strings in encoded form
# parameter "encoding" specifies the encoding of the file being read or written
f = open(path, encoding='latin2') # what if a wrong encoding? GB2312
for line in f:
    line = line.strip() # remove the leading and the trailing characters (e.g., space, \n)
    print(line)

Pruska Biblioteka Państwowa. Jej dawne zbiory znane pod nazwą
"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez
Niemców pod koniec II wojny światowej na Dolny Śląsk, zostały
odnalezione po 1945 r. na terytorium Polski. Trafiły do Biblioteki
Jagiellońskiej w Krakowie, obejmują ponad 500 tys. zabytkowych
archiwaliów, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.


- Check the underlying numerical values (or "codepoints") of the characters:
    - convert all non-ASCII characters into their two-digit \xXX and four-digit \uXXXX representations

In [13]:
f = open(path, encoding='latin2')
for line in f:
    line = line.strip()
    print(line.encode('unicode_escape'))
    
# \u0144: a Unicode escape string preceded by the \u escape string, display on the screen as ń
# \xf3: display as ó, and is within the 128-255 range
# b' ': byte

b'Pruska Biblioteka Pa\\u0144stwowa. Jej dawne zbiory znane pod nazw\\u0105'
b'"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez'
b'Niemc\\xf3w pod koniec II wojny \\u015bwiatowej na Dolny \\u015al\\u0105sk, zosta\\u0142y'
b'odnalezione po 1945 r. na terytorium Polski. Trafi\\u0142y do Biblioteki'
b'Jagiello\\u0144skiej w Krakowie, obejmuj\\u0105 ponad 500 tys. zabytkowych'
b'archiwali\\xf3w, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.'


- Arbitrary Unicode characters can be represented using the \uXXXX escape sequence 
    - X is hexadecimal digits

In [52]:
# the number representing the unicode code of a specified character
ord('ń')

324

In [53]:
u_char = '\u0144' # unicode 4 + 16*4 + 16*16*1 = 324
print(u_char)

ń


In [37]:
u_char = '\u3eac' # unicode
print(u_char)

㺬


In [90]:
u_char = '\u00f3'
u_char

'ó'

In [38]:
# check how this character is represented as a sequence of bytes inside a text file
print(u_char.encode('utf8')) 
# b' ': bytes literals

b'\xe3\xba\xac'


- properties of Unicode characters
    - unicodedata

In [14]:
import unicodedata
lines = open(path, encoding='latin2').readlines()

# select characters in the third line of the Polish text
line = lines[2] 

print(line)
print(line.encode('unicode_escape')) # unicode representation

Niemców pod koniec II wojny światowej na Dolny Śląsk, zostały

b'Niemc\\xf3w pod koniec II wojny \\u015bwiatowej na Dolny \\u015al\\u0105sk, zosta\\u0142y\\n'


In [4]:
print('ó'.encode('unicode_escape')) # ń

b'\\xf3'


- check characters outside the ASCII range

In [6]:
'ó'.encode('utf8')

b'\xc3\xb3'

In [18]:
ord('ń')

324

In [None]:
ą b'\xc4\x85' U+0105 LATIN SMALL LETTER A WITH OGONEK


In [21]:
for c in line: # iterate over character
    if ord(c) > 127: # text outside the ASCII range
        print('{} {} U+{:04x} {}'.format(c, c.encode('utf8'), ord(c), unicodedata.name(c))) 
        
# the character, utf8 byte sequence, unicode code point, unicode name (description for the character)

ó b'\xc3\xb3' U+00f3 LATIN SMALL LETTER O WITH ACUTE
ś b'\xc5\x9b' U+015b LATIN SMALL LETTER S WITH ACUTE
Ś b'\xc5\x9a' U+015a LATIN CAPITAL LETTER S WITH ACUTE
ą b'\xc4\x85' U+0105 LATIN SMALL LETTER A WITH OGONEK
ł b'\xc5\x82' U+0142 LATIN SMALL LETTER L WITH STROKE


## Difference between Unicode and other character encoding stardards (e.g.ASCII) in terms of memory? 
- How much memory does Unicode take?
- Unicode user guide: http://userguide.icu-project.org/unicode
- blog: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
- blog: https://rushter.com/blog/python-strings-and-memory/#:~:text=Unicode%20strings%20can%20take%20up,char%20(Latin%2D1%20encoding)

### Code point
- 1 byte = 8 bits
- ASCII: 7 bit code points, represent $2^7 = 128$ character
- Unicode: 21 bit code points, $2^{20} = 1,048,576$, $2^{21} = 2,097,152$
    - 0 (hex) to 10FFFF(hex)
    - 17 planes of 65536: 17*65536 = 1,114,112 characters

## Memory usage 
- How much memory it uses depends on the way it is encoded in memory 
    - ASCII: one character per byte (8 bits) in RAM
    - Unicode: vary considerably
        - Python 3: the str type uses Unicode representation 
        - Unicode strings can take up to 4 bytes per character depending on the encoding, which sometimes can be expensive from a memory perspective.
        

In [None]:
4*8=32 

In [99]:
ord('ń')

324

In [96]:
'ń'.encode('utf8') 

b'\xc5\x84'

In [98]:
'ń'.encode('unicode_escape') # 4+4*16+1*16*16

b'\\u0144'

In [101]:
'㺬'.encode('unicode_escape')

b'\\u3eac'

# Accessing text from different sources

## Online resources
- Kaggle:
    - https://www.kaggle.com/datasets
- Github:
    - https://github.com/niderhoff/nlp-datasets
    - topics: nlp-dataset
- Yelp:
    - https://www.yelp.com/dataset
- Airbnb
    - http://insideairbnb.com/
- Others?

<br><br><br><br><br>

## Processing web text 

<img src="./processing_pipeline.png" width="600" align="center">

### Online electronic books
- Project Gutenberg
- free online ebooks: http://www.gutenberg.org/catalog/
- over 50 languages, 90% in english

In [65]:
from urllib import request
url = "http://www.gutenberg.org/files/2554/2554-0.txt"

# open url and read it into string
raw_text = request.urlopen(url).read().decode('utf8')

print(raw_text[:75])

type(raw_text), len(raw_text)

﻿The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky


(str, 1176967)

In [75]:
raw_text[:75] # what is the difference between using and not using print?
# unicodedata.name('\ufeff') # non-printable 

'\ufeffThe Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r'

**tokenization**: break up strings into words and punctuations

In [91]:
from nltk import word_tokenize
tokens = word_tokenize(raw_text[:100])

print(tokens[:10])

type(tokens), len(tokens)

['BBC', 'NEWS', '|', 'Health', '|', 'Blondes', "'to", 'die', 'out', 'in']


(list, 14)

### HTML
(1) use a web browser to save a page as a local text file, then access the text file with normal text processing method

(2) get python to do the work directly

In [81]:
# BBC News story: Blondes to die out in 200 years
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = request.urlopen(url).read().decode('utf8')
html[:200] 
# print(html)

'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">\r\n<html>\r\n<head>\r\n<title>BBC NEWS | Health | Blondes \'to die out in 200 years\'</title>\r\n<meta '

- **BeautifulSoup**: a Python library to get text out of HTML
    - http://www.crummy.com/software/BeautifulSoup/

In [82]:
from bs4 import BeautifulSoup
raw_text = BeautifulSoup(html[:200], 'html.parser').get_text()
tokens = word_tokenize(raw_text)

print(raw_text[:100])
print(tokens[:10])




BBC NEWS | Health | Blondes 'to die out in 200 years'
<meta 
['BBC', 'NEWS', '|', 'Health', '|', 'Blondes', "'to", 'die', 'out', 'in']


In [84]:
# initialize as nltk.Text, remove unwanted material
text = nltk.Text(tokens[:10])
text

<Text: BBC NEWS | Health | Blondes 'to die...>

## Capturing user input
- Python built-in functions
    - https://docs.python.org/3/library/functions.html

In [86]:
s = input("Enter some text: ")

Enter some text: Natural Language Processing


In [87]:
print(s)

Natural Language Processing


<img src="./string_operations.png" width="600" align="center">