# <center>Supervised Learning - Text Classification</center>
References:
* http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

## 1. Finally, we come to machine learning ...
What can the players do when their every move is studied and predicted ?

<img src="machine_learning_cartoon.png" width="60%">
https://www.kdnuggets.com/2018/06/cartoon-fifa-world-cup-football-machine-learning.html

* Review basic concepts of machine learning
  * Cross validation
  * Performance metrics: recall and precision
* Text Classification  
  * Assign a document into one  or more pre-defined categories (or labels)
    * Input: 
      - a document $d$ 
      - a fixed set of classes C = {$c_1$, $c_2$,..., $c_J$}
      - A training set of $m$ hand-labeled documents ($d_1,c_1$),....,($d_m,c_m$)
    * Output: a classifier that predicts $d$ to some classes $c$ $\subset$ C
  * **Single-label** classification: e.g. spam dection, sentiment detection
  * **Multi-label** classification: e.g. news categorization

## 2. Review basic concepts of machine learning
### 2.1. Model assessment and selection - How valid is a model? 
- Generalization: the prediction capability of a model ($f$) on independent test data, 
  - Given testing samples ($X, Y$), and prediction ($X, f(X)$)
  - Testing error: $L(Y, f(X))$, e.g.
     - squared error
     - absolute error
  
- Data-rich situation: split data into training, validation, and test sets (e.g. 50%, 25%, 25%)
  - training set: fit the model
  - validation set: estimate prediction error for model selection
  - test set: assess the prediction erorr of the final chosen model
  <img src="train_validation_test.png" width="40%">


### 2.2. Cross Validation
- However, labeled data is always scarce. We cannot afford to set aside a validation set
- $K$-fold cross validation: 
    - Data is separated into k subsets. Each time, one of the subsets is held as the test set (a.k.a holdout) and the rest of them is used as the training set. 
    <img src="cross_validation.png" width="40%"> [source] (http://spark-public.s3.amazonaws.com/nlp/slides/sentiment.pptx)
    - This method repeats *k* times and each time with a different subset as the test set. 
    - Calculate average prediction error ($CV$) on K test sets $$ CV(\alpha) = \frac{1}{N} \sum_{i=1}^{N}{L(y_i, f^{k(i)}(x_i, \alpha))}$$ where $\alpha$: the model parameters (e.g. the number of neighbours in $k$-NN), $f^{k(i)}$: the model fitted on the $k$th iteration, $N$: number of samples
    - Tune model parameters ($\alpha$) to minize the average prediction error
    - Select the model with the minimal prediction error (along with $\alpha$ determined)
    - Fit the selected model to all the data
  

### 2.3. Performance metrics
  * Precision: precentage of true cases among the predicated true cases
  * Recall:  precentage of true cases that have been retrieved over the total number of true cases
  * F-score: $$\frac{2*precision*recall}{precision+recall}$$
  * Example: 
Confusion Matrix: <img src="confusion_matrix.png">
    * For "YES" group: 
      - precision=?, 
      - recall=?, 
      - f-score=?
      <img src="precision_recall.png" width="60%">
    * For "NO" group:
      - precision=?, 
      - recall=?, 
      - f-score=?
  * Overall model performance
    * precision_macro (or recall_macro or f1_macro) is calculated as:
      1. calculate precision for each label
      2. average over labels 
    * precision_micro (or recall_micro or f1_micro): calculates metrics globally regardless of labels
    * With inbalanced classes, the difference between these two metrics may be significant

## 3. Text Classification

* Basic process
  1. Load and preprocess sample data
  2. Extract features: e.g. bag of words with TF-IDF weights
  3. Split feature space into trainning and test sets following cross validation method
  4. Train a classifier/model with the training dataset using selected classification algorithm for each fold
  5. Calculate performance
 
* Considerations for deciding text classification algorithms
  - should be effective in high dimensional spaces (**curse of dimensionality**)
  - should be effective even if **the number of features is greater than the number of samples**
    * Is regression a good alogorithm if you have a small number of text samples?
  - some good algorithms to start with:
      - Naive Bayes (https://web.stanford.edu/class/cs124/lec/naivebayes.pdf): baseline for performance benchmarking of text classification algorithms
      - Support Vector Machine (SVM). References:
        - https://www.svm-tutorial.com/2014/11/svm-understanding-math-part-1/
        - http://www.robots.ox.ac.uk/~az/lectures/ml/lect2.pdf
  

In [14]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import pandas as pd

In [31]:
# Exercise 3.1.: Load data 
# Load datasets (http://qwone.com/~jason/20Newsgroups/)
# For convenience, a subset of the data has been saved into "twenty_news_data.csv"

import pandas as pd
data=pd.read_csv("twenty_news_data.csv",header=0)
data.head()

type(data)

# print out the full text of the first sample
print(data["text"][0])

Unnamed: 0,text,label
0,From: sd345@city.ac.uk (Michael Collier)\nSubj...,comp.graphics
1,From: ani@ms.uky.edu (Aniruddha B. Deglurkar)\...,comp.graphics
2,From: djohnson@cs.ucsd.edu (Darin Johnson)\nSu...,soc.religion.christian
3,From: s0612596@let.rug.nl (M.M. Zwart)\nSubjec...,soc.religion.christian
4,From: stanly@grok11.columbiasc.ncr.com (stanly...,soc.religion.christian


pandas.core.frame.DataFrame

From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton
Organization: The City University
Lines: 14

Does anyone know of a good way (standard PC application/PD utility) to
convert tif/img/tga files into LaserJet III format.  We would also like to
do the same, converting to HPGL (HP plotter) files.

Please email any response.

Is this the correct group?

Thanks in advance.  Michael.
-- 
Michael Collier (Programmer)                 The Computer Unit,
Email: M.P.Collier@uk.ac.city                The City University,
Tel: 071 477-8000 x3769                      London,
Fax: 071 477-8565                            EC1V 0HB.



## 3.1. TF-IDF matrix generation
- Function: **sklearn.feature_extraction.text.TfidfVectorizer**(input='content',encoding='utf-8', decode_error='strict', token_pattern='(?u)\b\w\w+\b', lowercase=True, stop_words=None, ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None, norm='l2', use_idf=True, smooth_idf=True, ...)
- Some useful parameters:
    * **input** : string {'filename', 'file', 'content').
    * **encoding** : encoding scheme, 'utf-8' by default.
If bytes or files are given to analyze, this encoding scheme is used to decode.
    * **decode_error** : {'strict', 'ignore', 'replace'}: Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given encoding. By default, it is 'strict', meaning that a UnicodeDecodeError will be raised. Other values are ‘ignore’ and ‘replace’.
    * **token_pattern** : Regular expression denoting what constitutes a “token”. The default is '(?u)\b\w\w+\b', i.e. a token contains at least two word characters in unicode (note: ?u: unicode, \b: space or non-word character, i.e. boundary, \w: word character). 
    * **ngram_range** : tuple (min_n, max_n): The lower and upper boundary of the range of n-values for different n-grams to be extracted. 
    * **stop_words** : string {‘english’}, list, or None (default)
    * **lowercase** : boolean, default True: Convert all characters to lowercase before tokenizing.
    * **max_df/min_df** : float in range [0.0, 1.0] or int, default=1.0: When building the vocabulary ignore terms that have a document frequency strictly higher (lower) than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. 
    * **max_features** : int or None, default=None. If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
    * **norm** : 'l1', 'l2' or None, optional. Norm used to normalize term vectors. None for no normalization.
    * **use_idf** : boolean, default=True. Enable inverse-document-frequency reweighting.
    * **smooth_idf** : boolean, default=True. Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.
- For all the parameters, see http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html


In [16]:
# Exercise 3.2 Create TF-IDF Matrix

from sklearn.feature_extraction.text import TfidfVectorizer

# initialize the TfidfVectorizer 

tfidf_vect = TfidfVectorizer() 

# with stop words removed
# tfidf_vect = TfidfVectorizer(stop_words="english") 

# generate tfidf matrix
dtm= tfidf_vect.fit_transform(data["text"])

print("type of dtm:", type(dtm))
print("size of tfidf matrix:", dtm.shape)

type of dtm: <class 'scipy.sparse.csr.csr_matrix'>
size of tfidf matrix: (2257, 35788)


In [17]:
# Exercise 3.3. Examine TF-IDF

# 1. Check vocabulary

# Vocabulary is a dictionary mapping a word to an index

# the number of words in the vocabulary
print("total number of words:", len(tfidf_vect.vocabulary_))

print("type of vocabulary:", \
      type(tfidf_vect.vocabulary_))
print("index of word 'city' in vocabulary:", \
      tfidf_vect.vocabulary_['city'])


total number of words: 35788
type of vocabulary: <class 'dict'>
index of word 'city' in vocabulary: 8696


In [18]:
# 3.4 check words with top tf-idf wights in a document, 
# e.g. 1st document

# get mapping from word index to word
# i.e. reversal mapping of tfidf_vect.vocabulary_
voc_lookup={tfidf_vect.vocabulary_[word]:word \
            for word in tfidf_vect.vocabulary_}

print("\nOriginal text: \n"+data["text"][0])

print("\ntfidf weights: \n")

# first, covert the sparse matrix row to a dense array
doc0=dtm[0].toarray()[0]
print(doc0.shape)

# get index of top 20 words
top_words=(doc0.argsort())[::-1][0:20]
[(voc_lookup[i], doc0[i]) for i in top_words]




Original text: 
From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton
Organization: The City University
Lines: 14

Does anyone know of a good way (standard PC application/PD utility) to
convert tif/img/tga files into LaserJet III format.  We would also like to
do the same, converting to HPGL (HP plotter) files.

Please email any response.

Is this the correct group?

Thanks in advance.  Michael.
-- 
Michael Collier (Programmer)                 The Computer Unit,
Email: M.P.Collier@uk.ac.city                The City University,
Tel: 071 477-8000 x3769                      London,
Fax: 071 477-8565                            EC1V 0HB.


tfidf weights: 

(35788,)


[('collier', 0.3841803935867984),
 ('city', 0.314400065528974),
 ('071', 0.25612026239119895),
 ('laserjet', 0.24645540709354397),
 ('477', 0.24645540709354397),
 ('converting', 0.21567205914741705),
 ('michael', 0.1962279892331408),
 ('iii', 0.18626015109199115),
 ('hp', 0.17358472047671197),
 ('files', 0.13635772403701527),
 ('sd345', 0.1348710554299733),
 ('8565', 0.1348710554299733),
 ('ec1v', 0.1348710554299733),
 ('x3769', 0.1348710554299733),
 ('0hb', 0.1348710554299733),
 ('tif', 0.12806013119559947),
 ('email', 0.125601499991304),
 ('ac', 0.12491817585060791),
 ('hpgl', 0.12322770354677198),
 ('img', 0.12322770354677198)]

In [33]:
# Exercise 3.5. classification using a single fold

# use MultinomialNB algorithm
from sklearn.naive_bayes import MultinomialNB

# import method for split train/test data set
from sklearn.model_selection import train_test_split

# import method to calculate metrics
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import classification_report

# split dataset into train (70%) and test sets (30%)
X_train, X_test, y_train, y_test = train_test_split(\
                dtm, data["label"], test_size=0.3, random_state=0)

# print(dtm)
# print("X train",X_train)
# print("X test",X_test)
# print("Y train",y_train)
# print("Y test",y_test)

print(type(y_train))




# train a multinomial naive Bayes model using the testing data
clf = MultinomialNB().fit(X_train, y_train)

# predict the news group for the test dataset
predicted=clf.predict(X_test)

# get the list of unique labels
labels=sorted(data["label"].unique())

# calculate performance metrics. 
# Support is the number of occurrences of each label

precision, recall, fscore, support=\
     precision_recall_fscore_support(\
     y_test, predicted, labels=labels)

print("labels: ", labels)
print("precision: ", precision)
print("recall: ", recall)
print("f-score: ", fscore)
print("support: ", support)

# another way to get all performance metrics
print(classification_report\
      (y_test, predicted, target_names=labels))

<class 'pandas.core.series.Series'>
labels:  ['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']
precision:  [1.         0.9702381  0.96969697 0.78059072]
recall:  [0.73972603 0.94767442 0.93023256 0.98404255]
f-score:  [0.8503937  0.95882353 0.9495549  0.87058824]
support:  [146 172 172 188]
                        precision    recall  f1-score   support

           alt.atheism       1.00      0.74      0.85       146
         comp.graphics       0.97      0.95      0.96       172
               sci.med       0.97      0.93      0.95       172
soc.religion.christian       0.78      0.98      0.87       188

           avg / total       0.92      0.91      0.91       678



In [20]:
# Exercise 3.6.  predict new documents

docs_new = ['God is love', 'OpenGL on the GPU is fast']

# generate tifid for new documents
X_new_tfidf = tfidf_vect.transform(docs_new)

print(X_new_tfidf.shape)

# predict probability that each document belongs to a class
predicted_p = clf.predict_proba(X_new_tfidf)

# predict classes for new documents
predicted = clf.predict(X_new_tfidf)

for idx, doc in enumerate(docs_new):
    print('\n', doc)
    for j, label in enumerate(labels):
        print('% s: %.3f'%(labels[j], predicted_p[idx][j]))
    print('%r => %s' % (doc, predicted[idx]))
    


(2, 35788)

 God is love
alt.atheism: 0.171
comp.graphics: 0.044
sci.med: 0.053
soc.religion.christian: 0.732
'God is love' => soc.religion.christian

 OpenGL on the GPU is fast
alt.atheism: 0.174
comp.graphics: 0.367
sci.med: 0.234
soc.religion.christian: 0.224
'OpenGL on the GPU is fast' => comp.graphics


In [None]:
# Exercise 3.7. Classification with stop words removed
# Can removing stop words improves performance?
# In Exercise 3.2, uncomment line 10 and comment line 7
# Run Exercise 3.2, 3.5

In [1]:
# Exercise 3.8. Run 5-fold cross validation
# to show the generalizability of the model

# import cross validation method
from sklearn.model_selection import cross_validate
from sklearn.naive_bayes import MultinomialNB

metrics = ['precision_macro', 'recall_macro', \
           "f1_macro"]

clf = MultinomialNB()
#clf = MultinomialNB(alpha=0.5)

cv = cross_validate(clf, dtm, data["label"], \
                    scoring=metrics, cv=5)
print("Test data set average precision:")
print(cv['test_precision_macro'])
print("\nTest data set average recall:")
print(cv['test_recall_macro'])
print("\nTest data set average fscore:")
print(cv['test_f1_macro'])

# To see the performance of training data set use 
# cv['train_xx_macro']
print("\ntraining data average f1:\n", cv['train_f1_macro'])

# The metrics are quite stable across folds.
# The performance between training and test sets is small
# This indicates the model has good generalizability

NameError: name 'dtm' is not defined

In [None]:
# Exercise 3.9. Multinominal NB 
# with different smoothing parameter alpha
# comment line 11 and uncomment 12 in Exercise 3.8
# use different alpha value to see if it affects performance

In [23]:
# Exercise 3.10. SVM model

from sklearn.model_selection import cross_validate
#from sklearn.metrics import precision_recall_fscore_support
from sklearn import svm

metrics = ['precision_macro', 'recall_macro', "f1_macro"]

# initiate an linear SVM model
clf = svm.LinearSVC()

cv = cross_validate(clf, dtm, data["label"], \
                    scoring=metrics, cv=2)

print(cv)

# print("Test data set average precision:")
# print(cv['test_precision_macro'])
# print("\nTest data set average recall:")
# print(cv['test_recall_macro'])
# print("\nTest data set average fscore:")
# print(cv['test_f1_macro'])


{'fit_time': array([0.10393167, 0.09794426]), 'score_time': array([0.01998925, 0.01998758]), 'test_precision_macro': array([0.96284554, 0.95543853]), 'train_precision_macro': array([0.99916667, 1.        ]), 'test_recall_macro': array([0.96006408, 0.95092921]), 'train_recall_macro': array([0.99895833, 1.        ]), 'test_f1_macro': array([0.96105757, 0.95207155]), 'train_f1_macro': array([0.99906072, 1.        ])}


## 3.3. Parameter tuning using grid search
* Each classification model has a few parameters
  * e.g. "stop_words": "english" or None, min_df: [1,2,3, ...]
  * e.g. MultinomialNB(alpha=1.0)
  * e.g. LinearSVC(C=1.0, penalty=’l2’, loss=’squared_hinge’,...)
* Instead of tweaking the parameters of the various components, it is possible to run an exhaustive search of the best parameters on a grid of possible values

In [12]:
# Exercise 3.3.1 Grid search

# import pipeline class
from sklearn.pipeline import Pipeline

# import GridSearch
from sklearn.model_selection import GridSearchCV

# build a pipeline which does two steps all together:
# (1) generate tfidf, and (2) train classifier
# each step is named, i.e. "tfidf", "clf"

text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', MultinomialNB())
                   ])

# set the range of parameters to be tuned
# each parameter is defined as 
# <step name>__<parameter name in step>
# e.g. min_df is a parameter of TfidfVectorizer()
# "tfidf" is the name for TfidfVectorizer()
# therefore, 'tfidf__min_df' is the parameter in grid search

parameters = {'tfidf__min_df':[1, 2,5,10],
              'tfidf__stop_words':[None,"english"],
              'clf__alpha': [0.5,1.0,2.0],
}

# the metric used to select the best parameters
metric =  "f1_macro"

# GridSearch also uses cross validation
gs_clf = GridSearchCV\
(text_clf, param_grid=parameters, \
 scoring=metric, cv=5)

# due to data volume and large parameter combinations
# it may take long time to search for optimal parameter combination
# you can use a subset of data to test
gs_clf = gs_clf.fit(data["text"], data["label"])


In [11]:
# gs_clf.best_params_ returns a dictionary 
# with parameter and its best value as an entry

for param_name in gs_clf.best_params_:
    print(param_name,": ",gs_clf.best_params_[param_name])

print("best f1 score:", gs_clf.best_score_)

clf__alpha :  0.5
tfidf__min_df :  2
tfidf__stop_words :  english
best f1 score: 0.9684663644232789


In [None]:
# Exercise 3.3.2 Grid search
# Modify Exercise 3.3 and Exercise 3.8 
# to use the best parameters found
# re-create the Multinominal NB classifier

# also, check the dimension reduction of feature space by set min_df to 2

## 4. Multi-label classification
- So far we only cover single-label classification, i.e. assign one class to each sample
- Multilabel classification emerges as a challenging problem, where classes are not mutually exclusive 
  * music categorization 
  * semantic classification of images
  * tagging
- **One-Vs-the-Rest** Strategy (a.k.a **one-vs-all**)
  * fitting one classifier per class. For each classifier, the class is fitted against all the other classes.
  * for $n$ classes (labels), $n$ classifier is needed
  * Advantage: good interpretability - Since each class is represented by one and only one classifier, it is possible to gain knowledge about the class by inspecting its corresponding classifier
  * Disadvantage: 
     * many classifiers are created if there is a large number classes
     * ignore the structure (or dependencies) of classes
- **Class indication matrix** (or **one-hot encoding**): Encode categorical integer features using a one-hot aka one-of-K scheme. 

| Document    | Money       | Investment | Crime & Justice |
| :-----------|:-----------:|:----------:|:--------------:|
| 1           | 0           |      0     | 1              |
| 2           | 1           |      1     | 0              |
| 3           | 1           |      0     | 0              |
| 4           | 0           |      1     | 1              |

- **dataset**: Yahoo News Ranked Multilabel Learning dataset (http://research.yahoo.com)
  - A subset is selected
  - 4 classes, 6426 samples
  
- **Discussion**: can you apply Naive Bayes for multi-label classification?

In [None]:
# Exercise 4.1 Multi-label classification- Load data

import json
data=json.load(open("../../dataset/ydata.json","r"))

docs,labels=zip(*data)

# show sample examples
docs[1]
labels[1]


In [None]:
# Exercise 4.2 One-hot coding of classes

from sklearn.preprocessing import MultiLabelBinarizer
import numpy as np

mlb = MultiLabelBinarizer()
Y=mlb.fit_transform(labels)
# check size of indicator matrix
Y.shape
# check classes
mlb.classes_

# check # of samples in each class
np.sum(Y, axis=0)

In [None]:

# Exercise 4.3 Multi-label classification- one vs. rest classifier

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import classification_report

# split dataset into train (70%) and test sets (30%)
X_train, X_test, Y_train, Y_test = train_test_split(\
                docs, Y, test_size=0.3, random_state=0)



classifier = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words="english",\
                              min_df=2)),
    ('clf', OneVsRestClassifier(LinearSVC()))])

classifier.fit(X_train, Y_train)



In [None]:
# Exercise 4.4 Multi-label classification- Performance report

from sklearn.metrics import classification_report

predicted = classifier.predict(X_test)

predicted.shape
predicted[0:2]
Y_test[0:2]

print(classification_report\
      (Y_test, predicted, target_names=mlb.classes_))

## 5. Encoding and Decoding
https://www.agiliq.com/blog/2014/11/character-encoding-and-unicode/
https://www.agiliq.com/blog/2014/12/understanding-python-unicode-str-unicodeencodeerro/

- Computers only work with binary (0 or 1). Any character needs to have **a binary representation** so computer can store it on disk or in the memory. However, there are various ways in which characters can be converted to binary.

- **Unicode** provides standard code points for different characters. It can give code point for any character in any language.
  - e.g. 'a' <-> integer 97, hexadecimal 61 (denoted as '\u0061' or '\x61')
  - e.g. 'ä' <-> integer 228, hexadecimal E4
  
- Python 3 always stores **text strings as sequences of Unicode code points**.
- However, in Python 2, text strings are stored as **binary representations** (i.e. bytes)


### 5.1. Encoding
- **Encoding** means the process of converting a string to a binary representation. 
  - There are diffent **coding schemes**  
      - **ascii**: encodes 128 specified characters into seven-bit integers 
        - e.g. a <-> 01100001 
      - **utf-8**: use one to four 8-bit bytes to encode 1,112,064 characters
        - ä <-> 11000011 10100100, or '\xc3\xa4' (hexadecimal c3a4)    
      - **latin-1**: map codepoints to byte values directly
        - ä <-> '\xe4' (hexadecimal 00E4)
  - Each encoding, which confirms to Unicode, has **a one-to-one mapping between a Unicode code point and the binary representation of codepoint**.
  
- Function **encode()**: convert a Unicode string to a binary representation according to an encoding scheme
- **UnicodeEncodeError**: Encode a unicode string which is not in the scope of encoding scheme
  - e.g. try to encode u'\u00E4' with ascii scheme

In [None]:
# Exercise 5.1.1

s =  u'\xE4'  # set unicode string. Note prefix u

# encode into binary
utf_s=s.encode("utf-8")
print(utf_s)

# During printing or writing files, 
# since Python can only print ‘str’ (binary bytes)
# it converts the ‘unicode’ into ‘str’ 
# using default system encoding
print(s)

# to check default encoding scheme
import sys
sys.getdefaultencoding()

In [None]:
# Exercise 5.1.2 UnicodeEncodeError

s =  u'this is a strange \xE4 character'

# However, you cannot encode s using ascii, why?
utf_s=s.encode("ascii")
print(utf_s)

### 5.2. Decoding
- **decoding**: the process of converting an encoded binary representation into Unicode codepoint.

- Function **decode()**: convert a binary string to a Unicode string according to an encoding scheme
- **UnicodeDecodeError**: decode a binary string which is not in the scope of encoding scheme
  - e.g. try to decode b'\u00E4' with ascii scheme


In [None]:
# Exercise 5.2.1. Decoding with UTF-8

# A binary string (i.e. byte) has a prefix "b"
s =  b'\xc3\xa4'
utf_s = s.decode('utf-8') # convert to Unicode using UTf-8. The result is u'\xE4'

print(utf_s)


In [None]:
# Exercise 5.2.2. UnicodeDecodeError

s =  b'\xc3\xa4'
utf_s = s.decode('ascii') # convert to Unicode using ascii
print(utf_s)



In [None]:
# Exercise 5.2.2. Encoding/decoding exception handling
# 'strict', 'ignore', and 'replace' 

s =  b'strange \xc3\xa4 text'
utf_s = s.decode('ascii', errors='ignore') # convert to Unicode, which is u'\xE4'
print(utf_s)

utf_s = s.decode('ascii', errors='replace') # convert to Unicode, which is u'\xE4'
print(utf_s)