# 💻 UnpackAI DL201 Bootcamp - Week 1 - Skills: NLP

## 📕 Learning Objectives

* Gain an appreciation for how NLP models accept data
* To see past the complexity and extract text from common file formats
* Perform cursory EDA by using a Pandas Series while expanding confidence and awareness of it's capabilities

## 📖 Concepts map
* Structured vs. Unstructured Data
* Qualitative vs Quantitative Data
* Flattening Data Structures
* Tokenization
* Transfer Learning

# Introduction

## Why is Natural Language Processing (NLP) Data Notorious?

The nature of text data makes it intimidating compared to other forms of data. It is not too difficult to consider an image or dataframe because we are familiar with these from everyday life. 



## Text is Unstructured

Instead, Text data can take endless forms because it represents ideas. As a result, text and words, are very free. They can be as short as an utterance, to as long as a dictionary. They can be stored as scanned PDFs of archives written with typewriters, word documents, e-books, front end web pages, back end API responses, or even just plain .txt files. 

## Text is Qualitative

Text doesn't have the meaning of numbers, and can be interpreted in different ways. 

For example, a mountain is normally a noun, but it is an adjective in mountain lion, and a mountain lion is also called a puma or cougar. There is inherent ambiguity in language.

# Part 0 : Code preparation

In [4]:
!pip install transformers openpyxl docx -qq
!git clone https://github.com/unpackAI/DL201.git


# Imports 
from pathlib import Path


# Import libraries
import numpy as np
import pandas as pd
import torch
import requests
from transformers import BertTokenizer

#Kaggle config
DATA_DIR = Path('/kaggle/working/DL201/data') #uncomment for kaggle
IMAGE_DIR = Path('/kaggle/working/DL201/img') #Uncomment for Kaggle


# Local Config
#DATA_DIR = Path.home()/'Datasets'/'unpackAI'/'DL201'/'data'
#IMAGE_DIR = Path('../img') #uncomment for local machine

# Part 1: How NLP Quantifies Text Data

### A basic NLP Overview

From Wikipedia:
- "Natural language processing (NLP) is a subfield of **linguistics, computer science, and artificial intelligence** concerned with the interactions between computers and **human language**, in particular how to program computers to **process and analyze** large amounts of natural language data. The goal is a computer capable of **"understanding"** the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves."

- Approaches to NLP tasks:
    - Rule-based
    - Traditional machine learning
    - Deep learning

In NLP, we often need to perform text preprocessing, such as removing stop words, stemming, lemmatization, and tokenization.
A nice overview is presented in: 
- https://stanfordnlp.github.io/CoreNLP/ 
- https://www.techtarget.com/searchenterpriseai/definition/natural-language-processing-NLP

Common NLP tasks:
- Classification
- Masked filing
- Text prediction
- Sentiment analysis
    - Positive
    - Negative
    - Subjectivity
- Entity recognition
    - Person
    - Location
    - Organization
- Entity extraction
- Keyword extraction
- Topic extraction

### Ilustrative example

Below there is a code example that that illustrates the usage of Pandas for text manipulation and a few exploratory steps to create Tensors representing the text data.

Let's load a sample book conveniently available in txt format from the collection at http://www.textfiles.com/stories/ the book in this case is Aladdin.


In [2]:
# Load a sample text, from the provided url
response = requests.get('http://www.textfiles.com/stories/alad10.txt')
sample_text = response.text

# Split the text into sentences
sentences = sample_text.split('\n')

# Load the sentences into a dataframe
df = pd.DataFrame(sentences, columns=['sentence'])

As it has been reitared before, loading the data into Pandas gives us tremendous flexibility to perform data cleaning and preprocessing with ease.

In [None]:
# Inspect some of the sentences
df.sample(15)

Unnamed: 0,sentence
94,"bowl, twelve silver plates containing rich mea..."
136,desperate deed if I refused to go and ask your...
165,"the Princess. ""Fear nothing,"" Aladdin said to..."
33,the mountains. Aladdin was so tired that he b...
180,passed there. Her mother did not believe her i...
8,"Aladdin did not mend his ways. One day, when ..."
432,"people by her touch of their ailments, whereup..."
330,"way and ordered Aladdin to be unbound, and par..."
73,lamp and kill him afterwards.\r
420,more wicked and more cunning than himself. He...


In [None]:
# Remove the sentences that have less than 3 words
df = df[df['sentence'].str.split().str.len() > 3]

In [None]:
# Remove punctuation from all sentences
df['sentence'] = df['sentence'].str.replace('[^\w\s]','')

# Note: instead of regex a list of punctuation can be used, give it a try!
punctuation = [
    '.', ',', '!', '?', ':', ';', '"', "'", '-', '_', '(', ')', '[', ']', '{', '}', '#', '@', '$', '%', '^', '&', '*',
     '+', '=', '<', '>', '/', '\\', '|', '~', '`', '“', '”', '‘', '’'
]

df.sample(10)

  df['sentence'] = df['sentence'].str.replace('[^\w\s]','')


Unnamed: 0,sentence
117,Aladdin at last prevailed upon her to go befor...
118,carry his request She fetched a napkin and la...
221,Besides this six slaves beautifully dressed to...
48,treasure Aladdin forgot his fears and grasped ...
334,amazed he could not say a word Where is your ...
209,and filled up the small house and garden Alad...
265,spokesman we cannot find jewels enough The Su...
457,deserve to be burnt to ashes but that this req...
27,at nightfall to his mother who was overjoyed t...
192,When the three months were over Aladdin sent h...


In [None]:
# Convert all sentences to lowercase
df['sentence'] = df['sentence'].str.lower()
df.sample(10)

Unnamed: 0,sentence
179,the bed had been carried into some strange hou...
116,her father his mother on hearing this burst o...
198,the princess that no man living would come up ...
261,returned aladdin i wished your majesty to hav...
303,hearing this said there is an old one on the c...
189,another such fearful night and wished to be se...
460,whom he murdered he it was who put that wish ...
229,saying i must build a palace fit for her and t...
396,cellar and the princess put the powder aladdin...
134,him of her sons violent love for the princess ...


**Sentences are a key unit of information when it comes to NLP** (as wells as tokens) in order to represent our data as a uniform "block" of text, we need to find out our longest sentence, the rest of them will later be padded with padding tokens.

In [None]:
# Get the length of the longest senctence
max_len = df['sentence'].str.len().max()
print(f'max sentence length is {max_len}')

max sentence length is 76


The transformers library provides a convenient way to load a variety of BERT models. Let's first load and explore a tokenizer.

In [None]:
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [None]:
# Get the tokenizer vocabulary words
vocab = bert_tokenizer.vocab
vocab_size = len(vocab)
print(f'vocab size is: {vocab_size}')

vocab size is: 30522


In [None]:
# Get the vocabulary words as a list, load them into a dataframe
vocab_list = list(vocab.keys())
vocab_df = pd.DataFrame(vocab_list, columns=['tokens'])
vocab_df.sample(15)

Unnamed: 0,tokens
12204,encyclopedia
2853,sold
3550,##ized
23141,##firmed
13789,cheshire
2730,killed
12930,midst
15755,captains
21536,jing
1182,в


In [None]:
# Get the count of tokens that begin with 'UNUSED'
unused_tokens = vocab_df[vocab_df['tokens'].str.find('unused')>=0]
print(f'There are {len(unused_tokens)} tokens that begin with "unused"')
unused_tokens.sample(10)

There are 995 tokens that begin with "unused"


Unnamed: 0,tokens
667,[unused662]
320,[unused315]
781,[unused776]
459,[unused454]
759,[unused754]
910,[unused905]
554,[unused549]
261,[unused256]
56,[unused55]
722,[unused717]


In [None]:
# Get the tokens that have a size of 1 character
one_char_tokens = vocab_df[vocab_df['tokens'].str.len()==1]
print(f'There are {len(one_char_tokens)} tokens that have a size of 1 character')
one_char_tokens.sample(10)

There are 997 tokens that have a size of 1 character


Unnamed: 0,tokens
1492,ᴰ
1967,長
1203,ш
1754,井
1645,〜
1383,ச
1828,將
1053,q
1631,ⱼ
1901,法


In [None]:
# Get the tokens which have a size of more than 2 characters and does not contain the word 'unused'
two_char_tokens = vocab_df[(vocab_df['tokens'].str.len()>2) & (vocab_df['tokens'].str.find('unused')<0)]
print(f'There are {len(two_char_tokens)} tokens that likely reprensent English words')
two_char_tokens.sample(10)

There are 28042 tokens that likely reprensent English words


Unnamed: 0,tokens
20492,##mt
11114,gazed
16005,brewing
4433,draft
4099,enemy
9280,potentially
11127,gravel
26613,klan
18187,jagged
17587,peterborough


Each sentence is currently represented as a list of characters. We need to transform this into a list of tokens, tokens then get converted into numbers using the tokenizers vocabulary as indexes. Here is an example with a phrase:

In [None]:
# Example of tokenizing a sentence
sample_sentence = "This is a sample sentence, which we will tokenize using the BERT tokenizer."
print(f'The sample sentence is:\n{sample_sentence}')

tokenized_sentence = bert_tokenizer.tokenize(sample_sentence)
print(f'\nThe tokenized sentence is:\n{tokenized_sentence}')

numericalized_sentence = bert_tokenizer.convert_tokens_to_ids(tokenized_sentence)
print(f'\nThe numericalized sentence is:\n{numericalized_sentence}')

The sample sentence is:
This is a sample sentence, which we will tokenize using the BERT tokenizer.

The tokenized sentence is:
['this', 'is', 'a', 'sample', 'sentence', ',', 'which', 'we', 'will', 'token', '##ize', 'using', 'the', 'bert', 'token', '##izer', '.']

The numericalized sentence is:
[2023, 2003, 1037, 7099, 6251, 1010, 2029, 2057, 2097, 19204, 4697, 2478, 1996, 14324, 19204, 17629, 1012]


We should now do the same for the sentences in the dataframe. Before proceding is a good idea to create a copy of what we have so far to be able to revert back to the original dataframe in case we need to.

In [None]:
# Create a copy of the senctences dataframe
tokens_df = df.copy()

In [None]:
# Tokenize each sentence in the dataframe
tokens_df['tokenized_sentence'] = tokens_df['sentence'].apply(bert_tokenizer.tokenize)
tokens_df.sample(10)

Unnamed: 0,sentence,tokenized_sentence
155,him of the lamp he rubbed it and the genie ap...,"[him, of, the, lamp, he, rubbed, it, and, the,..."
86,reality precious stones he then asked for som...,"[reality, precious, stones, he, then, asked, f..."
97,replied aladdin so they sat at breakfast till...,"[replied, ala, ##ddin, so, they, sat, at, brea..."
352,himself in africa under the window of the prin...,"[himself, in, africa, under, the, window, of, ..."
48,treasure aladdin forgot his fears and grasped ...,"[treasure, ala, ##ddin, forgot, his, fears, an..."
198,the princess that no man living would come up ...,"[the, princess, that, no, man, living, would, ..."
369,mine tell me what has become of an old lamp i ...,"[mine, tell, me, what, has, become, of, an, ol..."
214,stood in a halfcircle round the throne with th...,"[stood, in, a, half, ##ci, ##rcle, round, the,..."
100,hath made us aware of its virtues we will use ...,"[hat, ##h, made, us, aware, of, its, virtues, ..."
432,people by her touch of their ailments whereupo...,"[people, by, her, touch, of, their, ai, ##lm, ..."


In [None]:
# Add the numericalized sentences to the dataframe
tokens_df['numericalized_sentence'] = tokens_df['tokenized_sentence'].apply(bert_tokenizer.convert_tokens_to_ids)
tokens_df.sample(10)

Unnamed: 0,sentence,tokenized_sentence,numericalized_sentence
146,mother that though he consented to the marriag...,"[mother, that, though, he, consent, ##ed, to, ...","[2388, 2008, 2295, 2002, 9619, 2098, 2000, 199..."
46,this stone lies a treasure which is to be your...,"[this, stone, lies, a, treasure, which, is, to...","[2023, 2962, 3658, 1037, 8813, 2029, 2003, 200..."
47,may touch it so you must do exactly as i tell ...,"[may, touch, it, so, you, must, do, exactly, a...","[2089, 3543, 2009, 2061, 2017, 2442, 2079, 359..."
86,reality precious stones he then asked for som...,"[reality, precious, stones, he, then, asked, f...","[4507, 9062, 6386, 2002, 2059, 2356, 2005, 207..."
15,and told his mother of his newly found uncle ...,"[and, told, his, mother, of, his, newly, found...","[1998, 2409, 2010, 2388, 1997, 2010, 4397, 217..."
365,aladdin looked up she called to him to come t...,"[ala, ##ddin, looked, up, she, called, to, him...","[21862, 18277, 2246, 2039, 2016, 2170, 2000, 2..."
9,streets as usual a stranger asked him his age ...,"[streets, as, usual, a, stranger, asked, him, ...","[4534, 2004, 5156, 1037, 7985, 2356, 2032, 201..."
396,cellar and the princess put the powder aladdin...,"[cellar, and, the, princess, put, the, powder,...","[15423, 1998, 1996, 4615, 2404, 1996, 9898, 21..."
379,not but he will use violence aladdin comforte...,"[not, but, he, will, use, violence, ala, ##ddi...","[2025, 2021, 2002, 2097, 2224, 4808, 21862, 18..."
16,said your father had a brother but i always th...,"[said, your, father, had, a, brother, but, i, ...","[2056, 2115, 2269, 2018, 1037, 2567, 2021, 104..."


Phrases that will be inputted to a BERT model must include the special tokens `[CLS]` and `[SEP]`. These tokens are used to indicate the start and end of the input sequence. Let's add these tokens to the sample phrase. Another special token is `[PAD]`, which is used to pad shorter sequences.

In [None]:
tokenized_sentence = ['CLS'] + tokenized_sentence + ['SEP']
print(f'\nThe tokenized sentence is:\n{tokenized_sentence}')

numericalized_sentence = bert_tokenizer.convert_tokens_to_ids(tokenized_sentence)
print(f'\nThe numericalized sentence is:\n{numericalized_sentence}')

# Print the IDs for the special tokens for the BERT model
print(f'- The token ID for the special token [CLS] is: {bert_tokenizer.cls_token_id}')
print(f'- The token ID for the special token [SEP] is: {bert_tokenizer.sep_token_id}')
print(f'- The token ID for the special token [PAD] is: {bert_tokenizer.pad_token_id}')


The tokenized sentence is:
['CLS', 'this', 'is', 'a', 'sample', 'sentence', ',', 'which', 'we', 'will', 'token', '##ize', 'using', 'the', 'bert', 'token', '##izer', '.', 'SEP']

The numericalized sentence is:
[100, 2023, 2003, 1037, 7099, 6251, 1010, 2029, 2057, 2097, 19204, 4697, 2478, 1996, 14324, 19204, 17629, 1012, 100]
- The token ID for the special token [CLS] is: 101
- The token ID for the special token [SEP] is: 102
- The token ID for the special token [PAD] is: 0


As the exampled indicates, we need to add the [CLS] and [SEP] tokens and tokenize each sentence of the text dataframe

In [None]:
# Add the 100 special tokens to the numericalized sentences on the dataframe
tokens_df['numericalized_sentence'] = tokens_df['numericalized_sentence'].apply(lambda x: [bert_tokenizer.cls_token_id] + x + [bert_tokenizer.sep_token_id])
tokens_df['numericalized_sentence'].sample(10)

20     [101, 2022, 4527, 2012, 2025, 2383, 2464, 2032...
449    [101, 5689, 2013, 1996, 8514, 2065, 2008, 2003...
475    [101, 2005, 2116, 2086, 2975, 2369, 2032, 1037...
420    [101, 2062, 10433, 1998, 2062, 23626, 2084, 23...
149    [101, 21862, 18277, 4741, 19080, 2005, 3053, 2...
112    [101, 2004, 2016, 2253, 1999, 1998, 2246, 2061...
388    [101, 2187, 2014, 9140, 2098, 2841, 18576, 210...
61     [101, 2677, 1997, 1996, 5430, 1996, 16669, 663...
10     [101, 1996, 2365, 1997, 2442, 9331, 3270, 1996...
81     [101, 8116, 2033, 2013, 2023, 2173, 26090, 199...
Name: numericalized_sentence, dtype: object

In [None]:
# Add the 0 padding to the numericalized sentences on the dataframe
tokens_df['numericalized_sentence'] = tokens_df['numericalized_sentence'].apply(lambda x: x + [bert_tokenizer.pad_token_id] * (max_len - len(x)))

In [None]:
# Add a new column that indicates the length of the numericalized sentences
tokens_df['numericalized_sentence_length'] = tokens_df['numericalized_sentence'].apply(len)
tokens_df['numericalized_sentence_length'] .sample(10)

348    76
343    76
120    76
144    76
201    76
381    76
18     76
439    76
77     76
339    76
Name: numericalized_sentence_length, dtype: int64

In [None]:
# Extract the numericalized sentences from the dataframe
numericalized_sentences = tokens_df['numericalized_sentence'].values
numericalized_sentences.shape

(447,)

In [None]:
# Convert each row of the numericalized sentences to a list
numericalized_sentences = [list(x) for x in numericalized_sentences]

In [None]:
# Convert the list into a 2D NumPy array
numericalized_sentences = np.array(numericalized_sentences)
print(f'The shape of the numericalized sentences is: {numericalized_sentences.shape}')
print(numericalized_sentences)

The shape of the numericalized sentences is: (447, 76)
[[  101 21862 18277 ...     0     0     0]
 [  101  2045  2320 ...     0     0     0]
 [  101  1037 23358 ...     0     0     0]
 ...
 [  101  2044  2023 ...     0     0     0]
 [  101  2002  4594 ...     0     0     0]
 [  101  2005  2116 ...     0     0     0]]


In [None]:
#  Convert the numpy array into a Tensor
numericalized_sentences = torch.from_numpy(numericalized_sentences)
print(numericalized_sentences)
print(f'the shape of the numericalized tensor is: {numericalized_sentences.shape}')

tensor([[  101, 21862, 18277,  ...,     0,     0,     0],
        [  101,  2045,  2320,  ...,     0,     0,     0],
        [  101,  1037, 23358,  ...,     0,     0,     0],
        ...,
        [  101,  2044,  2023,  ...,     0,     0,     0],
        [  101,  2002,  4594,  ...,     0,     0,     0],
        [  101,  2005,  2116,  ...,     0,     0,     0]], dtype=torch.int32)
the shape of the numericalized tensor is: torch.Size([447, 76])


# Part 2: How to Structure Text Data

Since this problem is so diverse, it can be hard to look at. and turn it all into a 1D object, or list. 

If we can wrap our heads around this, we can take a soup of texts from literature, websites, reviews, almost anything imaginable, and wipe it clean so it can be tokenized. BERT recognizes this problem and doesn't rely on the text being structured in order to work. 

### Flattening the structure, starting from scratch

The most simple approach, or the lowest common denominator is to remove the structure of the data, and turn it into a 1D Object, either as a native Python String Object

### File Types

## .TXT Data

The simplest method to split text data is to use the .split() method

## CSV 

CSV files are organized by columns, so it makes it relatively straightforward to index the text columns. 

Similarly to how a numpy array can be flattened, it is not to difficult to do this and flatten it all into one string
The concatenate method of a pandas series can do this

.str.cat(sep = ' ')
.str.cat(sep = '.')

In [32]:
hskVocabPath = DATA_DIR/'ChineseVocabulary'/'HSK Official With Definitions 2012 L3 freqorder.txt'

hskVocab = pd.read_csv(hskVocabPath,
                       header = None,
                       index_col = None,
                       sep='\t'
                      )

hskVocabColumns = ['Simplified','Traditional','Pinyin_Numeric','Pinyin_Accented','Definition']
hskVocab.columns = hskVocabColumns

In [33]:
hanzi = hskVocab['Simplified'].str.cat(sep=' ')

In [34]:
print(hanzi[0:150])

啊 还 把 过 如果 只 被 跟 自己 用 像 为 需要 应该 起来 才 又 拿 更 带 然后 一样 当然 相信 认为 明白 一直 地 地方 离开 一定 还是 发 发现 而且 必须 放 为了 向 老 位 先 种 最后 其他 记得 或者 过去 担心 条 以前 长 世界 重要 别人 机会 张 接 比赛 


In [35]:
definitions = hskVocab['Definition'].str.cat(sep='. ')

In [36]:
print(definitions[0:150])

ah; (particle showing elation, doubt, puzzled surprise, or approval). still; yet; in addition; even | repay; to return. (mw for things with handles); 


It is also possible to join columns together

In [37]:
hanziWithDefinitions = hskVocab['Simplified'].str.cat(
    hskVocab['Definition'],sep=' ')
hanziWithDefinitions.head()

0    啊 ah; (particle showing elation, doubt, puzzle...
1    还 still; yet; in addition; even | repay; to re...
2    把 (mw for things with handles); (pretransitive...
3    过 to pass; to cross; go over; (indicates a pas...
4                             如果 if; in the event that
Name: Simplified, dtype: object

It is also to concatenate rows together, if you would like to preserve a relationship when they get tokenized in the same bag of words.

In [38]:
hanziWithDefinitions = hanziWithDefinitions.str.cat(sep='. ')

In [40]:
print(hanziWithDefinitions[0:99])

啊 ah; (particle showing elation, doubt, puzzled surprise, or approval). 还 still; yet; in addition; 


## JSON

Json files have a different paradigm than csv files which are table based. As a result, the results of this may vary depending on the structure. 

In this course, we look to cover key threshold concepts well, while being careful not to get over extended. Since tensors are dependent on unique indices and having regular shapes, this course gets into more detail about tables than it does about JSON. 

With this in mind, there are two things to consider about JSON files. Json files do not have to have a regular structure. One JSON cell can contain as few or as many items as required. Json files can also be heavily nested in ways that don't play well with the paradigm of tensors.

If a JSON file has a regular structure, then it will fit easier into a dataframe, such as for an API request. If it does not have a regular structure, like a configuration file, then it will probably not transfer over well into a dataframe.

In this case, it can be treated as a Series instead, since it is not bound by the same requirements. 

In many cases, it may be necessary to unpack it using python's JSON library. 

https://www.marsja.se/how-to-read-and-write-json-files-using-python-and-pandas/


If you end up with some highly nested JSON objects, this tutorial goes through some pretty complicated data and might give some inspiration.
https://medium.com/analytics-vidhya/extract-the-useful-data-from-jason-file-for-data-sceince-34ed5ae0b350


In [41]:
import json

# Creating a Python Dictionary
data = {"Sub_ID":["1","2","3","4","5","6","7","8" ],
        "Name":["Erik", "Daniel", "Michael", "Sven",
                "Gary", "Carol","Lisa", "Elisabeth" ],
        "Salary":["723.3", "515.2", "621", "731", 
                  "844.15","558", "642.8", "732.5" ],
        "StartDate":[ "1/1/2011", "7/23/2013", "12/15/2011",
                     "6/11/2013", "3/27/2011","5/21/2012", 
                     "7/30/2013", "6/17/2014"],
        "Department":[ "IT", "Manegement", "IT", "HR", 
                      "Finance", "IT", "Manegement", "IT"],
        "Sex":[ "M", "M", "M", 
              "M", "M", "F", "F", "F"]}

print(data)

{'Sub_ID': ['1', '2', '3', '4', '5', '6', '7', '8'], 'Name': ['Erik', 'Daniel', 'Michael', 'Sven', 'Gary', 'Carol', 'Lisa', 'Elisabeth'], 'Salary': ['723.3', '515.2', '621', '731', '844.15', '558', '642.8', '732.5'], 'StartDate': ['1/1/2011', '7/23/2013', '12/15/2011', '6/11/2013', '3/27/2011', '5/21/2012', '7/30/2013', '6/17/2014'], 'Department': ['IT', 'Manegement', 'IT', 'HR', 'Finance', 'IT', 'Manegement', 'IT'], 'Sex': ['M', 'M', 'M', 'M', 'M', 'F', 'F', 'F']}


In [43]:
import json

# Parse JSON
with open('data.json', 'w') as outfile:
    json.dump(data, outfile)

In [49]:

# Read JSON as a dataframe with Pandas:
json_df = pd.read_json('data.json')
json_df.head()

Unnamed: 0,Sub_ID,Name,Salary,StartDate,Department,Sex
0,1,Erik,723.3,1/1/2011,IT,M
1,2,Daniel,515.2,7/23/2013,Manegement,M
2,3,Michael,621.0,12/15/2011,IT,M
3,4,Sven,731.0,6/11/2013,HR,M
4,5,Gary,844.15,3/27/2011,Finance,M


In [50]:
(json_df['Name'] + ' ' + json_df['Department']).str.cat(sep='. ')

'Erik IT. Daniel Manegement. Michael IT. Sven HR. Gary Finance. Carol IT. Lisa Manegement. Elisabeth IT'

## PDF and Word Documents

PDFs and word documents contain information which makes it more readable than a plain .txt file. Word Documents are more regularly structured than a PDF, which is purely designed to be viewed by people. 

For this, PDF miner can be used to extract PDF documents, while python Docx can be used to extract text from a word document. 

### Docx

In [53]:
!pip install docx -qq
import docx

In [54]:
document = docx.Document()

document.add_heading('HSK Vocabulary Definitions')

document.add_paragraph(hanziWithDefinitions)

<docx.text.paragraph.Paragraph at 0x7fc929707a90>

In [68]:
 
 
# open connection to Word Document
#doc = docx.Document("FILE_PATH")
 
# read in each paragraph in file
rawText = [p.text for p in document.paragraphs]

# Since there are multiple paragraphs, they need to
# concatenated
rawText = ''.join(rawText)


In [67]:
rawText[:150]

'HSK Vocabulary Definitions啊 ah; (particle showing elation, doubt, puzzled surprise, or approval). 还 still; yet; in addition; even | repay; to return. '

# Part 3 : Discussions - Questions

### Discuss the following:
* What was the pipeline of this exercise?
* Please give a summary of the data cleaning and preprocessing steps.
* What is the difference between a token and a sentence?
* Why did we converted the tokens to numbers?
* Why did we add the special tokens?
* What advantages are offered by Pandas for text manipulation?
* Would this approach be suitable for complex datasets?

### Exercise:

* Repeat this pipeline with 3 different books that appear very different in nature (don't add the special tokens).
* When you obtain the numericalized sentences, convert them into a long 1D Numpy array.
* Plot the distribution of the numericalized tokens for each book using histograms.
* Comment your experience during the next lesson.