# 💻 UnpackAI DL201 Bootcamp - Week 1 - Skills: NLP

## 📕 Learning Objectives

* Gain an appreciation for how NLP models accept data
* To see past the complexity and extract text from common file formats
* Perform cursory EDA by using a Pandas Series while expanding confidence and awareness of it's capabilities

## 📖 Concepts map
* Structured vs. Unstructured Data
* Qualitative vs Quantitative Data
* Flattening Data Structures
* Tokenization
* Transfer Learning

# Introduction

## Why is Natural Language Processing (NLP) Data Notorious?

The nature of text data makes it intimidating compared to other forms of data. It is not too difficult to consider an image or dataframe because we are familiar with these from everyday life. 



## Text is Unstructured

Instead, Text data can take endless forms because it represents ideas. As a result, text and words, are very free. They can be as short as an utterance, to as long as a dictionary. They can be stored as scanned PDFs of archives written with typewriters, word documents, e-books, front end web pages, back end API responses, or even just plain .txt files. 

## Text is Qualitative

Text doesn't have the meaning of numbers, and can be interpreted in different ways. 

For example, a mountain is normally a noun, but it is an adjective in mountain lion, and a mountain lion is also called a puma or cougar. There is inherent ambiguity in language.

# Part 0 : Code preparation

In [1]:
!pip install transformers openpyxl docx -qq
!git clone https://github.com/unpackAI/DL201.git


# Imports 
from pathlib import Path


# Import libraries
import os
import numpy as np
import pandas as pd
import torch
import requests
from transformers import BertTokenizer

#Kaggle config
DATA_DIR = Path('/kaggle/working/DL201/data') #uncomment for kaggle
IMAGE_DIR = Path('/kaggle/working/DL201/img') #Uncomment for Kaggle


# Local Config
#week_1_folder = os.getcwd()
#os.chdir("..")
#DATA_DIR = os.getcwd() + '/data'
#IMAGE_DIR = os.getcwd() + '/data'

# Part 1: How NLP Quantifies Text Data

### A basic NLP Overview

From Wikipedia:
- "Natural language processing (NLP) is a subfield of **linguistics, computer science, and artificial intelligence** concerned with the interactions between computers and **human language**, in particular how to program computers to **process and analyze** large amounts of natural language data. The goal is a computer capable of **"understanding"** the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves."

- Approaches to NLP tasks:
    - Rule-based
    - Traditional machine learning
    - Deep learning

In NLP, we often need to perform text preprocessing, such as removing stop words, stemming, lemmatization, and tokenization.
A nice overview is presented in: 
- https://stanfordnlp.github.io/CoreNLP/ 
- https://www.techtarget.com/searchenterpriseai/definition/natural-language-processing-NLP

Common NLP tasks:
- Classification
- Masked filing
- Text prediction
- Sentiment analysis
    - Positive
    - Negative
    - Subjectivity
- Entity recognition
    - Person
    - Location
    - Organization
- Entity extraction
- Keyword extraction
- Topic extraction

### Ilustrative example

Below there is a code example that that illustrates the usage of Pandas for text manipulation and a few exploratory steps to create Tensors representing the text data.

Let's load a sample book conveniently available in txt format from the collection at http://www.textfiles.com/stories/ the book in this case is Aladdin.


In [2]:
# Load a sample text, from the provided url
response = requests.get('http://www.textfiles.com/stories/alad10.txt')
sample_text = response.text

# Split the text into sentences
sentences = sample_text.split('\n')

# Load the sentences into a dataframe
df = pd.DataFrame(sentences, columns=['sentence'])

As it has been reitared before, loading the data into Pandas gives us tremendous flexibility to perform data cleaning and preprocessing with ease.

In [3]:
# Inspect some of the sentences
df.sample(15)

Unnamed: 0,sentence
465,and requesting that the holy Fatima should be ...
365,Aladdin looked up. She called to him to come ...
154,"Aladdin, who was overwhelmed at first, but pre..."
11,"""but he died a long while ago."" On this the s..."
82,found himself outside. As soon as his eyes co...
335,"daughter?"" demanded the Sultan. ""For the firs..."
399,sign she was reconciled to him. Before drinki...
394,invited you to sup with me; but I am tired of ...
95,"and two bottles of wine. Aladdin's mother, wh..."
384,"with smiles, leading him to believe that you h..."


In [4]:
# Remove the sentences that have less than 3 words
df = df[df['sentence'].str.split().str.len() > 3]

In [5]:
# Remove punctuation from all sentences
df['sentence'] = df['sentence'].str.replace('[^\w\s]','')

# Note: instead of regex a list of punctuation can be used, give it a try!
punctuation = [
    '.', ',', '!', '?', ':', ';', '"', "'", '-', '_', '(', ')', '[', ']', '{', '}', '#', '@', '$', '%', '^', '&', '*',
     '+', '=', '<', '>', '/', '\\', '|', '~', '`', '“', '”', '‘', '’'
]

df.sample(10)

  df['sentence'] = df['sentence'].str.replace('[^\w\s]','')


Unnamed: 0,sentence
473,After this Aladdin and his wife lived in peace\r
225,him in his childhood knew him not he had grown...
439,him what he thought of it It is truly beautif...
180,passed there Her mother did not believe her in...
413,received him in the hall of the fourandtwenty ...
457,deserve to be burnt to ashes but that this req...
6,the streets with little idle boys like himself...
270,surprised to receive his jewels again and visi...
458,from you but from the brother of the African m...
132,all but the Vizier and bade her speak freely p...


In [6]:
# Convert all sentences to lowercase
df['sentence'] = df['sentence'].str.lower()
df.sample(10)

Unnamed: 0,sentence
271,showed him the window finished the sultan emb...
57,his finger and gave it to aladdin bidding him ...
447,humour he begged to know what was amiss and s...
80,and will obey thee in all things aladdin fear...
407,in it back to china this was done and the pri...
218,bidding him make haste but aladdin first call...
101,which i shall always wear on my finger when t...
449,hanging from the dome if that is all replied ...
163,put him outside in the cold and return at dayb...
167,to you the princess was too frightened to spe...


**Sentences are a key unit of information when it comes to NLP** (as wells as tokens) in order to represent our data as a uniform "block" of text, we need to find out our longest sentence, the rest of them will later be padded with padding tokens.

In [7]:
# Get the length of the longest senctence
max_len = df['sentence'].str.len().max()
print(f'max sentence length is {max_len}')

max sentence length is 76


The transformers library provides a convenient way to load a variety of BERT models. Let's first load and explore a tokenizer.

In [8]:
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [9]:
# Get the tokenizer vocabulary words
vocab = bert_tokenizer.vocab
vocab_size = len(vocab)
print(f'vocab size is: {vocab_size}')

vocab size is: 30522


In [10]:
# Get the vocabulary words as a list, load them into a dataframe
vocab_list = list(vocab.keys())
vocab_df = pd.DataFrame(vocab_list, columns=['tokens'])
vocab_df.sample(15)

Unnamed: 0,tokens
7992,buddhist
24327,##metry
23429,prentice
6552,habitat
9736,poets
17274,portraying
25296,caressing
4707,alliance
8188,curled
25855,##smo


In [11]:
# Get the count of tokens that begin with 'UNUSED'
unused_tokens = vocab_df[vocab_df['tokens'].str.find('unused')>=0]
print(f'There are {len(unused_tokens)} tokens that begin with "unused"')
unused_tokens.sample(10)

There are 995 tokens that begin with "unused"


Unnamed: 0,tokens
218,[unused213]
927,[unused922]
770,[unused765]
110,[unused105]
937,[unused932]
81,[unused80]
948,[unused943]
595,[unused590]
598,[unused593]
795,[unused790]


In [12]:
# Get the tokens that have a size of 1 character
one_char_tokens = vocab_df[vocab_df['tokens'].str.len()==1]
print(f'There are {len(one_char_tokens)} tokens that have a size of 1 character')
one_char_tokens.sample(10)

There are 997 tokens that have a size of 1 character


Unnamed: 0,tokens
1060,x
1292,ق
1056,t
1554,₇
1869,智
1383,ச
1341,ि
1162,θ
1455,ᄀ
1179,ω


In [13]:
# Get the tokens which have a size of more than 2 characters and does not contain the word 'unused'
two_char_tokens = vocab_df[(vocab_df['tokens'].str.len()>2) & (vocab_df['tokens'].str.find('unused')<0)]
print(f'There are {len(two_char_tokens)} tokens that likely reprensent English words')
two_char_tokens.sample(10)

There are 28042 tokens that likely reprensent English words


Unnamed: 0,tokens
4824,understanding
27661,comrade
6507,archbishop
20147,##anne
24687,geometridae
6909,stroke
13244,meadow
8013,customer
29062,rosenthal
26014,alteration


Each sentence is currently represented as a list of characters. We need to transform this into a list of tokens, tokens then get converted into numbers using the tokenizers vocabulary as indexes. Here is an example with a phrase:

In [14]:
# Example of tokenizing a sentence
sample_sentence = "This is a sample sentence, which we will tokenize using the BERT tokenizer."
print(f'The sample sentence is:\n{sample_sentence}')

tokenized_sentence = bert_tokenizer.tokenize(sample_sentence)
print(f'\nThe tokenized sentence is:\n{tokenized_sentence}')

numericalized_sentence = bert_tokenizer.convert_tokens_to_ids(tokenized_sentence)
print(f'\nThe numericalized sentence is:\n{numericalized_sentence}')

The sample sentence is:
This is a sample sentence, which we will tokenize using the BERT tokenizer.

The tokenized sentence is:
['this', 'is', 'a', 'sample', 'sentence', ',', 'which', 'we', 'will', 'token', '##ize', 'using', 'the', 'bert', 'token', '##izer', '.']

The numericalized sentence is:
[2023, 2003, 1037, 7099, 6251, 1010, 2029, 2057, 2097, 19204, 4697, 2478, 1996, 14324, 19204, 17629, 1012]


We should now do the same for the sentences in the dataframe. Before proceding is a good idea to create a copy of what we have so far to be able to revert back to the original dataframe in case we need to.

In [15]:
# Create a copy of the senctences dataframe
tokens_df = df.copy()

In [16]:
# Tokenize each sentence in the dataframe
tokens_df['tokenized_sentence'] = tokens_df['sentence'].apply(bert_tokenizer.tokenize)
tokens_df.sample(10)

Unnamed: 0,sentence,tokenized_sentence
175,princess would not say a word and was very sor...,"[princess, would, not, say, a, word, and, was,..."
34,but the magician beguiled him with pleasant st...,"[but, the, magician, beg, ##uil, ##ed, him, wi..."
206,your answer not so long mother as you think h...,"[your, answer, not, so, long, mother, as, you,..."
166,wife promised to me by your unjust father and ...,"[wife, promised, to, me, by, your, un, ##just,..."
94,bowl twelve silver plates containing rich meat...,"[bowl, twelve, silver, plates, containing, ric..."
335,daughter demanded the sultan for the first i ...,"[daughter, demanded, the, sultan, for, the, fi..."
272,envious vizier meanwhile hinting that it was t...,"[en, ##vious, viz, ##ier, meanwhile, hint, ##i..."
159,the bride and bridegroom master i obey said t...,"[the, bride, and, bride, ##gr, ##oom, master, ..."
302,offering to exchange fine new lamps for old on...,"[offering, to, exchange, fine, new, lamps, for..."
451,the genie appeared commanded him to bring a ro...,"[the, genie, appeared, commanded, him, to, bri..."


In [17]:
# Add the numericalized sentences to the dataframe
tokens_df['numericalized_sentence'] = tokens_df['tokenized_sentence'].apply(bert_tokenizer.convert_tokens_to_ids)
tokens_df.sample(10)

Unnamed: 0,sentence,tokenized_sentence,numericalized_sentence
379,not but he will use violence aladdin comforte...,"[not, but, he, will, use, violence, ala, ##ddi...","[2025, 2021, 2002, 2097, 2224, 4808, 21862, 18..."
388,left her arrayed herself gaily for the first t...,"[left, her, array, ##ed, herself, gail, ##y, f...","[2187, 2014, 9140, 2098, 2841, 18576, 2100, 20..."
143,own son begged the sultan to withhold her for ...,"[own, son, begged, the, sultan, to, with, ##ho...","[2219, 2365, 12999, 1996, 7544, 2000, 2007, 12..."
467,aladdin seizing his dagger pierced him to the ...,"[ala, ##ddin, seizing, his, dagger, pierced, h...","[21862, 18277, 24681, 2010, 10794, 16276, 2032..."
54,these halls lead into a garden of fine fruit t...,"[these, halls, lead, into, a, garden, of, fine...","[2122, 9873, 2599, 2046, 1037, 3871, 1997, 298..."
443,wonder of the world\r,"[wonder, of, the, world]","[4687, 1997, 1996, 2088]"
73,lamp and kill him afterwards\r,"[lamp, and, kill, him, afterwards]","[10437, 1998, 3102, 2032, 5728]"
14,go to your mother and tell her i am coming al...,"[go, to, your, mother, and, tell, her, i, am, ...","[2175, 2000, 2115, 2388, 1998, 2425, 2014, 104..."
183,the following night exactly the same thing hap...,"[the, following, night, exactly, the, same, th...","[1996, 2206, 2305, 3599, 1996, 2168, 2518, 304..."
395,and would fain taste those of africa the magi...,"[and, would, fai, ##n, taste, those, of, afric...","[1998, 2052, 26208, 2078, 5510, 2216, 1997, 30..."


Phrases that will be inputted to a BERT model must include the special tokens `[CLS]` and `[SEP]`. These tokens are used to indicate the start and end of the input sequence. Let's add these tokens to the sample phrase. Another special token is `[PAD]`, which is used to pad shorter sequences.

In [18]:
tokenized_sentence = ['CLS'] + tokenized_sentence + ['SEP']
print(f'\nThe tokenized sentence is:\n{tokenized_sentence}')

numericalized_sentence = bert_tokenizer.convert_tokens_to_ids(tokenized_sentence)
print(f'\nThe numericalized sentence is:\n{numericalized_sentence}')

# Print the IDs for the special tokens for the BERT model
print(f'- The token ID for the special token [CLS] is: {bert_tokenizer.cls_token_id}')
print(f'- The token ID for the special token [SEP] is: {bert_tokenizer.sep_token_id}')
print(f'- The token ID for the special token [PAD] is: {bert_tokenizer.pad_token_id}')


The tokenized sentence is:
['CLS', 'this', 'is', 'a', 'sample', 'sentence', ',', 'which', 'we', 'will', 'token', '##ize', 'using', 'the', 'bert', 'token', '##izer', '.', 'SEP']

The numericalized sentence is:
[100, 2023, 2003, 1037, 7099, 6251, 1010, 2029, 2057, 2097, 19204, 4697, 2478, 1996, 14324, 19204, 17629, 1012, 100]
- The token ID for the special token [CLS] is: 101
- The token ID for the special token [SEP] is: 102
- The token ID for the special token [PAD] is: 0


As the exampled indicates, we need to add the [CLS] and [SEP] tokens and tokenize each sentence of the text dataframe

In [19]:
# Add the 100 special tokens to the numericalized sentences on the dataframe
tokens_df['numericalized_sentence'] = tokens_df['numericalized_sentence'].apply(lambda x: [bert_tokenizer.cls_token_id] + x + [bert_tokenizer.sep_token_id])
tokens_df['numericalized_sentence'].sample(10)

69     [101, 3894, 2808, 1997, 1037, 6919, 10437, 202...
399    [101, 3696, 2016, 2001, 28348, 2000, 2032, 207...
349    [101, 1045, 2572, 2069, 1996, 6658, 1997, 1996...
350    [101, 2130, 2061, 2056, 21862, 18277, 2021, 15...
302    [101, 5378, 2000, 3863, 2986, 2047, 14186, 200...
311    [101, 28018, 2043, 2002, 2766, 2041, 1996, 104...
421    [101, 2000, 24896, 2010, 3428, 2331, 1998, 225...
134    [101, 2032, 1997, 2014, 4124, 6355, 2293, 2005...
293    [101, 2907, 1997, 1996, 10437, 1998, 2153, 259...
88     [101, 2210, 6557, 1998, 2097, 2175, 5271, 2009...
Name: numericalized_sentence, dtype: object

In [20]:
# Add the 0 padding to the numericalized sentences on the dataframe
tokens_df['numericalized_sentence'] = tokens_df['numericalized_sentence'].apply(lambda x: x + [bert_tokenizer.pad_token_id] * (max_len - len(x)))

In [21]:
# Add a new column that indicates the length of the numericalized sentences
tokens_df['numericalized_sentence_length'] = tokens_df['numericalized_sentence'].apply(len)
tokens_df['numericalized_sentence_length'] .sample(10)

287    76
262    76
43     76
36     76
312    76
55     76
188    76
133    76
270    76
24     76
Name: numericalized_sentence_length, dtype: int64

In [22]:
# Extract the numericalized sentences from the dataframe
numericalized_sentences = tokens_df['numericalized_sentence'].values
numericalized_sentences.shape

(447,)

In [23]:
# Convert each row of the numericalized sentences to a list
numericalized_sentences = [list(x) for x in numericalized_sentences]

In [24]:
# Convert the list into a 2D NumPy array
numericalized_sentences = np.array(numericalized_sentences)
print(f'The shape of the numericalized sentences is: {numericalized_sentences.shape}')
print(numericalized_sentences)

The shape of the numericalized sentences is: (447, 76)
[[  101 21862 18277 ...     0     0     0]
 [  101  2045  2320 ...     0     0     0]
 [  101  1037 23358 ...     0     0     0]
 ...
 [  101  2044  2023 ...     0     0     0]
 [  101  2002  4594 ...     0     0     0]
 [  101  2005  2116 ...     0     0     0]]


In [25]:
#  Convert the numpy array into a Tensor
numericalized_sentences = torch.from_numpy(numericalized_sentences)
print(numericalized_sentences)
print(f'the shape of the numericalized tensor is: {numericalized_sentences.shape}')

tensor([[  101, 21862, 18277,  ...,     0,     0,     0],
        [  101,  2045,  2320,  ...,     0,     0,     0],
        [  101,  1037, 23358,  ...,     0,     0,     0],
        ...,
        [  101,  2044,  2023,  ...,     0,     0,     0],
        [  101,  2002,  4594,  ...,     0,     0,     0],
        [  101,  2005,  2116,  ...,     0,     0,     0]], dtype=torch.int32)
the shape of the numericalized tensor is: torch.Size([447, 76])


# Part 2: How to Structure Text Data

Since this problem is so diverse, it can be hard to look at. and turn it all into a 1D object, or list. 

If we can wrap our heads around this, we can take a soup of texts from literature, websites, reviews, almost anything imaginable, and wipe it clean so it can be tokenized. BERT recognizes this problem and doesn't rely on the text being structured in order to work. 

### Flattening the structure, starting from scratch

The most simple approach, or the lowest common denominator is to remove the structure of the data, and turn it into a 1D Object, either as a native Python String Object

### File Types

## .TXT Data

The simplest method to split text data is to use the .split() method

## CSV 

CSV files are organized by columns, so it makes it relatively straightforward to index the text columns. 

Similarly to how a numpy array can be flattened, it is not to difficult to do this and flatten it all into one string
The concatenate method of a pandas series can do this

.str.cat(sep = ' ')
.str.cat(sep = '.')

In [26]:
hskVocabPath = os.path.join(DATA_DIR, 'ChineseVocabulary', 'HSK Official With Definitions 2012 L3 freqorder.txt')
os.path.exists(hskVocabPath)

hskVocab = pd.read_csv(hskVocabPath,
                       header = None,
                       index_col = None,
                       sep='\t'
                      )

hskVocabColumns = ['Simplified','Traditional','Pinyin_Numeric','Pinyin_Accented','Definition']
hskVocab.columns = hskVocabColumns

In [27]:
hanzi = hskVocab['Simplified'].str.cat(sep=' ')

In [28]:
print(hanzi[0:150])

啊 还 把 过 如果 只 被 跟 自己 用 像 为 需要 应该 起来 才 又 拿 更 带 然后 一样 当然 相信 认为 明白 一直 地 地方 离开 一定 还是 发 发现 而且 必须 放 为了 向 老 位 先 种 最后 其他 记得 或者 过去 担心 条 以前 长 世界 重要 别人 机会 张 接 比赛 


In [29]:
definitions = hskVocab['Definition'].str.cat(sep='. ')

In [30]:
print(definitions[0:150])

ah; (particle showing elation, doubt, puzzled surprise, or approval). still; yet; in addition; even | repay; to return. (mw for things with handles); 


It is also possible to join columns together

In [31]:
hanziWithDefinitions = hskVocab['Simplified'].str.cat(
    hskVocab['Definition'],sep=' ')
hanziWithDefinitions.head()

0    啊 ah; (particle showing elation, doubt, puzzle...
1    还 still; yet; in addition; even | repay; to re...
2    把 (mw for things with handles); (pretransitive...
3    过 to pass; to cross; go over; (indicates a pas...
4                             如果 if; in the event that
Name: Simplified, dtype: object

It is also to concatenate rows together, if you would like to preserve a relationship when they get tokenized in the same bag of words.

In [32]:
hanziWithDefinitions = hanziWithDefinitions.str.cat(sep='. ')

In [33]:
print(hanziWithDefinitions[0:99])

啊 ah; (particle showing elation, doubt, puzzled surprise, or approval). 还 still; yet; in addition; 


## JSON

Json files have a different paradigm than csv files which are table based. As a result, the results of this may vary depending on the structure. 

In this course, we look to cover key threshold concepts well, while being careful not to get over extended. Since tensors are dependent on unique indices and having regular shapes, this course gets into more detail about tables than it does about JSON. 

With this in mind, there are two things to consider about JSON files. Json files do not have to have a regular structure. One JSON cell can contain as few or as many items as required. Json files can also be heavily nested in ways that don't play well with the paradigm of tensors.

If a JSON file has a regular structure, then it will fit easier into a dataframe, such as for an API request. If it does not have a regular structure, like a configuration file, then it will probably not transfer over well into a dataframe.

In this case, it can be treated as a Series instead, since it is not bound by the same requirements. 

In many cases, it may be necessary to unpack it using python's JSON library. 

https://www.marsja.se/how-to-read-and-write-json-files-using-python-and-pandas/


If you end up with some highly nested JSON objects, this tutorial goes through some pretty complicated data and might give some inspiration.
https://medium.com/analytics-vidhya/extract-the-useful-data-from-jason-file-for-data-sceince-34ed5ae0b350


In [34]:
import json

# Creating a Python Dictionary
data = {"Sub_ID":["1","2","3","4","5","6","7","8" ],
        "Name":["Erik", "Daniel", "Michael", "Sven",
                "Gary", "Carol","Lisa", "Elisabeth" ],
        "Salary":["723.3", "515.2", "621", "731", 
                  "844.15","558", "642.8", "732.5" ],
        "StartDate":[ "1/1/2011", "7/23/2013", "12/15/2011",
                     "6/11/2013", "3/27/2011","5/21/2012", 
                     "7/30/2013", "6/17/2014"],
        "Department":[ "IT", "Manegement", "IT", "HR", 
                      "Finance", "IT", "Manegement", "IT"],
        "Sex":[ "M", "M", "M", 
              "M", "M", "F", "F", "F"]}

print(data)

{'Sub_ID': ['1', '2', '3', '4', '5', '6', '7', '8'], 'Name': ['Erik', 'Daniel', 'Michael', 'Sven', 'Gary', 'Carol', 'Lisa', 'Elisabeth'], 'Salary': ['723.3', '515.2', '621', '731', '844.15', '558', '642.8', '732.5'], 'StartDate': ['1/1/2011', '7/23/2013', '12/15/2011', '6/11/2013', '3/27/2011', '5/21/2012', '7/30/2013', '6/17/2014'], 'Department': ['IT', 'Manegement', 'IT', 'HR', 'Finance', 'IT', 'Manegement', 'IT'], 'Sex': ['M', 'M', 'M', 'M', 'M', 'F', 'F', 'F']}


In [35]:
import json

# Parse JSON
with open('data.json', 'w') as outfile:
    json.dump(data, outfile)

In [36]:

# Read JSON as a dataframe with Pandas:
json_df = pd.read_json('data.json')
json_df.head()

Unnamed: 0,Sub_ID,Name,Salary,StartDate,Department,Sex
0,1,Erik,723.3,1/1/2011,IT,M
1,2,Daniel,515.2,7/23/2013,Manegement,M
2,3,Michael,621.0,12/15/2011,IT,M
3,4,Sven,731.0,6/11/2013,HR,M
4,5,Gary,844.15,3/27/2011,Finance,M


In [37]:
(json_df['Name'] + ' ' + json_df['Department']).str.cat(sep='. ')

'Erik IT. Daniel Manegement. Michael IT. Sven HR. Gary Finance. Carol IT. Lisa Manegement. Elisabeth IT'

## PDF and Word Documents

PDFs and word documents contain information which makes it more readable than a plain .txt file. Word Documents are more regularly structured than a PDF, which is purely designed to be viewed by people. 

For this, PDF miner can be used to extract PDF documents, while python Docx can be used to extract text from a word document. 

# Part 3 : Discussions - Questions

### Discuss the following:

1. Why are text data sources more diverse than tabular or image data?

2. Are NLP Models supervised or unsupervised?

3. Do NLP Models need to have perfectly formatted data? 
	- A. Yes, the data needs to be cleansed and put into a perfect tabular structure
	- B. No, because NLP recognizes this problem, and instead utilizes unsupervised learning to create it's own labels for training 
	- C. Yes, this challenge makes it impossible to use PDFs or word documents, it must be in a .txt format to start out. 
	- D. No, because the hyperparameters autotune the words into tokens  

4. Why are tokens used instead of the words directly? 
	- A. Saves computing resources
	- B. It gives a word/subword a unique identifer that the models can use in training
	- C. The models need to multiple, add and subtract the tokens with eachother directly
	- D. So that the information the words is hidden from the NLP model and it can train by making strong guesses

5. What was the pipeline of this exercise?
6. Please give a summary of the data cleaning and preprocessing steps.
7. What is the difference between a token and a sentence?
8. Why did we converted the tokens to numbers?
9. Why did we add the special tokens?
10. What advantages are offered by Pandas for text manipulation?
11. Would this approach be suitable for complex datasets?

### Exercise:

* Repeat this pipeline with 3 different books that appear very different in nature (don't add the special tokens).
* When you obtain the numericalized sentences, convert them into a long 1D Numpy array.
* Plot the distribution of the numericalized tokens for each book using histograms.
* Comment your experience during the next lesson.