# NLP. Lab 1. Tokenization.


## What is tokenization?


Tokenization is one of the first step in any NLP pipeline. Tokenization is nothing but splitting the raw text into small chunks of words or sentences, called tokens. If the text is split into words, then its called as 'Word Tokenization' and if it's split into sentences then its called as 'Sentence Tokenization'. Generally 'space' is used to perform the word tokenization and characters like 'periods, exclamation point and newline char are used for Sentence Tokenization. We have to choose the appropriate method as per the task in hand. While performing the tokenization few characters like spaces, punctuations are ignored and will not be the part of final list of tokens.

![NLP_Tokenization](https://raw.githubusercontent.com/satishgunjal/images/master/NLP_Tokenization.png)


### Purpose


Every sentence gets its meaning by the words present in it. So by analyzing the words present in the text we can easily interpret the meaning of the text. Once we have a list of words we can also use statistical tools and methods to get more insights into the text. For example, we can use word count and word frequency to find out important of word in that sentence or document.


## Tokenization in Python


In [1]:
text = "Tokenization is one of the first step in any NLP pipeline. Tokenization is nothing but splitting the raw text into small chunks of words or sentences, called tokens. If the text is split into words, then its called as 'Word Tokenization' and if it's split into sentences then its called as 'Sentence Tokenization'. Generally 'space' is used to perform the word tokenization and characters like 'periods, exclamation point and newline char are used for Sentence Tokenization.  We have to choose the appropriate method as per the task in hand. While performing the tokenization few characters like spaces, punctuations are ignored and will not be the part of final list of tokens."

### Built-in methods


We can use **split()** method to split a string into a list where each word is a list item.


#### Word tokenization


In [2]:
tokens = text.split()
print(tokens[:5])

['Tokenization', 'is', 'one', 'of', 'the']


#### Sentence tokenization


In [3]:
tokens = text.split(".")
print(tokens[:3])

['Tokenization is one of the first step in any NLP pipeline', ' Tokenization is nothing but splitting the raw text into small chunks of words or sentences, called tokens', " If the text is split into words, then its called as 'Word Tokenization' and if it's split into sentences then its called as 'Sentence Tokenization'"]


### RegEx tokenization

- Using RegEx we can match character combinations in string and perform word/sentence tokenization.
- You can check your regular expressions at [regex101](https://regex101.com/)


#### Word tokenization


In [4]:
import re

tokens = re.findall("[\w]+", text)
print(tokens[:5])

['Tokenization', 'is', 'one', 'of', 'the']


### NLTK library


#### Word tokenization


In [5]:
!pip install nltk

In [7]:
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

tokens = word_tokenize(text)
print(tokens[:5])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['Tokenization', 'is', 'one', 'of', 'the']


#### Sentence tokenization


In [8]:
from nltk.tokenize import sent_tokenize

tokens = sent_tokenize(text)
print(tokens[:3])

['Tokenization is one of the first step in any NLP pipeline.', 'Tokenization is nothing but splitting the raw text into small chunks of words or sentences, called tokens.', "If the text is split into words, then its called as 'Word Tokenization' and if it's split into sentences then its called as 'Sentence Tokenization'."]


### Spacy library


#### Word tokenization


In [None]:
!pip install spacy
!python -m spacy download en

In [None]:
!python -m spacy download en_core_web_sm

In [10]:
from spacy.lang.en import English

english_tokenizer = English()

doc = english_tokenizer(text)
tokens = [token.text for token in doc]
print(tokens[:5])

['Tokenization', 'is', 'one', 'of', 'the']


#### Sentence tokenization


In [11]:
english_tokenizer = English()
english_tokenizer.add_pipe("sentencizer")


doc = english_tokenizer(text)
tokens = [token.sent for token in doc.sents]
print(tokens[:3])

[Tokenization is one of the first step in any NLP pipeline., Tokenization is nothing but splitting the raw text into small chunks of words or sentences, called tokens., If the text is split into words, then its called as 'Word Tokenization' and if it's split into sentences then its called as 'Sentence Tokenization'.]


## Task


Your goal is to solve tokenization task and count number of numeric tokens.

You should submit your solution to [competition](https://www.kaggle.com/t/50b3669520ce4a0e892900406bbc1f2f).


### Grade distribution

- your solution is ranked above or the same as the benchmark solution in the leaderboard - 1 point
- your solution is lower than the benchmark solution - 0.5 points
- no submission / late submission / no appearance on leaderboard - 0 points


In [15]:
import spacy

In [26]:
text = """Barack Obama was the 44th president of the US and he followed George W. Bush and was followed by Donald Trump in 2017.
As a young man, George H.W. Bush served in World War II as a fighter pilot. In 1944, he was shot down and had to parachute to safety.
Before he was president, George W. Bush was a cheerleader, a fraternity brother, an oilman, an owner of a professional baseball team, and a governor. After leaving office in 2009, Bush learned to paint.
Here's something else you probably didn't know about John Adams: He died on the Fourth of July. And he wasn't the only commander in chief to do so. In fact, three of the nation's five founding fathers—Adams, Thomas Jefferson, and James Monroe—died on Independence Day. Adams and Jefferson even passed on the same exact day: July 4, 1826, which happened to be the 50th anniversary of the adoption of the Declaration of Independence.
At 6 feet 4 inches tall, Abraham Lincoln and Lyndon B. Johnson were America's tallest presidents. But what about America's shortest president? That distinction goes to founding father James Madison (1809-1817), who, at 5 feet 4 inches tall, was a full foot shorter than his tallest peers.
That changed, however, in October 1860, when Lincoln received a letter from an 11-year-old girl named Grace Bedell. 'If you will let your whiskers grow I will try and get [my brothers] to vote for you,' Bedell wrote to Lincoln. 'You would look a great deal better for your face is so thin. All the ladies like whiskers and they would tease their husbands to vote for you and then you would be president'.
Richard Nixon was hardly the first president who liked to unwind by rolling a few strikes. Harry S. Truman also enjoyed bowling, and opened the first White House bowling alley in 1947.
If you had to bet on which U.S. president was the biggest movie fan, you'd probably put your money on America's actor-turned-president, Ronald Reagan (1981-1989). And that would be a great guess. Reagan reportedly watched 363 movies during his two terms in office.
Thomas Jefferson offered to sell his personal library when the Library of Congress was burned by the British during the War of 1812. He sold them 6487 books from his own collection, the largest in America at the time.
Born in New York in 1782, Martin Van Buren was the first president to have been born after the American Revolution, technically making him the first American-born president.
Benjamin Harrison had a tight-knit family and loved to amuse and dote on his grandchildren. He put up the first recorded White House Christmas tree in 1889, and was known to put on the Santa suit for entertainment.
A 16-year-old Bill Clinton managed to shake hands with President John F. Kennedy at a Boys Nation event in 1963. This would take place just four months before Kennedy's assassination.
In 1993—two years before he became the governor of Texas—George W. Bush ran the Houston marathon, finishing with a time of 3:44:52. He is the only president to have ever run a marathon."""

nlp = spacy.load("en_core_web_sm")

array_text = text.split('\n')
# print(array_text)

counts = []

for string in array_text:
    count = 0
    doc = nlp(string)
    for token in doc:
        if any(char.isdigit() for char in token.text):
          # print(token)
          count += 1
    counts.append(count)


In [27]:
with open("submission.csv", "w") as f:
    f.write("id,count\n")
    for id, count in enumerate(counts):
        f.write(f"{id},{count}\n")