# Additional NLP Concepts

## Tokenizatio

In [3]:
text = '''
The United States of America (U.S.A. or USA), is located in North America.
It consists of 50 states, five major unincorporated territories, 326 Indian reservations, 
a federal district, and some minor possessions.[g] 
At 3.8 million square miles (9.8 million square kilometers), 
it is the world's third- or fourth-largest country by total area.[c] 
With a population of more than 331 million people, 
it is the third most populous country in the world. 
The national capital is Washington, D.C., and the most populous city is New York City.
'''.replace('\n', '').strip()

print(text)

The United States of America (U.S.A. or USA), is located in North America.It consists of 50 states, five major unincorporated territories, 326 Indian reservations, a federal district, and some minor possessions.[g] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[c] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York City.


## Split on spaces

In [4]:
for token in text.split(' ')[:20]:
    print(f'"{token}"', end=', ')

"The", "United", "States", "of", "America", "(U.S.A.", "or", "USA),", "is", "located", "in", "North", "America.It", "consists", "of", "50", "states,", "five", "major", "unincorporated", 

## Split on non-alpha-numeric characters

`\W`: Matches any character which is not a word character. 
If the ASCII flag is used this becomes the equivalent of `[^a-zA-Z0-9_]`

In [7]:
import re 

for token in re.split(r'\W+', text)[:20]:
    print(f'"{token}"', end=', ')

"The", "United", "States", "of", "America", "U", "S", "A", "or", "USA", "is", "located", "in", "North", "America", "It", "consists", "of", "50", "states", 

## Language-aware splitting

In [9]:
import spacy

nlp = spacy.blank("en")

for token in nlp(text)[:20]:
    print(f'"{token}"', end=', ')

"The", "United", "States", "of", "America", "(", "U.S.A.", "or", "USA", ")", ",", "is", "located", "in", "North", "America", ".", "It", "consists", "of", 

In [12]:
import spacy

nlp = spacy.blank("en")

for token in nlp("Let's go to N.Y.!"):
    print(f'"{token}"', end=', ')

"Let", "'s", "go", "to", "N.Y.", "!", 

In [15]:
import spacy

nlp = spacy.blank("en")

for token in nlp("I'm gonna visit New York City at 6:00 A.M. :-)"):
    print(f'"{token}"', end=', ')

"I", "'m", "gon", "na", "visit", "New", "York", "City", "at", "6:00", "A.M.", ":-)", 

In [27]:
nlp = spacy.blank("en")

doc = nlp("I'm gonna visit New York City at 6:00 A.M. :-)")
    
with doc.retokenize() as retokenizer:
    for i in range(len(doc) - 3):
        if doc[i:i+3].text == 'New York City':
            retokenizer.merge(doc[i:i+3], attrs={"LEMMA": "new york city"})
            
print("After:", [token.text for token in doc])

After: ['I', "'m", 'gon', 'na', 'visit', 'New York City', 'at', '6:00', 'A.M.', ':-)']


In [16]:
tokens = nlp("I'm gonna visit New York City at 6:00 A.M. :-)")

In [26]:
tokens[1:1+3]

"'m gonna"

In [28]:
import spacy

nlp = spacy.blank("en")

for token in nlp("I'm gonna visit New York City at 6:00 A.M. :-)"):
    print(f'"{token.lemma_}"', end=', ')

"", "", "", "", "", "", "", "", "", "", "", "", 