# N-Grams demo

Here we use the nltk library to create n-grams.
Its important to know how to install nltk though since we are using miniconda.

From your terminal run:
> conda activate <env name>

> conda install -c anaconda nltk


Once this is done, you will need to install nltk data.
Open up the python terminal from within your conda environment.

> python -m nltk.downloader all

In [41]:
from nltk.util import ngrams
from nltk.tokenize import word_tokenize, sent_tokenize

nltk has a built-in function called split() that can be used but word_tokenizer handles things like punctuations.

In [42]:
s = "Kubernetes is an open-source platform for automating the deployment, scaling, and management of containerized applications. It provides a way to run applications in a consistent, reliable environment, without the need to worry about the underlying infrastructure."
tokenized_words = word_tokenize(s)
print(tokenized_words)

['Kubernetes', 'is', 'an', 'open-source', 'platform', 'for', 'automating', 'the', 'deployment', ',', 'scaling', ',', 'and', 'management', 'of', 'containerized', 'applications', '.', 'It', 'provides', 'a', 'way', 'to', 'run', 'applications', 'in', 'a', 'consistent', ',', 'reliable', 'environment', ',', 'without', 'the', 'need', 'to', 'worry', 'about', 'the', 'underlying', 'infrastructure', '.']


sent_tokenize function will tokenize based on sentences instead of words. Very handy.

In [43]:
tokenized_sentences = sent_tokenize(s)
print(tokenized_sentences)

['Kubernetes is an open-source platform for automating the deployment, scaling, and management of containerized applications.', 'It provides a way to run applications in a consistent, reliable environment, without the need to worry about the underlying infrastructure.']


Finally, we can use the ngrams function to get the n-grams we need. Here we get bigrams.

In [44]:
list(ngrams(tokenized_words, 2))

[('Kubernetes', 'is'),
 ('is', 'an'),
 ('an', 'open-source'),
 ('open-source', 'platform'),
 ('platform', 'for'),
 ('for', 'automating'),
 ('automating', 'the'),
 ('the', 'deployment'),
 ('deployment', ','),
 (',', 'scaling'),
 ('scaling', ','),
 (',', 'and'),
 ('and', 'management'),
 ('management', 'of'),
 ('of', 'containerized'),
 ('containerized', 'applications'),
 ('applications', '.'),
 ('.', 'It'),
 ('It', 'provides'),
 ('provides', 'a'),
 ('a', 'way'),
 ('way', 'to'),
 ('to', 'run'),
 ('run', 'applications'),
 ('applications', 'in'),
 ('in', 'a'),
 ('a', 'consistent'),
 ('consistent', ','),
 (',', 'reliable'),
 ('reliable', 'environment'),
 ('environment', ','),
 (',', 'without'),
 ('without', 'the'),
 ('the', 'need'),
 ('need', 'to'),
 ('to', 'worry'),
 ('worry', 'about'),
 ('about', 'the'),
 ('the', 'underlying'),
 ('underlying', 'infrastructure'),
 ('infrastructure', '.')]

Then we get trigrams.

In [45]:
list(ngrams(tokenized_words, 3))

[('Kubernetes', 'is', 'an'),
 ('is', 'an', 'open-source'),
 ('an', 'open-source', 'platform'),
 ('open-source', 'platform', 'for'),
 ('platform', 'for', 'automating'),
 ('for', 'automating', 'the'),
 ('automating', 'the', 'deployment'),
 ('the', 'deployment', ','),
 ('deployment', ',', 'scaling'),
 (',', 'scaling', ','),
 ('scaling', ',', 'and'),
 (',', 'and', 'management'),
 ('and', 'management', 'of'),
 ('management', 'of', 'containerized'),
 ('of', 'containerized', 'applications'),
 ('containerized', 'applications', '.'),
 ('applications', '.', 'It'),
 ('.', 'It', 'provides'),
 ('It', 'provides', 'a'),
 ('provides', 'a', 'way'),
 ('a', 'way', 'to'),
 ('way', 'to', 'run'),
 ('to', 'run', 'applications'),
 ('run', 'applications', 'in'),
 ('applications', 'in', 'a'),
 ('in', 'a', 'consistent'),
 ('a', 'consistent', ','),
 ('consistent', ',', 'reliable'),
 (',', 'reliable', 'environment'),
 ('reliable', 'environment', ','),
 ('environment', ',', 'without'),
 (',', 'without', 'the'),
 ('