There are 3 main types of categorires.

1) Structured Data: This is the most organized form of data. It is represented in tabular formats such as Excel files and Comma-Separated Value (CSV) files.

2) Semi-Structured Data: This type of data is not presented in a tabular structure, but it can be represented in a tabular format after transformation. Here, information is usually stored between tags following a definite pattern. XML and HTML files can be referred to as semi-structured data.

3)Unstructured Data: This type of data is the most difficult to deal with. Machine learning algorithms would find it difficult to comprehend unstructured data without any loss of information. Text corpora and images are examples of unstructured data.

Categorization of Data Based on Content:

•	Text Data: This refers to text corpora consisting of written sentences. This type of data can only be read. An example would be the text corpus of a book.
•	Image Data: This refers to pictures that are used to communicate messages. This type of data can only be seen.
•	Audio Data: This refers to recordings of someone's voice, music, and so on. This type of data can only be heard.
•	Video Data: A continuous series of images coupled with audio forms a video. This type of data can be seen as well as heard.



In [1]:
import re

In [2]:
sentence = 'Happy tweeted, "Witnessing 90th Republic Day of Viet Nam from Jack, \
Ho CHi Minh. Mesmerizing performance by Vietnamese Army! Awesome airshow! @Viet_Nam_official \
@Viet_Nam_official #VietNam #90thRepublic_Day. For more photos ping me jack@photoking.com :)"'

Delete all characters other than digits, alphabetical characters, and whitespaces from the text. Use the split() function to split the strings into parts.

In [3]:
re.sub(r'([^\s\w]|_)+', ' ', sentence).split()

['Happy',
 'tweeted',
 'Witnessing',
 '90th',
 'Republic',
 'Day',
 'of',
 'Viet',
 'Nam',
 'from',
 'Jack',
 'Ho',
 'CHi',
 'Minh',
 'Mesmerizing',
 'performance',
 'by',
 'Vietnamese',
 'Army',
 'Awesome',
 'airshow',
 'Viet',
 'Nam',
 'official',
 'Viet',
 'Nam',
 'official',
 'VietNam',
 '90thRepublic',
 'Day',
 'For',
 'more',
 'photos',
 'ping',
 'me',
 'jack',
 'photoking',
 'com']

Usually, extracting each token separately does not help. For instance, consider the sentence, "I don't hate you, but your behavior." Here, if we process each of the tokens, such as "hate" and "behavior," separately, then the true meaning of the sentence would not be comprehended. In this case, the context in which these tokens are present becomes essential. Thus, we consider n consecutive tokens at a time. n-grams refers to the grouping of n consecutive tokens together.

We will extract n-grams by using nltk and textBlb

In [4]:
def n_gram_extractor(sentence, n):
    tokens = re.sub(r'([^\s\w]|_)+', ' ', sentence).split()
    for i in range(len(tokens)-n+1):
        print(tokens[i:i+n])

In [5]:
# To check the bi-grams we need to pass the function with text and n
n_gram_extractor('The cute little boy is playing with balls.', 2)

['The', 'cute']
['cute', 'little']
['little', 'boy']
['boy', 'is']
['is', 'playing']
['playing', 'with']
['with', 'balls']


In [6]:
# To check the tri-grams we need to pass the function with text and n
n_gram_extractor('The cute little boy is playing with balls.', 3)

['The', 'cute', 'little']
['cute', 'little', 'boy']
['little', 'boy', 'is']
['boy', 'is', 'playing']
['is', 'playing', 'with']
['playing', 'with', 'balls']


In [8]:
import nltk
nltk.download('punkt')
from textblob import TextBlob
blob = TextBlob("The cute little boy is playing with balls.")
blob.ngrams(n=2)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


[WordList(['The', 'cute']),
 WordList(['cute', 'little']),
 WordList(['little', 'boy']),
 WordList(['boy', 'is']),
 WordList(['is', 'playing']),
 WordList(['playing', 'with']),
 WordList(['with', 'balls'])]

In [9]:
import nltk
nltk.download('punkt')
from textblob import TextBlob
blob = TextBlob("The cute little boy is playing with balls.")
blob.ngrams(n=3)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


[WordList(['The', 'cute', 'little']),
 WordList(['cute', 'little', 'boy']),
 WordList(['little', 'boy', 'is']),
 WordList(['boy', 'is', 'playing']),
 WordList(['is', 'playing', 'with']),
 WordList(['playing', 'with', 'balls'])]

Keras and TextBlob are two of the most popular Python libraries used for performing various NLP tasks. TextBlob provides a simple and easy-to-use interface to do so. Keras is used mainly for performing deep learning-based NLP tasks