# TextBlob

TextBlob is a Python (2 and 3) library for processing textual data.  
It provides a simple API for diving into common Natural Language Processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.


### Import TextBlob

In [1]:
from textblob import TextBlob

TextBlob is a python library and offers a simple API to access its methods and perform basic NLP tasks. 

<hr>

A good thing about TextBlob is that they are just like python strings. So, you can transform and play with it same like we did in python.

In [2]:
string1 = TextBlob("Analytics")

string1[1:5] # Extracting 1 to 5 characters

TextBlob("naly")

In [3]:
string1.upper() # Convert text to upper case

TextBlob("ANALYTICS")

In [4]:
string2 = TextBlob("Vidhya")

# concat
string1 + " " + string2

TextBlob("Analytics Vidhya")

## Tokenization

Tokenization refers to dividing text or a sentence into a sequence of tokens, which roughly correspond to “words”.  
This is one of the basic tasks of NLP.  

To do this using TextBlob, follow the two steps:
1. Create a **textblob** object and pass a string with it.
2. Call **functions** of textblob in order to do a specific task.

In [5]:
blob = TextBlob("Analytics Vidhya is a great platform to learn data science. \n It helps community through blogs, hackathons, discussions,etc.")

In [6]:
# Tokenizing Sentences
blob.sentences

[Sentence("Analytics Vidhya is a great platform to learn data science."),
 Sentence("It helps community through blogs, hackathons, discussions,etc.")]

In [7]:
# Extracting only the first sentence
blob.sentences[0]

Sentence("Analytics Vidhya is a great platform to learn data science.")

In [8]:
# Printing words of first sentence
for words in blob.sentences[0].words:
    print(words)

Analytics
Vidhya
is
a
great
platform
to
learn
data
science


In [9]:
# Printing all words
blob.words

WordList(['Analytics', 'Vidhya', 'is', 'a', 'great', 'platform', 'to', 'learn', 'data', 'science', 'It', 'helps', 'community', 'through', 'blogs', 'hackathons', 'discussions', 'etc'])

## Noun Phrase Extraction

Since we extracted the words in the previous section, instead of that we can just extract out the noun phrases from the textblob.  
Noun Phrase extraction is particularly important when you want to analyze the “who” in a sentence.

In [10]:
blob = TextBlob("Analytics Vidhya is a great platform to learn data science.")

for np in blob.noun_phrases:
    print (np)

analytics vidhya
great platform
data science


## Part-of-speech Tagging

Part-of-speech tagging or grammatical tagging is a method to mark words present in a text on the basis of its definition and context.  
In simple words, it tells whether a word is a noun, or an adjective, or a verb, etc.  
This is just a complete version of noun phrase extraction, where we want to find all the the parts of speech in a sentence.

In [11]:
for words, tag in blob.tags:
    print(words, tag)

Analytics NNS
Vidhya NNP
is VBZ
a DT
great JJ
platform NN
to TO
learn VB
data NNS
science NN


## Words Inflection and Lemmatization

*Inflection* is a process of word formation in which characters are added to the base form of a word to express grammatical meanings.  
Word inflection in TextBlob is very simple, i.e., the words we tokenized from a textblob can be easily changed into singular or plural.

In [12]:
blob = TextBlob("Analytics Vidhya is a great platform to learn data science. \n It helps community through blogs, hackathons, discussions,etc.")

print(blob.sentences[1].words[1])
print(blob.sentences[1].words[1].singularize())

helps
help


TextBlob library also offers an in-build object known as *Word*.  
We just need to create a word object and then apply a function directly to it.

In [13]:
from textblob import Word
w = Word("Platform")
w.pluralize()

'Platforms'

We can also use the tags to inflect a particular type of words

In [14]:
## Using Tags
for word, pos in blob.tags:
    if pos == 'NN':
        print(word.pluralize())

platforms
sciences
communities


Words can be lemmatized using the *lemmatize* function.

In [15]:
## Lemmatization
w = Word("running")

w.lemmatize("v") #Here 'v' represents verb

'run'

## N-grams

A combination of multiple words together are called N-Grams.  
N grams (N > 1) are generally more informative as compared to words, and can be used as features for language modelling.  
N-grams can be easily accessed in TextBlob using the **ngrams** function, which returns a tuple of n successive words.

In [16]:
for ngram in blob.ngrams(2):
    print(ngram)

['Analytics', 'Vidhya']
['Vidhya', 'is']
['is', 'a']
['a', 'great']
['great', 'platform']
['platform', 'to']
['to', 'learn']
['learn', 'data']
['data', 'science']
['science', 'It']
['It', 'helps']
['helps', 'community']
['community', 'through']
['through', 'blogs']
['blogs', 'hackathons']
['hackathons', 'discussions']
['discussions', 'etc']


## Sentiment Analysis

Sentiment analysis is basically the process of determining the attitude or the emotion of the writer, i.e., whether it is positive or negative or neutral.

The sentiment function of textblob returns two properties, **polarity**, and **subjectivity**.

Polarity is float which lies in the range of [-1,1] where 1 means positive statement and -1 means a negative statement.  
Subjective sentences generally refer to personal opinion, emotion or judgment whereas objective refers to factual information.  
Subjectivity is also a float which lies in the range of [0,1].

In [17]:
print(blob)
blob.sentiment

Analytics Vidhya is a great platform to learn data science. 
 It helps community through blogs, hackathons, discussions,etc.


Sentiment(polarity=0.8, subjectivity=0.75)

We can see that polarity is **0.8**, which means that the statement is positive and **0.75** subjectivity refers that mostly it is a public opinion and not a factual information.

## Spelling Correction

Spelling correction is a cool feature which TextBlob offers, we can be accessed using the **correct** function

In [18]:
blob = TextBlob('Analytics Vidhya is a gret platfrm to learn data scence')

blob.correct()

TextBlob("Analytics Vidhya is a great platform to learn data science")

We can also check the list of suggested word and its confidence using the **spellcheck** function.

In [19]:
blob.words[4].spellcheck()

[('great', 0.5351351351351351),
 ('get', 0.3162162162162162),
 ('grew', 0.11216216216216217),
 ('grey', 0.026351351351351353),
 ('greet', 0.006081081081081081),
 ('fret', 0.002702702702702703),
 ('grit', 0.0006756756756756757),
 ('cret', 0.0006756756756756757)]

## Creating a short summary of a text

This is a simple trick in which we will be using the things we learned above. 

In [20]:
import random

blob = TextBlob('Analytics Vidhya is a thriving community for data driven industry. This platform allows \
people to know more about analytics from its articles, Q&A forum, and learning paths. Also, we help \
professionals & amateurs to sharpen their skillsets by providing a platform to participate in Hackathons.')

In [21]:
nouns = list()

for word, tag in blob.tags:
    if tag == 'NN':
        nouns.append(word.lemmatize())      
       
print('This text is about...')

for item in random.sample(nouns, 5):
    word = Word(item)
    print(word.pluralize())

This text is about...
forums
communities
industries
platforms
platforms


What we did above that we extracted out a list of nouns from the text to give a general idea to the reader about the things the text is related to.

## Translation and Language Detection

In [22]:
blob = TextBlob('هذا بارد') # Arabic Text

In [23]:
blob.detect_language()

'ar'

So, it is Arabic. Now, let’s find translate it into English so that we can know what is written using TextBlob.

In [24]:
# Translate to English
blob.translate(from_lang='ar', to='en')

TextBlob("that's cool")

Even if you don’t explicitly define the source language, TextBlob will automatically detect the language and translate into the desired language

In [25]:
blob.translate(to='en')

TextBlob("that's cool")

## Text Classification

Let’s build a simple text classification model using TextBlob. For this, first, we need to prepare a training and testing data.

In [26]:
training = [
('Tom Holland is a terrible spiderman.','pos'),
('a terrible Javert (Russell Crowe) ruined Les Miserables for me...','pos'),
('The Dark Knight Rises is the greatest superhero movie ever!','neg'),
('Fantastic Four should have never been made.','pos'),
('Wes Anderson is my favorite director!','neg'),
('Captain America 2 is pretty awesome.','neg'),
('Let\s pretend "Batman and Robin" never happened..','pos'),
]

testing = [
('Superman was never an interesting character.','pos'),
('Fantastic Mr Fox is an awesome film!','neg'),
('Dragonball Evolution is simply terrible!!','pos')
]

Textblob provides in-build classifiers module to create a custom classifier. So, let’s quickly import it and create a basic classifier.

In [27]:
from textblob import classifiers

# Naive Bayes Classifier
nb_classifier = classifiers.NaiveBayesClassifier(training)

As you can see above, we have passed the training data into the classifier.

Note that here we have used Naive Bayes classifier, but TextBlob also offers Decision tree classifier which is as shown below.

In [28]:
# Decision Tree Classifier

dt_classifier = classifiers.DecisionTreeClassifier(training)

Now, let’s check the accuracy of this classifier on the testing dataset and also TextBlob provides us to check the most informative features.

In [29]:
print(nb_classifier.accuracy(testing))
nb_classifier.show_informative_features(3)

1.0
Most Informative Features
            contains(is) = True              neg : pos    =      2.9 : 1.0
         contains(never) = False             neg : pos    =      1.8 : 1.0
             contains(a) = False             neg : pos    =      1.8 : 1.0


As, we can see that if the text contains “is”, then there is a high probability that the statement will be negative.

In order to give a little more idea, let’s check our classifier on a random text.

In [30]:
blob = TextBlob('The weather is terrible!', classifier=nb_classifier)

print(blob.classify())

neg


So, based on the training on the above dataset, our classifier has provided us the right result.

## Pros and Cons:

### Pros:
1. Since, it is built on the shoulders of NLTK and Pattern, therefore making it simple for beginners by providing an intuitive interface to NLTK.
2. It provides language translation and detection which is powered by Google Translate (not provided with spaCy)

### Cons:
1. It is little slower in the comparison to spaCy but faster than NLTK. (spaCy > TextBlob > NLTK)
2. It does not provide features like dependency parsing, word vectors etc. which is provided by spacy.