<font color='Red'>
    
### Variables are added randomly pick from multiple notebooks. Hence only refer code.

    1. Ngrams are very useful in Text Classification
    2. Normalization (Stemming & Lemmatization) helps in reducing data dimensionality, text cleaning
    3. Constituency Grammar: Any sentence can be organized into 3 constituents (<Subject>,<Context>,<Object>)
    4. Dependecny Grammar can be represented in the form of Triplet (Governer, Relation, Dependent)
        
        (Governer,Relation,Dependent) ==> <Analyticsvidhya> <is> <the largest community of Datascientists>
        
        Use Cases:
        1. Named Entity Recognition
        2. Question Answering System
        3. Co reference Resolution
        4. Text Summarization
        5. Text Classification

In [11]:
text = """It is raining heavily today and I am not sure if I will be able to travel.
Can we postpone our meeting. Hope it is fine with you :) I am sending the new meeting invite on 
<a href= "www.example.com"> this link </a> """

### 1. Function to get the frequency of words present in the tweets

In [None]:
def gen_freq(text):
    #Will store the list of words
    word_list = []

    #Loop over all the tweets and extract words into word_list
    for tw_words in text.split():
        word_list.extend(tw_words)

    #Create word frequencies using word_list
    word_freq = pd.Series(word_list).value_counts()

    #Print top 20 words
    word_freq[:20]
    
    return word_freq

gen_freq(dataset.text.str)

Note:
    1. 'dataset' is the name of the dataframe
    2. 'text' is the column in the dataframe

### 2. EDA using Word Clouds

    Word Cloud is useful to understand the context of text data using Top 100-200 Words visually

In [None]:
#Import libraries
import matplotlib.pyplot as plt
from wordcloud import WordCloud

#Generate word cloud
wc = WordCloud(width=400, height=330, max_words=100, background_color='white').generate_from_frequencies(word_freq)

plt.figure(figsize=(12, 8))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.show()

Note: 
    word_freq = gen_freq(dataset.text.str)

### 3. Text Cleaning


Text Noise: Unwanted Information in Text Data

- Stopwords, URLs, hashtags, punctuations, numbers
- Slangs : words which are not present in the dictionaries
- Spelling and Grammar errors
- Keyword Variations, Apostrophe Contraction

In [2]:
def clean_text(text):
    #Remove RT
    text = re.sub(r'RT', '', text)
    
    #Fix &
    text = re.sub(r'&amp;', '&', text)
    
    #Remove punctuations
    text = re.sub(r'[?!.;:,#@-]', '', text)

    #Convert to lowercase to maintain consistency
    text = text.lower()
    return text

**1. Removing Punctuations**

In [3]:
import string

punc = string.punctuation

text = "".join(char for char in text if char not in punc)

text

'It is raining heavily today and I am not sure if I will be able to travel\nCan we postpone our meeting Hope it is fine with you  I am sending the new meeting invite on \na href wwwexamplecom this link a '

#### 2. Replacing special characters like \n with space

In [4]:
text = text.replace("\n"," ")
text

'It is raining heavily today and I am not sure if I will be able to travel Can we postpone our meeting Hope it is fine with you  I am sending the new meeting invite on  a href wwwexamplecom this link a '

#### 3. Getting rid of spaces present in the end

In [10]:
text = text_cleaned1.strip()
text

'It be rain heavily today and I be not sure if I will be able to travel Can we postpone our meet Hope it be fine with you I be send the new meet invite on a href wwwexamplecom this link a'

### 4. Stops Word Removal

In [None]:
from wordcloud import STOPWORDS

text = dataset.text.apply(lambda x: clean_text(x))
word_freq = gen_freq(text.str)*100
word_freq = word_freq.drop(labels=STOPWORDS, errors='ignore')

#Generate word cloud
wc = WordCloud(width=450, height=330, max_words=200, background_color='white').generate_from_frequencies(word_freq)

plt.figure(figsize=(12, 14))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.show()

In [6]:
from nltk.corpus import stopwords
stop = stopwords.words('english')


text_cleaned = ""
for word in text.split():
    if word in stop:
        pass
    else:
        text_cleaned += " "
        text_cleaned += word 

text_cleaned

' It raining heavily today I sure I able travel Can postpone meeting Hope fine I sending new meeting invite href wwwexamplecom link'

### 5. Tokenization

In [3]:
from nltk import sent_tokenize, word_tokenize

text = "Hi John, How are you doing? I will be travelling your city. Lets catchup."

sent_tokenize(text)

['Hi John, How are you doing?',
 'I will be travelling your city.',
 'Lets catchup.']

In [4]:
word_tokenize(text)

['Hi',
 'John',
 ',',
 'How',
 'are',
 'you',
 'doing',
 '?',
 'I',
 'will',
 'be',
 'travelling',
 'your',
 'city',
 '.',
 'Lets',
 'catchup',
 '.']

### 6. Normalization

#### 1. Stemming

In [5]:
#stemming

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

print(stemmer.stem('playing'))
print(stemmer.stem('running'))
print(stemmer.stem('increases'))

play
run
increas


In [6]:
# import nltk
# nltk.download('wordnet')

#### 2. Lemmatization

In [7]:
from nltk.stem import WordNetLemmatizer

lemm = WordNetLemmatizer()

print(lemm.lemmatize('increases'))

increase


In [8]:
print(lemm.lemmatize("running",pos='v'))

run


In [9]:
from nltk import pos_tag
text =  "Hi John, How are you doing? I will be travelling your city. Lets catchup."

tokens = word_tokenize(text)
pos_tag(tokens)

[('Hi', 'NNP'),
 ('John', 'NNP'),
 (',', ','),
 ('How', 'NNP'),
 ('are', 'VBP'),
 ('you', 'PRP'),
 ('doing', 'VBG'),
 ('?', '.'),
 ('I', 'PRP'),
 ('will', 'MD'),
 ('be', 'VB'),
 ('travelling', 'VBG'),
 ('your', 'PRP$'),
 ('city', 'NN'),
 ('.', '.'),
 ('Lets', 'VBZ'),
 ('catchup', 'NN'),
 ('.', '.')]

In [8]:
from nltk.stem import WordNetLemmatizer

lemm = WordNetLemmatizer()

text_cleaned1 = ""
for word in text.split():
    word = lemm.lemmatize(word)
    text_cleaned1 += " "
    text_cleaned1 +=  word
    
text_cleaned1

' It is raining heavily today and I am not sure if I will be able to travel Can we postpone our meeting Hope it is fine with you I am sending the new meeting invite on a href wwwexamplecom this link a'

In [9]:
text_cleaned1 = ""
for word in text.split():
    word = lemm.lemmatize(word, pos ="v") 
    text_cleaned1 += " "
    text_cleaned1 +=  word
    
text_cleaned1

# Raining is converted to rain after defiing POS
# meeting is converted to meet, etc.

' It be rain heavily today and I be not sure if I will be able to travel Can we postpone our meet Hope it be fine with you I be send the new meet invite on a href wwwexamplecom this link a'

### 7. Extracting Synonyms

NLTK has also integration with Wordnet which is a very comprehension vocabulary of all the possbile words. We can apply getting synonyms ad antonyms.

In [12]:
# synonyms

from nltk.corpus import wordnet
wordnet.synsets('good')

[Synset('good.n.01'),
 Synset('good.n.02'),
 Synset('good.n.03'),
 Synset('commodity.n.01'),
 Synset('good.a.01'),
 Synset('full.s.06'),
 Synset('good.a.03'),
 Synset('estimable.s.02'),
 Synset('beneficial.s.01'),
 Synset('good.s.06'),
 Synset('good.s.07'),
 Synset('adept.s.01'),
 Synset('good.s.09'),
 Synset('dear.s.02'),
 Synset('dependable.s.04'),
 Synset('good.s.12'),
 Synset('good.s.13'),
 Synset('effective.s.04'),
 Synset('good.s.15'),
 Synset('good.s.16'),
 Synset('good.s.17'),
 Synset('good.s.18'),
 Synset('good.s.19'),
 Synset('good.s.20'),
 Synset('good.s.21'),
 Synset('well.r.01'),
 Synset('thoroughly.r.02')]

In [13]:
wordnet.synsets('computer')

[Synset('computer.n.01'), Synset('calculator.n.01')]

### 8. Extracting n-grams

In [14]:
from nltk import ngrams

sentence = 'I love to play football'

n = 2

ngrams(word_tokenize(sentence),n)

<generator object ngrams at 0x000002143748F0C8>

In [15]:
for i in ngrams(word_tokenize(sentence),n):
    print(i)

('I', 'love')
('love', 'to')
('to', 'play')
('play', 'football')


### 8. Extracting the Index for word tokens

In [1]:
from keras.preprocessing.text import Tokenizer

data = "rain rain go away come again another day little johny wants to play johny wants to play again in the rain"

tokenizer = Tokenizer()
tokenizer.fit_on_texts([data]) # This data should be list of sentences. In this case, we have only one sentence
                               # Due to this every will have an index
    
tokens = tokenizer.word_index
tokens

{'rain': 1,
 'again': 2,
 'johny': 3,
 'wants': 4,
 'to': 5,
 'play': 6,
 'go': 7,
 'away': 8,
 'come': 9,
 'another': 10,
 'day': 11,
 'little': 12,
 'in': 13,
 'the': 14}

<font color='Blue'>

## Preprocessing Techniques Using SpaCy

It helps you build applications that process and "understand" large volumes of text. Its applications are wide in NLP

- Tokenisation (Word, Sentence)
- Different Properties of Tokens 
    - Punctuation Removal 
    - Stopword Removal 
    - Tokenization
    - Word Vector Notations
    - Document Similarity etc.
 


