Feature Enginerring or text representation means converting the text into some numeric form or vector. Some of the techniques which can be used to implement text representation are-

Label encoding

One Hot Encoding

Bag of Words

TF IDF

Bag of N-grams


In [1]:
!pip install nltk



In [10]:
import nltk
from nltk import sent_tokenize, word_tokenize
#Remove stop words
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
import pandas as pd
import numpy as np
nltk.download('punkt_tab')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [3]:
import nltk
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> l
Packages:
  [ ] abc................. Australian Broadcasting Commission 2006
  [ ] alpino.............. Alpino Dutch Treebank
  [ ] averaged_perceptron_tagger Averaged Perceptron Tagger
  [ ] averaged_perceptron_tagger_eng Averaged Perceptron Tagger (JSON)
  [ ] averaged_perceptron_tagger_ru Averaged Perceptron Tagger (Russian)
  [ ] averaged_perceptron_tagger_rus Averaged Perceptron Tagger (Russian)
  [ ] basque_grammars..... Grammars for Basque
  [ ] bcp47............... BCP-47 Language Tags
  [ ] biocreative_ppi..... BioCreAtIvE (Critical Assessment of Information
                           Extraction Systems in Biology)
  [ ] bllip_wsj_no_aux.... BLLIP Parser: WSJ Model
  [ 

True

In [11]:
#Function to attach POS tag letter as needed by wordnet lemmatizer
def pos_tagger(word):
    nltk_tag = nltk.pos_tag([word])[0][1][0]
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

In [5]:
#Function to do basic preprocessing steps like lower casing, text tokenize, stop word removal, punctuation removal, stemming
def text_preprocessing(corpus):
    text_lower=corpus.lower()
    text_tokenize= word_tokenize(text_lower)

    #Creating list of common punctuation marks
    punctuations= [",", ".", "!","'", "-", "_"]
    #Remove punctuations
    text_punc=[]
    for word in text_tokenize:
        if word not in punctuations:
            text_punc.append(word)

    #Stopword Removal
    stopWords = set(stopwords.words('english'))
    #Remove stop words
    text_stop=[]
    for word in text_punc:
        if word not in stopWords:
            text_stop.append(word)

    #Apply stemming using porter stemmer
    porter =PorterStemmer()
    text_stem=[]
    for word in text_stop:
        text_stem.append(porter.stem(word))

    #Apply lemmatization using wordnet lemmatizer on word after stop word removal
    lemmatizer= WordNetLemmatizer()
    text_lemma=[]
    for word in text_stop:
        tag= pos_tagger(word)
        if tag==None:
            text_lemma.append(lemmatizer.lemmatize(word))
        else:
            text_lemma.append(lemmatizer.lemmatize(word,tag))

    return text_lower, text_tokenize, text_punc, text_stop, text_stem, text_lemma

In [13]:
#Checking the above function

sent1= "Good morning dear students. Welcome to another lecture in Natural Language Processing"

# Download the missing resource
nltk.download('averaged_perceptron_tagger_eng')

lower, token, punctuation, stop, stem, lemma= text_preprocessing(sent1)
print("After Stemming: ",stem)
print("After Lemmatization: ", lemma)

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


After Stemming:  ['good', 'morn', 'dear', 'student', 'welcom', 'anoth', 'lectur', 'natur', 'languag', 'process']
After Lemmatization:  ['good', 'morning', 'dear', 'student', 'welcome', 'another', 'lecture', 'natural', 'language', 'processing']


### Label Encoding

A vocabulary is created from the list of all the unique words which are available in the corpus. We might want to clean the text by removing the stop words and punctuations. Also we can go for sorting the words in the vocabulary before assgining them numbers (or index)

In [14]:
corpus= "India, country that occupies the greater part of South Asia. Its capital is New Delhi"

#### Creating vocabulary from the given corpus

In [15]:
#Call the text preprocessing function to obtain the lemmatized word list

text_lower, text_tokenize, text_punc, text_stop, text_stem, text_lemma= text_preprocessing(corpus)
print("Lemmatized words from the corpus are- ", text_lemma)

Lemmatized words from the corpus are-  ['india', 'country', 'occupies', 'great', 'part', 'south', 'asia', 'capital', 'new', 'delhi']


In [16]:
#Take only the unique words to form the vocabulary

vocab= list(set(text_lemma))
print("The vocabulary is- ", vocab)

The vocabulary is-  ['country', 'capital', 'occupies', 'new', 'great', 'india', 'part', 'delhi', 'south', 'asia']


In [17]:
#Sort the vocab in ascending order
vocab_asc=sorted(vocab)
print(vocab_asc)

['asia', 'capital', 'country', 'delhi', 'great', 'india', 'new', 'occupies', 'part', 'south']


In [18]:
#Create a dictionary to give label to each word in the vocabulary
dict_words= {}
index=1
for word in vocab_asc:
    dict_words.update({word:index})
    index+=1
print(dict_words)

{'asia': 1, 'capital': 2, 'country': 3, 'delhi': 4, 'great': 5, 'india': 6, 'new': 7, 'occupies': 8, 'part': 9, 'south': 10}


#### Obtain vector for one of the sentences

In [19]:
#Apply label encoding using the dictionary on the given text

sent= "India, country that occupies the greater part of South Asia"
lower, token, punc, stop, stem, lemma= text_preprocessing(sent)
print(lemma)

['india', 'country', 'occupies', 'great', 'part', 'south', 'asia']


In [20]:
vector=[]
for word in lemma:
    vector.append(dict_words[word])
print(lemma)
print(vector)

['india', 'country', 'occupies', 'great', 'part', 'south', 'asia']
[6, 3, 8, 5, 9, 10, 1]


### One hot encoding

Ref: Steven Bird, Ewan Klein, Edward Loper, Natural Language Processing with Python:Analyzing Text with the Natural Language Toolkit, 1st Edition, O'Reilly Publications, 2009


In [23]:
def get_onehot_vector(word_list, vocab_dict):
  onehot_encoded = []
  for word in word_list:
             temp = [0]*len(vocab_dict)
             if word in vocab_dict:
                        temp[vocab_dict[word]-1] = 1
             onehot_encoded.append(temp)
  return onehot_encoded

# Example usage with your current variables
one_hot_vectors = get_onehot_vector(text_lemma, dict_words)
print(one_hot_vectors)

[[0, 0, 0, 0, 0, 1, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 1, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]]


### Bag of Words

In [24]:
corpus= "India, country that occupies the greater part of South Asia. India's capital is New Delhi. India is a democratic country."

#### Creating vocabulary for the given corpus

In [25]:
#Call the text preprocessing function to obtain the lemmatized word list

text_lower, text_tokenize, text_punc, text_stop, text_stem, text_lemma= text_preprocessing(corpus)
print("Lemmatized words from the corpus are- ", text_lemma)

Lemmatized words from the corpus are-  ['india', 'country', 'occupies', 'great', 'part', 'south', 'asia', 'india', "'s", 'capital', 'new', 'delhi', 'india', 'democratic', 'country']


In [26]:
#Take only the unique words to form the vocabulary

vocab= list(set(text_lemma))
print("The vocabulary is- ", vocab)

The vocabulary is-  ['country', 'capital', "'s", 'occupies', 'new', 'great', 'india', 'part', 'delhi', 'south', 'asia', 'democratic']


In [27]:
#Form a dictionary of vocabulary

dict_vocab= {}
index=1
for word in vocab:
    dict_vocab.update({word:index})
    index+=1
print(dict_vocab)

{'country': 1, 'capital': 2, "'s": 3, 'occupies': 4, 'new': 5, 'great': 6, 'india': 7, 'part': 8, 'delhi': 9, 'south': 10, 'asia': 11, 'democratic': 12}


#### Obtain BoW vector for the given sentence

In [28]:
#Sentence tokenize
sentences= sent_tokenize(corpus)
print(sentences)

['India, country that occupies the greater part of South Asia.', "India's capital is New Delhi.", 'India is a democratic country.']


In [29]:
#Apply BoW encoding

lower, token, punc, stop, stem, lemma= text_preprocessing(sent)
print(lemma)

['india', 'country', 'occupies', 'great', 'part', 'south', 'asia']


In [30]:
final_vectors=[]
for sentence in sentences:
    vector=[]
    lower, token, punc, stop, stem, lemma= text_preprocessing(sent)
    for word in vocab:
        vector.append(lemma.count(word))
    final_vectors.append(vector)
print(final_vectors)


[[1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0], [1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0], [1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0]]


In [31]:
final_vectors.append(vocab)
print(final_vectors)

[[1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0], [1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0], [1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0], ['country', 'capital', "'s", 'occupies', 'new', 'great', 'india', 'part', 'delhi', 'south', 'asia', 'democratic']]


In [32]:
df = pd.DataFrame(final_vectors[0:3],columns=final_vectors[3], index= ["document1", "document2", "document3"])
df

Unnamed: 0,country,capital,'s,occupies,new,great,india,part,delhi,south,asia,democratic
document1,1,0,0,1,0,1,1,1,0,1,1,0
document2,1,0,0,1,0,1,1,1,0,1,1,0
document3,1,0,0,1,0,1,1,1,0,1,1,0


### TF-IDF

![image.png](attachment:image.png)

In [33]:
corpus= "India, country that occupies the greater part of South Asia. India's capital is New Delhi. India is a democratic country."

In [34]:
#Call the text preprocessing function to obtain the lemmatized sentences

tokenized_sentences=[]
tokenized_words=[]
for sentence in sentences:

    lower, token, punc, stop, stem, lemma= text_preprocessing(sentence)
    tokenized_sentences.append(lemma)
    tokenized_words.extend(lemma)
print(tokenized_sentences)
print(tokenized_words)

[['india', 'country', 'occupies', 'great', 'part', 'south', 'asia'], ['india', "'s", 'capital', 'new', 'delhi'], ['india', 'democratic', 'country']]
['india', 'country', 'occupies', 'great', 'part', 'south', 'asia', 'india', "'s", 'capital', 'new', 'delhi', 'india', 'democratic', 'country']


#### To find Term Frequency

In [35]:
#Function to find TF for each word in each document
def tf(token_sent):
    dict_tf={}
    for word in token_sent:
        dict_tf.update({word: np.round(token_sent.count(word)/len(token_sent),2)})
    print(dict_tf)
    return dict_tf

In [36]:
dict_tf_final={}
count=1
for token_sent in tokenized_sentences:

    dict_tf_final.update({count:tf(token_sent)})
    count+=1
dict_tf_final= dict_tf_final

{'india': np.float64(0.14), 'country': np.float64(0.14), 'occupies': np.float64(0.14), 'great': np.float64(0.14), 'part': np.float64(0.14), 'south': np.float64(0.14), 'asia': np.float64(0.14)}
{'india': np.float64(0.2), "'s": np.float64(0.2), 'capital': np.float64(0.2), 'new': np.float64(0.2), 'delhi': np.float64(0.2)}
{'india': np.float64(0.33), 'democratic': np.float64(0.33), 'country': np.float64(0.33)}


In [37]:
#Dataframe to display frquency of each word in each sentence
df_tf= pd.DataFrame(dict_tf_final)
df_tf=df_tf.fillna(0)
df_tf

Unnamed: 0,1,2,3
india,0.14,0.2,0.33
country,0.14,0.0,0.33
occupies,0.14,0.0,0.0
great,0.14,0.0,0.0
part,0.14,0.0,0.0
south,0.14,0.0,0.0
asia,0.14,0.0,0.0
's,0.0,0.2,0.0
capital,0.0,0.2,0.0
new,0.0,0.2,0.0


#### To find IDF

![image.png](attachment:image.png)

In [38]:
def idf(tokenized_words, tokenized_sentences):

    for word in tokenized_words:
        count=0
        for sentence in tokenized_sentences:
            if word in set(sentence):
                count+=1




In [39]:
dict_idf={}
for word in tokenized_words:
        count=0
        for sentence in tokenized_sentences:
            if word in set(sentence):
                count+=1
        dict_idf.update({word:np.round(np.log(len(tokenized_sentences)/count),3)})
print(dict_idf)

{'india': np.float64(0.0), 'country': np.float64(0.405), 'occupies': np.float64(1.099), 'great': np.float64(1.099), 'part': np.float64(1.099), 'south': np.float64(1.099), 'asia': np.float64(1.099), "'s": np.float64(1.099), 'capital': np.float64(1.099), 'new': np.float64(1.099), 'delhi': np.float64(1.099), 'democratic': np.float64(1.099)}


In [40]:
df_idf = pd.DataFrame.from_dict(dict_idf.items())
df_idf.columns = ['words', 'idf']
df_idf.set_index("words")

Unnamed: 0_level_0,idf
words,Unnamed: 1_level_1
india,0.0
country,0.405
occupies,1.099
great,1.099
part,1.099
south,1.099
asia,1.099
's,1.099
capital,1.099
new,1.099


In [41]:
df_score=pd.DataFrame()

df_score["words"]=df_idf["words"]


In [42]:
df_score["1"]=df_idf["idf"]*df_tf[1]
df_score["2"]=df_idf["idf"]*df_tf[2]
df_score["3"]=df_idf["idf"]*df_tf[3]

In [43]:
df_score

Unnamed: 0,words,1,2,3
0,india,,,
1,country,,,
2,occupies,,,
3,great,,,
4,part,,,
5,south,,,
6,asia,,,
7,'s,,,
8,capital,,,
9,new,,,


In [44]:
df_tf[1]*df_idf["idf"]

Unnamed: 0,0
0,
1,
2,
3,
4,
5,
6,
7,
8,
9,


In [45]:
df_idf["idf"][1]

np.float64(0.405)

### N Grams using NLTK

REF: https://www.askpython.com/python/examples/n-grams-python-nltk


In [46]:
from nltk import ngrams, bigrams
from nltk import sent_tokenize, word_tokenize
from nltk import collocations

In [47]:
sentence = input("Enter the sentence: ")
n = int(input("Enter the value of n: "))
n_grams = ngrams(sentence.split(), n)

Enter the sentence: Hell , what's up ?
Enter the value of n: 2


In [48]:
for grams in n_grams:

    print(grams)

('Hell', ',')
(',', "what's")
("what's", 'up')
('up', '?')


In [49]:
sentence = input("Enter the sentence: ")
sent_tokenize= word_tokenize(sentence)
print(sent_tokenize)

Enter the sentence: Hello how are you
['Hello', 'how', 'are', 'you']


In [50]:
BiGrams=list(bigrams(sent_tokenize))

In [53]:
from nltk.text import Text

sentence = "Good morning all and welcome to this new lecture"
sent_tokenize = word_tokenize(sentence)
text = Text(sent_tokenize)
text.collocations()


