**Article on the subject** = https://medium.com/@hckecommerce/essential-nlp-feature-extraction-methods-1ff7cb4dc9f1

# **Essential_NLP_Feature_Extraction_Methods.ipynb**

NLP feature extraction methods are techniques used to convert raw text data into numerical representations that can be processed by machine learning models. These methods aim to capture the meaningful information and patterns in text data.

**Here are some essential NLP feature extraction methods:**


1.   Label Encoding
2.   One Hot Encoding
3.   Count Vectorization
  * TF-IDF Vectorizer  
  * Bag Of Words (BOW)

4.      Word Embedding
  *   Word2Vec
  *   GloVe
  *   FastText
5. N-gram Features


# **1 - Label Encoding**

Label Encoding is a technique used to convert categorical variables(texts) into numerical representations. Each unique category is assigned a unique integer value.

It can be quickly and easily integrated, but it does not understand the relationship between categories, for example, it does not recognize that nurses and doctors are closer to each other compared to others.

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Example categorical data
categories = ['teacher', 'nurse', 'police', 'doctor']

# Initializing the LabelEncoder
encoder = LabelEncoder()

# Fitting and transforming the categories
encoded_labels = encoder.fit_transform(categories)

# Creating a DataFrame
df = pd.DataFrame({'Category': categories, 'Encoded_Labels': encoded_labels})

# Printing the DataFrame
df.head()


Unnamed: 0,Category,Encoded_Labels
0,teacher,3
1,nurse,1
2,police,2
3,doctor,0


# **2 - One Hot Encoding**

One Hot Encoding is a technique used to convert categorical variables into binary vectors. Each category is represented by a binary vector where only one element is "hot" (1) and the others are "cold" (0).

If the number of categories is low, it is feasible to use One Hot Encoding to convert texts into numerical values.If the number of categories is large, adding a significant number of columns can lead to unnecessary data expansion, resulting in increased computational cost and time.

In [2]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Example categorical data
categories = ['teacher', 'nurse', 'police', 'doctor']

# Convert categorical data into a DataFrame
data = pd.DataFrame({'Category': categories})

# Initialize the OneHotEncoder
encoder = OneHotEncoder(sparse_output=False, dtype=int)

# Fit and transform the categorical data
encoded_data = encoder.fit_transform(data)

# Convert the encoded data to a DataFrame
encoded_df = pd.DataFrame(encoded_data, columns=categories)

# Print the encoded DataFrame
encoded_df.head()




Unnamed: 0,teacher,nurse,police,doctor
0,0,0,0,1
1,0,1,0,0
2,0,0,1,0
3,1,0,0,0


# **3 - Count Vectorization**

Count Vectorization is a technique used to convert text documents into numerical vectors based on the frequency of words in the documents. calculates according to the frequency of the word in the sentence

**a )  TF-IDF Vectorizer:**
It combines the concepts of "TF" (Term Frequency) and "IDF" (Inverse Document Frequency).

In [3]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Example text data
documents = ["This is the first document.",
             "This document is the second document.",
             "And this is the third one.",
             "Is this the first document?"]

# Convert text data into a DataFrame
data = pd.DataFrame({'Text': documents})

# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the text data
tfidf_vectors = vectorizer.fit_transform(data['Text'])

# Convert the TF-IDF vectors to a DataFrame
tfidf_df = pd.DataFrame(tfidf_vectors.toarray(), columns=vectorizer.get_feature_names_out())

# Print the TF-IDF DataFrame
tfidf_df.head(20)


Unnamed: 0,and,document,first,is,one,second,the,third,this
0,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085
1,0.0,0.687624,0.0,0.281089,0.0,0.538648,0.281089,0.0,0.281089
2,0.511849,0.0,0.0,0.267104,0.511849,0.0,0.267104,0.511849,0.267104
3,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085


**a ) Bag Of Words (BOW):**

It creates a vocabulary of unique words from the corpus and represents each document as a vector of word frequencies.

In [4]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Example text data
documents = ["This is the first document.",
             "This document is the second document.",
             "And this is the third one.",
             "Is this the first document?"]

# Convert text data into a DataFrame
data = pd.DataFrame({'Text': documents})

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the text data
bow_vectors = vectorizer.fit_transform(data['Text'])

# Convert the BOW vectors to a DataFrame
bow_df = pd.DataFrame(bow_vectors.toarray(), columns=vectorizer.get_feature_names_out())

# Print the BOW DataFrame
bow_df.head()


Unnamed: 0,and,document,first,is,one,second,the,third,this
0,0,1,1,1,0,0,1,0,1
1,0,2,0,1,0,1,1,0,1
2,1,0,0,1,1,0,1,1,1
3,0,1,1,1,0,0,1,0,1


# **4 ) Word Embedding**

Word Embedding is a technique in NLP that represents words as dense vectors in a high-dimensional space. It captures semantic meaning and word relationships, allowing for better understanding and processing of natural language. Word embeddings are learned from large text data using neural network models and provide dense representations that improve NLP model performance compared to sparse representations.

**a ) Word2Vec:**

It is a neural network-based model that learns continuous vector representations (embeddings) of words from large text corpora. These embeddings capture semantic and syntactic relationships between words, allowing for more meaningful and context-aware word representations.

**- CBOW (Continuous Bag of Words)**: predicts the target word based on the surrounding context words. Given the context words, CBOW tries to predict the target word in the center.

**- Skip-gram**: predicts the surrounding context words given a target word. Given a target word in the center, Skip-gram aims to predict the context words that typically appear around it.

**- CBOW (Continuous Bag of Words)**:

In [7]:
# CBOW (Continuous Bag of Words)

import pandas as pd
from gensim.models import Word2Vec

# Training data
sentences = [["I", "like", "apples"],
             ["I", "enjoy", "eating", "fruits"],
             ["Apples", "are", "delicious"],
             ["Fruits", "provide", "vitamins"]]

# Training the CBOW model with sg=0
model_cbow_sg0 = Word2Vec(sentences, min_count=1, window=3, sg=0,vector_size= 5)

# Accessing word vectors for CBOW (sg=0)
word_vectors_sg0 = model_cbow_sg0.wv

# Creating a DataFrame for word vectors with CBOW (sg=0)
word_vectors_df_sg0 = pd.DataFrame(word_vectors_sg0.vectors, index=word_vectors_sg0.index_to_key)


# Displaying the word vectors DataFrame
word_vectors_df_sg0.head(15)



Unnamed: 0,0,1,2,3,4
I,-0.010725,0.004729,0.102067,0.180185,-0.186059
vitamins,-0.142336,0.129177,0.17946,-0.100309,-0.075267
provide,0.14761,-0.030669,-0.090732,0.131081,-0.097203
Fruits,-0.03632,0.057532,0.019837,-0.165704,-0.188976
delicious,0.146235,0.101405,0.135154,0.015257,0.127018
are,-0.068107,-0.018928,0.115371,-0.150433,-0.078722
Apples,-0.150232,-0.018601,0.190762,-0.146383,-0.046675
fruits,-0.038755,0.161549,-0.118618,0.000903,-0.095075
eating,-0.192071,0.100146,-0.175192,-0.087837,-0.000702
enjoy,-0.005924,-0.153225,0.192295,0.099641,0.184663


**Skip-gram**:

In [8]:
# Skip-gram

import pandas as pd
from gensim.models import Word2Vec

# Training data
sentences = [["I", "like", "apples"],
             ["I", "enjoy", "eating", "fruits"],
             ["Apples", "are", "delicious"],
             ["Fruits", "provide", "vitamins"]]

# Training the Skip-gram model with sg=1
model_skip_gram_sg1 = Word2Vec(sentences, min_count=1, window=3, sg=1,vector_size=5)

# Accessing word vectors for Skip-gram (sg=1)
word_vectors_sg1 = model_skip_gram_sg1.wv

# Creating a DataFrame for word vectors with Skip-gram (sg=1)
word_vectors_df_sg1 = pd.DataFrame(word_vectors_sg1.vectors, index=word_vectors_sg1.index_to_key)

# Displaying the word vectors DataFrame
word_vectors_df_sg1.head(12)

Unnamed: 0,0,1,2,3,4
I,-0.010725,0.004729,0.102067,0.180185,-0.186059
vitamins,-0.142336,0.129177,0.17946,-0.100309,-0.075267
provide,0.14761,-0.030669,-0.090732,0.131081,-0.097203
Fruits,-0.03632,0.057532,0.019837,-0.165704,-0.188976
delicious,0.146235,0.101405,0.135154,0.015257,0.127018
are,-0.068107,-0.018928,0.115371,-0.150433,-0.078722
Apples,-0.150232,-0.018601,0.190762,-0.146383,-0.046675
fruits,-0.038755,0.161549,-0.118618,0.000903,-0.095075
eating,-0.192071,0.100146,-0.175192,-0.087837,-0.000702
enjoy,-0.005924,-0.153225,0.192295,0.099641,0.184663


**b  )  GloVe:**

GloVe stands for Global Vectors for Word Representation. It is an unsupervised learning algorithm that aims to generate word embeddings by capturing global word co-occurrence patterns in a corpus.

In [None]:
import numpy as np

words = ['apple', 'orange', 'banana', 'grape']
vectors = [
    [0.1, 0.2, 0.3, 0.4],
    [0.5, 0.6, 0.7, 0.8],
    [0.9, 1.0, 1.1, 1.2],
    [1.3, 1.4, 1.5, 1.6]
]

glove_file = 'glove_file.txt'
with open(glove_file, 'w', encoding='utf-8') as f:
    for word, vector in zip(words, vectors):
        vector_str = ' '.join(str(num) for num in vector)
        f.write(f"{word} {vector_str}\n")


In [None]:
import pandas as pd
from gensim.models import KeyedVectors

# Load pre-trained GloVe embeddings
glove_file = 'glove_file.txt'  # Path to the GloVe file
# Reading the GloVe file
word_vectors_df = pd.read_csv(glove_file, sep=' ', header=None, index_col=0, quoting=3)


# Displaying the word vectors DataFrame
word_vectors_df.head()



Unnamed: 0_level_0,1,2,3,4
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
apple,0.1,0.2,0.3,0.4
orange,0.5,0.6,0.7,0.8
banana,0.9,1.0,1.1,1.2
grape,1.3,1.4,1.5,1.6


**c ) FastText**

It learns word embeddings using the Skip-gram or Continuous Bag-of-Words (CBOW) architecture, making it effective for various natural language processing tasks. FastText is particularly useful for languages with rich morphology and large-scale datasets

In [None]:
import pandas as pd
from gensim.models import FastText

# Training data
sentences = [["I", "like", "apples"],
             ["I", "enjoy", "eating", "fruits"]]

# Training the FastText model
model_fasttext = FastText(sentences, min_count=1, window=5, vector_size=100)

# Accessing word vectors
word_vectors = model_fasttext.wv

# Creating a DataFrame for word vectors
word_vectors_df = pd.DataFrame(word_vectors.vectors, index=word_vectors.index_to_key)

# Displaying the word vectors DataFrame
word_vectors_df.head(10)

similarity = model_fasttext.wv.similarity("apples", "fruits")
print("Similarity between 'apples' and 'fruits':", similarity)


analogies = model_fasttext.wv.most_similar(positive=["eating", "fruits"], negative=["apples"])
print("Word analogy for 'eating' and 'fruits' - 'apples':", analogies)


Similarity between 'apples' and 'fruits': 0.5611198
Word analogy for 'eating' and 'fruits' - 'apples': [('enjoy', 0.09406167268753052), ('I', -0.022638631984591484), ('like', -0.06936056911945343)]


In [None]:
import fasttext.util
fasttext.util.download_model('en', if_exists='ignore')  # English
ft = fasttext.load_model('cc.en.300.bin')


# **5 ) N-gram features**

N-gram features are contiguous sequences of n words in a text document. They capture the contextual information and relationships between words, considering not just individual words but also the groups of words they form.

In [11]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Example text data
documents = ["This is the first document.",
             "This document is the second document.",
             "And this is the third one.",
             "Is this the first document?"]

# Convert text data into a DataFrame
data = pd.DataFrame({'Text': documents})

# Initialize the CountVectorizer with desired n-gram range
ngram_vectorizer = CountVectorizer(ngram_range=(2,4))

# Fit and transform the text data
ngram_vectors = ngram_vectorizer.fit_transform(data['Text'])

# Convert the N-gram vectors to a DataFrame
ngram_df = pd.DataFrame(ngram_vectors.toarray(), columns=ngram_vectorizer.get_feature_names_out())

# Print the N-gram DataFrame
ngram_df.head()


Unnamed: 0,and this,and this is,and this is the,document is,document is the,document is the second,first document,is the,is the first,is the first document,...,this document,this document is,this document is the,this is,this is the,this is the first,this is the third,this the,this the first,this the first document
0,0,0,0,0,0,0,1,1,1,1,...,0,0,0,1,1,1,0,0,0,0
1,0,0,0,1,1,1,0,1,0,0,...,1,1,1,0,0,0,0,0,0,0
2,1,1,1,0,0,0,0,1,0,0,...,0,0,0,1,1,0,1,0,0,0
3,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,1,1,1
