Feature engineering is the process of creating new features or modifying existing ones from raw data to improve the performance of machine learning algorithms. It involves selecting, transforming, and combining variables in the dataset to make them more suitable for modeling.

In various machine learning tasks, the quality of features has a significant impact on the model's predictive power. Feature engineering aims to extract relevant information from the data and represent it in a way that enhances the model's ability to learn patterns and make accurate predictions.

Some common techniques used in feature engineering include:

Feature Transformation: Transforming features to make them more suitable for modeling. This may include scaling, normalization, or applying mathematical transformations such as logarithms or square roots.

Feature Encoding: Converting categorical variables into numerical representations that algorithms can understand. This can involve techniques such as one-hot encoding, label encoding, or binary encoding.

Feature Selection: Selecting the most relevant features from the dataset to reduce dimensionality and improve model performance. This can be done using techniques like correlation analysis, feature importance ranking, or model-based selection.

Feature Extraction: Creating new features from existing ones to capture additional information or relationships in the data. This may involve techniques such as principal component analysis (PCA), text vectorization, or deriving new variables based on domain knowledge.

Feature Aggregation: Combining multiple features into a single feature to capture higher-level information. This can include aggregating numerical features using statistics like mean, median, or standard deviation, or combining categorical features into higher-level categories.

Feature Imputation: Handling missing values in the dataset by filling them in with estimated values based on other observations. Imputation techniques may include mean or median imputation, predictive modeling, or using algorithms specifically designed for handling missing data.

Effective feature engineering requires a deep understanding of the data and the problem domain, as well as creativity and experimentation to identify the most informative features. It is often an iterative process, where features are continuously refined and evaluated to improve model performance. Good feature engineering can lead to more robust and accurate machine learning models, ultimately enhancing their ability to solve real-world problems.

We will calculate the Jaccard and cosine similarity for a given pair of texts

In [1]:
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
lemmatizer = WordNetLemmatizer()

1. `from nltk import word_tokenize`: nltk (Natural Language Toolkit) is a library in python that provides tools for dealing with human language data. It provides more than 50 corpora and lexical resources. word_tokenize is a method in nltk library used to split text into words.

2. `from nltk.stem import WordNetLemmatizer`: Stemming and Lemmatization are Text Normalization (or sometimes called Word Normalization) techniques in the field of Natural Language Processing that are used to prepare text, words, and documents for further processing. The WordNetLemmatizer uses the WordNet Database to lookup lemmas.

3. `from sklearn.feature_extraction.text import TfidfVectorizer`: sklearn (Scikit-Learn) is a machine learning library in python. TfidfVectorizer converts a collection of raw documents to a matrix of TF-IDF (Term Frequency-Inverse Document Frequency) features. It's equivalent to CountVectorizer followed by TfidfTransformer.

4. `from sklearn.metrics.pairwise import cosine_similarity`: This code is importing the cosine_similarity function from sklearn library. This function is used to find the cosine of the angle between two vectors. This can be used as a measure of similarity between two text documents represented as tf-idf vectors.

5. `lemmatizer = WordNetLemmatizer()` : This is creating an object (lemmatizer) of the class WordNetLemmatizer, which will be used to lemmatize words i.e., convert words to their base form (lemma)

In [2]:
# We need to delare pair1, pair 2 and pair 3 variables
pair1 = ["What you do defines you","Your deeds define you"]
pair2 = ["Once upon a time there lived a king.", "Who is your queen?"]
pair3 = ["He is desperate", "Is he not desperate?"]

In [3]:
# We need to create a function to extract the Jaccard similarity between a pair of sentences

def extract_text_similarity_jaccard (text1, text2):
    words_text1 = [lemmatizer.lemmatize(word.lower()) for word in word_tokenize(text1)]
    words_text2 = [lemmatizer.lemmatize(word.lower()) for word in word_tokenize(text2)]
    nr = len(set(words_text1).intersection(set(words_text2)))
    dr = len(set(words_text1).union(set(words_text2)))
    jaccard_sim = nr/dr
    return jaccard_sim

extract_text_similarity_jaccard` is defined which computes the Jaccard similarity between two texts  `text1` and `text2`

1. The arguments `text1` and `text2` are tokenized. The `word_tokenize()` function splits the given text into separate words. This is done for both `text1` and `text2`.

2. A word lemmatizer is implemented on the tokens generated in the previous step, converting all the words in lowercase to their base form. For instance, 'running' is converted to 'run', 'better' to 'good', etc. Thus, `words_text1` and `words_text2` are lists containing the lemmatized words from `text1` and `text2` respectively.

3. To calculate the Jaccard similarity, the intersection (common elements) between the two sets of unique words (from `words_text1` and `words_text2`) is counted. This is the numerator (`nr`) of the Jaccard similarity.

4. The union (all unique elements from both sets, less duplicates) of the two sets of unique words is counted. This is the denominator (`dr`) of the Jaccard similarity.

5. The Jaccard similarity (`jaccard_sim`) is calculated as the ratio of the intersection count (`nr`) and the union count (`dr`).

6. It then returns the numeric value of `jaccard_sim` i.e., the Jaccard similarity index.Keep in mind that the Jaccard coefficient ranges from 0 to 1 – a higher score indicates greater similarity between the two texts.

In [20]:
extract_text_similarity_jaccard(pair1[0],pair1[1])

0.14285714285714285

In [21]:
extract_text_similarity_jaccard(pair2[0],pair2[1])

0.0