# Embeddings

### Overview

#### What are Embeddings?

- Embeddings are a technique in machine learning where words or objects are represented as vectors in a multidimensional space. This representation is based on the context in which words or objects appear, and it captures the semantic relationships between them. For instance:
  - Words with similar meanings would be mapped to a similar region in the space.
  - The distance between vectors can indicate the level of similarity between the words or objects they represent.
  
  This technique allows machine learning models to understand and process data more effectively, enhancing their ability to make accurate predictions or classifications.

#### Connection between Text Comparison and Embeddings

- Text comparison is a technique often utilized in the initial stages of data analysis, where the goal is to find similarities or differences between texts based on certain criteria or metrics. Embeddings further this analysis by providing a more nuanced method of comparison. For instance:
  - Consider the sentences: "The cat is on the roof" and "The feline is on top of the house". 
  - A simple text comparison might indicate that these sentences are quite different.
  - However, when we use embeddings, we can represent words such as 'cat' and 'feline', 'roof' and 'house', 'on' and 'on top of' as vectors in a space where their proximity indicates similarity. This representation allows us to perceive the sentences as more similar than a simple text comparison would suggest.
  
  Embeddings, therefore, offer a way to capture the synonymous nature of words and phrases, translating them into a numerical format that can be used for deeper analyses and various machine learning tasks.

#### Role of Data Engineering in Working with Embeddings

- Data engineering plays a significant role in effectively managing the data pipelines associated with embeddings. This role encompasses several tasks including:
  - **Data Cleaning**: Ensuring the data is cleaned and preprocessed to remove any noise or irrelevant information.
  - **Data Transformation**: Transforming data into a format that is suitable for generating embeddings.
  - **Optimized Storage Solutions**: Developing strategies for storing and retrieving the generated embeddings efficiently, which is crucial for facilitating smooth data processing and analysis.
  
  These aspects of data engineering are vital when working with large datasets typically encountered in embedding generation tasks, ensuring a streamlined workflow in machine learning projects that utilize embeddings.


---

## Types of Embeddings

### Overview

In the field of machine learning and natural language processing, various types of embeddings are used to convert words or items into numerical vectors. Understanding the different types of embeddings can help in choosing the right approach for a specific task. In this section, we will explore some of the popular types of embeddings:

### Word Embeddings

#### Word2Vec
- **Description**: Word2Vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words.
- **Example**: If the word 'king' is often found in the same context as 'queen', their vectors would be close in the vector space.
- **Application in Data Engineering**: Data engineers may utilize Word2Vec for tasks such as semantic search, where they need to identify items that are semantically similar to a given item.


In the below script, we are utilizing the `gensim` library to create word embeddings using the Word2Vec model. Here's a step-by-step explanation of the script:

1. **Importing the Word2Vec Class**: 
   ```python
   from gensim.models import Word2Vec
   ```
   We are importing the `Word2Vec` class from the `gensim.models` module.

2. **Creating a List of Sentences**:
   ```python
   sentences = [['I', 'love', 'coding'], ['Python', 'is', 'my', 'favorite', 'language'], ['I', 'love', 'data', 'engineering']]
   ```
   We are creating a list of sentences, where each sentence is represented as a list of words (tokens).

3. **Initializing and Training the Word2Vec Model**:
   ```python
   model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
   ```
   We are initializing and training the Word2Vec model with the following parameters:
   - `sentences`: The data on which the model will be trained, a list of sentence tokens.
   - `vector_size=100`: The number of dimensions of the embedding space, meaning each word will be represented as a 100-dimensional vector.
   - `window=5`: The maximum distance between the current and predicted word within a sentence during training. In this case, it considers 5 words on both sides of the current word.
   - `min_count=1`: Ignores all words with a total frequency lower than this. Here, it includes all words as the minimum count is 1.
   - `workers=4`: The number of CPU cores to use to train the model; higher numbers will lead to faster training.

4. **Getting the Vector Representation of a Word**:
   ```python
   vector = model.wv['coding']
   ```
   After training the model, we are getting the vector representation of the word 'coding'. This vector is a 100-dimensional array representing the 'coding' word in the vector space created by Word2Vec.

5. **Output - Vector**:
   The `vector` variable holds the 100-dimensional vector representation of the word 'coding'. 

6. **Usage of the Output Vector**:
   This vector can be used in various natural language processing tasks such as:
   - **Semantic Analysis**: To understand the semantic similarity between words by calculating the cosine similarity between their vectors.
   - **Text Classification**: As features in machine learning models for tasks like sentiment analysis.
   - **Information Retrieval**: To enhance search algorithms by finding semantically similar words or documents.

In [1]:
from gensim.models import Word2Vec
# Training Word2Vec model on sample sentences
sentences = [['I', 'love', 'coding'], ['Python', 'is', 'my', 'favorite', 'language'], ['I', 'love', 'data', 'engineering']]
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
# Getting the vector representation for the word 'coding'
vector = model.wv['coding']
print(vector)

#### GloVe (Global Vectors for Word Representation)
- **Description**: GloVe, which stands for "Global Vectors", is a method to efficiently learn word vectors by leveraging both global statistical information and local context information from the corpus on which it is trained. The resulting vectors encapsulate semantic relationships between words.

  In this script:
  - We first import the `spaCy` library and load a pre-trained language model that includes GloVe embeddings.
  - We then retrieve the vector representations for two words: 'computer' and 'laptop'.
  - Next, we calculate the cosine similarity between the two vectors, which gives us a measure of the semantic similarity between the words.
  - Finally, we print the calculated cosine similarity.

- **Application in Data Engineering**: Data engineers can use GloVe embeddings to analyze text data more deeply. By using the vector representations, they can develop systems capable of understanding semantic relationships between words, which can be used in various applications such as semantic search, sentiment analysis, and recommendation systems.

- **Output - Cosine Similarity**: The output is a cosine similarity score between the vectors representing the words 'computer' and 'laptop'. This score indicates the semantic similarity between the two words, with a higher score indicating greater similarity.

- **Usage of the Output**: The calculated cosine similarity can be used in several ways:
  - **Semantic Search**: To develop search algorithms that can find semantically similar words or documents.
  - **Natural Language Processing**: To enhance NLP models by understanding the semantic relationships between words.
  - **Data Analysis**: To perform data analysis where understanding the semantic relationships between words is essential.

In [None]:
import sys
#!{sys.executable} -m spacy download en_core_web_md
!{sys.executable} -m pip install tensorflow_hub

In [13]:
import spacy

# Load the spaCy model that includes GloVe embeddings
nlp = spacy.load('en_core_web_md')

# Get the vector representations for the words 'computer' and 'laptop'
vector_computer = nlp('computer').vector
vector_laptop = nlp('laptop').vector

# Calculate the cosine similarity between the two vectors to find the semantic similarity
cosine_similarity = vector_computer.dot(vector_laptop) / (nlp('computer').vector_norm * nlp('laptop').vector_norm)

print(f"The cosine similarity between the words 'computer' and 'laptop' is: {cosine_similarity}")

The cosine similarity between the words 'computer' and 'laptop' is: 0.6255755212776157


### Advanced Embeddings

#### BERT (Bidirectional Encoder Representations from Transformers)

- **Description**: BERT, which stands for Bidirectional Encoder Representations from Transformers, is a transformer-based machine learning technique for natural language processing pre-training. BERT considers the context from both left and right (bidirectional context) of a word during training, thereby understanding the word in its context within the sentence, which is a departure from previous methods that considered words in isolation.

<br>

  In the below script:
  - We first import the necessary classes from the `transformers` library.
  - We initialize the BERT tokenizer and model using the `from_pretrained` method to load the pre-trained BERT model.
  - We define two sentences that we want to analyze for semantic similarity.
  - We tokenize these sentences and pass them through the BERT model to obtain embeddings for each sentence.
  - We then calculate the cosine similarity between the embeddings of the two sentences to get a measure of their semantic similarity.
  - Finally, we print the calculated cosine similarity score.

<br><br>
- **Application in Data Engineering**: Data engineers can utilize BERT embeddings in various NLP applications such as sentiment analysis, question-answering systems, and document classification. The embeddings provide a rich representation of the text data, capturing the contextual nuances and semantic relationships between words.

- **Output - Cosine Similarity**: The script outputs the cosine similarity between the embeddings of the two sentences. This score is a measure of the semantic similarity between the sentences, with a higher score indicating a higher degree of similarity.

- **Usage of the Output**: The cosine similarity score can be used in several applications such as:
  - **Semantic Search**: Enhancing search algorithms by finding documents or content that are semantically similar to the query.
  - **Content Recommendation**: Developing recommendation systems that suggest content based on semantic similarity to user preferences.
  - **Text Analysis**: Performing text analysis where understanding the semantic relationships between pieces of text is vital.


In [15]:
from transformers import BertTokenizer, BertModel
import torch.nn.functional as F

# Initialize the BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Define two sentences for similarity analysis
sentence1 = "I love programming."
sentence2 = "Coding is my passion."

# Tokenize the sentences and obtain the outputs from the BERT model
inputs1 = tokenizer(sentence1, return_tensors='pt')
outputs1 = model(**inputs1)

inputs2 = tokenizer(sentence2, return_tensors='pt')
outputs2 = model(**inputs2)

# Calculate the cosine similarity between the sentence embeddings
cosine_similarity = F.cosine_similarity(outputs1.last_hidden_state.mean(dim=1), outputs2.last_hidden_state.mean(dim=1))

print(f"The cosine similarity between the sentences is: {cosine_similarity.item()}")

The cosine similarity between the sentences is: 0.8084198236465454


0.8 does indicate a high level of similarity, and it seems that the model has captured the similar thematic content of both sentences, which revolves around a positive sentiment towards coding/programming. However, the subtlety of the difference in context between "loving programming" and having "a passion for coding" might not be fully captured in this similarity score.

BERT, being a contextual model, forms representations based on the words and their surroundings, which in this case, both sentences express a fondness or affinity for coding/programming, hence the high similarity score. It's essential to note that semantic similarity metrics might not always align perfectly with human interpretation, and different embeddings or methods might give different insights into the semantic relationships between sentences.

To make the example more illustrative, we could potentially choose sentences with more distinct semantic content to showcase how BERT can differentiate between them based on the embeddings. 

In [17]:
from transformers import BertTokenizer, BertModel
import torch.nn.functional as F

# Initialize the BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Define two sentences for similarity analysis
sentence1 = "I love programming."
sentence2 = "The weather is sunny today."

# Tokenize the sentences and obtain the outputs from the BERT model
inputs1 = tokenizer(sentence1, return_tensors='pt')
outputs1 = model(**inputs1)

inputs2 = tokenizer(sentence2, return_tensors='pt')
outputs2 = model(**inputs2)

# Calculate the cosine similarity between the sentence embeddings
cosine_similarity = F.cosine_similarity(outputs1.last_hidden_state.mean(dim=1), outputs2.last_hidden_state.mean(dim=1))

print(f"The cosine similarity between the sentences is: {cosine_similarity.item()}")

The cosine similarity between the sentences is: 0.5463641285896301


#### ELMo (Embeddings from Language Models)

- **Description**: ELMo, which stands for "Embeddings from Language Models", is a deep learning model developed to create word embeddings. Unlike traditional word embeddings, ELMo embeddings are contextual, meaning that the embedding for each word depends on the surrounding words in a sentence. This allows ELMo to capture complex semantics and understand word polysemy, where a single word can have multiple meanings based on its context.

- **Python Usage and Example**:
  ```python
  import tensorflow_hub as hub
  import tensorflow as tf
  from sklearn.metrics.pairwise import cosine_similarity
  
  # Load the ELMo module from TensorFlow Hub
  elmo = hub.load('https://tfhub.dev/google/elmo/3')
  
  # Define two sentences for similarity analysis
  sentence1 = ["I love programming."]
  sentence2 = ["Coding is my passion."]
  
  # Get the ELMo embeddings for the sentences
  embeddings1 = elmo.signatures['default'](tf.constant(sentence1))['elmo']
  embeddings2 = elmo.signatures['default'](tf.constant(sentence2))['elmo']
  
  # Calculate the cosine similarity between the sentence embeddings
  cosine_sim = cosine_similarity(embeddings1[0].numpy().mean(axis=0).reshape(1, -1), embeddings2[0].numpy().mean(axis=0).reshape(1, -1))
  
  print(f"The cosine similarity between the sentences is: {cosine_sim[0][0]}")
  ```
  In this script:
  - We first import the necessary modules: `tensorflow_hub` for loading the ELMo model, `tensorflow` to work with tensors, and `cosine_similarity` from scikit-learn to calculate the cosine similarity.
  - We then load the pre-trained ELMo model from TensorFlow Hub.
  - Next, we define two sentences that we wish to analyze for semantic similarity.
  - We use the ELMo model to get the embeddings for each sentence. These embeddings are contextual, capturing the semantic nuances based on the context of each word in the sentences.
  - We then calculate the cosine similarity between the mean embeddings of the two sentences to gauge their semantic similarity.
  - Finally, we print the calculated cosine similarity score, which gives a measure of the semantic similarity between the sentences.

- **Application in Data Engineering**: ELMo embeddings can be utilized in various data engineering tasks such as semantic search, sentiment analysis, and text summarization. By understanding the contextual nuances of words, data engineers can build more sophisticated NLP models that provide richer insights into text data.

- **Output - Cosine Similarity**: The script outputs the cosine similarity score between the two sentences, indicating the degree of semantic similarity based on the context captured by the ELMo embeddings.

- **Usage of the Output**: The cosine similarity score can be used in several applications including:
  - **Semantic Search**: Enhancing search algorithms to find content or documents that are semantically similar to a given query.
  - **Content Recommendation**: Building recommendation systems that suggest content with similar semantic contexts to users.
  - **Text Analysis**: Conducting text analysis where understanding the semantic relationships between texts is essential.

In [28]:
import tensorflow_hub as hub
import tensorflow as tf
from sklearn.metrics.pairwise import cosine_similarity

# Load the ELMo module from TensorFlow Hub
elmo = hub.load('https://tfhub.dev/google/elmo/3')

# Define two sentences for similarity analysis
sentence1 = ["I love programming."]
sentence2 = ["Coding is my passion."]

# Get the ELMo embeddings for the sentences
embeddings1 = elmo.signatures['default'](tf.constant(sentence1))['elmo']
embeddings2 = elmo.signatures['default'](tf.constant(sentence2))['elmo']

# Calculate the cosine similarity between the sentence embeddings
cosine_sim = cosine_similarity(embeddings1[0].numpy().mean(axis=0).reshape(1, -1), embeddings2[0].numpy().mean(axis=0).reshape(1, -1))

print(f"The cosine similarity between the sentences is: {cosine_sim[0][0]}")

The cosine similarity between the sentences is: 0.5848343968391418


---

Embeddings can play a significant role in various data engineering concepts and tasks. Here's a detailed explanation:

1. **Data Preprocessing and Feature Engineering**:
   - **Text Normalization**: Embeddings can help in the normalization of text data by identifying synonyms and semantically similar words, which can then be unified to a standard term.
   - **Dimensionality Reduction**: Embeddings can facilitate dimensionality reduction by representing high-dimensional data (like text) in a lower-dimensional space while preserving essential information.

2. **Information Retrieval**:
   - **Semantic Search**: Embeddings can enhance search algorithms by enabling the retrieval of information based on semantic similarity rather than exact keyword matches. This makes the search more intuitive and capable of understanding user intent.
   - **Document Clustering**: Embeddings can be used to cluster documents based on semantic similarity, facilitating the organization and retrieval of information in large datasets.

3. **Content Recommendation**:
   - **Personalized Recommendations**: Embeddings can be used in recommendation systems to suggest content (articles, products, etc.) that is semantically similar to the content that the user has shown interest in, thereby personalizing the recommendations.
   
4. **Natural Language Processing (NLP)**:
   - **Sentiment Analysis**: Embeddings can enhance sentiment analysis by capturing the semantic nuances in the text, helping in the accurate classification of sentiments.
   - **Named Entity Recognition (NER)**: In NER tasks, embeddings can help in identifying and categorizing entities in the text based on semantic contexts.
   
5. **Data Integration**:
   - **Entity Resolution**: Embeddings can assist in entity resolution tasks by identifying records that refer to the same entity across different data sources based on semantic similarity.
   
6. **Data Visualization**:
   - **Semantic Data Visualization**: Embeddings can be used to visualize data in a semantic space, where the geometric distances between points represent semantic relationships, aiding in the exploration and analysis of complex data.

7. **Anomaly Detection**:
   - **Outlier Detection**: In anomaly detection tasks, embeddings can help identify outliers by analyzing the semantic relationships between data points and identifying those that deviate significantly from the norm.

8. **Data Quality and Consistency**:
   - **Data Cleaning**: Embeddings can assist in data cleaning by identifying inconsistencies and errors in the data based on semantic analysis, helping to improve the quality and reliability of the data.

9. **Optimizing Data Storage**:
   - **Data Compression**: Embeddings can facilitate data compression by representing data in a compact form without significant loss of information, optimizing data storage requirements.

10. **Real-Time Analytics**:
    - **Stream Analytics**: Embeddings can be used in stream analytics to analyze and interpret data in real-time, enabling applications like real-time sentiment analysis, trend detection, etc.

By integrating embeddings into data engineering workflows, data engineers can build more sophisticated data pipelines and systems that can understand and process data in a semantically rich manner, enhancing the insights and value derived from the data.

---

### Data Preprocessing and Feature Engineering - Text Normalization

In [29]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

# Create a Pandas DataFrame with some sample data
data = {'Text': ['I love programming', 'I adore coding', 'I enjoy software development']}
df = pd.DataFrame(data)

# Using TfidfVectorizer to convert the text data to vectors
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(df['Text'])

# Finding synonyms (or semantically similar words) for the word 'love' based on the vector representations
cosine_similarities = linear_kernel(tfidf_matrix[0:1], tfidf_matrix).flatten()
related_docs_indices = cosine_similarities.argsort()[:-3:-1]
related_docs_indices

# Output: array([0, 2])

array([0, 2])

In this script, we are using the TF-IDF (Term Frequency-Inverse Document Frequency) technique to find sentences that are semantically similar to the first sentence ("I love programming") in the given dataset. Here's a detailed breakdown of the script and the meaning of its output:

1. **Importing Necessary Libraries and Creating a DataFrame**:
   ```python
   import pandas as pd
   from sklearn.feature_extraction.text import TfidfVectorizer
   from sklearn.metrics.pairwise import linear_kernel
   
   # Create a Pandas DataFrame with some sample data
   data = {'Text': ['I love programming', 'I adore coding', 'I enjoy software development']}
   df = pd.DataFrame(data)
   ```
   In this step, we import the necessary libraries and create a Pandas DataFrame containing some sample sentences.

2. **Vectorizing the Text Data**:
   ```python
   # Using TfidfVectorizer to convert the text data to vectors
   vectorizer = TfidfVectorizer()
   tfidf_matrix = vectorizer.fit_transform(df['Text'])
   ```
   We use the `TfidfVectorizer` class from the scikit-learn library to convert the sentences into TF-IDF vectors. TF-IDF is a statistical measure that evaluates the importance of a word in a document relative to a collection of documents (corpus).

3. **Calculating Cosine Similarities**:
   ```python
   # Finding synonyms (or semantically similar words) for the word 'love' based on the vector representation
   cosine_similarities = linear_kernel(tfidf_matrix[0:1], tfidf_matrix).flatten()
   ```
   We calculate the cosine similarities between the TF-IDF vector of the first sentence and the TF-IDF vectors of all sentences in the DataFrame. The `linear_kernel` function is a computationally efficient method to calculate cosine similarities when dealing with TF-IDF vectors.

4. **Finding the Most Similar Sentences**:
   ```python
   related_docs_indices = cosine_similarities.argsort()[:-3:-1]
   related_docs_indices
   # Output: array([0, 2])
   ```
   We identify the most similar sentences to the first sentence by sorting the cosine similarity scores in descending order and selecting the top scores (top 2 in this case).

**Understanding the Output**:
- The output, `array([0, 2])`, indicates that the sentences at indices 0 ("I love programming") and 2 ("I enjoy software development") are the most similar to the first sentence, based on the TF-IDF vector representations.
- The sentence "I enjoy software development" is found to be more similar to "I love programming" compared to "I adore coding", possibly because the words 'enjoy' and 'love' have closer TF-IDF vectors in the context of the given corpus than 'adore' and 'love'.

This script illustrates the use of TF-IDF vectors and cosine similarity to analyze semantic similarity between sentences in a data engineering task, providing insights into the semantic relationships in the text data.


---

### Information Retrieval - Semantic Search

In [30]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a Pandas DataFrame with some sample data
data = {'Document': ['The sky is blue and beautiful', 'I love blueberries', 'Blue whales are the largest animals']}
df = pd.DataFrame(data)

# Using TfidfVectorizer to convert the document data to vectors
vectorizer = TfidfVectorizer(stop_words='english')
doc_vectors = vectorizer.fit_transform(df['Document'])

# Semantic search: Finding documents semantically similar to a query
query = "I am feeling blue"
query_vector = vectorizer.transform([query])
cosine_sim = cosine_similarity(query_vector, doc_vectors)

# Finding the document most similar to the query
most_similar_doc = df.iloc[cosine_sim.argmax()]
most_similar_doc

# Output: Document    The sky is blue and beautiful

Document    The sky is blue and beautiful
Name: 0, dtype: object

In this script, we are performing a semantic search to find the document that is most semantically similar to a given query using TF-IDF (Term Frequency-Inverse Document Frequency) vectors and cosine similarity as the metric. Here's a step-by-step breakdown of the script and an explanation of its output:

1. **Importing Necessary Libraries and Creating a DataFrame**:
   ```python
   import pandas as pd
   from sklearn.metrics.pairwise import cosine_similarity
   from sklearn.feature_extraction.text import TfidfVectorizer
   
   # Create a Pandas DataFrame with some sample data
   data = {'Document': ['The sky is blue and beautiful', 'I love blueberries', 'Blue whales are the largest animals']}
   df = pd.DataFrame(data)
   ```
   Initially, we import the required libraries and create a Pandas DataFrame that contains a set of documents.

2. **Vectorizing the Document Data**:
   ```python
   # Using TfidfVectorizer to convert the document data to vectors
   vectorizer = TfidfVectorizer(stop_words='english')
   doc_vectors = vectorizer.fit_transform(df['Document'])
   ```
   We then use `TfidfVectorizer` to convert the documents into TF-IDF vectors. The `stop_words='english'` parameter tells the vectorizer to ignore common English stop words (like 'is', 'and', etc.) which generally do not carry much semantic meaning.

3. **Performing Semantic Search**:
   ```python
   # Semantic search: Finding documents semantically similar to a query
   query = "I am feeling blue"
   query_vector = vectorizer.transform([query])
   cosine_sim = cosine_similarity(query_vector, doc_vectors)
   ```
   We define a query and convert it to a TF-IDF vector using the same `vectorizer` object. We then calculate the cosine similarity between the query vector and the vectors of all documents in the DataFrame to find how semantically similar the query is to each document.

4. **Finding the Most Similar Document**:
   ```python
   # Finding the document most similar to the query
   most_similar_doc = df.iloc[cosine_sim.argmax()]
   most_similar_doc
   # Output: Document    The sky is blue and beautiful
   ```
   Based on the cosine similarity scores, we identify the document that is most similar to the query. The `argmax` function is used to find the index of the document with the highest cosine similarity score to the query.

**Understanding the Output**:
- The output, `Document    The sky is blue and beautiful`, indicates that among the documents in the DataFrame, the document "The sky is blue and beautiful" is the most semantically similar to the query "I am feeling blue" based on the TF-IDF vector representations.
- This suggests that the TF-IDF vector representation of the query shares the highest cosine similarity with the vector representation of the document "The sky is blue and beautiful", possibly because of the presence of the word "blue" and the semantic context surrounding it in both the query and the document.

This script demonstrates how to use TF-IDF vectors and cosine similarity to perform semantic search in a collection of documents, providing a method to find documents that are semantically related to a given query in data engineering tasks.


---

### Content Recommendation - Personalized Recommendations

In [33]:
import pandas as pd
import tensorflow_hub as hub
from sklearn.metrics.pairwise import cosine_similarity

# Create a Pandas DataFrame with some sample data (user preferences and content descriptions)
data = {'User Preferences': ['I love programming and coding', 'I enjoy outdoor activities', 'I like reading books'],
        'Content Description': ['A guide to programming in Python', 'Top 10 hiking trails', 'Bestselling fiction books of 2022']}
df = pd.DataFrame(data)

# Load the Universal Sentence Encoder model
use_model = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

# Create embeddings for content descriptions and user preferences using USE
content_embeddings = use_model(df['Content Description'])
preference_embeddings = use_model(df['User Preferences'])

# Calculate cosine similarity between user preferences and content descriptions
cosine_sim = cosine_similarity(preference_embeddings, content_embeddings)

# Find the content most similar to each user's preferences
recommendations = cosine_sim.argmax(axis=1)
df['Recommended Content'] = df['Content Description'][recommendations].values
df

Unnamed: 0,User Preferences,Content Description,Recommended Content
0,I love programming and coding,A guide to programming in Python,A guide to programming in Python
1,I enjoy outdoor activities,Top 10 hiking trails,Top 10 hiking trails
2,I like reading books,Bestselling fiction books of 2022,Bestselling fiction books of 2022


In this modified script, we have improved the content recommendation system by using the Universal Sentence Encoder (USE) to create embeddings for the user preferences and content descriptions. The USE is proficient at capturing deeper semantic relationships between sentences, which enhances the recommendation quality. Here's a step-by-step explanation of the script and the interpretation of its output:

1. **Importing Necessary Libraries and Creating a DataFrame**:
   ```python
   import pandas as pd
   import tensorflow_hub as hub
   from sklearn.metrics.pairwise import cosine_similarity
   
   # Create a Pandas DataFrame with some sample data (user preferences and content descriptions)
   data = {'User Preferences': ['I love programming and coding', 'I enjoy outdoor activities', 'I like reading books'],
           'Content Description': ['A guide to programming in Python', 'Top 10 hiking trails', 'Bestselling fiction books of 2022']}
   df = pd.DataFrame(data)
   ```
   Initially, we import the necessary libraries and create a Pandas DataFrame containing user preferences and content descriptions.

2. **Loading the Universal Sentence Encoder and Creating Embeddings**:
   ```python
   # Load the Universal Sentence Encoder model
   use_model = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
   
   # Create embeddings for content descriptions and user preferences using USE
   content_embeddings = use_model(df['Content Description'])
   preference_embeddings = use_model(df['User Preferences'])
   ```
   We load the Universal Sentence Encoder (USE) model from TensorFlow Hub and use it to create embeddings for the content descriptions and user preferences. The USE creates embeddings that capture the semantic nuances of the sentences, providing a rich representation of the content and preferences.

3. **Calculating Cosine Similarity and Finding Recommendations**:
   ```python
   # Calculate cosine similarity between user preferences and content descriptions
   cosine_sim = cosine_similarity(preference_embeddings, content_embeddings)
   
   # Find the content most similar to each user's preferences
   recommendations = cosine_sim.argmax(axis=1)
   df['Recommended Content'] = df['Content Description'][recommendations].values
   df
   ```
   We then calculate the cosine similarity between the embeddings of user preferences and content descriptions to find the semantic similarity between them. Based on these similarity scores, we identify the most similar content for each user by finding the content with the highest cosine similarity score for each user's preferences.

**Understanding the Output**:
- The output is a DataFrame that displays the recommended content for each user based on their preferences. The recommendations are determined based on the semantic similarity between the user preferences and content descriptions, as captured by the USE embeddings.
- The Universal Sentence Encoder is able to understand the deeper semantic relationships between sentences, which helps in providing more accurate and contextually relevant recommendations compared to methods based on simple word frequency counts.

This modified script demonstrates an enhanced approach to building a content recommendation system by utilizing the Universal Sentence Encoder to understand the semantic relationships between user preferences and content descriptions, thereby offering more personalized and relevant content recommendations.

