In [None]:
Building a content recommendation system using NLP with Hugging Face involves several steps, from data preparation to model deployment. Here’s a comprehensive guide to creating an end-to-end content recommendation system.

1. Environment Setup
Install Necessary Libraries: Start by installing the required libraries.
bash
Copy code
pip install transformers datasets torch scikit-learn
2. Data Collection & Preprocessing
Collect Data: You'll need a dataset that includes user interactions with content. For instance, the MovieLens dataset can be used for movie recommendations.
Load Data: Load the dataset using Pandas or the datasets library.
python
Copy code
import pandas as pd

data = pd.read_csv("path_to_dataset.csv")
# Assuming the dataset has columns: user_id, item_id, and interaction (e.g., rating)
Preprocess Data: Prepare the data for model training, including tokenization and text vectorization.
python
Copy code
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenize_text(text):
    return tokenizer(text, padding="max_length", truncation=True, return_tensors="pt")

# Example: Assuming 'item_description' is the text to be used for recommendations
data['tokenized'] = data['item_description'].apply(lambda x: tokenize_text(x)['input_ids'])
3. Feature Engineering
Content-Based Features: Extract features from the text content. Use a pre-trained BERT model to get embeddings of content descriptions.
python
Copy code
from transformers import AutoModel

model = AutoModel.from_pretrained("bert-base-uncased")
data['embeddings'] = data['tokenized'].apply(lambda x: model(x)['last_hidden_state'].mean(dim=1).detach().numpy())
User-Based Features: Aggregate content embeddings for users based on their interaction history.
python
Copy code
user_embeddings = data.groupby('user_id')['embeddings'].apply(lambda x: x.mean(axis=0))
4. Model Selection
Choose a Recommendation Model: Use a simple nearest neighbors model or a more advanced model like matrix factorization or neural collaborative filtering. Here, we'll use a simple nearest neighbors model.
python
Copy code
from sklearn.neighbors import NearestNeighbors
import numpy as np

knn = NearestNeighbors(n_neighbors=5, metric='cosine')
knn.fit(np.stack(data['embeddings'].values))
5. Model Training (Optional)
Fine-Tuning a Pre-Trained Model: If you want to fine-tune the BERT model on your specific data, you can do so using the Hugging Face Trainer API.
python
Copy code
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    logging_dir="./logs",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=data['tokenized'],  # Modify according to your dataset format
    eval_dataset=validation_data['tokenized'],  # Modify according to your dataset format
)

trainer.train()
6. Content Recommendation
Generate Recommendations: For a given user, find the nearest items based on content embeddings.
python
Copy code
user_id = 123  # Example user ID
user_embedding = user_embeddings.loc[user_id]
distances, indices = knn.kneighbors([user_embedding])

recommended_items = data.iloc[indices[0]]['item_id']
print(f"Recommended items for user {user_id}: {recommended_items.tolist()}")
7. Model Evaluation
Evaluate the Recommendations: Use metrics like precision, recall, or mean reciprocal rank (MRR) to evaluate the quality of recommendations.
python
Copy code
def precision_at_k(actual, predicted, k):
    return len(set(predicted[:k]) & set(actual)) / k

# Example evaluation
actual_items = data[data['user_id'] == user_id]['item_id'].tolist()
predicted_items = recommended_items.tolist()
print(f"Precision at 5: {precision_at_k(actual_items, predicted_items, 5)}")
8. Model Deployment
Save the Model: Save the trained model and tokenizer for deployment.
python
Copy code
import joblib

joblib.dump(knn, "knn_model.pkl")
tokenizer.save_pretrained("./tokenizer")
model.save_pretrained("./bert_model")
Deploy the Recommendation System: Deploy the system using Flask, FastAPI, or any other web framework.
python
Copy code
from flask import Flask, request, jsonify
import joblib

app = Flask(__name__)

knn_model = joblib.load("knn_model.pkl")
tokenizer = AutoTokenizer.from_pretrained("./tokenizer")
model = AutoModel.from_pretrained("./bert_model")

@app.route("/recommend", methods=["POST"])
def recommend():
    user_id = request.json['user_id']
    user_embedding = user_embeddings.loc[user_id]
    distances, indices = knn_model.kneighbors([user_embedding])
    recommended_items = data.iloc[indices[0]]['item_id'].tolist()
    return jsonify({"recommended_items": recommended_items})

if __name__ == "__main__":
    app.run(debug=True)
9. Monitoring and Maintenance
Monitor Recommendations: Keep track of recommendation performance using real-time analytics.
Update the Model: Periodically update the embeddings and retrain the nearest neighbors model as new content is added.
10. Documentation and Sharing
Document the Process: Provide documentation for the entire pipeline from data preprocessing to deployment.
Share the Model and Code: Optionally, share the model and code on platforms like GitHub or Hugging Face Model Hub.
This guide gives you a complete overview of building a content recommendation system using NLP and Hugging Face tools, from data processing to model deployment. The approach can be customized based on your specific use case, data availability, and desired complexity.