## Projet MongoDB

### Objectifs:

- Analyser et extraire des informations pertinentes à partir d’une base de connaissances textuelle.
- Appliquer un traitement NLP sur une base de connaissance.
- Utiliser MongoDB pour stocker, rechercher et interroger ces données, en exploitant les fonctionnalités NoSQL avancées (texte intégral, recherche, graphes).
- Développer une API RESTful en Python (Flask ou FastAPI) pour exposer la base de connaissance à une application (ex. : chatbot)
- Enrichissement des documents MongoDB avec des tags automatiques ou entités nommées.


## TP5

**Objectif: Appliquer un traitement NLP à vos documents textuels pour les enrichir avec des informations utiles**

1. Choix de l’outil NLP

spaCy pour extraire des entités nommées

HuggingFace pour classifier ou résumer (optionnel)

2. Traitement des documents

- Charger les textes depuis MongoDB
- Appliquer un modèle NLP sur chaque document
- Extraire les entités ou tags clés

3. Mise à jour de la base

Ajouter les résultats NLP comme nouveaux champs :{
  "tags": [...],
  "entites": [...],
  "resume": "..."}

4. Test des nouvelles fonctionnalités de recherche

Recherche par tags , Recherche par entité , Recherche pondérée combinant les deux

**Utiliser les 2 bases de données , sample_mflix et airbnb de Atlas cluster**


In [1]:
# Installation de librairies
!pip install pymongo[srv] spacy transformers

Collecting pymongo[srv]
  Downloading pymongo-4.13.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)
[0mCollecting dnspython<3.0.0,>=1.16.0 (from pymongo[srv])
  Downloading dnspython-2.7.0-py3-none-any.whl.metadata (5.8 kB)
Downloading dnspython-2.7.0-py3-none-any.whl (313 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m313.6/313.6 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pymongo-4.13.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dnspython, pymongo
Successfully installed dnspython-2.7.0 pymongo-4.13.0


In [2]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m69.1 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [3]:
import spacy
from transformers import pipeline

# Charger les modèles
nlp_spacy = spacy.load("en_core_web_sm")
summarizer = pipeline("summarization")

print("spaCy et Transformers prêts à l'emploi !")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Device set to use cpu


spaCy et Transformers prêts à l'emploi !


In [4]:
# COnnexion à MongoDB Atlas
from pymongo import MongoClient

# Connexion au cluster Atlas
MONGODB_URI = "mongodb+srv://hadycoul:8TLd9gSHlT17bzpc@cluster0.hulxfud.mongodb.net/?retryWrites=true&w=majority&appName=Cluster0"

client = MongoClient(MONGODB_URI)

# Accès aux bases
db_mflix = client["sample_mflix"]
db_airbnb = client["sample_airbnb"]

# Collections principales
movies_col = db_mflix["movies"]
listings_col = db_airbnb["listingsAndReviews"]

print("Connexion à MongoDB réussie.")


Connexion à MongoDB réussie.


In [18]:
# Charger les documents textuels

# 5 films avec leur titre et description
movies = list(movies_col.find({"fullplot": {"$exists": True, "$ne": ""}}, {"title": 1, "fullplot": 1}).limit(5))

# 5 logements avec leur nom et description, ensure description is not empty
listings = list(listings_col.find({"description": {"$exists": True, "$ne": ""}}, {"name": 1, "description": 1}).limit(5))

print("Films échantillons :")
for m in movies:
    print("-", m["title"])

print("\nLogements échantillons :")
for l in listings:
    print("-", l["name"])

Films échantillons :
- The Great Train Robbery
- A Corner in Wheat
- Winsor McCay, the Famous Cartoonist of the N.Y. Herald and His Moving Comics
- Gertie the Dinosaur
- The Perils of Pauline

Logements échantillons :


### Choix de L'outil NLP

Spacy

In [6]:
def extract_entities_spacy(text):
    doc = nlp_spacy(text)
    return list(set(ent.text for ent in doc.ents if ent.label_ not in ["CARDINAL"]))

Hugging Face

In [19]:
def generate_summary(text):
    word_count = len(text.split())

    if word_count < 30:
        return text  # Trop court pour un résumé

    # max_length doit être < longueur d'entrée
    max_len = max(32, int(word_count * 0.6))  # 60% de la taille d'entrée
    min_len = max(20, int(word_count * 0.3))  # 30% de la taille d'entrée

    summary = summarizer(text, max_length=max_len, min_length=min_len, do_sample=False)[0]['summary_text']
    return summary


### Traitement de document

In [20]:
def process_documents(docs, text_field):
    processed = []
    for doc in docs:
        texte = doc.get(text_field, "")
        entities = extract_entities_spacy(texte)
        summary = generate_summary(texte)
        processed.append({
            "_id": doc["_id"],
            "tags": list(set(word.lower() for word in texte.split() if len(word) > 4)),
            "entites": entities,
            "resume": summary
        })
    return processed

# Traitement NLP
movies_nlp = process_documents(movies, "fullplot")
listings_nlp = process_documents(listings, "description")

In [22]:
from pprint import pprint

In [24]:
from pprint import pprint

print("Exemple - Film enrichi :")
# Check if movies_nlp is not empty before printing
if movies_nlp:
    pprint(movies_nlp[0])
else:
    print("movies_nlp est vide.")

print("\nExemple - Logement enrichi :")
# Check if listings_nlp is not empty before printing
if listings_nlp:
    pprint(listings_nlp[0])
else:
    print("listings_nlp est vide. Aucun logement n'a été traité.")

Exemple - Film enrichi :
{'_id': ObjectId('573a1390f29313caabcd42e8'),
 'entites': ['Sheriff', 'American', 'first'],
 'resume': ' Among the earliest existing films in American cinema - notable as '
           'the first film that presented a narrative story to tell - it '
           'depicts a group of outlaws who',
 'tags': ['narrative',
          'first',
          'train',
          'cinema',
          'outlaws',
          'color',
          'presented',
          'posse.',
          'among',
          'films',
          'tinted.',
          'notable',
          "sheriff's",
          'story',
          'group',
          'american',
          'passengers.',
          'pursued',
          'earliest',
          'cowboy',
          'depicts',
          'scenes',
          'existing',
          'included',
          'several']}

Exemple - Logement enrichi :
listings_nlp est vide. Aucun logement n'a été traité.


### Mise à jour de la base

In [25]:
# Mise à jour des documents films
for doc in movies_nlp:
    movies_col.update_one(
        {"_id": doc["_id"]},
        {"$set": {
            "tags": doc["tags"],
            "entites": doc["entites"],
            "resume": doc["resume"]
        }}
    )

In [26]:
# Mise à jour des documents airbnb
for doc in listings_nlp:
    listings_col.update_one(
        {"_id": doc["_id"]},
        {"$set": {
            "tags": doc["tags"],
            "entites": doc["entites"],
            "resume": doc["resume"]
        }}
    )

In [27]:
from pprint import pprint

doc_test = movies_col.find_one({"_id": movies_nlp[0]["_id"]})
pprint(doc_test)

{'_id': ObjectId('573a1390f29313caabcd42e8'),
 'awards': {'nominations': 0, 'text': '1 win.', 'wins': 1},
 'cast': ['A.C. Abadie',
          "Gilbert M. 'Broncho Billy' Anderson",
          'George Barnes',
          'Justus D. Barnes'],
 'countries': ['USA'],
 'directors': ['Edwin S. Porter'],
 'entites': ['Sheriff', 'American', 'first'],
 'fullplot': 'Among the earliest existing films in American cinema - notable '
             'as the first film that presented a narrative story to tell - it '
             'depicts a group of cowboy outlaws who hold up a train and rob '
             "the passengers. They are then pursued by a Sheriff's posse. "
             'Several scenes have color included - all hand tinted.',
 'genres': ['Short', 'Western'],
 'imdb': {'id': 439, 'rating': 7.4, 'votes': 9847},
 'languages': ['English'],
 'lastupdated': '2015-08-13 00:27:59.177000000',
 'num_mflix_comments': 0,
 'plot': 'A group of bandits stage a brazen train hold-up, only to find a '
         'de

In [31]:
from pprint import pprint

doc_test = listings_col.find_one({"_id": listings_nlp[0]["_id"]})
pprint(doc_test)

IndexError: list index out of range

### Teste de nouvelles fonctionnalités de recherche

Recherche par tag

In [32]:
from pprint import pprint

In [33]:
def search_by_tag(collection, keyword):
    return list(collection.find({"tags": keyword.lower()}))

In [34]:
print("Recherche par tag = 'color'")
res1 = search_by_tag(movies_col, "color")
pprint(res1[:1])

Recherche par tag = 'color'
[{'_id': ObjectId('573a1390f29313caabcd42e8'),
  'awards': {'nominations': 0, 'text': '1 win.', 'wins': 1},
  'cast': ['A.C. Abadie',
           "Gilbert M. 'Broncho Billy' Anderson",
           'George Barnes',
           'Justus D. Barnes'],
  'countries': ['USA'],
  'directors': ['Edwin S. Porter'],
  'entites': ['Sheriff', 'American', 'first'],
  'fullplot': 'Among the earliest existing films in American cinema - notable '
              'as the first film that presented a narrative story to tell - it '
              'depicts a group of cowboy outlaws who hold up a train and rob '
              "the passengers. They are then pursued by a Sheriff's posse. "
              'Several scenes have color included - all hand tinted.',
  'genres': ['Short', 'Western'],
  'imdb': {'id': 439, 'rating': 7.4, 'votes': 9847},
  'languages': ['English'],
  'lastupdated': '2015-08-13 00:27:59.177000000',
  'num_mflix_comments': 0,
  'plot': 'A group of bandits stage a bra

Recherche par entites

In [35]:
def search_by_entity(collection, entity_name):
    return list(collection.find({"entites": entity_name}))


In [36]:
print("\nRecherche par entité = 'first'")
res2 = search_by_entity(movies_col, "first")
pprint(res2[:1])


Recherche par entité = 'first'
[{'_id': ObjectId('573a1390f29313caabcd42e8'),
  'awards': {'nominations': 0, 'text': '1 win.', 'wins': 1},
  'cast': ['A.C. Abadie',
           "Gilbert M. 'Broncho Billy' Anderson",
           'George Barnes',
           'Justus D. Barnes'],
  'countries': ['USA'],
  'directors': ['Edwin S. Porter'],
  'entites': ['Sheriff', 'American', 'first'],
  'fullplot': 'Among the earliest existing films in American cinema - notable '
              'as the first film that presented a narrative story to tell - it '
              'depicts a group of cowboy outlaws who hold up a train and rob '
              "the passengers. They are then pursued by a Sheriff's posse. "
              'Several scenes have color included - all hand tinted.',
  'genres': ['Short', 'Western'],
  'imdb': {'id': 439, 'rating': 7.4, 'votes': 9847},
  'languages': ['English'],
  'lastupdated': '2015-08-13 00:27:59.177000000',
  'num_mflix_comments': 0,
  'plot': 'A group of bandits stage a

Recherche pondérée (tags + entités)

In [37]:
def weighted_search(collection, tag=None, entity=None):
    query = {"$or": []}
    if tag:
        query["$or"].append({"tags": tag.lower()})
    if entity:
        query["$or"].append({"entites": entity})

    if not query["$or"]:
        return []

    results = list(collection.find(query))

    # Pondération : +2 si tag match, +1 si entité match
    for doc in results:
        score = 0
        if tag and tag.lower() in doc.get("tags", []):
            score += 2
        if entity and entity in doc.get("entites", []):
            score += 1
        doc["score"] = score

    return sorted(results, key=lambda d: d["score"], reverse=True)

In [38]:
print("\nRecherche pondérée : tag = 'color', entité = 'American'")
res3 = weighted_search(movies_col, tag="color", entity="American")
pprint(res3[:1])


Recherche pondérée : tag = 'color', entité = 'American'
[{'_id': ObjectId('573a1390f29313caabcd42e8'),
  'awards': {'nominations': 0, 'text': '1 win.', 'wins': 1},
  'cast': ['A.C. Abadie',
           "Gilbert M. 'Broncho Billy' Anderson",
           'George Barnes',
           'Justus D. Barnes'],
  'countries': ['USA'],
  'directors': ['Edwin S. Porter'],
  'entites': ['Sheriff', 'American', 'first'],
  'fullplot': 'Among the earliest existing films in American cinema - notable '
              'as the first film that presented a narrative story to tell - it '
              'depicts a group of cowboy outlaws who hold up a train and rob '
              "the passengers. They are then pursued by a Sheriff's posse. "
              'Several scenes have color included - all hand tinted.',
  'genres': ['Short', 'Western'],
  'imdb': {'id': 439, 'rating': 7.4, 'votes': 9847},
  'languages': ['English'],
  'lastupdated': '2015-08-13 00:27:59.177000000',
  'num_mflix_comments': 0,
  'plot': 'A

## FASTAPI

In [39]:
!pip install fastapi nest-asyncio uvicorn pyngrok

Collecting pyngrok
  Downloading pyngrok-7.2.9-py3-none-any.whl.metadata (9.3 kB)
Downloading pyngrok-7.2.9-py3-none-any.whl (25 kB)
Installing collected packages: pyngrok
Successfully installed pyngrok-7.2.9


In [41]:
from fastapi import FastAPI
from pymongo import MongoClient
from bson import ObjectId
import nest_asyncio
import uvicorn
from pyngrok import ngrok
import nest_asyncio

# Connexion MongoDB Atlas
MONGODB_URI = "mongodb+srv://hadycoul:8TLd9gSHlT17bzpc@cluster0.hulxfud.mongodb.net/?retryWrites=true&w=majority&appName=Cluster0"
client = MongoClient(MONGODB_URI)

# Bases et collections
db_mflix = client["sample_mflix"]
movies_col = db_mflix["movies"]

db_airbnb = client["sample_airbnb"]
listings_col = db_airbnb["listingsAndReviews"]

# Fonctions génériques de recherche
def search_by_tag(collection, keyword):
    return list(collection.find({"tags": keyword.lower()}))

def search_by_entity(collection, entity_name):
    return list(collection.find({"entites": entity_name}))

def weighted_search(collection, tag=None, entity=None):
    query = {"$or": []}
    if tag:
        query["$or"].append({"tags": tag.lower()})
    if entity:
        query["$or"].append({"entites": entity})
    if not query["$or"]:
        return []

    results = list(collection.find(query))
    for doc in results:
        score = 0
        if tag and tag.lower() in doc.get("tags", []):
            score += 2
        if entity and entity in doc.get("entites", []):
            score += 1
        doc["score"] = score
    return sorted(results, key=lambda d: d["score"], reverse=True)

# Serialisation ObjectId
def serialize(doc):
    doc["_id"] = str(doc["_id"])
    return doc

# Création de l'API
app = FastAPI(
    title="API MongoDB + NLP",
    description="Une API pour rechercher des films et des logements avec le NLP"
)

@app.get("/")
def root():
    return {"message": "API NLP MongoDB - mflix & airbnb opérationnelle."}

@app.get("/search/tags/{tag}")
def tag_search(tag: str):
    docs = search_by_tag(movies_col, tag)
    return [serialize(doc) for doc in docs]

# ENDPOINTS pour les films
@app.get("/movies/search/tags/{tag}")
def search_tag_movies(tag: str):
    results = search_by_tag(movies_col, tag)
    return [serialize(doc) for doc in results]

@app.get("/movies/search/entities/{entity}")
def search_entity_movies(entity: str):
    results = search_by_entity(movies_col, entity)
    return [serialize(doc) for doc in results]

@app.get("/movies/search/combined")
def search_combined_movies(tag: str = None, entity: str = None):
    results = weighted_search(movies_col, tag, entity)
    return [serialize(doc) for doc in results]

# ENDPOINTS pour les logements airbnb
@app.get("/airbnb/search/tags/{tag}")
def search_tag_airbnb(tag: str):
    results = search_by_tag(listings_col, tag)
    return [serialize(doc) for doc in results]

@app.get("/airbnb/search/entities/{entity}")
def search_entity_airbnb(entity: str):
    results = search_by_entity(listings_col, entity)
    return [serialize(doc) for doc in results]

@app.get("/airbnb/search/combined")
def search_combined_airbnb(tag: str = None, entity: str = None):
    results = weighted_search(listings_col, tag, entity)
    return [serialize(doc) for doc in results]


In [42]:
!ngrok config add-authtoken 2y36c5MiQXJkXx0n2KZWi86X3Ld_81ZxnPsYGchLD8KNbxFTA


Authtoken saved to configuration file: /root/.config/ngrok/ngrok.yml


In [43]:
nest_asyncio.apply()

# Lancer Ngrok
tunnel = ngrok.connect(8000)
print(f"API accessible sur : {tunnel.public_url}")
print(f"Swagger : {tunnel.public_url}/docs")

# Lancer FastAPI
uvicorn.run(app, host="0.0.0.0", port=8000)

API accessible sur : https://3c7c-34-106-47-253.ngrok-free.app
Swagger : https://3c7c-34-106-47-253.ngrok-free.app/docs


INFO:     Started server process [562]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)


INFO:     89.156.91.45:0 - "GET / HTTP/1.1" 200 OK
INFO:     89.156.91.45:0 - "GET /favicon.ico HTTP/1.1" 404 Not Found


INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [562]
