In [1]:
import pandas as pd
import json

from tqdm import tqdm
from joblib import Memory
from typing import List, Dict
from pymongo import MongoClient

# Data Structure Observation

The primary objective of this notebook is to observe the structure of the data in the database.

Let's retrieve some data to better understand their structure.

- The data is stored in MongoDB
- Each feedback is a "document" that can contain multiple fields/metadata
- The mandatory metadata are:
    * Timestamp
    * A text field (often verbatim)
    * An ID
    * A brand (to filter our data based on the client)

We will take the example of the brand: ColumbusCafe

Dashboard link :  https://dashboard.allobrain.com/columbuscafe?filter_time_range=%5B1660521600000%2C1727733599999%5D&

Connect to the database

In [2]:
MONGO_PASSWORD = "TZ4ejFMVMzInADLP"

import certifi

mongo_client = MongoClient(
    f"mongodb+srv://alloreview:{MONGO_PASSWORD}@"
    "feedbacksdev.cuwx1.mongodb.net/"
    "myFirstDatabase?retryWrites=true&w=majority",
     tlsCAFile=certifi.where()
)
collection = mongo_client['feedbacks_db']['feedbacks_Prod']

Brand:

In [3]:
BRAND = 'ditp_analysis'

In [4]:
# getting 100 documents from picard brand

from_mongo = pd.DataFrame(list(collection.aggregate([
    {
        '$match': {
            'brand': BRAND,
        },
    },
    { "$sample" : { "size": 100 } }
])))

from_mongo.shape

(100, 82)

In [14]:
from_mongo.head()

Unnamed: 0,_id,accessibilite,action,aide_ia_proposee_reponse_structure_1,aide_ia_proposee_reponse_structure_2,aide_ia_proposee_reponse_structure_3,amelioration_de_service_a_considerer,audio,brand,canaux_typologie_1,...,taux_de_similarite_reponse_ia_structure_3,timestamp,titre,top_ia_structure_1,top_ia_structure_2,top_ia_structure_3,vote_de_l_agent_reponse_ia_structure_1,vote_de_l_agent_reponse_ia_structure_2,vote_de_l_agent_reponse_ia_structure_3,verbatims
0,ditp_analysis/4325567,Négatif,,N,N,,Non,N,ditp_analysis,"Démarche en ligne,E-mail",...,,1702249000000.0,Renouvellement du titre de séjour étudiant,O,N,,,,,Renouvellement du titre de séjour étudiant\nJ'...
1,ditp_analysis/3812414,Négatif,,O,N,,Non,N,ditp_analysis,Démarche en ligne,...,,1689631000000.0,Refus de la demande de changement d'adresse,O,N,,,,,Refus de la demande de changement d'adresse\nJ...
2,ditp_analysis/4136135,Négatif,,O,,,Non,N,ditp_analysis,"Démarche en ligne,Téléphone",...,,1697926000000.0,CCAM,O,,,,,,CCAM\nLe consulat de France à Fès est le plus ...
3,ditp_analysis/5094741,Positif,,O,,,Non,N,ditp_analysis,Démarche en ligne,...,,1725055000000.0,Edition d'un plan de situation pour faire une DP,O,,,,,,Edition d'un plan de situation pour faire une ...
4,ditp_analysis/468186,Négatif,,N,,,Non,N,ditp_analysis,Démarche en ligne,...,,1638054000000.0,demande de documents,O,,,,,,demande de documents\nj'ai demandé par ANTS un...


In [19]:
from_mongo.columns

Index(['_id', 'accessibilite', 'action',
       'aide_ia_proposee_reponse_structure_1',
       'aide_ia_proposee_reponse_structure_2',
       'aide_ia_proposee_reponse_structure_3',
       'amelioration_de_service_a_considerer', 'audio', 'brand',
       'canaux_typologie_1', 'canaux_typologie_2', 'canaux_typologie_3',
       'cle_de_tracking', 'code_insee_departement_usager',
       'code_insee_region_usager', 'code_postal_typologie_1',
       'code_postal_typologie_2', 'code_postal_typologie_3',
       'date_action_engagee', 'date_action_realisee', 'date_de_publication',
       'description', 'ecrit_le', 'etat_experience',
       'evaluation_inutile_reponse_structure_1_par_visiteurs',
       'evaluation_inutile_reponse_structure_2_par_visiteurs',
       'evaluation_inutile_reponse_structure_3_par_visiteurs',
       'evaluation_reponse_structure_1_par_auteur',
       'evaluation_reponse_structure_2_par_auteur',
       'evaluation_reponse_structure_3_par_auteur',
       'evaluation_util

In [10]:
sample_document = from_mongo.sample().iloc[0]

# the text of the client feedback
print(sample_document)

_id                                                                   ditp_analysis/4797317
accessibilite                                                                          None
action                                                                                 None
aide_ia_proposee_reponse_structure_1                                                      O
aide_ia_proposee_reponse_structure_2                                                   None
                                                                ...                        
top_ia_structure_3                                                                     None
vote_de_l_agent_reponse_ia_structure_1                                                utile
vote_de_l_agent_reponse_ia_structure_2                                                 None
vote_de_l_agent_reponse_ia_structure_3                                                 None
verbatims                                 Renouvellement\n- Le site web est trop

In [42]:
sample_document.generated_answer

"Bonjour,\n\nMerci pour votre retour d’expérience. Désolé d’apprendre que votre séjour au Pôle Hébergement du CROUS n’a pas été à la hauteur de vos attentes. \n\nPour résoudre les problèmes mentionnés, nous vous recommandons d'abord de contacter directement le service concerné pour demander une révision de la décision liée à votre carte d’identité italienne. Pour l’assurance logement, il est conseillé de demander un certificat provisoire à votre assureur pour couvrir la période du 30 août au 1er septembre.\n\nEnfin, concernant le traitement de votre dossier en ligne, il est important de signaler ces incidents à la direction du CROUS pour qu'ils puissent améliorer leurs services. En attendant, n’hésitez pas à leur envoyer les documents numériques requis directement par l’adresse mail prévue à cet effet.\n\nNous vous remercions pour votre patience et espérons que ces solutions contribueront à améliorer votre situation.\n\n"

### Exploring the different fields of the document


In [16]:
print("Timestamp:", sample_document.timestamp)
print("Title:", sample_document.titre)
print("Text field:", sample_document.verbatims)
print("ID:", sample_document._id)
print("Brand:", sample_document.brand)

Timestamp: 1715724000000.0
Title: Renouvellement
Text field: Renouvellement
- Le site web est trop touffu ; je n’ai pas vu la mention claire de la nécessité de prendre rendez-vous pour un retrait de pièce d’identité. 
- Délai de 8 mois pour l’obtention de pièce d’identité pour un nouveau-né (service de l’etat-civil de Nantes compris) : globalement beaucoup trop long. C’est hélas un grand pas en arrière. 
- Personnel consulaire extrêmement gentil, efficace, à l’écoute et proactif face aux problèmes rencontrés.
ID: ditp_analysis/4797317
Brand: ditp_analysis


### Checking for additional fields

Each brand has its own metadata !

In [None]:
print(sample_document.rating_out_of_5)
print(sample_document.establishment)
print(sample_document.author)

## Analysis fields

### 1. Topic Extraction

Firstly, we will extract the topics that emerge from a review. The goal is to transform a long text containing several mixed topics into a list of distinct and reformulated topics.

Each extracted topic can have a positive, negative, or neutral sentiment associated with it. The sentiment of each topic is indicated in the "sentiment" field.

In [None]:
print(sample_document.verbatim['text'])

sample_document.extractions

In [None]:
for extr in sample_document.extractions:
    print('Extraction:', extr['extraction'])
    print('Sentiment:', extr['sentiment'])
    print('-' * 50)

**Detailed Structure of Extractions**

Upon closer examination, each object within the "splitted_analysis_v2" field contains the following information:

- Extraction: The extracted topic or subject.
- Sentiment: The sentiment associated with the extraction (positive, negative, or suggestion).
- Elementary Subjects: Generated subjects that allow us to classify the extractions.
- Topics (optional): More general and business-oriented subjects.

#### Elementary Subjects

Elementary subjects are generated subjects that help us classify the extractions. They are designed to highlight the most frequent subjects expressed by customers. These elementary subjects are displayed in the "Top Subjects" graph on the dashboard.

The purpose of elementary subjects is to provide a structured and organized way to categorize the extracted topics. By identifying common themes and grouping similar extractions together, we can gain insights into the most prevalent issues or opinions expressed by customers.

#### Topics

Topics, on the other hand, are more general and business-oriented subjects. They are less numerous compared to elementary subjects and provide a higher-level categorization.

Topics are intended to capture broader themes or categories that are relevant to the business or domain. They allow for a more strategic view of the feedback and can help identify overarching areas of concern or satisfaction.

In [None]:
for extr in sample_document.extractions:
    print('Extraction:', extr['extraction'])
    print('Elementary subjects:', extr['elementary_subjects'])
    print('Topics:', extr['topics'])
    print('-' * 50)

### 2. Linking Extractions to Feedback

Each extracted topic is linked to the corresponding part of the feedback in the "splitted_analysis_v2" field. This field allows us to highlight the topics in the "Details" graph on the dashboard.

The "splitted_analysis_v2" field contains information that maps the extracted topics to their respective positions within the original feedback text. This mapping enables us to visually highlight the relevant parts of the feedback when displaying the extracted topics on the dashboard.

In [None]:
sample_document.splitted_analysis_v2