In [13]:
import pandas as pd
import json

from tqdm import tqdm
from typing import List, Dict
from pymongo import MongoClient
import boto3
import certifi

# Data Structure Observation

The primary objective of this notebook is to observe the structure of the data in the database.

Let's retrieve some data to better understand their structure.

- The data is stored in MongoDB
- Each feedback is a "document" that can contain multiple fields/metadata
- The mandatory metadata are:
    * Timestamp
    * A text field (often verbatim)
    * An ID
    * A brand (to filter our data based on the client)

We will take the example of the brand: ColumbusCafe

Dashboard link :  https://dashboard.allobrain.com/columbuscafe?filter_time_range=%5B1660521600000%2C1727733599999%5D&

Connect to the database

In [14]:
_secrets_manager_client = boto3.client("secretsmanager", region_name="eu-west-3")



_secrets = json.loads(
    _secrets_manager_client.get_secret_value(
        SecretId=f"Prod/alloreview"
    )["SecretString"]
)
# MONGO_CONNECTION_STRING = (
#     "mongodb+srv://alloreview:{}@feedbacksdev.cuwx1.mongodb.net".format(
#         _secrets["mongodb"]["password"]
#     )
# )
# mongo_client = MongoClient(MONGO_CONNECTION_STRING,tlsCAFile=certifi.where())

# collection = mongo_client['feedbacks_db']['feedbacks_Prod']


Brand:

In [15]:
BRAND = 'columbuscafe_test'

In [59]:
# getting 100 documents from picard brand

from_mongo = pd.DataFrame(list(collection.aggregate([
    {
        '$match': {
            'brand': BRAND,
        },
    },
    { "$sample" : { "size": 100 } }
])))

from_mongo.shape

(100, 16)

In [60]:
from_mongo.head()

Unnamed: 0,_id,id,brand,timestamp,verbatim,establishment,review_site,author,rating_out_of_5,language,review_title,extractions,splitted_analysis_v2,topics_v2,splitted_analysis,topics
0,columbuscafe_test/2c0206086f3771dee60d,2c0206086f3771dee60d,columbuscafe_test,1685491000000.0,{},Boulogne-sur-Mer,Google,Myriam Zeghdoudi,5.0,fr,,,,,,
1,columbuscafe_test/984e66d8fd06d2829343,984e66d8fd06d2829343,columbuscafe_test,1682986000000.0,{'text': 'Services trop long et les cafés sont...,Caen Rives de l’Orne,Google,oce gp,3.0,fr,,,,,,
2,columbuscafe_test/b11d9049001593e8ed03,b11d9049001593e8ed03,columbuscafe_test,1695254000000.0,{'text': '3👍'},Ajaccio,Uber Eats,Angélique B,5.0,fr,21-09-2023 - 20.7EUR,,,,,
3,columbuscafe_test/c6587c0bc333ed5e4336,c6587c0bc333ed5e4336,columbuscafe_test,1697155000000.0,{'text': 'Service de qualité et très bons prod...,Grenoble Alsace Lorraine,Google,Camille Viguet-carrin,5.0,fr,,,,,,
4,columbuscafe_test/2a75e9fa52fa6a2267b3,2a75e9fa52fa6a2267b3,columbuscafe_test,1690070000000.0,"{'text': 'Accueil chaleureux, nourriture de bo...",Aubergenville Marques Avenue,Google,Zehra Aktas,5.0,fr,,,,,,


In [53]:
sample_document = from_mongo.iloc[0]

# the text of the client feedback
print(sample_document.verbatim)

{'text': '3👍'}


### Exploring the different fields of the document


In [54]:
print("Timestamp:", sample_document.timestamp)
print("Text field:", sample_document.verbatim)
print("ID:", sample_document._id)
print("Brand:", sample_document.brand)

Timestamp: 1690329600000.0
Text field: {'text': '3👍'}
ID: columbuscafe_test/88fb857fd1ce0901758d
Brand: columbuscafe_test


### Checking for additional fields

Each brand has its own metadata !

In [55]:
print(sample_document.rating_out_of_5)
print(sample_document.establishment)
print(sample_document.author)

5.0
Marseille La Valentine 2
Laetitia R


## Analysis fields

### 1. Topic Extraction

Firstly, we will extract the topics that emerge from a review. The goal is to transform a long text containing several mixed topics into a list of distinct and reformulated topics.

Each extracted topic can have a positive, negative, or neutral sentiment associated with it. The sentiment of each topic is indicated in the "sentiment" field.

In [56]:
print(sample_document.verbatim['text'])

sample_document.extractions

3👍


nan

In [47]:
sample_document

_id                                columbuscafe_test/a0b9134143921a9809f6
id                                                   a0b9134143921a9809f6
brand                                                   columbuscafe_test
timestamp                                                 1696291200000.0
verbatim                {'text': 'Samantha et Yves rose sont charmante...
establishment                                Bourges Saintes Thorette A71
review_site                                                        Google
author                                                         Chris Prat
rating_out_of_5                                                       5.0
language                                                               fr
review_title                                                          NaN
extractions             [{'sentiment': 'POSITIVE', 'extraction': 'Sama...
splitted_analysis       [{'text': 'Samantha et Yves rose sont charmant...
topics                                

In [48]:
for extr in sample_document.extractions:
    print('Extraction:', extr['extraction'])
    print('Sentiment:', extr['sentiment'])
    print('-' * 50)

Extraction: Samantha et Yves sont charmantes
Sentiment: POSITIVE
--------------------------------------------------
Extraction: Professionnalisme
Sentiment: POSITIVE
--------------------------------------------------


**Detailed Structure of Extractions**

Upon closer examination, each object within the "splitted_analysis_v2" field contains the following information:

- Extraction: The extracted topic or subject.
- Sentiment: The sentiment associated with the extraction (positive, negative, or suggestion).
- Elementary Subjects: Generated subjects that allow us to classify the extractions.
- Topics (optional): More general and business-oriented subjects.

#### Elementary Subjects

Elementary subjects are generated subjects that help us classify the extractions. They are designed to highlight the most frequent subjects expressed by customers. These elementary subjects are displayed in the "Top Subjects" graph on the dashboard.

The purpose of elementary subjects is to provide a structured and organized way to categorize the extracted topics. By identifying common themes and grouping similar extractions together, we can gain insights into the most prevalent issues or opinions expressed by customers.

#### Topics

Topics, on the other hand, are more general and business-oriented subjects. They are less numerous compared to elementary subjects and provide a higher-level categorization.

Topics are intended to capture broader themes or categories that are relevant to the business or domain. They allow for a more strategic view of the feedback and can help identify overarching areas of concern or satisfaction.

In [49]:
for extr in sample_document.extractions:
    print('Extraction:', extr['extraction'])
    print('Elementary subjects:', extr['elementary_subjects'])
    print('Topics:', extr['topics'])
    print('-' * 50)

Extraction: Samantha et Yves sont charmantes


KeyError: 'elementary_subjects'

### 2. Linking Extractions to Feedback

Each extracted topic is linked to the corresponding part of the feedback in the "splitted_analysis_v2" field. This field allows us to highlight the topics in the "Details" graph on the dashboard.

The "splitted_analysis_v2" field contains information that maps the extracted topics to their respective positions within the original feedback text. This mapping enables us to visually highlight the relevant parts of the feedback when displaying the extracted topics on the dashboard.

In [23]:
sample_document.splitted_analysis_v2

[{'text': 'Le service laisse largement à désirer.',
  'extraction': 'Service de mauvaise qualité',
  'sentiment': 'NEGATIVE',
  'topics': []},
 {'text': " La personne a la caisse n'est pas du tout compétente.",
  'extraction': 'Incompétence de la personne à la caisse',
  'sentiment': 'NEGATIVE',
  'topics': ['Le personnel > Amabilité du personnel']},
 {'text': ' Nous avons du redonner nos boisson trois fois pour enfin être servie.'},
 {'text': " J'ai demander un frappé et l'on ne m'a même pas demander le type de lait que je voulais.",
  'extraction': 'Manque de communication sur les choix de lait',
  'sentiment': 'NEGATIVE',
  'topics': []},
 {'text': ' Je ne digèrent pas le lait de vache.'},
 {'text': " J'ai donc dû demander moi même à une des serveuses après avoir entendue des clients en parler à une table.\n"},
 {'text': 'Le services à été vraiment très long.',
  'extraction': 'Service très long, Erreurs dans les commandes',
  'sentiment': 'NEGATIVE',
  'topics': ['Le personnel > Ra