# OpenClassrooms - Ingenieur IA
# Projet 10 - Fly Me - Partie 1 : Analyse et préparation des données
# Développez un chatbot pour réserver des vacances

## Objectif du projet : 
- **Construire un MVP qui aidera les employés de Fly Me à réserver facilement un billet d’avion pour leurs vacances**

## Plan :
- **Partie 1 : Analyse et préparation des données**
    - Chargement des données
    - Analyse des données brutes
    - Parsing des données pour LUIS
    - Sauvegarde des données parsées pour LUIS
    - Analyse des données parsées pour LUIS
    - Nettoyage des données parsées pour LUIS
    - Sauvegarde des données parsées et nettoyées pour LUIS
 
 
- **Partie 2 : Modélisation avec LUIS**
    - Chargement des données
    - Creation de l'application (LUISAuthoringClient)
    - Creation des exemples d'entrainement : Phrases (utterances), Entités (Entities) et Valeur des entités
        - Ajout d'exemples pour l'intention : OrderTravel
        - Ajout d'exemples pour l'intention : Greetings
        - Ajout d'exemples pour l'intention : None
    - Entrainement du modèle   
    - Publication du modèle
    - Evaluation du modèle
         - Creation du client de l'application (LUISRuntimeClient)
         - Evaluation sur le jeu de test
         
         
- **Partie 3 : Modélisation avec Microsoft Bot Framework**
    - Voir repository GitHub
    
    
- **Partie 4 : Tests**
    - Voir repository GitHub
    
    
- **Partie 5 : Déploiement**
    - Fait avec Azure + GitHub Actions


- **Partie 6 : Monitoring**
    - Fait avec Azure Application Insights + rédaction d'une méthodologie

## Script Partie 1 : Analyse et préparation des données

### Remarque préliminaire :
- **Le code de ce script est découpé en fonctions, ce qui permet une meilleure lisibilité et une meilleure maintenabilité du code**

In [1]:
import pandas as pd
import numpy as np

# Chargement des données

In [2]:
df_data = pd.io.json.read_json('../data/frames.json')

# Analyse des données brutes

In [3]:
df_data

Unnamed: 0,user_id,turns,wizard_id,id,labels
0,U22HTHYNP,[{'text': 'I'd like to book a trip to Atlantis...,U21DKG18C,e2c0fc6c-2134-4891-8353-ef16d8412c9a,"{'userSurveyRating': 4.0, 'wizardSurveyTaskSuc..."
1,U21E41CQP,"[{'text': 'Hello, I am looking to book a vacat...",U21DMV0KA,4a3bfa39-2c22-42c8-8694-32b4e34415e9,"{'userSurveyRating': 3.0, 'wizardSurveyTaskSuc..."
2,U21RP4FCY,[{'text': 'Hello there i am looking to go on a...,U21E0179B,6e67ed28-e94c-4fab-96b6-68569a92682f,"{'userSurveyRating': 2.0, 'wizardSurveyTaskSuc..."
3,U22HTHYNP,[{'text': 'Hi I'd like to go to Caprica from B...,U21DKG18C,5ae76e50-5b48-4166-9f6d-67aaabd7bcaa,"{'userSurveyRating': 5.0, 'wizardSurveyTaskSuc..."
4,U21E41CQP,"[{'text': 'Hello, I am looking to book a trip ...",U21DMV0KA,24603086-bb53-431e-a0d8-1dcc63518ba9,"{'userSurveyRating': 5.0, 'wizardSurveyTaskSuc..."
...,...,...,...,...,...
1364,U2AMZ8TLK,[{'text': 'Hi I've got 9 days free and I'm loo...,U21DMV0KA,957fd205-bb7c-4b81-8cb6-13c81c51c5c9,"{'userSurveyRating': 3.5, 'wizardSurveyTaskSuc..."
1365,U2AMZ8TLK,[{'text': 'I need to get to Fortaleza on Septe...,U260BGVS6,71b21b86-2d05-4372-a0ee-6ed64b0ddc42,"{'userSurveyRating': 4.5, 'wizardSurveyTaskSuc..."
1366,U231PNNA3,[{'text': 'We're finally going on vacation isn...,U21T9NMKM,ef2cd70e-c1f2-42be-8839-cb465af0bf41,"{'userSurveyRating': 5.0, 'wizardSurveyTaskSuc..."
1367,U2AMZ8TLK,"[{'text': 'Hi there, I'm looking for a place t...",U21DMV0KA,ffa79d2c-14eb-45e6-8573-b0817a1a1803,"{'userSurveyRating': 4.0, 'wizardSurveyTaskSuc..."


# Parsing des données pour LUIS

## Fonction parsing
- Fonction **parse_data_for_luis** permettant le parsing des données pour être par la suite utilisées par le service Azure LUIS

In [4]:
def parse_data_for_luis(df_data_to_parse):
    
    #création du dataframe renvoyé
    col_names =  ['text', 'intent', 'or_city', 'dst_city', 'str_date', 'end_date', 'budget']
    df_data_parsed_luis  = pd.DataFrame(columns = col_names)
    
    #parsing des données
    for i in range(len(df_data)):

        text=""
        intent=""
        or_city=""
        dst_city=""
        str_date=""
        end_date=""
        budget=""

        #récupération du texte du message
        text=df_data['turns'][i][0]['text']

        #cas où il y a une intention spécifiée
        if(len(df_data['turns'][i][0]['labels']['acts'])!=0):

            #intention de type Greeting
            if df_data['turns'][i][0]['labels']['acts'][0]['name']=='greeting':
                intent='greeting'

            #intention de type Intent
            for k in range(len(df_data['turns'][i][0]['labels']['acts'])):      
                if len(df_data['turns'][i][0]['labels']['acts'][k]['args'])!=0:
                    if df_data['turns'][i][0]['labels']['acts'][k]['args'][0]['key']=='intent':
                        intent=df_data['turns'][i][0]['labels']['acts'][k]['args'][0]['val']       
    
            #récupération des informations du message (cas1)
            if (len(df_data['turns'][i][0]['labels']['acts'])>1):
                for j in range(len(df_data['turns'][i][0]['labels']['acts'][1]['args'])):
                    if df_data['turns'][i][0]['labels']['acts'][1]['args'][j]['key']=="or_city":
                        or_city=df_data['turns'][i][0]['labels']['acts'][1]['args'][j]['val']
                    if df_data['turns'][i][0]['labels']['acts'][1]['args'][j]['key']=="dst_city":
                        dst_city=df_data['turns'][i][0]['labels']['acts'][1]['args'][j]['val']
                    if df_data['turns'][i][0]['labels']['acts'][1]['args'][j]['key']=="str_date":
                        str_date=df_data['turns'][i][0]['labels']['acts'][1]['args'][j]['val']
                    if df_data['turns'][i][0]['labels']['acts'][1]['args'][j]['key']=="end_date":
                        end_date=df_data['turns'][i][0]['labels']['acts'][1]['args'][j]['val']  
                    if df_data['turns'][i][0]['labels']['acts'][1]['args'][j]['key']=="budget":
                        budget=df_data['turns'][i][0]['labels']['acts'][1]['args'][j]['val']
            
            #récupération des informations du message (cas2)
            else:
                for j in range(len(df_data['turns'][i][0]['labels']['acts'][0]['args'])):
                    if df_data['turns'][i][0]['labels']['acts'][0]['args'][j]['key']=="or_city":
                        or_city=df_data['turns'][i][0]['labels']['acts'][0]['args'][j]['val']
                    if df_data['turns'][i][0]['labels']['acts'][0]['args'][j]['key']=="dst_city":
                        dst_city=df_data['turns'][i][0]['labels']['acts'][0]['args'][j]['val']
                    if df_data['turns'][i][0]['labels']['acts'][0]['args'][j]['key']=="str_date":
                        str_date=df_data['turns'][i][0]['labels']['acts'][0]['args'][j]['val']
                    if df_data['turns'][i][0]['labels']['acts'][0]['args'][j]['key']=="end_date":
                        end_date=df_data['turns'][i][0]['labels']['acts'][0]['args'][j]['val']  
                    if df_data['turns'][i][0]['labels']['acts'][0]['args'][j]['key']=="budget":
                        budget=df_data['turns'][i][0]['labels']['acts'][0]['args'][j]['val']

        #cas où il n'y a pas d'intention spécifiée
        else:
            intent="none"

        #remplissage du dataframe avec les informations récupérées
        df_data_parsed_luis.loc[i] = [text, intent, or_city, dst_city, str_date, end_date, budget]

    return df_data_parsed_luis

In [5]:
df_data_parsed_luis = parse_data_for_luis(df_data)

In [6]:
df_data_parsed_luis

Unnamed: 0,text,intent,or_city,dst_city,str_date,end_date,budget
0,I'd like to book a trip to Atlantis from Capri...,book,Caprica,Atlantis,"Saturday, August 13, 2016",,1700
1,"Hello, I am looking to book a vacation from Go...",book,Gotham City,Mos Eisley,,,2100
2,Hello there i am looking to go on a vacation w...,book,,Gotham City,,,
3,"Hi I'd like to go to Caprica from Busan, betwe...",book,Busan,Caprica,"Sunday August 21, 2016","Wednesday August 31, 2016",
4,"Hello, I am looking to book a trip for 2 adult...",book,Kochi,Denver,,,"$21,300"
...,...,...,...,...,...,...,...
1364,Hi I've got 9 days free and I'm looking for a ...,book,,,,,
1365,I need to get to Fortaleza on September 8th or...,book,,Fortaleza,September 8th,,
1366,We're finally going on vacation isn't that ama...,book,,,,,15600
1367,"Hi there, I'm looking for a place to get away ...",book,,,,,


# Sauvegarde des données parsées pour LUIS

In [7]:
df_data_parsed_luis.to_csv("luis_parsed_dataset.csv", index=False)

# Analyse des données parsées pour LUIS

In [8]:
df_data_parsed_luis

Unnamed: 0,text,intent,or_city,dst_city,str_date,end_date,budget
0,I'd like to book a trip to Atlantis from Capri...,book,Caprica,Atlantis,"Saturday, August 13, 2016",,1700
1,"Hello, I am looking to book a vacation from Go...",book,Gotham City,Mos Eisley,,,2100
2,Hello there i am looking to go on a vacation w...,book,,Gotham City,,,
3,"Hi I'd like to go to Caprica from Busan, betwe...",book,Busan,Caprica,"Sunday August 21, 2016","Wednesday August 31, 2016",
4,"Hello, I am looking to book a trip for 2 adult...",book,Kochi,Denver,,,"$21,300"
...,...,...,...,...,...,...,...
1364,Hi I've got 9 days free and I'm looking for a ...,book,,,,,
1365,I need to get to Fortaleza on September 8th or...,book,,Fortaleza,September 8th,,
1366,We're finally going on vacation isn't that ama...,book,,,,,15600
1367,"Hi there, I'm looking for a place to get away ...",book,,,,,


In [9]:
df_data_parsed_luis['intent'].unique()

array(['book', '', 'greeting', 'none'], dtype=object)

In [10]:
df_data_parsed_luis['intent'].value_counts()

book        1134
greeting     141
              91
none           3
Name: intent, dtype: int64

In [11]:
df_data_parsed_luis[df_data_parsed_luis['intent']=='book']

Unnamed: 0,text,intent,or_city,dst_city,str_date,end_date,budget
0,I'd like to book a trip to Atlantis from Capri...,book,Caprica,Atlantis,"Saturday, August 13, 2016",,1700
1,"Hello, I am looking to book a vacation from Go...",book,Gotham City,Mos Eisley,,,2100
2,Hello there i am looking to go on a vacation w...,book,,Gotham City,,,
3,"Hi I'd like to go to Caprica from Busan, betwe...",book,Busan,Caprica,"Sunday August 21, 2016","Wednesday August 31, 2016",
4,"Hello, I am looking to book a trip for 2 adult...",book,Kochi,Denver,,,"$21,300"
...,...,...,...,...,...,...,...
1364,Hi I've got 9 days free and I'm looking for a ...,book,,,,,
1365,I need to get to Fortaleza on September 8th or...,book,,Fortaleza,September 8th,,
1366,We're finally going on vacation isn't that ama...,book,,,,,15600
1367,"Hi there, I'm looking for a place to get away ...",book,,,,,


In [12]:
df_data_parsed_luis[df_data_parsed_luis['intent']=='greeting']

Unnamed: 0,text,intent,or_city,dst_city,str_date,end_date,budget
40,Hi!,greeting,,,,,
43,Hi! I'd like to go to Boston from Mos Eisley o...,greeting,Mos Eisley,Boston,August 15th,,
48,Heyo!,greeting,,,,,
52,Good morning.,greeting,,,,,
63,Hello wozbot!,greeting,,,,,
...,...,...,...,...,...,...,...
1188,hi there. i really wanna pretend im somewhere ...,greeting,,,,,2900
1211,Guess what? I'm a recently married person look...,greeting,osaka,manaus,,,
1223,Hi,greeting,,,,,
1251,Hi,greeting,,,,,


In [13]:
df_data_parsed_luis[df_data_parsed_luis['intent']=='none']

Unnamed: 0,text,intent,or_city,dst_city,str_date,end_date,budget
526,"Have you ever read the book ""Vernon's Travels""?",none,,,,,
657,psssstttttt,none,,,,,
1158,Vacay time woooohooooooo,none,,,,,


# Nettoyage des données parsées pour LUIS

In [14]:
df_data_parsed_luis_cleaned = df_data_parsed_luis

## Suppresion des valeurs '-1'

In [15]:
df_data_parsed_luis_cleaned = df_data_parsed_luis_cleaned.replace({'-1': ""})

## Homogénéisation dollar :
- le budget est indiqué en dollar -> on peut supprimer la currency pour homogénéiser les données

In [16]:
df_data_parsed_luis_cleaned['budget']=df_data_parsed_luis_cleaned['budget'].apply(lambda x: x.replace("$",""))
df_data_parsed_luis_cleaned['budget']=df_data_parsed_luis_cleaned['budget'].apply(lambda x: x.replace("USD",""))
df_data_parsed_luis_cleaned['budget']=df_data_parsed_luis_cleaned['budget'].apply(lambda x: x.replace("dollar",""))
df_data_parsed_luis_cleaned['budget']=df_data_parsed_luis_cleaned['budget'].apply(lambda x: x.replace("dollars",""))

In [17]:
df_data_parsed_luis_cleaned

Unnamed: 0,text,intent,or_city,dst_city,str_date,end_date,budget
0,I'd like to book a trip to Atlantis from Capri...,book,Caprica,Atlantis,"Saturday, August 13, 2016",,1700
1,"Hello, I am looking to book a vacation from Go...",book,Gotham City,Mos Eisley,,,2100
2,Hello there i am looking to go on a vacation w...,book,,Gotham City,,,
3,"Hi I'd like to go to Caprica from Busan, betwe...",book,Busan,Caprica,"Sunday August 21, 2016","Wednesday August 31, 2016",
4,"Hello, I am looking to book a trip for 2 adult...",book,Kochi,Denver,,,21300
...,...,...,...,...,...,...,...
1364,Hi I've got 9 days free and I'm looking for a ...,book,,,,,
1365,I need to get to Fortaleza on September 8th or...,book,,Fortaleza,September 8th,,
1366,We're finally going on vacation isn't that ama...,book,,,,,15600
1367,"Hi there, I'm looking for a place to get away ...",book,,,,,


# Sauvegarde des données parsées et nettoyées pour LUIS

In [18]:
df_data_parsed_luis_cleaned.to_csv("luis_parsed_cleaned_dataset.csv", index=False)