<h1>Groupe 5 - Prediction
> Functions Notebook

<span class="tocSkip"></span>

> *Authors : All*

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Environment" data-toc-modified-id="Environment-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Environment</a></span><ul class="toc-item"><li><span><a href="#Libraries-:" data-toc-modified-id="Libraries-:-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Libraries :</a></span></li><li><span><a href="#Data-loading-:" data-toc-modified-id="Data-loading-:-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Data loading :</a></span></li><li><span><a href="#Functions-:" data-toc-modified-id="Functions-:-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Functions :</a></span></li></ul></li><li><span><a href="#Textual-analysis-approach" data-toc-modified-id="Textual-analysis-approach-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Textual analysis approach</a></span></li><li><span><a href="#Metadata-analysis-approach" data-toc-modified-id="Metadata-analysis-approach-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Metadata analysis approach</a></span></li><li><span><a href="#Both" data-toc-modified-id="Both-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Both</a></span></li></ul></div>

#  Introduction

This Notebook's aim is to give access to all the functions needed to run the other DataFrames.

It is separeted in 3 main parts : 
- The first one is the environnement setting, with imports of libraries and functions, and data loading.
- The second one gathers functions used for the textual analysis approach. It means that thoses functions will be needed for each step of text analysis (pre-processing, model computing...)
- The last one is the equivalent but for the metadata analysis approach.
- Some functions can also be used by both groups, in this case, they will be in this fourth part.



In every cell comments can be found about what the function is made for. 

Inside a function, one-line-comments (starting with #) are made to understand how the function works, and what is done during its run.

More precise information can also be found for bigger functions, those are written between """---""", and written in red.

Some explaining cells will also be used sometimes in order to interprate, or conclude about the previous code cell.

# Environment

## Libraries : 

In [7]:
import pandas as pd
from langdetect import detect
from tqdm import tqdm_notebook
import spacy
from tqdm import tqdm
import en_core_web_sm
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
import sklearn
import pickle
import numpy as np
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
import re
import string
import nltk
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk import ne_chunk, pos_tag, word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
import gensim
from gensim.parsing.preprocessing import strip_punctuation
from gensim.parsing.preprocessing import strip_short
from gensim.parsing.preprocessing import remove_stopwords
import scipy
from textblob import TextBlob
from textblob.en.sentiments import NaiveBayesAnalyzer
from nltk.stem import WordNetLemmatizer
import datetime
import statistics
from nltk.corpus import subjectivity
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.util import *
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [8]:
# Loading the english language model of spacy library
nlp = spacy.load('en_core_web_sm')

## Data loading :

*Unused here.*

## Functions : 

*Unused here.*

#  Textual analysis approach

In [None]:
# Creates and concatenates lemmas coming from a commentary.


def lemmatize(sentence: str) -> str:
    nlp = spacy.load('en_core_web_sm')
    s = nlp(sentence)
    lemmatized = ''
    for w in s:
        lemmatized += w.lemma_ + ' '

    return lemmatized

In [None]:
# Cleans the commentary by dropping useless data and by applying few functions of string management.


def preprocessing(commentary: str) -> str:
    # Droping 'Points positifs'-like strings
    commentary = commentary.replace('Points positifs', ' ').replace(
        'Points négatifs', ' ')  
    # Droping 'Verified'-like strings
    commentary = commentary.replace('Trip Verified', ' ').replace('Not Verified', ' ').replace(
        'Verified Review', ' ').replace('|', ' ')  # delete "Verified"
    # Converting text to lowercase
    commentary = commentary.lower()
    # Lemmatizing
    commentary = lemmatize(commentary)
    commentary = commentary.replace('-PRON-', ' ').replace("✅", " ")
    return commentary.strip()

In [None]:
# Computes the tf-idf of a given column in a given DataFrame


def TF_IDF_V0(data: pd.core.frame.DataFrame, COLUMN_NAME: str) -> pd.core.frame.DataFrame:
    vectorizer = TfidfVectorizer(
        stop_words="english", min_df=50)
    X = vectorizer.fit_transform(list(data[COLUMN_NAME].values.astype('U')))
    print('Taille : ', X.shape)
    tf_idf_data = pd.DataFrame(
        X.toarray(), columns=vectorizer.get_feature_names())
    return (tf_idf_data)

In [None]:
# Function importing data from xlsx file and returning its comments


def database(data: pd.core.frame.DataFrame,
             column: str) -> pd.core.frame.DataFrame:
    # Dropping rows with empty review
    data = data[data["Review"].notnull()]
    # Creating index
    data['index'] = range(len(data)) 
    # Filling empty rows
    data = data.fillna('')
    sentences = data.loc[:, lambda data: ["index", column]]
    return sentences

In [9]:
# Function importing a list of sentences and cleaning all of them by removing useless characters and converting each word to its word root


def clean_data(sentences: pd.core.frame.DataFrame, column: str) -> pd.core.frame.DataFrame:
    nlp_en = spacy.load('en_core_web_sm')

    sentences_clean = []
    
    for i in tqdm(sentences["index"]):
        
        # Dropping 'Points positifs'-like strings
        sentence = sentences[column][i].replace('Points positifs', ' ').replace(
            'Points négatifs', ' ')  
        
         # Dropping 'Verified'-like strings
        sentence = sentence.replace('Trip Verified', ' ').replace(
            'Not Verified', ' ').replace('Verified Review', ' ') 
        
        # Dropping link strings
        sentence = re.sub(
            r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', ' ', sentence) 
        
        # Finding hashtag and doubled
        sentence = sentence + ' ' + ' '.join(re.findall(r"#(\w+)", sentence))
        
        # Finding @ and doubled
        sentence = sentence + ' ' + \
            ' '.join(re.findall(r"@(\w+)", sentence)) 
            
        # Deleting punctuation
        sentence = strip_punctuation(sentence)
        
        # Lower comments
        semi_clean_sentence = ''
        
        # No-empty comments
        comments = nlp_en(sentence.lower())  
        if len(comments) != 0: 
            try:
                # English comments
                if detect(str(comments)) == 'en':  
                    for token in comments:
                        # Add lemmatizer
                        semi_clean_sentence = semi_clean_sentence + token.lemma_ + ' '  
                    # Deleting "PRON"
                    semi_clean_sentence = semi_clean_sentence.replace(
                        '-PRON-', '')  
                    # Deleting shorts words
                    semi_clean_sentence = remove_stopwords(
                        strip_short(semi_clean_sentence))  
                    sentences_clean.append([i, semi_clean_sentence])
            except:
                print(i)
    return sentences_clean

In [10]:
# Function creating the tf-idf matrix from cleaned sentences and creating a csv file


def create_tfidf(sentences_clean: pd.core.frame.DataFrame, file: str):
    # Recover comments
    comments = [i[1] for i in sentences_clean]  
    # Recover index
    index = [i[0] for i in sentences_clean]  

    vectorizer = TfidfVectorizer(
        stop_words="english", min_df=0.15)  
    X = vectorizer.fit_transform(comments)

    # Creation of matrice
    M = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
    
    # Add comments to tfidf
    tfidf = np.concatenate((pd.DataFrame(index), pd.DataFrame(
        comments), M), axis=1)  
    col = vectorizer.get_feature_names()
    # Renaming columns
    col = ['index', 'commentaire'] + col  
    
    # Saving matrix
    pd.DataFrame(tfidf, columns=col).set_index('index').to_csv(
        "ALL_DATA_Processed_15.csv", sep=",")  

    print('Matrice TF-IDF de ' + file + ' enregistrée')

In [11]:
def nb_word_comment(data: pd.core.frame.DataFrame,
                    column: str) -> pd.core.frame.DataFrame:
  # Calculation of word count per commment
  # Out : dataframe plus a column containing the word count

    df = pd.DataFrame(data)
    liste_comm = []
    size = []
    for i in range(len(df[column])):
        comm = df[column][i].split()
        size.append(len(comm))
    df['nb_word_comment'] = size
    return df

In [12]:
def nbpunctuation(data: pd.core.frame.DataFrame, punct: str,
                  columns: str) -> pd.core.frame.DataFrame:
  # Calculation of the punctuation count per commment
  # Out : dataframe plus a column with the punctuation count

    df = pd.DataFrame(data)
    nbpunct = []
    for i in range(len(df[columns])):
        comment = df[columns][i]
        cpt = 0
        for j in range(len(comment)):
            if comment[j] == punct:
                cpt += 1
        nbpunct.append(cpt)
    df[punct] = nbpunct
    return df

In [13]:
def sentiment_analyse(
    df: pd.DataFrame, review: str) -> pd.core.frame.DataFrame:
  # Out : dataframe with sentiment analysis features in new colum
  # polarity and subjectivity from TextBlob module
  # polarity (neg, neu, pos, compound) from Vader
  # References: https://www.nltk.org/api/nltk.sentiment.html

    df["polarity"] = df[review].apply(lambda x: TextBlob(x).sentiment.polarity)
    df["subjectivity"] = df[review].apply(
        lambda x: TextBlob(x).sentiment.subjectivity)

    sid = SentimentIntensityAnalyzer()
    df["sentiments"] = df[review].apply(lambda x: sid.polarity_scores(x))
    df = pd.concat([df.drop(["sentiments"], axis=1),
                    df["sentiments"].apply(pd.Series)], axis=1)
    return df

In [14]:
def nbword(data, word: str, columns: str) -> pd.core.frame.DataFrame:
  # Calculation of the word count per commment
  # Out : dataframe plus a column containing the word count

    df = pd.DataFrame(data)
    nbword = []
    word = word.lower()
    for i in range(len(df[columns])):
        comment = df[columns][i].split()
        comment = [item.replace(".", "") for item in comment]
        cpt = 0
        for j in range(len(comment)):
            if comment[j].lower() == word:
                cpt += 1
        nbword.append(cpt)
    df[word] = nbword
    return df

In [None]:
def nblistword(data: pd.core.frame.DataFrame,
               listword: list, columns: str,
               name: str) -> pd.core.frame.DataFrame:

   # Calculation of the word count per commment, words are given in the list
   # Out : dataframe plus a column with the word count

    for colname in listword:
        data = nbword(data, colname, columns)
    data["nb_word_" + name] = 0
    print(data)
    for colname in listword:
        data["nb_word_" + name] = data["nb_word_" + name] + data[colname]
        data.drop([colname], axis=1, inplace=True)
    return data

In [None]:
def nblisteponctuation(data: pd.core.frame.DataFrame,
                       listpoct: list, columns: str,
                       name: str) -> pd.core.frame.DataFrame:

  # Calculation of the punctuation count available in the list, per commment
  # Out : dataframe plus a column with the punctuation count

    for colname in listpoct:
        data = nbpunctuation(data, colname, columns)
    data["nb_ponct_" + name] = 0
    print(data)
    for colname in listpoct:
        data["nb_ponct_" + name] = data["nb_ponct_" + name] + data[colname]
        data.drop([colname], axis=1, inplace=True)
    return data

In [None]:
def nb_sentence(data: pd.core.frame.DataFrame,
                column: str) -> pd.core.frame.DataFrame:
  # Calculation of the sentence count per commment
  # Out : dataframe plus a column with the sentence count

    df = pd.DataFrame(data)
    nbsentence = []
    for i in range(len(df[column])):
        phrase = df[column][i].split('.')
        nbsentence.append(len(phrase))
    df['nbsentence'] = nbsentence
    return df

# Metadata analysis approach

In [15]:
# Applying different functions on the 'Aircraft_Type' column in order to have the
# same writting between dataframes


def homogenize_aircraft(df: pd.core.frame.DataFrame) -> pd.core.frame.DataFrame:
    # Type str
    df['Aircraft_Type'] = df['Aircraft_Type'].astype(str)
    # Lower case
    df['Aircraft_Type'] = df['Aircraft_Type'].apply(lambda x: x.lower())
    # Deleting rows corresponding to flights with connexions
    df = df[~df['Aircraft_Type'].str.contains('/|,|&|and|\+|then', na=False)]
    # Deleting spaces and '-'
    df['Aircraft_Type'] = df['Aircraft_Type'].apply(
        lambda x: re.sub("(\s|-)", "", x))
    # The 6 next commands replace some strings by other ones
    df['Aircraft_Type'] = df['Aircraft_Type'].apply(
        lambda x: x.replace('embraer', 'e'))
    df['Aircraft_Type'] = df['Aircraft_Type'].apply(
        lambda x: x.replace('ee', 'e'))
    df['Aircraft_Type'] = df['Aircraft_Type'].apply(
        lambda x: x.replace('airbus', 'a'))
    df['Aircraft_Type'] = df['Aircraft_Type'].apply(
        lambda x: x.replace('aa', 'a'))
    df['Aircraft_Type'] = df['Aircraft_Type'].apply(
        lambda x: x.replace('boeing', 'b'))
    df['Aircraft_Type'] = df['Aircraft_Type'].apply(
        lambda x: x.replace('bb', 'b'))
    # Deleting strings containing format 'v' + number
    df['Aircraft_Type'] = df['Aircraft_Type'].apply(
        lambda x: re.sub("v[0-9].*$", "", x))
    # Keeping only format letter + numbers
    df['Aircraft_Type'] = df['Aircraft_Type'].apply(
        lambda x: re.sub(".+?(?=[a-z]{1}[0-9]{1,})", '', x))
    # Deleting everything that is after a parenthesis
    df['Aircraft_Type'] = df['Aircraft_Type'].apply(
        lambda x: re.sub("(\().*$", "", x))

    return(df)

In [16]:
# Applying different functions on a given column (the one explaining the cabin class)
# in order to have the same writtings between dataframes


def homogenize_cabin_class(df: pd.core.frame.DataFrame,
                           COLUMN_NAME: str) -> pd.core.frame.DataFrame:
    # Deleting word 'Class'
    df[COLUMN_NAME] = df[COLUMN_NAME].apply(
        lambda x: x.replace(' Class', ''))
    # Lower
    df[COLUMN_NAME] = df[COLUMN_NAME].apply(
        lambda x: x.lower())

    return(df)

In [17]:
# Applying different functions on a given column (the one explaining the the airline name)
# in order to have the same writtings between dataframes


def homogenize_airline(df : pd.core.frame.DataFrame
                       , COLUMN_NAME: str) -> pd.core.frame.DataFrame:
    # Lower case
    df[COLUMN_NAME] = df[COLUMN_NAME].apply(
        lambda x: x.lower())
    # Deleting spaces and '-'
    df[COLUMN_NAME] = df[COLUMN_NAME].apply(
        lambda x: re.sub("(\s|-)", "", x))
    return(df)

In [18]:
# Merging DataFrames coming from the SEATGURU_INFO_AIRCRAFT.csv and the ALL_DATA.xslsx files
# in order to get only complete rows


def merge_all_data_seatguru(df_all_data: pd.core.frame.DataFrame,
                            seatguru: pd.core.frame.DataFrame) -> pd.core.frame.DataFrame:
    # Merging
    new_df = pd.merge(df_all_data, seatguru,  how='left',
                      left_on=['Aircraft_Type', 'Airline_Name', 'Cabin_Class'],
                      right_on=['Aircraft_Type', 'Airline_name', 'Category'])

    # Deleting rows from the new DataFrame that are not 'complete'
    # No Aircraft Type
    new_df = new_df[new_df['Aircraft_Type'] != '']
    # No 'Total_seat' (because this one is representative of the seatguru file)
    new_df = new_df[~new_df['Total_seat'].isnull()]
    # No 'Review' (because this one is representative of the all_data file)
    new_df = new_df[~new_df['Data_Source_x'].isnull()]

    # Dropping useless columns
    new_df = new_df.drop(['Data_Source_y',
                          'Airline_name',
                         'Category'], axis=1)

    # Renaming columns that have '_x' or '_y' at end because of merge
    new_df = new_df.rename(columns={
                           "Data_Source_x": "Data_Source",
                           "Category_x": "Category",
                           "Seat_Type_y": "Seat_Type"})

    return(new_df)

In [19]:
# Transform a string date into a standard format by trying each date format.
# If you want to add a format, add a try/except in the last except


def format_date(date: str) -> datetime.datetime:
    date_str = date
    m = "nc"

    if date_str != "nc":
        date_str = str(date_str)
        # Does not work for August
        if ("August" not in date_str):
            date_str = date_str.replace("st", "").replace("th", "")\
                .replace("nd", "").replace("rd", "").replace(" Augu ", " Aug ")
        # Transofrm string considering its format
        try:
            m = datetime.datetime.strptime(date_str, "%d %B %Y")
        except ValueError:
            try:
                m = datetime.datetime.strptime(date_str, "%d %b %Y")
            except ValueError:
                try:
                    m = datetime.datetime.strptime(date_str, "%Y/%m/%d")
                except ValueError:
                    try:
                        m = datetime.datetime\
                            .strptime(date_str, "%d/%m/%Y %H:%M:%S")
                    except ValueError:
                        try:
                            m = datetime.datetime\
                                .strptime(date_str, "%Y-%m-%d %H:%M:%S")
                        except ValueError:
                            try:
                                m = datetime.datetime.strptime(date_str,
                                                               "%d %m %Y")
                            except ValueError:
                                try:
                                    m = datetime.datetime.strptime(date_str,
                                                                   "%B %Y")
                                except ValueError:
                                    # HERE ADD A FORMAT TO CHECK
                                    print("Format not recognised. \nConsider "
                                          "adding a date format "
                                          "in the function \"format_date\".")
    return m

In [20]:
# Function to select the class according to the seat type


def select_class_from_seat(classe: pd.core.frame.Series, seat_type: pd.core.frame.Series) -> pd.core.frame.Series:

    # Replace only None values
    if classe == 'nc':
        if "Economy Class" in seat_type:
            classe = "Economy Class"
        elif "Premium Economy" in seat_type:
            classe = "Premium Economy"
        elif "Business Class" in seat_type:
            classe = "Business Class"
        elif "World Traveller Plus" in seat_type:
            classe = "Premium Economy"
    return classe

In [21]:
# Chi-square computing


def test_khi_deux(df: pd.core.frame.DataFrame, col1: str, col2: str) -> float:

    df = pd.crosstab(df[col1], df[col2])
    chisq, p_value, dot, expected = stats.chi2_contingency(df)
    return p_value

In [22]:
# Chi-square compputing then decision. Compares one column with every columns and keeps p-value < 1%


def compare_colonne(df : pd.core.frame.DataFrame, COL : str) -> list:

    # delete empty values
    filter_df = df[df[COL].notnull()]
    keep_column = []
    columns_df = filter_df.columns
    for column in columns_df:

        # Check if a column has more than two different values
        VALUES = filter_df[column].nunique()
        if VALUES > 1 and column != COL:
            p_value = test_khi_deux(filter_df, COL, column)

            # Keeps the columns significantly dependent
            if p_value < 0.01:
                keep_column.append([column, p_value])

    return keep_column

In [23]:
# Complete "Cabin_Class" column


def clean_Cabin_Class(df: pd.core.frame.DataFrame) -> pd.core.frame.DataFrame:

    # Replace all Nan values by "nc" (easier)
    df['Cabin_Class'] = df['Cabin_Class'].fillna("nc")
    df['Seat_Type'] = df['Seat_Type'].fillna("nc")
    df['Type_Of_Traveller'] = df['Type_Of_Traveller'].fillna("nc")

    # Replace values function
    df['Cabin_Class'] = df.apply(lambda df: select_class_from_seat(
        df['Cabin_Class'], df['Seat_Type']), axis=1)

    # Retrieves the most frequent value according to a constraint
    for i in range(len(df['Type_Of_Traveller'])):
        if df['Cabin_Class'][i] == "nc" and df['Type_Of_Traveller'][i] != "nc":
            mode = statistics.mode(
                df[df['Type_Of_Traveller'] == df['Type_Of_Traveller'][i]]['Cabin_Class'])
            df['Cabin_Class'][i] = mode

    return df

In [24]:
# Complete "Value_For_Money" column


def clean_Value_For_Money(df: pd.core.frame.DataFrame) -> pd.core.frame.DataFrame:

    # Create an X_train and Y_train for predictions. Binarize X_train. Moreover, we keep not null values lines for value_for_money
    X_train = df[df['Value_For_Money'].notnull(
    )][['Cabin_Class', 'Overall_Customer_Rating', 'Recommended']]
    X_train = pd.get_dummies(X_train)

    Y_train = df[df['Value_For_Money'].notnull()]['Value_For_Money']

    # KNN neighbors=20
    neigh = KNeighborsClassifier(20).fit(X_train, Y_train)
    predict = neigh.predict(pd.get_dummies(
        df[['Cabin_Class', 'Overall_Customer_Rating', 'Recommended']]))
    df['predict'] = pd.Series(predict)

    # Replace Nan values by the prediction
    df['Value_For_Money'] = df.apply(lambda row: row['predict'] if np.isnan(
        row['Value_For_Money']) else row['Value_For_Money'], axis=1)

    del df['predict']

    return df

In [25]:
# Complete "Type_Of_Traveller" column thanks to Cabin_class column


def clean_Type_Of_Traveller(df: pd.core.frame.DataFrame) -> pd.core.frame.DataFrame:

    test = df[df['Type_Of_Traveller'] == "nc"]
    mode = {}
    
    # Comparing with cabin class value, then getting the most present value in type of traveller
    for unique_value in df['Cabin_Class'].unique():
        if unique_value != "nc":
            mode[unique_value] = statistics.mode(
                df[(df['Cabin_Class'] == unique_value) & (df['Type_Of_Traveller'] != "nc")]['Type_Of_Traveller'])

    df['Type_Of_Traveller'] = df.apply(lambda row: mode[row['Cabin_Class']] if row['Type_Of_Traveller']
                                       == "nc" and row['Cabin_Class'] != "nc" else row['Type_Of_Traveller'], axis=1)

    return df

In [26]:
# Gives percentage of missing values by column


def pourcent_col_missing(df: pd.core.frame.DataFrame) -> dict:

    # Take all columns name
    column = df.columns
    missing_values = {}

    # Calculate missing values for all column
    for col in column:
        missing_values[col] = 100 * len(df[df[col].isnull()][col]) / len(df)

    return missing_values

In [27]:
# Deletes columns with many Nan values


def select_col_nal(df: pd.core.frame.DataFrame, Percent: int) -> pd.core.frame.DataFrame:

    # percent by columns
    percent_columns = pourcent_col_missing(df)

    # keep column name who has less than percentage Nan values
    var_percent = [key for key, val in percent_columns.items()
                   if val < Percent]

    df = df[var_percent]

    return df

In [28]:
# Calculate missing values by ligne


def pourcent_row_missing(df: pd.core.frame.DataFrame) -> pd.core.frame.DataFrame:
    
    # percentage of missing values by row
    df['percent_missing'] = (df.isnull().sum(axis=1))/len(list(df))*100
    
    return df

In [29]:
# Delete rows with many Nan values


def select_lign_nal(df: pd.core.frame.DataFrame, Percent: int) -> pd.core.frame.DataFrame:

    # missing values by row
    df = pourcent_row_missing(df)
    list_var = list(df[df['percent_missing'] > Percent].index)
    list_var = sorted(list_var, reverse=True)

    # Delete row above the given percentage
    for ligne in list_var:
        df = df.drop([ligne], inplace=True)

    del df['percent_missing']

    return df

In [30]:
# Calculate number of modality by column


def nb_moda(df: pd.core.frame.DataFrame) -> dict:

    # Takes all columns name
    column = df.columns
    nunique_value = {}

    # Calculates number of modality for all column
    for col in column:
        nunique_value[col] = df[col].nunique()

    return nunique_value

In [31]:
# Delete columns with low modality


def select_moda_nal(df: pd.core.frame.DataFrame, nb_mod: int) -> pd.core.frame.DataFrame:

    # takes number of modality by column
    moda = nb_moda(df)

    # keeps only variable with number of madality between 1 and nb_mod
    good_mod = [key for key, val in moda.items() if val < nb_mod and val > 1]
    df = df[good_mod]

    return df

In [32]:
# Binarize all columns


def binarized(df: pd.core.frame.DataFrame, liste: list, var: str) -> pd.core.frame.DataFrame:

    for elt in liste:
        if elt != var:
            # Transform all in int column
            df = pd.get_dummies(df, columns=[elt], dummy_na=True)
    return df

In [33]:
# Determine best score for one column on every step and complete the column on every step


def score(df: pd.core.frame.DataFrame, step: int, tab_score: pd.core.frame.DataFrame, drop_var: list) -> tuple:
    best_score = 0.0
    column = df.columns

    for col in column:

        # Binarized dataframe except one column
        df_binari = binarized(df, column, col)
        df_bin = df_binari[df_binari[col].notnull()]
        df_nan = df_binari[df_binari[col].isnull()]

        model = LogisticRegression()
        cv_score = []

        # Logistic Regression with cross validation (KFold)
        kf = KFold(n_splits=3)
        for train_index, test_index in kf.split(df_bin):
            X_train, X_test = df_bin.drop(col, axis=1).iloc[list(
                train_index)], df_bin.drop(col, axis=1).iloc[list(test_index)]
            y_train, y_test = df_bin[col].iloc[list(
                train_index)], df_bin[col].iloc[list(test_index)]
            model.fit(X_train, y_train)
            prediction_cv = model.predict(X_test)
            cv_score.append(accuracy_score(prediction_cv, y_test))

        mean_cv_score = np.mean(cv_score)
        var_cv_score = np.var(cv_score)

        # Determine the best score for all variable and we predict the best score for a column. Be careful, if a variable has been complete, we don't complete it again.
        if (mean_cv_score > best_score) and (col not in drop_var):
            best_score = mean_cv_score
            prediction = model.predict(df_nan.drop(col, axis=1))
            column_select = col
            df_nan_loc = df_nan

        # Complete the empty Dataframe
        tab_score[col].iloc[step] = mean_cv_score

    # Complete all nan values on the original DataFrame
    row = df_nan_loc.index.values
    predict_value = {}

    for index, valeur in zip(row, prediction):
        predict_value[index] = valeur

    df[column_select] = df[column_select].reset_index().apply(lambda row: predict_value[row['index']]
                                                              if row['index'] in predict_value else row[column_select], axis=1)

    return tab_score, df, column_select

In [34]:
# Replace Nan values by linear regression prediction


def fill_dataset(df: pd.core.frame.DataFrame) -> tuple:

    # Creating an empty dataframe that will be completed with all columns
    var_list = []
    tab_score = pd.DataFrame(
        np.zeros((len(df.columns), len(df.columns))), columns=df.columns)

    # Complete all Nan values and it deserve for taking parameters decision
    for i in range(len(df.columns)):
        tab_score, df, column = score(df, i, tab_score, var_list)
        var_list.append(column)
        
    return df, tab_score

In [35]:
# Changes the units of the seat angles to 'deg' for degrees, 'in' for inches.


def modif_recline_degrees(recline: str) -> str:

    # Select the column 'Recline' of the DataFrame associated with the file SEATGURU_INFO_AIRCRAFT.csv
    if recline != 'nc':
        recline = str(recline).lower()
        recline = recline.replace('inches', 'in').replace('inch', 'in').replace(
            '"', ' in').replace('degrees', 'deg').replace('degree', 'deg')
        if ('deg' not in recline and 'in' not in recline):
            str_list = re.findall(r'\d+', recline)
            int_list = [int(s) for s in str_list]
            nb = np.mean(int_list)
            if nb < 90:
                recline = recline + ' in'
            else:
                recline = recline + ' deg'

    return recline

In [36]:
# Give the same mesure for inc and degree


def get_recline_percent(recline: str) -> float:
    if recline != 'nc':
        recline = modif_recline_degrees(recline)
        list_string = re.findall(r'\d+', recline)
        list_int = [int(n) for n in list_string]
        if 'in' in recline:
            recline = (np.mean(list_int)/18)*100
        elif 'deg' in recline:
            recline = (np.mean(list_int)/180)*100
            
    return recline

In [37]:
# Complete the width max by the width min


def fill_width_max(width_min: float, width_max: float) -> float:
    if width_max is None : 
        width_max = width_min
        
    return width_max

In [38]:
# Calculate the proportion of the seat occuped by passenger


def filling_percent(total_seat: float, count: float) -> float:
    if total_seat != 'nc' and count != 'nc':
        fill = (count/total_seat)*100
        
    return fill

In [39]:
# Gives the difference the 2 given note


def rating_overall_rating_gap(rating: float, overall_rating: float) -> float:
    if rating != 'nc' and overall_rating != 'nc':
        gap = abs(rating - overall_rating)
        
    return gap

In [40]:
# This function take best parameters for both model. Be careful, this function take so much time
# As this function is quite slow, grid searchs are now commented (we already know best parameters)

def best_parameter(model_type: str, X_train: pd.core.frame.DataFrame, Y_train: pd.core.frame.DataFrame) -> list:
    
    list_best_parameter = []
    # Determine better parameter for KNN
    if model_type == 'knn':
        list_best_parameter = [1]
        '''param_grid2 = {
            'n_neighbors': range(1, 20, 5)
        }

        KNN = KNeighborsClassifier()
        gs = GridSearchCV(
            estimator=KNN, param_grid=param_grid2, cv=3, n_jobs=-1)
        gs.fit(X_train, Y_train)

        list_best_parameter.append(gs.best_params_.get('n_neighbors'))'''
    
    # Determine better parameters for Random Forest
    elif model_type == 'random_forest':
        list_best_parameter = [50,42,100]
        '''param_grid1 = {
            'max_depth': [30, 40],
            'random_state': [42],
            'n_estimators': [100]
        }
        
        randomF = RandomForestRegressor()
        gs = GridSearchCV(estimator=randomF, param_grid=param_grid1, cv=3, n_jobs=-1)
        gs.fit(X_train,Y_train)
        list_best_parameter.append(gs.best_params_.get('max_depth'))
        list_best_parameter.append(gs.best_params_.get('random_state'))
        list_best_parameter.append(gs.best_params_.get('n_estimators'))'''
    
    #Determine better parameters for lgbm
    elif model_type == 'lgbm':
        list_best_parameter = [2500, 17, 100, 200]
 
    '''param={
            'num_leaves': [2500],
            'max_depth': [15,16,17],
            'min_data_in_leaf': [100,400],
            'n_estimators':[100,150,200]
    }
    lgb_train = lgb.Dataset(X_train, Y_train, categorical_feature=param)
    lgb_eval = lgb.Dataset(X_test, Y_test, reference=lgb_train)
    gbm = lgb.LGBMClassifier('gbdt',lgb_train)


    gs=GridSearchCV(estimator=gbm,param_grid=param,cv=3, n_jobs=-1, verbose=1)
    gs.fit(X_train,Y_train)

    list_best_parameter.append(gs.best_params_.get('max_depth'))
    list_best_parameter.append(gs.best_params_.get('random_state'))
    list_best_parameter.append(gs.best_params_.get('n_estimators')'''
    
    return list_best_parameter

In [41]:
# This function will drop label that we don't predict


def labels_todrop(label_predict: list) -> list:

    liste_label = ['Cabin_Staff_Service', 'Seat_Comfort',
                   'Food_And_Beverages', 'Inflight_Entertainment']
    liste_label.remove(label_predict)

    return liste_label

In [42]:
# Separate X and y (column for predicting)


def create_XY(df: pd.core.frame.DataFrame, predict: 'str') -> tuple:

    y = df[predict]
    del df[predict]
    X = df
    
    return X, y

In [43]:
# Result accuracy, rmse, cross_val mean and cross-val var for a model. Deserve for taking the best model.


def modele(df : pd.core.frame.DataFrame, label_predict : str, model_type : str, percent_missing_col : int, nb_moda : int) -> str:

    
    # Clear empty column or little number of value
    df = select_col_nal(df, percent_missing_col)
    
    df = select_moda_nal(df, nb_moda)
    
    
    # Keep label for predict
    unnecessary_label = labels_todrop(label_predict)
    
    df_drop = df.drop(unnecessary_label, axis=1, inplace = True)
    
    df = df[df[label_predict].notnull()]
    
    # Just binarized string column
    if model_type != 'lgbm':
        bin_var = df.select_dtypes(include=['object']).columns
        df = binarized(df, bin_var, label_predict)
        
        # Create X and Y (train and test too) for the prediction
        X, y = create_XY(df, label_predict)
        X = df.fillna(-1)
    else:
        X, y, categ = prepare_data_non_binarized(df, label_predict)
    
   

    X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.30, random_state=42)
    # Determine best parameters for model
    list_best_parameter = []
    if (model_type == 'knn' or model_type == 'random_forest' or model_type == 'lgbm'):
        list_best_parameter = best_parameter(model_type, X_train, Y_train)
        
    # Predict and result all values
    if (model_type != 'lgbm'):
        accuracy, mean_error, mean_cv_score, var_cv_score = model_computing(X,
                                                                        y,
                                                                        model_type,
                                                                        label_predict,
                                                                        list_best_parameter,                                                                        
                                                                        X_train,
                                                                        Y_train,
                                                                        X_test,
                                                                        Y_test)
    else:
        accuracy, mean_error, mean_cv_score, var_cv_score = model_computing_non_binarize(X, y, categ, label_predict, X_train,
                             Y_train, X_test, Y_test)

    liste_result = [round(accuracy, 4), round(mean_error, 4),
                    round(mean_cv_score, 4), round(var_cv_score, 4)]

    return liste_result

In [44]:
#Calculate for all model, all values. We decide the best model thanks all result

def all_model(df : pd.core.frame.DataFrame, liste_label : list, liste_model : list, tab_score : pd.core.frame.DataFrame) -> pd.core.frame.DataFrame:
    types = ['_A','_E','_CV','_VAR']
    
    for label in liste_label:
        for model in liste_model:
            #Calcul all model
            score = modele(df, label, model, 86, 100)
            for typ in types:
                tab_score.loc[label, model + typ] = score[types.index(typ)]
    return tab_score

In [45]:
def prepare_data_non_binarized(data: pd.core.frame.DataFrame,
                               to_predict: str) -> tuple:

    # Getting list of object type columns
    categ = data.select_dtypes('object').columns.tolist()

    # Mapping those columns as string
    data[categ] = data[categ].astype(str)

    # Encoding those columns as integer categories
    data[categ] = data[categ].apply(LabelEncoder().fit_transform)

    # Encoding those columns as integer categories
    data = data[data[to_predict].notnull()]
    X, y = data.drop([to_predict], axis=1), data[to_predict]

    return(X, y, categ)

In [46]:
def model_computing_non_binarize(X: pd.core.frame.DataFrame,
                                 y: pd.core.frame.DataFrame,
                                 categ: list,
                                 value_to_predict: str,
                                 X_train: pd.core.frame.DataFrame,
                                 Y_train: pd.core.frame.DataFrame,
                                 X_test: pd.core.frame.DataFrame,
                                 Y_test: pd.core.frame.DataFrame) -> tuple:

    lgb_train = lgb.Dataset(X_train, Y_train, categorical_feature=categ)
    lgb_eval = lgb.Dataset(X_test, Y_test, reference=lgb_train)

    # parameters for lgbm
    params = {
        'num_leaves': 2500,
        'max_depth': 17,
        'min_data_in_leaf': 100,
        'n_estimators': 200
    }

    # Training
    gbm = lgb.train(params,
                    lgb_train,
                    num_boost_round=80, 

                    valid_sets=lgb_eval)

    prediction = gbm.predict(X_test)
    # RMSE
    #mean_error = rmse(prediction, Y_test)
    mean_error = rmse(prediction, Y_test)

    # Accuracy score
    accuracy = accuracy_score(np.round(prediction), Y_test)

    # Cross validation
    cv_score = []
    kf = KFold(n_splits=4, shuffle=True)
    for train_index, test_index in kf.split(X):
        X_train, X_test = X.iloc[list(train_index)], X.iloc[list(test_index)]
        y_train, y_test = y.iloc[list(train_index)], y.iloc[list(test_index)]
        lgb_train = lgb.Dataset(X_train, y_train, categorical_feature=categ)
        lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
        gbm = lgb.train(params,
                        lgb_train,
                        num_boost_round=80,
                        valid_sets=lgb_eval)
        prediction_cv = gbm.predict(X_test)
        cv_score.append(accuracy_score(np.round(prediction_cv), y_test))

    mean_cv_score = np.mean(cv_score)
    var_cv_score = np.var(cv_score)

    # Saving the model to disk
    filename = 'g5_' + value_to_predict + '_lgbm.sav'
    pickle.dump(gbm, open(filename, 'wb'))
    print('File saved as ' + filename)

    return(accuracy, mean_error, mean_cv_score, var_cv_score)

# Both

In [47]:
# Plots a bar chart of a metric score by label and model


def plot_bars_model(labels_color: list, lists_models: list, labels_bar: list, measure: str):
    """
    Documentation

        Parameters:
            labels_color: list of the labels wanted on legend
            lists_models: list of the values wanted, grouped by model
            labels_bar: list of the labels that are going to be on abscissa
            measure: name of the measure on ordinate
    """
    fig, ax = plt.subplots()
    width = 0.20
    r1 = np.arange(len(lists_models[0]))
    r2 = [x + width for x in r1]
    r3 = [x + width for x in r2]
    r4 = [x + width for x in r3]
    rects1 = ax.bar(r1, lists_models[0], width, label=labels_color[0])
    rects2 = ax.bar(r2, lists_models[1], width, label=labels_color[1])
    rects3 = ax.bar(r3, lists_models[2], width, label=labels_color[2])
    if len(lists_models) > 3:
        rects4 = ax.bar(r4, lists_models[3], width, label=labels_color[3])
    ax.set_ylabel(measure)
    ax.set_title(measure + ' by label and model')
    ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
    plt.xticks([r + width for r in range(len(lists_models[0]))], labels_bar)
    plt.show()

In [48]:
# Computes Root Mean Squarred Error between target and prediction


def rmse(predictions: list, targets: list) -> float:
    return np.sqrt(mean_squared_error(predictions, targets))

In [49]:
# Computes a given model on given data, predicts and computes RMSE, accuracy score, cross validation, then saves the model.


def model_computing(X: pd.core.frame.DataFrame,
                    y: pd.core.frame.DataFrame,
                    model_type: str,
                    value_to_predict: str,
                    list_param: list,
                    X_train: pd.core.frame.DataFrame,
                    Y_train: pd.core.frame.DataFrame,
                    X_test: pd.core.frame.DataFrame,
                    Y_test: pd.core.frame.DataFrame) -> tuple:
    """
    Documentation

        Parameters:
            X: Preprocessed DataFrame containing features
            y: Preprocessed DataFrame containing label
            model_type: To choose between 'knn', 'logistic_regression', 'random_forest'
            value_to_predict: Label name,
            list_param: List of parameters, length and type of content depends on model_type
            X_train: Splited preprocessed DataFrame containing features
            Y_train: Splited preprocessed DataFrame containing labels
            X_test: Splited preprocessed DataFrame containing features
            Y_test: Splited preprocessed DataFrame containing labels

        Out:
            Tuple containing : 
                accuracy : Accuracy score on Splited DataFrames
                mean_error : RMSE score on Splited DataFrames
                mean_cv_score : Mean of accuracy scores computed by cross validation
                var_cv_score : Variance of accuracy scores computed by cross validation

    """
    if model_type == 'knn':
        model = KNeighborsClassifier(n_neighbors=list_param[0])

    elif model_type == 'logistic_regression':
        model = LogisticRegression()

    elif model_type == 'random_forest':
        model = RandomForestRegressor(
            max_depth=list_param[0], random_state=list_param[1], n_estimators=list_param[2])

    # Training model and predicting
    model.fit(X_train, Y_train)
    prediction = model.predict(X_test)

    # RMSE
    mean_error = rmse(prediction, Y_test)

    # Creation of a DataFrame with predictions
    prediction = pd.DataFrame(prediction, columns=["Label"])
    prediction["index"] = X_test.index

    # Accuracy score
    accuracy = accuracy_score(np.round(prediction["Label"]), Y_test)

    # Cross validation
    cv_score = []
    kf = KFold(n_splits=3)
    for train_index, test_index in kf.split(X):
        X_train, X_test = X.iloc[list(train_index)], X.iloc[list(test_index)]
        y_train, y_test = y.iloc[list(train_index)], y.iloc[list(test_index)]
        model.fit(X_train, y_train)
        prediction_cv = model.predict(X_test)
        cv_score.append(accuracy_score(np.round(prediction_cv), y_test))

    mean_cv_score = np.mean(cv_score)
    var_cv_score = np.var(cv_score)

    # Saving the model to disk
    filename = 'g5_' + value_to_predict + '_' + model_type + '.sav'
    pickle.dump(model, open(filename, 'wb'))
    print('File saved as ' + filename)

    return(accuracy, mean_error, mean_cv_score, var_cv_score)