## Mesure du temps d'exécution : ```%%timeit``` et ```%timeit```
Pour ```%%timeit``` la mesure du temps d'exécution va porter sur l'ensemble du code de la cellule (*cell mode*) alors que pour ```%timeit``` elle ne portera que sur la ligne suivant la déclaration (*line mode*).

Le code va être exécuter *n.r* fois et le résultat va se présenter sous la forme : ```n loops, best of r : xxx ms per loop```. Comprendre : on a exécuté le code en *r* blocs de *n* exécutions. La meilleure série de n exécutions présente un temps moyen d'exécution de xxx ms. Pour ajuster les paramètres *r* et *n*, utiliser la syntaxe suivante : 

```%%timeit -n 10 -r 3```

*Attention* : *r* vaut 7 par défaut, *n* n'a pas de valeur par défaut fixe et suivant le bloc de code testé, sa valeur peut être très élevée (ex: 10 000 000).

Attention il semble que ```%%timeit``` et ```%timeit``` wrappent le code à exécuter dans une fonction. Toutes les variables assignées dans la cellule à chronométrer vont donc l'être dans le scope local de cette fonction qui sera détruit à la fin de l'exécution. On ne pourra donc pas récupérer le produit des calculs assigné à ces variables.

In [22]:
%%timeit -n 10 -r 5

str(1) == "1"

10 loops, best of 5: 191 ns per loop


Il existe aussi un magic ```%%time```

## Décorateur ```timer```

In [2]:
import time

# A changer avec Python 3.x où le formattage des strings change
def timer(f):
    def wrapper(*args, **kwargs):
        t1 = time.time()
        f(*args, **kwargs)
        duration = time.time()-t1
        print 'Total function runtime : %(hours)02dh%(mins)02dm%(secs)02ds%(ms)03dms' %\
        {'hours': duration//3600, 'mins': (duration%3600)//60, 'secs' : duration%60, 'ms' : (duration %1)*1000//1}
    return wrapper

In [3]:
@timer
def test_func(n):
    time.sleep(n)
    
test_func(2.365)

Total function runtime : 00h00m02s367ms


## Spark ```gather``` function
Equivalent Spark de ```pandas.melt``` ou de ```tidyr::gather``` pour ceux qui ont fait du R. Il s'agit de la fonction inverse d'un pivot. L'avantage est que dans ce sens et contrairement au pivot, l'opération ne nécessite pas de ```groupBy``` et donc de *shuffle*. Dis autrement, elle peut s'effectuer partition par partitions sans nécessiter d'échange d'informations entre les exécuteurs.

**Remarque importante** : les valeurs destinées à se retrouver dans la même colonne doivent toutes être de même type. Il n'y a pas de conversion de type automatique. Par exemple, l'application de la fonction sur des colonnes mélangeant ```IntegerType()``` et ```FloatType()``` ne marchera pas.

In [None]:
def melt(df, var_col_name, val_col_name, pivot_col_name_list, drop_na):
    """
    Implements a melt operation in Spark (reverse pivot)
    :param df : Input Spark DataFrame (pyspak.sql.DataFrame)
    :param var_col_name : Name of the future columns that holds the names of all the melted columns (string)
    :param val_col_name : Name of the future columns that holds the values of all the melted columns (string)
    :param pivot_col_name_list : Names of the columns to melt (list of strings)
    :param drop_na : If True, drops NA values in val_col_name (boolean)
    """
    import pyspark.sql.functions as sqlf
    
    bcktk_pivot_col_name_list = ['`' + x.replace('`', '') + '`' for x in pivot_col_name_list]
    
    var_val_pairs = sqlf.array(*[sqlf.struct(sqlf.lit(var).alias(var_col_name), sqlf.col(var).alias(val_col_name)) 
                                 for var in bcktk_pivot_col_name_list])
    
    cols = [sqlf.col(name) for name in df.columns if '`' + name + '`' not in bcktk_pivot_col_name_list] + \
                [sqlf.col('_tmp_nested_col')[var_col_name].alias(var_col_name),
                sqlf.col('_tmp_nested_col')[val_col_name].alias(val_col_name)]
    
    melted_df = df.withColumn('_tmp_nested_col', sqlf.explode(var_val_pairs))\
            .select(*cols)\
            .withColumn(var_col_name, sqlf.regexp_replace(var_col_name, '`', ''))
            
    if drop_na:
        melted_df = melted_df.na.drop(subset=[val_col_name])
    
    return melted_df

Détails sur l'utilisation de ```explode``` : 

In [None]:
columns = ['id', 'day']
vals = [(1, 2),
        (2, 2),
        (3, 2),
        (4, 2),
        (5, 2),
        (6, 2),
        (7, 4),
        (8, 4),
        (9, 4),
        (10, 4),
        (11, 4),
        (12, 5),
        (13, 5),
        (14, 5)]

# create DataFrame
dfe = spark.createDataFrame(vals, columns)\
        .withColumn('month', lit(9))\
        .withColumn('day', col('day').cast(IntegerType()))

In [None]:
dfe.select(
    col('id'),
    explode(array(struct(lit('day').alias('var_name'), col('day').alias('value')),
                 struct(lit('month').alias('var_name'), col('month').alias('value')))).alias('gather'))\
    .show()

In [None]:
tst = gather(dfe, 'sensor', 'value', ['day', 'month'])
tst.show()

## Utilitaires Metastore Spark

In [None]:
def spark_list_tables():
    list_tables = spark.catalog.listTables()
    table_list_cols = ['tableName', 'database', 'description', 'tableType', 'isTemporary']
    return pd.DataFrame([dict(zip(table_list_cols, list(x))) for x in list_tables], columns=table_list_cols)

def spark_list_databases():
    list_db = spark.catalog.listDatabases()
    db_list_cols = ['dbName', 'description', 'locationUri']
    return pd.DataFrame([dict(zip(db_list_cols, list(x))) for x in list_db], columns=db_list_cols)

In [35]:
import subprocess
import pandas as pd

old_table_path = '/apps/hive/warehouse/historic_measurement'
test_list = [u'aircraft=a380/registration=F-HPJB/departure_year=2018/departure_month=1',
 u'aircraft=a380/registration=F-HPJB/departure_year=2018/departure_month=10',
 u'aircraft=a380/registration=F-HPJB/departure_year=2018/departure_month=2',
 u'aircraft=a380/registration=F-HPJB/departure_year=2018/departure_month=3',
 u'aircraft=a380/registration=F-HPJB/departure_year=2018/departure_month=4',
 u'aircraft=a380/registration=F-HPJB/departure_year=2018/departure_month=5',
 u'aircraft=a380/registration=F-HPJB/departure_year=2018/departure_month=6',
 u'aircraft=a380/registration=F-HPJB/departure_year=2018/departure_month=7',
 u'aircraft=a380/registration=F-HPJB/departure_year=2018/departure_month=8',
 u'aircraft=a380/registration=F-HPJB/departure_year=2018/departure_month=9']

records = []
for part in test_list:
    out = subprocess.check_output('hdfs dfs -count ' + old_table_path + '/' + part, shell=True)
    out_splt = [x for x in out.replace('\n', '').split(' ') if x != '']
    records.append(dict(zip(['file_nb', 'file_size_bytes', 'partition'], out_splt[1:])))
    
part_data = pd.DataFrame(records)\
    .assign(**{'file_nb' : lambda df : df['file_nb'].astype('int64'),
             'file_size_bytes' : lambda df : df['file_size_bytes'].astype('int64')})\
    .assign(**{'file_size_GiB' : lambda df : df['file_size_bytes']/(1024)**3})\
    .loc[:,['file_nb', 'file_size_bytes', 'file_size_GiB', 'partition']]
    
part_data

Unnamed: 0,file_nb,file_size_bytes,file_size_GiB,partition
0,1716,1817447582,1.69263,/apps/hive/warehouse/historic_measurement/airc...
1,2938,3366535298,3.13533,/apps/hive/warehouse/historic_measurement/airc...
2,2739,3250229880,3.027012,/apps/hive/warehouse/historic_measurement/airc...
3,382,324940148,0.302624,/apps/hive/warehouse/historic_measurement/airc...
4,2652,3168396437,2.950799,/apps/hive/warehouse/historic_measurement/airc...
5,2558,2975102376,2.77078,/apps/hive/warehouse/historic_measurement/airc...
6,3034,3252658429,3.029274,/apps/hive/warehouse/historic_measurement/airc...
7,3536,4183635325,3.896314,/apps/hive/warehouse/historic_measurement/airc...
8,2586,2842847892,2.647608,/apps/hive/warehouse/historic_measurement/airc...
9,3140,3596756801,3.349741,/apps/hive/warehouse/historic_measurement/airc...


In [42]:
for part in test_list[0:1]:
    out = subprocess.check_output('hdfs dfs -ls ' + old_table_path + '/' + part, shell=True)
    out_splt = [x for x in out.split('\n')]
    out_splt

### Pretty print dict

In [3]:
def pprintDict(dc, i=0):
    for k, v in dc.items():
        if isinstance(v, dict):
            print '    '*i + '|-' + str(k)
            pprintDict(v, i+1)
        else:
            print '    '*i + '|-' + str(k) + ' : ' + str(v)

In [1]:
import os, json
dir_param = '/app/PROFILER/travail/pilienhart/Test_refactoring'

with open(os.path.join(dir_param, 'generate_param_ata28_pmp_dp.json'), "r") as f:
    d = json.load(f)

In [16]:
d['ata28_pmp_dp']['param']['load_pipeline']

TypeError: list indices must be integers, not str

In [17]:
for dc in d['ata28_pmp_dp'][0]['param']['load_pipeline']:
    pprintDict(dc)

|-function : merge_sar_with_meta
|-module : profiler.load
|-param
    |-sensors : [u'_CURM_9QL1_3.2', u'FPMPBRITEN.1', u'FQWT.RM:1', u'FPMPARITPS.1', u'_CURM_9QL2_1.2', u'FQCC.4:1', u'FPMPBLITPS.1', u'_CURM_10QL1_3.2', u'FPMPBLITEN.1', u'FQFT.3:1', u'FQFT.4:1', u'FPMPBROTPS.1', u'_CURM_5QA1_2.2', u'FPMPF4MPS', u'_CURM_13QN2_3.1', u'_CURM_5QA1_3.2', u'FQWT.LI:1', u'_CURM_5QA4_3.1', u'FPMPF1MEN.1', u'_CURM_5QA4_2.1', u'_CURM_6QN2_1.1', u'FPMPBLOTPS.1', u'_CURM_11QL2_2.1', u'_CURM_804QA_2.2', u'FQCC.3:1', u'FPMPALITEN.1', u'FPMPF1MPS.1', u'_CURM_804QA_1.2', u'_CURM_5QA4_1.1', u'_CURM_11QL1_3.1', u'_CURM_20QN1_3.1', u'_CURM_804QA_3.2', u'FQCC.2:1', u'FPMPF4MEN', u'_CURM_13QN1_1.1', u'FPMPBLMTPS.1', u'_CURM_10QL2_3.2', u'FPMPBLOTEN.1', u'_CURM_5QA1_1.2', u'FPMPBLMTEN.1', u'FPMPALMTEN.1', u'FPMPBRITPS.1', u'FQFT.1:1', u'FPMP2TTPS.1', u'_CURM_20QN2_2.1', u'FQCC.1:1', u'FPMPARMTPS.1', u'FPMP2TTEN.1', u'FPMPBRMTPS.1', u'FPMPARMTEN.1', u'FPMPALMTPS.1', u'FPMPBROTEN.1', u'FPMPF3MEN.1', u'FPMPALIT

In [22]:
with open(os.path.join(dir_param, 'global_settings.json'), "r") as f:
    dparam = json.load(f)

In [23]:
dparam.keys()

[u'prd', u'clean_mongo', u'ite', u'sparkparams', u'csv_to_parquet', u'cae']

In [20]:
pprintDict(dparam)

|-global_settings
    |-prd
        |-mongo_db : profiler
        |-hive_measurement_table_name : measurement
        |-hive_flight_table_name : flight
        |-nb_days_hdfs_sliding_window_hive : 90
        |-dir_fs_csv_a380 : /app/PROFILER/data/hdfs
        |-dir_hdfs_resources : /user/profiler/resources
        |-spark.local.dir : /app/PROFILER/tmp
        |-nb_days_hdfs_sliding_window_csv : 3
        |-dir_hdfs_csv : /user/profiler/csv
        |-mongo_client : mongodb://PFI_O:Profil3rO@tlspbpfimg01.france.airfrance.fr:27017,tlspbpfimg02.france.airfrance.fr:27017/profiler?replicaSet=rs606
    |-clean_mongo
        |-aircraft
            |-a380
                |-metainformations
                    |-departure_date : departureDate
                    |-equipment_name : equipmentName
                    |-immat : registration
                    |-time : time
                |-ata
                    |-ata32
                        |-equipment_type
                            |-bws
  

In [34]:
d.keys()

[u'prd', u'clean_mongo', u'ite', u'sparkparams', u'csv_to_parquet', u'cae']

In [39]:
not 'cae' in ['ite']



True

In [27]:
with open(os.path.join(dir_param, 'global_settings.json'), "r") as f:
    d = json.load(f)
    
pprintDict(d)

|-prd
    |-mongo_db : profiler
    |-dir_fs_csv_a380 : /app/PROFILER/data/hdfs
    |-dir_hdfs_resources : /user/profiler/resources
    |-spark.local.dir : /app/PROFILER/tmp
    |-mongo_client : mongodb://PFI_O:Profil3rO@tlspbpfimg01.france.airfrance.fr:27017,tlspbpfimg02.france.airfrance.fr:27017/profiler?replicaSet=rs606
    |-dir_hdfs_csv : /user/profiler/csv
    |-data_storage : HDFS
    |-output_storage : mongodb
|-clean_mongo
    |-aircraft
        |-b777
            |-metainformations
                |-departure_date : departureDate
                |-immat : registration
                |-time : acarsTime
            |-ata
                |-ata28
                    |-equipment_type
                        |-datafuel
                            |-collection : measurementsB777Ata28Datafuel
                            |-nb_flights_30_min : 1
                            |-components : [u'fuelTemp', u'fuelQtyLeft', u'fuelQtyCenter', u'fuelQtyRight', u'fuelQtyTotal', u'densityLeft', 

In [29]:
print 'PREPROCESSING PIPELINE'
for i, dc in enumerate(d[u'ata28_pmp_dp'][0]['param']['preprocessing_pipeline']):
    print 'Preprocessing function #' + str(i+1)
    pprintDict(dc)
    print dc['function'] + '('
    print '\n'

print 'AGGREGATE PIPELINE'
for i, dc in enumerate(d[u'ata28_pmp_dp'][1]['param']['aggregate_functions_spark']):
    print 'Aggregate function #' + str(i+1)
    pprintDict(dc)
    print '\n'

PREPROCESSING PIPELINE
Preprocessing function #0
|-function : add_unix_time
|-param
|-module : profiler.preprocessing


Preprocessing function #1
|-function : order_by_list
|-param
    |-column_name : [u'aircraft', u'registration', u'departure_date', u'flight_leg_count', u'time']
|-module : profiler.preprocessing


Preprocessing function #2
|-function : flight_duration
|-param
    |-partition_by : [u'aircraft', u'registration', u'departure_date', u'flight_leg_count']
|-module : profiler.preprocessing


Preprocessing function #3
|-function : add_flight_chrono
|-param
    |-return_name : _unix_time_from_start
    |-partition_by : [u'aircraft', u'registration', u'departure_date', u'flight_leg_count']
|-module : profiler.preprocessing


Preprocessing function #4
|-function : filter_on_column
|-param
    |-condition : >3600
    |-column_name : flight_duration
|-module : profiler.preprocessing


Preprocessing function #5
|-function : add_column_operation
|-param
    |-operation : * 0.1
    |

In [61]:
def rebuild_func(dct):
    if not dct['param']:
        return '.pipe(' + dct['function'] + ')'
    
    return '.pipe(partial(' + ','.join([dct['function']] + [k + '=' + str(v) for k, v in dct['param'].items()]) + '))'

In [63]:
print 'df\\'
for i, dc in enumerate(d[u'ata28_pmp_dp'][0]['param']['preprocessing_pipeline']):
    print '\t' + rebuild_func(dc) + '\\' # à affiner


df\
	.pipe(add_unix_time)\
	.pipe(partial(order_by_list,column_name=[u'aircraft', u'registration', u'departure_date', u'flight_leg_count', u'time']))\
	.pipe(partial(flight_duration,partition_by=[u'aircraft', u'registration', u'departure_date', u'flight_leg_count']))\
	.pipe(partial(add_flight_chrono,return_name=_unix_time_from_start,partition_by=[u'aircraft', u'registration', u'departure_date', u'flight_leg_count']))\
	.pipe(partial(filter_on_column,condition=>3600,column_name=flight_duration))\
	.pipe(partial(add_column_operation,operation=* 0.1,return_name=start_progno_unix_time,column_name=flight_duration))\
	.pipe(partial(add_column_operation,operation=* 0.9,return_name=end_progno_unix_time,column_name=flight_duration))\
	.pipe(partial(add_column_operation_2_cols,operation=>=,return_name=filter_start_progno_unix_time,column_name_1=_unix_time_from_start,column_name_2=start_progno_unix_time))\
	.pipe(partial(add_column_operation_2_cols,operation=<=,return_name=filter_end_progno_unix

In [72]:
def rebuild_func_agg(dct):
    return dct['function'] + '(' + ','.join([k + '=' + str(v) for k, v in dct['param'].items()]) + ')'

In [73]:
print 'df.groupBy(' + str(d[u'ata28_pmp_dp'][1]['param']['aggregation_level']) + ')'
print '\t' + '.agg(' + ','.join(rebuild_func_agg(x) for x in d[u'ata28_pmp_dp'][1]['param']['aggregate_functions_spark']) + ')'

df.groupBy([u'aircraft', u'registration', u'departure_date', u'flight_leg_count'])
	.agg(count_condition(result_name=risingEdgesNumber5qa1,variable=rising_edges_5qa1,condition_value=0,column_name_cond=first_occurence_300),count_condition(result_name=risingEdgesNumber5qa4,variable=rising_edges_5qa4,condition_value=0,column_name_cond=first_occurence_300),agg_min(result_name=noise5qa1,variable=noise_90_CURM_5QA1,column_name_cond=filter_common),agg_min(result_name=noise5qa4,variable=noise_90_CURM_5QA4,column_name_cond=filter_common),agg_max(result_name=maxRisingEdges5qa1,variable=rising_edges_5qa1,column_name_cond=filter_common),agg_max(result_name=maxRisingEdges5qa4,variable=rising_edges_5qa4,column_name_cond=filter_common))


Problème : \ en bout de ligne. Arguments à mettre entre quotes ? Comment ils font dans l'appli lors du parsing ?

### List file size on local FS

In [None]:
def list_file_size_fs(dir_path, file_ext):
    if dir_path[-1]=='/':
        dir_path = dir_path[:-1]
    files = [dir_path + '/' + x for x in os.listdir(dir_path) if os.path.isfile(dir_path + '/' + x) and file_ext in x]
    files_size = [round(os.path.getsize(x)/(1024.0*1024.0),3) for x in files]
    return pd.DataFrame(data={'file' : files, 'size_MiB' : files_size})