# SELECTOR Readme

SELECTOR is a library that allows to select best performing features and algorithms for human mobility prediction.
It is composed of several scripts that all correspond to a set of steps towards deriving individual and population models.
This readme gives an overview of these scripts, SELECTOR's functionality, and how it can be used.

## Summary of the steps
1. Feature selection 'STEP1_feature_selection.py'.
2. Performance assessment 'STEP2_create_candidate_individual_models.py'.
    1. Print context feature analysis results 'STEP2a_print_candidate_individual_models_performance.py'.
3. Create candidate individual models 'STEP3_create_feature_ranks.py'.
    1. Create feature rank plot 'STEP3a_visualize_feature_ranks.py'.
4. Create candidate population models 'STEP4_create_candidate_population_models.py'.
5. Compare candidate population models to individual models 'STEP5_compare_population_models.py'.
    1. Visualize population model performance 'STEP5a_visualize_population_model_performance.py'.
    2. Visualize performance comparison between individual and population models 'STEP5b_visualize_individual_vs_population_model_performance.py'.
6. Visualize performance gains of deriving population models for demographic groups 'STEP6_visualize_demographic_population_models_performance.py'.
7. Computing and visualizing population models for day periods of time 'STEP7_visualize_demo_daily_population_models_performance.py'.

## Important practical notes
- Parallel execution: Although each of the main files of SELECTOR provides an option to execute it in parallel, it is however recommended to parallelize the execution as follows. Each main file provides the option to restrict the execution to a subset of users of prediction task. By simply running the same script in parallel provides a better solution for parallelizing the code. The following snippet demonstrates how parallel processes can be started for non-overlapping subsets of users. 
- Visualization: Each visualization script offers flags 'SAVE' and 'SHOW' at the top of the file. If a plot should be shown on a screen then set 'SHOW' to 'True'. To save the plot to a file, set 'SAVE' to 'True'.



In [None]:
screen -d -m -S run01 sh -c 'python STEP4_create_candidate_population_models.py 1 3 0 8; exec bash'
screen -d -m -S run2 sh -c 'python STEP4_create_candidate_population_models.py 1 3 8 16; exec bash'
screen -d -m -S run3 sh -c 'python STEP4_create_candidate_population_models.py 1 3 16 24; exec bash'
screen -d -m -S run4 sh -c 'python STEP4_create_candidate_population_models.py 1 3 24 32; exec bash'
screen -d -m -S run5 sh -c 'python STEP4_create_candidate_population_models.py 1 3 32 40; exec bash'
screen -d -m -S run6 sh -c 'python STEP4_create_candidate_population_models.py 1 3 40 48; exec bash'
screen -d -m -S run7 sh -c 'python STEP4_create_candidate_population_models.py 1 3 48 56; exec bash'
screen -d -m -S run8 sh -c 'python STEP4_create_candidate_population_models.py 1 3 56 64; exec bash'
screen -d -m -S run9 sh -c 'python STEP4_create_candidate_population_models.py 1 3 64 72; exec bash'
screen -d -m -S run10 sh -c 'python STEP4_create_candidate_population_models.py 1 3 72 80; exec bash'
screen -d -m -S run11 sh -c 'python STEP4_create_candidate_population_models.py 1 3 80 88; exec bash'
screen -d -m -S run12 sh -c 'python STEP4_create_candidate_population_models.py 1 3 88 96; exec bash'
screen -d -m -S run13 sh -c 'python STEP4_create_candidate_population_models.py 1 3 96 104; exec bash'
screen -d -m -S run14 sh -c 'python STEP4_create_candidate_population_models.py 1 3 104 112; exec bash'
screen -d -m -S run15 sh -c 'python STEP4_create_candidate_population_models.py 1 3 112 120; exec bash'
screen -d -m -S run16 sh -c 'python STEP4_create_candidate_population_models.py 1 3 120 128; exec bash'
screen -d -m -S run17 sh -c 'python STEP4_create_candidate_population_models.py 1 3 128 136; exec bash'
screen -d -m -S run18 sh -c 'python STEP4_create_candidate_population_models.py 1 3 136 1000; exec bash'

This script spawns 18 processes that will run on at least 18 cores (if available).

## STEP 1: Feature selection

- Main file: STEP1_feature_selection.py
- Parameters (all mandatory): 
    - start task index
    - end task index
    - start user index
    - end user index
- Example: 'python STEP1_feature_selection.py 1 3 0 1'

Feature selection is the main entry point of SELECTOR.
The core logic is summarized in the file 'STEP1_feature_selection.py'.
By executing this file, the feature selection will be performed.
To do so, a set of parameters need to be set as well as access to the data should be given.
The following subsection summarize these parameters and explain how to set them and the purpose of a set of further files.

### Database access
SELECTOR requires access to a database that contains data instances along with the corresponding features.
It is important to mention that the current SELECTOR version requires a predefined database scheme since particular columns (features) are accessed by indices instead of names.
For each prediction task, there are two database tables containing such information.
These tables are '!TASK!_Pre_Selected_Features' and '!TASK!_Feature_Matrix'.
The former table contains a mask for each timestamp indicating which features and instances should be used based on their availability.
The latter table contains the actual feature matrix for each user.

The gateway to access the database is the python class 'Database_Handler.py'.
The method 'Get_DB_Handler()' returns a database handler and allows setting database parameters such as 'username' or 'password'. 

In [None]:
def Get_DB_Handler():
    
    return Database_Handler.Database_Handler("ADDRESS", 3306, "USERNAME", "PASSWORD", "DBNAME")

This class further proves a set of methods to access and manipulate database tables.

In [None]:
def insert(self, table_name, fields, values):
    
def insertMany(self, table_name, fields, values):
    
def select(self, query):
    
def update(self, query):

def dropTable(self, table_name):
    
def createTable(self, table_name, selectString):
    
def deleteData(self, query):
    
def truncateTable(self, table_name):
    
def getGreatestIndex(self, table_name,device_id):
    
def getGreatestTimestamp(self, table_name,device_id):

An example of how to interact with the database is given in the 'STEP1_feature_selection.py' file.

In [None]:
import Database_Handler
def Save_End_Evaluation_Run_To_DB(evaluation_run):
    
    # store prediction run details
    timestamp = datetime.datetime.now().strftime('%d-%m-%Y-%H:%M:%S')
    
    dbHandler = Database_Handler.Get_DB_Handler()
    query = "UPDATE %s_Prediction_Run SET end_timestamp = '%s' WHERE id = %i" % (evaluation_run.task, timestamp, evaluation_run.run_id)
    dbHandler.update(query)

### Retrieving user data from the database

After establishing a connection to our database, we now look at how particular user data can be retrieved.
To do so, two data structures ('UserData.py' and 'UserDataSet.py') and a helper class ('UserDataAssemble.py') are used.
As the first step, we initialize a 'UserData.py' object to store all user related data.

In [None]:
## STEP1_feature_selection.py
def Run_Main_Loop():

    ## ...
    
    userData = UserData.UserData()
    userData.userId = int(user) 
    
    ## ...

'UserData.py' data structure allows us to store information such as user ID, optimization set, training set, and test set of user data.

In [None]:
import numpy
from UserDataSet import UserDataSet

class UserData:
    
    def __init__ (self):
        self.userId = None;
        self.pre_feature_combination = None;
        self.complete_feature_matrix = None;
        
        self.optimization_set = None;   
        self.training_set = None;
        self.test_set = None;   

After we have initialized the 'UserData.py' structure for our user, the next step is to load data from the database.

In [None]:
## STEP1_feature_selection.py
def Run_Main_Loop():

    ## ...
    
    user_data_assemble = UserDataAssemble.UserDataAssemble(evaluation_run)
    evaluation_run = user_data_assemble.Get_User_Data() 
    
    ## ...

This code sample introduces two further files.
The file 'UserDataAssemble.py' contains all the logic to load data from the database and to parse it into three (optimization, training, test) subsets.
Our code checks for a existing partition of data in the database.
If it is available, the data loaded from the database are partitioned accordingly.
Otherwise a new partition is created.

In [None]:
class UserDataAssemble:
    ## ...
    def Get_User_Data(self): 
        ## ...  
        if evaluation_run.task == EvaluationRun.task_next_place_daily:
            task = evaluation_run.task[5:]
        else:
            task = evaluation_run.task
        query = "SELECT optimization_array, training_array, test_array FROM %s_Prediction_Run WHERE user_id = %i  LIMIT 1" % (task, evaluation_run.userData.userId)
        
        dbHandler = Database_Handler.Get_DB_Handler()
        existing_data = dbHandler.select(query)
        existing_data = numpy.array(existing_data)
        
        ## A partition exists
        if existing_data.shape[0] > 0:
            optimization_idx = numpy.fromstring(existing_data[0,0], sep=', ').astype(int)
            training_idx = numpy.fromstring(existing_data[0,1], sep=', ').astype(int)
            test_idx = numpy.fromstring(existing_data[0,2], sep=', ').astype(int)
        else: ## No partition found --> create a new partition of data into three subsets
            optimization_set_size = self.Get_Optimization_Set_Size(number_of_days);
            training_set_size = self.Get_Training_Set_Size(number_of_days - optimization_set_size);

            # select sub sets of data    
            indices = numpy.random.permutation(unique_days)
            optimization_idx, training_idx, test_idx = indices[:optimization_set_size], indices[optimization_set_size:optimization_set_size+training_set_size], indices[optimization_set_size+training_set_size:]

            optimization_membership = np.array([i in optimization_idx for i in day_strings])
            optimization_idx = numpy.where(optimization_membership == True)[0]

            training_membership = np.array([i in training_idx for i in day_strings])
            training_idx = numpy.where(training_membership == True)[0]

            test_membership = np.array([i in test_idx for i in day_strings])
            test_idx = numpy.where(test_membership == True)[0]

The resulting partitions are then applied to the data loaded from the database to create three subsets that are encapsuled in the data structure 'UserDataSet.py'.
This structure contains timestamps of instances, data strings, ground truth values, the corresponding feature matrix, and a mask that indicates which features should be used for the current experiment.

In [None]:
class UserDataSet:
    
    def __init__ (self):
        self.timestamps = None;
        self.day_string = None;
        self.time_string = None;
        
        self.ground_truth = None;
        self.feature_matrix = None;
        
        self.rows_mask = None; 

The initialization of a data subset is shown in the following snippet.
As already mentioned, SELECTOR relies on a well-defined database scheme.
The following snippet also demonstrates this dependency, i.e., timestamps, day strings, etc. are at well-defined columns.

In [None]:
class UserDataAssemble:
    ## ...
    def Get_User_Data(self): 
        # optimization set
        optimization_set = UserDataSet.UserDataSet()
        optimization_set.timestamps = ravel(feature_matrix[optimization_idx, 2:3])[day_period_mask[optimization_idx]]
        optimization_set.day_string = ravel(feature_matrix[optimization_idx, 3:4])[day_period_mask[optimization_idx]]
        optimization_set.time_string = ravel(feature_matrix[optimization_idx, 4:5])[day_period_mask[optimization_idx]]
        
        optimization_set.ground_truth = (ravel(feature_matrix[optimization_idx, 7:8]).astype(float))[day_period_mask[optimization_idx]]
        optimization_set.feature_matrix = (feature_matrix[optimization_idx, 8:feature_matrix.shape[1]])[day_period_mask[optimization_idx],:]
        optimization_set.rows_mask = optimization_idx

All information regarding the current evaluation run including user data are stored in the data structure 'EvaluationRun.py'.
Beside many parameters and fields, this data structure also provides with static fields that ensure a consistent spelling of algorithms and metrics across the entire code base.

In [None]:
class EvaluationRun:
    
    task_next_slot_place = 'NextSlotPlace'
    task_next_slot_transition = 'NextSlotTransition'
    task_next_place = 'NextPlace'
    
    task_next_place_daily = 'DailyNextPlace'
    
    metric_accuracy = 'accuracy'
    metric_fscore = 'fscore'
    metric_MCC = 'MCC'
    
    alg_logistic_regression = 'logistic_regression'
    alg_knn = 'knn'
    alg_knn_dyn = 'knn_dyn'
    alg_perceptron = 'perceptron'
    alg_decision_tree = 'decision_tree'
    alg_gradient_boost = 'gradient_boost'
    alg_svm = 'svm'
    alg_naivebayes = 'naive_bayes'
    alg_stupid = 'stupid'
    
    ## baselines
    alg_random = 'random'
    alg_majority = 'majority'
    alg_histogram = 'histogram'
    
    algorithms = [alg_knn_dyn, alg_naivebayes];
    
    metrics_next_place = [metric_accuracy, metric_fscore, metric_MCC]

An example of how to initialize such a structure is given in the following snippet.

In [None]:
class STEP1_feature_selection:
    ## ...
    def Run_Main_Loop():
        ## ...
        evaluation_run = EvaluationRun()
        evaluation_run.task = current_task
        evaluation_run.task_object = task_objects[task_id]

        # feature group selection
        evaluation_run.is_network = True;
        evaluation_run.is_temporal = True;
        evaluation_run.is_spatial = True;
        evaluation_run.is_context = True;

### Parallel execution of the feature selection

After getting familiar with several data structures and helper classes, we now examine the code within the file 'STEP1_feature_selection.py' and an option to execute the code in parallel.

There are several alternatives to parallelize SELECTOR.
One way is implemented in the file 'STEP1_feature_selection.py'.
By using the flag 'THREAD_LEVEL' it allows us to create multiple threads.

The lowest level to parallelize the code is to set THREAD_LEVEL = 1.
It will allow us to run the execution for the selected performance metrics in parallel.
The corresponding code is shown in the following snippen.

In [None]:
class STEP1_feature_selection:
    ## ...
    def Thread_Algorithm(evaluation_run, metrics, start):
        ## ...
        threads = []
        for current_metric in metrics:
            ## ... 
            if THREAD_LEVEL > 0:
                metric_thread = threading.Thread( target=Thread_Metric, args=(current_evaluation_run, start,) )
                threads.append(metric_thread)
                metric_thread.start()
            else:   
                Thread_Metric(current_evaluation_run, start) 

        if THREAD_LEVEL > 0:
            for thread in threads:
                thread.join()

The next level is to also parallize the execution of predictors.
To do so, THREAD_LEVEL should be set to 2.

In [None]:
class STEP1_feature_selection:
    ## ...
    def Thread_Task(task_id, evaluation_run, algorithms, list_of_metrics, userData, start):
        threads = []
        for current_algorithm in algorithms:
            ## ...
            if THREAD_LEVEL > 1:
                algorithm_thread = threading.Thread( target=Thread_Algorithm, args=(current_evaluation_run, metrics, start,) )
                threads.append(algorithm_thread)
                algorithm_thread.start()
            else:
                Thread_Algorithm(current_evaluation_run, metrics, start)

        if THREAD_LEVEL > 1:
            for thread in threads:
                thread.join()

Lastly, we can also parallelize SELECTOR to execute for each prediction task indidually.
To do so, we have to set THREAD_LEVEL = 3.

In [None]:
class STEP1_feature_selection:
    ## ...
    def Run_Main_Loop():
        ## ...
        threads = []
        for current_task in tasks:
            ## ...
            if THREAD_LEVEL > 2:
                task_thread = threading.Thread( target=Thread_Task, args=(task_id, evaluation_run, algorithms, list_of_metrics, userData, start,) )
                threads.append(task_thread)
                task_thread.start()
            else:
                Thread_Task(task_id, evaluation_run, algorithms, list_of_metrics, userData, start)
            ## ...
        if THREAD_LEVEL > 2:
            for thread in threads:
                thread.join()

There are additional and higher values of THREAD_LEVEL that can be set to further parallelize the code.
We will discuss these values later.

### Kicking off the feature selection procedure

After defining feature selection tasks for each combination of prediction tasks, performance metrics, users, and predictors, we now examine the code that is responsible for selecting features.

This code is located in the class 'SFFS.py', which is an implementation of the Sequentual Floating Feature Selection algorithm.

The following code in the file 'STEP1_feature_selection.py' starts SFFS.

In [None]:
class STEP1_feature_selection:
    ## ...
    def Thread_Metric(evaluation_run, start):  
        ## ...
        # save data to database
        evaluation_run = Save_Start_Evaluation_Run_To_DB(evaluation_run)  
        
        # prepare data
        evaluation_run.training_set = evaluation_run.userData.optimization_set
        evaluation_run.test_set = evaluation_run.userData.training_set

        # run SFFS
        sffs = SFFS.SFFS(evaluation_run, 10, start)
        sffs.Run_SFFS()

        Save_End_Evaluation_Run_To_DB(evaluation_run)

Similar to the file 'STEP1_feature_selection.py' the file 'SFFS.py' provides further options to parallelize the execution.
This can be achieved by setting THREAD_LEVEL to 4 or higher.

### Saving feature selection results to the database
While feature selection is performed, SELECTOR stores intermediate and finale results in the database.
There are two table per prediction task that are dedicated for this purpose.
These tables are '!TASK!_Prediction_Result_Analysis' and '!TASK!_Prediction_Run'.
The former table stores the results of each SFFS execution cycle, while the latter one contains high-level information of the current evaluation run, e.g., which prediction algorithm is used or which performance metric should be optimized.

The file 'NextPlaceOrSlotPredictionTask.py' contains the entire code to compute the performance and store the current execution step in the database (table: '!TASK!_Prediction_Result_Analysis').
After each SFFS step, the method 'Save_To_DB(..)' is executed.

Before and after the execution of SFFS, a corresponding entry is written to the table '!TASK!_Prediction_Run' containing all the high-level information of the current evaluation run.
The implementation of these two methods (Save_Start_Evaluation_Run_To_DB(..) and Save_End_Evaluation_Run_To_DB(..)) is in the file 'STEP1_feature_selection.py'.


## STEP 2: Create candidate individual models

- Main file: STEP2_create_candidate_individual_models.py
- Parameters (all mandatory): 
    - start task index
    - end task index
    - start user index
    - end user index
- Example: 'python STEP2_create_candidate_individual_models.py 1 3 0 1'

After applying feature selection, the next step is to measure the actual performance of the predictors before and after the feature selection.
As already mentioned, our data set is devided into three subsets.
While the optimization and traning subsets are used by SFFS, the training and test subsets are used to assess the actual performance.

Since we are also interested in the influence of phone context data on the performance in solving human mobility prediction tasks, we measure the performance in all four cases:
- No feature selection + phone context data
- No feature selection and no phone context data
- After feature selection + phone context data
- After feature selection and no phone context data

The corresponding implementation of this step can be found in the file 'STEP2_create_candidate_individual_models.py'.

In [None]:
class STEP2_create_candidate_individual_models:
    ## ...
    BASELINE = False
    ## ...
    def Run_Main_Loop():
    
    start = time()
    
    ## PREPARE TO RUN THE LOOP
    list_of_metrics = [EvaluationRun.metrics_next_place, EvaluationRun.metrics_next_place, EvaluationRun.metrics_next_place]
    algorithms = [EvaluationRun.alg_knn_dyn, EvaluationRun.alg_perceptron, EvaluationRun.alg_decision_tree, EvaluationRun.alg_svm];
        
    if IS_PER_DAY_PERIOD:
        start_periods = [1, 49, 69];
        end_periods = [48, 68, 96];
        tasks = [EvaluationRun.task_next_place_daily]
        task_objects = [NextPlaceOrSlotPredictionTask]
        feature_group_combinations = [[False, True, True, True, False]];
    else:
        start_periods = [1];
        end_periods = [96];
        tasks = [EvaluationRun.task_next_slot_place, EvaluationRun.task_next_slot_transition, EvaluationRun.task_next_place]
        task_objects = [NextPlaceOrSlotPredictionTask, NextPlaceOrSlotPredictionTask, NextPlaceOrSlotPredictionTask]
        ## first argument: true = no feature selection; false = feature selection
        feature_group_combinations = [[True, True, True, True, True], 
                                      [True, True, True, True, False], 
                                      [False, True, True, True, True], 
                                      [False, True, True, True, False]]; 
        
    if BASELINE:
        algorithms = [EvaluationRun.alg_random, EvaluationRun.alg_majority, EvaluationRun.alg_histogram];
        feature_group_combinations = [[True, True, True, True, True]]; 

This code allows us to define which metrics and algorithms should be included.
In the context of this evaluation, we also include three naive algorithms along with the four sophisticated.
The flag 'BASELINE', which can be set at the top of the file, allows us to switch between sophisticated and naive predictors.
The flag 'IS_PER_DAY_PERIOD' is set to 'True' only for building population models for different day periods of time (STEP 7). 
In default case, it is set to 'False'.

The matrix 'feature_group_combinations' allows us to define the four aforementioned scenarios, while each of the five boolean values correspond to the following semantic:

1. No feature selection?
2. Network features?
3. Temporal features?
4. Spatial features?
5. Phone context features?

The entire structure of this file is similar to the feature selection implementation.
The method 'Thread_Feature_Group(..)' processes all the parameters and finally performs the performance evaluation.

Lastly, the results are stored in the corresponding database table 'PostSelection_!TASK!_Prediction'.

## STEP 2a: Print candidate individual models performance

- Main file: STEP2a_print_candidate_individual_models_performance.py
- Parameters (all mandatory): No
- Example: 'python STEP2a_print_candidate_individual_models_performance.py'

This script prints the performance measured in the previous step.
The corresponding results are then printed to the console.
The implementation of this script allows us to define for which metrics, tasks, and algorithms this information should be displayed. 
An additional flag 'IS_BASELINE' switches between the naive algorithms (for which no feature selection is applied) and the more sophisticated.
The corresponding code snipped is given below.

In [None]:
class STEP2a_print_candidate_individual_models_performance:
    ## ...
    def printPerformance():

        IS_BASELINE = True

        list_of_metrics = [EvaluationRun.metric_accuracy, EvaluationRun.metric_fscore, EvaluationRun.metric_MCC]
        tasks = [EvaluationRun.task_next_place, EvaluationRun.task_next_slot_place, EvaluationRun.task_next_slot_transition]

        if IS_BASELINE:
            algorithms = [EvaluationRun.alg_random, EvaluationRun.alg_histogram, EvaluationRun.alg_majority]
            is_feature_selection_array = ['=']
            is_phone_context_array = [1]
        else:
            algorithms = [EvaluationRun.alg_knn_dyn, EvaluationRun.alg_perceptron, EvaluationRun.alg_decision_tree, EvaluationRun.alg_svm]
            is_feature_selection_array = ['=', '>']
            is_phone_context_array = [1,0]

        ## ...

## STEP 3: Create feature ranks

- Main file: STEP3_create_feature_ranks.py
- Parameters (all mandatory): No
- Example: 'python STEP3_create_feature_ranks.py'

After deriving candidate individual models, we now identify individual models (best performining out of candidates) and create a ranking of features selected in users' individual models.

### Identify individual models
The following SQL query selects an individual model for each user out of the candidates.

In [None]:
class STEP2_create_candidate_individual_models:
    ## ...
    def createFeatureRanks():
        ## ...
        for task in tasks:
            for metric in list_of_metrics:
                dbHandler = Database_Handler.Get_DB_Handler()
                query = ("select a.user_id, a.%s, a.selected_algorithm, a.selected_features from "
                         "(select user_id, %s, selected_algorithm, selected_features from PostSelection_%s_Prediction "
                         "where is_context = 0 and selected_metric = '%s' and feature_selection_id > 0 and "
                         "(selected_algorithm = 'knn_dyn' or selected_algorithm = 'svm' or "
                         "selected_algorithm = 'perceptron' or selected_algorithm = 'decision_tree') "
                         "AND start_time = %s and end_time = %s "
                         "order by %s DESC) as a group by a.user_id order by a.user_id;"
                         "") % (metric, metric, task, metric, start_periods[time_index], 
                                end_periods[time_index], metric)

                query_results = numpy.array(dbHandler.select(query))

### Compute ranks for all demographic groups

This process is repeated for all 16 demographic groups by checking for each user whether she belongs to the currently selected demographic group, as shown in the following snippen.

In [None]:
class STEP2_create_candidate_individual_models:
    ## ...
    def createFeatureRanks():
        ## ...
        for demo_key in get_demo_groups_dict().keys():

            feature_ids = numpy.arange(0,NUMBER_OF_FEATURES,1)
            feature_occurrence = numpy.zeros((NUMBER_OF_FEATURES,))
            total_number_of_models = 0

            # For each user
            for row in query_results:
                user_id = int(row[0])

                # Apply the following steps only if the user belongs to the current demographic group
                if Util.userBelongToDemoGroup(user_id, demo_key):
                    selected_features = ravel(numpy.fromstring(row[3], sep=', ').astype(int))
                    feature_occurrence[selected_features] += 1
                    total_number_of_models += 1

After creating the ranks for all features, we finally store the results to the database table 'FeatureRanks'.

## STEP 3a: Visualize feature ranks

- Main file: STEP3a_visualize_feature_ranks.py
- Parameters (all mandatory): No
- Example: 'python STEP3a_visualize_feature_ranks.py'

This file simply visualizes the results of the computation of feature ranks for all combinations of tasks and metrics.

In [7]:
PDF('../plots/feature_ranks.pdf',size=(550,275))

## STEP 4: Create candidate population models

- Main file: STEP4_create_candidate_population_models.py
- Parameters (all mandatory):
    - start task index
    - end task index
    - start user index
    - end user index
- Example: 'python STEP4_create_candidate_population_models.py 1 3 0 1'

After ranking features based on the individual models, we are now able to derive candidate population models.
The corresponding file 'STEP4_create_candidate_population_models.py' follows the same structure as previous files, therefore we mainly focus on the major differences and some of the parameters that can be set.

Along the metrics, tasks, and predictors that can be selected, in this part of the evaluation we can also define how many of the top-X features should be used for the brute-force operation of deriving candidate population models or for which demographic groups it should be performed.

These parameters can be set in the method 'Thread_Metric(..)' as shown in the following snippet.

In [None]:
class STEP4_create_candidate_population_models:
    ## ...
    def Thread_Metric(evaluation_run, start):  
        ## ...
        feature_set_size = 5 
    
        for demo_group in Util.demo_groups:

            ## check whether the user belongs to the demographic group
            if Util.userBelongToDemoGroup(user, demo_group) == False:
                continue;
            ## ...

The file 'Util.py' contains among other useful functions also a list with all demographic groups and functions to check whether a user belongs to one of the selected groups, as shown in the following snippet.

In [None]:
class Util:
    ## ...
    demo_groups = ['all', 'female','male','working','study','age_group_16_21','age_group_22_27',
                      'age_group_28_33','age_group_34_38','age_group_39_44','no_children_all','with_children_all',
                      'with_children_female','with_children_male','single','family'];

    def areUsersBelongToDemoGroup(users, demo_group):

        mask_users_belong_to_demo_group = numpy.zeros((len(users),),dtype=bool)

        for idx in range(len(mask_users_belong_to_demo_group)):
            mask_users_belong_to_demo_group[idx] = userBelongToDemoGroup(users[idx], demo_group)

        return mask_users_belong_to_demo_group


    def userBelongToDemoGroup(user, demo_group):

        ## get demographic data
        dbHandler = Database_Handler.Get_DB_Handler()
        query = ("select * FROM Demographics where userid = %s") % (user) 
        demographics = numpy.array(dbHandler.select(query))

        gender = demographics[0, 1]
        age = demographics[0, 2] 
        work = demographics[0, 3]
        relationship = demographics[0, 4]
        children = demographics[0, 5]

        if demo_group == 'all':
            return True
        if demo_group == 'female':
            return gender == 1
        if demo_group == 'male':
            return gender == 2
        if demo_group == 'working':
            return work == 1
        if demo_group == 'study':
            return work == 4
        if demo_group == 'age_group_16_21':
            return age == 2
        if demo_group == 'age_group_22_27':
            return age == 3
        if demo_group == 'age_group_28_33':
            return age == 4
        if demo_group == 'age_group_34_38':
            return age == 5
        if demo_group == 'age_group_39_44':
            return age == 6
        if demo_group == 'no_children_all':
            return children == 0
        if demo_group == 'with_children_all':
            return children > 0
        if demo_group == 'with_children_female':
            return (children > 0) & (gender == 1)
        if demo_group == 'with_children_male':
            return (children > 0) & (gender == 2)
        if demo_group == 'single':
            return relationship == 0
        if demo_group == 'family':
            return relationship > 0

As mentioned, to derive candidate population models, we apply the brute-force approach as shown in the following snippet.

In [None]:
class STEP4_create_candidate_population_models:
    ## ...
    def Thread_Metric(evaluation_run, start):  
        ## ...
        dbHandler = Database_Handler.Get_DB_Handler()
        query = ("select feature_id FROM FeatureRanks where metric = '%s' and prediction_task = '%s' "
                 "and demo_group = '%s' and daily_period = '%s' order by id") % ( 
                 evaluation_run.selected_metric, evaluation_run.task, demo_group, daily_period)

        feature_combinations = dbHandler.select(query)
        feature_combinations = ravel(feature_combinations).astype(int);
        feature_combinations = feature_combinations[0:feature_set_size];
        
        # brute-force execution
        for L in range(1, len(feature_combinations)+1):
            for subset in itertools.combinations(feature_combinations, L):

                evaluation_run.selected_features = numpy.array(subset)
                evaluation_run.training_set = evaluation_run.userData.training_set
                evaluation_run.test_set = evaluation_run.userData.test_set
                
                # Predict mobility
                predictors_pipeline = PredictorsPipeline(evaluation_run)
                evaluation_run = predictors_pipeline.Run_Predictions()
                
                # Measure performance
                evaluation_run.metric_results = evaluation_run.task_object.Run_Analysis(evaluation_run)
                evaluation_run.metric_results.selected_features = evaluation_run.selected_features

## STEP 5: Compare candidate population models

- Main file: STEP5_compare_population_models.py
- Parameters (all mandatory):
    - start demographic group index
    - end demographic group index
- Example: 'python STEP5_compare_population_models.py 0 1'

To identify best performing population models out of the candidates, we need to compare the performance results of each of the candidate to those achieved by the individual models.
The implementation of this process can be found in the file 'STEP5_compare_population_models.py'.
The procedure allows us to find population models for different demographic groups too.
Therefore, along the already known parameters such as the choice of prediction tasks, we select users according to their membership in each of the demographic groups that we consider in this study.
The following code snippet shows the initialization phase of this step.

In [None]:
class STEP5_compare_population_models:
    ## ...
    def compareCandidatePopulationModels():
        ## ...
        list_of_metrics = [EvaluationRun.metric_accuracy, EvaluationRun.metric_fscore, EvaluationRun.metric_MCC]
    
        algorithms = [EvaluationRun.alg_knn_dyn, EvaluationRun.alg_perceptron, EvaluationRun.alg_decision_tree, EvaluationRun.alg_svm]; #

        start_demo_group = int(sys.argv[1])
        end_demo_group = int(sys.argv[2])
        demo_groups = Util.demo_groups[start_demo_group:end_demo_group]

        for demo_group in demo_groups:

            # read user list
            text_file = open("userids.txt", "r")
            user_ids = text_file.read().split('\n')
            text_file.close()

            # Identify which users belong to the current demo group
            mask_users_belong_to_demo_group = Util.areUsersBelongToDemoGroup(user_ids, demo_group)
            
            ## ...

For each prediction task and metric, our script first selects individual models of all users and then removes those users who do not belong to the current demographic group, as shown in the following snippet.

In [None]:
class STEP5_compare_population_models:
    ## ...
    def compareCandidatePopulationModels():
        ## ...
        # Get individual models
        dbHandler = Database_Handler.Get_DB_Handler()
        query = ("select a.%s from "
                 "(select user_id, %s, selected_algorithm, selected_features from PostSelection_%s_Prediction "
                 "where is_context = 0 and selected_metric = '%s' and feature_selection_id > 0 and "
                 "(selected_algorithm = 'knn_dyn' or selected_algorithm = 'svm' or "
                 "selected_algorithm = 'perceptron' or selected_algorithm = 'decision_tree') "
                 "AND start_time = %s and end_time = %s "
                 "order by %s DESC) as a group by a.user_id order by a.user_id;"
                 "") % (metric, metric, task, metric, start_periods[time_index], end_periods[time_index], metric)

        query_individual_models = numpy.array(dbHandler.select(query))
        individual_models_performance = query_individual_models[:,0]

        # Remove user results that do not belong to the current demo group
        individual_models_performance = individual_models_performance[mask_users_belong_to_demo_group]

After that, we now select all configurations of candidate population models based on the prediction task and metric.

In [None]:
class STEP5_compare_population_models:
    ## ...
    def compareCandidatePopulationModels():
        ## ...
        # Get candidate population model configs
        dbHandler = Database_Handler.Get_DB_Handler()
        query = ("select distinct(selected_features) from %s where demo_group = '%s' "
                 "and prediction_task = '%s' and selected_metric = '%s' AND start_time = %s and end_time = %s;"
                 "") % (table, demo_group, task, metric, start_periods[time_index], end_periods[time_index])

        query_results_selected_features = numpy.array(dbHandler.select(query))
    
        # Iterate over each feature subset
        for selected_features in query_results_selected_features[:,0]:
            number_of_features = len(ravel(numpy.fromstring(selected_features, sep=', ').astype(int)))
            
            # For each predictor
            for algorithm in algorithms:
                dbHandler = Database_Handler.Get_DB_Handler()
                query = ("select %s from %s where demo_group = '%s' "
                         "and prediction_task = '%s' and selected_metric = '%s' and selected_algorithm = '%s' "
                         "and selected_features = '%s' and start_time = %s and end_time = %s order by user_id;"
                         "") % (metric, table, demo_group, task, metric, algorithm, selected_features, 
                                start_periods[time_index], end_periods[time_index])

                query_results_performance = numpy.array(dbHandler.select(query))
                candidate_population_model_performance = query_results_performance[:,0]
                
                ## ...

Finally, we compare the performance achieved by individual models to those achieved by each candidate population model.
To this end, we use RMSE as the metric.
As explained in our paper, we divide the performance results in those that indicate that individual models perform better and the rest.
All the results are then stored in the database table 'Candidate_Population_Models_Performance'.
The corresponding implementation is shown in the following snippet.

In [None]:
class STEP5_compare_population_models:
    ## ...
    def compareCandidatePopulationModels():
        ## ...
        diff_performance = individual_models_performance - candidate_population_model_performance
        mask_individual_model_better = diff_performance > 0

        # the higher, the better is individual model
        if sum(mask_individual_model_better) > 0:
            RMSE_individual_models = sqrt(sum(pow(diff_performance[mask_individual_model_better], 2)) 
                                          / sum(mask_individual_model_better))
        else:
            RMSE_individual_models = 0.0
        # the lower, the better is population model
        if sum(~mask_individual_model_better) > 0:
            RMSE_population_models = -sqrt(sum(pow(diff_performance[~mask_individual_model_better], 2)) 
                                           / sum(~mask_individual_model_better))
        else:
            RMSE_population_models = 0.0

        RMSE_total = (RMSE_individual_models * sum(mask_individual_model_better) 
                      + RMSE_population_models * sum(~mask_individual_model_better)) 
        / len(mask_individual_model_better)

## STEP 5a: Visualize population model performance

- Main file: STEP5a_visualize_population_model_performance.py
- Parameters (all mandatory): No
- Example: 'python STEP5a_visualize_population_model_performance.py'

This script allows us to visualize the performance results achieved by the selected population models.

In [6]:
PDF('../plots/PopulationModel_Performance_boxplot_brute_force.pdf',size=(750,250))

## STEP 5b: Visualize individual vs population model performance

- Main file: STEP5b_visualize_individual_vs_population_model_performance.py
- Parameters (all mandatory): No
- Example: 'python STEP5b_visualize_individual_vs_population_model_performance.py'

This script visualizes the performance differences between individual and population models when derived by considering the entire population.

In [5]:
PDF('../plots/PopulationModel_vs_Individual_boxplot_brute_force.pdf',size=(750,250))

## STEP 6: Performance gains of deriving population models for demographic groups

- Main file: STEP6_visualize_demographic_population_models_performance.py
- Parameters (all mandatory): No
- Example: 'python STEP6_visualize_demographic_population_models_performance.py'

This script retrieves performance results achieved by population models that are derived for the entire population and compares them to the results achieved if population models are derived for each demographic group separately.

In [4]:
PDF('../plots/demographic_model_improvements.pdf',size=(870,190))

## STEP 7: Computing and visualizing population models for day periods of time

The last part of SELECTOR introduces the possibility to derive population models for non-overlapping day periods of time.
The current implementation considered three day periods (indicated as indexes of 15-minutes time slots starting from 00:00):

- 1 - 48
- 49 - 68
- 69 - 96

This separation is done based on analysis insights that are outside of the focus of SELECTOR.
Nevertheless, the day periods can be easily changed as shown in the following snippet. 

In [None]:
start_periods = [1, 49, 69];
end_periods = [48, 68, 96];

To construct population models for different day periods of time and to finally compare their performance for population models that are constructed for the entire day, we simply need to rerun all the above mentioned six steps by changing some of the parameters.
The following snippets along with explanations provide an overview of how to conduct such experiments with SELECTOR and which parameters in which files need to be changed.

### Database and prediction task

The current implementation of SELECTOR supports the construction of population models for different day periods of time for the next-place (NP) prediction task only.
However, the implementation can be easily extended if further prediction tasks should be supported too.
Furthermore, although all the results can be saved in the same database scheme, we decided to stored all the results in separate database tables.
This is mainly done for debugging reasons.

The corresponding tables are:
- STEP 1:
    - Input: NextPlace_Feature_Matrix, NextPlace_Pre_Selected_Features
    - Output: Daily_NextPlace_Prediction_Result_Analysis, Daily_NextPlace_Prediction_Run
- STEP 2:
    - Input: Daily_NextPlace_Prediction_Result_Analysis, Daily_NextPlace_Prediction_Run
    - Output: PostSelection_DailyNextPlace_Prediction
- STEP 3:
    - Input: PostSelection_DailyNextPlace_Prediction
    - Output: FeatureRanksDaily
- STEP 4:
    - Input: FeatureRanksDaily
    - Output: Candidate_Population_Models_Daily
- STEP 5:
    - Input: Candidate_Population_Models_Daily
    - Output: Candidate_Population_Models_Performance_Daily
- STEP 6: [THIS STEP IS OMITTED AND REPLACED BY STEP 7]
- STEP 7:
    - Input: Candidate_Population_Models_Daily, Candidate_Population_Models_Performance_Daily
    - Output: -

### STEP 1 – 5

To repeat STEP 1 to STEP 5 in the context of building population models for different day periods of time, we simply need to change the flag 'IS_PER_DAY_PERIOD' in each of the files, as shown in the following snippet.

In [None]:
class STEP1_feature_selection:
    IS_PER_DAY_PERIOD = True
    ## ...

All the other parameters such as prediction tasks will be changed automatically by a if-else statement, as shown next.

In [None]:
class STEP1_feature_selection:
    ## ...
    def Run_Main_Loop():

        start = time()

        ## PREPARE TO RUN THE LOOP
        list_of_metrics = [EvaluationRun.metrics_next_place, EvaluationRun.metrics_next_place, EvaluationRun.metrics_next_place]
        algorithms = [EvaluationRun.alg_knn_dyn, EvaluationRun.alg_perceptron, EvaluationRun.alg_decision_tree, EvaluationRun.alg_svm];

        # Select a configuration depending on whether the mobility should be predicted for specific day periods of time or not
        if IS_PER_DAY_PERIOD:
            start_periods = [1, 49, 69];
            end_periods = [48, 68, 96];
            tasks = [EvaluationRun.task_next_place_daily]
            task_objects = [NextPlaceOrSlotPredictionTask]
        else:
            start_periods = [1];
            end_periods = [96];
            tasks = [EvaluationRun.task_next_slot_place, EvaluationRun.task_next_slot_transition, EvaluationRun.task_next_place]
            task_objects = [NextPlaceOrSlotPredictionTask, NextPlaceOrSlotPredictionTask, NextPlaceOrSlotPredictionTask]

### Visualization of STEPS 1 – 5

In the context of STEP 7 analysis, all the visualization steps 2a, 3a, 5a, and 5b are skipped and replaced by the final visualization of STEP 7.

### Visualization of performance comparison of different types of population models

- Main file: STEP7_visualize_demo_daily_population_models_performance.py
- Parameters (all mandatory): No
- Example: 'python STEP7_visualize_demo_daily_population_models_performance.py'

Lastly, we visualize the performance differences between population models that are built for different day periods of time and those built for the entire day.
Along comparing for the entire population of users, we also compare for each of the 15 demographic groups.

In [3]:
PDF('../plots/daily_vs_all_population_model_improvements.pdf',size=(870,210))

In [2]:
class PDF(object):
  def __init__(self, pdf, size=(200,200)):
    self.pdf = pdf
    self.size = size

  def _repr_html_(self):
    return '<iframe src={0} width={1[0]} height={1[1]}></iframe>'.format(self.pdf, self.size)

  def _repr_latex_(self):
    return r'\includegraphics[width=1.0\textwidth]{{{0}}}'.format(self.pdf)