## Predicting flight delays with regression analysis
Let’s try to predict flight delays by using the sample flight data. We want to be able to use information such as weather and location of the destination and origin, flight distance and carrier to predict the number of minutes delayed for each flight. As it is a continuous numeric variable, we’ll use regression analysis to make the prediction.

We have chosen this dataset as an example because it is easily accessible for Kibana users and the use case is relevant. However, the data has been manually created and contains some inconsistencies. For example, a flight can be both delayed and canceled. Please remember that the quality of your input data will affect the quality of results.

Each document in the dataset contains details for a single flight, so this data is ready for analysis as it is already in a two-dimensional entity-based data structure (data frame). In general, you often need to transform the data into an entity-centric index before you analyze the data.

In [1]:
## imports
import pprint

from elasticsearch import Elasticsearch
import requests
## create a client to connect to Elasticsearch
es_url = 'http://localhost:9200'
es_client = Elasticsearch()

### Example document

In [2]:
## insert example of reading docs from ES index

results = es_client.search(index='kibana_sample_data_flights', filter_path=['hits.hits._*'], size=1)
results

{'hits': {'hits': [{'_index': 'kibana_sample_data_flights',
    '_id': 'RtUC128BeYSKXsYkOwi4',
    '_score': 1.0,
    '_source': {'FlightNum': '9HY9SWR',
     'DestCountry': 'AU',
     'OriginWeather': 'Sunny',
     'OriginCityName': 'Frankfurt am Main',
     'AvgTicketPrice': 841.2656419677076,
     'DistanceMiles': 10247.856675613455,
     'FlightDelay': False,
     'DestWeather': 'Rain',
     'Dest': 'Sydney Kingsford Smith International Airport',
     'FlightDelayType': 'No Delay',
     'OriginCountry': 'DE',
     'dayOfWeek': 0,
     'DistanceKilometers': 16492.32665375846,
     'timestamp': '2020-01-13T00:00:00',
     'DestLocation': {'lat': '-33.94609833', 'lon': '151.177002'},
     'DestAirportID': 'SYD',
     'Carrier': 'Kibana Airlines',
     'Cancelled': False,
     'FlightTimeMin': 1030.7704158599038,
     'Origin': 'Frankfurt am Main Airport',
     'OriginLocation': {'lat': '50.033333', 'lon': '8.570556'},
     'DestRegion': 'SE-BD',
     'OriginAirportID': 'FRA',
     'Or

Regression is a supervised machine learning analysis and therefore needs to train on data that contains the ground truth for the dependent_variable that we want to predict. In this example, the ground truth is available in each document as the actual value of FlightDelayMins. In order to be analyzed, a document must contain at least one field with a supported data type (numeric, boolean, text, keyword or ip) and must not contain arrays with more than one item.

If your source data consists of some documents that contain a dependent_variable and some that do not, the model is trained on the training_percent of the documents that contain ground truth. However, predictions are made against all of the data. The current implementation of regression analysis supports a single batch analysis for both training and predictions.

### Creating a regression model
To predict the number of minutes delayed for each flight:
Create a data frame analytics job.
Use the create data frame analytics jobs API as you can see in the following example:

In [3]:
endpoint_url = "/_ml/data_frame/analytics/model-flight-delays"

job_config = {
  "source": {
    "index": [
      "kibana_sample_data_flights" # [1]
    ],
    "query": { 
      "range": {
        "DistanceKilometers": {  # [2]
          "gt": 0
        }
      }
    }
  },
  "dest": {
    "index": "df-flight-delays"  # [3]
  },
  "analysis": {
    "regression": {
      "dependent_variable": "FlightDelayMin",  # [4]
      "training_percent": 90  #  [5] see below note on training percent
    }
  },
  "analyzed_fields": {
    "includes": [],
    "excludes": [     # [6]
      "Cancelled",
      "FlightDelay",
      "FlightDelayType"
    ]
  },
  "model_memory_limit": "100mb"  # [7]
}

result = requests.put(es_url+endpoint_url, json=job_config)
pprint.pprint(result.json())



{'allow_lazy_start': False,
 'analysis': {'regression': {'dependent_variable': 'FlightDelayMin',
                             'prediction_field_name': 'FlightDelayMin_prediction',
                             'randomize_seed': 3819668469703279014,
                             'training_percent': 90.0}},
 'analyzed_fields': {'excludes': ['Cancelled',
                                  'FlightDelay',
                                  'FlightDelayType'],
                     'includes': []},
 'create_time': 1580394054274,
 'dest': {'index': 'df-flight-delays', 'results_field': 'ml'},
 'id': 'model-flight-delays',
 'model_memory_limit': '100mb',
 'source': {'index': ['kibana_sample_data_flights'],
            'query': {'range': {'DistanceKilometers': {'gt': 0}}}},
 'version': '8.0.0'}


[1] The source index to analyze.


[2] This query removes erroneous data from the analysis to improve its quality.


[3] The index that will contain the results of the analysis; it will consist of a copy of the source index data where each document is annotated with the results.


[4] Specifies the continuous variable we want to predict with the regression analysis.


[5] Specifies the approximate proportion of data that is used for training. In this example we randomly select 90% of the source data for training.


[6] Specifies fields to be excluded from the analysis. It is recommended to exclude fields that either contain erroneous data or describe the dependent_variable.


[7] Specifies a memory limit for the job. If the job requires more than this amount of memory, it fails to start. This makes it possible to prevent job execution if the available memory on the node is limited.

#### A brief note on training percentage

As you may have noticed, in the job configuration above we set the value of `training_percent` to 90. This means that out of the whole Flights dataset 90 percent of the data will be used to train model and the remaining 10 percent of the data will be used for testing the model. 
You might wonder at this point, what is the best percentage for the train/test split and how you should choose what percentage to use in your own job? The answer will usually depend on your particular situation. In general it is useful to consider some of the following tradeoffs.
The more data you supply to the model at training time, the more examples the model will have to learn from, which usually leads to a better classification performance. However, more training data will also increase the training time of the model and at some point, providing the model with more training examples will only result in marginal increase in accuracy. 

Moreover, the more data you use for training, the less data you have for the testing phase. This means that you will have less previously unseen examples to show your model and thus perhaps your estimate for the generalization error will not be as accurate. 

In general, for datasets containing several thousand docs or more, start with a low 5-10% training percentage and see how your results and runtime evolve as you increase the training percentage. 

In [4]:
# 2. Start the job

start_endpoint = "/_ml/data_frame/analytics/model-flight-delays/_start"
result = requests.post(es_url+start_endpoint)
pprint.pprint(result.json())

{'acknowledged': True}


The job takes a few minutes to run. Runtime depends on the local hardware and also on the number of documents and fields that analyzed. The more fields and documents, the longer the job runs.

In [6]:
# 3. Check the job stats

stats_endpoint = "/_ml/data_frame/analytics/model-flight-delays/_stats"
result = requests.get(es_url+stats_endpoint)
pprint.pprint(result.json())

{'count': 1,
 'data_frame_analytics': [{'assignment_explanation': '',
                           'id': 'model-flight-delays',
                           'node': {'attributes': {'ml.machine_memory': '34359738368',
                                                   'ml.max_open_jobs': '20',
                                                   'xpack.installed': 'true'},
                                    'ephemeral_id': 'x4i2YXznSRGhu2GGR-ucyg',
                                    'id': '6YdR-HiRQeiZ7-sarLIxng',
                                    'name': 'Camillas-MBP.lan',
                                    'transport_address': '192.168.1.131:9300'},
                           'progress': [{'phase': 'reindexing',
                                         'progress_percent': 100},
                                        {'phase': 'loading_data',
                                         'progress_percent': 100},
                                        {'phase': 'analyzing',
              

## View Regression Results
Now you have a new index that contains a copy of your source data with predictions for your dependent variable. Use the standard Elasticsearch search command to view the results in the destination index:

In [8]:
# insert code to get results
query = {"query": {"term": {"ml.is_training": {"value": False }}}}
result = es_client.search(index='df-flight-delays', filter_path=['hits.hits._*'], size=1, body=query)
result

{'hits': {'hits': [{'_index': 'df-flight-delays',
    '_id': '-9UC128BeYSKXsYkPAr_',
    '_score': 2.3027456,
    '_source': {'FlightNum': 'VUHDJ5M',
     'Origin': 'Guangzhou Baiyun International Airport',
     'OriginLocation': {'lon': '113.2990036', 'lat': '23.39240074'},
     'DestLocation': {'lon': '140.3860016', 'lat': '35.76470184'},
     'FlightDelay': False,
     'DistanceMiles': 1831.1103511142028,
     'FlightTimeMin': 163.71591427241867,
     'OriginWeather': 'Hail',
     'dayOfWeek': 2,
     'AvgTicketPrice': 998.7819944362366,
     'Carrier': 'Logstash Airways',
     'FlightDelayMin': 0,
     'OriginRegion': 'SE-BD',
     'FlightDelayType': 'No Delay',
     'DestAirportID': 'NRT',
     'timestamp': '2020-01-15T07:29:09',
     'Dest': 'Narita International Airport',
     'FlightTimeHour': 2.728598571206978,
     'Cancelled': False,
     'DistanceKilometers': 2946.886456903536,
     'OriginCityName': 'Guangzhou',
     'DestWeather': 'Sunny',
     'OriginCountry': 'CN',
    

## Evaluating Results
The results can be evaluated for documents which contain both the ground truth field and the prediction. In the example below, FlightDelayMins contains the ground truth and the prediction is stored as ml.FlightDelayMin_prediction.

Use the data frame analytics evaluate API to evaluate the results.

First, we want to know the training error that represents how well the model performed on the training dataset:

In [9]:
# compute the training error

evaluate_endpoint = "/_ml/data_frame/_evaluate"

config = {
 "index": "df-flight-delays",  # [1]
   "query": {
    "term": {
      "ml.is_training": {  # [2]
        "value": True  
      }
    }
  },
 "evaluation": {
   "regression": {
     "actual_field": "FlightDelayMin",   # [3]
     "predicted_field": "ml.FlightDelayMin_prediction", # [4]
     "metrics": {
       "r_squared": {},
       "mean_squared_error": {}
     }
   }
 }
}

result = requests.post(es_url+evaluate_endpoint, json=config)
result.json()

{'regression': {'mean_squared_error': {'error': 2622.6653870997893},
  'r_squared': {'value': 0.7164057980072627}}}

[1] The destination index which is the output of the analysis job.

[2] We calculate the training error by only evaluating the training data.

[3] The ground truth label.

[4] Predicted value.

Next, we would like to compute the generalisation error - that is, how well the model performs on data points that have not been used in training

In [10]:
# compute the training error

evaluate_endpoint = "/_ml/data_frame/_evaluate"

config = {
 "index": "df-flight-delays",  # [1]
   "query": {
    "term": {
      "ml.is_training": {  # [2]
        "value": False  
      }
    }
  },
 "evaluation": {
   "regression": {
     "actual_field": "FlightDelayMin",   # [3]
     "predicted_field": "ml.FlightDelayMin_prediction", # [4]
     "metrics": {
       "r_squared": {},
       "mean_squared_error": {}
     }
   }
 }
}

result = requests.post(es_url+evaluate_endpoint, json=config)
result.json()

{'regression': {'mean_squared_error': {'error': 4115.252232438452},
  'r_squared': {'value': 0.5929111762737593}}}