## <u style="text-decoration-thickness: 4px;">4. Concept and Methodology</span>

### **Approach**  
The methodology involves processing helicopter flight trajectory data from an HDF5 file, preparing it for machine learning, and applying a LSTM model for path prediction.

### **Methodology Breakdown**  
1. **Data Loading & Exploration**  
   - Access HDF5 file using `h5py`.  
   - Identify datasets and convert to Pandas DataFrame.  

2. **Data Preprocessing**  
   - Extract features and labels.  
   - Handle missing values.  

3. **Train-Test Split**  
   - Partition data using `train_test_split()`.  

4. **Model Training**  
   - Train `LinearRegression()` model.  

5. **Evaluation**  
   - Predict and compute Mean Squared Error (MSE).  
   - Analyze performance.  
 
- **Interaction of Methods**: The methods follow a sequential pipeline, where data preprocessing ensures quality input for training, and model evaluation verifies performance.  
- **Overall Structure**: Data flows from HDF5 storage → preprocessing → model training → evaluation.  
- **Boundary Conditions**: The methodology assumes clean data with sufficient features for LSTM. Poor-quality or insufficient data may affect accuracy.  
- **Method Selection**: LSTM was chosen for efficiency, while alternative complex models were avoided for initial analysis.

# <u style="text-decoration-thickness: 4px;">HDF5 Data Processing & Linear Regression Model</span>

## **1. Import Libraries**  
- `h5py`, `numpy`, `pandas`, and `sklearn` for data handling and modeling.

## **2. Load & Explore HDF5 File**  
- Open the file with `h5py.File()`.  
- List datasets using `.visititems()`.  
- Convert to a `pandas.DataFrame`.  

## **3. Preprocess Data**  
- Extract features & labels.  
- Handle missing values (if any).  

## **4. Train-Test Split**  
- Use `train_test_split()` for data partitioning.  

## **5. Train & Evaluate Model**  
- Train a `LinearRegression()` model.  
- Predict and compute **Mean Squared Error (MSE)**.  

## **6. Results**  
- Print dataset preview.  
- Display MSE and analyze performance.  


In [1]:
##### Import required libraries
import h5py
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# File name
#file_name = "final_updated_features.h5"
file_name = "final_updated_features_with_flight_id.h5"

# Step 1: Open the H5 file and explore its structure
with h5py.File(file_name, 'r') as h5file:
    print("Datasets and groups in the H5 file:")
    h5file.visit(print)  # Print all groups and datasets in the filefinal_updated_features.h5

Datasets and groups in the H5 file:
float_columns


# <u style="text-decoration-thickness: 4px;"> Exploring an HDF5 Dataset</span>

## **1. Select a Dataset**  
- Identify available datasets in the HDF5 file.  
- Replace `'float_columns'` with the actual dataset name.  

## **2. Open and Inspect the Dataset**  
- Check if the dataset exists in the file.  
- Print dataset shape and data type.  

## **3. Convert to Pandas DataFrame (If Needed)**  
- Load the dataset into a NumPy array.  
- Convert to a Pandas DataFrame for further analysis.  

## **4. Next Steps**  
- Preprocess the dataset (handle missing values, normalization, etc.).  
- Visualize data if necessary.  
- Use it for machine learning model training.  


In [2]:
# Step 2: Choose a dataset to explore (replace 'your_dataset_name' with the actual dataset name after inspecting the file)
dataset_name = 'float_columns'  # Replace this with the actual dataset name

with h5py.File(file_name, 'r') as h5file:
    if dataset_name in h5file:
        data = h5file[dataset_name]
        print(f"\nDataset: {dataset_name}")
        print(f"Shape: {data.shape}")
        print(f"Data type: {data.dtype}")



Dataset: float_columns
Shape: (25158056, 8)
Data type: float64


# <u style="text-decoration-thickness: 4px;">Dataset Exploration Code</span>

## **1. Open the HDF5 File**  
- The HDF5 file `"final_updated_features_with_flight_id.h5"` is opened using `h5py.File()`.  
- We check if the dataset `'float_columns'` exists within the file.

## **2. Dataset Inspection**  
- If the dataset is found, the **shape** and **data type** of the dataset are printed for a quick understanding of its structure.  
- Example output would show the dataset's dimensions and the type of data stored (e.g., `float64`).

## **3. Preview Data**  
- We extract the first **10 rows** of the dataset to inspect the values using slicing (`dataset[:10]`).  
- The data is then **converted into a Pandas DataFrame** for better readability and manipulation.

## **4. Output Example**  
- An example of the output shows the shape, data type, and the first few rows of the dataset, allowing you to validate its content and structure.

## **5. Next Steps**  
- After exploring the data, you can move to further preprocessing, analysis, and visualization before using the data for machine learning models.


In [3]:
##file_name = "combined_with_flight_id.h5"
# File path
file_name = "final_updated_features_with_flight_id.h5"

# Open the HDF5 file
with h5py.File(file_name, 'r') as hf:
    # Check if the dataset exists
    if 'float_columns' in hf:
        dataset = hf['float_columns']
        
        # Print the shape of the dataset
        print(f"Dataset 'combined_data' has shape: {dataset.shape}")
        
        # Print the first few rows of the data
        print(dataset[:10000])  

    else:
        print("'final_combined_with_flight_id' dataset not found in the file.")


Dataset 'combined_data' has shape: (25158056, 8)
[[ 775.  775.   91. ...  320.  180. 7413.]
 [ 775.  775.   91. ...  320.  180. 7413.]
 [ 775.  775.   91. ...  320.  180. 7413.]
 ...
 [1575. 2375.  123. ... -128.  180. 2479.]
 [1575. 2375.  123. ... -128.  180. 2479.]
 [1575. 2375.  123. ... -128.  180. 2479.]]


#  <u style="text-decoration-thickness: 4px;">Loading Data and Converting to DataFrame</span>

## **1. Import Libraries**  
- **`h5py`**: To read the HDF5 file.  
- **`numpy`**: For array operations.  
- **`pandas`**: For handling data in DataFrame format.  
- **`sklearn.model_selection`**: For splitting the dataset into training and testing sets.  
- **`sklearn.linear_model`**: To apply linear regression.  
- **`sklearn.metrics`**: For calculating model performance (e.g., Mean Squared Error).

## **2. Load Dataset from HDF5 File**  
- The file `"final_updated_features_with_flight_id.h5"` is opened using `h5py.File()`.  
- The dataset `'float_columns'` is accessed and loaded into memory using `[:]`, which loads the entire dataset.

## **3. Convert to Pandas DataFrame**  
- The loaded dataset is converted into a **Pandas DataFrame** for easier manipulation and analysis.

## **4. Next Steps**  
- Once the data is in a DataFrame, you can proceed with preprocessing (e.g., handling missing values) and model training.  
- The next step might involve splitting the data into training and testing sets and applying a machine learning model, such as **Linear Regression**.


In [4]:
import h5py
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

file_name = "final_updated_features_with_flight_id.h5"  # Replace with your actual file path
with h5py.File(file_name, 'r') as h5_file:
    dataset = h5_file['float_columns'][:]  # Load the entire dataset into memory
df = pd.DataFrame(dataset)

In [5]:
print(df.head())
print(df.tail())

       0      1     2          3         4      5      6       7
0  775.0  775.0  91.0  53.473299  9.945374  320.0  180.0  7413.0
1  775.0  775.0  91.0  53.473299  9.945374  320.0  180.0  7413.0
2  775.0  775.0  91.0  53.473299  9.945374  320.0  180.0  7413.0
3  775.0  775.0  91.0  53.473299  9.945374  320.0  180.0  7413.0
4  775.0  775.0  91.0  53.473299  9.945374  320.0  180.0  7413.0
               0       1      2          3          4    5      6        7
25158051  2775.0  3250.0  123.0  47.617264  10.281898  0.0  180.0  21105.0
25158052  2775.0  3250.0  123.0  47.617264  10.281898  0.0  180.0  21105.0
25158053  2775.0  3250.0  123.0  47.617264  10.281898  0.0  180.0  21105.0
25158054  2775.0  3250.0  123.0  47.617264  10.281898  0.0  180.0  21105.0
25158055  2775.0  3250.0  123.0  47.617264  10.281898  0.0  180.0  21105.0


# <u style="text-decoration-thickness: 4px;">Assigning Custom Column Names</span>

## **1. Define Custom Column Names**  
- A list of custom column names is provided for the dataset:
  ```python
  column_names = [ 'altitude', 'geoaltitude', 'groundspeed', 'latitude', 'longitude', 'vertical_rate', 'compute_gs', 'flight_id']


In [6]:
# Assign custom column names based on the list you provided
column_names = [ 'altitude', 'geoaltitude', 'groundspeed', 'latitude', 'longitude', 'vertical_rate', 'compute_track','flight_id']

# Assign these column names to the DataFrame
df.columns = column_names

# Check if the column names are assigned correctly
print(df.columns)


Index(['altitude', 'geoaltitude', 'groundspeed', 'latitude', 'longitude',
       'vertical_rate', 'compute_gs', 'flight_id'],
      dtype='object')


# <u style="text-decoration-thickness: 4px;">Converting Latitude and Longitude to Cartesian Coordinates</span>

## **1. Import Libraries**  
- **`numpy`**: For mathematical operations (like converting degrees to radians).  
- **`pandas`**: For DataFrame manipulation.

## **2. Convert Latitude and Longitude to Radians**  
- The columns `'latitude'` and `'longitude'` are converted from degrees to radians:
  ```python
  df['lat_rad'] = np.radians(df['latitude'])  # Convert latitude to radians
  df['lon_rad'] = np.radians(df['longitude'])  # Convert longitude to radians


In [7]:
import numpy as np
import pandas as pd

# Assuming df is your DataFrame with 'latitude' and 'longitude' columns in degrees
df['lat_rad'] = np.radians(df['latitude'])  # Convert latitude to radians
df['lon_rad'] = np.radians(df['longitude'])  # Convert longitude to radians

# Compute x, y, z coordinates
df['x'] = np.cos(df['lat_rad']) * np.cos(df['lon_rad'])
df['y'] = np.cos(df['lat_rad']) * np.sin(df['lon_rad'])
df['z'] = np.sin(df['lat_rad'])

# Drop original latitude, longitude, and intermediate radian columns
df.drop(columns=['latitude', 'longitude', 'lat_rad', 'lon_rad'], inplace=True)

# Display the transformed DataFrame
print(df.head())


   altitude  geoaltitude  groundspeed  vertical_rate  compute_gs  flight_id  \
0     775.0        775.0         91.0          320.0       180.0     7413.0   
1     775.0        775.0         91.0          320.0       180.0     7413.0   
2     775.0        775.0         91.0          320.0       180.0     7413.0   
3     775.0        775.0         91.0          320.0       180.0     7413.0   
4     775.0        775.0         91.0          320.0       180.0     7413.0   

          x         y        z  
0  0.586253  0.102796  0.80358  
1  0.586253  0.102796  0.80358  
2  0.586253  0.102796  0.80358  
3  0.586253  0.102796  0.80358  
4  0.586253  0.102796  0.80358  


In [8]:
df = df.sort_values(by="flight_id")

In [9]:
df = df.drop(columns=["flight_id"])

In [10]:
print("Data after sorting and removing the flight_id column:")
print(df.head())

Data after sorting and removing the flight_id column:
       altitude  geoaltitude  groundspeed  vertical_rate  compute_gs  \
11797    1400.0       1375.0        132.0            0.0       180.0   
88921     450.0        425.0        116.0           64.0       180.0   
88922     475.0        425.0        115.0           64.0       180.0   
88923     475.0        450.0        115.0           64.0       180.0   
88924     475.0        450.0        115.0           64.0       180.0   

              x         y         z  
11797  0.586926  0.104000  0.802933  
88921  0.586205  0.103102  0.803575  
88922  0.586226  0.103064  0.803565  
88923  0.586226  0.103064  0.803565  
88924  0.586226  0.103064  0.803565  


# <u style="text-decoration-thickness: 4px;">Splitting Data into Input (X) and Output (y)</span>

## **1. Importing Libraries**  
- **`numpy`**: For array manipulation and handling data.

## **2. Splitting the DataFrame**  
- The dataset is split into two parts:  
  - **Input (X)**: This will contain the features used to train the model.  
  - **Output (y)**: This will contain the target variables that the model will predict.

### **a. Selecting Input (X)**  
- The input variable **X** includes all columns from the DataFrame:  
  ```python
  X = df.values  # Use all columns as input


In [11]:
import numpy as np

# Split into input (X) and output (y)
X = df.values  # Use all columns as input
y = df.iloc[:, [0, 5, 6, 7]].values  # Select columns 1, 4, and 5 (0-based index)


# Split into input (X) and output (y)
#X = df.iloc[:, 3:].values  # Use all columns except the first 3 as input
#y = df.iloc[:, :3].values  # Use the first 3 columns as output


print("X shape:", X.shape)
print("y shape:", y.shape)


X shape: (25158056, 8)
y shape: (25158056, 4)


# <u style="text-decoration-thickness: 4px;"> Splitting the Data into Train and Test Sets</span>

## **1. Define Train-Test Split Ratio**  
- The **train ratio** is set to 80%, meaning 80% of the data will be used for training and 20% for testing:  
  ```python
  train_ratio = 0.8


In [12]:
## splitting into train and test

train_ratio = 0.8
train_size = int(len(df) * train_ratio)

X_train, X_test = X[:train_size], X[train_size:]
y_train, y_test = y[:train_size], y[train_size:]

print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)


X_train shape: (20126444, 8)
y_train shape: (20126444, 4)
X_test shape: (5031612, 8)
y_test shape: (5031612, 4)


# <u style="text-decoration-thickness: 4px;">Time-Series Data Preparation for Forecasting</span>

This code snippet is designed to prepare training and test datasets for a time-series forecasting model, specifically aiming to predict values 60 timesteps ahead.

#### 1. **Parameters**:
   - `n_target_steps = 60`: Specifies the number of timesteps into the future that the model is expected to predict.

#### 2. **Training Data Preparation**:
   - `X_train_new` and `y_train_new` are initialized as empty lists to store the new input-output pairs for training.
   - A loop iterates through the original training data (`X_train` and `y_train`):
     - Each input sample consists of the current timestep (`X_train[i, :]`).
     - The output is the value 60 timesteps ahead (`y_train[i + n_target_steps, :]`).
   - After the loop, the lists `X_train_new` and `y_train_new` are converted into NumPy arrays (`X_train` and `y_train`).

#### 3. **Test Data Preparation**:
   - Similarly, `X_test_new` and `y_test_new` are initialized for test data processing.
   - A loop processes the `X_test` and `y_test` in the same way, generating input-output pairs where the output is 60 timesteps ahead of the input.

#### 4. **Final Shape Check**:
   - The shapes of the final training and test datasets (`X_train`, `y_train`, `X_test`, and `y_test`) are printed. The expected shapes are:
     - `X_train` and `X_test`: `(samples, features)`
     - `y_train` and `y_test`: `(samples, target_features)`



In [13]:
# Parameters
n_target_steps = 60  # Predict 60 timesteps into the future

# Initialize lists for inputs and outputs
X_train_new = []
y_train_new = []

# Create training data
for i in range(len(X_train) - n_target_steps):
    # Current timestep as input
    X_train_new.append(X_train[i, :])  # Take the current timestep (single row)
    # Value 60 timesteps into the future as output
    y_train_new.append(y_train[i + n_target_steps, :])  # Predict 60 timesteps ahead

# Convert to NumPy arrays
X_train = np.array(X_train_new)
y_train = np.array(y_train_new)

# Do the same for the test set
X_test_new = []
y_test_new = []

for i in range(len(X_test) - n_target_steps):
    X_test_new.append(X_test[i, :])
    y_test_new.append(y_test[i + n_target_steps, :])

X_test = np.array(X_test_new)
y_test= np.array(y_test_new)

# Check shapes
print("X_train shape:", X_train.shape)  # Should be (samples, features)
print("y_train shape:", y_train.shape)  # Should be (samples, target_features)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)


X_train shape: (20126384, 8)
y_train shape: (20126384, 4)
X_test shape: (5031552, 8)
y_test shape: (5031552, 4)


# <u style="text-decoration-thickness: 4px;">Feature Scaling for Time-Series Data</span>

This code snippet applies feature scaling using `MinMaxScaler` from `sklearn` to transform the input features and target values of a time-series dataset. The goal is to normalize the data for improved model performance.

#### 1. **Scaler Initialization**:
   - Four `MinMaxScaler` objects are created:
     - `scaler_X_altitude`: Scales the altitude feature in the input data (`X`) to a range of [0, 1].
     - `scaler_X_other`: Scales other features in `X` to a range of [-1, 1].
     - `scaler_y_altitude`: Scales the output altitude (`y`) to a range of [0, 1].
     - `scaler_y_other`: Scales the other output columns (`y`) to a range of [-1, 1].

#### 2. **Separation of Altitude and Other Features**:
   - The altitude column is isolated from both the input features (`X_train` and `X_test`) and target values (`y_train` and `y_test`), using `altitude_index = 0` as the reference for the altitude column.
   - The altitude values are reshaped to be a 2D array, while the remaining features are stored separately in `X_train_other` and `X_test_other` for the input, and `y_train_other` and `y_test_other` for the target.

#### 3. **Scaling**:
   - The altitude features are scaled between 0 and 1 using `scaler_X_altitude` and `scaler_y_altitude`.
   - The remaining features and target values are scaled between -1 and 1 using `scaler_X_other` and `scaler_y_other`.

#### 4. **Recombining Scaled Features**:
   - After scaling, the altitude features (both input and output) are combined with the other features (input and output) to form the final scaled input (`X_train`, `X_test`) and target (`y_train`, `y_test`) datasets using `np.hstack()`.

#### 5. **Shape Check**:
   - The shapes of the scaled datasets (`X_train`, `y_train`, `X_test`, `y_test`) are printed to ensure the dimensions remain consistent after scaling.
   
#### 6. **Optional Output**:
   - The first few rows of the scaled training data are printed for inspection, allowing verification of the transformed values.


In [14]:
from sklearn.preprocessing import MinMaxScaler
import numpy as np

# Create MinMaxScaler objects
scaler_X_altitude = MinMaxScaler(feature_range=(0, 1))  # For altitude to scale between 0 and 1
scaler_X_other = MinMaxScaler(feature_range=(-1, 1))  # For other features to scale between -1 and 1
scaler_y_altitude = MinMaxScaler(feature_range=(0, 1))  # For output altitude to scale between 0 and 1
scaler_y_other = MinMaxScaler(feature_range=(-1, 1))  # For other output columns to scale between -1 and 1

# Identify the altitude column index
altitude_index = 0  # Change this based on your dataset

# Separate the altitude column from X (features)
altitude_train_X = X_train[:, altitude_index].reshape(-1, 1)
X_train_other = X_train[:, [i for i in range(X_train.shape[1]) if i != altitude_index]]

altitude_test_X = X_test[:, altitude_index].reshape(-1, 1)
X_test_other = X_test[:, [i for i in range(X_test.shape[1]) if i != altitude_index]]

# Separate the altitude column from y (target)
altitude_train_y = y_train[:, altitude_index].reshape(-1, 1)
y_train_other = y_train[:, [i for i in range(y_train.shape[1]) if i != altitude_index]]

altitude_test_y = y_test[:, altitude_index].reshape(-1, 1)
y_test_other = y_test[:, [i for i in range(y_test.shape[1]) if i != altitude_index]]

# Scale the altitude feature in X to [0, 1]
altitude_train_X_scaled = scaler_X_altitude.fit_transform(altitude_train_X)
altitude_test_X_scaled = scaler_X_altitude.transform(altitude_test_X)

# Scale the other features in X to [-1, 1]
X_train_other_scaled = scaler_X_other.fit_transform(X_train_other)
X_test_other_scaled = scaler_X_other.transform(X_test_other)

# Scale the altitude in y to [0, 1]
altitude_train_y_scaled = scaler_y_altitude.fit_transform(altitude_train_y)
altitude_test_y_scaled = scaler_y_altitude.transform(altitude_test_y)

# Scale the other columns in y to [-1, 1]
y_train_other_scaled = scaler_y_other.fit_transform(y_train_other)
y_test_other_scaled = scaler_y_other.transform(y_test_other)

# Combine the scaled altitude with the scaled other features in X
X_train = np.hstack([altitude_train_X_scaled, X_train_other_scaled])
X_test = np.hstack([altitude_test_X_scaled, X_test_other_scaled])

# Combine the scaled altitude with the scaled other columns in y
y_train= np.hstack([altitude_train_y_scaled, y_train_other_scaled])
y_test = np.hstack([altitude_test_y_scaled, y_test_other_scaled])

# Check the shapes after scaling (should remain the same)
print("X_train_scaled shape:", X_train.shape)
print("y_train_scaled shape:", y_train.shape)
print("X_test_scaled shape:", X_test.shape)
print("y_test_scaled shape:", y_test.shape)

# Optional: Print the first few rows of the scaled training data
print(X_train[:5])
print(y_train[:5])

X_train_scaled shape: (20126384, 8)
y_train_scaled shape: (20126384, 4)
X_test_scaled shape: (5031552, 8)
y_test_scaled shape: (5031552, 4)
[[ 2.03125000e-02 -9.62580394e-01 -8.73015873e-01  4.23728814e-02
  -1.04805054e-13  5.82042289e-01  2.44283662e-01  5.55616578e-01]
 [ 1.28906250e-02 -9.77392321e-01 -8.88407888e-01  4.51977401e-02
  -1.04805054e-13  5.80401163e-01  2.43010181e-01  5.61400906e-01]
 [ 1.30859375e-02 -9.77392321e-01 -8.89369889e-01  4.51977401e-02
  -1.04805054e-13  5.80449006e-01  2.42955740e-01  5.61307183e-01]
 [ 1.30859375e-02 -9.77002534e-01 -8.89369889e-01  4.51977401e-02
  -1.04805054e-13  5.80449006e-01  2.42955740e-01  5.61307183e-01]
 [ 1.30859375e-02 -9.77002534e-01 -8.89369889e-01  4.51977401e-02
  -1.04805054e-13  5.80449006e-01  2.42955740e-01  5.61307183e-01]]
[[0.01230469 0.58061868 0.24267503 0.56104589]
 [0.01230469 0.58053663 0.24266607 0.56129005]
 [0.01230469 0.58053663 0.24266607 0.56129005]
 [0.01230469 0.58065747 0.24253726 0.56104589]
 [0.01

# <u style="text-decoration-thickness: 4px;">Overview of Model Serialization for Scalers</span>

This code snippet demonstrates how to save the trained `MinMaxScaler` objects to disk using `joblib` for future use.

#### 1. **Saving Scalers**:
   - The following scalers are serialized (saved) as `.pkl` files using `joblib.dump()`:
     - `scaler_X_altitude`: The scaler for scaling the altitude feature in the input data (`X`).
     - `scaler_X_other`: The scaler for scaling other features in the input data (`X`).
     - `scaler_y_altitude`: The scaler for scaling the altitude feature in the target data (`y`).
     - `scaler_y_other`: The scaler for scaling other target columns in the target data (`y`).

   These serialized scaler objects can later be reloaded into another script or model to ensure that the same scaling transformations are applied consistently during inference or testi


In [15]:
import joblib
joblib.dump(scaler_X_altitude, 'scaler_X_altitude.pkl')
joblib.dump(scaler_X_other, 'scaler_X_other.pkl')
joblib.dump(scaler_y_altitude, 'scaler_y_altitude.pkl')
joblib.dump(scaler_y_other, 'scaler_y_other.pkl')


['scaler_y_other.pkl']

## <u style="text-decoration-thickness: 4px;">Dimension Expansion for LSTM Input</span>

This code snippet adjusts the dimensions of the input data to make it compatible with the requirements of a Long Short-Term Memory (LSTM) network.

#### 1. **Expanding Dimensions**:
   - The `np.expand_dims()` function is used to add an additional dimension to the input datasets (`X_train` and `X_test`), converting them into 3D arrays:
     - The new shape becomes `(samples, time_steps, features)`, where:
       - `samples` represents the number of data points.
       - `time_steps` is set to 1 since we are processing one timestep at a time (i.e., no sequence length).
       - `features` refers to the number of features in the dataset.
   - Specifically, the expansion happens along the second axis (axis=1), which is the time-step dimension.

#### 2. **Shape of Data**:
   - After the operation, the shape of `X_train` and `X_test` changes from `(samples, features)` to `(samples, 1, features)`. This ensures the data fits the expected input format for LSTM models, which require 3D data.

#### 3. **Print Shape for Verification**:
   - The shapes of `X_train` and `X_test` after the expansion are printed to verify the transformation. The expected output is:
     - `X_train shape after expansion: (samples, 1, features)`
     - `X_test shape after expansion: (samples, 1, features)`


In [16]:
# Expand dimensions for LSTM input (samples, time_steps, features)
X_train = np.expand_dims(X_train, axis=1)  # Shape becomes (samples, 1, features)
X_test = np.expand_dims(X_test, axis=1)    

print("X_train shape after expansion:", X_train.shape)
print("X_test shape after expansion:", X_test.shape)

X_train shape after expansion: (20126384, 1, 8)
X_test shape after expansion: (5031552, 1, 8)


# <u style="text-decoration-thickness: 4px;">LSTM Model for Time-Series Forecasting</span>

This code snippet defines, compiles, trains, and evaluates a Long Short-Term Memory (LSTM) model for a time-series forecasting task.

#### 1. **Model Definition**:
   - A Sequential model is created using Keras:
     - **LSTM Layer**: 
       - The LSTM layer is added with 7 units (neurons) and uses the ReLU activation function.
       - The `input_shape` is defined as `(X_train.shape[1], X_train.shape[2])`, which corresponds to the time steps (1) and the number of features in the input data.
     - **Dense Layer**: 
       - A Dense output layer is added with `y_train.shape[1]` units, which represents the number of target columns (i.e., 3 target variables).
  
#### 2. **Model Compilation**:
   - The model is compiled using the Adam optimizer (`'adam'`) and Mean Squared Error (`'mse'`) as the loss function. The Adam optimizer is commonly used for training deep learning models due to its efficiency and adaptability.

#### 3. **Model Training**:
   - The model is trained using the `fit()` function:
     - Training is performed for 20 epochs with a batch size of 64.
     - The training progress is printed (`verbose=1`), and validation data (`X_test` and `y_test`) is provided to evaluate the model on unseen data during training.

#### 4. **Model Evaluation**:
   - After training, the model is evaluated using the `evaluate()` function on the test set (`X_test`, `y_test`), and the test loss is printed.

   - The expected output will display the test loss, which indicates how well the model performs on the unseen test data.


# <u style="text-decoration-thickness: 4px;">Enhanced LSTM Model with Regularization and Early Stopping</span>

This code snippet enhances the previous LSTM model by adding more layers, applying dropout for regularization, and implementing early stopping during training.

#### 1. **Model Definition**:
   - **LSTM Layers**:
     - Three LSTM layers with 50 units each are added to the model. 
     - Each LSTM layer uses the **tanh** activation function instead of ReLU to capture more complex relationships in the data.
     - The first two LSTM layers return sequences (`return_sequences=True`) to pass a sequence of outputs to the next layer, while the final LSTM layer does not return sequences, producing the final output for prediction.
   - **Dropout Layers**:
     - Dropout layers with a rate of 0.2 are added after each LSTM layer. Dropout helps prevent overfitting by randomly setting a fraction of the input units to zero during training.
   - **Dense Layer**:
     - A Dense output layer is added with a number of units equal to `y_train.shape[1]` (i.e., the number of target features).

#### 2. **Model Compilation**:
   - The model is compiled with the Adam optimizer (`'adam'`) and Mean Squared Error (`'mse'`) as the loss function, which is commonly used for regression problems like time-series forecasting.

#### 3. **Early Stopping**:
   - The **EarlyStopping** callback is defined:
     - It monitors the validation loss (`'val_loss'`) to detect if the model’s performance is not improving.
     - The training process will stop if the validation loss does not improve for 3 consecutive epochs (`patience=3`).
     - The best model weights are restored after training stops (`restore_best_weights=True`), ensuring the best model is retained.

#### 4. **Model Training**:
   - The model is trained with the `fit()` function:
     - Training runs for up to 20 epochs with a batch size of 64.
     - The `validation_data` is passed to monitor the performance on the test set during training.
     - The **EarlyStopping** callback is used to prevent overfitting and stop training once the model performance plateaus.

#### 5. **Model Evaluation**:
   - After training, the model is evaluated on the test set (`X_test`, `y_test`) using the `evaluate()` function.
     

In [17]:
#ading early stopping and chnaging relu to tanh, adding dropout layers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping


model = Sequential()

model.add(LSTM(50, input_shape=(X_train.shape[1], X_train.shape[2]), return_sequences=True, activation='tanh'))
model.add(Dropout(0.2))  # Add dropout for regularization

model.add(LSTM(50, return_sequences=True, activation='tanh'))
model.add(Dropout(0.2))

model.add(LSTM(50, activation='tanh'))
model.add(Dropout(0.2))

model.add(Dense(y_train.shape[1])) 

model.compile(optimizer='adam', loss='mse')

# Define EarlyStopping callback
early_stopping = EarlyStopping(
    monitor='val_loss',   # Monitors the validation loss
    patience=3,           # Number of epochs to wait for improvement before stopping
    restore_best_weights=True  # Restores the best model weights after stopping
)

# Train the model with EarlyStopping
history = model.fit(
    X_train, 
    y_train, 
    epochs=20, 
    batch_size=64, 
    verbose=1, 
    validation_data=(X_test, y_test), 
    callbacks=[early_stopping]  # Pass EarlyStopping callback
)

# Evaluate the model
loss = model.evaluate(X_test, y_test)
print("Test Loss:", loss)

##0.0090 as mse

2025-02-09 13:40:15.753977: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
  super().__init__(**kwargs)


Epoch 1/20
[1m314475/314475[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m698s[0m 2ms/step - loss: 0.0012 - val_loss: 6.3799e-04
Epoch 2/20
[1m314475/314475[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m702s[0m 2ms/step - loss: 8.5705e-04 - val_loss: 6.5319e-04
Epoch 3/20
[1m314475/314475[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m711s[0m 2ms/step - loss: 8.5081e-04 - val_loss: 6.5730e-04
Epoch 4/20
[1m314475/314475[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m715s[0m 2ms/step - loss: 8.4274e-04 - val_loss: 6.5131e-04
[1m157236/157236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m122s[0m 775us/step - loss: 6.8818e-04
Test Loss: 0.0006379964761435986


# <u style="text-decoration-thickness: 4px;">Prediction and Evaluation on Test Set</span>

This code snippet generates predictions for the test set and compares the predicted values to the actual target values.

#### 1. **Generating Predictions**:
   - The model's `predict()` function is used to generate predictions on the test set (`X_test`). These predictions are stored in `y_pred`.

#### 2. **Displaying Predicted vs Actual Values**:
   - The code prints a comparison of the first few predicted values (`y_pred`) versus the actual values (`y_test`) from the test set. 
   - A loop is used to display 5 examples (specified by `num_examples = 5`):
     - For each example, the predicted and actual values are printed.
     - A separator line (`"-" * 30`) is used for clarity between examples.

This step allows for visual inspection of the model's performance by comparing the predicted outputs with the ground truth, providing insight into how well the model generalizes to unseen data.


In [18]:
# Generate predictions on the test set
y_pred = model.predict(X_test)

# Print a few examples of predictions and actual values
num_examples = 5  # Number of examples to display
print("Predicted vs Actual Values:")

for i in range(num_examples):
    print(f"Example {i+1}:")
    print(f"Predicted: {y_pred[i]}")
    print(f"Actual: {y_test[i]}")
    print("-" * 30)


[1m157236/157236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m132s[0m 834us/step
Predicted vs Actual Values:
Example 1:
Predicted: [0.02834139 0.74597764 0.25983354 0.02050689]
Actual: [0.02832031 0.74902057 0.25640524 0.01169354]
------------------------------
Example 2:
Predicted: [0.02835777 0.74599934 0.25999257 0.02025291]
Actual: [0.02832031 0.74902057 0.25640524 0.01169354]
------------------------------
Example 3:
Predicted: [0.02835828 0.7460049  0.25998527 0.02024028]
Actual: [0.02851563 0.74904069 0.2563979  0.01162982]
------------------------------
Example 4:
Predicted: [0.02834491 0.7459714  0.25986397 0.02048442]
Actual: [0.02832031 0.74907141 0.25638748 0.01153179]
------------------------------
Example 5:
Predicted: [0.02835853 0.74600756 0.25998178 0.0202342 ]
Actual: [0.02832031 0.74907141 0.25638748 0.01153179]
------------------------------


In [19]:
import joblib
joblib.dump(model,'model.pkl')
#model = joblib.load('model.pkl')

['model.pkl']

In [19]:
import joblib
model = joblib.load('model.pkl')

2025-02-09 14:35:58.156363: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


# <u style="text-decoration-thickness: 4px;">Data Loading and Merging from Multiple CSV Files</span>

This code snippet demonstrates how to load multiple CSV files, process them, and combine them into a single DataFrame. The files represent test data for multiple years and months, and the goal is to merge them for further analysis.

#### 1. **Directory Setup**:
   - The directory path `/home/jovyan/data/data_test` is specified, where the CSV files are stored.

#### 2. **Iterating Over Files**:
   - A loop iterates over years from 2017 to 2021 and months from 1 to 12.
   - For each year-month combination, the corresponding file name is generated (e.g., `test_set_1_2017.csv` for January 2017).
   - The full file path is constructed using `os.path.join()`.

#### 3. **File Loading**:
   - Inside a `try-except` block, the code attempts to load each CSV file into a Pandas DataFrame using `pd.read_csv()`.
   - The `timestamp` column is converted into a `datetime` format to ensure consistent date handling.

#### 4. **Handling Missing Files**:
   - If a file is not found (`FileNotFoundError`), it is skipped, and the loop continues to the next file.

#### 5. **Merging Data**:
   - All successfully loaded DataFrames are appended to a list (`data_frames`).
   - After iterating through all the files, the DataFrames are concatenated into a single DataFrame using `pd.concat()`, with `ignore_index=True` to reset the index.

#### 6. **Summary**:
   - The code prints the number of CSV files that were successfully loaded and merged into the final DataFrame (`merged_data`), providing insight into the data loading process.

This process allows the user to efficiently load, process, and merge large amounts of time-series data spread across multiple files.


In [20]:
import os
import pandas as pd

data_frames = []  # List to hold DataFrames
data_dir = "/home/jovyan/data/data_test"  # Data directory

# Counter for successfully loaded files
loaded_files_count = 0

# Looping through all years and months
for year in range(2017, 2022):  # From 2017 to 2021
    for month in range(1, 13):  # Months from 1 to 12
        # Construct the full file path
        file_name = f"test_set_{month}_{year}.csv"  # The file name
        file_path = os.path.join(data_dir, file_name)
        
        try:
            # Load the data into a DataFrame
            temp_data = pd.read_csv(file_path, sep=',')
            
            # Convert the timestamp column to datetime format
            temp_data['timestamp'] = pd.to_datetime(temp_data['timestamp'])
            
            # Add the DataFrame to the list
            data_frames.append(temp_data)
            
            # Increment the counter for successfully loaded files
            loaded_files_count += 1
        except FileNotFoundError:
            # Skip the file
            pass

# Concatenate all DataFrames in the list into a single DataFrame
merged_data = pd.concat(data_frames, ignore_index=True)

# Print the count of successfully loaded files
print(f"Number of CSV files successfully loaded: {loaded_files_count}")


Number of CSV files successfully loaded: 12


In [21]:
print(merged_data.head())
print(merged_data.tail())

   Unnamed: 0.1  Unnamed: 0 registration  icao24 operator callsign  alert  \
0           108       79969       D-HXBA  3E0F76     ADAC    CHX32  False   
1           109       79970       D-HXBA  3E0F76     ADAC    CHX32  False   
2           110       79971       D-HXBA  3E0F76     ADAC    CHX32  False   
3           111       79972       D-HXBA  3E0F76     ADAC    CHX32  False   
4           112       79973       D-HXBA  3E0F76     ADAC    CHX32  False   

   altitude  count  geoaltitude  ...  squawk           timestamp      track  \
0    1275.0      2       1550.0  ...     NaN 2017-08-01 07:18:29  315.00000   
1    1325.0      2       1600.0  ...     NaN 2017-08-01 07:18:30  293.19858   
2    1325.0      2       1625.0  ...     NaN 2017-08-01 07:18:31  294.44397   
3    1300.0      2       1625.0  ...     NaN 2017-08-01 07:18:32  294.44397   
4    1300.0      2       1600.0  ...     NaN 2017-08-01 07:18:33  291.25052   

   vertical_rate   flight_id   cumdist  compute_gs  compute_tr

# <u style="text-decoration-thickness: 4px;">Timestamp Processing and Flight ID Modification</span>

This code snippet processes the timestamp column and modifies the `flight_id` column in the merged dataset.

#### 1. **Convert Timestamp to Datetime**:
   - The code ensures that the `timestamp` column in the `merged_data` DataFrame is in `datetime` format using `pd.to_datetime()`. If it’s already in the correct format, this step has no effect.

#### 2. **Extract Year and Month**:
   - The `year_month` variable is created by extracting the year and month from the `timestamp` column. The `dt.strftime('%Y_%m')` method formats the year and month into the format `YYYY_MM` (e.g., `2021_01` for January 2021).

#### 3. **Modify the `flight_id` Column**:
   - The `flight_id` column is modified by appending the extracted `year_month` to each existing `flight_id`. This creates a new unique identifier for each flight that combines the `flight_id` with the corresponding year and month of the data (e.g., `flight123_2021_01` for January 2021).

#### 4. **Display the Updated DataFrame**:
   - The updated `merged_data` DataFrame is displayed using `print(merged_data.head())`, which shows the first few rows of the DataFrame with the updated `flight_id` and timestamp columns.




In [22]:
import pandas as pd

# Convert timestamp to datetime if not already done
merged_data['timestamp'] = pd.to_datetime(merged_data['timestamp'])

# Extract year and month in the format YYYY_MM
year_month = merged_data['timestamp'].dt.strftime('%Y_%m')

# Modify the flight_id column
merged_data['flight_id'] = merged_data['flight_id'] + '_' + year_month

# Display the updated DataFrame
print(merged_data.head())


   Unnamed: 0.1  Unnamed: 0 registration  icao24 operator callsign  alert  \
0           108       79969       D-HXBA  3E0F76     ADAC    CHX32  False   
1           109       79970       D-HXBA  3E0F76     ADAC    CHX32  False   
2           110       79971       D-HXBA  3E0F76     ADAC    CHX32  False   
3           111       79972       D-HXBA  3E0F76     ADAC    CHX32  False   
4           112       79973       D-HXBA  3E0F76     ADAC    CHX32  False   

   altitude  count  geoaltitude  ...  squawk           timestamp      track  \
0    1275.0      2       1550.0  ...     NaN 2017-08-01 07:18:29  315.00000   
1    1325.0      2       1600.0  ...     NaN 2017-08-01 07:18:30  293.19858   
2    1325.0      2       1625.0  ...     NaN 2017-08-01 07:18:31  294.44397   
3    1300.0      2       1625.0  ...     NaN 2017-08-01 07:18:32  294.44397   
4    1300.0      2       1600.0  ...     NaN 2017-08-01 07:18:33  291.25052   

   vertical_rate           flight_id   cumdist  compute_gs  co

In [23]:
merged_data = merged_data.sort_values(by=['flight_id', 'timestamp'])

# <u style="text-decoration-thickness: 4px;"> Column Selection in the DataFrame</span>

This code snippet selects a subset of columns from the `merged_data` DataFrame to focus on specific features.

#### 1. **Define Columns to Keep**:
   - A list, `columns_to_keep`, is created to specify which columns from the original `merged_data` DataFrame should be retained:
     - `flight_id`: The unique identifier for each flight.
     - `altitude`: The altitude of the aircraft.
     - `geoaltitude`: The geometric altitude of the aircraft.
     - `groundspeed`: The speed of the aircraft relative to the ground.
     - `latitude`: The geographical latitude of the aircraft.
     - `longitude`: The geographical longitude of the aircraft.
     - `vertical_rate`: The rate of change in altitude (climb or descent rate).
     - `compute_gs`: A computed value for ground speed.

#### 2. **Subset the DataFrame**:
   - Using the list `columns_to_keep`, the DataFrame `df` is created by selecting only the specified columns from `merged_data`.

#### 3. **Display the Result**:
   - The `.head()` function is used to display the first few rows of the resulting DataFrame (`df`), allowing the user to quickly verify that the correct columns have been retained.

This process allows for focusing on a specific set of features, which can be particularly useful when preparing the data for analysis or machine learning tasks.


In [24]:
## change the columns to keep as the input changes
columns_to_keep = ['flight_id', 'altitude', 'geoaltitude', 'groundspeed', 'latitude', 'longitude', 'vertical_rate', 'compute_track']
df = merged_data[columns_to_keep]
df.head

<bound method NDFrame.head of                   flight_id  altitude  geoaltitude  groundspeed   latitude  \
101055   3DD7B9_007_2017_09       0.0        -50.0         44.0  53.504489   
101056   3DD7B9_007_2017_09       0.0        -50.0         44.0  53.504489   
101057   3DD7B9_007_2017_09       0.0        -50.0         44.0  53.504489   
101058   3DD7B9_007_2017_09     -25.0        -50.0         44.0  53.504489   
101059   3DD7B9_007_2017_09     -25.0        -50.0         44.0  53.504489   
...                     ...       ...          ...          ...        ...   
1832681  RDF29_2110_2021_07    5050.0       5550.0         89.0  51.491547   
1832683  RDF29_2110_2021_07    5025.0       5500.0         90.0  51.491318   
1832685  RDF29_2110_2021_07    5000.0       5450.0         91.0  51.491197   
1832687  RDF29_2110_2021_07    4925.0       5375.0         91.0  51.490825   
1832689  RDF29_2110_2021_07    4900.0       5350.0         91.0  51.490677   

         longitude  vertical_rate

#  <u style="text-decoration-thickness: 4px;">Missing Value Analysis</span>

This code snippet identifies and summarizes the missing values in the `df` DataFrame, and calculates the percentage of missing data for each column.

#### 1. **Identifying Missing Values**:
   - The `df.isnull().sum()` function is used to identify missing values in the DataFrame. It returns the count of missing values (`NaN`) for each column.

#### 2. **Displaying the Missing Values Count**:
   - The count of missing values is printed to provide insight into which columns have missing data and how many missing entries they contain.

#### 3. **Calculating the Percentage of Missing Values**:
   - The percentage of missing values for each column is calculated by dividing the count of missing values (`missing_summary`) by the total number of rows (`len(df)`), and then multiplying by 100.
   - This helps to understand the proportion of missing data in each column relative to the total dataset.

The result allows the user to assess the quality of the data and determine whether any preprocessing or imputation is needed before further analysis or modeling.


In [25]:
# finding the missing value and percentage of data
# Summarize missing data
missing_summary = df.isnull().sum()
print(missing_summary)
# Percentage of missing values
print((missing_summary / len(df)) * 100)

flight_id            0
altitude         13916
geoaltitude      26632
groundspeed      12173
latitude             0
longitude            0
vertical_rate    12306
compute_gs        1383
dtype: int64
flight_id        0.000000
altitude         0.639227
geoaltitude      1.223331
groundspeed      0.559162
latitude         0.000000
longitude        0.000000
vertical_rate    0.565272
compute_gs       0.063528
dtype: float64


# <u style="text-decoration-thickness: 4px;">Missing Value Imputation with Moving Average</span>

This code snippet demonstrates how to handle missing values in a DataFrame by applying a moving average for imputation on numeric columns and using forward and backward filling to handle remaining NaNs.

#### 1. **Define Constants**:
   - The `window_size` variable is defined as `10`, specifying the number of periods over which the moving average is calculated.

#### 2. **Track Missing Values Before and After Imputation**:
   - The `total_missing_before` dictionary stores the initial count of missing values for each column.
   - The `total_missing_after` dictionary is initialized with zero values for all columns and will be updated after imputation.

#### 3. **Handling Numeric Columns**:
   - The `numeric_cols` variable stores the list of columns that contain numeric data (using `df.select_dtypes(include=["number"])`).
   - For each numeric column, the following imputation steps are performed:
     - **Moving Average**: Missing values are filled with the rolling mean of the column, using a window size of `10`. The `min_periods=1` ensures that the average is calculated even if there are fewer than 10 available data points.
     - **Forward Fill**: Remaining missing values are forward filled, using the previous non-null value.
     - **Backward Fill**: Any remaining NaNs are backward filled, using the next available non-null value.

#### 4. **Handling Non-Numeric Columns**:
   - For non-numeric columns (if any), the missing values count is updated, but no imputation is applied.

#### 5. **Display Results**:
   - After imputation, the total number of missing values for each column is printed, showing how many NaNs remain in each column after the imputation process.

#### 6. **Optional Saving**:
   - The code includes a commented-out line (`df.to_csv("imputed_data.csv", index=False)`) to save the imputed DataFrame to a CSV file if needed.

This process ensures that missing values in numeric columns are filled based on a smoothed representation of the data, allowing for a more complete dataset suitable for analysis or modeling.


In [26]:
import pandas as pd

# Define constants
window_size = 10  # Window size for moving average

# Initialize total NaN counts dictionaries
total_missing_before = df.isnull().sum()
total_missing_after = {col: 0 for col in df.columns}  # Initialize for all columns

# Select only numeric columns for rolling mean
numeric_cols = df.select_dtypes(include=["number"]).columns

# Process each column
for col in numeric_cols:
    # Apply moving average to impute missing values
    df[col] = df[col].fillna(
        df[col].rolling(window=window_size, min_periods=1).mean()
    )
    
    # Forward fill for remaining NaNs
    df[col] = df[col].fillna(method="ffill")
    
    # Backward fill for remaining NaNs
    df[col] = df[col].fillna(method="bfill")
    
    # Update total NaN counts after imputation
    total_missing_after[col] = df[col].isnull().sum()

# Handle non-numeric columns: update total_missing_after for consistency
for col in df.columns:
    if col not in numeric_cols:
        total_missing_after[col] = df[col].isnull().sum()

# Print total NaN counts before and after imputation
print("\nTotal NaN counts per column:")
for col in df.columns:
    print(f"Feature '{col}':")
    print(f"  NaN after imputation: {total_missing_after[col]}")

# Save the imputed DataFrame if needed
# df.to_csv("imputed_data.csv", index=False)

print("\nImputation completed.")


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[col] = df[col].fillna(
  df[col] = df[col].fillna(method="ffill")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[col] = df[col].fillna(method="ffill")
  df[col] = df[col].fillna(method="bfill")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[col] = df[col].fillna(method="bfill")



Total NaN counts per column:
Feature 'flight_id':
  NaN after imputation: 0
Feature 'altitude':
  NaN after imputation: 0
Feature 'geoaltitude':
  NaN after imputation: 0
Feature 'groundspeed':
  NaN after imputation: 0
Feature 'latitude':
  NaN after imputation: 0
Feature 'longitude':
  NaN after imputation: 0
Feature 'vertical_rate':
  NaN after imputation: 0
Feature 'compute_gs':
  NaN after imputation: 0

Imputation completed.


# <u style="text-decoration-thickness: 4px;">Missing Data</span>

This code calculates and displays the total number of missing (null) values in each column of the dataframe `df`. 

1. **`df.isnull()`**: This function checks for missing values in the dataframe `df` and returns a boolean dataframe where `True` indicates a missing value and `False` indicates a non-missing value.
2. **`.sum()`**: This function is applied to the boolean dataframe to count the number of `True` values (missing values) in each column.
3. **`missing_summary`**: The result is stored in the variable `missing_summary`, which contains the total count of missing values for each column.
4. **`print(missing_summary)`**: This line outputs the `missing_summary` to the console, allowing users to easily view the count of missing values in each column of the dataframe.

This helps to quickly identify columns with missing data, which can be crucial for data cleaning and preprocessing.


In [27]:
# Summarize missing data
missing_summary = df.isnull().sum()
print(missing_summary)

flight_id        0
altitude         0
geoaltitude      0
groundspeed      0
latitude         0
longitude        0
vertical_rate    0
compute_gs       0
dtype: int64


# <u style="text-decoration-thickness: 4px;">Counting Unique Values in the 'flight_id' Column</span>

This code calculates the number of unique flight IDs in the `'flight_id'` column of the dataframe `df`.

1. **`df['flight_id']`**: This accesses the `'flight_id'` column in the dataframe `df`.
2. **`.nunique()`**: This function returns the number of unique values in the specified column. It counts the distinct entries in the `'flight_id'` column.
3. **`num_unique_flight_ids`**: The result is stored in the variable `num_unique_flight_ids`, which contains the count of unique flight IDs.
4. **`print(f"Number of unique flight IDs: {num_unique_flight_ids}")`**: This line outputs the number of unique flight IDs to the console, allowing the user to quickly see how many different flight IDs are present in the dataset.

This code is useful for understanding the diversity or repetition of flight IDs in the dataset.


In [28]:
# Count the number of unique values in the 'flight_id' column
num_unique_flight_ids = df['flight_id'].nunique()

print(f"Number of unique flight IDs: {num_unique_flight_ids}")


Number of unique flight IDs: 5324


# <u style="text-decoration-thickness: 4px;">Counting and Extracting Unique Flight IDs</span>

This code counts the number of unique flight IDs and extracts the distinct flight IDs from the `'flight_id'` column in the dataframe `df`.

1. **Counting Unique Flight IDs**:
   - **`df['flight_id'].nunique()`**: This function calculates the number of unique flight IDs in the `'flight_id'` column of the dataframe `df`.
   - **`num_unique_flight_ids`**: The count of unique flight IDs is stored in the variable `num_unique_flight_ids`.
   - **`print(f"Number of unique flight IDs: {num_unique_flight_ids}")`**: The total number of unique flight IDs is printed to the console.

2. **Extracting Unique Flight IDs**:
   - **`df['flight_id'].unique()`**: This function extracts the unique flight IDs from the `'flight_id'` column as an array and stores them in the variable `all_unique_ids`.
   - A commented-out section (`#print("\nList of unique flight IDs:")`) shows an option to print the list of unique flight IDs for inspection.

3. **Viewing the Updated DataFrame**:
   - **`print("Updated DataFrame:")`**: A message indicating that the updated version of the dataframe will be printed.
   - **`print(df.tail())`**: This line prints the last 5 rows of the dataframe (`df`) to the console, which helps to inspect the recent changes or values.

This code helps in understanding the distribution of flight IDs in the dataset and allows for easy extraction and review of distinct flight IDs. It also includes an option to view the most recent entries in the dataframe.


In [29]:
# Count the number of unique values in the 'flight_id' column
num_unique_flight_ids = df['flight_id'].nunique()
print(f"Number of unique flight IDs: {num_unique_flight_ids}")

# Extract unique flight IDs
all_unique_ids = df['flight_id'].unique()

# If you want to save the mapping for later use:
#print("\nList of unique flight IDs:")
#print(all_unique_ids)


print("Updated DataFrame:")
print(df.tail())


Number of unique flight IDs: 5324
Updated DataFrame:
                  flight_id  altitude  geoaltitude  groundspeed   latitude  \
1832681  RDF29_2110_2021_07    5050.0       5550.0         89.0  51.491547   
1832683  RDF29_2110_2021_07    5025.0       5500.0         90.0  51.491318   
1832685  RDF29_2110_2021_07    5000.0       5450.0         91.0  51.491197   
1832687  RDF29_2110_2021_07    4925.0       5375.0         91.0  51.490825   
1832689  RDF29_2110_2021_07    4900.0       5350.0         91.0  51.490677   

         longitude  vertical_rate  compute_gs  
1832681   9.160371        -2240.0   78.208565  
1832683   9.159777        -2368.0   94.224470  
1832685   9.159546        -2432.0   40.602660  
1832687   9.158630        -2432.0  147.530010  
1832689   9.158366        -2496.0   47.926025  


# <u style="text-decoration-thickness: 4px;">Extracting the Last Record for Each Flight ID</span>

This code groups the dataframe by the `'flight_id'` column and extracts the last record for each unique flight ID.

1. **Grouping by Flight ID**:
   - **`df.groupby('flight_id').last()`**: This groups the dataframe by the `'flight_id'` column and retrieves the last entry for each group (i.e., the last record for each unique flight ID).
   - **`.reset_index()`**: This resets the index of the resulting dataframe after the groupby operation, ensuring that the flight IDs are treated as regular columns rather than indices.

2. **Reordering Columns**:
   - **`df_last = df_last[df.columns]`**: This line ensures that the columns in the new dataframe `df_last` match the original column order in the original dataframe `df`.

3. **Displaying Data**:
   - **`print(df_last.head())`**: This prints the first 5 rows of the `df_last` dataframe, allowing you to inspect the results of the operation.
   - **`print(df_last.tail())`**: This prints the last 5 rows of the `df_last` dataframe, providing a quick way to verify the final entries.

This code is useful for extracting the most recent or final record for each flight ID in a dataset, which can be important for analyzing the latest available data for each flight.


In [30]:
#taking the last values
df_last = df.groupby('flight_id').last().reset_index()

df_last = df_last[df.columns]

print(df_last.head())
print(df_last.tail())

            flight_id  altitude  geoaltitude  groundspeed   latitude  \
0  3DD7B9_007_2017_09     425.0        400.0        138.0  53.353683   
1  3DD7B9_042_2017_08     550.0        525.0         99.0  53.411291   
2  3DD7B9_073_2017_08    1200.0       1200.0         93.0  53.408295   
3  3DD7B9_075_2017_08     950.0        950.0        126.0  53.315094   
4  3DD7B9_080_2017_08     525.0        475.0         96.0  53.393188   

   longitude  vertical_rate  compute_gs  
0   9.804984            0.0         0.0  
1   9.892381            0.0         0.0  
2   9.891384          448.0         0.0  
3   9.771005         -128.0         0.0  
4   9.839669            0.0         0.0  
               flight_id  altitude  geoaltitude  groundspeed   latitude  \
5319  DHXCB_1762_2021_07     775.0       1150.0        125.0  50.750931   
5320  RDF17_1583_2021_07    2775.0       2975.0        115.0  51.370488   
5321  RDF17_2108_2021_07    2575.0       2900.0         87.0  51.685043   
5322  RDF29_210

# <u style="text-decoration-thickness: 4px;">Saving Flight IDs and Dropping the 'flight_id' Column</span>

This code extracts the `'flight_id'` values and removes the `'flight_id'` column from the dataframe `df_last` to create a new dataframe with scaled data (or without the `flight_id`).

1. **Saving Flight IDs**:
   - **`saved_flight_ids = df_last["flight_id"].values`**: This line saves all the flight ID values from the `'flight_id'` column of `df_last` into the `saved_flight_ids` variable. The `.values` attribute returns a NumPy array of flight IDs, maintaining their order.

2. **Dropping the 'flight_id' Column**:
   - **`df_last_scaled_df = df_last.drop(columns=['flight_id'])`**: This removes the `'flight_id'` column from the dataframe `df_last`, creating a new dataframe `df_last_scaled_df`. The resulting dataframe no longer includes the flight ID values, which is typically done before applying transformations like scaling or normalization.

3. **Displaying the New DataFrame**:
   - **`print(df_last_scaled_df.head())`**: This prints the first 5 rows of the new dataframe `df_last_scaled_df`, allowing inspection of the dataset without the `'flight_id'` column.

This code is typically used when preparing data for modeling or further analysis, where flight IDs are preserved separately, but the main dataframe is simplified by removing non-numeric or identifier columns.


In [31]:

#flight_ids = df_last['flight_id'].values
saved_flight_ids = df_last["flight_id"].values  # Save all flight_id values in order
df_last_scaled_df = df_last.drop(columns=['flight_id'])
print(df_last_scaled_df.head())


   altitude  geoaltitude  groundspeed   latitude  longitude  vertical_rate  \
0     425.0        400.0        138.0  53.353683   9.804984            0.0   
1     550.0        525.0         99.0  53.411291   9.892381            0.0   
2    1200.0       1200.0         93.0  53.408295   9.891384          448.0   
3     950.0        950.0        126.0  53.315094   9.771005         -128.0   
4     525.0        475.0         96.0  53.393188   9.839669            0.0   

   compute_gs  
0         0.0  
1         0.0  
2         0.0  
3         0.0  
4         0.0  


# <u style="text-decoration-thickness: 4px;">Converting Geographic Coordinates to 3D Cartesian Coordinates</span>

This code transforms geographic coordinates (latitude and longitude) from degrees to 3D Cartesian coordinates (x, y, z) using spherical coordinate conversion.

1. **Converting Latitude and Longitude to Radians**:
   - **`df_last_scaled_df['lat_rad'] = np.radians(df_last_scaled_df['latitude'])`**: This line converts the `'latitude'` column from degrees to radians using NumPy's `radians` function.
   - **`df_last_scaled_df['lon_rad'] = np.radians(df_last_scaled_df['longitude'])`**: Similarly, this line converts the `'longitude'` column from degrees to radians.

2. **Computing the Cartesian Coordinates**:
   - **`df_last_scaled_df['x'] = np.cos(df_last_scaled_df['lat_rad']) * np.cos(df_last_scaled_df['lon_rad'])`**: This calculates the `x` coordinate using the formula for spherical to Cartesian transformation.
   - **`df_last_scaled_df['y'] = np.cos(df_last_scaled_df['lat_rad']) * np.sin(df_last_scaled_df['lon_rad'])`**: This calculates the `y` coordinate using the sine of the longitude and cosine of the latitude.
   - **`df_last_scaled_df['z'] = np.sin(df_last_scaled_df['lat_rad'])`**: This calculates the `z` coordinate, which is the sine of the latitude in radians.

3. **Dropping Intermediate Columns**:
   - **`df_last_scaled_df.drop(columns=['lat_rad', 'lon_rad','latitude', 'longitude'], inplace=True)`**: After the conversion, the intermediate columns for radians and original latitude/longitude are dropped, as they are no longer necessary.

4. **Displaying the Transformed DataFrame**:
   - **`print(df_last_scaled_df.head())`**: This prints the first 5 rows of the transformed dataframe, which now contains the `x`, `y`, and `z` coordinates instead of the original latitude and longitude.

This code is commonly used for geospatial data transformations, especially when working with geographic coordinates for 3D visualizations, modeling, or distance computations on a spherical Earth model.


In [32]:
import numpy as np
import pandas as pd

# Assuming df is your DataFrame with 'latitude' and 'longitude' columns in degrees
df_last_scaled_df['lat_rad'] = np.radians(df_last_scaled_df['latitude'])  # Convert latitude to radians
df_last_scaled_df['lon_rad'] = np.radians(df_last_scaled_df['longitude'])  # Convert longitude to radians

# Compute x, y, z coordinates
df_last_scaled_df['x'] = np.cos(df_last_scaled_df['lat_rad']) * np.cos(df_last_scaled_df['lon_rad'])
df_last_scaled_df['y'] = np.cos(df_last_scaled_df['lat_rad']) * np.sin(df_last_scaled_df['lon_rad'])
df_last_scaled_df['z'] = np.sin(df_last_scaled_df['lat_rad'])

# Drop intermediate radian columns if not needed
df_last_scaled_df.drop(columns=['lat_rad', 'lon_rad','latitude', 'longitude'], inplace=True)

# Display the transformed DataFrame
print(df_last_scaled_df.head())


   altitude  geoaltitude  groundspeed  vertical_rate  compute_gs         x  \
0     425.0        400.0        138.0            0.0         0.0  0.588155   
1     550.0        525.0         99.0            0.0         0.0  0.587204   
2    1200.0       1200.0         93.0          448.0         0.0  0.587248   
3     950.0        950.0        126.0         -128.0         0.0  0.588748   
4     525.0        475.0         96.0            0.0         0.0  0.587548   

          y         z  
0  0.101645  0.802335  
1  0.102403  0.802935  
2  0.102400  0.802904  
3  0.101388  0.801933  
4  0.101906  0.802747  


# <u style="text-decoration-thickness: 4px;">Preparing Feature and Target Variables</span>

This code extracts the feature matrix `X` and the target variable(s) `y` from the transformed dataframe `df_last_scaled_df`, and prints their shapes to verify the structure.

1. **Extracting Features (X)**:
   - **`X = df_last_scaled_df.values`**: This line extracts all the values from the dataframe `df_last_scaled_df` into a NumPy array and assigns it to the variable `X`. This will include all the columns, treating them as features for machine learning or other analytical tasks.

2. **Extracting Target Variables (y)**:
   - **`y = df_last_scaled_df.iloc[:, [0, 5, 6, 7]].values`**: This line selects specific columns (the 1st, 6th, 7th, and 8th) from `df_last_scaled_df` using `.iloc[]` and extracts their values as the target variables. The selected columns are stored in the `y` variable. These columns are likely being treated as the output or labels in a supervised learning task.

3. **Printing the Shapes**:
   - **`print("X shape:", X.shape)`**: This prints the shape of the feature matrix `X`, showing the number of rows (samples) and columns (features).
   - **`print("y shape:", y.shape)`**: This prints the shape of the target variable(s) `y`, showing the number of rows (samples) and the number of columns (target variables).

This code is used to prepare the data for machine learning or statistical analysis by separating the features (input variables) and target variables (output labels), while also verifying their structure through the printed shapes.


In [33]:
X = df_last_scaled_df.values  
y = df_last_scaled_df.iloc[:, [0, 5, 6,7]].values 

print("X shape:", X.shape)
print("y shape:", y.shape)

X shape: (5324, 8)
y shape: (5324, 4)


# <u style="text-decoration-thickness: 4px;">Loading Pre-trained Scalers for Data Transformation</span>

This code loads pre-trained scaling models from disk to transform the features and target variables during the machine learning process.

1. **Loading Scalers for Features and Targets**:
   - **`scaler_X_altitude = joblib.load('scaler_X_altitude.pkl')`**: This line loads the pre-trained scaler for the `X` features related to altitude from the file `'scaler_X_altitude.pkl'`. The scaler is expected to be used for transforming the feature data related to altitude, such as standardizing or normalizing values.
   - **`scaler_X_other = joblib.load('scaler_X_other.pkl')`**: Similarly, this loads the pre-trained scaler for other features (not related to altitude) from the file `'scaler_X_other.pkl'`.
   - **`scaler_y_altitude = joblib.load('scaler_y_altitude.pkl')`**: This line loads the pre-trained scaler for the target variable related to altitude from `'scaler_y_altitude.pkl'`, which will be used for transforming the target values corresponding to altitude.
   - **`scaler_y_other = joblib.load('scaler_y_other.pkl')`**: This loads the pre-trained scaler for other target variables from `'scaler_y_other.pkl'`.

2. **Purpose of Scalers**:
   - The loaded scalers are used to standardize or normalize the data before applying machine learning models. These scalers ensure that the features and targets are transformed consistently according to the parameters (such as mean and standard deviation) used during the training of the model.

This approach allows for reusability of scaling operations in subsequent steps, ensuring that the model can be applied to new data in the same manner as it was originally trained.


In [34]:
import joblib
scaler_X_altitude = joblib.load('scaler_X_altitude.pkl')
scaler_X_other = joblib.load('scaler_X_other.pkl')
scaler_y_altitude = joblib.load('scaler_y_altitude.pkl')
scaler_y_other = joblib.load('scaler_y_other.pkl')


# <u style="text-decoration-thickness: 4px;">Separating, Scaling, and Preparing Data for LSTM Input</span>

This code preprocesses the feature and target variables by separating, scaling, and reshaping the data for input into a Long Short-Term Memory (LSTM) model.

1. **Identifying the Altitude Column**:
   - **`altitude_index = 0`**: The index of the altitude column is defined. This index is used to separate altitude-related data from the other features and targets. Modify the index based on the dataset's structure if needed.

2. **Separating the Altitude Data**:
   - **`altitude_X = X[:, altitude_index].reshape(-1, 1)`**: The altitude feature (from `X`) is extracted and reshaped to a column vector for later scaling.
   - **`X_other = X[:, [i for i in range(X.shape[1]) if i != altitude_index]]`**: All other features (excluding altitude) are stored in `X_other`.

3. **Separating the Altitude Target**:
   - **`altitude_y = y[:, altitude_index].reshape(-1, 1)`**: The target variable related to altitude is extracted and reshaped to a column vector.
   - **`y_other = y[:, [i for i in range(y.shape[1]) if i != altitude_index]]`**: The remaining target variables are stored in `y_other`.

4. **Scaling the Features and Targets**:
   - **`altitude_X_scaled = scaler_X_altitude.transform(altitude_X)`**: The altitude feature in `X` is scaled to the range [0, 1] using the pre-trained scaler `scaler_X_altitude`.
   - **`X_other_scaled = scaler_X_other.transform(X_other)`**: Other features in `X` are scaled to the range [-1, 1] using the pre-trained scaler `scaler_X_other`.
   - **`altitude_y_scaled = scaler_y_altitude.transform(altitude_y)`**: The altitude target in `y` is scaled to the range [0, 1] using the pre-trained scaler `scaler_y_altitude`.
   - **`y_other_scaled = scaler_y_other.transform(y_other)`**: The other targets in `y` are scaled to the range [-1, 1] using the pre-trained scaler `scaler_y_other`.

5. **Recombining Scaled Data**:
   - **`X = np.hstack([altitude_X_scaled, X_other_scaled])`**: The scaled altitude feature is recombined with the other scaled features into a new feature matrix `X`.
   - **`y = np.hstack([altitude_y_scaled, y_other_scaled])`**: The scaled altitude target is recombined with the other scaled targets into a new target array `y`.

6. **Expanding Dimensions for LSTM Input**:
   - **`X = np.expand_dims(X, axis=1)`**: LSTM models expect input data to be of shape `(samples, time_steps, features)`. This reshapes `X` to include a single time step, making the shape `(samples, 1, features)`.

7. **Ensuring y is a NumPy Array**:
   - **`y = y.values if isinstance(y, pd.Series) else y`**: Ensures that `y` is a NumPy array, as LSTM models require NumPy arrays for input.

8. **Printing the Shapes**:
   - **`print("X shape after expansion:", X.shape)`**: This prints the shape of `X` after the reshaping operation.
   - **`print("y shape:", y.shape)`**: This prints the shape of `y`, confirming the correct dimensionality for model input.

This code ensures that the features and targets are appropriately separated, scaled, and reshaped for input into an LSTM model, which is commonly used for sequence prediction tasks.


In [35]:
# Identify the altitude column index
altitude_index = 0  # Change this based on your dataset

# Separate the altitude column from X (features)
altitude_X = X[:, altitude_index].reshape(-1, 1)
X_other = X[:, [i for i in range(X.shape[1]) if i != altitude_index]]

# Separate the altitude column from y (target)
altitude_y = y[:, altitude_index].reshape(-1, 1)
y_other = y[:, [i for i in range(y.shape[1]) if i != altitude_index]]

# **Use the SAME scalers from training**
# Scale the altitude feature in X to [0, 1]
altitude_X_scaled = scaler_X_altitude.transform(altitude_X)

# Scale the other features in X to [-1, 1]
X_other_scaled = scaler_X_other.transform(X_other)

# Scale the altitude in y to [0, 1]
altitude_y_scaled = scaler_y_altitude.transform(altitude_y)

# Scale the other columns in y to [-1, 1]
y_other_scaled = scaler_y_other.transform(y_other)

# **Recombine scaled altitude with other features**
X = np.hstack([altitude_X_scaled, X_other_scaled])
y = np.hstack([altitude_y_scaled, y_other_scaled])

# **Expand dimensions for LSTM input (samples, time_steps, features)**
X = np.expand_dims(X, axis=1)  # Shape becomes (samples, 1, features)  

# Ensure y is a NumPy array
y = y.values if isinstance(y, pd.Series) else y  

# Print the shapes
print("X shape after expansion:", X.shape)
print("y shape:", y.shape)


X shape after expansion: (5324, 1, 8)
y shape: (5324, 4)


# <u style="text-decoration-thickness: 4px;">Making Predictions for the Next 60 Seconds and Saving Results</span>

This code makes predictions for the next 60 seconds based on the input data and stores the predicted values, including altitude, latitude, longitude, and other variables, along with the corresponding `flight_id`. The predictions are then saved into a CSV file for further analysis.

1. **Initializing a List for Predictions**:
   - **`predictions = []`**: This list will store the predictions for each row of data, including the predicted values for altitude, latitude, longitude, and any other relevant variables.

2. **Iterating Over Each Row in `X`**:
   - **`for idx in range(len(X)):`**: This loop iterates over every row in the `X` matrix, which represents the features for each sample.
   
3. **Preparing the Input for Prediction**:
   - **`last_input = np.zeros((1, 1, 8))`**: Initializes an array `last_input` with zeros, shaped to match the model's expected input format, i.e., `(samples, time_steps, features)`. The shape here is `(1, 1, 8)` to represent 1 sample with 1 time step and 8 features.
   - **`last_input[0, 0, :] = X[idx, :]`**: Populates the `last_input` array with the feature values for the current row in `X`.

4. **Making Predictions**:
   - **`next_prediction = model.predict(last_input, verbose=0)`**: This predicts the next 60 seconds of data (such as altitude, latitude, etc.) for the current row using the trained model.

5. **Storing Predictions**:
   - The predicted values (altitude, latitude, longitude, and other features) are stored in the `predictions` list with the corresponding `flight_id`. 
   - The dictionary stores each predicted feature under meaningful names like `"predicted_altitude"`, `"predicted_x"`, and `"predicted_y"`. 

6. **Converting Predictions to a DataFrame**:
   - **`predictions_df = pd.DataFrame(predictions)`**: Converts the list of dictionaries into a DataFrame for easy manipulation and analysis.

7. **Restoring the `flight_id`**:
   - **`predictions_df["flight_id"] = saved_flight_ids`**: Restores the original `flight_id` column, ensuring the predictions are associated with the correct flight IDs.
   
8. **Reordering Columns**:
   - **`predictions_df = predictions_df[["flight_id", "predicted_altitude", "predicted_x", "predicted_y", "predicted_z"]]`**: Reorders the columns to display the `flight_id` first, followed by the predicted values for altitude, latitude, longitude, and other features.

9. **Saving Predictions to a CSV File**:
   - **`predictions_df.to_csv('row_wise_predictions.csv', index=False)`**: Saves the DataFrame containing predictions to a CSV file for further use. (Note: this line is commented out in the code but can be enabled to save the predictions.)
   
10. **Final Output**:
    - **`print("Predictions saved to 'row_wise_predictions.csv'")`**: This print statement confirms that the predictions have been saved.

This code is designed to run predictions for each row of input data using a trained model, store the results in a structured format, and save the predictions to a CSV file for further analysis or reporting.


In [36]:
# Initialize a list to store predictions
predictions = []

# Iterate over each row in X
for idx in range(len(X)):  # Iterate through all rows in the dataset
    # Initialize last_input with the correct shape for 7 features
    last_input = np.zeros((1, 1, 8))  # Match the input shape of the model
    last_input[0, 0, :] = X[idx, :]  # Populate all 7 features from the current row

    # Predict 60 seconds into the future for this row
    next_prediction = model.predict(last_input, verbose=0)  # Predict the 60-second value


    # Save predictions with flight_id and predicted values
    predictions.append({
        "predicted_altitude": next_prediction[0][0],  # Predicted altitude
        "predicted_x": next_prediction[0][1],  # Predicted latitude
        "predicted_y": next_prediction[0][2],  # Predicted longitude
        "predicted_z":next_prediction[0][3]
        # Add other predicted features if necessary
    })

# Convert predictions to a DataFrame
predictions_df = pd.DataFrame(predictions)

# Restore the flight_id column
predictions_df["flight_id"] = saved_flight_ids  # Ensuring original order

# Reorder columns for better readability
predictions_df = predictions_df[["flight_id", "predicted_altitude", "predicted_x", "predicted_y","predicted_z"]]

# Save predictions to a CSV file
#predictions_df.to_csv('row_wise_predictions.csv', index=False)

print("Predictions saved to 'row_wise_predictions.csv'")


Predictions saved to 'row_wise_predictions.csv'


In [37]:
predictions_df.head()
predictions_df.tail()

Unnamed: 0,flight_id,predicted_altitude,predicted_x,predicted_y,predicted_z
5319,DHXCB_1762_2021_07,0.021725,0.674062,0.212727,0.296856
5320,RDF17_1583_2021_07,0.020973,0.655071,0.23332,0.339777
5321,RDF17_2108_2021_07,0.020091,0.640503,0.235079,0.384064
5322,RDF29_2109_2021_07,0.020793,0.652798,0.238901,0.342042
5323,RDF29_2110_2021_07,0.020852,0.652501,0.234566,0.346771


# <u style="text-decoration-thickness: 4px;">Inverse Scaling of Predictions and Data Preprocessing</span>

This code performs the inverse transformation of the scaled predictions to restore the predicted values for altitude, latitude, longitude, and other features back to their original ranges.

1. **Identifying the Altitude Column Index**:
   - **`altitude_index = 1`**: Specifies the index of the altitude column. The index can be adjusted depending on where the altitude data is located in the dataset. In this case, it assumes altitude is in the second position (index 1).

2. **Extracting the Predicted Values**:
   - **`predicted_altitude = predictions_df[['predicted_altitude']].values`**: This line extracts the predicted altitude values from the `predictions_df` DataFrame and stores them as a 2D array for later rescaling.
   - **`predicted_other = predictions_df[['predicted_x', 'predicted_y', 'predicted_z']].values`**: Extracts the predicted values for latitude, longitude, and any other features (such as altitude in 3D space or other spatial coordinates).

3. **Performing the Inverse Transformation**:
   - **`rescaled_altitude = scaler_y_altitude.inverse_transform(predicted_altitude)`**: This line applies the inverse transformation using the pre-trained scaler `scaler_y_altitude` to rescale the predicted altitude values back to their original range.
   - **`rescaled_other = scaler_y_other.inverse_transform(predicted_other)`**: Similarly, the predicted latitude, longitude, and other features are rescaled using the pre-trained scaler `scaler_y_other` to restore them to their original values.

4. **Replacing the Original Prediction Columns with Rescaled Values**:
   - **`predictions_df['rescaled_altitude'] = rescaled_altitude`**: The rescaled altitude values are added to the DataFrame as a new column `rescaled_altitude`.
   - **`predictions_df[['rescaled_x', 'rescaled_y', 'rescaled_z']] = rescaled_other`**: The rescaled values for latitude, longitude, and other features are added to the DataFrame under new column names (`rescaled_x`, `rescaled_y`, `rescaled_z`).

5. **Optionally Dropping the Old Scaled Prediction Columns**:
   - **`predictions_df.drop(columns=['predicted_altitude', 'predicted_x', 'predicted_y', 'predicted_z'], inplace=True)`**: The original scaled prediction columns are dropped from the DataFrame as they are no longer needed after rescaling.

6. **Verifying the Results**:
   - **`print(predictions_df.head())`**: Prints the first 5 rows of the updated DataFrame to verify that the rescaled values are correctly added.
   - **`print(predictions_df.tail())`**: Prints the last 5 rows to further verify the result.

This code is crucial for converting the scaled predictions back to their original values, making them more interpretable and suitable for further analysis or reporting. After inverse transformation, the predicted values reflect real-world measurements (such as altitude and geographical coordinates).


In [38]:
# Identify column indices
altitude_index = 1  # Change this if altitude is in a different position

# Extract predicted values
predicted_altitude = predictions_df[['predicted_altitude']].values  # Extract altitude as 2D array
predicted_other = predictions_df[['predicted_x', 'predicted_y','predicted_z']].values  # Extract other features

# **Perform inverse transformation**
rescaled_altitude = scaler_y_altitude.inverse_transform(predicted_altitude)  # Rescale altitude (0,1 → original range)
rescaled_other = scaler_y_other.inverse_transform(predicted_other)  # Rescale other features (-1,1 → original range)

# **Replace the original columns with the rescaled values**
predictions_df['rescaled_altitude'] = rescaled_altitude  # Add rescaled altitude
predictions_df[['rescaled_x', 'rescaled_y','rescaled_z']] = rescaled_other  # Add rescaled lat/lon

# **Optionally drop the old scaled prediction columns**
predictions_df.drop(columns=['predicted_altitude', 'predicted_x', 'predicted_y','predicted_z'], inplace=True)

# **Verify the results**
print(predictions_df.head())
print(predictions_df.tail())


            flight_id  rescaled_altitude  rescaled_x  rescaled_y  rescaled_z
0  3DD7B9_007_2017_09         699.877563    0.586225    0.103837    0.803171
1  3DD7B9_042_2017_08         703.252380    0.585793    0.104077    0.803480
2  3DD7B9_073_2017_08         703.340332    0.585527    0.104184    0.803667
3  3DD7B9_075_2017_08         702.617004    0.586547    0.103751    0.802948
4  3DD7B9_080_2017_08         701.963501    0.586011    0.103962    0.803326
               flight_id  rescaled_altitude  rescaled_x  rescaled_y  \
5319  DHXCB_1762_2021_07        1580.817749    0.627347    0.081743   
5320  RDF17_1583_2021_07        1484.539551    0.619005    0.096268   
5321  RDF17_2108_2021_07        1371.697144    0.612606    0.097508   
5322  RDF29_2109_2021_07        1461.520142    0.618006    0.100204   
5323  RDF29_2110_2021_07        1469.118164    0.617876    0.097147   

      rescaled_z  
5319    0.774204  
5320    0.778970  
5321    0.783887  
5322    0.779221  
5323    0.779746

# <u style="text-decoration-thickness: 4px;">Restoring Latitude and Longitude from 3D Coordinates</span>

This code calculates the latitude and longitude from the previously rescaled 3D Cartesian coordinates (x, y, z) and updates the `predictions_df` DataFrame accordingly. The `x`, `y`, and `z` columns are then dropped if they are no longer needed.

1. **Computing Latitude from z**:
   - **`predictions_df['latitude'] = np.degrees(np.arcsin(predictions_df['rescaled_z']))`**: This line calculates the latitude using the arcsine (`np.arcsin`) of the rescaled `z` values. The result is in radians, which is then converted to degrees using `np.degrees`. The latitude is derived from the `z` coordinate in the 3D Cartesian system, assuming the data represents a spherical or ellipsoidal model.

2. **Computing Longitude from x and y**:
   - **`predictions_df['longitude'] = np.degrees(np.arctan2(predictions_df['rescaled_y'], predictions_df['rescaled_x']))`**: This line calculates the longitude using the `arctan2` function, which computes the arctangent of the ratio of `y` to `x` to obtain the correct angle (in radians) representing the longitude. The result is then converted to degrees using `np.degrees`. This step assumes the data is represented in a 2D projection or 3D Cartesian coordinates, where `x` and `y` map to geographical coordinates.

3. **Dropping the Original `x`, `y`, and `z` Columns**:
   - **`predictions_df.drop(columns=['rescaled_x', 'rescaled_y', 'rescaled_z'], inplace=True)`**: The rescaled `x`, `y`, and `z` columns are dropped from the DataFrame since they are no longer needed after calculating the latitude and longitude.

4. **Displaying the Updated DataFrame**:
   - **`print(predictions_df.head())`**: This line prints the first 5 rows of the updated DataFrame to verify that the latitude and longitude have been correctly restored.

This code is useful for converting the 3D spatial coordinates back into latitude and longitude values, which are commonly used for mapping or geographic analyses. The resulting `latitude` and `longitude` values are now in degrees, ready for further use or reporting.


In [39]:
import numpy as np

# Compute latitude from z
predictions_df['latitude'] = np.degrees(np.arcsin(predictions_df['rescaled_z']))

# Compute longitude from x and y
predictions_df['longitude'] = np.degrees(np.arctan2(predictions_df['rescaled_y'], predictions_df['rescaled_x']))

# Drop x, y, z if not needed
predictions_df.drop(columns=['rescaled_x', 'rescaled_y', 'rescaled_z'], inplace=True)

# Display the DataFrame with restored lat/lon
print(predictions_df.head())


            flight_id  rescaled_altitude   latitude  longitude
0  3DD7B9_007_2017_09         699.877563  53.433998  10.044492
1  3DD7B9_042_2017_08         703.252380  53.463711  10.074492
2  3DD7B9_073_2017_08         703.340332  53.481689  10.089153
3  3DD7B9_075_2017_08         702.617004  53.412563  10.030974
4  3DD7B9_080_2017_08         701.963501  53.448883  10.059967


# <u style="text-decoration-thickness: 4px;">Merging Predictions with Original Data Based on Flight ID and Timestamp</span>

This code merges the predictions DataFrame (`predictions_df`) with the original data (`original_order_df`) based on a combination of `flight_id` and `timestamp`. It includes creating a new identifier and performing the merge operation.

1. **Loading the Original Flight Data**:
   - **`original_order_df = pd.read_csv("/home/jovyan/data/predictions_format.csv")`**: This line loads the original dataset containing flight information, including flight IDs and timestamps, from a CSV file.

2. **Converting Timestamp to Datetime Format**:
   - **`original_order_df['timestamp'] = pd.to_datetime(original_order_df['timestamp'])`**: The `timestamp` column is converted to the `datetime` format to facilitate time-based operations like extracting the year and month.

3. **Extracting Year and Month**:
   - **`original_order_df['year_month'] = original_order_df['timestamp'].dt.strftime('%Y_%m')`**: This extracts the year and month in the `YYYY_MM` format from the `timestamp` column and stores it in a new column `year_month`. This allows grouping or merging based on time intervals.

4. **Creating a New `flight_id_rename` Column**:
   - **`original_order_df['flight_id_rename'] = original_order_df['flight_id'].astype(str) + '_' + original_order_df['year_month']`**: This creates a new column `flight_id_rename` by concatenating the `flight_id` with the `year_month` value, forming a unique identifier that combines both flight ID and the year-month pair.

5. **Merging Predictions with the Original Data**:
   - **`merged_df = original_order_df.merge(predictions_df, left_on='flight_id_rename', right_on='flight_id', how='left')`**: This merges `original_order_df` with `predictions_df` using the `flight_id_rename` column in `original_order_df` and the `flight_id` column in `predictions_df`. The merge is done using a "left" join, ensuring that all rows from `original_order_df` are kept, and matching rows from `predictions_df` are added.

6. **Dropping the `flight_id_rename` Column**:
   - **`merged_df.drop(columns=['flight_id_rename'], inplace=True)`**: After the merge, the temporary `flight_id_rename` column is no longer needed, so it is dropped from the DataFrame.

7. **Displaying the Updated DataFrame**:
   - **`print(merged_df.head())`**: This prints the first 5 rows of the merged DataFrame, showing the original flight data with the corresponding predictions.

This process allows you to combine the predictions with the original flight data by aligning the flight IDs and the year-month values, ensuring that the predictions match the correct flight records based on both flight ID and time.


In [40]:
import pandas as pd

# Load the original flight ID order and timestamp
original_order_df = pd.read_csv("/home/jovyan/data/predictions_format.csv")

# Convert 'timestamp' column to datetime format
original_order_df['timestamp'] = pd.to_datetime(original_order_df['timestamp'])

# Extract year and month in the format YYYY_MM
original_order_df['year_month'] = original_order_df['timestamp'].dt.strftime('%Y_%m')

# Create a new column 'flight_id_rename'
original_order_df['flight_id_rename'] = original_order_df['flight_id'].astype(str) + '_' + original_order_df['year_month']

# Merge predictions_df into original_order_df using 'flight_id_rename'
merged_df = original_order_df.merge(
    predictions_df, 
    left_on='flight_id_rename', 
    right_on='flight_id', 
    how='left'
)

# Drop the 'flight_id_rename' column as it's no longer needed
merged_df.drop(columns=['flight_id_rename'], inplace=True)

# Display the updated DataFrame
print(merged_df.head())

  flight_id_x           timestamp  latitude_x  longitude_x  altitude  \
0  3E0F76_969 2017-08-01 07:25:23           0            0         0   
1  3E0F76_970 2017-08-01 08:19:52           0            0         0   
2  3E0F76_972 2017-08-01 11:26:36           0            0         0   
3  3E0F76_973 2017-08-01 12:00:52           0            0         0   
4  3E0F76_974 2017-08-01 13:42:23           0            0         0   

  year_month         flight_id_y  rescaled_altitude  latitude_y  longitude_y  
0    2017_08  3E0F76_969_2017_08        1943.071655   48.893475    11.815664  
1    2017_08  3E0F76_970_2017_08        1909.218506   48.960136    11.863684  
2    2017_08  3E0F76_972_2017_08        1958.378540   48.872608    11.338322  
3    2017_08  3E0F76_973_2017_08        1955.404297   48.879944    11.226740  
4    2017_08  3E0F76_974_2017_08        2053.909180   48.694546    11.682269  


# <u style="text-decoration-thickness: 4px;">Saving Merged Predictions to a CSV File</span>

This code saves the merged predictions DataFrame (`merged_df`) into a CSV file for future analysis or reporting.

1. **Saving the Merged DataFrame to CSV**:
   - **`merged_df.to_csv('rescaled_predictions.csv', index=False)`**: This line saves the `merged_df` DataFrame, which contains both the original flight data and the rescaled predictions, to a CSV file named `rescaled_predictions.csv`. The `index=False` argument ensures that the row indices are not included in the saved CSV file.

2. **Printing Confirmation**:
   - **`print("Rescaled predictions saved to 'rescaled_predictions.csv'")`**: After successfully saving the file, this print statement confirms that the predictions have been saved to the specified file.

By saving the merged DataFrame, you can preserve the predictions along with the corresponding flight data, making it easier to share, analyze, or further process the results.


In [41]:
# Save predictions to a CSV file
merged_df.to_csv('rescaled_predictions.csv', index=False)

print("Rescaled predictions saved to 'rescaled_predictions.csv'")


Rescaled predictions saved to 'rescaled_predictions.csv'


# <u style="text-decoration-thickness: 4px;">Loading and Displaying the Rescaled Predictions</span>

This code loads the previously saved rescaled predictions from a CSV file and displays the first few rows of the data.

1. **Loading the CSV File**:
   - **`df = pd.read_csv('rescaled_predictions.csv')`**: This line loads the `rescaled_predictions.csv` file into a Pandas DataFrame named `df`. This file contains the merged flight data and rescaled predictions.

2. **Displaying the First Few Rows**:
   - **`df.head()`**: This function displays the first 5 rows of the DataFrame `df` to give a quick overview of the loaded data. It helps verify the content and structure of the rescaled predictions file.

By loading and displaying the data, you can ensure that the saved predictions were properly stored and can now be further analyzed or processed.


In [42]:
import pandas as pd

df = pd.read_csv('rescaled_predictions.csv')
df.head()


Unnamed: 0,flight_id_x,timestamp,latitude_x,longitude_x,altitude,year_month,flight_id_y,rescaled_altitude,latitude_y,longitude_y
0,3E0F76_969,2017-08-01 07:25:23,0,0,0,2017_08,3E0F76_969_2017_08,1943.0717,48.893475,11.815664
1,3E0F76_970,2017-08-01 08:19:52,0,0,0,2017_08,3E0F76_970_2017_08,1909.2185,48.960136,11.863684
2,3E0F76_972,2017-08-01 11:26:36,0,0,0,2017_08,3E0F76_972_2017_08,1958.3785,48.87261,11.338322
3,3E0F76_973,2017-08-01 12:00:52,0,0,0,2017_08,3E0F76_973_2017_08,1955.4043,48.879944,11.22674
4,3E0F76_974,2017-08-01 13:42:23,0,0,0,2017_08,3E0F76_974_2017_08,2053.9092,48.694546,11.682269


# <u style="text-decoration-thickness: 4px;"> Cleaning the DataFrame by Dropping Unwanted Columns</span>

This code removes specific columns from the DataFrame (`df`) that are no longer needed for further analysis.

1. **Defining the Columns to Drop**:
   - **`columns_to_delete = ['latitude_x', 'longitude_x', 'altitude', 'year_month', 'flight_id_y']`**: A list of columns that should be removed from the DataFrame is defined. These columns are identified as unnecessary for the analysis or have duplicate information.

2. **Dropping the Specified Columns**:
   - **`df_cleaned = df.drop(columns=columns_to_delete)`**: This line drops the columns specified in the `columns_to_delete` list from the DataFrame `df`. The resulting DataFrame, with the unwanted columns removed, is stored in a new variable `df_cleaned`.

The cleaned DataFrame (`df_cleaned`) now contains only the relevant columns, making it ready for further processing or analysis.


In [46]:

columns_to_delete = ['latitude_x', 'longitude_x','altitude','year_month','flight_id_y']

df_cleaned = df.drop(columns=columns_to_delete)


# <u style="text-decoration-thickness: 4px;">Renaming Columns in the DataFrame</span>

This code renames specific columns in the cleaned DataFrame (`df_cleaned`) to match the desired column names.

1. **Creating a Dictionary for Column Renaming**:
   - **`columns_to_rename = {'flight_id_x': 'flight_id', 'rescaled_altitude': 'altitude', 'latitude_y': 'latitude', 'longitude_y': 'longitude'}`**: A dictionary is created, where the keys represent the old column names, and the values represent the new column names. This allows for a clear mapping from the old names to the new ones.

2. **Renaming the Columns**:
   - **`df_cleaned = df_cleaned.rename(columns=columns_to_rename)`**: The `rename` function is used to rename the columns in `df_cleaned` based on the dictionary `columns_to_rename`. The DataFrame is updated in place, and the renamed DataFrame is stored back into `df_cleaned`.

By renaming the columns, the DataFrame is now standardized and has more meaningful names, making it easier to interpret and work with the data.


In [48]:
# Dictionary mapping old column names to new ones
columns_to_rename = {
    'flight_id_x': 'flight_id',
    'rescaled_altitude': 'altitude',
    'latitude_y': 'latitude',
    'longitude_y': 'longitude'
}

# Rename columns
df_cleaned = df_cleaned.rename(columns=columns_to_rename)


# <u style="text-decoration-thickness: 4px;">Saving the Cleaned DataFrame to a CSV File</span>

This code saves the cleaned and renamed DataFrame (`df_cleaned`) to a new CSV file for further use or sharing.

1. **Saving the Cleaned DataFrame**:
   - **`df_cleaned.to_csv('final_predictions_group32.csv', index=False)`**: This line saves the `df_cleaned` DataFrame to a CSV file named `final_predictions_group32.csv`. The `index=False` argument ensures that the row indices are not included in the saved file.

2. **Confirmation Message**:
   - **`print("Cleaned CSV saved as 'final_predictions_group32.csv'")`**: After saving the file, this print statement confirms that the cleaned DataFrame has been successfully saved with the specified filename.

By saving the cleaned DataFrame, you create a final version of the predictions that is ready for reporting, further analysis, or sharing.


In [49]:
# Save the cleaned DataFrame to a new CSV file
df_cleaned.to_csv('final_predictions_group32.csv', index=False)

print("Cleaned CSV saved as 'final_predictions_group32.csv'")


Cleaned CSV saved as 'cleaned_file.csv'


# 5. Presentation and Evaluation of Results

Results can be produced given the technical implementation of the presented concept and the database. These results need to be evaluated and verified. Subsequently, it has to be evaluated whether and with which quality the results can fulfill the desired objectives.

The following questions and topics might be used as a guidance for the presentation of the results:

Use meaningful figures to present the results and also cover intermediate results
What findings can be derived with regard to the objectives?
The following questions and topics might be used as a guidance for the evaluation of the results:

Evaluate the algorithms with common metrics
How robust is the implemented methodology?
How can the performance of the methodology be classified with regard to the achieved results?
Can reliable statements be made on the basis of the methodology?

## 6. Applicability Analysis and Outlook<a class="anchor" id="outlook"></a>
-----------------------

