# Training a Contrail Classifier
The goal of this project is to train a machine learning model that can accurately classify images of the sky as containing contrails.
To build the model, we have obtained cloud data from four sources:
1. [Cirrus Cumulus Stratus Nimbus (CCSN) Database](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/CADDPD), this 
2. [Singapore Data Swimcat](https://ieeexplore.ieee.org/abstract/document/7350833)
3. [CLASA](https://github.com/CLASA/Contrail-Machine-Vision), [Official Website](https://clasa.github.io/) a proposed solution to the NASA Clouds vs Contrails challenge.
4. [Google Cloud Project](https://arxiv.org/abs/2304.02122)

Potential applications are noted below:

Potential Applications
* Climate Studies: Contrails can have a significant impact on the Earth's atmosphere and climate. They can reflect sunlight back into space, contributing to global cooling, but they can also trap heat within the Earth's atmosphere, contributing to global warming. Therefore, a machine learning model trained to detect and monitor contrails could provide important data for climate researchers.

* Air Traffic Control: A model trained to identify contrails could be useful in tracking aircraft routes and densities, particularly in areas with less developed radar infrastructure.

* Aerospace and Defense: This model could be used for aerospace and defense purposes. For instance, detecting contrails could help in tracking and identifying stealth, unauthorized, or unrecognized flights, which can be important in maintaining airspace security.

# Preprocessing Script
Preprocessing is a crucial step in the machine learning pipeline because the quality and quantity of the data that you feed into your model will directly determine how well it can learn. Here are some reasons how we could preprocess image data:
* Labeling: These images are not all labeled, and images from different datasets. The purpose of labeling is to homogenize the data so that each image is labeled in the same manner.

* Image Resizing: In real-world scenarios, images can come in different sizes and aspect ratios. However, many computer vision models (like Convolutional Neural Networks) require images to be of a uniform size. Therefore, images often need to be resized to fit the requirements of the model.

* Normalization: Image pixel intensities can range from 0 to 255. Normalizing these pixel intensities to a smaller range, often between 0 and 1 or -1 and 1, can help the model learn more effectively. This is because smaller, centered values are easier for the model's weight initialization and optimization process. Scaling the pixel values of the images to a small range like 0-1 or -1 to 1 can help the model converge faster during training. The 'Rescaling' layer in TensorFlow can be used for this purpose.

* Data Augmentation: Image datasets can be augmented by applying random transformations like rotation, scaling, translation, flip etc. This can help increase the amount of training data and make the model more robust to variations in the input data that it hasn't seen before. This can help the model generalize better to new data. TensorFlow provides tools for data augmentation in the 'tf.keras.layers.experimental.preprocessing' module.

* Dealing with Color Channels: Some models might require grayscale images, while others might require color images. Depending on the model, you might need to convert images from color to grayscale, or vice versa. Depending on your data, you might find that transforming the color space of your images (from RGB to HSV, Lab, YUV, etc.) could improve your model's ability to detect features.

* Feature Extraction: In some cases, it might be beneficial to manually extract features from the images, such as edges, corners, and other local features. These can be used as inputs to the machine learning model. For contrails detection, specific filters that are sensitive to the features of contrails could be used. This might require some research and experimentation.

* Dimensionality Reduction: Images are high-dimensional data, and it may be beneficial to reduce their dimensionality. This can be done through techniques like Principal Component Analysis (PCA) or autoencoders, which can make the model more efficient without losing too much information.

* Balancing Classes: If the numbers of contrail and no-contrail images are not roughly equal, the model might become biased towards the more common class. Solutions include oversampling the minority class, undersampling the majority class, or using a combination of both.

These preprocessing steps help to make the image data more suitable for computer vision models and can lead to better performance.

## Loading the Data and Labeling

In [4]:
import os
from PIL import Image

In [5]:
def image_loader(image_dir, dictionary, filetype):
    # List to hold all image data
    images = []
    # List to hold image classifications (1,0)
    classes = []
    # Iterate over each image in the directory
    for folder in os.listdir(image_dir):
        # Only open files with .jpg extension (or add your extension)
        for filename in os.listdir(image_dir+'/'+folder):
            if filename.endswith(filetype):
                # Open each image file
                img = Image.open(os.path.join(image_dir+'/'+folder, filename))
                # Append the image data to your list
                images.append(img)
                classes.append(dictionary[folder])
    # Now the 'images' list contains all the images im the image_dir as PIL Image objects, with the labels in the 'classes' list
    return(images, classes)

In [6]:
image_dir = "../data/CCSN_v2"
# Dictionary to translate scientific names into contrail labels
ccsn_dictionary = {
    'Ct':1,
    'Ac':0, 'Sc':0, 'Ns':0, 'Cu':0, 'Ci':0, 'Cc':0, 'Cb':0, 'As':0, 'Cs':0, 'St':0
}
images_ccsn, classes_ccsn = image_loader(image_dir, ccsn_dictionary, ".jpg")

In [8]:
print(classes_ccsn)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [None]:
image_dir = "data/CLASA"
# Dictionary to translate names into contrail labels
ccsn_dictionary = {
    'Contrail':1,
    'Cirrus':0
}
images_clasa, classes_clasa = image_loader(image_dir, ccsn_dictionary, ".jpg")

In [None]:
image_dir = "data/Singapore Data Swimcat"
# Dictionary to SWIMCAT labels into contrail labels
ccsn_dictionary = {
    'A-sky':0,
    'B-pattern':0,
    'C-thick-dark':0,
    'D-thick-white':0,
    'E-veil':0
}
images_singapore, classes_singapore = image_loader(image_dir, ccsn_dictionary, ".jpg")

In [None]:
# Connect to the Google Cloud api, so that we pull 

In [None]:
# Merge all the folders into two lists: one containing images, and the other has the labels
images_all = images_ccsn.append(images_clasa)
images_all = images_all.append(images_singapore)

classes_all = classes_ccsn.append(classes_clasa)
classes_all = classes_all.append(classes_singapore)

## Train test split

In [None]:
from sklearn.model_selection import train_test_split

# Assume X is your array of features and y are the labels
X_train, X_test, y_train, y_test = train_test_split(images_all, classes_all, test_size=0.2, random_state=42)

## Normalization

## Dealing with Color Channels

## Feature Extraction

## Dimensionality Reduction

## Image Resizing

In [1]:
from keras.preprocessing.image import ImageDataGenerator

ModuleNotFoundError: No module named 'tensorflow'

In [None]:
random_datagen = ImageDataGenerator(
    rescale = 1/255,
    shear_range = 0.2,
    zoom_range = 0.2,
    horizontal_flip = True
    )
# compute quantities required for featurewise normalization
# (std, mean, and principal components if ZCA whitening is applied)
random_datagen.fit(X_train)
         

# Balancing the Data

In [None]:
from sklearn.utils import class_weight
import numpy as np

# Calculate class weights
class_weights = class_weight.compute_class_weight('balanced',
                                                 np.unique(y_train),
                                                 y_train)

# Convert class_weights to a dictionary to include it in model.fit()
class_weights = dict(enumerate(class_weights))

# Fitting the Model

In [None]:
# fits the model on batches with real-time data augmentation:
# Pass class_weights to the fit function
model.fit(random_datagen.flow(X_train, y_train, batch_size=32, subset='training'),
         validation_data=random_datagen.flow(X_train, y_train, batch_size=8, subset='validation'),
         steps_per_epoch=len(X_train) / 32, epochs=32, class_weights = class_weights)

# Testing the Model
When dealing with imbalanced classes, traditional metrics like accuracy can be misleading. For a task where avoiding false positives (i.e., the model predicting a positive class when it's actually negative) is important, you might want to consider the following metrics:

* Precision: Precision is the ratio of true positives (TP) to the sum of true positives and false positives (FP). Precision is directly concerned with minimizing false positive predictions. Precision = TP / (TP + FP)

* F1 Score: The F1 score is the harmonic mean of precision and recall. While it doesn't directly focus on false positives, it provides a balance between precision and recall. This can be useful if both false positives and false negatives are of concern.

* Area Under the Precision-Recall Curve (AUPRC): In an imbalanced classification problem, AUPRC can be a better metric than traditional ones. It calculates the area under the curve formed by plotting recall (x-axis) against precision (y-axis) at various threshold settings. The closer this area is to 1, the better the model is at distinguishing between the positive and negative classes.

For the cost function in the training phase of a neural network, the standard is cross-entropy loss. When dealing with imbalanced classes, one way to handle this is by applying class weights to the loss function, which assigns a higher penalty for misclassifying the minority class.