# Contrails Detection Descriptive Analysis


This notebook explores a dataset generated with randomized values to simulate atmospheric conditions affecting contrail formation and visibility. A contrail, or condensation trail, refers to the visible streak of condensed water vapor, often originating from aircraft engine exhaust. These trails are commonly observed at high altitudes where the air is sufficiently cold to freeze the water vapor emitted by aircraft engines into ice crystals. Through the application of descriptive statistics, the aim is to identify correlations and trends related to altitude, temperature, humidity, and the duration of contrail persistence.

In [20]:
import numpy as np
import pandas as pd
from scipy.stats import kurtosis

# Seed for reproducibility
np.random.seed(42)

# Generate sample data
data = {
    "ObservationID": range(1, 101),
    "Altitude": np.random.randint(30000, 40000, 100),  # Contrails typically form at high altitudes
    "Temperature": np.random.randint(-50, -20, 100),  # Very cold temperatures at high altitudes
    "RelativeHumidity": np.random.randint(60, 100, 100),  # Higher humidity needed for contrail formation
    "ContrailPersistence": np.random.randint(1, 60, 100),  # Persistence can vary widely
    "SkyCoverage": np.random.randint(0, 100, 100)  # Percentage of sky covered by contrails
}

df = pd.DataFrame(data)

df.shape

(100, 6)

In [21]:
df.head()

Unnamed: 0,ObservationID,Altitude,Temperature,RelativeHumidity,ContrailPersistence,SkyCoverage
0,1,37270,-23,83,4,49
1,2,30860,-44,74,33,24
2,3,35390,-42,91,14,23
3,4,35191,-43,91,21,12
4,5,35734,-39,83,48,59


In [22]:
df.describe()

Unnamed: 0,ObservationID,Altitude,Temperature,RelativeHumidity,ContrailPersistence,SkyCoverage
count,100.0,100.0,100.0,100.0,100.0,100.0
mean,50.5,35177.22,-35.45,80.13,31.97,49.5
std,29.011492,2869.564753,9.712405,12.24304,15.414446,28.733361
min,1.0,30064.0,-50.0,60.0,2.0,0.0
25%,25.75,32674.25,-44.0,70.0,22.0,21.75
50%,50.5,35428.0,-37.0,81.5,33.0,54.0
75%,75.25,37655.25,-26.0,91.0,46.0,68.0
max,100.0,39998.0,-21.0,99.0,59.0,98.0


#### **Mean**

The mean provides an average value for our data, offering insights into the central tendency of each variable.

The mean (average) of a data set is calculated by summing all the numbers in the data set and then dividing by the count of those numbers.

$$
\mu = \frac{1}{N} \sum_{i=1}^{N} x_i
$$

Where:
- $\mu$ is the mean,
- $N$ is the number of observations,
- $x_i$ is each individual observation.

In [23]:
print("Mean values:")
df.mean()

Mean values:


ObservationID             50.50
Altitude               35177.22
Temperature              -35.45
RelativeHumidity          80.13
ContrailPersistence       31.97
SkyCoverage               49.50
dtype: float64

#### **Median**

The median gives us the middle value when our data is ordered, which can be more robust to outliers than the mean. If there is an even number of observations, the median is the average of the two middle numbers.


In [24]:
print("Median values:")
df.median()

Median values:


ObservationID             50.5
Altitude               35428.0
Temperature              -37.0
RelativeHumidity          81.5
ContrailPersistence       33.0
SkyCoverage               54.0
dtype: float64

#### **Mode**

The mode represents the most frequently occurring value in our data, which can be particularly informative for categorical data. A data set may have one mode, more than one mode, or no mode at all.


In [25]:
# Since mode can return multiple values, we'll ensure we're handling this appropriately
mode_values = df.mode().loc[0]
print("Mode values:")
mode_values


Mode values:


ObservationID              1.0
Altitude               30064.0
Temperature              -21.0
RelativeHumidity          91.0
ContrailPersistence       33.0
SkyCoverage               57.0
Name: 0, dtype: float64

#### **Variance**

Variance provides measures of data spread, indicating how much the data varies from the average. Variance measures the average squared deviation from the mean. It's calculated by taking the average of the squared differences between each data point and the mean. Squaring the differences emphasizes larger deviations from the mean, making extreme values contribute more to the overall measure of variability. It could be useful in calculating the spread of a distribution or assessing variability in a sample.

$$
\sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2
$$

Where:
- $\sigma^2$ is the variance,
- $N$ is the number of observations,
- $x_i$ is each individual observation,
- $\mu$ is the mean of the data set.



In [26]:
print("Variance values:")
df.var()

Variance values:


ObservationID          8.416667e+02
Altitude               8.234402e+06
Temperature            9.433081e+01
RelativeHumidity       1.498920e+02
ContrailPersistence    2.376052e+02
SkyCoverage            8.256061e+02
dtype: float64

#### **Standard Deviation**

Standard deviation provides measures of data spread, indicating how much the data varies from the average. Standard deviation measures the average deviation from the mean. Standard deviation is the square root of the variance and it provides a measure of dispersion in the same units as the original data, which makes it more easily interpretable and understandable.

$$
\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2}
$$

Where:
- $\sigma$ is the standard deviation.

In [27]:
print("Standard Deviation values:")
df.std()


Standard Deviation values:


ObservationID            29.011492
Altitude               2869.564753
Temperature               9.712405
RelativeHumidity         12.243040
ContrailPersistence      15.414446
SkyCoverage              28.733361
dtype: float64

#### **Skewness**

Skewness in data refers to the measure of asymmetry in the distribution of values. It indicates whether the data is concentrated more to one side of the mean than the other. A dataset is said to be skewed if the distribution of values is not symmetrical around the mean. Positive skewness means that the tail of the distribution is longer on the right side, while negative skewness means the tail is longer on the left side. A skewness value of zero indicates a perfectly symmetrical distribution. Skewness provides valuable insights into the shape and nature of the data distribution.
$$
Skewness = \frac{1}{N} \sum_{i=1}^{N} \left( \frac{x_i - \mu}{\sigma} \right)^3
$$


In [28]:
print("Skewness values:")
df.apply(skew)


Skewness values:


ObservationID          0.000000
Altitude              -0.152978
Temperature            0.077802
RelativeHumidity      -0.180743
ContrailPersistence   -0.262596
SkyCoverage           -0.050774
dtype: float64

#### **Kurtosis**

Kurtosis is a statistical measure that describes the shape of the distribution of data. Specifically, it quantifies how much the tails of a distribution differ from the tails of a normal distribution. A distribution with high kurtosis has heavy tails, meaning it has more outliers or extreme values compared to a normal distribution. Conversely, a distribution with low kurtosis has lighter tails, indicating fewer outliers. Kurtosis helps to identify whether a dataset has peakedness or flatness relative to a normal distribution, providing insights into the nature of the data's distribution.

$$
Kurtosis = \frac{1}{N} \sum_{i=1}^{N} \left( \frac{x_i - \mu}{\sigma} \right)^4 - 3
$$

In [29]:
print("Kurtosis values:")
df.apply(kurtosis)


Kurtosis values:


ObservationID         -1.200240
Altitude              -1.168779
Temperature           -1.427832
RelativeHumidity      -1.253947
ContrailPersistence   -0.797769
SkyCoverage           -1.123127
dtype: float64

#### **Range**

Range in statistics refers to the difference between the maximum and minimum values in a dataset. It gives a measure of how spread out the data is. A larger range indicates greater variability among the data points, while a smaller range suggests less variability.

$$
Range = \max(x_i) - \min(x_i)
$$

In [30]:
column_ranges = df.max() - df.min()
print("Range of each column:")
column_ranges

Range of each column:


ObservationID            99
Altitude               9934
Temperature              29
RelativeHumidity         39
ContrailPersistence      57
SkyCoverage              98
dtype: int64

#### **Interquartile Range**

The interquartile range (IQR) is a measure of statistical dispersion, specifically used to describe the spread of the middle 50% of a dataset. It is calculated as the difference between the third quartile (Q3) and the first quartile (Q1).

$$
IQR = Q3 - Q1
$$

In [31]:
q1 = df.quantile(0.25)
q3 = df.quantile(0.75)
iqr = q3 - q1

print("Interquartile range of each column:")
iqr

Interquartile range of each column:


ObservationID            49.50
Altitude               4981.00
Temperature              18.00
RelativeHumidity         21.00
ContrailPersistence      24.00
SkyCoverage              46.25
dtype: float64