# Exploratory Data Analysis
Here, I want to take some time to explore the smoke detection dataset.  


## 1. Information provided by publisher
First, let's take a look at the information the publisher gave on this dataset.


### 1.1. Context
A smoke detector is a device that senses smoke, typically as an indicator of fire. Smoke detectors are usually housed in plastic enclosures, typically shaped like a disk about 150 millimetres (6 in) in diameter and 25 millimetres (1 in) thick, but shape and size vary.

#### Types of Smoke Detectors
1. Photoelectric Smoke Detector:  
A photoelectric smoke detector contains a source of infrared, visible, or ultraviolet light, a lens, and a photoelectric receiver. In some types, the light emitted by the light source passes through the air being tested and reaches the photosensor. The received light intensity will be reduced due to scattering from particles of smoke, air-borne dust, or other substances; the circuitry detects the light intensity and generates an alarm if it is below a specified threshold, potentially due to smoke. Such detectors are also known as optical detectors.  

2. Ionization Smoke Detector:  
An ionization smoke detector uses a radioisotope to ionize air. If any smoke particles enter the open chamber, some of the ions will attach to the particles and not be available to carry the current in that chamber. An electronic circuit detects that a current difference has developed between the open and sealed chambers, and sounds the alarm.  
  
The author of this dataset has successfully created a smoke detection device with the help of IOT devices and AI model.


### 1.2. About the dataset
Collection of training data is performed with the help of IOT devices since the goal is to develop a AI based smoke detector device.
Many different environments and fire sources have to be sampled to ensure a good dataset for training. A short list of different scenarios which are captured:

- Normal indoor
- Normal outdoor
- Indoor wood fire, firefighter training area
- Indoor gas fire, firefighter training area
- Outdoor wood, coal, and gas grill
- Outdoor high humidity
- etc.

The dataset is nearly 60.000 readings long. The sample rate is 1Hz for all sensors. To keep track of the data, a UTC timestamp is added to every sensor reading.

## 2. Own analysis
Next, we want to get some more insights on the data ourselves. let's take a look.

### 2.1. Load the data
Let's load the data and take a first glance at it

In [2]:
# all import statements
import pandas as pd

In [5]:
df = pd.read_csv("../../data/raw/smoke_detection_iot.csv", index_col=0)

### 2.2. Preliminary Analysis
We can take a very high level look at the data. Nothing too fancy, just getting some initial information before diving in more deeply.

In [7]:
df.columns

Index(['UTC', 'Temperature[C]', 'Humidity[%]', 'TVOC[ppb]', 'eCO2[ppm]',
       'Raw H2', 'Raw Ethanol', 'Pressure[hPa]', 'PM1.0', 'PM2.5', 'NC0.5',
       'NC1.0', 'NC2.5', 'CNT', 'Fire Alarm'],
      dtype='object')

Okay, so let's find an explanation for each of the columns.  

- `UTC`: Timestamp
- `Temperature[C]`: Air Temperature; measured in Celsius
- `Humidity[%]`: Air Humidity; measured in percent
- `TVOC[ppb]`: Total Volatile Organic Compunds; measured in parts per billion
- `eCO2[ppm]`: CO2 equivalent concentration; measured in parts per million
- `Raw H2`: raw molecular hydrogen; measured in ???
- `Raw Ethanol`: raw ethanol gas; measured in ???
- `Pressure[hPa]`: Air Pressure; measured in hectopascal
- `PM1.0`: particulate matter size smaller than 1 µm
- `PM2.5`: particulate matter size between 1 and 2.5 µm
- `NC0.5`: Number concentration of particulate matter. This differs from PM because NC gives the actual number of particles in the air. The raw NC is also classified by the particle size: < 0.5 µm (NC0.5); 
- `NC1.0`: Number concentration of particulate matter. 0.5 µm < 1.0 µm (NC1.0);
- `NC2.5`: Number concentration of particulate matter. 1.0 µm < 2.5 µm (NC2.5);
- `CNT`: Sample counter
- `Fire Alarm`: binary label indicating whether there is a fire (1) or not (0)

Now let's look at the top and bottom of the dataset, just to get a feel for the data.

In [8]:
df.head()

Unnamed: 0,UTC,Temperature[C],Humidity[%],TVOC[ppb],eCO2[ppm],Raw H2,Raw Ethanol,Pressure[hPa],PM1.0,PM2.5,NC0.5,NC1.0,NC2.5,CNT,Fire Alarm
0,1654733331,20.0,57.36,0,400,12306,18520,939.735,0.0,0.0,0.0,0.0,0.0,0,0
1,1654733332,20.015,56.67,0,400,12345,18651,939.744,0.0,0.0,0.0,0.0,0.0,1,0
2,1654733333,20.029,55.96,0,400,12374,18764,939.738,0.0,0.0,0.0,0.0,0.0,2,0
3,1654733334,20.044,55.28,0,400,12390,18849,939.736,0.0,0.0,0.0,0.0,0.0,3,0
4,1654733335,20.059,54.69,0,400,12403,18921,939.744,0.0,0.0,0.0,0.0,0.0,4,0


In [9]:
df.tail()

Unnamed: 0,UTC,Temperature[C],Humidity[%],TVOC[ppb],eCO2[ppm],Raw H2,Raw Ethanol,Pressure[hPa],PM1.0,PM2.5,NC0.5,NC1.0,NC2.5,CNT,Fire Alarm
62625,1655130047,18.438,15.79,625,400,13723,20569,936.67,0.63,0.65,4.32,0.673,0.015,5739,0
62626,1655130048,18.653,15.87,612,400,13731,20588,936.678,0.61,0.63,4.18,0.652,0.015,5740,0
62627,1655130049,18.867,15.84,627,400,13725,20582,936.687,0.57,0.6,3.95,0.617,0.014,5741,0
62628,1655130050,19.083,16.04,638,400,13712,20566,936.68,0.57,0.59,3.92,0.611,0.014,5742,0
62629,1655130051,19.299,16.52,643,400,13696,20543,936.676,0.57,0.59,3.9,0.607,0.014,5743,0


Let's look at some random examples where there was a fire and where there was not.  

Can we see a clear pattern here?

In [15]:
random_sample = df.groupby("Fire Alarm").apply(lambda x: x.sample(n=10))
random_sample

Unnamed: 0_level_0,Unnamed: 1_level_0,UTC,Temperature[C],Humidity[%],TVOC[ppb],eCO2[ppm],Raw H2,Raw Ethanol,Pressure[hPa],PM1.0,PM2.5,NC0.5,NC1.0,NC2.5,CNT,Fire Alarm
Fire Alarm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
0,2398,1654735729,18.663,51.25,42,400,13199,20127,939.599,0.95,0.99,6.57,1.025,0.023,2398,0
0,2919,1654736250,12.597,48.42,168,400,13152,20013,939.632,1.08,1.12,7.43,1.159,0.026,2919,0
0,54636,1654715681,25.57,45.95,0,400,13078,20859,937.47,2.02,2.1,13.9,2.168,0.049,3494,0
0,26400,1654762749,14.99,55.69,27,400,13092,19986,939.665,0.89,0.92,6.1,0.952,0.021,1406,0
0,53518,1654714563,28.5,42.77,177,429,12772,20542,937.324,1.91,1.99,13.18,2.055,0.046,2376,0
0,27513,1654763862,18.03,52.6,52,400,13191,20123,939.616,1.02,1.06,7.02,1.094,0.025,2519,0
0,60663,1655128085,12.708,41.67,0,400,13418,21253,937.411,2.06,2.14,14.2,2.214,0.05,3777,0
0,26926,1654763275,16.73,53.18,74,400,13114,19970,939.614,0.56,0.58,3.83,0.597,0.013,1932,0
0,58928,1655126350,-5.171,46.54,143,410,12775,20575,937.344,1.92,2.0,13.23,2.063,0.047,2042,0
0,53424,1654714469,26.22,46.58,136,411,12785,20588,937.341,1.8,1.87,12.39,1.932,0.044,2282,0


To be completely honest, I do not see any pattern here...  
Maybe it is a combination of parameters that plays a role? We will find out later if a machine can learn to set off a fire alarm with this data.

Lastly, for our preliminary data analysis, we use the `.describe()` method to see if we find something interesting there.

In [16]:
df.describe()

Unnamed: 0,UTC,Temperature[C],Humidity[%],TVOC[ppb],eCO2[ppm],Raw H2,Raw Ethanol,Pressure[hPa],PM1.0,PM2.5,NC0.5,NC1.0,NC2.5,CNT,Fire Alarm
count,62630.0,62630.0,62630.0,62630.0,62630.0,62630.0,62630.0,62630.0,62630.0,62630.0,62630.0,62630.0,62630.0,62630.0,62630.0
mean,1654792000.0,15.970424,48.539499,1942.057528,670.021044,12942.453936,19754.257912,938.627649,100.594309,184.46777,491.463608,203.586487,80.049042,10511.386157,0.714626
std,110002.5,14.359576,8.865367,7811.589055,1905.885439,272.464305,609.513156,1.331344,922.524245,1976.305615,4265.661251,2214.738556,1083.383189,7597.870997,0.451596
min,1654712000.0,-22.01,10.74,0.0,400.0,10668.0,15317.0,930.852,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1654743000.0,10.99425,47.53,130.0,400.0,12830.0,19435.0,938.7,1.28,1.34,8.82,1.384,0.033,3625.25,0.0
50%,1654762000.0,20.13,50.15,981.0,400.0,12924.0,19501.0,938.816,1.81,1.88,12.45,1.943,0.044,9336.0,1.0
75%,1654778000.0,25.4095,53.24,1189.0,438.0,13109.0,20078.0,939.418,2.09,2.18,14.42,2.249,0.051,17164.75,1.0
max,1655130000.0,59.93,75.2,60000.0,60000.0,13803.0,21410.0,939.861,14333.69,45432.26,61482.03,51914.68,30026.438,24993.0,1.0


- Temperature goes from -22°C to 60°C. My guess would be that higher temperatures correlate with fire.
    - We will see about that in the data visualization part.  
- TVOC and eCO2 suspiciously both cap at 60000 ppb/ppm. The 75th percentile is suspiciously low compared to that.
    - We need to investigate that.
    - We need to investigate outliers in general.  
- Pressure is a rather small range (931-940 hPa).
    - a small change in that value could already be significant.  
- All the PM and NC values go from 0 to something insanely big. The 75th percentile is suspiciously small compared to the max.
    - investigate outliers  
  
- Mean of "Fire Alarm" is ~0.71, so we have unbalanced data here.
    - We have to do _something_ about that before training
    - because obviously in the day to day, the input data is massively skewed towards no fire
    - at the same time, we should strive for 100% recall with as high precision as possible. 

- We for CERTAIN need some scaler 

### 2.3. Data Cleaning / Processing

We want to check the data to see if there is some processing or cleaning necessary.  
Missing values are usually the lowest hanging fruit, so let's check for that.

In [18]:
df.isna().sum()

UTC               0
Temperature[C]    0
Humidity[%]       0
TVOC[ppb]         0
eCO2[ppm]         0
Raw H2            0
Raw Ethanol       0
Pressure[hPa]     0
PM1.0             0
PM2.5             0
NC0.5             0
NC1.0             0
NC2.5             0
CNT               0
Fire Alarm        0
dtype: int64

Cool, no missing values. Great work from the publisher! Maybe there already was some cleaning before publishing the dataset :)

TODO: We need to check for outliers and possibly remove them (if they are incorrect that is). We will do that in the visualization section and clean the data on demand then

### 2.4. Data Visualization
Let us continue by visualizing some data from the dataset to get a better feel for it.