## Purpose 

Predict daily smoke behavior using only contextual information.

### Inputs (X)
Day (derived from date / day)
Weather
Occasion

### Outputs (y)
Probability of smoke detection today
Likely time-of-day bucket if smoke occurs


### Constraints
- No time inputs
- Time only appears as targets
- Minimal user input (forecast-style UX)

## Imports

In [8]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import roc_auc_score, classification_report

import lightgbm as lgb


### Load Data 

In [9]:
import os

# os.getcwd()

df = pd.read_csv("smoke_occurence_nairobi.csv")
df.head()

Unnamed: 0,id,time_opening_windows,time_closing_windows,smoke_detected,time_sensing_smoke,duration,date,day,occassion,weather,type_of_smoke
0,1,1641,1717,,1717.0,,2025-10-03,Friday,nothing,cloudywithoutwind,stove
1,2,1252,1528,,1728.0,,2025-10-04,Saturday,,cloudywithoutwind,stove
2,3,1452,1734,,1734.0,,2025-10-05,Sunday,,windy,
3,4,1035,1759,,1759.0,,2025-10-06,,,windy,stove
4,5,1400,1813,True,1813.0,,2025-10-07,Tuesday,,,


In [10]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    20 non-null     int64  
 1   time_opening_windows  20 non-null     int64  
 2   time_closing_windows  20 non-null     int64  
 3   smoke_detected        14 non-null     object 
 4   time_sensing_smoke    8 non-null      float64
 5   duration              12 non-null     float64
 6   date                  20 non-null     object 
 7   day                   19 non-null     object 
 8   occassion             12 non-null     object 
 9   weather               16 non-null     object 
 10  type_of_smoke         6 non-null      object 
dtypes: float64(2), int64(3), object(6)
memory usage: 1.8+ KB


In [11]:
df.isna().sum()


id                       0
time_opening_windows     0
time_closing_windows     0
smoke_detected           6
time_sensing_smoke      12
duration                 8
date                     0
day                      1
occassion                8
weather                  4
type_of_smoke           14
dtype: int64

Date & Day Engineering 

In [12]:

df["date"] = pd.to_datetime(df["date"], errors="coerce")

df["day_of_week"] = df["date"].dt.dayofweek  # 0=Monday
df["is_weekend"] = df["day_of_week"].isin([5, 6]).astype(int)


For now we do not need ```time_opening_windows``` or ```time_closing_windows```, we need to see whether there is a correlation amongst other features except specific time frames. 

Target feature engineering

In [17]:
df["smoke_today"] = df["smoke_detected"].map(
    {"True": 1, "False": 0}
).fillna(0)


In [19]:
df["smoke_today"].head()

0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
Name: smoke_today, dtype: float64

Target B -> **Time Bucket Engineering** (Core ML Step)

Time appears ONLY HERE.

In [24]:
# # helper 
# def hhmm_to_minutes(val):
#     if pd.isna(val):
#         return np.nan
#     val = str(val).zfill(4)
#     return int(val[:2]) * 60 + int(val[2:])

# Update your hhmm_to_minutes function
def hhmm_to_minutes(hhmm):
    # Convert to float first, then int, to handle cases like '17.0'
    try:
        hhmm = float(hhmm)
        hhmm = int(hhmm)
    except ValueError:
        # Handle or log the error as needed
        return None
    hours = hhmm // 100
    minutes = hhmm % 100
    return hours * 60 + minutes

In [25]:
# bucketing 
def time_bucket(minutes):
    if pd.isna(minutes):
        return "none"
    if minutes < 12 * 60:
        return "morning"
    if minutes < 17 * 60:
        return "afternoon"
    if minutes < 21 * 60:
        return "evening"
    return "night"


In [27]:
df["smoke_time_bucket"] = df["time_sensing_smoke"].apply(hhmm_to_minutes).apply(time_bucket)

In [28]:
df["smoke_time_bucket"].head()

0    evening
1    evening
2    evening
3    evening
4    evening
Name: smoke_time_bucket, dtype: object

End of Target Engineering

### feature selection

In [29]:
FEATURES = [
    "day_of_week",
    "is_weekend",
    "weather",
    "occassion"
]


In [30]:
X = df[FEATURES]
y_binary = df["smoke_today"]
y_time = df["smoke_time_bucket"]


In [31]:
X.head()

Unnamed: 0,day_of_week,is_weekend,weather,occassion
0,4,0,cloudywithoutwind,nothing
1,5,1,cloudywithoutwind,
2,6,1,windy,
3,0,0,windy,
4,1,0,,


Categorical Encoding

In [33]:
cat_features = ["weather", "occassion"]

encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)

X_cat = encoder.fit_transform(X[cat_features])
X_num = X.drop(columns=cat_features).fillna(0)

X_final = np.hstack([X_num.values, X_cat])


In [34]:
X_final.shape

(20, 11)

### Train/Test Split

In [38]:
X_train, X_test, y_train, y_test = train_test_split(
    X_final, y_binary, test_size=0.2, random_state=42
)


From here we are going to have two models:

- smoke prediction (binary)
- time bucket forecast 

### Model 1 - Smoke Detection

In [39]:
clf_smoke = lgb.LGBMClassifier(
    n_estimators=200,
    learning_rate=0.05,
    max_depth=5,
    random_state=42
)

clf_smoke.fit(X_train, y_train)


[LightGBM] [Info] Number of positive: 0, number of negative: 16
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 16, number of used features: 0
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.000000 -> initscore=-34.538776
[LightGBM] [Info] Start training from score -34.538776


Evaluation (Model 1)

In [41]:
y_pred_proba = clf_smoke.predict_proba(X_test)[:, 1]
roc_auc_score(y_test, y_pred_proba)
print(classification_report(y_test, clf_smoke.predict(X_test)))

              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00         4

    accuracy                           1.00         4
   macro avg       1.00      1.00      1.00         4
weighted avg       1.00      1.00      1.00         4





### Model 2 (Time Bucket Forecast)

In [42]:
X_train_t, X_test_t, y_train_t, y_test_t = train_test_split(
    X_final, y_time, test_size=0.2, random_state=42
)

clf_time = lgb.LGBMClassifier(
    objective="multiclass",
    num_class=len(y_time.unique()),
    n_estimators=200,
    learning_rate=0.05,
    random_state=42
)

clf_time.fit(X_train_t, y_train_t)


[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 16, number of used features: 0
[LightGBM] [Info] Start training from score -2.079442
[LightGBM] [Info] Start training from score -1.386294
[LightGBM] [Info] Start training from score -0.470004


Evaluation of Model 2 (Time Buckets)

In [43]:
y_pred_time = clf_time.predict(X_test_t)
print(classification_report(y_test_t, y_pred_time))


              precision    recall  f1-score   support

     evening       0.00      0.00      0.00         2
        none       0.50      1.00      0.67         2

    accuracy                           0.50         4
   macro avg       0.25      0.50      0.33         4
weighted avg       0.25      0.50      0.33         4



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
