# Feature Engineering

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import yaml

In [3]:
def correct_path(path_type, name):
    config_path = os.path.join("..", "configs", "paths.yaml")
    with open(config_path, "r") as file:
        config = yaml.safe_load(file)
    
    path = config[path_type][name]
    full_path = os.path.join("..", path.replace("\\", "/"))
    return full_path

In [4]:
# data loading
data_path = correct_path("data_paths", "cleaned_data_path")
clean_data = pd.read_csv(data_path)

**Chi-Square Test for Independence:**

This test is used whether there is a significant relationship between two categorical variables.
It is used to determine whether changes in one variable are independent of changes in another. This test is applied when we have counts of values for two nominal or categorical variables.

In [5]:
# categorical columns:
cat_cols_FE = clean_data.select_dtypes(include=['object']).columns.difference(['RainTomorrow'])
# target column
target = 'RainTomorrow'

### Make Hypothesis
- Null hypothesis: No significant association between features
- Alternate Hypothesis: There is a significant association between features.

In [None]:
from scipy.stats import chi2_contingency

p_values = {}

for col in cat_cols_FE:
    # Create the contingency table
    contingency_table = pd.crosstab(clean_data[col], clean_data[target])

    # Perform Chi-Square test
    chi2, p, dof, expected = chi2_contingency(contingency_table)

    p_values[col] = p

    print(f"\nChi-Square Test between '{col}' and '{target}'")
    print(f"Chi2 Statistic = {chi2:.2f}, p-value = {p:.10f}")
    if p < 0.05:
        print("Significant relationship (reject H0)")
    else:
        print("No significant relationship (fail to reject H0)")

Chi-Square Test between 'Location' and 'RainTomorrow'
Chi2 Statistic = 3472.46, p-value = 0.0000000000
Significant relationship (reject H0)

Chi-Square Test between 'RainToday' and 'RainTomorrow'
Chi2 Statistic = 13595.74, p-value = 0.0000000000
Significant relationship (reject H0)

Chi-Square Test between 'Season' and 'RainTomorrow'
Chi2 Statistic = 0.00, p-value = 1.0000000000
No significant relationship (fail to reject H0)

Chi-Square Test between 'WindDir3pm' and 'RainTomorrow'
Chi2 Statistic = 1206.21, p-value = 0.0000000000
Significant relationship (reject H0)

Chi-Square Test between 'WindDir9am' and 'RainTomorrow'
Chi2 Statistic = 1787.93, p-value = 0.0000000000
Significant relationship (reject H0)

Chi-Square Test between 'WindGustDir' and 'RainTomorrow'
Chi2 Statistic = 1484.60, p-value = 0.0000000000
Significant relationship (reject H0)

All columns have `p` values > 0.05 so all columns have a significant relationship with `RainTomorrow`.

##  Feature Engineering

We engineered new features to enhance predictive power by capturing daily changes in weather variables.

---

###  1. TempDiff = MaxTemp − MinTemp  
- **Logic:** Daytime temperature swing; smaller swings may coincide with rainy/cloudy days.

###  2. WindSpeedAvg = (WindSpeed9am + WindSpeed3pm) / 2  
- **Logic:** Daily average wind speed; windier conditions often precede rain.

### 3. HumidityDiff = Humidity3pm − Humidity9am  
- **Logic:** Change in humidity over the day; sustained high humidity can signal rain.

###  4. PressureDiff = Pressure3pm − Pressure9am  
- **Logic:** Drop in atmospheric pressure; low‐pressure systems often bring rain.

###  5. CloudCoverAvg = (Cloud9am + Cloud3pm) / 2  
- **Logic:** Daily average cloud cover; more clouds generally mean higher precipitation chance.

### 6. RainToday = 1 if Rainfall > 0 else 0  
- **Logic:** Flag for rain today; rainfall events tend to cluster day‐to‐day.

###  7. WindGustDiff = WindGustSpeed − WindSpeedAvg  
- **Logic:** Gust variability; strong gusts can be a precursor to storms.


In [None]:
# Create engineered features
clean_data['TempDiff']      = clean_data['MaxTemp']   - clean_data['MinTemp']
clean_data['WindSpeedAvg']  = clean_data[['WindSpeed9am', 'WindSpeed3pm']].mean(axis=1)
clean_data['HumidityDiff']  = clean_data['Humidity3pm'] - clean_data['Humidity9am']
clean_data['PressureDiff']  = clean_data['Pressure3pm'] - clean_data['Pressure9am']
clean_data['CloudCoverAvg'] = clean_data[['Cloud9am', 'Cloud3pm']].mean(axis=1)
clean_data['RainToday']     = (clean_data['Rainfall']    > 0).astype(int)
clean_data['WindGustDiff']  = clean_data['WindGustSpeed'] - clean_data['WindSpeedAvg']


**Observation:**  
Average wind speeds (`WindSpeedAvg`) are generally higher on days before rain, suggesting wind patterns can signal incoming systems.

**Observation:**  
The change in humidity (`HumidityDiff`) is smaller on rainy days, reflecting sustained high moisture levels throughout the day.


## ANOVA Test on Numerical Features

- Why use ANOVA?

ANOVA (Analysis of Variance) is used to determine whether there are any statistically significant differences between the means of two or more independent groups. In this task, we use ANOVA to check whether the average values of numeric features differ significantly between the "RainTomorrow = Yes" and "RainTomorrow = No" groups.


In [None]:
from scipy.stats import f_oneway  # Import the ANOVA test from scipy
# Select all numeric columns from the DataFrame
numeric_cols = clean_data.select_dtypes(include='number').columns

# Dictionary to store p-values for each numeric feature
p_values = {}

# Loop through each numeric column to perform ANOVA
for col in numeric_cols:
    # Split the column into two groups based on 'RainTomorrow' (Yes / No), dropping missing values
    group_yes = clean_data[clean_data['RainTomorrow'] == 'Yes'][col].dropna()
    group_no = clean_data[clean_data['RainTomorrow'] == 'No'][col].dropna()

    # Perform one-way ANOVA test between the two groups
    stat, p = f_oneway(group_yes, group_no)

    # Store the p-value in the dictionary
    p_values[col] = p

# Convert the p-values dictionary to a DataFrame and sort by p-value (ascending)
pval_df = pd.DataFrame.from_dict(p_values, orient='index', columns=['p_value']).sort_values('p_value')

# Display the top 5 features with the lowest p-values
# Print all p-values sorted in ascending order
print("All p-values sorted (ascending):")
pval_df.head()

All p-values sorted (ascending):

## P-Values Summary from ANOVA Test



The table below shows the top numeric features ranked by their p-values from the ANOVA test comparing "RainTomorrow = Yes" vs "RainTomorrow = No":

| Feature         | p-value |
|-----------------|---------|
| MaxTemp         | 0.0     |
| Rainfall        | 0.0     |
| WindGustSpeed   | 0.0     |
| Sunshine        | 0.0     |
| Pressure3pm     | 0.0     |

These p-values are extremely small (close to 0), indicating that the differences in the means of these features between the two RainTomorrow groups are **statistically significant**.


## Final Comment on Important Features:

Based on the ANOVA test results, the features **MaxTemp**, **Rainfall**, **WindGustSpeed**, **Sunshine**, and **Pressure3pm** showed statistically significant differences between the "RainTomorrow = Yes" and "RainTomorrow = No" groups (p-value ≈ 0).

This indicates that these features are important for predicting whether it will rain tomorrow and should be given special attention in any predictive modeling or feature selection process.


# Feature Engineering and Selection Process

In [None]:
X=clean_data.drop(columns='RainTomorrow')
y=clean_data['RainTomorrow']

In [None]:
from sklearn.preprocessing import LabelEncoder

X_encoded = X.copy()
for col in X_encoded.columns:
    X_encoded[col] = LabelEncoder().fit_transform(X_encoded[col])


In [None]:
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest
from scipy.stats import chi2_contingency

# Assuming X is your feature matrix (categorical features encoded), and y is target
chi2_selector = SelectKBest(score_func=chi2, k='all')
chi2_selector.fit(X_encoded, y)
chi2_scores = chi2_selector.scores_

feature_names = X_encoded.columns

scores = chi2_selector.scores_
sorted_indices = np.argsort(scores)[::-1]


for i in sorted_indices:
    print(f"{feature_names[i]}: {scores[i]:.2f}")


In [None]:
# Assuming 'X_encoded' is the feature matrix and 'chi2_selector' is the fitted SelectKBest object
feature_names = X_encoded.columns

# Scores from chi2_selector
scores = chi2_selector.scores_

# Sorting indices in descending order based on the scores
sorted_indices = np.argsort(scores)[::-1]

# Sorting features and their scores
sorted_feature_names = feature_names[sorted_indices]
sorted_scores = scores[sorted_indices]

# Plotting the bar chart
plt.figure(figsize=(12, 8))
plt.barh(sorted_feature_names, sorted_scores, color='skyblue')
plt.xlabel('Chi-Square Scores')
plt.title('Feature Importance based on Chi-Square Test')
plt.gca().invert_yaxis()  # To show the most important feature at the top
plt.show()




##  Feature Selection Using Chi-Square Test

###  Objective

Before diving into model training, we aimed to identify which categorical and numerical features have the strongest statistical relationship with our target variable: **RainTomorrow**. The goal was to perform **feature selection** to reduce dimensionality, improve model performance, and interpretability.

---
### Methodology

We used the **Chi-Square test (`chi2`)** from `sklearn.feature_selection.SelectKBest` to score each feature based on how strongly it is associated with the target variable. The higher the score, the more statistically significant the feature is.

Before applying the test:

- All categorical variables were properly encoded (Label Encoding or One-Hot Encoding as needed).
- Features were ensured to be non-negative, as required by the Chi-Square test.

---

###  Results

Here are the Chi-Square scores for the most important features:

| Feature         | Chi2 Score        |
|----------------|-------------------|
| TempDiff        | 105,086,227.56    |
| WindGustDiff    | 23,303,550.23     |
| Rainfall        | 15,857,153.72     |
| PressureDiff    | 8,396,806.55      |
| Sunshine        | 5,189,527.32      |
| Pressure9am     | 4,252,383.59      |
| Pressure3pm     | 3,755,461.03      |
| Humidity3pm     | 3,082,988.01      |
| Temp3pm         | 1,182,737.16      |
| HumidityDiff    | 1,146,961.35      |

These features exhibited the strongest relationship with whether it will rain tomorrow, and will be prioritized in our model training phase.

On the other hand, less important features included:

| Feature         | Chi2 Score        |
|----------------|-------------------|
| Location        | 37.41             |
| year            | 32.47             |
| day             | 22.91             |
| month           | 18.63             |

These features showed weak statistical relationships and may be excluded or further analyzed.

---

###  Conclusion

- **TempDiff**, **WindGustDiff**, and **Rainfall** are the most influential predictors for rain on the following day.
- Features with low scores (like `Location`, `year`, `day`) are likely less useful for prediction and may be dropped or transformed differently.
- This analysis helps reduce noise and improve model efficiency by focusing on the most impactful variables.

Next step: Feed the selected features into our classification models.


###  Feature Importance (from XGBoost)

| Rank | Feature         | Importance |
|------|------------------|------------|
| 1    | Humidity3pm      | 0.3097     |
| 2    | Cloud3pm         | 0.0849     |
| 3    | Sunshine         | 0.0562     |
| 4    | WindGustDiff     | 0.0506     |
| 5    | Pressure3pm      | 0.0499     |
| 6    | Rainfall         | 0.0450     |
| 7    | WindGustSpeed    | 0.0413     |
| 8    | Location         | 0.0216     |
| 9    | Temp9am          | 0.0193     |
| 10   | PressureDiff     | 0.0193     |
| ...  | ...              | ...        |
| 30   | Cloud9am         | 0.0106     |

> `Humidity3pm` was the most influential feature for rain prediction.

---

###  Top Features by Univariate Selection (SelectKBest)

| Rank | Feature         | Score        |
|------|------------------|--------------|
| 1    | Humidity3pm      | 36829.35     |
| 2    | Sunshine         | 26400.96     |
| 3    | Cloud3pm         | 24651.29     |
| 4    | CloudCoverAvg    | 24021.69     |
| 5    | Rainfall         | 18300.56     |
| 6    | TempDiff         | 16788.64     |
| 7    | Cloud9am         | 16335.03     |
| 8    | RainToday        | 12817.98     |
| 9    | Humidity9am      | 10821.16     |
| 10   | HumidityDiff     | 10454.75     |

>  Features related to humidity and cloud cover were most important — makes sense for rain prediction.

---
