# PROBLEM STATEMENT 

Dataset contains daily weather observations from Australian weather stations, and your goal is to predict whether it will rain tomorrow based on the data. The target variable you want to predict is "RainTomorrow," which is a binary variable with two possible values: "Yes" or "No."

In this context, "RainTomorrow" is set to "Yes" if the amount of rain recorded for the current day is 1mm or more, indicating that it rained that day. If the recorded rainfall is less than 1mm, "RainTomorrow" is set to "No," indicating that it did not rain that day.

- The aim of this is to test the chances of raining in australia Yes or No

# OVERVIEW 

The dataset weatherAUS.csv contains 40000 rows and 23 columns to weather observations in Australia. Here's an explanation of each column:
### The data we are using here is already  preprocessed including encoding and scaling, so the names of the columns may change.

- Date: The date of the observation.
- Location: The location where the weather data was recorded.
- MinTemp: The minimum temperature in degrees Celsius.
- MaxTemp: The maximum temperature in degrees Celsius.
- Rainfall: The amount of rainfall recorded for the day in millimeters.
- Evaporation: The so-called Class A pan evaporation (in millimeters) in the 24 hours to 9am.
- Sunshine: The number of hours of bright sunshine in the day.
- WindGustDir: The direction of the strongest wind gust in the 24 hours to midnight.
- WindGustSpeed: The speed (in km/h) of the strongest wind gust in the 24 hours to midnight.
- WindDir9am: Direction of the wind at 9am.
- WindDir3pm: Direction of the wind at 3pm.
- WindSpeed9am: Wind speed (in km/h) averaged over 10 minutes prior to 9am.
- WindSpeed3pm: Wind speed (in km/h) averaged over 10 minutes prior to 3pm.
- Humidity9am: Humidity (percent) at 9am.
- Humidity3pm: Humidity (percent) at 3pm.
- Pressure9am: Atmospheric pressure (hpa) reduced to mean sea level at 9am.
- Pressure3pm: Atmospheric pressure (hpa) reduced to mean sea level at 3pm.
- Cloud9am: Fraction of sky obscured by cloud at 9am. This is measured in "oktas," which are a unit of eighths. It records how many eighths of the sky are obscured by cloud.
- Cloud3pm: Fraction of sky obscured by cloud at 3pm. Similar to Cloud9am.
- Temp9am: Temperature (degrees Celsius) at 9am.
- Temp3pm: Temperature (degrees Celsius) at 3pm.
- RainToday: Indicates if it has rained. Yes if precipitation (mm) in the 24 hours to 9am exceeds 1mm, otherwise No.
- RainTomorrow: The target variable. Indicates if it will rain tomorrow. Yes or No.

These columns provide comprehensive information about daily weather conditions, useful for various analyses including weather forecasting, climate studies, and understanding local weather patterns. 











There is a date variable. It is denoted by Date column.
There are 6 categorical variables. These are given by Location, WindGustDir, WindDir9am, WindDir3pm, RainToday and RainTomorrow.
There are two binary categorical variables - RainToday and RainTomorrow.
RainTomorrow is the target variable.


## ------------------------------------------------------------------------------------

## Guidelines to follow in this notebook 
- The name of the main dataframe should be df 
- Keep the seed value 42
- Names of training and testing variables should be X_train, X_test, y_train, y_test
- Keep the name of model instance as "model", e.g. model = DecisionTreeClassifer()
- Keep the predictions on training and testing data in a variable named y_train_pred and y_test_pred respectively.

## ------------------------------------------------------------------------------------

## Import Libraries 
#### Lets begin by importing necessary data libraries 

In [1]:
#Importing necessary libraries
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
#from sklearn.neighbors import KNeighborsClassifier

### Load the dataset

In [2]:
#### load the dataset and print no. of rows and columns 
df = pd.read_csv('Rain_in_aus_data_scaled_with_feature_extraction.csv')
df.head()

Unnamed: 0,Location_Penrith,WindSpeedDiff,Location_Bendigo,Location_Watsonia,Location_Uluru,Location_Perth,Location_Nuriootpa,WindDir3pm_NNE,Location_Albany,WindDir3pm_NNW,...,WindSpeed3pm.1,Humidity9am,Humidity9am.1,MaxTemp,MaxTemp.1,WindDir3pm_NW,MaxTemp.2,MaxTemp.3,Evaporation,Evaporation.1
0,False,37.5,False,False,False,False,False,False,False,False,...,22.0,1.510122,100.0,-1.648527,10.8,False,-1.648527,10.8,-0.591534,3.4
1,False,11.0,False,False,False,False,False,False,False,False,...,7.0,0.426211,80.0,-0.99932,15.4,False,-0.99932,15.4,-1.343942,2.0
2,False,14.0,False,False,False,False,False,False,False,False,...,11.0,1.076558,92.0,-0.124303,21.6,False,-0.124303,21.6,-0.37656,3.8
3,False,32.0,False,False,False,False,False,False,False,False,...,31.0,0.913971,89.0,-1.225131,13.8,False,-1.225131,13.8,-0.054099,4.4
4,False,26.0,False,False,False,False,False,False,False,False,...,26.0,-0.386722,65.0,0.708376,27.5,False,0.708376,27.5,-0.054099,4.4


In [3]:
# Displaying the columns in the dataset
df.columns

Index(['Location_Penrith', 'WindSpeedDiff', 'Location_Bendigo',
       'Location_Watsonia', 'Location_Uluru', 'Location_Perth',
       'Location_Nuriootpa', 'WindDir3pm_NNE', 'Location_Albany',
       'WindDir3pm_NNW', 'Location_PearceRAAF', 'WindDir3pm_ESE',
       'RainTomorrow', 'Month', 'WindSpeed3pm', 'WindSpeed3pm.1',
       'Humidity9am', 'Humidity9am.1', 'MaxTemp', 'MaxTemp.1', 'WindDir3pm_NW',
       'MaxTemp.2', 'MaxTemp.3', 'Evaporation', 'Evaporation.1'],
      dtype='object')

## Separate the independent and dependent features in 'X' and 'y' respectively

In [4]:
# your code here
# raise NotImplementedError
X = df.drop(columns='RainTomorrow',axis=1)
y = df['RainTomorrow']

# Splitting the data

In [5]:
#splitting of df to training and testing with 0.25 as Test size 
# your code here
# raise NotImplementedError
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state = 42)

In [6]:
from sklearn.model_selection import cross_val_predict
import numpy as np

# Create base estimator

In [7]:
# Define base estimators
# Create 2 estimators with the variable names as rf, and gb
# keep seed 42 in  all cases 
# 1.create random forest classifier with depth 20, 61 tress , min leaf 10 and name it under 'rf'
# 2. GradientBoosting classifier with 400 trees, lr 0.3 and depth 10 and use the variable name as gb

# your code here
# raise NotImplementedError
estimators = [
    ('rf', RandomForestClassifier(max_depth=20, min_samples_leaf=10, n_estimators=61, random_state=42, n_jobs=-1)),
    ('gb', GradientBoostingClassifier(n_estimators=400, learning_rate=0.3, max_depth=10, random_state=42))
]


The next step is to is to perform cross-validation and return predictions made on each of the folds on the training data. Here, we will store the results in an array and owing to this, we have to import the numpy library too. 

In [8]:
# Generate out-of-fold predictions for each base model using cross-validation on training data.
# since its a binary classification, store probablity of 1 in the array and not 0 and stack them.

import numpy as np
# your code here
# raise NotImplementedError

# Generate out-of-fold predictions for each base model using cross-validation on training data
X_train_meta = []
for name, model in estimators:
    predictions = cross_val_predict(model, X_train, y_train, cv=5, method="predict_proba")
    X_train_meta.append(predictions[:, 1])  # Assuming binary classification     

# Convert to array
X_train_meta = np.column_stack(X_train_meta)

In [9]:
# Stack the predictions
# X_train_meta = np.column_stack([predictions_rf, predictions_knn, predictions_gb])

# your code here
# raise NotImplementedError

In [10]:

# Define base estimators and keep the name of the variable as estimators which is a list of previously used classifiers
# Create a list of estimators which stores the tuple with 3 classifiers 1. random forest 2. knn 3. gradientboosting
# example : ["<nameofmodel>",model) ]

estimators = [ ('rf', RandomForestClassifier(max_depth=20, min_samples_leaf=10, n_estimators=61, random_state=42, n_jobs=-1)), ('gb', GradientBoostingClassifier(n_estimators=400, learning_rate=0.3, max_depth=10, random_state=42))]


In [11]:
# Function to generate meta features
def generate_meta_features(estimators, X):
    meta_features = []
    for name, model in estimators:
        model.fit(X_train, y_train)  # Train each model on the full training set
        predictions = model.predict_proba(X)
        meta_features.append(predictions[:, 1])  # Assuming binary classification
    return np.column_stack(meta_features)
# raise NotImplementedError

# Generate meta features for test sets
X_test_meta = generate_meta_features(estimators, X_test)

Here we have performed cross-validation on multiple base models to generate out-of-fold probability predictions for each instance in the training dataset. These predictions are then compiled into a new feature set called meta-features, which will be used for training a meta-model in a stacking ensemble, leveraging the combined strengths of each base model.

Hence we've created a meta features dataset which will be used to train the meta model. Let me first show you how it's done followed by which i shall help you understand the code.

So let's go ahead and train the meta model. In this case we are going to use a logistic regression model

## Make final predictions using the meta model(Logistic Regression)

In [12]:
from sklearn.linear_model import LogisticRegression

# Train meta-model (Logistic Regression) on the out-of-fold predictions
meta_model = LogisticRegression()
meta_model.fit(X_train_meta, y_train)

In [13]:
from sklearn.metrics import f1_score
y_pred_train = meta_model.predict(X_train_meta)

# Calculate and print F1 score for the validation data
f1 = f1_score(y_train, y_pred_train)
print(f"F1 Score for train Data: {f1}")


F1 Score for train Data: 0.6970153753391619


With that, let's go ahead and make predictions on validation and test data followed by which we shall take a look at their f1 scores

In [14]:
from sklearn.metrics import f1_score

# Predictions on validation and test data using the meta-model
y_pred_test = meta_model.predict(X_test_meta)

# Calculate and print F1 score for the test data
f1 = f1_score(y_test, y_pred_test)
print(f"F1 Score for Test Data: {f1}")

F1 Score for Test Data: 0.7090094574415131
