# Weather Data Classification using scikit-learn

<p style="font-family: Arial; font-size:1.75em;color:blue; font-style:bold"><br>

Importing the Necessary Libraries<br></p>

First we import the necessary libraries of the python for demostration of the Decision Tree Classifier

In [None]:
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

Read the data of the weather from the csv file using read_csv function of pandas dataframe

In [None]:
data = pd.read_csv('./daily_weather.csv')

<p style="font-family: Arial; font-size:1.75em;color:blue; font-style:bold">Daily Weather Data Description</p>
<br>
The file **daily_weather.csv** is a comma-separated file that contains weather data.  This data comes from a weather station located in San Diego, California.  The weather station is equipped with sensors that capture weather-related measurements such as air temperature, air pressure, and relative humidity.  Data was collected for a period of three years, from September 2011 to September 2014, to ensure that sufficient data for different seasons and weather conditions is captured.<br><br>
Let's now check all the columns in the data.

Know about various columns in the dataset.

In [None]:
data.columns

Index(['number', 'air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am',
       'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am',
       'rain_accumulation_9am', 'rain_duration_9am', 'relative_humidity_9am',
       'relative_humidity_3pm', 'Unnamed: 11'],
      dtype='object')

In [None]:
data.head()

Unnamed: 0,number,air_pressure_9am,air_temp_9am,avg_wind_direction_9am,avg_wind_speed_9am,max_wind_direction_9am,max_wind_speed_9am,rain_accumulation_9am,rain_duration_9am,relative_humidity_9am,relative_humidity_3pm,Unnamed: 11
0,0,918.06,74.822,271.1,2.080354,295.4,2.863283,0.0,0.0,42.42,36.16,
1,1,917.347688,71.403843,101.935179,2.443009,140.471549,3.533324,0.0,0.0,24.328697,19.426597,
2,2,923.04,60.638,51.0,17.067852,63.7,22.100967,0.0,20.0,8.9,14.46,
3,3,920.502751,70.138895,198.832133,4.337363,211.203341,5.190045,0.0,0.0,12.189102,12.742547,
4,4,921.16,44.294,277.8,1.85666,136.5,2.863283,8.9,14730.0,92.41,76.74,


Checking is there exists null values in the dataset or not

In [None]:
data[data.isnull().any(axis=1)].head()

Unnamed: 0,number,air_pressure_9am,air_temp_9am,avg_wind_direction_9am,avg_wind_speed_9am,max_wind_direction_9am,max_wind_speed_9am,rain_accumulation_9am,rain_duration_9am,relative_humidity_9am,relative_humidity_3pm,Unnamed: 11
0,0,918.06,74.822,271.1,2.080354,295.4,2.863283,0.0,0.0,42.42,36.16,
1,1,917.347688,71.403843,101.935179,2.443009,140.471549,3.533324,0.0,0.0,24.328697,19.426597,
2,2,923.04,60.638,51.0,17.067852,63.7,22.100967,0.0,20.0,8.9,14.46,
3,3,920.502751,70.138895,198.832133,4.337363,211.203341,5.190045,0.0,0.0,12.189102,12.742547,
4,4,921.16,44.294,277.8,1.85666,136.5,2.863283,8.9,14730.0,92.41,76.74,


<p style="font-family: Arial; font-size:1.75em;color:blue; font-style:bold"><br>

Data Cleaning Steps<br><br></p>

We will not need to number for each row so we can clean it.

Data Cleaning process --> As number column contains unique values which can not help us making any decision

In [None]:
del data['number']

Calculatoing the amount of data or say number of rows in the dataset before removing the rows containg null values

In [None]:
before_rows = data.shape[0]
print(before_rows)

1095


Removing the rows which contains the null values

In [None]:
data = data.dropna()

Calculatoing the amount of data or say number of rows in the dataset after removing the rows containg null values

In [None]:
after_rows = data.shape[0]
print(after_rows)

0


Calculate how many rows are deleted which contains the Null Values

In [None]:
before_rows - after_rows

1095

Filter the values which contains more than 24.99 relative humidity at 3pm.

In [None]:
clean_data = data.copy()
clean_data['high_humidity_label'] = (clean_data['relative_humidity_3pm'] >24.99) *1
clean_data['high_humidity_label'].head()

Series([], Name: high_humidity_label, dtype: int64)

In [None]:
y = clean_data[['high_humidity_label']].copy()
y.head()

Unnamed: 0,high_humidity_label


In [None]:
clean_data['relative_humidity_3pm'].head()

Series([], Name: relative_humidity_3pm, dtype: float64)

In [None]:
y.head()

Unnamed: 0,high_humidity_label


<p style="font-family: Arial; font-size:1.75em;color:blue; font-style:bold"><br>

Use 9am Sensor Signals as Features to Predict Humidity at 3pm
<br><br></p>


Storing all the Morning features other than Humidity at 3 pm in the morning feature

In [None]:
morning_features = ['air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am',
       'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am',
       'rain_accumulation_9am', 'rain_duration_9am', 'relative_humidity_9am']

Copying the values from the clean_data dataset to new dataset x which only consist of the Morning Feature Data

In [None]:
x=clean_data[morning_features].copy()
x.columns

Index(['air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am',
       'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am',
       'rain_accumulation_9am', 'rain_duration_9am', 'relative_humidity_9am'],
      dtype='object')

In [None]:
y.columns

Index(['high_humidity_label'], dtype='object')

<p style="font-family: Arial; font-size:1.75em;color:blue; font-style:bold"><br>

Perform Test and Train split

<br><br></p>


By using train_test_split we have split the data into traing dataset and testing datasets.

In [None]:
print(clean_data.shape)  # Check the shape of clean_data
print(morning_features) # Check what features are in morning_features
print(clean_data[morning_features].head())  # Print the first few rows of the sliced DataFrame
print(y.head()) # Print the first few rows of y

# If clean_data is empty, you'll need to figure out why it's not being populated correctly.
# If morning_features is an empty list, you won't get any columns in x.
# If y is empty, you'll need to figure out why it's not being populated correctly.

(0, 12)
['air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am', 'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am', 'rain_accumulation_9am', 'rain_duration_9am', 'relative_humidity_9am']
Empty DataFrame
Columns: [air_pressure_9am, air_temp_9am, avg_wind_direction_9am, avg_wind_speed_9am, max_wind_direction_9am, max_wind_speed_9am, rain_accumulation_9am, rain_duration_9am, relative_humidity_9am]
Index: []
Empty DataFrame
Columns: [high_humidity_label]
Index: []


<p style="font-family: Arial; font-size:1.75em;color:blue; font-style:bold"><br>

Fit on Train Set
<br><br></p>


We have made a classifier for making the Decision Tree and to train the data using this classifier

In [None]:
# Before calling train_test_split, check the shape of the DataFrames:
print(clean_data.shape)
print(X.shape)

# If they are empty, investigate the steps leading to the creation of 'clean_data'.
# Ensure that data is being loaded and processed correctly.
# You might need to review the code that generates 'clean_data'.

(0, 12)
(0, 9)


In [None]:
type(humidity_classifier)

<p style="font-family: Arial; font-size:1.75em;color:blue; font-style:bold"><br>

Predict on Test Set

<br><br></p>


Using humidity_classifier we have predicted the value for the X_test and stored it to y_predicted

In [None]:
from sklearn.model_selection import train_test_split
import numpy as np

# Assume data loading here
# Example data (replace with your actual data loading code)
X = np.array([[1, 2], [3, 4], [5, 6]])  # Replace with your actual data loading
y = np.array([0, 1, 1])                # Replace with your actual data loading

# Check if data is empty
if X.size == 0 or y.size == 0:
    raise ValueError("The dataset is empty. Please check your data loading process.")
else:
    # Proceed with train-test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    print("Training set:", X_train, y_train)
    print("Test set:", X_test, y_test)


Training set: [[3 4]
 [5 6]] [1 1]
Test set: [[1 2]] [0]


In [None]:
# Assuming 'humidity_classifier' is your trained model and X_test is your test data
y_predicted = humidity_classifier.predict(X_test)

# Now you can print the first 10 predictions
print(y_predicted[:10])

[0]


In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.utils.validation import check_is_fitted
import numpy as np

# Example data
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]])
y = np.array([0, 1, 0, 1, 0])

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the classifier
humidity_classifier = DecisionTreeClassifier()

# Fit the classifier on the training data
humidity_classifier.fit(X_train, y_train)

# Check if the model is fitted before prediction
check_is_fitted(humidity_classifier)

# Make predictions on the test data
y_predicted = humidity_classifier.predict(X_test)

# Print the predictions
print("Predictions:", y_predicted)


Predictions: [0]


<p style="font-family: Arial; font-size:1.75em;color:blue; font-style:bold"><br>

Measure Accuracy of the Classifier
<br><br></p>


Checking our accuracy of the model using accuracy_score function from sklearn metrics which in this case is with around 90% accuracy

In [None]:
accuracy_score(y_test,y_predicted)*100

0.0