<a href="https://colab.research.google.com/github/mariam2002212/JS-simplePbls/blob/main/ml2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

load dataset

In [None]:
import pandas as pd
data = pd.read_csv('weather_forecast_data.csv')
print(data.head())

   Temperature   Humidity  Wind_Speed  Cloud_Cover     Pressure     Rain
0    19.096119  71.651723   14.782324    48.699257   987.954760  no rain
1    27.112464  84.183705   13.289986    10.375646  1035.430870  no rain
2    20.433329  42.290424    7.216295     6.673307  1033.628086  no rain
3    19.576659  40.679280    4.568833    55.026758  1038.832300  no rain
4    19.828060  93.353211    0.104489    30.687566  1009.423717  no rain


preprocessing

first we want to identify the missing data

In [None]:
missing_data = data.isnull().sum()
print("Missing values in each column:")
print(missing_data)

Missing values in each column:
Temperature    25
Humidity       40
Wind_Speed     32
Cloud_Cover    33
Pressure       27
Rain            0
dtype: int64


then we want to handle the missing data
first, technique--> dropping missing values

In [None]:
dropped_data = data.dropna()
print("Data after dropping missing values:")
print(dropped_data.head())

Data after dropping missing values:
   Temperature   Humidity  Wind_Speed  Cloud_Cover     Pressure     Rain
0    19.096119  71.651723   14.782324    48.699257   987.954760  no rain
1    27.112464  84.183705   13.289986    10.375646  1035.430870  no rain
2    20.433329  42.290424    7.216295     6.673307  1033.628086  no rain
3    19.576659  40.679280    4.568833    55.026758  1038.832300  no rain
4    19.828060  93.353211    0.104489    30.687566  1009.423717  no rain


second technique, replacing them with the average of the feature."handled both numerical features and non numerical features"

1st, Separate numeric and non-numeric features

In [None]:
numeric_features = data.select_dtypes(include=['number']).columns
non_numeric_features = data.select_dtypes(exclude=['number']).columns

2nd, calc avg value for numeric features

In [None]:
data_replaced = data.copy()
data_replaced[numeric_features] = data_replaced[numeric_features].fillna(data_replaced[numeric_features].mean())

3rd, handle non numeric features "using mode, is used for categorical data in our case"


In [None]:
for feature in non_numeric_features:
    data_replaced[feature] = data_replaced[feature].fillna(data_replaced[feature].mode()[0])

In [None]:
print("\nData after replacing missing values with the average of the feature:")
print(data.head())


Data after replacing missing values with the average of the feature:
   Temperature   Humidity  Wind_Speed  Cloud_Cover     Pressure     Rain
0    19.096119  71.651723   14.782324    48.699257   987.954760  no rain
1    27.112464  84.183705   13.289986    10.375646  1035.430870  no rain
2    20.433329  42.290424    7.216295     6.673307  1033.628086  no rain
3    19.576659  40.679280    4.568833    55.026758  1038.832300  no rain
4    19.828060  93.353211    0.104489    30.687566  1009.423717  no rain


Splitting our data to training and testing for training and evaluating our
models

train/test split

In [None]:
from sklearn.model_selection import train_test_split

features = data.iloc[:, :-1]
target = data.iloc[:, -1] #akher column hwa eltarget w elba2y features

feature_train, feature_test, target_train, target_test = train_test_split(features, target, test_size=0.2, random_state=20)
print("\nTraining set size:", feature_train.shape)
print("Testing set size:", feature_test.shape)


Training set size: (2000, 5)
Testing set size: (500, 5)


feature scaling

Scale the numeric features in the training set

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
feature_train_numeric = feature_train[numeric_features]
feature_train_scaled = scaler.fit_transform(feature_train_numeric)

feature_train[numeric_features] = feature_train_scaled

print("Scaled Training Data:")
print(feature_train.head())

Scaled Training Data:
      Temperature  Humidity  Wind_Speed  Cloud_Cover  Pressure
2202    -0.236241  0.649045   -1.233287     1.540939  1.501493
766      0.837869  0.826690    0.472179    -1.612293  0.583840
714      0.482423 -0.594048   -0.138226          NaN  0.679527
1801     0.985756  0.842187    1.207046     1.509516  0.720259
2038    -1.728037 -0.622781   -1.392236    -0.674957 -0.472427


Scale the numeric features in the testing set

In [None]:
feature_test_numeric = feature_test[numeric_features]
feature_test_scaled = scaler.transform(feature_test_numeric)

feature_test[numeric_features] = feature_test_scaled

print("Scaled Testing Data:")
print(feature_test.head())

Scaled Testing Data:
      Temperature  Humidity  Wind_Speed  Cloud_Cover  Pressure
1760    -1.496370 -0.722929   -1.695469     0.584244  0.313179
2345    -0.736163  0.966734    1.485667     1.448742  1.659652
2370    -0.418491  1.467582   -1.454176     0.235008 -1.387687
187      1.587758 -0.393788   -1.317569          NaN -1.624697
1911    -1.361379 -0.194031   -1.237179    -0.050157  1.137124


implement decision tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report

decision_tree_model = DecisionTreeClassifier(random_state = 40)

decision_tree_model.fit(feature_train, target_train)#train el model
predicted_target = decision_tree_model.predict(feature_test)

accuracy = accuracy_score(target_test, predicted_target)
print(f"Accuracy: {accuracy:.4f}")

Accuracy: 0.9860
