<a href="https://colab.research.google.com/github/mariam2002212/JS-simplePbls/blob/naive_bayes/ml2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

load dataset

In [1]:
import pandas as pd
data = pd.read_csv('weather_forecast_data.csv')
print(data.head())

   Temperature   Humidity  Wind_Speed  Cloud_Cover     Pressure     Rain
0    19.096119  71.651723   14.782324    48.699257   987.954760  no rain
1    27.112464  84.183705   13.289986    10.375646  1035.430870  no rain
2    20.433329  42.290424    7.216295     6.673307  1033.628086  no rain
3    19.576659  40.679280    4.568833    55.026758  1038.832300  no rain
4    19.828060  93.353211    0.104489    30.687566  1009.423717  no rain


preprocessing

first we want to identify the missing data

In [2]:
missing_data = data.isnull().sum()
print("Missing values in each column:")
print(missing_data)

Missing values in each column:
Temperature    25
Humidity       40
Wind_Speed     32
Cloud_Cover    33
Pressure       27
Rain            0
dtype: int64


then we want to handle the missing data
first, technique--> dropping missing values

In [3]:
dropped_data = data.dropna()
print("Data after dropping missing values:")
print(dropped_data.head())

Data after dropping missing values:
   Temperature   Humidity  Wind_Speed  Cloud_Cover     Pressure     Rain
0    19.096119  71.651723   14.782324    48.699257   987.954760  no rain
1    27.112464  84.183705   13.289986    10.375646  1035.430870  no rain
2    20.433329  42.290424    7.216295     6.673307  1033.628086  no rain
3    19.576659  40.679280    4.568833    55.026758  1038.832300  no rain
4    19.828060  93.353211    0.104489    30.687566  1009.423717  no rain


second technique, replacing them with the average of the feature."handled both numerical features and non numerical features"

1st, Separate numeric and non-numeric features

In [4]:
numeric_features = data.select_dtypes(include=['number']).columns
non_numeric_features = data.select_dtypes(exclude=['number']).columns

2nd, calc avg value for numeric features

In [5]:
data_replaced = data.copy()
data_replaced[numeric_features] = data_replaced[numeric_features].fillna(data_replaced[numeric_features].mean())

3rd, handle non numeric features "using mode, is used for categorical data in our case"


In [6]:
for feature in non_numeric_features:
    data_replaced[feature] = data_replaced[feature].fillna(data_replaced[feature].mode()[0])

In [7]:
print("\nData after replacing missing values with the average of the feature:")
print(data_replaced.head())


Data after replacing missing values with the average of the feature:
   Temperature   Humidity  Wind_Speed  Cloud_Cover     Pressure     Rain
0    19.096119  71.651723   14.782324    48.699257   987.954760  no rain
1    27.112464  84.183705   13.289986    10.375646  1035.430870  no rain
2    20.433329  42.290424    7.216295     6.673307  1033.628086  no rain
3    19.576659  40.679280    4.568833    55.026758  1038.832300  no rain
4    19.828060  93.353211    0.104489    30.687566  1009.423717  no rain


Splitting our data to training and testing for training and evaluating our
models

train/test split "on non handled data"

In [8]:
from sklearn.model_selection import train_test_split

features = data.iloc[:, :-1]
target = data.iloc[:, -1] #akher column hwa eltarget w elba2y features

feature_train, feature_test, target_train, target_test = train_test_split(features, target, test_size=0.2, random_state=20)
print("\nTraining set size:", feature_train.shape)
print("Testing set size:", feature_test.shape)


Training set size: (2000, 5)
Testing set size: (500, 5)


train/test split"on the handeled data with the dropping missing data technique"

In [None]:
from sklearn.model_selection import train_test_split

features = dropped_data.iloc[:, :-1]
target = dropped_data.iloc[:, -1] #akher column hwa eltarget w elba2y features

feature_train, feature_test, target_train, target_test = train_test_split(features, target, test_size=0.2, random_state=24)
print("\nTraining set size:", feature_train.shape)
print("Testing set size:", feature_test.shape)


Training set size: (1877, 5)
Testing set size: (470, 5)


train/test split"on the handeled data with the replacing missing data technique"

In [9]:
from sklearn.model_selection import train_test_split

features = data_replaced.iloc[:, :-1]
target = data_replaced.iloc[:, -1] #akher column hwa eltarget w elba2y features

feature_train, feature_test, target_train, target_test = train_test_split(features, target, test_size=0.2, random_state=20)
print("\nTraining set size:", feature_train.shape)
print("Testing set size:", feature_test.shape)


Training set size: (2000, 5)
Testing set size: (500, 5)


feature scaling

Scale the numeric features in the training set

In [10]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
feature_train_numeric = feature_train[numeric_features]
feature_train_scaled = scaler.fit_transform(feature_train_numeric)

feature_train[numeric_features] = feature_train_scaled

print("Scaled Training Data:")
print(feature_train.head())

Scaled Training Data:
      Temperature  Humidity  Wind_Speed  Cloud_Cover  Pressure
2202    -0.237315  0.654273   -1.241286     1.549968  1.510454
766      0.841933  0.833311    0.475375    -1.621552  0.587481
714      0.484786 -0.598567   -0.139037    -0.007912  0.683722
1801     0.990528  0.848930    1.215065     1.518362  0.724690
2038    -1.736248 -0.627525   -1.401278    -0.678779 -0.474909


Scale the numeric features in the testing set

In [11]:
feature_test_numeric = feature_test[numeric_features]
feature_test_scaled = scaler.transform(feature_test_numeric)

feature_test[numeric_features] = feature_test_scaled

print("Scaled Testing Data:")
print(feature_test.head())

Scaled Testing Data:
      Temperature  Humidity  Wind_Speed  Cloud_Cover  Pressure
1760    -1.503473 -0.728458   -1.706502     0.587725  0.315251
2345    -0.739630  0.974453    1.495514     1.457237  1.669529
2370    -0.420437  1.479228   -1.463625     0.236463 -1.395475
187      1.595409 -0.396737   -1.326121    -0.007912 -1.633859
1911    -1.367837 -0.195413   -1.245203    -0.050356  1.143972


implement decision tree

i got accuracy "using dropped data": 0.9957


i got accuracy "using replaced data": 0.9860

In [12]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report

decision_tree_model = DecisionTreeClassifier(random_state = 20)

decision_tree_model.fit(feature_train, target_train)#train el model
predicted_target = decision_tree_model.predict(feature_test)

accuracy = accuracy_score(target_test, predicted_target)
print(f"Accuracy: {accuracy:.4f}")

Accuracy: 0.9860


Precision
Definition: The ratio of correctly predicted positive instances to the total predicted positives.
Meaning: Measures how many of the predicted positives are actually correct.
Accuracy
Definition: The ratio of correctly predicted instances to the total number of instances.
Meaning: Measures how often the model is correct overall.
Recall (Sensitivity or True Positive Rate)
Definition: The ratio of correctly predicted positive instances to all actual positives.
Meaning: Measures how many of the actual positives were correctly identified.


In [22]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report

naive_bayes_model = GaussianNB()

naive_bayes_model.fit(feature_train, target_train)
#naive_bayes_model.partial_fit(feature_train, target_train, classes=target_train.unique())
predicted_target = naive_bayes_model.predict(feature_test)

accuracy = accuracy_score(target_test, predicted_target)
#accuracy=naive_bayes_model.score(feature_test, target_test)
precision=precision_score(target_test, predicted_target, average='weighted')
recall=recall_score(target_test, predicted_target, average='weighted')
print(f"Accuracy: {accuracy:.4f}")
print(f"recall: {recall:.4f}")
print(f"precision: {precision:.4f}")

Accuracy: 0.9500
recall: 0.9500
precision: 0.9527


i got accuracy of 95 using