# <center> Water Quality </center>


<center> <img src="https://media.istockphoto.com/photos/analyzing-samples-picture-id182188515?k=6&m=182188515&s=612x612&w=0&h=Hcjly5YZGs4tFxPmD6Q-hbCKcoGFU-JIPT8qLaYDUOQ="> </center>


## About dataset

#### Content
The water_potability.csv file contains water quality metrics for 3276 different water bodies.

**1. pH value:**
PH is an important parameter in evaluating the acid–base balance of water. It is also the indicator of acidic or alkaline condition of water status.
WHO has recommended maximum permissible limit of pH from 6.5 to 8.5. The current investigation ranges were 6.52–6.83 which are in the range of WHO standards.

**2. Hardness:**
Hardness is mainly caused by calcium and magnesium salts. These salts are dissolved from geologic deposits through which water travels. The length of time water is in contact with hardness producing material helps determine how much hardness there is in raw water. Hardness was originally defined as the capacity of water to precipitate soap caused by Calcium and Magnesium.

**3. Solids (Total dissolved solids - TDS):**
Water has the ability to dissolve a wide range of inorganic and some organic minerals or salts such as potassium, calcium, sodium, bicarbonates, chlorides, magnesium, sulfates etc. These minerals produced un-wanted taste and diluted color in appearance of water. This is the important parameter for the use of water. The water with high TDS value indicates that water is highly mineralized. Desirable limit for TDS is 500 mg/l and maximum limit is 1000 mg/l which prescribed for drinking purpose.

**4. Chloramines:**
Chlorine and chloramine are the major disinfectants used in public water systems. Chloramines are most commonly formed when ammonia is added to chlorine to treat drinking water. Chlorine levels up to 4 milligrams per liter (mg/L or 4 parts per million (ppm)) are considered safe in drinking water.

**5. Sulfate:**
Sulfates are naturally occurring substances that are found in minerals, soil, and rocks. They are present in ambient air, groundwater, plants, and food. The principal commercial use of sulfate is in the chemical industry. Sulfate concentration in seawater is about 2,700 milligrams per liter (mg/L). It ranges from 3 to 30 mg/L in most freshwater supplies, although much higher concentrations (1000 mg/L) are found in some geographic locations.

**6. Conductivity:**
Pure water is not a good conductor of electric current rather’s a good insulator. Increase in ions concentration enhances the electrical conductivity of water. Generally, the amount of dissolved solids in water determines the electrical conductivity. Electrical conductivity (EC) actually measures the ionic process of a solution that enables it to transmit current. According to WHO standards, EC value should not exceeded 400 μS/cm.

**7. Organic_carbon:**
Total Organic Carbon (TOC) in source waters comes from decaying natural organic matter (NOM) as well as synthetic sources. TOC is a measure of the total amount of carbon in organic compounds in pure water. According to US EPA < 2 mg/L as TOC in treated / drinking water, and < 4 mg/Lit in source water which is use for treatment.

**8. Trihalomethanes:**
THMs are chemicals which may be found in water treated with chlorine. The concentration of THMs in drinking water varies according to the level of organic material in the water, the amount of chlorine required to treat the water, and the temperature of the water that is being treated. THM levels up to 80 ppm is considered safe in drinking water.

**9. Turbidity:**
The turbidity of water depends on the quantity of solid matter present in the suspended state. It is a measure of light emitting properties of water and the test is used to indicate the quality of waste discharge with respect to colloidal matter. The mean turbidity value obtained for Wondo Genet Campus (0.98 NTU) is lower than the WHO recommended value of 5.00 NTU.

**10. Potability:**
Indicates if water is safe for human consumption where 1 means Potable and 0 means Not potable.


##Title: Water Potability Classification using Machine Learning Algorithms

**Introduction:**

In this Colab notebook, we will explore the water potability dataset and classify water samples as potable or non-potable using various machine learning algorithms. The dataset consists of 10 features, including pH, hardness, solids, chloramines, sulfate, conductivity, organic carbon, trihalomethanes, and turbidity. By the end of this notebook, you will have a better understanding of how to preprocess data, handle imbalanced datasets, and apply machine learning algorithms to classify water potability.

Table of Contents:

1.   Data Import and Preprocessing
2.   Handling Imbalanced Datasets
3.   Feature Scaling
4.   Machine Learning Algorithms
5.   Model Evaluation and Comparison






Data Import and Preprocessing:

We begin by importing the necessary libraries and loading the water potability dataset. We then provide some basic information about the dataset, such as its shape, data types, and missing values.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from imblearn.over_sampling import SMOTE

from collections import Counter as ctr

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
df = pd.read_csv("/content/drive/MyDrive/water_potability.csv")
print(df.shape)

(3276, 10)


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3276 entries, 0 to 3275
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ph               2785 non-null   float64
 1   Hardness         3276 non-null   float64
 2   Solids           3276 non-null   float64
 3   Chloramines      3276 non-null   float64
 4   Sulfate          2495 non-null   float64
 5   Conductivity     3276 non-null   float64
 6   Organic_carbon   3276 non-null   float64
 7   Trihalomethanes  3114 non-null   float64
 8   Turbidity        3276 non-null   float64
 9   Potability       3276 non-null   int64  
dtypes: float64(9), int64(1)
memory usage: 256.1 KB


In [None]:
df.head()

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
0,,204.890455,20791.318981,7.300212,368.516441,564.308654,10.379783,86.99097,2.963135,0
1,3.71608,129.422921,18630.057858,6.635246,,592.885359,15.180013,56.329076,4.500656,0
2,8.099124,224.236259,19909.541732,9.275884,,418.606213,16.868637,66.420093,3.055934,0
3,8.316766,214.373394,22018.417441,8.059332,356.886136,363.266516,18.436524,100.341674,4.628771,0
4,9.092223,181.101509,17978.986339,6.5466,310.135738,398.410813,11.558279,31.997993,4.075075,0


In [None]:
df.duplicated().sum()

0

In [None]:
df.isnull().sum()

ph                 491
Hardness             0
Solids               0
Chloramines          0
Sulfate            781
Conductivity         0
Organic_carbon       0
Trihalomethanes    162
Turbidity            0
Potability           0
dtype: int64

In [None]:
df['ph']=df['ph'].fillna(df.groupby(['Potability'])['ph'].transform('mean'))
df['Sulfate']=df['Sulfate'].fillna(df.groupby(['Potability'])['Sulfate'].transform('mean'))
df['Trihalomethanes']=df['Trihalomethanes'].fillna(df.groupby(['Potability'])['Trihalomethanes'].transform('mean'))

In [None]:
X = df.drop(columns = ['Potability'])
y = df['Potability']

smote (balancing the dataset)

In [None]:
ctr(y)

Counter({0: 1998, 1: 1278})

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1, stratify = y)

smote = SMOTE()
X_train, y_train = smote.fit_resample(X_train, y_train)

In [None]:
ctr(y_train)

Counter({0: 1598, 1: 1598})

Feature Scaling:

We will use the MinMaxScaler to scale our features to have values between 0 and 1. This is an essential step for some machine learning algorithms that are sensitive to the scale of the input features.

In [None]:
MMScaler = MinMaxScaler()
X_train = MMScaler.fit_transform(X_train)
X_test = MMScaler.transform(X_test)

### Following are the list of algorithms that are used in this notebook.

|    Algorithm         |
| -------------------- |
| Decision Tree|
| Random Forest|
| GradientBoostingClassifier|
| KNeighbours|
| BaggingClassifier|
| AdaBoostClassifier|


Model Evaluation and Comparison:

We will compare the performance of each algorithm using cross-validation scores and present the results in a table.

In [None]:
mod = []
cv_score=[]
model =[AdaBoostClassifier(), BaggingClassifier(), GradientBoostingClassifier(), DecisionTreeClassifier(), KNeighborsClassifier(), RandomForestClassifier()]
for m in model:
    cv_score.append(cross_val_score(m, X_train, y_train, scoring='accuracy', cv=5).mean())
    mod.append(m)

model_df=pd.DataFrame(columns=['model','cv_score'])
model_df['model']=mod
model_df['cv_score']=cv_score
model_df.sort_values(by=['cv_score'], ascending=True)

Unnamed: 0,model,cv_score
4,KNeighborsClassifier(),0.630798
0,AdaBoostClassifier(),0.703067
3,DecisionTreeClassifier(),0.720273
2,GradientBoostingClassifier(),0.751256
1,BaggingClassifier(),0.757823
5,RandomForestClassifier(),0.796626


In [None]:
from hyperopt import hp, fmin, tpe, Trials, space_eval
import numpy as np
from sklearn.model_selection import cross_val_score

# Define objective function
def objective(params):
    if params['classifier'] == 'RandomForest':
        clf = RandomForestClassifier(n_estimators=params['n_estimators'],
                                     max_depth=params['max_depth'],
                                     min_samples_split=params['min_samples_split'],
                                     min_samples_leaf=params['min_samples_leaf'])
    elif params['classifier'] == 'Bagging':
        base_estimator = RandomForestClassifier(n_estimators=params['base_n_estimators'],
                                                 max_depth=params['base_max_depth'],
                                                 min_samples_split=params['base_min_samples_split'],
                                                 min_samples_leaf=params['base_min_samples_leaf'])
        clf = BaggingClassifier(base_estimator=base_estimator,
                                n_estimators=params['n_estimators'],
                                max_samples=params['max_samples'],
                                max_features=params['max_features'])
    elif params['classifier'] == 'GradientBoosting':
        clf = GradientBoostingClassifier(n_estimators=params['n_estimators'],
                                          learning_rate=params['learning_rate'],
                                          max_depth=params['max_depth'],
                                          min_samples_split=params['min_samples_split'],
                                          min_samples_leaf=params['min_samples_leaf'])
    else:
        raise ValueError("Invalid classifier")

    score = cross_val_score(clf, X_train, y_train, cv=5).mean()
    return -score  # Hyperopt minimizes the objective, so we use negative score

# Define search space
space = {
    'classifier': hp.choice('classifier', ['RandomForest', 'Bagging', 'GradientBoosting']),

    'n_estimators': hp.choice('n_estimators', range(10, 300)),
    'max_depth': hp.choice('max_depth', range(1, 20)),
    'min_samples_split': hp.uniform('min_samples_split', 0.1, 0.9),
    'min_samples_leaf': hp.uniform('min_samples_leaf', 0.1, 0.5),

    'base_n_estimators': hp.choice('base_n_estimators', range(10, 300)),
    'base_max_depth': hp.choice('base_max_depth', range(1, 20)),
    'base_min_samples_split': hp.uniform('base_min_samples_split', 0.1, 0.9),
    'base_min_samples_leaf': hp.uniform('base_min_samples_leaf', 0.1, 0.5),
    'max_samples': hp.uniform('max_samples', 0.1, 1.0),
    'max_features': hp.uniform('max_features', 0.1, 1.0),

    'learning_rate': hp.loguniform('learning_rate', np.log(0.01), np.log(0.2))
}

# Select optimization algorithm
algo = tpe.suggest

# Run hyperparameter optimization
trials = Trials()
best = fmin(objective, space, algo=algo, max_evals=100, trials=trials)

# Retrieve best hyperparameters
best_params = space_eval(space, best)
print("Best hyperparameters:", best_params)

 12%|█▏        | 12/100 [00:36<03:13,  2.20s/trial, best loss: -0.7424877738654148]








 15%|█▌        | 15/100 [04:55<56:12, 39.68s/trial, best loss: -0.7424877738654148]  





In [None]:
param_grid = {'n_estimators': [100, 150, 200, 250, 300, 350, 400]}

rf_classifier = RandomForestClassifier()

# Perform grid search with cross-validation
grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

print(f"Best Estimator: {grid_search.best_params_} , Best Score : {grid_search.best_score_}")

In [None]:
from sklearn.model_selection import GridSearchCV
param={'n_estimators': [100,150,200,250,300,350,400]}
grid_Grd=GridSearchCV(GradientBoostingClassifier(), param_grid=param, cv=5, scoring='accuracy')
grid_Grd.fit(X_train, y_train)
print(f"Best Estimator: {grid_Grd.best_params_} , Best Score : {grid_Grd.best_score_}")


In [None]:
param={'n_estimators': [100,150,200,250,300,350,400]}
grid_Bag=GridSearchCV(BaggingClassifier(), param_grid=param, cv=5, scoring='accuracy')
grid_Bag.fit(X_train, y_train)
print(f"Best Estimator: {grid_Bag.best_params_} , Best Score : {grid_Bag.best_score_}")


In [None]:
rf_model = RandomForestClassifier(n_estimators=300, random_state=42)
rf_model.fit(X_train, y_train)

gb_model = GradientBoostingClassifier(n_estimators=150, random_state=42)
gb_model.fit(X_train, y_train)

bg_model = BaggingClassifier(n_estimators=100,random_state=42 )
bg_model.fit(X_train,y_train)

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_pred_rf = rf_model.predict(X_test)
y_pred_gb = gb_model.predict(X_test)
y_pred_bg = bg_model.predict(X_test)

predictions = {
    'Random Forest': y_pred_rf,
    'Gradient Boosting': y_pred_gb,
    'Bagging Classifier': y_pred_bg,
}

performance_metrics = {}

# Calculate performance metrics for each model and store them in the dictionary
for model_name, y_pred in predictions.items():
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    performance_metrics[model_name] = [accuracy, precision, recall, f1]

# Create a DataFrame from the performance metrics dictionary
performance_df = pd.DataFrame(performance_metrics, index=['Accuracy', 'Precision', 'Recall', 'F1 Score']).T

# Display the summary table
print(performance_df)

In [None]:
from sklearn.ensemble import VotingClassifier

voting_classifier = VotingClassifier(estimators=[
    ('rf', rf_model),
    ('gb', gb_model),
    ('bg', bg_model)
], voting='hard')

voting_classifier.fit(X_train, y_train)
y_pred = voting_classifier.predict(X_test)
ensemble_accuracy = accuracy_score(y_test, y_pred)
print("Ensemble Accuracy:", ensemble_accuracy)

In [None]:
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# Define the base models
base_models = [

    ('rf', rf_model),
    ('gb', gb_model),
    ('bg', bg_model)
]

# Define the meta learner (or blender)
meta_learner = LogisticRegression()

# Create the stacking classifier
stacking_classifier = StackingClassifier(estimators=base_models, final_estimator=meta_learner)

# Train the stacking classifier
stacking_classifier.fit(X_train, y_train)

# Evaluate the stacking classifier
accuracy = stacking_classifier.score(X_test, y_test)
print("Stacking Classifier Accuracy:", accuracy)

In [None]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Define function to plot confusion matrix
def plot_confusion_matrix(y_true, y_pred, model_name):
    cm = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, cmap='Blues', fmt='g')
    plt.xlabel('Predicted labels')
    plt.ylabel('True labels')
    plt.title(f'Confusion Matrix - {model_name}')
    plt.show()

# Generate confusion matrix for Random Forest
plot_confusion_matrix(y_test, y_pred_rf, 'Random Forest')

# Generate confusion matrix for Gradient Boosting
plot_confusion_matrix(y_test, y_pred_gb, 'Gradient Boosting')

# Generate confusion matrix for Bagging Classifier
plot_confusion_matrix(y_test, y_pred_bg, 'Bagging Classifier')


In [None]:
from sklearn.feature_selection import SelectPercentile, SelectKBest, chi2

# Instantiate SelectPercentile for 50th percentile
selector_percentile_50 = SelectPercentile(percentile=50)

# Fit and transform the selector on your features
X_selected_percentile_50 = selector_percentile_50.fit_transform(X_train, y_train)

# Instantiate SelectPercentile for 70th percentile
selector_percentile_70 = SelectPercentile(percentile=70)

# Fit and transform the selector on your features
X_selected_percentile_70 = selector_percentile_70.fit_transform(X_train, y_train)

# Instantiate SelectKBest with chi2 as score function
selector_kbest = SelectKBest(score_func=chi2, k=7)

# Fit and transform the selector on your features
X_selected_kbest = selector_kbest.fit_transform(X_train, y_train)

# Get the indices of the selected features for SelectKBest
selected_features_indices = selector_kbest.get_support(indices=True)

# Get the names of the selected features
selected_features_names = X.columns[selected_features_indices]

print("Selected features using SelectPercentile (50th percentile):", X_selected_percentile_50.shape[1])
print("Selected features using SelectPercentile (70th percentile):", X_selected_percentile_70.shape[1])
print("Selected features using SelectKBest:", X_selected_kbest.shape[1])
print("Selected features:", selected_features_names)


In [None]:
import pandas as pd
from sklearn.feature_selection import SelectPercentile, SelectKBest, chi2
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split

# Function to calculate metrics
def calculate_metrics(y_true, y_pred):
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    return accuracy, precision, recall, f1

# Define feature selection methods
methods = ['Select Percentile 50', 'Select Percentile 70', 'SelectKBest 5', 'SelectKBest 7', 'No multicollinearity']

# Initialize lists to store results
accuracy_results = []
precision_results = []
recall_results = []
f1_results = []

# Loop through each feature selection method
for method in methods:
    if 'Select Percentile' in method:
        # Instantiate SelectPercentile
        percentile = int(method.split()[-1])
        selector = SelectPercentile(percentile=percentile)
        X_selected = selector.fit_transform(X_train, y_train)
    elif 'SelectKBest' in method:
        # Instantiate SelectKBest
        k = int(method.split()[-1])
        selector = SelectKBest(score_func=chi2, k=k)
        X_selected = selector.fit_transform(X_train, y_train)
    else:
        # No feature selection
        X_selected = X_train

    # Split data into train and test sets
    X_train_sel, X_test_sel, y_train_sel, y_test_sel = train_test_split(X_selected, y_train, test_size=0.2, random_state=1, stratify=y_train)

    # Train model
    model = LogisticRegression()
    model.fit(X_train_sel, y_train_sel)

    # Make predictions
    y_pred_sel = model.predict(X_test_sel)

    # Calculate metrics
    accuracy, precision, recall, f1 = calculate_metrics(y_test_sel, y_pred_sel)

    # Append results to lists
    accuracy_results.append(accuracy)
    precision_results.append(precision)
    recall_results.append(recall)
    f1_results.append(f1)

# Create DataFrame to store results
results_df = pd.DataFrame({
    'Method': methods,
    'Accuracy': accuracy_results,
    'Precision': precision_results,
    'Recall': recall_results,
    'F1 Score': f1_results
})

# Display results
print(results_df)


In [None]:
# Define the models
rf_model = RandomForestClassifier()
gb_model = GradientBoostingClassifier()
bg_model = BaggingClassifier()

# Create a list of models
models = [rf_model, gb_model, bg_model]

# Create a list of model names
model_names = ['Random Forest', 'Gradient Boosting', 'Bagging Classifier']

# Create a DataFrame to store the results
results = pd.DataFrame(columns=['Model', 'Training Accuracy', 'Testing Accuracy'])

# Loop through each model and calculate training and testing accuracy
for i in range(len(models)):
    # Fit the model to the training data
    models[i].fit(X_train, y_train)

    # Calculate training accuracy
    train_acc = models[i].score(X_train, y_train)

    # Calculate testing accuracy
    test_acc = models[i].score(X_test, y_test)

    # Add results to DataFrame
    results = results.append({'Model': model_names[i], 'Training Accuracy': train_acc, 'Testing Accuracy': test_acc}, ignore_index=True)

# Print the results
print(results)

# Plot the results
plt.figure(figsize=(10,6))
plt.bar(results['Model'], results['Testing Accuracy'], color='blue')
plt.xlabel('ML Classifiers')
plt.ylabel('Accuracy')
plt.title('ML Classifiers vs Accuracy')
plt.show()


In [None]:
# Train the Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=0)
rf_model.fit(X_train, y_train)

# Train the Gradient Boosting model
gb_model = GradientBoostingClassifier(n_estimators=100, random_state=0)
gb_model.fit(X_train, y_train)

# Train the Bagging Classifier model
bg_model = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=100, random_state=0)
bg_model.fit(X_train, y_train)


In [None]:
# Calculate accuracy for each model on training and testing data
train_accuracy = []
test_accuracy = []

for name, model in zip(['Random Forest', 'Gradient Boosting', 'Bagging Classifier'], [rf_model, gb_model, bg_model]):
    train_acc = model.score(X_train, y_train)
    test_acc = model.score(X_test, y_test)
    train_accuracy.append(train_acc)
    test_accuracy.append(test_acc)

# Create a DataFrame to store the results
results = pd.DataFrame({'Classifier': ['Random Forest', 'Gradient Boosting', 'Bagging Classifier'],
                         'Train Accuracy': train_accuracy,
                         'Test Accuracy': test_accuracy})

# Print the DataFrame
print(results)

# Create a bar plot to visualize the results
import matplotlib.pyplot as plt

plt.figure(figsize=(10,6))
plt.bar(results['Classifier'], results['Test Accuracy'], color='blue')
plt.xlabel('Classifier')
plt.ylabel('Accuracy')
plt.title('Classifier Accuracy')
plt.show()


In [None]:
# Create a DataFrame to store the results
results = pd.DataFrame(columns=['Model', 'Training Accuracy', 'Testing Accuracy'])

# Loop through each model and calculate training and testing accuracy
for i in range(len(models)):
    # Fit the model to the training data
    models[i].fit(X_train, y_train)

    # Calculate training accuracy
    train_acc = models[i].score(X_train, y_train)

    # Calculate testing accuracy
    test_acc = models[i].score(X_test, y_test)

    # Add results to DataFrame
    results = results.append({'Model': model_names[i], 'Training Accuracy': train_acc, 'Testing Accuracy': test_acc}, ignore_index=True)

# Set 'Model' column as index
results.set_index('Model', inplace=True)

# Plot the results
plt.figure(figsize=(10,6))
plt.bar(results.index, results['Testing Accuracy'], color='blue', label='Testing Accuracy')
plt.bar(results.index, results['Training Accuracy'], color='red', label='Training Accuracy')
plt.xlabel('ML Classifiers')
plt.ylabel('Accuracy')
plt.title('ML Classifiers vs Accuracy')
plt.legend()
plt.show()


In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.ensemble import AdaBoostClassifier

# Define the AdaBoost classifier
ada_boost_model = AdaBoostClassifier()

# Define the models with their predictions
models = {
    'Random Forest': rf_model,
    'Gradient Boosting': gb_model,
    'Bagging Classifier': bg_model,
    'AdaBoost Classifier': ada_boost_model,
    'Decision Tree Classifier': decision_tree_model,
    'K Neighbors Classifier': knn_model
}

# Calculate metrics for each model
metrics = {}
for name, model in models.items():
    # Make predictions
    y_pred = model.predict(X_test)
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    # Store metrics in dictionary
    metrics[name] = [accuracy, precision, recall, f1]

# Create DataFrame to store results
metrics_df = pd.DataFrame(metrics, index=['Accuracy', 'Precision', 'Recall', 'F1 Score'])

# Print the DataFrame
print(metrics_df)


In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# List of classifiers
classifiers = [
    ('KNeighborsClassifier', KNeighborsClassifier()),
    ('AdaBoostClassifier', AdaBoostClassifier()),
    ('BaggingClassifier', BaggingClassifier()),
    ('GradientBoostingClassifier', GradientBoostingClassifier()),
    ('RandomForestClassifier', RandomForestClassifier()),
    ('DecisionTreeClassifier', DecisionTreeClassifier())
]

# List to store scores
scores = []

# Loop over classifiers
for name, clf in classifiers:
    # Fit the classifier
    clf.fit(X_train, y_train)
    # Predict the test set results
    y_pred = clf.predict(X_test)
    # Calculate scores
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred, average='weighted')
    rec = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')
    # Append scores to the list
    scores.append((name, acc, prec, rec, f1))

# Convert the list to a DataFrame
df_scores = pd.DataFrame(scores, columns=['Classifier', 'Accuracy', 'Precision', 'Recall', 'F1 Score'])

# Print the DataFrame
print(df_scores)


In [None]:
import matplotlib.pyplot as plt

# Plotting F1 Score
plt.figure(figsize=(12,6))
plt.bar(df_scores['Classifier'], df_scores['F1 Score'])
plt.xlabel('Classifiers')
plt.ylabel('F1 Score')
plt.title('F1 Score of Different Classifiers')
plt.xticks(rotation=45)
plt.show()


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.patches import Circle, RegularPolygon
from pandas.plotting import register_matplotlib_converters
import math

register_matplotlib_converters()

# Assuming df_scores is the DataFrame containing the scores
categories=list(df_scores)[1:]
N = len(categories)

# Prepare data
data = df_scores.set_index('Classifier').T.to_dict('list')
data = {k: v for k, v in data.items()}

# Create the radar chart
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, polar=True)
ax.set_rlabel_position(0)
ax.set_theta_offset(math.pi / 2)
ax.set_theta_direction(-1)
ax.set_ylim(bottom=0, top=max(df_scores.loc[:, categories].max()))

# Draw one axe
ax.plot([theta for theta in range(N)], [0 for _ in range(N)],
        linestyle='--', label='Average')
ax.fill([theta for theta in range(N)], [0 for _ in range(N)], 'white')

# Draw each category
for d, color in zip(data.values(), colors):
    ax.plot(theta, d, color=color, linewidth=1, linestyle='solid')
    ax.fill(theta, d, facecolor=color, alpha=0.25)

# Add labels
ax.set_xticks(theta)
ax.set_xticklabels(categories)

# Add legend
ax.legend(loc='upper right', bbox_to_anchor=(1.1, 1.1))

plt.show()


In [None]:
import matplotlib.pyplot as plt

# Define colors
colors = plt.cm.rainbow(np.linspace(0, 1, len(data)))


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.patches import Circle, RegularPolygon
from pandas.plotting import register_matplotlib_converters
import math

register_matplotlib_converters()

# Assuming df_scores is the DataFrame containing the scores
categories=list(df_scores)[1:]
N = len(categories)

# Prepare data
data = df_scores.set_index('Classifier').T.to_dict('list')
data = {k: v for k, v in data.items()}

# Define colors
colors = plt.cm.rainbow(np.linspace(0, 1, len(data)))

# Create the radar chart
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, polar=True)
ax.set_rlabel_position(0)
ax.set_theta_offset(math.pi / 2)
ax.set_theta_direction(-1)
ax.set_ylim(bottom=0, top=max(df_scores.loc[:, categories].max()))

# Draw one axe
ax.plot([theta for theta in range(N)], [0 for _ in range(N)],
        linestyle='--', label='Average')
ax.fill([theta for theta in range(N)], [0 for _ in range(N)], 'white')

# Draw each category
for d, color in zip(data.values(), colors):
    ax.plot(theta, d, color=color, linewidth=1, linestyle='solid')
    ax.fill(theta, d, facecolor=color, alpha=0.25)

# Add labels
ax.set_xticks(theta)
ax.set_xticklabels(categories)

# Add legend
ax.legend(loc='upper right', bbox_to_anchor=(1.1, 1.1))

plt.show()


In [None]:
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, BaggingClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

# Initialize the models
rf_model = RandomForestClassifier()
gb_model = GradientBoostingClassifier()
bg_model = BaggingClassifier()
ada_boost_model = AdaBoostClassifier()
decision_tree_model = DecisionTreeClassifier()
knn_model = KNeighborsClassifier(n_neighbors = 7)

# Train the models
rf_model.fit(X_train, y_train)
gb_model.fit(X_train, y_train)
bg_model.fit(X_train, y_train)
ada_boost_model.fit(X_train, y_train)
decision_tree_model.fit(X_train, y_train)
knn_model.fit(X_train, y_train)

# List of trained models
models = [rf_model, gb_model, bg_model, ada_boost_model, decision_tree_model, knn_model]

# List of model names
model_names = ['Random Forest', 'Gradient Boosting', 'Bagging', 'AdaBoost', 'Decision Tree', 'KNN']


In [None]:
import pickle

# Create pickle files for each model
for i in range(len(models)):
    with open(f'{model_names[i]}.pkl', 'wb') as file:
        pickle.dump(models[i], file)


In [None]:
import pickle

# List of trained models
models = [rf_model, gb_model, bg_model, ada_boost_model, decision_tree_model, knn_model]

# List of model names
model_names = ['RandomForest', 'GradientBoosting', 'Bagging', 'AdaBoost', 'DecisionTree', 'KNN']

# Loop through each model and save it as a pickle file
for i in range(len(models)):
    with open(f'{model_names[i]}_model.pkl', 'wb') as file:
        pickle.dump(models[i], file)