## Algorithms Used for Classification
1. CART (Classification and Regression Trees)
2. Gaussian Naive Bayes / Naive Bayes
3. Gradient Boosting Machines (AdaBoost)
4. K-Nearest Neighbors (K-NN)
5. Logistic Regression
6. Multi-Layer Perceptron (MLP)
7. Perceptron
8. Random Forest
9. Support Vector Machines (SVM)

### 1. CART (Classification and Regression Trees) - DecisionTree Classifier
- Sampling Technique - Train/Test Split (80:20)
- Classification Metrics - Accuracy

In [1]:
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Load the dataset
filename = 'D:/DataViz/BDA/pima-indians-diabetes.csv'
dataframe = read_csv(filename)
array = dataframe.values
X = array[:, 0:8]
Y = array[:, 8]

# Set the test size and random seed for reproducibility
test_size = 0.20
random_seed = 50  # You can change this value

# Split the dataset into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=random_seed)

# Initialize and train the Decision Tree Classifier with hyperparameters
max_depth = 5  # You can adjust this value
min_samples_split = 2  # You can adjust this value
min_samples_leaf = 1  # You can adjust this value

model = DecisionTreeClassifier(
    max_depth=max_depth,
    min_samples_split=min_samples_split,
    min_samples_leaf=min_samples_leaf,
    random_state=random_seed
)

model.fit(X_train, Y_train)

# Evaluate the accuracy
accuracy = model.score(X_test, Y_test)
print("Accuracy: %.3f%%" % (accuracy * 100.0))


PermissionError: [Errno 13] Permission denied: 'D:/DataViz/BDA/pima-indians-diabetes.csv'

### 2. Gaussian Naive Bayes
- Sampling Technique - Train/Test Split (80:20)
- Classification Metrics - Accuracy

In [None]:
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

# Load the dataset
filename = 'D:/DataViz/BDA/pima-indians-diabetes.csv'
dataframe = read_csv(filename)
array = dataframe.values
X = array[:, 0:8]
Y = array[:, 8]

# Set the test size
test_size = 0.20  # Hyperparameter: Fraction of the dataset to use for testing
seed = 7

# Split the dataset into test and train
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)

# Create a Gaussian Naive Bayes classifier
model = GaussianNB(priors=None, var_smoothing=1e-9)
# Hyperparameters:
# - priors: You can specify class prior probabilities if you have prior knowledge.
# - var_smoothing: A smoothing parameter for avoiding zero variances.

# Train the model on the training data
model.fit(X_train, Y_train)

# Evaluate the accuracy
result = model.score(X_test, Y_test)
print("Accuracy: %.3f%%" % (result * 100.0))


### 3. Gradient Boosting Machines (AdaBoost)
- Sampling Technique - Train/Test Split (80:20)
- Classification Metrics - Accuracy

In [None]:
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier

# Load the dataset
filename = 'D:/DataViz/BDA/pima-indians-diabetes.csv'
dataframe = read_csv(filename)
array = dataframe.values
X = array[:, 0:8]
Y = array[:, 8]

# Set the test size
test_size = 0.20  # Hyperparameter: Fraction of the dataset to use for testing
seed = 7

# Split the dataset into test and train
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)

# Create an AdaBoost classifier
model = AdaBoostClassifier(n_estimators=50, random_state=seed)
# Hyperparameters:
# - n_estimators: The number of weak classifiers (base estimators) to train. You can adjust this to control the complexity of the ensemble.
# - random_state: The random seed for reproducibility. You can set this to a specific value if you want consistent results.

# Train the model on the training data
model.fit(X_train, Y_train)

# Evaluate the accuracy
result = model.score(X_test, Y_test)
print("Accuracy: %.3f%%" % (result * 100.0))


### 4. K-Nearest Neighbors (K-NN)
- Sampling Technique - Train/Test Split (80:20)
- Classification Metrics - Accuracy

In [None]:
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# Load the dataset
filename = 'D:/DataViz/BDA/pima-indians-diabetes.csv'
dataframe = read_csv(filename)
array = dataframe.values
X = array[:, 0:8]
Y = array[:, 8]

# Set the test size
test_size = 0.20  # Hyperparameter: Fraction of the dataset to use for testing
seed = 7

# Split the dataset into test and train
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)

# Create a K-Nearest Neighbors (K-NN) classifier
model = KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto')
# Hyperparameters:
# - n_neighbors: The number of nearest neighbors to consider when making predictions. You can adjust this to control the model's sensitivity to local patterns.
# - weights: Determines how the neighbors' contributions are weighted (e.g., 'uniform' or 'distance'). You can choose the appropriate weighting strategy.
# - algorithm: The algorithm used to compute the nearest neighbors ('auto', 'ball_tree', 'kd_tree', or 'brute'). You can choose the most suitable algorithm based on your data size and structure.

# Train the model on the training data
model.fit(X_train, Y_train)

# Evaluate the accuracy
result = model.score(X_test, Y_test)
print("Accuracy: %.3f%%" % (result * 100.0))


### 5. Logistic Regression
- Sampling Technique - Train/Test Split (80:20)
- Classification Metrics - Accuracy

In [None]:
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Load the dataset
filename = 'D:/DataViz/BDA/pima-indians-diabetes.csv'
dataframe = read_csv(filename)
array = dataframe.values
X = array[:, 0:8]
Y = array[:, 8]

# Set the test size
test_size = 0.20  # Hyperparameter: Fraction of the dataset to use for testing
seed = 7

# Split the dataset into test and train
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)

# Create a Logistic Regression model
model = LogisticRegression(max_iter=200, solver='lbfgs', C=1.0)
# Hyperparameters:
# - max_iter: The maximum number of iterations for the solver to converge. You can adjust this if the model does not converge.
# - solver: The algorithm to use for optimization ('lbfgs', 'liblinear', etc.). Choose an appropriate solver for your data and problem.
# - C: Inverse of regularization strength. Smaller values increase regularization. You can adjust this to control the trade-off between fitting the data and preventing overfitting.

# Train the model on the training data
model.fit(X_train, Y_train)

# Evaluate the accuracy
result = model.score(X_test, Y_test)
print("Accuracy: %.3f%%" % (result * 100.0))


### 6. Multi-Layer Perceptron (MLP)
- Sampling Technique - Train/Test Split (80:20)
- Classification Metrics - Accuracy

In [None]:
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier

# Load the dataset
filename = 'D:/DataViz/BDA/pima-indians-diabetes.csv'
dataframe = read_csv(filename)
array = dataframe.values
X = array[:, 0:8]
Y = array[:, 8]

# Set the test size
test_size = 0.20  # Hyperparameter: Fraction of the dataset to use for testing
seed = 7

# Split the dataset into test and train
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)

# Create an MLP-based model
model = MLPClassifier(hidden_layer_sizes=(65, 32), activation='relu', solver='adam', max_iter=200, random_state=seed)
# Hyperparameters:
# - hidden_layer_sizes: The number of neurons in each hidden layer. You can customize the architecture by adjusting this parameter.
# - activation: The activation function used in the hidden layers ('relu', 'tanh', etc.). Choose the appropriate one for your problem.
# - solver: The algorithm for weight optimization ('adam', 'lbfgs', etc.). Select the one that works best for your data.
# - max_iter: The maximum number of iterations for the solver to converge. You can adjust this if the model does not converge.
# - random_state: The random seed for reproducibility. Set this to a specific value for consistent results.

# Train the model
model.fit(X_train, Y_train)

# Evaluate the accuracy
result = model.score(X_test, Y_test)
print("Accuracy: %.3f%%" % (result * 100.0))


### 7. Perceptron
- Sampling Technique - Train/Test Split (80:20)
- Classification Metrics - Accuracy

In [None]:
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Perceptron

# Load the dataset
filename = 'D:/DataViz/BDA/pima-indians-diabetes.csv'
dataframe = read_csv(filename)
array = dataframe.values
X = array[:, 0:8]
Y = array[:, 8]

# Set the test size
test_size = 0.20  # Hyperparameter: Fraction of the dataset to use for testing
seed = 7

# Split the dataset into test and train
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)

# Create a Perceptron classifier
model = Perceptron(max_iter=200, random_state=seed, eta0=1.0, tol=1e-3)
# Hyperparameters:
# - max_iter: The maximum number of iterations for the solver to converge. You can adjust this if the model does not converge.
# - random_state: The random seed for reproducibility. Set this to a specific value for consistent results.
# - eta0: The initial learning rate. You can control the step size for weight updates by adjusting this.
# - tol: The tolerance for stopping criterion. The model will stop training when the change in the average loss is smaller than this value.

# Train the model
model.fit(X_train, Y_train)

# Evaluate the accuracy
result = model.score(X_test, Y_test)
print("Accuracy: %.3f%%" % (result * 100.0))


### 8. Random Forest
- Sampling Technique - Train/Test Split (80:20)
- Classification Metrics - Accuracy

In [None]:
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Load the dataset
filename = 'D:/DataViz/BDA/pima-indians-diabetes.csv'
dataframe = read_csv(filename)
array = dataframe.values
X = array[:, 0:8]
Y = array[:, 8]

# Set the test size
test_size = 0.20  # Hyperparameter: Fraction of the dataset to use for testing
seed = 7

# Split the dataset into test and train
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)

# Create a Random Forest classifier
rfmodel = RandomForestClassifier(n_estimators=100, random_state=seed, max_depth=None, min_samples_split=2, min_samples_leaf=1)
# Hyperparameters:
# - n_estimators: The number of decision trees in the random forest. Adjust this to control the ensemble size.
# - random_state: The random seed for reproducibility. Set this to a specific value for consistent results.
# - max_depth: The maximum depth of the decision trees. You can limit tree depth to prevent overfitting.
# - min_samples_split: The minimum number of samples required to split a node. Adjust this to control tree node splitting.
# - min_samples_leaf: The minimum number of samples required in a leaf node. You can adjust this to control tree leaf size.

# Train the model
rfmodel.fit(X_train, Y_train)

# Evaluate the accuracy
result = rfmodel.score(X_test, Y_test)
print("Accuracy: %.3f%%" % (result * 100.0))


### 9. Support Vector Machines (SVM)
- Sampling Technique - Train/Test Split (80:20)
- Classification Metrics - Accuracy

In [None]:
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

# Load the dataset
filename = 'D:/DataViz/BDA/pima-indians-diabetes.csv'
dataframe = read_csv(filename)
array = dataframe.values
X = array[:, 0:8]
Y = array[:, 8]

# Set the test size
test_size = 0.20  # Hyperparameter: Fraction of the dataset to use for testing
seed = 7

# Split the dataset into test and train
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)

# Create an SVM classifier
model = SVC(kernel='linear', C=1.0, random_state=seed)
# Hyperparameters:
# - kernel: The type of kernel to use ('linear', 'poly', 'rbf', etc.). Choose the appropriate kernel for your problem.
# - C: The regularization parameter. Smaller values increase regularization. You can adjust this to control the trade-off between fitting the data and preventing overfitting.
# - random_state: The random seed for reproducibility. Set this to a specific value for consistent results.

# Train the model
model.fit(X_train, Y_train)

# Evaluate the accuracy
result = model.score(X_test, Y_test)
print("Accuracy: %.3f%%" % (result * 100.0))


## Algorithms Used for Regression
1. CART (Classification and Regression Trees)
2. Elastic Net
3. Gradient Boosting Machines (AdaBoost)
4. K-Nearest Neighbors (K-NN)
5. Lasso and Ridge Regression
6. Linear Regression
7. Multi-Layer Perceptron (MLP)
8. Random Forest
9. Support Vector Machines (SVM)

**When comparing models, a lower MAE is generally better.

### 1. CART (Classification and Regression Trees) - DecisionTree Regressor
- Sampling Technique = K-fold Cross Validation (k=10)
- Classification Metrics = MAE

In [None]:
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeRegressor

# Load the dataset
filename = 'D:/DataViz/BDA/housing.csv'
dataframe = read_csv(filename)
array = dataframe.values
X = array[:, 0:13]
Y = array[:, 13]

# Split the dataset into a 10-fold cross-validation
kfold = KFold(n_splits=10, random_state=None)

# Train the data on a Decision Tree Regressor
model = DecisionTreeRegressor(max_depth=None, min_samples_split=2, min_samples_leaf=1, random_state=None)
# Hyperparameters:
# - max_depth: The maximum depth of the decision tree. You can limit tree depth to prevent overfitting.
# - min_samples_split: The minimum number of samples required to split an internal node. Adjust this to control node splitting.
# - min_samples_leaf: The minimum number of samples required in a leaf node. You can adjust this to control leaf size.
# - random_state: The random seed for reproducibility. Set this to a specific value for consistent results.

# Calculate the mean absolute error
scoring = 'neg_mean_absolute_error'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("MAE: %.3f (%.3f)" % (-results.mean(), results.std()))


### 2. Elastic Net
- Sampling Technique = K-fold Cross Validation (k=10)
- Classification Metrics = MAE

In [None]:
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import ElasticNet

# Load the dataset
filename = 'D:/DataViz/BDA/housing.csv'
dataframe = read_csv(filename)
array = dataframe.values
X = array[:, 0:13]
Y = array[:, 13]

# Split the dataset into a 10-fold cross-validation
kfold = KFold(n_splits=10, random_state=None)

# Train the data on an Elastic Net model
model = ElasticNet(alpha=1.0, l1_ratio=0.5, max_iter=1000, random_state=None)
# Hyperparameters:
# - alpha: The regularization parameter that controls the balance between L1 (Lasso) and L2 (Ridge) penalties. Adjust this to control the regularization strength.
# - l1_ratio: The mixing parameter for L1 and L2 penalties. A value of 0 corresponds to L2, 1 to L1, and values in between to combinations.
# - max_iter: The maximum number of iterations for the solver to converge. You can adjust this if the model does not converge.
# - random_state: The random seed for reproducibility. Set this to a specific value for consistent results.

# Calculate the mean absolute error
scoring = 'neg_mean_absolute_error'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("MAE: %.3f (%.3f)" % (-results.mean(), results.std()))


### 3. Gradient Boosting Machines (AdaBoost)
- Sampling Technique = K-fold Cross Validation (k=10)
- Classification Metrics = MAE

In [None]:
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import AdaBoostRegressor

# Load the dataset
filename = 'D:/DataViz/BDA/housing.csv'
dataframe = read_csv(filename)
array = dataframe.values
X = array[:, 0:13]
Y = array[:, 13]

# Split the dataset into a 10-fold cross-validation
kfold = KFold(n_splits=10, random_state=None)

# Train the data on an AdaBoost Regressor
ada_model = AdaBoostRegressor(n_estimators=50, learning_rate=1.0, random_state=None)
# Hyperparameters:
# - n_estimators: The number of weak regressors to combine in the ensemble. You can adjust this to control the complexity of the ensemble.
# - learning_rate: The contribution of each weak regressor to the final prediction. You can adjust this to control the impact of individual estimators.
# - random_state: The random seed for reproducibility. Set this to a specific value for consistent results.

# Calculate the mean absolute error with AdaBoost
scoring = 'neg_mean_absolute_error'
ada_results = cross_val_score(ada_model, X, Y, cv=kfold, scoring=scoring)
print("AdaBoost MAE: %.3f (%.3f)" % (-ada_results.mean(), ada_results.std()))


### 4. K-Nearest Neighbors (K-NN)
- Sampling Technique = K-fold Cross Validation (k=10)
- Classification Metrics = MAE

In [None]:
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsRegressor

# Load the dataset
filename = 'D:/DataViz/BDA/housing.csv'
dataframe = read_csv(filename)
array = dataframe.values
X = array[:, 0:13]
Y = array[:, 13]

# Split the dataset into a 10-fold cross-validation
kfold = KFold(n_splits=10, random_state=None)

# Train the data on a K-Nearest Neighbors Regressor
knn_model = KNeighborsRegressor(n_neighbors=5, weights='uniform', algorithm='auto')
# Hyperparameters:
# - n_neighbors: The number of nearest neighbors to consider when making predictions. You can adjust this to control the model's sensitivity to local patterns.
# - weights: Determines how the neighbors' contributions are weighted (e.g., 'uniform' or 'distance'). You can choose the appropriate weighting strategy.
# - algorithm: The algorithm used to compute the nearest neighbors ('auto', 'ball_tree', 'kd_tree', or 'brute'). You can choose the most suitable algorithm based on your data size and structure.

# Calculate the mean absolute error with K-NN
scoring = 'neg_mean_absolute_error'
knn_results = cross_val_score(knn_model, X, Y, cv=kfold, scoring=scoring)
print("K-NN MAE: %.3f (%.3f)" % (-knn_results.mean(), knn_results.std()))


### 5. Lasso and Ridge Regression
- Sampling Technique = K-fold Cross Validation (k=10)
- Classification Metrics = MAE

In [None]:
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge

# Load the dataset
filename = 'D:/DataViz/BDA/housing.csv'
dataframe = read_csv(filename)
array = dataframe.values
X = array[:, 0:13]
Y = array[:, 13]

# Split the dataset into a 10-fold cross-validation
kfold = KFold(n_splits=10, random_state=None)

# Train the data on a Lasso Regression model
lasso_model = Lasso(alpha=1.0, max_iter=1000, random_state=None)
# Hyperparameters for Lasso:
# - alpha: The regularization parameter that controls the strength of L1 regularization. Adjust this to control the level of sparsity in the model.
# - max_iter: The maximum number of iterations for the solver to converge. You can adjust this if the model does not converge.
# - random_state: The random seed for reproducibility. Set this to a specific value for consistent results.

# Calculate the mean absolute error with Lasso
scoring = 'neg_mean_absolute_error'
lasso_results = cross_val_score(lasso_model, X, Y, cv=kfold, scoring=scoring)
print("Lasso MAE: %.3f (%.3f)" % (-lasso_results.mean(), lasso_results.std()))

# Train the data on a Ridge Regression model
ridge_model = Ridge(alpha=1.0, max_iter=1000, random_state=None)
# Hyperparameters for Ridge:
# - alpha: The regularization parameter that controls the strength of L2 regularization. Adjust this to control the strength of regularization.
# - max_iter: The maximum number of iterations for the solver to converge. You can adjust this if the model does not converge.
# - random_state: The random seed for reproducibility. Set this to a specific value for consistent results.

# Calculate the mean absolute error with Ridge
ridge_results = cross_val_score(ridge_model, X, Y, cv=kfold, scoring=scoring)
print("Ridge MAE: %.3f (%.3f)" % (-ridge_results.mean(), ridge_results.std()))


### 6. Linear Regression
- Sampling Technique = K-fold Cross Validation (k=10)
- Classification Metrics = MAE

In [None]:
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

# Load the dataset
filename = 'D:/DataViz/BDA/housing.csv'
dataframe = read_csv(filename)
array = dataframe.values
X = array[:, 0:13]
Y = array[:, 13]

# Split the dataset into a 10-fold cross-validation
kfold = KFold(n_splits=10, random_state=None)

# Train the data on a Linear Regression model
model = LinearRegression()

# Calculate the mean absolute error
scoring = 'neg_mean_absolute_error'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("MAE: %.3f (%.3f)" % (-results.mean(), results.std()))


### 7. Multi-Layer Perceptron (MLP)
- Sampling Technique = K-fold Cross Validation (k=10)
- Classification Metrics = MAE

In [None]:
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.neural_network import MLPRegressor

# Load the dataset
filename = 'D:/DataViz/BDA/housing.csv'
dataframe = read_csv(filename)
array = dataframe.values
X = array[:, 0:13]
Y = array[:, 13]

# Split the dataset into a 10-fold cross-validation
kfold = KFold(n_splits=10, random_state=None)

# Train the data on an MLP Regressor with specified hyperparameters
mlp_model = MLPRegressor(
    hidden_layer_sizes=(100, 50),  # Hyperparameter: Adjust the architecture as needed, specifying the number and size of hidden layers.
    activation='relu',             # Hyperparameter: Choose an appropriate activation function ('identity', 'logistic', 'tanh', 'relu', etc.).
    solver='adam',                # Hyperparameter: Choose an optimization algorithm ('adam', 'lbfgs', 'sgd', etc.).
    learning_rate='constant',     # Hyperparameter: Choose a learning rate schedule ('constant', 'invscaling', 'adaptive').
    max_iter=1000,                # Hyperparameter: Adjust the maximum number of iterations for training.
    random_state=50               # Hyperparameter: Set a random seed for reproducibility.
)

# Calculate the mean absolute error with MLP
scoring = 'neg_mean_absolute_error'
mlp_results = cross_val_score(mlp_model, X, Y, cv=kfold, scoring=scoring)
print("MLP MAE: %.3f (%.3f)" % (-mlp_results.mean(), mlp_results.std()))


### 8. Random Forest
- Sampling Technique = K-fold Cross Validation (k=10)
- Classification Metrics = MAE

In [None]:
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor

# Load the dataset
filename = 'D:/DataViz/BDA/housing.csv'
dataframe = read_csv(filename)
array = dataframe.values
X = array[:, 0:13]
Y = array[:, 13]

# Split the dataset into a 10-fold cross-validation
kfold = KFold(n_splits=10, random_state=None)

# Train the data on a Random Forest Regressor with specified hyperparameters
rf_model = RandomForestRegressor(
    n_estimators=100,       # Hyperparameter: The number of trees in the forest. You can adjust this for ensemble size.
    max_depth=None,         # Hyperparameter: The maximum depth of each tree. Adjust to control tree depth.
    min_samples_split=2,    # Hyperparameter: The minimum number of samples required to split an internal node. Adjust to control node splitting.
    min_samples_leaf=1,     # Hyperparameter: The minimum number of samples required in a leaf node. Adjust to control leaf size.
    random_state=42         # Hyperparameter: Set a random seed for reproducibility.
)

# Calculate the mean absolute error with Random Forest
scoring = 'neg_mean_absolute_error'
rf_results = cross_val_score(rf_model, X, Y, cv=kfold, scoring=scoring)
print("Random Forest MAE: %.3f (%.3f)" % (-rf_results.mean(), rf_results.std()))


### 9. Support Vector Machines (SVM)
- Sampling Technique = K-fold Cross Validation (k=10)
- Classification Metrics = MAE

In [None]:
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVR

# Load the dataset
filename = 'D:/DataViz/BDA/housing.csv'
dataframe = read_csv(filename)
array = dataframe.values
X = array[:, 0:13]
Y = array[:, 13]

# Split the dataset into a 10-fold cross-validation
kfold = KFold(n_splits=10, random_state=None)

# Train the data on a Support Vector Regressor (SVM) with specified hyperparameters
svm_model = SVR(
    kernel='rbf',           # Hyperparameter: The kernel function to use ('linear', 'poly', 'rbf', etc.).
    C=1.0,                  # Hyperparameter: The regularization parameter. Adjust this to control the trade-off between margin width and error.
    epsilon=0.1,            # Hyperparameter: The epsilon-tube within which no penalty is associated with errors.
)

# Calculate the mean absolute error with SVM
scoring = 'neg_mean_absolute_error'
svm_results = cross_val_score(svm_model, X, Y, cv=kfold, scoring=scoring)
print("SVM MAE: %.3f (%.3f)" % (-svm_results.mean(), svm_results.std()))


In [None]:
import joblib  # Import the joblib library
# Save the model to a file
model_filename = 'D:/sample_model/random_forest_model.pkl'
joblib.dump(rfmodel, model_filename)

In [None]:
import joblib

# Load the saved Random Forest model
model_filename = 'D:/sample_model/random_forest_model.pkl'
loaded_model = joblib.load(model_filename)

# Define sample input data
#sample_input = [[preg, plas, pres, skin, test, mass, pedi, age]]
# Replace 'preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', and 'age' with the actual values you want to test
sample_input = [[0, 100, 72, 35, 0, 33.6, 0.627, 50]]

# Make predictions using the loaded model
predictions = loaded_model.predict(sample_input)

# Define messages based on the predicted class
if predictions[0] == 1:
    print("The person is diabetic")
else:
    print("The person is non-diabetic")
