## MachineLearningEngine Class

The MachineLearningEngine class is builds on the CoreEngine class. The CoreEngine class serves as a parent class engines that focus on data, while the MachineLearningEngine class is for engines that focus on learning from data.

In [None]:
from src.StreamPort.ml.MachineLearningEngine import MachineLearningEngine

#Creates an empty MachineLearningEngine object and prints it
engine = MachineLearningEngine()
engine.print()

## MachineLearningAnalysis Class

The MachineLearningAnalysis class is builds on the class Analysis. The Analysis class that is used to perform analysis on the data. 

In [None]:
from src.StreamPort.ml.MachineLearningAnalysis import MachineLearningAnalysis

#Creates an empty MachineLearningAnalysis obejct and prints it
analysis = MachineLearningAnalysis()
analysis.print()

#### Load the CSV File  

This method loads the dataset from csv file and create a list of analysis object. Used the data to make a matrix with the analysis names and visualizes the results using a scatter plot.  

In [None]:
from src.StreamPort.ml.MachineLearningEngine import MachineLearningEngine
from sklearn.decomposition import PCA 
import matplotlib.pyplot as plt

#Creates an empty MachineLearningEngine object and prints it
path = 'feature_list.csv'
engine = MachineLearningEngine()
engine.add_analyses_from_csv(path)

engine.print()

print("Create a list of analysis object and prints it" )
for analysis in engine._analyses:
    print(f"Analysis: {analysis.name}")
    for key, value in analysis.data.items():
        print(f"{key}: {value}")
    print("\n")

rownames = engine.get_analyses_names()
print("Analysename: ", rownames)

mat = engine.get_data()
mat.index = rownames
print("Matrix: \n", mat)


#### Make a Principle Conponent Analysis (PCA)

The method implements a machine learning engine that perfporms PCA on the dataset and visualizes the results. ProcessingSetting is the parent of MakePCA. The ProcessingSettings used to assemble data processing workflows within the each engine. The subclass MakePCASKL of MakePCA using skitklearn algorithm to perform the PCA.

In [None]:
from src.StreamPort.ml.MachineLearningEngine import MachineLearningEngine
from src.StreamPort.ml.MachineLearningProcessingSettings import  MakeModelPCASKL
import webbrowser

#Creates an empty MachineLearningEngine object and prints it
path = 'feature_list.csv'
engine = MachineLearningEngine()
engine.add_analyses_from_csv(path)

class_path = 'feature_metadata.csv'
engine.add_classes_from_csv(class_path)

engine.print()
#print(engine.get_classes())

# !!! make a general data plot
engine.plot_data()
webbrowser.open('general_data_plot.html')
# x axis in the index of the features (i.e., col names)
# y axis is the valule for each analysis
# color legend is applied for each analysis


# Add the ProcessingSettings to the _settings attribute with add settings
pca_model = MakeModelPCASKL(n_components = 2, center_data= True)
engine.add_settings(pca_model)
engine.print()
# Create a method in the ML engine to perfom PCA and collect the results
engine.run_workflow()
# The results are added to the _results atribute of the engine
# make a plot method in the ML engine for the PCA results and classes
engine.plot_pca()
webbrowser.open('pca_scores_plot.html')
webbrowser.open('pca_loadings_plot.html')
# make a loadings plot after confirming the scores plot


#### Make a Density-Based Spatial Clustering of Application with Noise (DBSCAN)



In [None]:
from src.StreamPort.ml.MachineLearningEngine import MachineLearningEngine
from src.StreamPort.ml.MachineLearningProcessingSettings import  MakeModelDBSCANSKL

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.decomposition import PCA
from sklearn.neighbors import NearestNeighbors
import plotly.express as px

# Creates an empty MachineLearningEngine object and prints it
path = 'feature_list.csv'
engine = MachineLearningEngine()
engine.add_analyses_from_csv(path)

class_path = 'feature_metadata.csv'
engine.add_classes_from_csv(class_path)

engine.print()

data = engine.get_data()
mean = np.mean(data, axis=0)
data = data - mean

pca = PCA(n_components=2)
data_2d = pca.fit_transform(data)
print("Reduced data shape:", data_2d.shape)

# Visualize the data
plt.scatter(data_2d[:, 0], data_2d[:, 1])
plt.title('Data Visualization')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()

# Create a k-distance plot
k = 5  # Choose k based on min_samples
neighbors = NearestNeighbors(n_neighbors=k)
neighbors_fit = neighbors.fit(data)
distances, indices = neighbors_fit.kneighbors(data)
distances = np.sort(distances[:, k-1], axis=0)

plt.plot(distances)
plt.title('K-Distance Plot')
plt.xlabel('Points sorted by distance')
plt.ylabel('Distance to {}-th nearest neighbor'.format(k))
plt.show()

# Experiment with DBSCAN parameters
eps = 1.5E6  # Adjust based on the k-distance plot
min_samples = 3  # Adjust based on your data

dbscan = DBSCAN(eps=eps, min_samples=min_samples)
dbscan.fit(data_2d)

labels = dbscan.labels_

# Analyze the clusters
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
print("Estimated number of clusters: %d" % n_clusters_)
print("Estimated number of noise points: %d" % n_noise_)

# Create a DataFrame for Plotly
df = pd.DataFrame(data_2d, columns=['PC1', 'PC2'])
df['Cluster'] = labels.astype(str)  # Convert to string for categorical coloring

# Plot the results using Plotly
fig = px.scatter(df, x='PC1', y='PC2', color='Cluster', title='DBSCAN Clustering Results',
                 labels={'PC1': 'Principal Component 1', 'PC2': 'Principal Component 2'})

fig.show()


### Uniform Manifold Approximation and Projection (UMAP)

In [1]:
from src.StreamPort.ml.MachineLearningEngine import MachineLearningEngine
from src.StreamPort.ml.MachineLearningProcessingSettings import  MakeModelUMAP

#Creates an empty MachineLearningEngine object and prints it
path = 'feature_list.csv'
engine = MachineLearningEngine()
engine.add_analyses_from_csv(path)

class_path = 'feature_metadata.csv'
engine.add_classes_from_csv(class_path)
engine.print()

umap_model = MakeModelUMAP(n_neighbors=15, min_dist=0.1, n_components=2,random_state=42)
engine.add_settings(umap_model)
engine.print()
engine.run_workflow()
engine.plot_umap()


Structure of the CSV file: {'number_of_rows': 45, 'number_of_columns': 4445}
Structure of the CSV file: {'number_of_rows': 45, 'number_of_columns': 2}

MachineLearningEngine 
  name: None 
  author: None 
  path: None 
  date: 2024-10-14 13:13:21.954842 
  analyses: 45 
  settings: 0 


MachineLearningEngine 
  name: None 
  author: None 
  path: None 
  date: 2024-10-14 13:13:21.954842 
  analyses: 45 
  settings: 1 

Running workflow with settings: MakeModel


  warn(f"n_jobs value {self.n_jobs} overridden to 1 by setting random_state. Use no seed for parallelism.")


In [1]:
from src.StreamPort.ml.MachineLearningEngine import MachineLearningEngine
from src.StreamPort.ml.MachineLearningProcessingSettings import  MakeModelUMAP

path = 'feature_list.csv'
engine = MachineLearningEngine()
engine.add_analyses_from_csv(path)

class_path = 'new_metadata.csv'

#for march
engine.month_march(class_path)
engine.print()
umap_model_march = MakeModelUMAP(n_neighbors=15, min_dist=0.1, n_components=2, random_state=42)
engine.add_settings(umap_model_march)
engine.run_workflow()
engine.plot_umap()

# #for april
engine.month_april(class_path)
engine.print()
umap_model_april = MakeModelUMAP(n_neighbors=15, min_dist=0.1, n_components=2, random_state=42)
engine.add_settings(umap_model_april)
engine.run_workflow()
engine.plot_umap()

Structure of the CSV file: {'number_of_rows': 45, 'number_of_columns': 4445}

MachineLearningEngine 
  name: None 
  author: None 
  path: None 
  date: 2024-10-15 13:57:24.908298 
  analyses: 45 
  settings: 0 

Running workflow with settings: MakeModel


  warn(f"n_jobs value {self.n_jobs} overridden to 1 by setting random_state. Use no seed for parallelism.")



MachineLearningEngine 
  name: None 
  author: None 
  path: None 
  date: 2024-10-15 13:57:24.908298 
  analyses: 45 
  settings: 1 

Running workflow with settings: MakeModel



n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.



In [3]:
from src.StreamPort.ml.MachineLearningEngine import MachineLearningEngine
from src.StreamPort.ml.MachineLearningProcessingSettings import  MakeModelUMAP

path = 'feature_list.csv'
engine = MachineLearningEngine()
engine.add_analyses_from_csv(path)

class_path = 'new_metadata.csv'

#for april
engine.month_april(class_path)
engine.print()
umap_model_april = MakeModelUMAP(n_neighbors=15, min_dist=0.1, n_components=2, random_state=42)
engine.add_settings(umap_model_april)
engine.run_workflow()
engine.plot_umap()

Structure of the CSV file: {'number_of_rows': 45, 'number_of_columns': 4445}

MachineLearningEngine 
  name: None 
  author: None 
  path: None 
  date: 2024-10-14 12:34:40.990528 
  analyses: 45 
  settings: 0 

Running workflow with settings: MakeModel



n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.



### 

### Random Forest Classifier

In [24]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder

df = pd.read_csv('new_metadata.csv')
target_column = 'class'

label_encoders = {}
for column in df.columns:
    if df[column].dtype == 'object':  
        le = LabelEncoder()
        df[column] = le.fit_transform(df[column])
        label_encoders[column] = le  

X = df.drop(columns=[target_column])
y = df[target_column]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf = DecisionTreeClassifier(max_depth=3, random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Accuracy: 0.7777777777777778
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00         2
           1       1.00      0.50      0.67         2
           2       1.00      0.50      0.67         2
           3       1.00      1.00      1.00         2
           4       0.33      1.00      0.50         1

    accuracy                           0.78         9
   macro avg       0.87      0.80      0.77         9
weighted avg       0.93      0.78      0.80         9

Confusion Matrix:
 [[2 0 0 0 0]
 [0 1 0 0 1]
 [0 0 1 0 1]
 [0 0 0 2 0]
 [0 0 0 0 1]]


In [27]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder

df = pd.read_csv('groups_classes.csv')
target_column = 'class'

label_encoders = {}
for column in df.columns:
    if df[column].dtype == 'object': 
        le = LabelEncoder()
        df[column] = le.fit_transform(df[column])
        label_encoders[column] = le  

X = df.drop(columns=[target_column])
y = df[target_column]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf = DecisionTreeClassifier(max_depth=3, random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
unique_classes = y_test.unique()
print("Classification Report:\n", classification_report(y_test, y_pred, labels=unique_classes, zero_division=0))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))


Accuracy: 0.673469387755102
Classification Report:
               precision    recall  f1-score   support

           3       0.00      0.00      0.00         2
           1       0.00      0.00      0.00         3
           7       0.00      0.00      0.00         3
          12       1.00      0.94      0.97        16
           6       0.83      1.00      0.91         5
           5       0.89      1.00      0.94         8
           8       0.40      1.00      0.57         2
          11       0.00      0.00      0.00         2
           2       0.00      0.00      0.00         3
           4       0.00      0.00      0.00         1
           0       0.40      1.00      0.57         2
          10       0.17      1.00      0.29         1
          13       0.00      0.00      0.00         1

   micro avg       0.72      0.67      0.69        49
   macro avg       0.28      0.46      0.33        49
weighted avg       0.59      0.67      0.61        49

Confusion Matrix:
 [[ 2  0 

### NEW DATA ###

plot the 'neg' and 'pos' classes with umap. 

In [12]:
import pandas as pd
import numpy as np
from src.StreamPort.ml.MachineLearningEngine import MachineLearningEngine, MachineLearningAnalysis
from src.StreamPort.ml.MachineLearningProcessingSettings import MakeModelUMAP

path = 'groups_ints.csv'
class_path = 'groups_classes.csv'
df = pd.read_csv(class_path)

df_pos = df[df['polarity'] == 'positive']
df_neg = df[df['polarity'] == 'negative']

#process and plot neg class
print("plot neg class")
engine = MachineLearningEngine() 
engine.add_analyses_from_csv(path)

#add neg classes to the engine
for index, row in df_neg.iterrows():
    row_value = row.tolist()[1:]
    class_name = row['class']
    ana = MachineLearningAnalysis(name=str(class_name), data={"x": np.array(df_neg.columns.tolist()[1:]), "y": np.array(row_value)})
    if ana.validate():
        engine.add_classes(class_name)
    else:
        print(f"Analysis {class_name} did not pass validation.")

umap_model_neg = MakeModelUMAP(n_neighbors=15, min_dist=0.1, n_components=2, random_state=42)
engine.add_settings(umap_model_neg)
engine.run_workflow()
engine.plot_umap()  

#process and plot the pos classes
print("plot pos classes")
engine = MachineLearningEngine()  
engine.add_analyses_from_csv(path)

#add pos classes to the engine
for index, row in df_pos.iterrows():
    row_value = row.tolist()[1:]
    class_name = row['class']
    ana = MachineLearningAnalysis(name=str(class_name), data={"x": np.array(df_pos.columns.tolist()[1:]), "y": np.array(row_value)})
    if ana.validate():
        engine.add_classes(class_name)
    else:
        print(f"Analysis {class_name} did not pass validation.")

umap_model_pos = MakeModelUMAP(n_neighbors=15, min_dist=0.1, n_components=2, random_state=42)
engine.add_settings(umap_model_pos)
engine.run_workflow()
engine.plot_umap()  


plot neg class
Structure of the CSV file: {'number_of_rows': 202, 'number_of_columns': 2006}
Running workflow with settings: MakeModel


plot pos classes
Structure of the CSV file: {'number_of_rows': 202, 'number_of_columns': 2006}
Running workflow with settings: MakeModel


pca plot for the new data

In [31]:
from src.StreamPort.ml.MachineLearningEngine import MachineLearningEngine
from src.StreamPort.ml.MachineLearningProcessingSettings import  MakeModelPCASKL
import webbrowser

#Creates an empty MachineLearningEngine object and prints it
path = 'groups_ints.csv'
engine = MachineLearningEngine()
engine.add_analyses_from_csv(path)

class_path = 'groups_classes.csv'
engine.add_classes_from_csv(class_path)
engine.print()
engine.plot_data()
webbrowser.open('general_data_plot.html')

pca_model = MakeModelPCASKL(n_components = 2, center_data= True)
engine.add_settings(pca_model)
engine.print()
engine.run_workflow()
engine.plot_pca()
webbrowser.open('pca_scores_plot.html')
webbrowser.open('pca_loadings_plot.html')

Structure of the CSV file: {'number_of_rows': 202, 'number_of_columns': 2006}
Structure of the CSV file: {'number_of_rows': 245, 'number_of_columns': 6}

MachineLearningEngine 
  name: None 
  author: None 
  path: None 
  date: 2024-10-10 01:57:56.439960 
  analyses: 202 
  settings: 0 


MachineLearningEngine 
  name: None 
  author: None 
  path: None 
  date: 2024-10-10 01:57:56.439960 
  analyses: 202 
  settings: 1 

Running workflow with settings: MakeModel


True