Skip to content

oesterei/JNBs

Repository files navigation

JNBs using a Python 3 kernel

The Jupyter Notebook – SKLearnDecisionTree.ipynb – uses the SKLearn module to generate a random forest . The notebook analyzes the Brest Cancer Wisconsin (Diagnostic) Data Set provided by Kaggle: https://www.kaggle.com/code/nisasoylu/decision-tree-implementation-on-cancer-dataset. The data contains 32 attributes (columns) for 569 unique breast tissue samples (rows) with no missing values. The attributes include a sample ID, diagnosis (malignant or benign), and mean, standard error (SE), and worst values for 10 sample attributes: radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension. First, the Pandas module is imported to preprocess the data including separating out the sample ID column. SKLearn is used to separate data into training and test sets using a 70/30 split and establish a decision tree classifier using all 10 attributes. SKLearn metrics then calculates the accuracy of random forest predictions compared to the diagnosis. The established decision tree is visualized using both the graphviz and dtreeviz modules.

The Jupyter Notebook – SKLearnRandomForest.ipynb – uses the SKLearn module to generate a random forest. The notebook analyzes the Brest Cancer Wisconsin (Diagnostic) Data Set described above provided by Kaggle: https://www.kaggle.com/code/nisasoylu/decision-tree-implementation-on-cancer-dataset. First, the Pandas module is imported to preprocess the data including separating out the sample ID column. SKLearn is used to separate data into training and test sets using a 70/30 split and establish a random forest classifier using all 10 attributes. SKLearn metrics then calculates the accuracy of random forest predictions compared to the diagnosis. To assess the accuracy achieved from random forest predictions derived from solely the most predictive attributes, SKLearn identified the most important attributes. Pandas was used to select data for these attributes, and SKLearn was used to repeat random forest classification and accuracy calculations.

The Jupyter Notebook, manualRandomForestipynb, is an example emphasizing how random forests are the collection of multiple decision trees through implementing a non-SKLearn approach. The notebook analyzes the Brest Cancer Wisconsin (Diagnostic) Data Set described above provided by Kaggle: https://www.kaggle.com/code/nisasoylu/decision-tree-implementation-on-cancer-dataset. First, the Pandas package is imported, then the sample ID column is removed before the data is separated into training and test sets using a 70/30 split. The program uses the mean, standard error, and worst values to calculate decision values for each training data sample using the following formula: decision value = worst - mean + SE. The program then collects decision values by diagnosis and establishes each attribute’s decision points from the average mean for each diagnosis adjusted by the SE. Next, the decision points for each attribute are applied to every sample individually, and a weighted score – 0, 0.5, or 1 with 0 and 1 being benign and malignant guesses, respectively – assigned to each attribute based on the sample’s value relation to the attribute’s decision points. For example, if the decision points are -17.8 and -12.3 for malignant and benign, respectively, then a sample’s attribute value of -18 would score a 1, a -10 would score a 0, and a -15 value would score a 0.5. A running sum metric is calculated from these weighted scores and if the final metric is larger than 5, then the program predicts the sample as malignant. Program predictions are compared to the actual sample diagnosis to establish accuracy. Decision points established from the training data are applied to test data, and the scoring process is repeated to verify the random forest.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages