JNBs using a Python 3 kernel

The Jupyter Notebook – SKLearnDecisionTree.ipynb – uses the SKLearn module to generate a random forest . The notebook analyzes the Brest Cancer Wisconsin (Diagnostic) Data Set provided by Kaggle: https://www.kaggle.com/code/nisasoylu/decision-tree-implementation-on-cancer-dataset. The data contains 32 attributes (columns) for 569 unique breast tissue samples (rows) with no missing values. The attributes include a sample ID, diagnosis (malignant or benign), and mean, standard error (SE), and worst values for 10 sample attributes: radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension. First, the Pandas module is imported to preprocess the data including separating out the sample ID column. SKLearn is used to separate data into training and test sets using a 70/30 split and establish a decision tree classifier using all 10 attributes. SKLearn metrics then calculates the accuracy of random forest predictions compared to the diagnosis. The established decision tree is visualized using both the graphviz and dtreeviz modules.

The Jupyter Notebook – SKLearnRandomForest.ipynb – uses the SKLearn module to generate a random forest. The notebook analyzes the Brest Cancer Wisconsin (Diagnostic) Data Set described above provided by Kaggle: https://www.kaggle.com/code/nisasoylu/decision-tree-implementation-on-cancer-dataset. First, the Pandas module is imported to preprocess the data including separating out the sample ID column. SKLearn is used to separate data into training and test sets using a 70/30 split and establish a random forest classifier using all 10 attributes. SKLearn metrics then calculates the accuracy of random forest predictions compared to the diagnosis. To assess the accuracy achieved from random forest predictions derived from solely the most predictive attributes, SKLearn identified the most important attributes. Pandas was used to select data for these attributes, and SKLearn was used to repeat random forest classification and accuracy calculations.

The Jupyter Notebook, manualRandomForestipynb, is an example emphasizing how random forests are the collection of multiple decision trees through implementing a non-SKLearn approach. The notebook analyzes the Brest Cancer Wisconsin (Diagnostic) Data Set described above provided by Kaggle: https://www.kaggle.com/code/nisasoylu/decision-tree-implementation-on-cancer-dataset. First, the Pandas package is imported, then the sample ID column is removed before the data is separated into training and test sets using a 70/30 split. The program uses the mean, standard error, and worst values to calculate decision values for each training data sample using the following formula: decision value = worst - mean + SE. The program then collects decision values by diagnosis and establishes each attribute’s decision points from the average mean for each diagnosis adjusted by the SE. Next, the decision points for each attribute are applied to every sample individually, and a weighted score – 0, 0.5, or 1 with 0 and 1 being benign and malignant guesses, respectively – assigned to each attribute based on the sample’s value relation to the attribute’s decision points. For example, if the decision points are -17.8 and -12.3 for malignant and benign, respectively, then a sample’s attribute value of -18 would score a 1, a -10 would score a 0, and a -15 value would score a 0.5. A running sum metric is calculated from these weighted scores and if the final metric is larger than 5, then the program predicts the sample as malignant. Program predictions are compared to the actual sample diagnosis to establish accuracy. Decision points established from the training data are applied to test data, and the scoring process is repeated to verify the random forest.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
Kfold.ipynb		Kfold.ipynb
LICENSE		LICENSE
LOOCV.ipynb		LOOCV.ipynb
LoopsvsListComprehension.ipynb		LoopsvsListComprehension.ipynb
PCAexample.ipynb		PCAexample.ipynb
README.md		README.md
ROC.ipynb		ROC.ipynb
Resistance Derivative Example.ipynb		Resistance Derivative Example.ipynb
SKLearnDecisionTree.ipynb		SKLearnDecisionTree.ipynb
SKLearnRandomForest.ipynb		SKLearnRandomForest.ipynb
dataforPCA.zip		dataforPCA.zip
manualRandomForest.ipynb		manualRandomForest.ipynb
multipleROC.ipynb		multipleROC.ipynb
stouffersHM.ipynb		stouffersHM.ipynb
volcanoplotdataPAlog2.txt		volcanoplotdataPAlog2.txt
volcanoplotexample.ipynb		volcanoplotexample.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JNBs using a Python 3 kernel

About

Releases

Packages

Languages

License

oesterei/JNBs

Folders and files

Latest commit

History

Repository files navigation

JNBs using a Python 3 kernel

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages