Skip to content

Advanced Machine Learning with Apache Spark: Leveraging Logistic Regression, Random Forest and Decision Tree Classifiers

License

Notifications You must be signed in to change notification settings

nickShengY/Spark-ML-Projects

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spark-ML-Projects

Advanced Machine Learning with Apache Spark: Leveraging Logistic Regression, Random Forest and Decision Tree Classifiers

Project Walk Through: YouTube

Project 1: Toxic Comment Classification

Introduction

This project explores basic text processing using the toxic comment text classification dataset. The primary objective was to convert the comment text column into a sparse vector representation to be utilized by a classification algorithm within the Spark ML library. (details in Dir)

Data Description

The data set is a large dataset originally released by Jigsaw and Google, formatted as a CSV file, with each row representing a unique comment. The columns in the dataset include id, comment_text, and other binary labels (0 or 1) indicating whether the comment falls into the respective toxicity category.

Model Explanation and Evaluation

The chosen model for this task was Logistic Regression, which was implemented using the PySpark 'LogisticRegression' class.

Project 2: Heart Disease Prediction using Logistic Regression

Introduction

This project involves a Python script that uses Logistic Regression to identify the most significant risk factors associated with heart disease and predicted overall risk levels. We used the Framingham Heart dataset. (details in Dir)

Dataset Description

The Framingham Heart Dataset originates from the Framingham Heart Study, with an initial cohort of 5209 subjects. The data typically includes various demographic information about patients and the outcome of the presence or absence of coronary heart disease.

Model Explanation and Results

The chosen model for this task was logistic regression, implemented using the PySpark 'LogisticRegression' class.

Project 3: Income Prediction with Logistic Regression on Spark: A Dive into Census Income Data

Project Description

This project showcases an adept implementation of Logistic Regression, utilizing Apache Spark ML/MLlib on UCI's Census Income Data. The goal was to predict income brackets, either >50K or <50K, based on a blend of 14 categorical, numerical, and missing attributes across 48,842 instances. (details in Dir)

Project 4: Advanced Machine Learning with Apache Spark: Leveraging Logistic Regression, Random Forest and Decision Tree Classifiers

Project Description

This project is a comprehensive demonstration of practical machine learning applications in large-scale data environments using Apache Spark ML/MLlib. The project involved the transformation of Python code to Apache Spark, then employed Logistic Regression in Spark ML/MLlib, concluding with the exploration of additional Spark ML algorithms, specifically the Random Forest and Decision Tree Classifiers. (details in Dir)

Contributors

About

Advanced Machine Learning with Apache Spark: Leveraging Logistic Regression, Random Forest and Decision Tree Classifiers

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published