AQI-prediction-spark-ml

This project is to prove the efficiency of distributed computing and distributed database. The machine learning multiple classification algorithms in spark were used to predict the Air Quality Index in California.

Project Description

Air pollution has become a concern of people in California. Prediction of Air Quality Index is, on one hand, benefit by the high frequency, the large number of sampling stations, and the substantial size of relevant data. On the other hand, it is challenging to effectively store, manage, and process a vast amount of data in real-time. In this project, we explore a pipeline to store, process and make predictions applying machine learning models to the air quality datasets. Using a 10-year air quality dataset of California, we develop Logistic Regression and Random Forest classification machine learning models on a local machine as well as on a distributed system. We found that employing Amazon S3, MongoDB and Apache Spark, the distributed setting on a cluster achieved better computational performance than a non-distributed setting.

Work Flow

The pipeline was built using Apache Spark SQL and Spark machine learning libraries (MLlib) on AWS Elastic MapReduce (EMR).

Results

Conclusion

Deploying the model on a cluster with various hyperparameter settings proved that a distributed setting on a cluster achieved better computational performance than a non-distributed setting. Compared to nondistributed settings, the study result is promising, proving that the designed pipeline can provide a scalable and efficient throughput of machine learning algorithms for air quality prediction.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
pics		pics
AQI.pptx		AQI.pptx
README.md		README.md
Report.docx		Report.docx
Report.pdf		Report.pdf
data_preprocessing.ipynb		data_preprocessing.ipynb
data_query.ipynb		data_query.ipynb
logistic_regression_with_spark.ipynb		logistic_regression_with_spark.ipynb
random_forest_on_spark.ipynb		random_forest_on_spark.ipynb
sql_query_10yrs_group.sql		sql_query_10yrs_group.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AQI-prediction-spark-ml

Project Description

Work Flow

Results

Conclusion

About

Releases

Packages

Languages

liyinging/AQI-prediction-spark-ml

Folders and files

Latest commit

History

Repository files navigation

AQI-prediction-spark-ml

Project Description

Work Flow

Results

Conclusion

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages