Perform simple prediction of world population using linear regression using different data science tools
Learn complete data science workflow and how to integrate different data science tools in the workflow
Data includes the population of 264 countries
- Amazon Relational Database Service (Amazon RDS): tool to set up, operate, and scale MySQL deployments in the cloud
- Amazon Elastic Compute Cloud (Amazon EC2): tool to provide secure, resizable compute capacity in the cloud
- MySQL: tool to create table and store database
- Talend ETL: tool to load data into MySQL
- Tableau: tool to visualize data
-
Setup:
- Launching a MySQL database instance in my AWS account
- Launching an EC2 instance
- Connect to EC2 instance via SSH
- Connect to the databsase in AWS from my computer (via EC2) to interact with it using SQL
- Conect Talend ETL to AWS RDS
- Upload data to Talend ETL then transfer the data to AWS RDS
- Connect MySQL to Python with PyMySQL
- Integrate 2 data and create prediction file in Python
- Load the data on Talend ETL then transfer the data to AWS RDS
- Connect Tableau to AWS RDS and visualize
-
Preprocessing and Model Training in Python _The notebook can be found here
- Preprocessing:
- Fill in missing values with fill forward method
- Visualization:
- Prediction:
- Since the population trends of many countries are linear and for the simple purpose of practicing with the tools, I with linear regression to predict the population from 2019 to 2021
- Score: 0.994
- Find the prediction csv file here
- Preprocessing: