Test assigment on Microsoft Geolife Dataset
Testassigment is based on Microsoft Geolife Dataset
Testassigment consist of 4 python modules in geo_app/geo folder and tests:
ingest.py - by default read Geolife data from local folder geo_data_source, transform it and save it as geo_table.parquet file to local folder geo_data_output, or input_folder and output_folder should be provided
length_dist.py - by default read geo_table.parquet file (from ingestion stage) as a SQL table from local folder geo_data_output. Calculate distribution of trip length in km and save it to local folder geo_data_output as
length_{timestamp}.parquet
length_{timestamp}.png
time_gross_dist.py - by default read geo_table.parquet file (from ingestion stage) as a SQL table from local folder geo_data_output. Calculate distribution of trip duration time in hours and save it to local folder geo_data_output as
time_gross_{timestamp}.parquet
time_gross_{timestamp}.png
time_net_dist.py - by default read geo_table.parquet file (from ingestion stage) as a SQL table from local folder geo_data_output. Calculate distribution of trip duration time without stops in hours and save it to local folder geo_data_output as
time_net_{timestamp}.parquet
time_net_{timestamp}.png
Human stops were defined as trajectory steps where speed was less than 0.05 km/h
IngestionTime: TimestampType,
UserId: StringType,
TrajectoryID: StringType,
Latitude: DoubleType,
Longitude: DoubleType,
Altitude: DoubleType,
StepTimestamp: TimestampType,
Label: IntegerType
IngestionTime - data ingestion time, the same time is added to all input data while the script is running
UserId - folder name in Geo Dataset: Geolife Trajectories 1.3/Data/{users}
TrajectoryID - file name of .plt file in Geo Dataset: Geolife Trajectories 1.3/Data/{user}/Trajectory/{.plt}
Latitude - the first column from plt file
Longitude - the second column from plt file
Altitude - the forth column from plt file
StepTimestamp - comdined 6th and 7th columns from file as timestamp = data + time
Label - merged labels as categories from labels.txt file. Merged by time
3th and and 5th columns from plt according to Geo Dataset User Guide are useless
-
Load Microsoft Geolife Dataset](https://www.microsoft.com/en-us/research/publication/geolife-gps-trajectory-dataset-user-guide/) and unpack it to geo_data_source. It's a default local folder for input data. You can create your own directory and then use param --input_folder for all scripts
-
Install Python 3.8
-
Create Python enviroment
cd geo_app
python3 -m venv geo_env
source geo_env/bin/activate
- Install requirements.txt from geo_app folder
pip install -r requirements.txt
-
cd geo
-
Run Ingest.py from geo_app/geo folder
default folders:
python3 ingest.py
custom local folders:
python3 ingest.py --input_folder ../../test_data --output_folder ../../geo_data_output_test
-
Check geo_data_output or your output table for geo_table.parquet
-
Run length_dist.py from geo_app/geo folder
default folders:
python3 ingest.py
custom local folders:
python3 length_dist.py --database_folder ../../geo_data_output_test
-
Look at geo_data_output or your output table. You will see 2 new files: length_{timestamp}.parquet length_{timestamp}.png
-
Run time_gross_dist.py from geo_app/geo folder
default folders:
python3 time_gross_dist.py
custom local folders:
python3 time_gross_dist.py --database_folder ../../geo_data_output_test
-
Look at geo_data_output or your output table. You will see 2 new files: time_gross_{timestamp}.parquet time_gross_{timestamp}.png
-
Run time_net_dist.py from geo_app/geo folder
default folders:
python3 time_net_dist.py
custom local folders:
python3 time_net_dist.py --database_folder ../../geo_data_output_test
- Look at geo_data_output or your output table. You will see 2 new files:
time_net_{timestamp}.parquet time_net_{timestamp}.png
cd geo_app python -m pytest geo
-
Install Docker
-
Go to the airflow folder
cd airflow_docker
- Load some data into folder airflow_docker/geo_data_source
This folder is a source folder for a new data. You can copy full Geo Dataset or just part of it
Example of the folder structure:
- Run airflow docker and wait until it starts
./build_and_run_airflow.sh
- Go to http://0.0.0.0:8080/home in your browser. User - admin, password - admin
-
Trigger the dag and wait until it ends
-
Check airflow_docker/geo_data_output directory for results.
-
Data input folder airflow_docker/geo_data_source must be empty in the end and all input plt files will be copied to airflow_docker/processed_data folder