A tool for finding the difference between multiple data sources which should have the same value.
Heavily tested versions
Django | Python 2 | Python 3 |
---|---|---|
1.11.15 | V | |
2.0.5 | V |
- Reduce the efforts required to check data validation from different source.
- Report the differences to user.
- Resolve the differences for user basing on input rules.
Scenarios
- Comparing tables within same database
- Comparing tables from different databases
- Comparing tables with different range of data within same/different database
- Comparing table and CSV file
- Comparing unordered CSV file and database
install on python2 Venv
in these steps, the bold and italic sentences are the commands for terminal
-
install brew
sudo xcodebuild -license /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
-
install mysql via brew
brew install mysql
-
start mysql
brew services start mysql
-
install rabbitmqÂ
brew install rabbitmq
-
start rabbitmq
/usr/local/sbin/rabbitmq-server -detached
-
pull the source code
cd ~ git clone https://github.com/analyticsMD/datadiff.git
-
create virtual environment p2env for qdiff
cd datadiff python -m venv p2env
-
activate the virtual environmentÂ
source p2env/bin/activate
-
install dependency for qdiff
pip install -r requirements.txt
-
create databaseÂ
mysql -uroot -p CREATE DATABASE qdiff; exit;
-
init tables in the database
python manage.py makemigrations python manage.py migrate
-
start Celery workers
celery -A qdiff worker -l info --detach
-
do a test for sanity check
python manage.py test
-
run the server for demo!!
python manage.py runserver
-
check URLÂ http://127.0.0.1:8000/
After you restart your device, you only need to do the following to launch Qdiff
-
start rabbitmq server
/usr/local/sbin/rabbitmq-server -detached
-
activate virtual environment
cd ~/datadiff source p2env/bin/activate
-
start Celery workers
celery -A qdiff worker -l info --detach
-
run the django server
python manage.py runserver
install on python3 Venv
in these steps, the bold and italic sentences are the commands for terminal
-
install brew
sudo xcodebuild -license /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
-
install mysql via brew
brew install mysql
-
start mysql
brew services start mysql
-
install rabbitmqÂ
brew install rabbitmq
-
start rabbitmq
/usr/local/sbin/rabbitmq-server -detached
-
install python36
brew install python
-
pull the source code
cd ~ git clone https://github.com/analyticsMD/datadiff.git
-
create virtual environment p3env for qdiff
cd datadiff python3 -m venv p3env
-
activate the virtual environmentÂ
source p3env/bin/activate
-
install dependency for qdiff
pip install -r requirements.txt
-
create databaseÂ
mysql -uroot -p CREATE DATABASE qdiff; exit;
-
init tables in the database
python manage.py makemigrations python manage.py migrate
-
start Celery workers
celery -A qdiff worker -l info --detach
-
do a test for sanity check
python manage.py test
-
run the server for demo!!
python manage.py runserver
-
check URLÂ http://127.0.0.1:8000/
After you restart your device, you only need to do the following to launch Qdiff
-
start rabbitmq server
/usr/local/sbin/rabbitmq-server -detached
-
activate virtual environment
cd ~/datadiff source p3env/bin/activate
-
start Celery workers
celery -A qdiff worker -l info --detach
-
run the django server
python manage.py runserver
detail guides
detail guides
- install mysql
sudo yum install mysql-server
- install python 3.6
sudo yum install python36
- clone repository
ssh-agent bash -c 'ssh-add /path/to/your/private/rsakey; git clone git@github.com:analyticsMD/datadiff.git'
- install gcc for compiling mysql-connector
sudo yum install gcc
- install python-devel for compiling mysql-connector
sudo yum install -y python36-devel
- install dependency
python3 -m pip install -r datadiff/requirements.txt --user
detail guides
- start mysql server
sudo service mysqld start
- login into mysql with root user
mysql -u root -p
- create database qdiff and qdiff_test
mysql> create database qdiff; mysql> create database qdiff_test;
- change database password in the /qdiff/setting/settings.py and settings_test.py
- install the database schema
python3 manage.py makemigrations python3 manage.py migrate
detail guides
- install Erlang Version 20.1
cd /opt sudo wget https://github.com/rabbitmq/erlang-rpm/releases/download/v20.1.7/erlang-20.1.7-1.el6.x86_64.rpm sudo rpm -ivh erlang-20.1.7-1.el6.x86_64.rpm
- install Socat
sudo yum install socat
- RabbitMQ v3.7.0
sudo wget https://dl.bintray.com/rabbitmq/all/rabbitmq-server/3.7.0/rabbitmq-server-3.7.0-1.el6.noarch.rpm sudo rpm -ivh rabbitmq-server-3.7.0-1.el6.noarch.rpm
- start rabbitmq, run the command
rabbitmq-server
- or you can start rabbitmq as service
sudo service rabbitmq-server start
detail guides
celery -A qdiff worker -l info --detach
You can also check http://docs.celeryproject.org/en/latest/userguide/daemonizing.html
detail guides
- run command
sudo python3 manage.py test
- Cheers if all the test cases are successful
Please make sure you already finish this section
detail guides
detail guides
- make directory /opt/
sudo mkdir /opt
- clone the code with git
cd /opt git clone git@github.com:analyticsMD/datadiff.git
detail guides
- run command
sudo python3 -m pip install uwsgi
detail guides
- run command
sudo yum install nginx
detail guides
- run command
sudo vim /etc/nginx/nginx.conf
- edit context for your requirement, this is a simple sample
user ec2-user ec2-user; worker_processes auto; error_log /var/log/nginx/error.log; pid /var/run/nginx.pid; include /usr/share/nginx/modules/*.conf; events { worker_connections 1024; } http { server_names_hash_bucket_size 128; log_format main '$remote_addr - $remote_user [$time_local] "$request" ' '$status $body_bytes_sent "$http_referer" ' '"$http_user_agent" "$http_x_forwarded_for"'; access_log /var/log/nginx/access.log main; sendfile on; tcp_nopush on; tcp_nodelay on; keepalive_timeout 65; types_hash_max_size 2048; include /etc/nginx/mime.types; default_type application/octet-stream; include /etc/nginx/conf.d/*.conf; include /etc/nginx/sites-enable/*; index index.html index.htm; }
detail guides
- make two directories, /etc/nginx/sites-enable/ and /etc/nginx/sites-available/
mkdir /etc/nginx/sites-enable/ mkdir /etc/nginx/sites-available/
- create qdiff_nginx.conf in /etc/nginx/sites-available/
sudo vim /etc/nginx/sites-available/qdiff_nginx.conf
- edit content as follow
# the upstream component nginx needs to connect to upstream django { server unix:///opt/datadiff/qdiff.sock; # for a file socket # server 127.0.0.1:8001; # for a web port socket } # configuration of the server server { # the port your site will be served on listen 8000; # the domain name it will serve for server_name ec2-54-183-250-158.us-west-1.compute.amazonaws.com; # substitute your machine's IP address or FQDN charset utf-8; # max upload size client_max_body_size 1000M; # adjust to taste # Django media location /media { alias /opt/datadiff/media; # your Django project's media files - amend as required } location /static { alias /opt/datadiff/static; # your Django project's static files - amend as required } # Finally, send all non-media requests to the Django server. location / { uwsgi_pass django; include /opt/datadiff/uwsgi_params; # the uwsgi_params file you installed } }
detail guides
- check the read write permissions
it should return
ls -l /var/lib/nginx/tmp/
drwxr-xr-x 2 ec2-user ec2-user 4096 Jul 31 19:36 client_body drwx------ 2 ec2-user ec2-user 4096 Jul 24 02:06 fastcgi drwx------ 2 ec2-user ec2-user 4096 Jul 24 02:06 proxy drwx------ 2 ec2-user ec2-user 4096 Jul 24 02:06 scgi drwx------ 7 ec2-user ec2-user 4096 Jul 31 19:36 uwsgi
detail guides
-
make sure the rabbit mq already start
-
start the nginx service
sudo service nginx start
-
start uwsgi worker
cd /opt/datadiff uwsgi --socket qdiff.sock --module setting.wsgi --chmod-socket=664 --daemonize uwsgi.log
detail guides
-
access URL host:port/tasks to check if it works or not
for example: http://ec2-xx-xx-xx-xx.us-west-1.compute.amazonaws.com/tasks
-
if not working, check following files for debug
/var/log/nginx/error.log #nginx reverse server log /opt/datadiff/uwsgi.log #uwsgi server log /opt/datadiff/dev.log #qdiff log
details
When input databases as datasources, users need to generate a database access token first. The database access token is an encrypted JSON file which contains the database setting for Django.details
- Using Django DB cursor to read the data
- Supporting multiple databases and file sources
- Support CSV excel
A wrapper for package tableschema
- It will query setting.settings.SCHEMA_INFER_LIMIT number records and infer the schema from them.
- The setting.settings.SCHEMA_INFER_CONFIDENCE permits the minor occurance of abnomral, default is 1.00.
- The setting.settings.SCHEMA_CSV_MISSING_VALUES and SCHEMA_DATABASE_MISSING_VALUES is used to configurate what should be consider as missing.
-
Variable definitions
data1, data2: the input data, can be queryset, list, dictionary
item1, item2: the elements from data1 and data2
dict1, dict2: the list for saving unmatch records
-
steps
- Sort the data1 and data2
- Iterate over data1 and data2 at same time, item1 comes from data1 and item2 comes from data2
- If the item1 is identical to item2, iterate next item pair
- Else if the item1 in dict2, save the all elements in dict2 except item1 as conflicted results, iterate next item1
- Else if the item2 in dict1, save the all elements in dict1 except item2 as conflicted results, iterate next item2
- Else, the item1 is different from the item2, put item1 in dict1 and item2 in dict2, iterate next item pair
-
Complexity.
given m,n = len(data1),len(data2)
Time complexity:
O(m+n)
Space complexity:
O(m+n)
-
pseudo code in python
qDiff(data1, data2): iter1 = iter(sorted(data1)) iter2 = iter(sorted(data2)) temp_dict1 ={} temp_dict2 ={} item1 = None item2 = None try: while True: item1 = next(iter1) item2 = next(iter2) h1 = hash(item1) h2 = hash(item2) if h1==h2: item1 = None item2 = None continue elif h1 in temp_dict2 or h2 in temp_dict1: if h1 in temp_dict2: temp_dict2.pop(h1) saveToConflictedResult(temp_dict2.values()) if h2 in temp_dict1: temp_dict1.pop(h2) saveToConflictedResult(temp_dict1.values()) else: temp_dict1[h1]=item1 temp_dict2[h2]=item2 item1 = None item2 = None except StopIteration as e: if not item1: saveToConflictedResult(list(iter2)) else: saveToConflictedResult([item1] + list(iter1))
- Providing the comparison result
- GUI for accepting rules for resolving ((Not implemented yet))
-
Parsing the input rules and save as rule set for reuse
-
Rules:
Where to write the resolved result Left join, right join, inner join, and outer join Condition based rule, (E.X. when field1 == 0 and field2 > 3)
- Filtering the conflicted results basing on the input rules
Entities
-
Task
- Information about the data source
- Uploaded file path
- Database information (This will not contain password)
- Datetime, Recording the start time and end time for performance evaluation
- Owner (Not implemented yet)
-
Conflict record
- Raw data
- What source it belongs to
- The name of raw table
-
Raw Table
- Fields here are dynamics, this table schema is depending on the schema of given datasources
-
Report
- Which generator will process the data
- Generated file path
- Parameters for the generator, in JSON format
-
Rule set (Not implemented yet)
- Name
- Description
- Rule, formatted rules in json format