PostgreSQL scripts for Microsoft Academic Graph (MAG)
Part of sql_mag.
In order to use pg_mag
you need:
- The MAG raw
txt
files. - A PostgreSQL server, with the extensions
postgis
andpg_trgm
.
In order to get the MAG raw files, please check the official documentation for details. The normal procedure involves creating a Microsoft Azure storage account, signing up for MAG provisioning and downloading the raw files to your host system (using Azcopy or another tool).
If you are already running a Postgres Server you just need to enable the extensions postgis
and pg_trgm
. You can do this by uncommenting and executing the following lines from main.sql
:
CREATE EXTENSION postgis;
CREATE EXTENSION pg_trgm;
If you want to set up a Postgres server from scratch, one of the easiest ways is through Docker and the postgis/postgis
image. If you use the postgis/postgis
image, you just need to install the pg_trgm
extension. After having installed Docker and Docker compose you can run a new server by modifying and executing the provided Docker Compose definition file (docker-compose.yml
):
version: '3.5'
services:
db:
image: postgis/postgis:13-3.0
restart: always
ports:
- 5432:5432
environment:
POSTGRES_PASSWORD: YOURPASSWORD # Set a password
POSTGRES_USER: YOURUSER # Set a username
POSTGRES_DB: DBNAME # Set a database
volumes:
- YOURDBDATA:/var/lib/postgresql/data # Set the folder that will contain the Postgres data
- YOURINPUT:/home/input # Set the folder that contains the MAG txt files
To start the db
service you need to run docker-compose up
on a terminal, from your project folder (the folder that contains your docker-compose.yml
file). If you are new to Docker you can get more information here.
In order to improve the performance of Postgres according to the characteristics of your host computer, you can edit the Postgres configuration file (postgresql.conf
). Please check the official Postgres documentation for more details and PGTune to get some suggested parameters according to your system. Among others, it is recommended to increase the shared_buffers
, effective_cache_size
and max_wal_size
, as to allow Postgres to handle big files, before executing the script.
You can run the script using a PostgreSQL client such as psql
. You can run the whole script at once (psql -h localhost -U YOURUSER -d DBNAME -f main.sql
) or manually executing the sections you need.
If you are using Docker, you can open a psql
client session by running:
docker exec -it CONTAINERNAME psql -h localhost -U YOURUSER -d DBNAME
COPY
is a Postgres command while \copy
is a psql
client command. COPY
is faster than \copy
for large files, but the file accessibility and access rights depend on the server rather than the client (see SQL-COPY).
Once you get all the files, you may need to modify some of them. As Postgres does not allow loading files with the characters \000
and with the character \\
inside the text, you may need to delete or scape them before loading. You can achieve this by using the sed
and tr
commands in Linux. If you are using a psql
client, you can execute Linux commands adding the \!
at the beginning of each line. For example, for modifying the Papers.txt
file, you can execute inside a psql
client the following lines:
\! sed -e 's/\\/\\\\/g' < Papers.txt > Papers_.txt
\! tr -d '\000' < Papers_.txt > Papers__.txt
Alternatively, you can execute the sed
and tr
commands from a Linux terminal (removing \!
from each line), using pipes to run sed and tr at once, etc.
Another option is to read the txt
files WITH CSV delimiter E'\t' ESCAPE '\' QUOTE E'\b'
, but this may not work if the character \b
is inside the text.
If you are using Docker with the postgis:postgis
image, remember to set all the necessary reading permissions for your data
and input
folders (e.g. in Linux, you can type chmod 755 yourfolder
) and to mount them, when defining the Docker compose file.
The postgis
extension is required to enable the Spatial and Geographic objects (i.e. the geom columns) and the pg_trgm
to enable the gin_trgm_ops
class, to create GIN indexes and speeding searching for similar strings up. If you remove the geom columns from the schema definition and the gidx_*
indexes, you can omit these extensions.
It depends on your version of MAG, Postgres and the set of indexes. For example, the set of the MAG 20200917 raw txt files for the core tables is around 456 GB. After loading the data on Postgres 13, the MAG core tables require around 450 GB. Therefore, you need at least 1TB of space to store the raw and the Postgres database files.
It depends on the kind of analysis that you intend to perform. For basic operations, 4 - 16 GBs of RAM should be enough.
Please refer to my mysql_mag repository.
Please report them using Github or feel free to contact me by email. Feel free to fork this repository and submit your contributions/pull requests.
If you use these scripts in your research publications, please include the following citation:
Chacua, Christian (2020). sql_mag. MySQL and PostgreSQL scripts for Microsoft Academic Graph. https://github.com/cchacua/sql_mag