## Scientific Publications Data Warehouse

Project 1: A Data Cube on top of Delta Lake (ETL)
#### *Purpose*
The purpose is to extract data about scientific publications from JSON data that describe, title, topic, authors, etc., about a large number of papers and populate a data warehouse to issue analytics queries using SQL.

We will use Spark DataFrames to extract and transform the data.

We will also use Spark tables (delta tables) to be used for dimensions and fact tables as will be shown below.

### *DWH Schema*

We will follow the proposed schema as shown:

DBLP Fact Table:
    - Date_ID (FK)
    - Keyword_ID (FK)
    - Type_ID (FK)
    - Publication_ID (FK)
    - Venue_ID (FK)
    - FOS_ID (FK)
    - ORG_ID (FK)
    - Author_ID (FK)
    - Lange_ID (FK)
    - AuthorRank

Keyword Table:
    - ID
    - Text

Type Table:
    - ID
    - Description

Publication Table:
    - ID
    - Title
    - Year
    - PageStart
    - PageEnd
    - DOI
    - PDF
    - URL
    - Abstract
    - IndexedAbstract
    - N_Citation

Venue Table:
    - ID
    - Name
    - City
    - Country

Date Table:
    - ID
    - Year
    - Month
    - Day

Language Table:
    - ID
    - Name

FOS Table:
    - ID
    - Field

ORG Table:
    - ID
    - Name
    - City
    - Country

Author Table:
    - ID
    - FirstName
    - LastName
    - MiddleName


### *Dataset*

The data source is https://www.aminer.org/citation, version 13, as it is the most detailed one in JSON
format. You can also check the schema of the respective data set on the same page under the  "Description" link – note that the schema may not correspond to the schema in the JSON file.


#### Dataschema of V13

_Backed to v11 schema, where id and references are in String form._*


| --- | --- | --- | ---
| Field Name | Field Type | Description | Example
| id | string | paper ID | 43e17f5b20f7dfbc07e8ac6e
| title | string | paper title | Data mining: concepts and techniques
| authors.name | string | author name | Jiawei Han
| authors.org | string | author affiliation | Department of Computer Science, University of Illinois at Urbana-Champaign
| authors.id | string | author ID | 53f42f36dabfaedce54dcd0c
| venue.id | string | paper venue ID | 53e17f5b20f7dfbc07e8ac6e
| venue.raw | string | paper venue name | Inteligencia Artificial, Revista Iberoamericana de Inteligencia Artificial
| year | int | published year | 2000
| keywords | list of strings | keywords | ["data mining", "structured data", "world wide web", "social network", "relational data"]
| fos.name | string | paper fields of study | Web mining
| fos.w | float | fields of study weight | 0.659690857
| references | list of strings | paper references | ["4909282", "16018031", "16159250",  "19838944", ...]
| n_citation | int | citation number | 40829
| page_start | string | page start | 11
| page_end | string | page end | 18
| doc_type | string | paper type: journal, book title... | book
| lang | string | detected language | en
| publisher | string | publisher | Elsevier
| volume | string | volume | 10
| issue | string | issue | 29
| issn | string | issn | 0020-7136
| isbn | string | isbn | 1-55860-489-8
| doi | string | doi | 10.4114/ia.v10i29.873
| pdf | string | pdf URL | //static.aminer.org/upload/pdf/1254/ 370/239/53e9ab9eb7602d970354a97e.pdf
| url | list | external links | ["http://dx.doi.org/10.4114/ia.v10i29.873", "http://polar.lsi.uned.es/revista/index.php/ia/ article/view/479"]
| abstract | string | abstract | Our ability to generate...
| indexed_abstract | dict | indexed abstract | {"IndexLength": 164, "InvertedIndex": {"Our": [0], "ability": [1], "to": [2, 7, ...]}}

### Extract

In [0]:
# let's fetch the data
# !wget https://originalstatic.aminer.cn/misc/dblp.v13.7z
# !7z x dblp.v13.7z


In [0]:
%pip install gender_guesser

Python interpreter will be restarted.
Python interpreter will be restarted.


In [0]:
# we will read and process the data while cleaning it simultaneously

import json
import ast
import os
# import tqdm.notebook

def process_json(file_name, split_size, output_prefix, offset=0, file_number=0):
    with open(file_name, 'r', encoding='utf-8') as ifh:
        # seek to the second line in the file
        if offset > 0:
            ifh.seek(offset)
        else:
            ifh.seek(1) # skip the first '['
        file_number = file_number
        checkpoint = []
        file_sizes = []
        end_of_file = False
        # we will keep looping until all the lines are read
        while not ifh or not end_of_file:
            file_number += 1
            # json_objects = [ast.literal_eval(build_json_object(ifh)) for _ in tqdm.notebook.tqdm(range(split_size))]
            json_objects = []
            while len(json_objects) < split_size:
                try:
                    json_objects.append(ast.literal_eval(build_json_object(ifh)))
                except:
                    end_of_file = True
                    break # we reached the end of the file
            print(f"Checkpoint {file_number}: {ifh.tell()}, objects processed: {len(json_objects)}")
            # write each json object to a file
            with open(f"{output_prefix}{file_number}.json", 'w', encoding='utf-8') as ofh:
                # this process yields smaller files than using json.dump w/ indent = 4
                for i, json_object in enumerate(json_objects):
                    if i == len(json_objects) - 1:
                        ofh.write(json.dumps(json_object) + "]")
                    elif i == 0:
                        ofh.write('[' + json.dumps(json_object) + ",")
                    else:
                        ofh.write(json.dumps(json_object[0]) + ",")
            # get the size of the file
            file_sizes.append(os.path.getsize(f"{output_prefix}{file_number}.json") / 1024 / 1024)
            checkpoint.append(ifh.tell())
            print(f"Checkpoint {file_number}: {checkpoint[-1]}, objects processed: {len(json_objects)}, size of file {file_number}: {file_sizes[-1]} MB")
            # break # for testing purposes
        print(f"Finished processing {file_name}, {file_number} files created.")
        return checkpoint, file_sizes

def clean_line(line):
    if "NumberInt" in line:
        line = line.replace("NumberInt", "") # NumberInt(123) -> (123)
        line = line.replace("(", '"') # (123) -> "123)
        line = line.replace(")", '"') # "123) -> "123"
    if ": null," in line or ": null" in line:
        line = line.replace("null", '""')
    return line

def build_json_object(fh):
    buffer = ''
    line = fh.readline()
    while line != "},\n":
        if not line:
            print("Reached end of file")
            return buffer[:-2]
        buffer += clean_line(line)
        line = fh.readline()
    buffer += line
    return buffer


In [0]:
# results = process_json('dblpv13.json', 100000, 'clean_dataset/dblpv13_clean_', offset=17260419910, file_number=53)
# results = process_json('dblpv13.json', 100000, 'clean_dataset/dblpv13_clean_')

### Transform

Here we will begin the transformation part of our pipeline. We will use delta tables and pyspark dataframes to do this. There are a few tasks that we must complete:
1. Drop publications with very short titles (one word, empty authors, etc.)
2. Visualize the number of citations
3. ISSN is sometimes filled with wrong values, we can either drop or make an effor to resolve using DOI for instance.
4. Defining the type of publication (journal, book, conference, etc.)
5. Resolving ambiguous author names
6. Resolving ambiguous or abbreviated conference and journal names using DBLP database.
7. Refining venues
8. Author gender
9. H-index of authors
10. Normalization of the field of study

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

import warnings
warnings.filterwarnings("ignore")


In [0]:
path_to_data = 'dbfs:/FileStore/tables/dblpv13_cleanv2_'
# path_to_data = 'dbfs:/FileStore/tables/small_'

In [0]:
# change the memory size depending
spark = SparkSession.builder \
    .appName("Project1") \
    .config("spark.jars.packages", "io.delta:delta-core_2.12:2.1.0") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .config("spark.sql.warehouse.dir", "file:///tmp/spark-warehouse")\
    .config("spark.driver.memory", "16g")\
    .config("spark.executor.memory", "16g")\
    .config("spark.driver.maxResultSize", "16g")\
    .config("spark.sql.execution.arrow.pyspark.enabled", "true")\
    .getOrCreate()

In [0]:
# gender
# %pip install gender_guesser
import gender_guesser.detector as gender

def get_author_gender(first_name):
    d = gender.Detector()
    gender_result = d.get_gender(first_name)
    if gender_result == "mostly_male":
        return "male"
    elif gender_result == "mostly_female":
        return "female"
    else:
        return gender_result

# with api / has limit
# import request
# def get_author_gender(first_name):
#     url = f'https://api.genderize.io/?name={first_name}'
#     response = requests.get(url)
#     if response.status_code == 200:
#         data = response.json()
#         if 'gender' in data:
#             return data['gender']
#     return None

def add_author_gender(authors):
    genders = []
    for author in authors:
        if author is None:
            return None
        gender = get_author_gender(author.split()[0])
        genders.append(gender)
    return genders

In [0]:
# author names
import requests

def resolve_names(name):
    response = requests.get("https://dblp.org/search/author/api", params={"q": name, "format": "json"})
    if response.status_code == 200:
        result = response.json()
        if "result" in result and "hits" in result["result"]:
            if "hit" in result["result"]["hits"]:
                for hit in result["result"]["hits"]:
                    if "info" in hit and "author" in hit["info"]:
                        author_name = hit["info"]["author"]
#                     if author_name.lower() == name.lower():
                        return author_name
    return name


def add_resolved_names(authors):
    names = []
    for author in authors:
        name = resolve_names(author['name'])
        names.append(name)
    return names

In [0]:
# publ titles
import requests

def resolve_data(request):
    response = requests.get("https://dblp.org/search/publ/api", params={"q": request, "format": "json"})
    if response.status_code == 200:
        result = response.json()
        if "result" in result and "hits" in result["result"]:
            hits = result["result"]["hits"]
            if "hit" in hits:
                for hit in hits["hit"]:
                    if all(key in hit["info"] for key in ["title", "venue"]):
                        resolved_data = hit["info"]["title"]
                        refined_venue = hit["info"]["venue"]
                        return (resolved_data, refined_venue)
    
    return (None, None)

# resolve_data('A stepwise framework for the normalization of array CGH data.')

In [0]:
from pyspark.sql.functions import explode, desc, row_number
from pyspark.sql.window import Window


# Function to merge two schemas
def merge_schemas(schema1, schema2):
    merged_fields = {field.name: field for field in schema1}
    for field in schema2:
        if field.name not in merged_fields:
            merged_fields[field.name] = field
    return StructType(sorted(merged_fields.values(), key=lambda field: field.name))

  
def preprocess_dataframe(df):
    df = df.dropDuplicates()
    
    # 1. Drop publications with very short titles (one word, empty authors, etc.)
    df = df.filter(df.title.isNotNull()) \
        .filter(length(df.title) > 5) \
        .filter(col('title').contains(' '))\
        .filter(~df.title.rlike(".*Editorial.*")) \
        .filter(~df.title.rlike(".*Forward.*")) \
        .filter(~df.title.rlike(".*Preface.*")) \
        .filter(~df.title.rlike(".*Conference.*")) \
        .filter(~df.title.rlike(".*Proceedings.*")) \
        .filter(~df.title.rlike(".*Symposium.*")) \
        .filter(~df.title.rlike(".*Workshop.*")) \
        .filter(~df.title.rlike(".*Tutorial.*")) \
        .filter(~df.title.rlike(".*Forum.*"))

    df = df.filter(length(df.abstract) > 0)

    df = df.filter(df.issn.isNotNull()) \
        .filter(length(df.issn) > 5)
    
    # 2. Visualize the number of citations
    df = df.withColumn("n_citation_int", col("n_citation").cast("integer"))
    
    # 3. ISSN is sometimes filled with wrong values, we can either drop or make an effor to resolve using DOI for instance.
    df = df.filter(~df.doi.rlike(".*[a-zA-Z]+.*"))
    
    
    # 4. Defining the type of publication (journal, book, conference, etc.)
    # Create a new column with the default publication type of 'Conference'
    df = df.withColumn('pub_type', lit('Conference'))
    # Update the publication type based on the 'venue.raw', 'volume', and 'issue' columns
    df = df.withColumn('pub_type', when(col('venue.raw').contains('@'), 'Workshop')
                                   .when((col('volume') != '') | (col('issue') != ''), 'Journal')
                                   .otherwise(col('pub_type')))
    
    # 5. Resolving ambiguous author names
    resolve_name_udf = udf(add_resolved_names, StringType())
    df = df.withColumn("resolved_authors", resolve_name_udf(df.authors))
    
    # 6. Resolving ambiguous or abbreviated conference and journal names using DBLP database
    # 7. Refining venues
    resolve_data_udf = udf(resolve_data, StructType([StructField("title", StringType()), StructField("venue", StringType())]))

    df = df.withColumn("resolved_data", resolve_data_udf(df.title))
    df = df.withColumn("resolved_title", df.resolved_data.title)
    df = df.withColumn("refined_venue", df.resolved_data.venue)
    df = df.drop("resolved_data")
    
    
    # 8. Author gender
    get_author_gender_udf = udf(add_author_gender, StringType())
    df = df.withColumn('authors_gender', get_author_gender_udf(df.resolved_authors))
    
    
    # 9.
    

    
    # 10. 
    


    return df


In [0]:
# We read JSON files with schema inference and preprocess them
json_files = [f"{path_to_data}{i}.json" for i in range(1,2)]
dataframes = [preprocess_dataframe(spark.read.option("inferSchema", "true").json(file)) for file in json_files]

In [0]:
# 2. Visualize the number of citations
display(dataframes[0].limit(20))

_id,abstract,authors,doi,fos,isbn,issn,issue,keywords,lang,n_citation,page_end,page_start,pdf,references,title,url,venue,volume,year,n_citation_int,pub_type,resolved_authors,resolved_title,refined_venue,authors_gender
53e99796b7602d9701f5cf52,No abstract,"List(List(null, null, null, 5b86b6c6e1cd8e14a34ccaa2, Yongho Sohn Ph.D., null, null, null, null, University of Central Florida, USA, null, 5f71b28c1c455f439fe3c9f6, null, null, null), List(53f7e29fdabfae938c6e77f2, null, null, 5b86b6d9e1cd8e14a34d55b9, Carelyn Campbell Ph.D., null, null, null, null, National Institute of Standards and Technology, USA, null, 5f71b2831c455f439fe3c650, null, null, null), List(null, null, null, 5b869472e1cd8e14a365e82f, John E. Morral Ph.D., null, null, null, null, The Ohio State University, USA, null, 5f71b2831c455f439fe3c656, null, null, null), List(null, null, null, 5b86a6eae1cd8e14a3e065ee, Richard D. Sisson Ph.D., null, null, null, null, Worcester Polytechnic Institute, USA, null, 5f71b2971c455f439fe3cedc, null, null, null))",10.1016/0266-3538(95)00134-4,,,Composites Science and Technology,3.0,List(),en,2,207,205.0,,,Guest editorial,"List(http://dx.doi.org/10.1016/0266-3538(95)00134-4, http://dx.doi.org/10.3233/JHS-1996-5101, http://dx.doi.org/10.1007/BF02736555, http://link.springer.com/article/10.1007/BF02736555, http://dx.doi.org/10.1049/iet-net.2016.0080)","List(555036c17cea80f9541516a0, null, null, null, null, null, , J. High Speed Networks, , null, null, null, 0)",56,1996,2,Journal,"[Yongho Sohn Ph.D., Carelyn Campbell Ph.D., John E. Morral Ph.D., Richard D. Sisson Ph.D.]",Editorial: A Message from the Editorial Team and an Introduction to the January-March 2016 Issue.,IEEE Trans. Learn. Technol.,"[unknown, unknown, male, male]"
53e99796b7602d9701f5d38d,"In recent years the amount of digital data in the world has risen immensely.But, the more information exists, the greater is the possibility of itsunwanted disclosure. Thus, the data privacy protection has become a pressingproblem of the present time. The task of individual privacy-preserving is beingthoroughly studied nowadays. At the same time, the problem of statisticaldisclosure control for collective (or group) data is still open. In this paperwe propose an effective and relatively simple (wavelet-based) way to providegroup anonymity in collective data. We also provide a real-life example toillustrate the method.","List(List(53f46040dabfaee4dc83594a, null, chertov@i.ua, 5b86a745e1cd8e14a3e2c4ba, Oleg Chertov, null, 544bd94045ce266baeb06b5f, null, null, National Technical University of Ukraine, “Kyiv Polytechnic Institute” Faculty of Applied Mathematics 37 Peremohy Prospekt 03056 Kyiv Ukraine, null, 5f71b32f1c455f439fe4112d, null, null, 18205960), List(53f44da9dabfaee1c0b09fca, null, null, 5b86a745e1cd8e14a3e2c4ba, Dan Tavrov, null, 544bd94045ce266baeb06b5f, null, null, National Technical University of Ukraine, “Kyiv Polytechnic Institute” Faculty of Applied Mathematics 37 Peremohy Prospekt 03056 Kyiv Ukraine, null, 5f71b32f1c455f439fe4112d, null, null, 11702932))",10.1007/978-3-642-14058-7_61,,,"""Information Processing and Management of Uncertainty in  Knowledge-Based Systems. Applications"" (Communications in Computer and  Information Science, Volume 81, Part 6, Part 9, 592-601), 2010",,"List(wavelet analysis., privacy-preserving data mining, group anonymity, statistical disclosure control)",en,3,601,592.0,https://static.aminer.cn/storage/pdf/arxiv/10/1011/1011.1119.pdf,,Group Anonymity,"List(http://arxiv.org/abs/1011.1119, https://arxiv.org/abs/1011.1119)","List(53e18c8d20f7dfbc07e9024f, null, null, null, null, null, , Information Processing and Management of Uncertainty, null, null, null, null, 2)",abs/1011.1,2010,3,Journal,"[Oleg Chertov, Dan Tavrov]",Dynamic group size accreditation and group discounts preserving anonymity.,Int. J. Inf. Sec.,"[male, male]"
53e99796b7602d9701f5c900,"This paper introduces and describes an algorithm or technique, called gravitational clustering, for performing cluster analysis on Euclidean data. The paper describes the physical gravitational model, an abstract generalized model, and several specific gravitational models. It illustrates clustering by one of these models using several sample data sets, and compares the results with those obtained using two other well-known nongravitational clustering methods. The paper also illustrates four graphical techniques to aid in the analysis of a clustering.","List(List(53f4d0addabfaeedd377e0c3, null, null, 5b86be33e1cd8e14a3828b37, W.E. Wright, null, 544bd9e645ce266baf39d328, null, null, Southern Illinois University at Carbondale, Carbondale, IL 62901, U.S.A., null, 5f71b2aa1c455f439fe3d5cf, null, null, null))",10.1016/0031-3203(77)90013-9,,,Pattern Recognition,3.0,"List(Gravitational clustering, Classification, Euclidean data, Cluster analysis, Contour plot)",en,68,166,151.0,,,Gravitational clustering,List(http://dx.doi.org/10.1016/0031-3203(77)90013-9),"List(539e72778314ff4cf49b3f5d, null, null, null, null, null, , Pattern Recognition, , null, null, null, 0)",9,1977,68,Journal,[W.E. Wright],Combining density peaks clustering and gravitational search method to enhance data clustering.,Eng. Appl. Artif. Intell.,[unknown]
53e99792b7602d9701f5b2c7,"In two-channel competitive genomic hybridization microarray experiments, the ratio of the two fluorescent signal intensities at each spot on the microarray is commonly used to infer the relative amounts of the test and reference sample DNA levels. This ratio may be influenced by systematic measurement effects from non-biological sources that can introduce biases in the estimated ratios. These biases should be removed before drawing conclusions about the relative levels of DNA. The performance of existing gene expression microarray normalization strategies has not been evaluated for removing systematic biases encountered in array-based comparative genomic hybridization (CGH), which aims to detect single copy gains and losses typically in samples with heterogeneous cell populations resulting in only slight shifts in signal ratios. The purpose of this work is to establish a framework for correcting the systematic sources of variation in high density CGH array images, while maintaining the true biological variations.After an investigation of the systematic variations in the data from two array CGH platforms, SMRT (Sub Mega base Resolution Tiling) BAC arrays and cDNA arrays of Pollack et al., we have developed a stepwise normalization framework integrating novel and existing normalization methods in order to reduce intensity, spatial, plate and background biases. We used stringent measures to quantify the performance of this stepwise normalization using data derived from 5 sets of experiments representing self-self hybridizations, replicated experiments, detection of single copy changes, array CGH experiments which mimic cell population heterogeneity, and array CGH experiments simulating different levels of gene amplifications and deletions. Our results demonstrate that the three-step normalization procedure provides significant improvement in the sensitivity of detection of single copy changes compared to conventional single step normalization approaches in both SMRT BAC array and cDNA array platforms.The proposed stepwise normalization framework preserves the minute copy number changes while removing the observed systematic biases.","List(List(53f43184dabfaedf4354af59, null, mkhojast@bccrc.ca, 5b86c4a8e1cd8e14a3b0a117, Mehrnoush Khojasteh, null, null, null, null, British Columbia Cancer Research Centre, Vancouver, BC, Canada|Department of Electrical and Computer Engineering, University of British Columbia, Vancouver, BC, Canada, null, 5f71b28e1c455f439fe3cad2, null, null, null), List(548a7e4ddabfaed7b5fa41f6, null, wanlam@bccrc.ca, 5b8690dee1cd8e14a34ed9ef, Wan L Lam, null, null, null, null, British Columbia Cancer Research Centre, Vancouver, BC, Canada, null, null, null, null, null), List(54480291dabfae87b7dc1fa9, null, rababw@ece.ubc.ca, 5b86c1e5e1cd8e14a39d1258, Rabab K Ward, null, null, null, null, Department of Electrical and Computer Engineering, University of British Columbia, Vancouver, BC, Canada, null, 5f71b28e1c455f439fe3cad2, null, null, null), List(54311531dabfae8f2912ab95, null, cmacaula@bccrc.ca, 5b8690dee1cd8e14a34ed9ef, Calum MacAulay, null, null, null, 0000-0003-4440-2792, British Columbia Cancer Research Centre, Vancouver, BC, Canada, null, null, null, null, null))",10.1186/1471-2105-6-274,"List(CDNA Arrays, Normalization (statistics), Pattern recognition, Biology, Population Heterogeneity, High density, Comparative genomic hybridization, Artificial intelligence, Mega-, Genetics, DNA microarray, Gene expression microarray)",,1471-2105,1.0,"List(microarrays, copy number, genetic variation, gene expression, nucleic acid hybridization, gene amplification, algorithms, bioinformatics, calibration)",en,110,274,274.0,https://static.aminer.cn/upload/pdf/program/53e99792b7602d9701f5b2c7_0.pdf,List(53e9baebb7602d970471a152),A stepwise framework for the normalization of array CGH data.,"List(http://dx.doi.org/10.1186/1471-2105-6-274, http://www.ncbi.nlm.nih.gov/pubmed/16297240?report=xml&format=text, https://link.springer.com/10.1186/1471-2105-6-274, http://www.webofknowledge.com/)","List(54826079582fc50b5e20fe5b, null, null, null, null, null, null, BMC Bioinformatics, null, null, null, null, 0)",6,2005,110,Journal,"[Mehrnoush Khojasteh, Wan L Lam, Rabab K Ward, Calum MacAulay]",A stepwise framework for the normalization of array CGH data.,BMC Bioinform.,"[female, andy, female, andy]"
53e99796b7602d9701f5cf4e,"Even fifteen years ago, most of us felt privileged to be co-located with a terminal room enabling us to run our various programs on remote mainframes. Those in the most technical positions, such as programmers and equipment designers, had nearby labs with somewhat more impressive, but bulky equipment. We communicated with our associates, suppliers, and customers either face-to-face or via telephone. Our presentations were prepared by hand and made presentable by clerical staff. Paper documents were transmitted by mail or courier; paper tape was still sometimes used for data. Any suggestion to an executive that he type something for himself was immediately met by indignation.","List(List(null, null, null, null, Chieng, D., null, null, null, null, null, null, null, null, null, null), List(null, null, null, null, Yau, K.A., null, null, null, null, null, null, null, null, null, null), List(null, null, null, null, Ni, Q., null, null, null, null, null, null, null, null, null, null))",10.1016/0263-8231(94)90042-6,,,FEMS Microbiology Reviews,1.0,"List(Electronic mail, Office automation, Software, Data integration, Business communication, Standards)",en,3,1,1.0,,,Guest editorial,"List(http://dx.doi.org/10.1016/0263-8231(94)90042-6, http://dx.doi.org/10.1002/ima.1850040102, http://dx.doi.org/10.1016/j.inffus.2015.07.008, http://dl.acm.org/citation.cfm?id=2837848.2838126&coll=DL&dl=GUIDE&preflayout=flat, http://dx.doi.org/10.1049/iet-net.2015.0093, http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=7312542, https://doi.org/10.1108/ITSE-02-2017-0016, http://dx.doi.org/10.1109/MNET.1987.6434223, https://ieeexplore.ieee.org/document/6434223/, https://doi.org/10.1109/MNET.2018.8473414, http://dx.doi.org/10.1902/jop.1989.60.7.417, https://www.ncbi.nlm.nih.gov/pubmed/29539664, http://www.webofknowledge.com/)","List(53a7321620f7420be8d78ece, null, null, null, null, null, null, European Transactions on Telecommunications, null, null, null, null, 0)",11,1993,3,Journal,"[Chieng, D., Yau, K.A., Ni, Q.]",Editorial: A Message from the Editorial Team and an Introduction to the January-March 2016 Issue.,IEEE Trans. Learn. Technol.,"[unknown, unknown, unknown]"
53e99796b7602d9701f5c215,"A read-channel chip set for rewritable 3.5 in 230 Mbytes magneto-optical disk drives (MOD) is presented. The front-end chip includes an automatic gain control (AGC) circuit, a programmable six-pole two-zero equiripple filter/equalizer, a DC restore circuit, and pulse detectors. The back-end contains a frequency synthesizer phase-locked loop (PLL) and a data separator PLL with 3:1 operating range to support a constant density recording with 8-24 Mb/s data rate (or code rate of 16 to 48 Mb/s) in (2, 7) run-length limited (RLL) encoding format. The architecture of the chip provides high degree of programmability through a serial microprocessor interface, fast switching (<1 /spl mu/s) between sector mark and data detector modes, and four levels of power management in a 1.5 /spl mu/m 4 GHz BiCMOS process. With a nominal power supply of 5 V, the chip set dissipates 600 mW during normal operation and 1 mW during sleep mode.","List(List(53f45bb3dabfaee2a1d87ab3, null, null, null, Sang-Soo Lee, null, null, null, null, null, null, null, null, null, null), List(53f43b7edabfaee0d9b96d29, null, null, null, Carlos A. Laber, null, null, null, null, null, null, null, null, null, null))",10.1109/92.544410,"List(Phase-locked loop, Code rate, Microprocessor, Frequency synthesizer, Chip, Engineering, Chipset, Automatic gain control, Computer hardware, Detector)",,1063-8210,4.0,"List(rll encoding format, magneto-optical disk drives, frequency synthesizer pll, 600 mw, code rate, data separator pll, power management, dc restore circuit, back-end chip, 5 v, pulse detectors, bicmos integrated circuits, mbytes magneto-optical disk drive, magneto-optical recording, automatic gain control, 1 mw, 1.5 micron, phase locked loops, agc circuit, 230 mbyte, mixed analogue-digital integrated circuits, read-channel chip, phase-locked loop, serial microprocessor interface, mbytes read-channel chip, front-end chip, read-channel chip set, optical disc storage, run-length limited encoding, six-pole two-zero filter/equalizer, bicmos process, 8 to 48 mbit/s, data rate, runlength codes, 4 ghz, programmable equiripple filter/equalizer, encoding format, 3.5 in, constant density recording, phase lock loop, gain control, front end, detectors, frequency synthesizer, phase locked loop, normal operator, chip)",en,1,463,455.0,,,A 3.5 in 230 Mbytes read-channel chip set for magneto-optical disk drives,"List(http://dx.doi.org/10.1109/92.544410, http://doi.ieeecomputersociety.org/10.1109/92.544410, http://ieeexplore.ieee.org/xpl/abstractAuthors.jsp?tp=&arnumber=544410)","List(555036cf7cea80f954158d42, null, null, IEEE Transactions on Very Large Scale Integration Systems, null, null, null, IEEE Trans. VLSI Syst., null, null, null, null, 0)",4,1996,1,Journal,"[Sang-Soo Lee, Carlos A. Laber]",A 3.5 in 230 Mbytes read-channel chip set for magneto-optical disk drives.,IEEE Trans. Very Large Scale Integr. Syst.,"[male, male]"
53e99796b7602d9701f5c226,"Since Wolfram proposed to use cellular automata as pseudorandom sequence generators, many cryptographic applications using cellular automata have been introduced. One of the recent one is Mukherjee, Ganguly, and Chaudhuri's message authentication scheme using a special class of cellular automata called Single Attractor Cellular Automata (SACA). In this paper, we show that their scheme is vulnerable to a chosen-message attack, i.e., the secret key can be recovered by an attacker using only several chosen message-MAC pairs. The weakness of the scheme results from the regularity of SACA.","List(List(53f428e7dabfaec09f0df4cf, null, null, 5b86c978e1cd8e14a3d36653, Mun-Kyu Lee, null, null, null, null, School of Computer Science and Engineering, Inha University, Incheon 402-751, Korea, null, 5f71b2f71c455f439fe3f876, List(Inha Univ, Sch Engn & Comp Sci, Incheon 402751, South Korea), null, null), List(53f4c792dabfaee57c77c396, null, null, 5b86bb06e1cd8e14a36b9803, Dowon Hong, null, null, null, 0000-0001-9690-5055, Electronics and Telecommunications Research Institute, Daejeon 305-350, Korea, null, 5f71b89a1c455f439fe60441, List(Elect & Telecommun Res Inst, Daejeon 305350, South Korea), null, null), List(54856b64dabfae8a11fb2912, null, dqkim@hanyang.ac.kr, 5b869ad4e1cd8e14a390d384, Dong Kyue Kim, null, null, null, null, Division of Electronics and Computer Engineering, Hanyang University, Seoul 133-791, Korea, null, 5f71b2f51c455f439fe3f7b9, List(Hanyang Univ, Div Elect & Comp Engn, Seoul 133791, South Korea), null, null))",10.1007/978-3-540-74377-4_45,"List(Attractor, Hash-based message authentication code, Cellular automaton, Message authentication code, Cryptography, Computer science, Cryptanalysis, Theoretical computer science, Data Authentication Algorithm, Chosen message attack)",,0302-9743,,"List(special class, chosen-message attack, cellular automaton, cryptographic application, chosen message attack, scheme result, message authentication scheme, secret key, single attractor cellular automata, chosen message-mac pair, pseudorandom sequence generator, message authentication, cellular automata)",en,0,434,427.0,,"List(53e9a114b7602d9702a06562, 53e99dccb7602d970268b3a8, 53e9aef1b7602d970391d2b1, 53e9a20fb7602d9702b112a2, 53e9ac28b7602d97035e7263, 53e99a0ab7602d97022572c1, 53e9ac28b7602d97035e7141, 53e9994cb7602d970218dbdb, 53e9b1a9b7602d9703c3455f)",Chosen Message Attack Against Mukherjee-Ganguly-Chaudhuri's Message Authentication Scheme,"List(http://dx.doi.org/10.1007/978-3-540-74377-4_45, http://www.webofknowledge.com/)","List(53a72ec720f7420be8c9a9e3, null, null, null, null, null, null, CIS, null, null, null, null, 0)",4456,2007,0,Journal,"[Mun-Kyu Lee, Dowon Hong, Dong Kyue Kim]",Chosen Message Attack Against Mukherjee-Ganguly-Chaudhuri's Message Authentication Scheme.,CIS,"[unknown, unknown, andy]"
53e99792b7602d9701f5b3e5,"Augmented Reality is the technology that overlays virtual image and information generated by computer on real scene, the technology combines the virtual object in a real world. This paper puts forward a registration method of multi-marker in augmented system. A paleontology magic book is designed and realized. The book is special for virtual education and can be interactive in real time. At the end of the paper, the result is showed. In this study, the prosperous future of augmented reality applying in education can be seen.","List(List(5603dcf245cedb3396276ecc, null, xiaohuitan@163.com, null, Xiaohui Tan, null, null, null, null, Beijing Normal Univ, Sch Informat Sci & Technol, Beijing 100875, Peoples R China, null, 5f71b28b1c455f439fe3c989, List(Beijing Normal Univ, Sch Informat Sci & Technol, Beijing 100875, Peoples R China, Informat Engineer Coll Capital Normal Univ, Beijing 100048, Peoples R China), null, null), List(53f391a4dabfae4b34a58998, null, null, null, Pengcheng Fan, null, null, null, null, Beijing Normal Univ, Sch Informat Sci & Technol, Beijing 100875, Peoples R China, null, 5f71b28b1c455f439fe3c989, List(Beijing Normal Univ, Sch Informat Sci & Technol, Beijing 100875, Peoples R China), null, null), List(53f44999dabfaee02ad25e1d, null, null, null, Liming Luo, null, null, null, null, Informat Engineer Coll Capital Normal Univ, Beijing 100048, Peoples R China, null, 5f71b28b1c455f439fe3c989, List(Informat Engineer Coll Capital Normal Univ, Beijing 100048, Peoples R China), null, null), List(542a16aedabfaec7081dce4b, null, null, null, Mingquan Zhou, null, null, null, null, Beijing Normal Univ, Sch Informat Sci & Technol, Beijing 100875, Peoples R China, null, 5f71b28b1c455f439fe3c989, List(Beijing Normal Univ, Sch Informat Sci & Technol, Beijing 100875, Peoples R China), null, null))",10.1007/978-3-642-27552-4_60,"List(Virtual image, Educational technology, Computer architecture, Computer graphics (images), Computer science, Augmented reality, Overlay)",,1867-5662,,"List(Augmented Reality, Virtual Education, Computer Vision)",en,0,+,431.0,,"List(53e9a8c5b7602d9703216bc6, 53e9ab6fb7602d970350d776, 53e9adb6b7602d97037b854c)",A Method of Multiple-Marker Register and Application on Virtual Education.,"List(http://dx.doi.org/10.1007/978-3-642-27552-4_60, http://www.webofknowledge.com/)","List(null, 1867-5662, null, null, null, null, null, FRONTIERS IN COMPUTER EDUCATION, null, FRONTIERS IN COMPUTER EDUCATION, null, J, null)",133,2011,0,Journal,"[Xiaohui Tan, Pengcheng Fan, Liming Luo, Mingquan Zhou]",A Method of Multiple-Marker Register and Application on Virtual Education.,ICFCE,"[unknown, unknown, unknown, unknown]"
53e99796b7602d9701f5c1a2,"Existing medical vocabularies lack rich terms to describe findings that are generated by modem molecular diagnostic procedures. Most bioinformatics resources were designed primarily to support the needs of the research community. We describe the development of a curated resource, the Clinical Bioinformatics Ontology (CBO), a semantic network appropriate for describing clinically significant genomics concepts. The CBO includes concepts appropriate for both molecular diagnostics and cytogenetics. A standardized methodology based on consistent application of RefSeq information is applied to the curation of the CBO in order to provide a reproducible and reliable tool. Challenges related to this curation process are discussed in this paper. At the time of submission the CBO included 4,069 concepts, associated by 8,463 relationships.","List(List(56017d9745cedb3395e63f00, null, null, null, M Hoffman, null, null, null, null, null, null, null, null, null, null), List(53f47465dabfaedd74ea10d5, null, null, null, C Arnoldi, null, null, null, null, null, null, null, null, null, null), List(53f43320dabfaee02aca879b, null, null, null, I Chuang, null, null, null, null, null, null, null, null, null, null))",,"List(Ontology, RefSeq, Molecular diagnostics, Information retrieval, Computer science, Research community, Semantic network, Genomics, Bioinformatics)",,2335-6936,,"List(controlled vocabulary, diagnostic test, semantic network, snomed ct, single nucleotide polymorphism, molecular diagnostics)",en,33,150,139.0,//static.aminer.org/pdf/PDF/000/554/244/the_clinical_bioinformatics_ontology_a_curated_semantic_network_utilizing_refseq.pdf,"List(53e9a281b7602d9702b88181, 55a3f4c2612ca648687ef2fa, 5550475c45ce0a409eb66b99, 53e9aba5b7602d970355360e, 53e9afc0b7602d9703a0e389, 53e9a682b7602d9702fb432e, 53e9a86ab7602d97031b90e9, 5550474545ce0a409eb660ce)",The clinical bioinformatics ontology: a curated semantic network utilizing RefSeq information.,"List(http://psb.stanford.edu/psb-online/proceedings/psb05/hoffman.pdf, http://www.ncbi.nlm.nih.gov/pubmed/15759621?report=xml&format=text)","List(53a7254520f7420be8b4a58f, null, null, Pacific Symposium on Biocomputing, null, null, null, Pacific Symposium on Biocomputing, null, null, null, null, 0)",,2005,33,Conference,"[M Hoffman, C Arnoldi, I Chuang]",The Clinical Bioinformatics Ontology: A Curated Semantic Network Utilizing RefSeq Information.,Pacific Symposium on Biocomputing,"[unknown, unknown, unknown]"
53e99796b7602d9701f5c857,"We introduce gesture controllers, a method for animating the body language of avatars engaged in live spoken conversation. A gesture controller is an optimal-policy controller that schedules gesture animations in real time based on acoustic features in the user's speech. The controller consists of an inference layer, which infers a distribution over a set of hidden states from the speech signal, and a control layer, which selects the optimal motion based on the inferred state distribution. The inference layer, consisting of a specialized conditional random field, learns the hidden structure in body language style and associates it with acoustic features in speech. The control layer uses reinforcement learning to construct an optimal policy for selecting motion clips from a distribution over the learned hidden states. The modularity of the proposed method allows customization of a character's gesture repertoire, animation of non-human characters, and the use of additional inputs such as speech recognition or direct user control.","List(List(53f42828dabfaeb22f3ce756, null, null, 5b86b6bfe1cd8e14a34c91c3, Sergey Levine, null, 544bd96545ce266baeca9366, null, null, Stanford University, null, 5f71b2841c455f439fe3c67b, null, null, null), List(53f4589ddabfaedd74e36b9a, null, null, 5b86b6bfe1cd8e14a34c91c3, Philipp Krähenbühl, null, 544bd96545ce266baeca9366, null, null, Stanford University, null, 5f71b2841c455f439fe3c67b, null, null, null), List(53f48525dabfaedf436905bf, null, null, 5b86b6bfe1cd8e14a34c91c3, Sebastian Thrun, null, 544bd96545ce266baeca9366, null, null, Stanford University, null, 5f71b2841c455f439fe3c67b, null, null, null), List(54089dfbdabfae450f42ee3c, null, null, 5b86b6bfe1cd8e14a34c91c3, Vladlen Koltun, null, 544bd96545ce266baeca9366, null, null, Stanford University, null, 5f71b2841c455f439fe3c67b, null, null, null))",10.1145/1833349.1778861,,,0730-0301,4.0,"List(data-driven animation, speech signal, optimal control, gesture controller, direct user control, human animation, hidden state, schedules gesture animation, control layer, speech recognition, gesture repertoire, gesture synthesis, acoustic feature, inference layer, nonverbal behavior generation)",en,125,11,1.0,,,Gesture controllers,"List(http://doi.acm.org/10.1145/1833351.1778861, https://dl.acm.org/doi/abs/10.1145/1778765.1778861)","List(555036f67cea80f95416a9f4, null, null, null, null, null, null, ACM Trans. Graph., null, null, null, null, 0)",29,2010,125,Journal,"[Sergey Levine, Philipp Krähenbühl, Sebastian Thrun, Vladlen Koltun]",A Gesture Recognition Model for Virtual Reality Motion Controllers.,CGVC,"[male, male, male, male]"


Databricks visualization. Run in Databricks to view.

In [0]:
# Merge schemas from all dataframes
merged_schema = dataframes[0].schema
for df in dataframes[1:]:
    merged_schema = merge_schemas(merged_schema, df.schema)

# Apply the merged schema to all dataframes
dataframes_with_merged_schema = [df.selectExpr(*merged_schema.fieldNames()) for df in dataframes]

# Union all dataframes to create a single dataframe with a consistent schema
combined_df = dataframes_with_merged_schema[0].limit(20)
for df in dataframes_with_merged_schema[1:]:
    combined_df = combined_df.unionByName(df.limit(20), allowMissingColumns=True)

# Save the combined dataframe to a Delta table
combined_df.write.format("delta").mode("append").option("mergeSchema", "true").save("dbfs:/FileStore/tables/delta/dblpv13")

In [0]:
combined_df = combined_df.withColumn("publication_id", monotonically_increasing_id()) \
    .withColumn("author_id", monotonically_increasing_id()) \
    .withColumn("venue_id", monotonically_increasing_id()) \
    .withColumn("fos_id", monotonically_increasing_id()) \
    .withColumn("org_id", monotonically_increasing_id()) \
    .withColumn("date_id", monotonically_increasing_id()) \
    .withColumn("keyword_id", monotonically_increasing_id()) \
    .withColumn("type_id", monotonically_increasing_id()) \
    .withColumn("lang_id", monotonically_increasing_id())

# display(combined_df)

# dblp_fact_table = combined_df.select("date_id", "keyword_id", "type_id", "publication_id", "venue_id",
#                                       "fos_id", "org_id", "author_id", "lang_id")

# keyword_table = combined_df.select("keyword_id",
#                                    col("keywords").alias("text"))


venue_table = combined_df.select("venue_id",
#                                  col("venue.name_d").alias("name"),
                                 col("refined_venue").alias("name"),
                                 col("venue.type").alias("type"),
                                 col("venue.raw").alias("raw"),
                                 col("venue._id").alias("vid"))

venue_table = venue_table.filter(venue_table.type.isNotNull())

# date_table = combined_df.select("date_id",
#                                 year("year").alias("year"))

# language_table = combined_df.select("lang_id",
#                                     col("lang").alias("name"))


# fos_table = combined_df.select("fos_id",
#                                col("fos").alias("field"))

# author_table = combined_df.select("author_id",
# #                                   col("authors.name").alias("name"),
#                                   col("resolved_authors").alias("name"),
#                                   col("authors.org").alias("org"),
#                                   col("authors.gid").alias("gid"),
#                                   col("authors.orgid").alias("orgid"),
#                                   col("authors_gender").alias("gender"))

# publication_table = combined_df.select("publication_id",
# #                                        col("title").alias("name"),
#                                        col("resolved_title").alias("name"),
#                                        col("abstract").alias("description"),
#                                        col("doi").alias("doi"),
#                                        col("issn").alias("issn"),
#                                        col("isbn").alias("isbn"),
#                                        col("url").alias("url"),
#                                        col("pdf").alias("pdf"),
#                                        col("page_start").alias("page_start"),
#                                        col("page_end").alias("page_end"),
#                                        col("volume").alias("volume"),
#                                        col("issue").alias("issue"),
#                                        col("n_citation").alias("n_citation"),
#                                        col("pub_type").alias("pub_type"))


In [0]:
# dblp_fact_table.write.format("delta").mode("append").save("dbfs:/FileStore/tables/delta/dblp_fact_table")
# keyword_table.write.format("delta").mode("append").save("dbfs:/FileStore/tables/delta/keyword_table")
venue_table.write.format("delta").mode("append").save("dbfs:/FileStore/tables/delta/venue_table")
# date_table.write.format("delta").mode("append").save("dbfs:/FileStore/tables/delta/date_table")
# language_table.write.format("delta").mode("append").save("dbfs:/FileStore/tables/delta/language_table")
# fos_table.write.format("delta").mode("append").save("dbfs:/FileStore/tables/delta/fos_table")
# author_table.write.format("delta").mode("append").save("dbfs:/FileStore/tables/delta/author_table")
# publication_table.write.format("delta").mode("append").save("dbfs:/FileStore/tables/delta/publication_table")


In [0]:
# load table from delta lake
venue_table = spark.read.format("delta").load("dbfs:/FileStore/tables/delta/venue_table")


In [0]:
dbutils.fs.rm("dbfs:/FileStore/tables/delta/venue_table", recurse=True)

Out[211]: True

In [0]:
display(venue_table)

venue_id,name,type,raw,vid
0,IEEE Trans. Learn. Technol.,0,J. High Speed Networks,555036c17cea80f9541516a0
1,Int. J. Inf. Sec.,2,Information Processing and Management of Uncertainty,53e18c8d20f7dfbc07e9024f
2,Eng. Appl. Artif. Intell.,0,Pattern Recognition,539e72778314ff4cf49b3f5d
3,BMC Bioinform.,0,BMC Bioinformatics,54826079582fc50b5e20fe5b
4,IEEE Trans. Learn. Technol.,0,European Transactions on Telecommunications,53a7321620f7420be8d78ece
5,IEEE Trans. Very Large Scale Integr. Syst.,0,IEEE Trans. VLSI Syst.,555036cf7cea80f954158d42
6,CIS,0,CIS,53a72ec720f7420be8c9a9e3
8,Pacific Symposium on Biocomputing,0,Pacific Symposium on Biocomputing,53a7254520f7420be8b4a58f
9,CGVC,0,ACM Trans. Graph.,555036f67cea80f95416a9f4
10,J. Symb. Log.,0,"J. Comb. Theory, Ser. B",555036df7cea80f954162774
