#### Executing this notebook depends on the Delta tables from project 1 being saved on DBFS!
In project 1, make sure to run chapters 1, 2, 3 fully to save the Delta tables.

### Read data and extract relevant features

In [0]:
import pyspark.sql.functions as F
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

In [0]:
# Read the full dblp fact table.
dblp_df = (spark
           .read.load(f'dbfs:/user/dblpv13/dblp_full')
           .filter(F.col('Lang') == 'en')) # Keep only papers in English

In [0]:
# Keep just the relevant features: FOS, Keywords, Title, Abstract.
data = dblp_df.select('FOS', 'Keywords', 'Title', 'Abstract')

# Drop all rows where any value is null.
data = data.na.drop()

# Keep only rows where title and abstract are longer than 1 word.
data = (data.filter( (F.size(F.split(data.Title, ' ')) > 1) & 
                     (F.size(F.split(data.Abstract, ' ')) > 1) ))

# Remove all empty (null or '') values from the FOS and Keywords arrays.
data = (data
        .withColumn('FOS', F.expr('filter(FOS, x -> x is not null)'))
        .withColumn('FOS', F.expr('filter(FOS, x -> length(x) > 0)'))
        .withColumn('Keywords', F.expr('filter(Keywords, x -> x is not null)'))
        .withColumn('Keywords', F.expr('filter(Keywords, x -> length(x) > 0)')))

logger.info(f'Our data has {data.count()} samples.')

display(data.limit(10))

INFO:__main__:Our data has 1356788 samples.


FOS,Keywords,Title,Abstract
"List(Logic synthesis, Topology, Digital electronics, Boolean circuit, Sequential logic, Pass transistor logic, Logic optimization, Algorithm, Register-transfer level, High-level verification, Mathematics)","List(combinational circuits, formal verification, logic design, logic testing, network topology, SAT, combinational circuits, fixed circuit topology, formal verification, logic synthesis, logic verification, minimum ECO, net lists, test vector generation method)",Logic synthesis and verification on fixed topology,"We discuss about logic synthesis and formal verification of combinational circuits mapped to a given fixed topology. Here “fixed topology” means that circuit structures in terms of net lists except for gate/cell types are fixed in advance. That is, for logic synthesis, what should be generated are the types of gates/cells (or simply gates) in the circuits, and all the others are prefixed before synthesis. As the circuit topology is fixed in advance, placement and routing could be shared among different designs and minimum ECO can be realized by keeping the same layout. Also, we can show that we do not need many test vectors in order to guarantee 100% correctness of such synthesis. Small numbers of test vectors, such as only 100 test vectors for 30 input circuits, are sufficient to test if the circuits behave correctly for all possible input value combinations. That is, very efficient formal verification can be realized through simulations with small numbers of test vectors. We present SAT based implementation of the synthesis and a test vector generation method with preliminary but encouraging experimental results."
"List(Detection theory, Computer science, Inference, Algorithm, Artificial intelligence, Fuse (electrical), Machine learning, Recursion)","List(adaptive signal detection, compressed sensing, multidimensional signal processing, signal reconstruction, adaptive sequential compressive signal detection, compressive gain, multidimensional phenomena, nonadaptive strategy, recursive sparse reconstruction algorithm, sequential reconstruction, sparse signal detection, stopping criterion)",Adaptive sequential compressive detection,"Sparsity is at the heart of numerous applications dealing with multidimensional phenomena with low-information content. The primary question that this work investigates is whether, and how much, further compressive gains could be achieved if the goal of the inference task does not require exact reconstruction of the underlying signal. In particular, if the goal is to detect the existence of a sparse signal in noise, it is shown that the number of measurements can be reduced. In contrast to prior work, which considered non-adaptive strategies, a sequential adaptive approach for compressed signal detection is proposed. The key insight is that the decision can be made as soon as a stopping criterion is met during sequential reconstructions. Two sources of performance gains are studied, namely, compressive gains due to adaptation, and computational gains via recursive sparse reconstruction algorithms that fuse newly acquired measurements and previous signal estimates."
"List(Model order selection, Subspace topology, Computer science, Ordinal number, Ordinal data, Matrix decomposition, Speech recognition, Language acquisition, Artificial intelligence, Non-negative matrix factorization, Robot, Machine learning)","List(acoustic signal processing, audio user interfaces, human-robot interaction, intelligent robots, learning (artificial intelligence), matrix decomposition, CSNMF, accuracy improvement, acoustic data augmentation, automatic relevance, command execution, command meaning representation, constrained subspace NMF, grounding information, learning rate improvement, machine learning algorithm, model order selection, ordinal structure, ordinal word acquisition, robot learning, semantic labels, spoken utterances, vocal interface, weakly-supervised NMF, weakly-supervised nonnegative matrix factorization, Automatic Relevance Determination (ARD), Language acquisition, Machine learning, Nonnegative Matrix Factorization (NMF), Ordinal data)",Acquisition of ordinal words using weakly supervised NMF,This paper issues in the design of a vocal interface for a robot that can learn to understand spoken utterances through demonstration. Weakly supervised non-negative matrix factorization (NMF) is used as a machine learning algorithm where acoustic data are augmented with semantic labels representing the meaning of the command. Many parameters that the robot needs in order to execute the commands have an ordinal structure. Constrained subspace NMF (CSNMF) is proposed as an extension to NMF that aims to better deal with ordinal data and thus increase the learning rate of the grounding information with an ordinal structure. Furthermore automatic relevance determination is used to deal with model order selection. The use of CSNMF yields a significant improvement in the learning rate and accuracy when recognising ordinal parameters.
"List(Teleoperation, Robot control, Computer vision, Social robot, Robot calibration, Robot end effector, Visual servoing, Artificial intelligence, Engineering, Robot, Mobile robot)","List(human-robot interaction, image sensors, mobile robots, robot vision, stereo image processing, telerobotics, visual perception, image camera, laser point, master slave system, master touch manipulation, mobile robot teleoperation, remote location, slave mobile robot, spatial recognition, stereo camera, touch input, master-slave system, operability, two-wheeled robot, visual servoing, xenon, human robot interaction, zirconium)",An approach to high operational control of a slave mobile robot by master touch manipulation,"The master-slave system is expected to be a key technology for the next generation of robots. Indeed, a master-slave system will be utilized in many fields to perform tasks remotely in unknown environments. However, it is difficult for the operator to lead the slave robot to goal position correctly in case of operating the slave robot in remote location since the operator cannot recognize three-dimensional space correctly. The purpose of this research is realization of high operational control of teleoperated mobile robot. With such system, the operator need not spatial recognition, and the proposed system does not depend on the operator's skill, and also the proposed system can be applied to unknown environment. In the proposed system, the operator input the goal position by touch input on the screen which show the camera image. After touch input, the slave robot control the manipulator which mounts laser pointer and illuminates goal position. Then, the slave robot recognizes laser point by stereo camera and move to the goal position automatically. In this research, we utilize the touch input device as the master robot and the two-wheeled robot as the slave robot. Simulation is conducted to verify the validity of the proposed method."
"List(Least mean squares filter, Mathematical optimization, Renewable energy, Idle, Scheduling (computing), Computer science, Mean squared error, Solar power, Operator (computer programming), Grid)","List(least squares approximations, load forecasting, sunlight, LMS algorithm, LinEx cost functions, LinLin cost functions, asymmetrie cost functions, grid operator problem, least mean squares algorithm, online solar radiation forecasting, tracking ability)",Online solar radiation forecasting under asymmetrie cost functions,"Grid operators are tasked to balance the electric grid such that generation equals load. In recent years renewable energy sources have become more popular. Because wind and solar power are intermittent, system operators must predict renewable generation and allocate some operating reserves to mitigate errors. If they overestimate the renewable generation during scheduling, they do not have enough generation available during operation, which can be very costly. On the other hand, if they underestimate the renewable generation, they face only the cost of keeping some generation capacity online and idle. So overestimation of resources create a more serious problem than underestimation. However, many researchers who study the solar radiation forecasting problem evaluate their methods using symmetric criteria like root mean square error (RMSE) or mean absolute error (MAE). In this paper, we use LinLin and LinEx which are asymmetric cost functions that are better fitted to the grid operator problem. We modify the least mean squares (LMS) algorithm according to LinLin and LinEx cost functions to create an online forecasting method. Due to tracking ability, the online methods gives better performance than their corresponding batch methods which is confirmed using simulation results."
"List(Computer vision, Pattern recognition, Computer science, Support vector machine, Pixel, Artificial intelligence, Image resolution, Classification rate, Geometric mean, Statistical analysis)","List(face recognition, geometry, image classification, statistical analysis, visual databases, CCR, GMean, automatic face gender classification, correct classification rate, cross-database, face datasets, geometric mean, image resolution, image sizes, single-database, statistical analysis)",Analysis of the Effect of Image Resolution on Automatic Face Gender Classification,"This paper presents a thorough study into the influence of the image resolution on automatic face gender classification. The images involved range from extremely low resolutions (2 × 1 pixels) to full face sizes (329 × 264 pixels). A comprehensive comparison of the performances achieved by two classifiers using ten different image sizes is provided by means of two performance measures Correct Classification Rate (CCR) and Geometric Mean (GMean). Single- and cross-database experiments are designed over three well-known face datasets. A detailed statistical analysis of the results revealed that a face as small as 3 × 2 pixels carries some useful information for distinguishing between genders. However, in situations where higher resolution face images are available, moderately sized faces from 22 × 18 to 90 × 72 pixels are optimal for this task. Furthermore, the performance of the classifiers was robust to the changes in the image resolution (using medium to full sizes). Only when the image resolution was reduced to 8 × 6 pixels or smaller, the classification results were significantly affected."
"List(Control theory, Output feedback, Control theory, Exact model matching, Invertible matrix, Mathematics)","List(delays, feedback, linear systems, matrix algebra, robust control, emmdr, exact model matching with simultaneous disturbance rejection, linear singular multidelay systems, multidelay measurement output feedback type controller, necessary conditions, realizable controller solution, sufficient conditions)",Exact model matching and disturbance rejection of linear singular multi-delay systems via measurement output feedback,For the case of regularizable linear singular multi delay systems the design problem of Exact Model Matching with simultaneous Disturbance Rejection (EMMDR) is solved under the assumption that the ideal model is left invertible and using a general proportional multi delay measurement output feedback type controller. The necessary and sufficient conditions for the problem to admit a realizable controller solution are established and the general solution of the realizable controllers is derived.
"List(Channel code, Secure multi-party computation, Secret sharing, Computer science, Communication channel, Computer network, Theoretical computer science, Implementation, Redundancy (engineering), Linear function, Computation)","List(Reed-Solomon codes, channel coding, channel coding techniques, communication overhead, nested Reed-Solomon codes, secret sharing schemes, secure multiparty computation, wiretap channel, wiretap codes)",Wiretap codes for secure multi-party computation,"In this paper, we propose a new secret sharing scheme for secure multi-party computation. We present a general framework that allows us to construct efficient secret sharing schemes from channel coding techniques for the wiretap channel. The resulting schemes can be employed to securely calculate linear functions of data that are distributed in a network without leaking any information on the data except the desired result. For the examples considered in this paper, our schemes minimize the communication overhead while keeping the data perfectly secure. Compared to conventional schemes, for which the communication overhead grows quadratically in the number of clients in the considered scenarios, the communication overhead for our approach grows only linearly with the number of clients. This property is maintained even if our secret sharing scheme is set up to introduce redundancy in order to compensate for losses of secret shares. While we only consider the case of passive eavesdroppers and implementations based on nested Reed-Solomon codes in this paper, the proposed framework can also be applied in other cases (e.g., when clients tamper with the data) by taking into account the effects of attacks in the design of the underlying wiretap code."
"List(Mel-frequency cepstrum, Speech processing, Turkish, Computer science, Support vector machine, User information, Speech recognition, Speaker recognition, Artificial intelligence, Natural language processing, Hidden Markov model, German)","List(speaker recognition, speech processing, Turkish language, audio based age identification, audio based gender identification, audio information extraction, age identification, classification, gender identification, speech database)",Audio-based gender and age identification,"Nowadays interaction between humans and computers is increasing rapidly. Efficiency and comfort of these interactions depend on the availability of user information to computers. Gender, age and emotional state are most the most fundamental pieces of these information. Extraction of such information from audio or video data is an important research area. There are several works on different languages including especially English, German and Italian. In this study, we developed a system for Turkish language to extract gender and age information from the speech. Our test results show that the proposed system is able to identify the gender and age range of the speaker with a success rate of 99%."
"List(Baseband, Digital biquad filter, Wireless, Chip, Electronic engineering, CMOS, Transimpedance amplifier, Engineering, Code division multiple access, Operational amplifier)","List(3g mobile communication, cmos integrated circuits, cellular radio, channel allocation, code division multiple access, inductors, integrated circuit packaging, next generation networks, operational amplifiers, surface acoustic waves, telecommunication standards, 3g td-scdma dual bands, 4g, cmos, pga, baseband cr filter, cellular phone networks, cellular standards, dynamic gain-bandwidth-product-extension circuit, high-speed wireless communication, multiband inductor less saw-less 2g/3g-td-scdma cellular receiver, receiver supporting 2g quad bands, size 40 nm, transimpedance amplifier, wider channel bandwidth)",20.7 A multi-band inductor-less SAW-less 2G/3G-TD-SCDMA cellular receiver in 40nm CMOS,"The growing demand for high-speed wireless communication has driven the evolution of cellular phone networks. New-generation cellular standards use wider channel bandwidth and more sophisticated modulation to obtain higher data-rates. Due to various cellular standards, chip providers are required to offer highly integrated solutions that support 2G, 3G, and even 4G in one chip. This paper presents a receiver supporting 2G quad bands and 3G TD-SCDMA dual bands. Figure 20.7.1 shows the 2G/3G receiver, whose front-end current-mode outputs are combined at baseband CR filter and biquad PGA, which are shared between all bands. A dynamic gain-bandwidth-product-extension circuit technique is used to remove a transimpedance amplifier to save die area and current."
