/
body.tex
2421 lines (2045 loc) · 177 KB
/
body.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%e WORDY word checks
%--------------------------------------------------------------------------------
% - weasel
% - weak
% - puffy
% - ditto
%--------------------------------------------------------------------------------
\section{Introduction}
\label{sec:introduction}
% intro structuring basing on style from https://explorationsofstyle.com/2013/01/22/introductions/
%Intro short:
% - recent developments of of A.I. and machine learnin
% - most research problems applied to image recognition, translation and in the RL space to games and robotics.
% - global warming, lots of problems
% - reinvent the energy grid, lots of changes to the structure
% - very difficult to construct such a highly complex, globally spanning, must-never-fail system
% - combine the two
%Intro long
% - energy grids of the future background research (PTac)
% - key components of such an intelligent agent (prediction, actions --> supervised learning and \ac{RL} )
% - research in supervised learning and \ac{RL} has seen huge improvements in recent years, thanks to neural network
% - agents/brokers in the field of PTac haven't been seeing much of these improvements
% - also an issue of "adopting what has been learned by previous agents (transfer learning issues)"
% -
% -
% Global warming is a key challenge of the near and medium future. Without proper action, entire continents will see
%
% Global warming, if not combated, will change the face of the planet. Billions will be impacted, entire coastlines will
% be changed and cities all over the global will have to either be retrofitted to handle sub-sea level positioning or
% abandoned and relocated. (global warming report)
%
%
% One key component to avoid such disastrous effects is the reinvention of the energy systems of the world. While
% appliances on an individual level need to become ever more efficient, globally it is necessary to shift the
% transportation sector towards renewable energy sources.
% Solar and wind
% are required. But The future of energy is difficult (--> MISQ paper argumentation line)
%
% Smart grids need decentralized intelligence where appliance level evaluation of the grid status impacts how energy is
% consumed. When such intelligence shifting is happening towards the \emph{edge} of the grid, it can be intelligent to
% introduce intermediate broker entities that mediate between the two extremes, the end-consumers and the wholesale
% market.
%
% At the same time, current developments in AI and machine learning allow for highly sophisticated learning machines that
% can help manage complex tasks and systems. (citing some sexy AI papers)
%
% Bringing these two developments together, it is intuitive to apply some of the recently developed technologies of
% \ac{AI} research to solve the coordination issues of contemporary, frankly crude energy networks.
%-------------------------------------------------------------------------------
% Done v1, looking over it again at the end
%-------------------------------------------------------------------------------
In recent years, \ac{AI} research saw a steady rise in publications and overall interest in the field
\citep{arulkumaran2017brief, russell2016artificial}.
It has been discussed as a key future challenge for nation states and companies alike
\citep{mozur_markoff_2017, faznetchina_2018}. Researchers have produced a large corpus of research focusing on visual
data learning such as image recognition, audio and text based language recognition and robotics. In the field of
\ac{RL}, recent breakthroughs were achieved by applying it to robotics as well as common game challenges like solving Atari games or
playing Go
\citep{arulkumaran2017brief}.
There are other important problem fields that can also benefit from these technologies, one of them being global energy markets.
These are expected to shift radically in the upcoming decades, adapting to new problems related to global warming,
distributed and alternative energy sources, lack of intelligently coordinated systems, cybersecurity and electric vehicles \cite[p.10ff.]{mitei2011}. New problem solving techniques are required to solve such \emph{wicked problems}, because
they depend on numerous elements such as economic, social, political and technical factors.
\citep{ketter2015competitive}.
On a local scale, but much more prominently in day-to-day life, machines need to deliver their performance with minimal
energy requirements. Cars, fridges, water heating appliances, dishwashers and entertainment systems alike have all shown
improvements in their efficiency and this has become a key component of a customer's purchasing choice. Similarly,
large distributed IT systems as well as building management systems are adapted to make more efficient use of the energy
they consume \citep{Orgerie:2014:STI:2597757.2532637}. A key driver in this field is
the usage of intelligent systems that adapt to changing environments and demands \citep{DePaola:2014:IMS:2620784.2611779}.
On a macro scale, the problem is just as complex, albeit less salient. Electricity grids were conventionally not built
to contain \emph{energy buffers}. Electricity always needed to be produced to match demand. This is expected to change
over the coming years due to an increasing number of electric vehicles and smart appliances. Such systems can serve as
buffers
by offering their storage capacity to the grid, which is needed because decentralized
solar energy production changes the demand curve of macro-level energy supply. As an example, California is currently
facing an
oversupply of energy during sunny summer days and undersupply during peak demand hours that intersect with low levels of
solar energy production. This puts previously unseen stress on the grid systems which were constructed to deliver steady amounts
of energy from a few sources to many consumers instead of having many small producers distributed throughout the system.
Furthermore, large conventional power plants struggle to adapt quickly to change in demand patterns
\citep{roberts_2016}.
\ac{PowerTAC}, a competitive simulation of future energy markets, attempts to solve the planning dilemma of such
complex systems. It allows researchers to experiment with numerous scenarios and participant designs. By adapting system
parameters, robust system designs are developed to incentivize participants to align with the overall
interests. The interaction of a variety of market participants using different technologies to automatically generate
profit is explored in a competitive game environment. Researchers are invited to participate in this simulation by
supplying usage models for appliances and developing \emph{brokers} that participate in the game. Brokers trade energy,
offer contracts and coordinate storage capacities within their own customer network as well as with the overall market.
The simulation offers opportunities for several fields of research: game design, energy demand forecasting,
intelligent contract design, commodity trading and general simulation and software design
\citep{ketter2015competitive, ketter2018powertac}.
The competition has been organized for several years and brokers can be developed by anyone. This means that some
broker developers have years of experience while others have not participated in a single competition. Previous researchers have
identified the problem as a \ac{POMDP}, a common model of \ac{RL} literature \citep{tactexurieli2016mdp}. Deep neural network
architectures have proven to be successful in solving games in a variety of instances. It is intuitive to
attempt to apply such architectures to the problems posed by the \ac{PowerTAC} simulation. Unfortunately, most of the
implementations are only available in Python \citep{baselines, plappert2016kerasrl, schaarschmidt2017tensorforce} and
\ac{PowerTAC} is almost exclusively based on Java. An extension of the current communication protocols to other
languages may increase the reach of the simulation and motivate newcomers to join the competition with
their Python-based neural network architectures.
Finally, a sub field of \ac{RL} research has identified a problem in the transfer of knowledge from previously trained
networks to newly developed iterations. Because neural networks are mostly black boxes to researchers
\citep{yosinski2015understanding}, it is difficult to extract knowledge and to transfer this to another architecture. The
learned weights of a neural network can not be easily transferred between models, especially when architectures fundamentally
differ in their hyperparameters. The field of transfer learning has shown new approaches for solving this problem.
Agents with access to previously developed models may pass their observations to the \emph{teacher agent} and initially
attempt to align their decisions to those that their teacher would do \citep{schmitt2018kickstarting}. High level
problem solving agents may be trained by first training several small narrow focus agent networks on sub problems and
then applying transfer learning to transfer the knowledge from the narrow focus agents to the generic high level
agent \citep{parisotto2015actor}. For problems where a reward function is difficult to construct, \emph{inverse
reinforcement learning} can be used to train an agent to behave similar to an observable expert. The policy function of
the agent shows good performance despite lacking a specific reward function \citep{NG2004Apprentice}.
In summary, neural networks are an interesting technology to solve complex problems and energy markets stand to benefit from
their usage. To ensure beneficial results, \ac{PowerTAC} simulates complex energy markets before implementing them in the real
world. \ac{PowerTAC} focuses on brokers as intermediaries between end consumers and wholesale markets to reduce
complexity and to decentralize which also aids resilience. To allow current and future teams competing in the \ac{PowerTAC}
competition to easily deploy neural network technologies and to allow new brokers in the \ac{PowerTAC} competition to quickly
catch up to previously developed competitor brokers, it may be beneficial to extend the technology scope of the
competition and enable learning transfer methods and their underlying deep architectures for the problem scope of
\ac{PowerTAC}. Therefore, the research question for this work goes as follows:
\emph{Can deep reinforcement learning agents learn from actions of other agents in the \ac{PowerTAC} environment? If so,
how? Can imitation allow for boosted performance of reinforcement learning algorithms within a competitive simulation
environment?}
\subsection{Methodology}
\label{sec:methodology}
To answer the questions, a lot of foundation work has to be done. First, the competition needs to be able to interface
with the technologies required by modern neural network frameworks. Then, a problem mapping needs to occur to map the
\ac{PowerTAC} problems to a structure that frameworks and libraries can work with. Finally, current research methods for
learning transfer need to be applied to the \ac{PowerTAC} environment. Consequently, this work follows a practical
software development approach, and this document also serves as a documentation to accompany the implementation. The
practical part of the thesis consists of the concepts for integrating \ac{PowerTAC} and the libraries and tools of
modern \ac{RL} research, then implementing all the necessary components to bridge the gap. The implemented source code is
often tested with unit tests to ensure the correct functioning of the components. Once all components are
developed to connect the two areas, machine learning approaches need to be applied to create solutions for the demand
prediction problem and the wholesale trading problem.
This document is structured as follows: first, the literature of the fields of \ac{AI}, \ac{RL} and the \ac{PowerTAC}
competitive simulation for energy markets is reviewed. In the field of AI its sub fields of supervised learning and
unsupervised learning will be introduced. Here, the focus is placed on the area of neural networks and a way to let them
learn through Backpropagation. In the field of \ac{RL} the focus is on the \ac{MDP} framework. Next follows an
introduction of the recent research concerning the use of neural networks in \ac{RL} settings to allow the so-called
\ac{DeepRL}. For \ac{PowerTAC}, its concepts and how agents (often called brokers in the context of \ac{PowerTAC}) make
decisions are analyzed, including an analysis of previous agents solution approaches.
%After having introduced the basic research of \ac{AI} and \ac{RL}, I will summarize the state of research of Animal
%Cognition, which focuses on how animals and humans learn, act and remember in their environment. Since humans and
%animals are the only known form of intelligent life to us as of today, it is intuitive why exploring the exact workings
%of these examples might help in better understanding how to artificially create intelligence. It is also a basis of the
%thesis, as many animals show forms of social learning, concepts of teaching and learning through observation.
Following the theoretical background, an analysis of a number of observation based learning approaches is performed.
Three alternative approaches for learning from other agents or historical data are discussed conceptually.
In the practical part, the main technologies used are briefly explained. Then follows a summary of the implementation
of two important decision areas, wholesale trading and demand predicting. Both implementations demonstrate
the ability to use current research results from the neural network and \ac{RL} research community to apply them to the
\ac{PowerTAC} problem set.
Finally, a conclusion is drawn together with a discussion of the limits and weaknesses as well as recommended further research.
%This chapter will introduce the two underlying research fields, \ac{AI} and the \ac{PowerTAC} simulation. The broad
%field of \ac{AI} will be separated into three sections: an introduction to the fields of \ac{AI}, neural networks and \ac{RL}.
%\ac{PowerTAC} will be discussed by introducing it, comparing it to similar work and analyzing its
%components and some dominant past broker implementations.
\section{Artificial Intelligence}%
\label{sec:artificial_intelligence}
%-------------------------------------------------------------------------------
% done unless professor adds notes
%-------------------------------------------------------------------------------
The field of \ac{AI} is both old and yet quiet contemporary.
Right with the advent of computers around the middle of the 20th century, research has started to aim for artificial
intelligence. Generally, defining \ac{AI} in a single sentence is hard. \citet{russell2016artificial} structures
historical definitions along two dimensions: the grade of how \emph{human} \texttt{XOR} \emph{rationl} a system
\emph{thinks} \texttt{XOR} \emph{behaves}. These four directions are all pursued by researchers. In
this thesis, the goal of \emph{behaving rationally} is the most appropriate sub field of research in the larger field of
\ac{AI}.
Today, some 70 years later, \ac{AI} is again extensively discussed by both researchers and main-stream media
\citep[p.24ff.]{russell2016artificial, arulkumaran2017brief}. The reasons for this are diverse but it can be argued that
the combination of easily available computing power through cloud computing and advances in the mathematical
underpinnings have allowed for fast-paced advances in recent years. Also, the currently popular neural network
architectures often require large amounts of data to learn which have lately been readily available for companies and
researchers through the adoption of online technologies by the majority of the population
\citep[p.27]{russell2016artificial}.
\subsection{Learning}
\label{sec:learning}
According to \citep{russell2016artificial}, learning agents are those that \emph{improve their performance on future
tasks after making observations about the world} \cite[p.693]{russell2016artificial}. Among living animals, learning
behavior is present in many species. The general goal of \ac{AI} research is to imitate these
skills to dynamically adapt
to unforeseen environments. To create a learning algorithm doesn't require the creator to anticipate every
potential variant of an environment that the learning agent is confronted with while still creating a successfully
acting agent. Cognitive Sciences define learning as the change of state due to experiences and often limit the
recognition of learning to some externally observable behavior \cite[p.96f.]{cognition1999}. This applies to all known
species and the same definition can easily be applied to a learning artificial agent. A learning agent that doesn't
change its behavior is not helpful and an agent that doesn't change its internal state can hardly have learned
something.
In \ac{AI} research, a \emph{loss function} is commonly used as a measure of learning progress. Loss functions describe
the difference between the actual utility of the right actions versus the results of the agents learned actions. The
exact loss function might be a mean squared error function or an absolute loss depending on the learning algorithm that
is used or whether the researcher intends to emphasize large deviations from the target \cite[p.710]{russell2016artificial}.
Computational learning theory looks at different problems of learning: how to learn through a large number of
examples, the effects of learning when the agent already knows something, how to learn without examples, how to learn
through feedback from the environment and how to learn if the origin of the feedback is not deterministic
\citep{russell2016artificial}. In this work, two of those problems are of special interest: the ability to learn from
previously labeled examples and the ability to learn through feedback from the environment. The former is called
supervised learning and the latter is referred to as \acl{RL}. To understand the difference, it is also important to
understand algorithms that don't have access to labels of existing data, yet are still able to derive value from the
information. These belong to the class of unsupervised learning. Although this class is not heavily relied upon in the
later implementation of the agent , it is crucial for tasks in machine learning such as data exploration or anomaly
recognition.
The following sections will describe both supervised learning and unsupervised learning and section~\ref{sec:neural_networks} will introduce
an architecture that can be used as the learning function in these learning problems. Finally,
Section~\ref{sec:Backpropagation} will explain how exactly neural networks learn.
\subsubsection{Supervised Learning}
%-------------------------------------------------------------------------------
% done unless professor adds notes
%-------------------------------------------------------------------------------
As noted above, supervised learning uses labeled examples to learn to recognize future examples that might be of the
same kind but not identical \cite[p.695]{russell2016artificial}. Common examples of this form of learning include object recognition in images or
time-series prediction. One of the most known examples to date is the Imagenet classification algorithm by
\citep{krizhevsky2012imagenet} which was one of the first neural network based algorithms to break a classification high-score
on a popular image classification database. The goal is to correctly classify images according to a set of defined
labels. If a picture of a dog is read by the neural network, it needs to be able to classify the fact that a dog is in the
picture. In trading environments it can be helpful to be able to predict future patterns based on current and previous
observations. In the space of machinery, learning to recognize sensor data that indicates faulty parts can be used to
avoid down-time of machines through preemptive replacement during scheduled service intervals \citep{rudin2012machine}. In the online marketing
industry, recognizing user interests to send appropriate ads benefits just as well from the approach as do spam filters
that recognize ads and filter them out again \citep{domingos2012few}.
The general problem of supervised learning is as follows:
\begin{enumerate}
\item Generation of a \emph{training set} that holds a set of input-output pairs \\ $(x_1,y_1),(x_2,y_2),\ldots$
\item Training of algorithm against training set
\item Verification of results against a previously unseen \emph{test set}
\end{enumerate}
If $y$ can be any of a given set of answers, the problem is a \emph{classification} problem and if the problem requires the
prediction of a potentially infinite number of alternatives (e.g.\ a real number between 1 and 10), it's a
\emph{regression} problem. The outputs $y_n$, or labels, are created based on an underlying true function $f$ which the
algorithm tries to learn or approximate through a function $h$, the hypothesis. The space of hypotheses is infinitely
large and the general principle, called \emph{Ockhams razor} , is that simpler hypotheses with equal performance as more complex
ones are to be preferred. By deciding up-front about the decision space (e.g.\ all linear functions) the best hypothesis
might not be able to perfectly match the underlying true function $f$. On the other hand a hypothesis chosen from such
an expressive hypotheses space may generalize well and it is easier to understand and implement.
The tradeoff described above is a key factor when deciding on the \emph{right} function to use to solve a supervised
learning problem. A linear regression model is easier to understand than complex convoluted functions and neural
networks have
often been described as hard to interpret as it is not clear \emph{what} they learn. Systems such as decision trees,
which make many sequential decisions about features of the input in question to arrive at a classification, are easy to
interpret and might be more appropriate when not only the performance of the system is important but also the
inner workings of it \citep[p.696ff.]{russell2016artificial}.
\subsubsection{Unsupervised Learning}
%-------------------------------------------------------------------------------
% done unless professor adds notes
%-------------------------------------------------------------------------------
Unsupervised learning suffers from one key difference: the set of data that is used to learn from lacks labels or
classifications. In other words, there are no examples to indicate what it is that needs to be
learned. Common examples of unsupervised learning are \emph{clustering} or \emph{principal components analysis}. The
overall goal of unsupervised learning is not to predict but to learn information about the underlying distributions and
reasons as to why the values have been measured in a certain way. Unsupervised learning is also used during pre-processing
data for supervised learning problems to improve the later results of the regression or classification problems
\cite[p.373f.]{james2013introduction}.
Additional features can be constructed from results of unsupervised learning such as distances to cluster centers. These
additional features may then also be fed to the learning algorithm. Doing so is risky, as it may introduce
implicit biases of the data analyst.
\subsection{Neural Networks}%
\label{sec:neural_networks}
%-------------------------------------------------------------------------------
% done unless professor adds notes
%-------------------------------------------------------------------------------
\begin{figure}[b]
\centering
\includegraphics[width=0.8\linewidth]{img/perceptron.png}
\caption{Model of the perceptron, taken from \citep{russell2016artificial}.}
\label{fig:perceptron}
\end{figure}
Neural networks are a technology that is used to approach problems from both supervised learning and unsupervised learning problems. The original
concept can be dated as far back as 1943 \cite[p.727]{russell2016artificial} and the mathematical description of a
neuron is a linear combination of many input variables $a_i$ and their weights $w_i$. If the linear combination of the
input variables exceeds a threshold, defined by an activation function $g$, the neuron activates or \emph{fires}. When
the activation function $a$ results in a binary value, it is called a \emph{perceptron}. Real value outputs are possible
through a different, non binary activation function. These are often logistic functions and the resulting unit is
sometimes called a sigmoid perceptron \cite[p.729]{russell2016artificial}. A visual model of this unit
is given in Figure~\ref{fig:perceptron}.
A neural network is a collection of such neuron components, often layered. The properties of the neurons as well
as the overall network properties are called \emph{hyperparameters} and describe the overall architecture of the neural network.
A common architecture is the \emph{feed-forward network} which holds several layers of sets of neurons. Each set has no
connection within itself but its activation output is fed into the next layers neurons. It is a directed
acyclic graph. Other than the weights, this network has no internal state and can not hold information about
the input in some form of memory. An alternative is a \emph{\acl {RNN} } which includes loops and can
hold state. The former network is often used for image classification problems while the latter is used for
time-series analysis and natural language processing.
When looking at neural networks one important decision is the number of layers. In fact, the history of neural networks has shown
three key phases of progress, the first phase which included simple single-layer networks, the second which included one
\emph{hidden layer} and the third phase, today, which uses networks that benefit from several hidden layers. A hidden
layer is a number of neurons between the input layer and the output layer. This allows the network to generate complex
input-output relationships. Such a multi-layer network is conceptualized in Figure~\ref{fig:multilayernn}. Each layer
$h^n$ feeds into the next until the output layer is reached \citep[p.729ff.]{russell2016artificial}.
\begin{figure}[]
\centering
\includegraphics[width=0.3\linewidth]{img/multilayer_nn.png}
\caption{Multi-layer neural network from \citep{bengio2009learning} }
\label{fig:multilayernn}
\end{figure}
Neural networks can represent complex non-linear and discontinuous functions
\cite[p.732]{russell2016artificial} even with small numbers of layers or neurons. Such \emph{deep} networks however long
suffered from a large issue: it was unclear how to train them, i.e.\ how to make them learn. The next section describes
a solution to this problem.
\subsubsection{Learning Neural Networks and Backpropagation}
\label{sec:Backpropagation}
%-------------------------------------------------------------------------------
% done unless professor adds notes
%-------------------------------------------------------------------------------
The previous sections have described learning in respect to the goal of the learning process and the input data that is
used to learn from. This section explains the forms of learning and focuses on one widely used form called \emph{backpropagation}.
When looking at neural networks while remembering the earlier definition of learning, it becomes clear that there are many
ways a neural network can change its state. It could:
\begin{enumerate}
\item develop new connections
\item remove existing connections
\item change the connecting weights
\item change the threshold values of the activation functions
\item change its input function
\item develop new neurons
\item remove existing neurons \cite[p.60]{kriesel2007brief}
\end{enumerate}
\noindent Of these many actions, changing the weights is the most common way to let a neural network learn. This is because many
of the other changes in its state can be performed by a specific way of changing the weights. Removing connections is
equivalent to setting the weight of the connection to 0 and forbidding further adaption afterwards. Equally, adding new
connections is the same as setting a weight of 0 to something that is not 0. Changing the threshold values can also be
achieved by modeling them as weights. Changing the input function is uncommon. The addition and removal of neurons
(i.e.\ the growing or shrinking of the network itself) is a popular field of research but will not be discussed further
\cite[p.60]{kriesel2007brief}.
Learning by changing the weights covers a wide range of possible adaptions to the network structure. When
looking at a single (sigmoid) perceptron, the changing of the weights of its input values is the same process as that of the
concept of gradient descent algorithms. Because the activation function is most often \emph{soft} to ensure
differentiability and because a hard threshold creates a non-continuous function, the process of fitting the weights to
minimize loss is called logistic regression \cite[p.729f.]{russell2016artificial}. For a detailed explanation of the gradient
descent approach, refer to the works of \citet{russell2016artificial} as well as
\citet{Goodfellow-et-al-2016}.
%logistic regression + gradient descent on a unit based view
The above described concept of learning from labeled examples is intuitive for single-layer neural network. The output can be
directly compared to the labels provided by the training set and logistic regression can be applied to correct the weights of
the network to reduce the loss. It becomes problematic though, when several layers are inserted between the input and
the output. The weights of the hidden layers are not included in the labeled examples. This is where the concept of
backpropagation becomes useful. For Figure~\ref{fig:multilayernn}, any error of the weights of the neurons in
layer $h^1$ influence the values of the output of layer $h^2$ and $h^3$ (in the case of fully connected layer).
However, for any additive loss function (such as $L_2$) the error is simply the sum of the gradients of the losses of
the outputs \cite[p.733f.]{russell2016artificial}. For a given $L_2$ loss it is
\begin{equation}
\frac{\partial}{\partial w} Loss(w) = \frac{\partial}{\partial w} \vert y-h_w(x) \vert ^2 = \frac{\partial}{\partial w} \sum_k{(y_k - a_k)^2} = \sum_k{\frac{\partial}{\partial w}(y_k - a_k)^2}
\label{equ:errorssum}
\end{equation}
\noindent where $w$ is the weight of the target neuron, $y$ the target value and $k$ the index of the nodes in the
output layer \cite[p.733f.]{russell2016artificial}. Even though this does not
solve the issue that the training set doesn't include the expected values for the hidden layers, this is solved by
back-propagating the error values through the network. Each previous hidden neuron is considered to be partially
responsible for a downstream error in relation to its weight in the target neuron.
\subsubsection{Recurrent Neural Networks}%
\label{sec:recurrent_neural_networks}
As was already noted in the previous chapter, neural networks can be both acyclic and cyclic graphs. The
\emph{vanilla} neural network is usually considered to be an acyclic feed-forward network, as it has no internal state
and it is more suited to describe the concepts of how the networks operate. Especially in translation and text to speech
applications, \ac{RNN} are popular as they are able to act on previously seen information in a sequence of
data. Generally they are suitable for many applications where the data has time-dependent embedding
\cite[p.373]{Goodfellow-et-al-2016}.
A \ac{RNN} computes its output based on the weights $w_i$, commonly noted as $\theta$, it's current input
$x^t$ and it's previous hidden units internal states $h^{t-1}$.
\begin{equation}
h^t = f(h^{t-1}, x^t, \theta)
\end{equation}
\noindent The network generally learns to use $h^t$ to encode previously seen aspects relevant to the current task, although this
is inherently lossy as the previous number of inputs (i.e.\ $\mid t-1\mid$) is arbitrary. Figure~\ref{fig:rnn_concept}
shows this concept.
\begin{figure}[]
\centering
\includegraphics[width=0.8\linewidth]{img/rnn_concept.png}
\caption[Recurrent Neural Network conceptualized]{. \emph{Left}: circuit diagram where the black square represents a
1 time slot delay. \emph{Right:} The same network unfolded where each node represents a particular time instance.
Taken from \citet{Goodfellow-et-al-2016}.}
\label{fig:rnn_concept}
\end{figure}
The network structure has two benefits: firstly, it allows for arbitrary sequence length, as the network size is
dependent on the time slot specific input and not on the number of previous time slots. Secondly, the same network with
the same weights (or in mathematical terms the same transition function $f$) can be used during each time slot. This
means: when a \ac{RNN} is fed a sequence of data, the weights will stay the same throughout the sequence. They can be
updated after the entire sequence has been processed.
Such recurrent systems, while theoretically able to hold information across inputs, suffer from an issue called the
\emph{vanishing gradient problem}. A network that sequentially processes 20 samples is not easily capable to hold useful
information within its state from the early beginning to then act upon it later in the sequence. This is a common
problem for translation: sentences often have structures where the first word influences the meaning of the final one.
The network processes each word at a time, diluting the information that is representing the first word because it
is covered with noise from the other (potentially irrelevant) words. \citet{Hochreiter:1997:LSM:1246443.1246450}
developed the \ac{LSTM} model to solve this problem. Each unit in the network is actually a group of gates that act in
harmony to store information in a recurrent cell. \ac{LSTM} implementations differ between libraries, but they generally
follow the same core concepts. Modern TensorFlow-based implementations offer \ac{GPU} acceleration which significantly
increases performance and allows for distributed calculation.
\subsection{Reinforcement Learning}
\label{sub:rl}
The previous chapters have introduced concepts of supervised learning, neural networks, backpropagation and recurrent
neural networks for time-embedded
learning tasks. \ac{RL} can be described as an intersection between supervised and unsupervised learning concepts and
Deep \ac{RL} is the usage of neural networks, especially those with many layers, to perform \ac{RL}.
On the one hand \ac{RL} does not require large amounts of labeled data to enable successful systems which is
beneficial for areas where such data is either expensive to acquire or difficult to clearly label. On the other hand it
requires some form of feedback. Generally, \ac{RL} \emph{agents} use feedback received from an \emph{environment}. The
general principle of \ac{RL} includes an agent and the environment where it performs actions. The function
that determines the action $a$ taken by the agent in a given state $s$ is called its policy, usually represented by
$\pi$. The environment reacts to the actions of the agent by returning new states $s'$ which are evaluated and a
corresponding reward $r$ is given to the agent. The reward gives the agent information about how well it performed
\citep[p.830f.]{russell2016artificial}.
This section will first introduce the concepts of a \ac{MDP}, then introduce different concepts of \ac{RL} agents,
describe approaches to encourage exploration of its options and finally describe how neural networks can be used to create
state-of-the-art agents that can solve complex tasks. The majority of Section~\ref{sub:rl} is based on
chapters 17 and 21 of \citet[]{russell2016artificial} unless it is marked otherwise.
\subsubsection{Markovian Decision Processes}%
\label{ssub:markovian_decision_processes}
A common model describing the conceptual process of states and actions followed by new states and new actions of an
agent and its environment is called a \acf {MDP}. In fact, \ac{RL} is an approach to optimally solve \ac{MDP} problems.
A \ac{MDP} is usually defined by the following components:
\begin{itemize}
\item $\mathcal{A}$: Finite set of allowed actions
\item $\mathcal{S}$: Finite set of states
\item $P(s' \mid s,a) \forall s \in \mathcal{S}, a \in \mathcal{A}$: Probability of transitioning from state
$s$ to state $s'$ when action $a$ is taken
\item $\gamma$: Discount factor for each time slot, discounting future rewards to allow for long-term and
short-term focus
\item $R(s)$: Reward function that defines the reward received for transitioning into state $s$
\end{itemize}
To solve an \ac{MDP}, an agent needs to be equipped with a policy $\pi$ that allows for corresponding actions to each
of the states. The type of policy can further be distinguished between \emph{stationary} and \emph{non-stationary}
policies. The former type refers to policies that recommend the same action for the same state independent of the
time step. The latter describes those policies trying to solve non-finite state spaces, where an
agent might act differently once time becomes scarce. However, infinite-horizon \ac{MDP} can also have
terminal states which conceptually mean that the process has ended.
A more complex form of \ac{MDP} is the \ac{POMDP} which involves agents basing their actions on a belief of the
current state. However, as the later practical application to \ac{PowerTAC} can be mapped to a \ac{MDP}
where the transition probability implicitly represents the partial observability \citep{tactexurieli2016mdp}, this will not be discussed.
\subsubsection{Bellman Equation}%
\label{ssub:bellman_equation}
The Bellman Equation offers a way to describe the utility of each state in an \ac{MDP}. For this, it defines the
utility of a state as the reward for the current state plus the sum of all future rewards discounted by $\gamma$.
\begin{equation}
U(s) = R(s) + \gamma \max_{a\in\mathcal{A}(s)} \sum_{s'}{P(s' \mid s,a)U(s')}
\end{equation}
\noindent In the above equation, the \emph{max} operation selects the optimal action in regard to all possible actions. The
Bellman equation is explicitly targeting \emph{discrete} state spaces. If the state transition graph is a cyclic graph
the solution to the Bellman equation requires some equation system solving. That is because $U(s')$ may depend on $U(s)$
and the other way around. Further, the \emph{max} operator creates non linearity that quickly becomes intractable for
large state spaces. An iterative approach called \emph{Value Iteration} is considered a valid alternative.
%In a discrete action space this would be a selection over all possible actions, in a continuous action space it however
%can become more complex. neural network based \ac{RL} agents simply invoke their policy network to retrieve the action which
%the agent believes it the one with the highest utility \citep{mnih2013playing}.
%As an example, an agent active in the environment of playing Super Mario may receive rewards corresponding to the game
%score. It may perform all valid moves permitted by the game and the goal is to improve its score.
%To train an agent, the task is usually performed several times and the environment is reset after each iteration to
%allow for a new learning step.
%When thinking about such an agent, it becomes obvious that without some explicit incentive to explore new alternatives,
%it may be contempt with whatever success it achieves and then always perform the same action. To avoid this, the agent
%can either be forced to try new alternative actions (through forcing random actions in a certain percentage of cases) or
%through explicit rewards for random actions.
%There are several forms of learning to solve a \ac{MDP} using \ac{RL} but for the purpose of this thesis I will focus
%on explaining how policy search algorithms work. Their goal is to find a good policy $\pi$ given a certain state $s$.
%Alternatives to this approach are concepts that try to learn expected future reward for future possible states, for
%available actions and others.
\subsubsection{Value and Policy Iteration}%
\label{sub:policy_and_value_iteration}
Value Iteration uses the Bellman equation to iteratively converge towards a correct estimation of each states utility,
assuming both the transition function $P(s' \mid s,a) \forall s \in \mathcal{S}$ and the reward function $R(s)$ are
known to the agent.
In the algorithm, the utility of each state is updated based on the \emph{Bellman update} rule:
\begin{equation}
U_{i+1}(s) \gets R(s) + \gamma \max_{a \in \mathcal{A}(s)} \sum_{s'}{P(s' \mid s,a) U_i(s')}
\end{equation}
This needs to be performed for \emph{each} state during \emph{each} iteration. It is clear how quickly this becomes
intractable when $\gamma$ is reasonably close to 1, meaning that also long-term rewards are taken into
consideration.
However, the agent doesn't care much about the values of various states. It cares about making the right
decisions, using the value of states as a basis for doing so. It is often observed that the policy $\pi$ converges far
sooner than the utility estimates $U(s)$. This is the basis for the \emph{Policy Iteration} approach which alternates
between:
\begin{enumerate}
\item evaluating the current policy $\pi_i$ by calculating $U_i=U^{\pi_i}$, the value of each state if $\pi$ is
executed and
\item improving the policy using one-step look-ahead based on $U_i$
\end{enumerate}
\noindent This process stops when the policy is no longer showing any significant improvements in respect to its loss value. It is
generally also not necessary to always apply the above operations to \emph{every} state. Instead, state values and
policies can be updated only in respect to newly discovered knowledge regarding specific states or specific actions.
This is called \emph{asynchronous policy iteration}.
Both variants require the transition function and the reward function to be known to the agent. \ac{RL} research has
developed several methods that adapt the concepts of the two iteration algorithms for environments with the two unknown
functions. They are explained in the next sections.
%It allows for many interesting concepts to be used such as distributed learning, using multiple agents to explore an
%environment and learn from all their observations centrally, and
\subsubsection{Temporal Difference Learning}%
\label{sub:temporal_difference_learning}
When the underlying transition function is not known, but the agent has the ability to perform many trial runs in the
environment, an empirical approach can be adapted. Therefore, the agent performs a number of trials where it acts
according to a (fixed) policy and observes the rewards it receives. Each string of alternating actions and observations
is called a trial.
The update rule for the utility of each state is
\begin{equation}
U^\pi(s) \gets U^\pi(s) + \alpha(R(s) + \gamma U^\pi(s') - U^\pi(s))
\end{equation}
where $\alpha$ is the learning rate and $U^\pi$ the utility under the execution of $\pi(s)$ in state $s$. This only
updates the utilities based on the observed transitions so if the unknown transition function sometimes leads to
extremely negative rewards through rare transitions, this is unlikely to be captured. However, with sufficiently many
trials, these rare probabilities will be adequately represented in the utilities for the states. A continuous reduction
of $\alpha$ leads to conversion on the correct value.
\subsubsection{Exploration}%
\label{sub:exploration}
The above learning approach has one weakness: it is only based on observed utilities. If $\pi$ follows the pattern of
always choosing the action that leads to the highest expected $U_{i+1}$, i.e.\
\begin{equation}
\pi(s) = \max_{a \in \mathcal{A}(s)}P(s' \mid s, a)U(s')
\end{equation}
then it will never explore possible alternatives and will very quickly get stuck on a rigid action
pattern mapping each state to a resulting action. To avoid this, the concept of \emph{exploration} has been introduced.
There are many approaches to encourage exploration. The simplest way is to define a factor $\epsilon$ which defines the
probability of choosing a random action at each step.
A more advanced variant is to add a term to the loss function that corresponds to negative entropy of the policy $-\beta
H(\pi(a \mid s ))$ where $H$ measures the entropy of a series of actions. The encourages randomness in the policy but
it permits the policy function to determine how this randomness occurs \citep{schmitt2018kickstarting}. This
entropy based loss also automatically regulates itself: when the agent is not at all able to choose rewarding actions
it reduces its loss through high entropy choices, i.e.\ lots of exploration. Once the agent finds actions for certain
states that lead to high rewards, choosing other random actions negatively outweighs following the best action.
Therefore, it becomes less random and the entropy reduces. If $\beta$ is progressively lowered, the impact on the loss
is progressively lowered, allowing the agent to continuously improve its loss despite less exploration.
Another alternative is the positive weighting of actions in states that have not been tried yet, essentially giving such
actions an optimistic prior as if they promise higher rewards than the already explored regions. This is easy to
implement for small, discrete state and action spaces but more complex for continuous spaces.
% Q Learning and Deep-Q learning
% Policy Gradient
\subsubsection{Q Learning}%
\label{sub:q_learning}
Section~\ref{sub:policy_and_value_iteration} already described how to learn the values of states, given an
action. This action can also be derived from a policy function. When an agent wants to learn its policy
(i.e.\ learn what a good policy is), it becomes problematic if the transition function is not known. An alternative
model is called \emph{Q-Learning} which is a form of Temporal Difference Learning. It learns an action-utility value
instead of the state values. The relationship between the \emph{Q-Value} and the former value of a state is
\begin{equation}
U(s) = \max_{a}Q(s,a)
\end{equation}
so the value of a state is that of the highest Q-Value. This approach is beneficial because it does not require a model
of how the world works, it therefore is called a \emph{model-free} method. The update rule for the Q-Values is
the Bellman equation with $U(s)$ and $U(s')$ replaced with $Q(s,a)$ and $Q(s',a')$ respectively.
The update rules for the Q-Value approach are related to the Temporal Difference Learning rules but include a $\max$
operator
\begin{equation}
Q(s,a) \gets Q(s,a) + \alpha(R(s) + \gamma \max_{a'}Q(s', a') - Q(s,a))
\end{equation}
An alternative version is the reduction of the above equation by removing the $\max$ operator. This results in the
\emph{actual} action being considered instead of the one that the policy believes to be the best. Q-Learning is
\emph{off-policy} while the latter version, called \ac{SARSA}, is \emph{on-policy}. The distinction has a significant
consequence: while Q-Learning may be used to learn the Q-Values from recorded state-action pairs, \ac{SARSA} requires
the action taken to be derived from the current policy function.
\subsubsection{Policy Search and Policy Gradient Methods}%
\label{sub:policy_search_and_policy_gradient_methods}
These two approaches are possibly the simplest of the \ac{RL} algorithms. In its most basic form, policy search requires
the algorithm to start with an initial policy to then adapt it until no further gains can be made. While the
concept is simple, it may lead to significant performances, if the \emph{choices regarding what to change} are made
wisely. If the policy is just randomly changed, the results will be equally random. However, if the policy is changed depending
on a good interpretation of the environment's responses, this method can offer good performance without the need
to have a model of how the world works. Such an agent takes the current state $s$ as input and uses its policy to
determine an output action $a = \pi(s)$. The value of a policy is noted as $\rho(\theta)$.
For simplicity, actions from a policy are assumed to be continuous as the later application relies on
such actions and because the analysis of policy search algorithms becomes more complex in discrete action spaces. When
both the policy and the environment are deterministic and without noise, policy search algorithms are effective.
The agent can repeat actions in the equivalent states several times, adapting its policy parameters $\theta$ by small
values and determining the empirical gradient values, allowing the agent to perform hill-climbing in the policy
function. This will converge to a local optimum, hence trying different actions allows the agent to improve its
performance as long as the local optimum has not been reached.
In real world scenarios, environments (and also policies) are commonly stochastic. Changing the policy
parameters $\theta$ by a very small value and comparing results of two instances of executing the policy may lead to
strong variations in the reward due to the stochasticity of the environment and the noise in the reward
signal. This is a common problem of statistics and the typical answer is to increase the number of trials until
statistical significance can be reached. But this is often impractical for real world problems and it is also not the
best approach.
The general idea of modern policy gradient methods thus follows an approach of using a different function as an
estimator for the gradient of the policy in a given configuration. A common approach is to use an advantage function
$\hat{A}_t$ to create an estimator for the policy gradient:
\begin{equation}
\hat{g} \ =\ \hat{\mathbb{E}}_{t} \ \left[ \nabla _{\theta }\log \pi _{\theta }( a_{t} \ \mid s_{t})\hat{A}_{t} \right]
\end{equation}
\noindent Where $\hat{A}_t$ describes the advantage of taking one action over another in a given state. It can be
described as an \emph{actor-critic architecture}, because $A(a_t, s_t) = Q(a_t,s_t) - V(s_t)$, meaning that the
advantage value is equivalent to the difference in the estimated value of the state itself and the value of performing
a specific action (derived from the policy) in that state \citep{mnih2016asynchronous}.
\subsubsection{Contemporary Research in Deep Reinforcement Learning}%
\label{sub:deep_learning_in_reinforcement_settings}
The previous sections have outlined the conceptual approaches for designing learning agents based on various approaches
for what is essentially a system that tries to act intelligently in respect to its environment. Especially in recent
years, many breakthroughs have been made possible by using neural networks in \ac{RL} settings. Neural networks have
been proven effective as
parameterized Q-Value estimators, state-value estimators and as policy functions. Most approaches suffer similar
problems: data efficiency in respect to the trial number required to learn a desired skill, scalability and robustness
\citep{proximalpolicyopt}. The reasons for these challenges are obvious: the agent receives minimal feedback and it has
a hard time mapping its received reward to specific alterations in its behavior (aka credit assignment problem)
\citep{arulkumaran2017brief}.
The research has shown many approaches to alleviate these shortcomings: Inverse Reinforcement Learning allows the
learning of a policy by imitating a trusted expert, allowing faster learning rates through clear signals
\citep{NG2000InvReinf}. \citet{brockman2016openai} have created the \emph{gym} which allows for coherent benchmarking of
various approaches against a common set of challenges. These challenges include the aforementioned Atari games as well
as locomotion simulations where various bodies are controlled by agents \citep{heess2017emergence}.
\citet{hafner2017agents} have created a setup that allows for massive parallel processing of several environments which
all contribute to the improvement of a central policy function. The implemented algorithm uses neural networks and an
advanced form of the advantage based policy methods introduced earlier. By this time, the research groups were usually
referring to their agents learning progress in the range of millions of time slots \citep{proximalpolicyopt}. One common
argument for the benefit of \ac{AI} is the ability to transfer knowledge learned by one agent to many agents. However, the
structure of neural networks makes it difficult to transfer knowledge between agents with varying
hyperparameters. The weights for the neurons cannot simply be copied between networks with different structures and
learning them again from scratch is resource intensive due to the complexity of the systems.
\citet{matiisen2017teacher} have introduced a concept that helps agents to learn faster by breaking complex challenges down
into several simpler sub tasks, similar to how humans are taught in educational institutes. This allows researchers to
quickly get new generations of agents up to speed with their predecessors.
Another approach to solve this problem of repetitive learning has been introduced by \citet{schmitt2018kickstarting}. In
their setup, the newly created agent transitions from trying to act as similar as its \emph{teacher agent} towards trying
to improve its performance independently. To achieve this, the student agent includes a term that describes the
difference between its action and the action its teacher would have taken.
In summary, many tweaks to the core concepts allow for improvements in the challenges outlined before: faster learning given limited
\sloppy resources through bootstrapping, improving wall time by leveraging large-scale architectures through
parallelization, transferring knowledge from (human) experts through inverse \ac{RL} etc. A rich landscape of tools is
in rapid development and to construct able agents, it is beneficial to leverage both the specific problem domain
structure and the available resources.
\section{PowerTAC: a Competitive Simulation}%
\label{sec:powertac_a_competitive_simulation}
% Simulating the effect on the energy efficiency of smart grid technologies.pdf
%-------------------------------------------------------------------------------
%NOTES :
% - pretty much complete.
% - missing: analysis of competing broker behaviors
%-------------------------------------------------------------------------------
In the following section, \acl{PowerTAC} is introduced and some similarities to comparable
research are summarized. At the end of the section, some competing brokers are compared and their underlying
functioning analyzed where possible.
\ac{PowerTAC} simulates a liberalized retail electricity market where multiple autonomous agents compete in
different markets. Firstly, a tariff market where agents, or \emph{brokers}, compete for numerous end-users through the
offering of tariff contracts. Secondly, a wholesale market in which brokers buy and sell large amounts of electric
energy to match their customers' demands. This market allows brokers to place bids up to 24 hours in advance and each
hour the broker has the ability to place new bids to correct for changes in their forecast models. Lastly, the balancing
market places relatively high costs on any broker that causes an imbalance in the system giving an incentive to the
brokers to balance their own portfolios prior to the balancing operations \citep{ketter2018powertac}. Figure~\ref{fig:powertacoverview} summarizes
this ecosystem.
\begin{figure}[h]%!h \centering
\includegraphics[width=1.0\textwidth]{powerTACScenarioOverview.png}
\caption{An overview of the markets simulated in PowerTAC, from \citep{ketter2018powertac} }
\label{fig:powertacoverview}
\end{figure}
%OVERVIEW,Economical
The broker to be developed has to contest in a number of markets and handle a variety of customer types. While
\ac{PowerTAC} generates a fairly complex landscape, it mostly aims at economic complexity rather than
modeling the technical underpinnings of the system. Therefore, it doesn't simulate any hardware but rather focuses on the
different agents involved in the market.
Its goal is to explore numerous market designs to correctly incentivize the market participants, allowing future
energy grids to be distributed, failure tolerant and adaptable to network dynamics \citep{ketter2015competitive}. Future
grids need to handle the changing landscape of energy production, delivery and consume patterns, triggered by the
increase of renewable energy sources and the planned reduction of fossil fuel dependency in many countries
\citep[p.13]{smartgrids2012smartgrids}.
\subsection{Similar research}% \label{sub:similar_research}
\ac{PowerTAC} is part of a larger body of research focusing on agent-based simulations. The current landscape of generic
agent based simulation frameworks is summarized by \citet{abar2017agent}. \ac{PowerTAC} falls into a subcategory of
simulations concerning the energy markets. \citet{zhou2007agent} surveyed a number of tools in 2009, before the
inception of \ac{PowerTAC}. They define six categories to be used to compare a number of existing platforms and
frameworks for creating simulations. In this work, focus is placed on the components \ac{PowerTAC} does or does not
exhibit without describing the other platforms. \ac{PowerTAC} mostly focuses on the intermediaries between the end
consumers and the producers of energy, simulating both ends of the market through automated models and not by defining
them as agents with goals and intelligent behavior. It also does not simulate the transmission infrastructures and their
capacities, nor does it assume hierarchical structures of local and inter-regional grid interaction. \ac{PowerTAC} offers,
in the form of the central server instance, a strong "Independent System Operator", i.e.\ an instance that manages the
grid, the market and the communication between all agents in the simulation. The wholesale market deploys mostly
bidding approaches, in contrast to other simulations that also support bilateral mid- and long-term contracting options.
However, it emphasizes the concept of offering balancing capacity through energy storage devices and curtailment of
energy consumption which was not noted in the survey by \citet{zhou2007agent}.
\ac{PowerTAC} follows a distributed research approach. Teams can create their own agents and compete with each
other. This creates a rich landscape of solution approaches from researchers based in a number of countries and with
diverse backgrounds \citep{ketter2015competitive}. There is one drawback: few teams have opened their agents
implementations to others which increases the entry barrier and may lead to duplicate efforts that could have been
reused \citep{boettiger2015introduction}.
\subsection{Components}%
\label{sub:components}
\ac{PowerTAC} is both technically and logically separated into several components to aid both comprehensibility of the
system and yet allow complex simulations of more realistic scenarios. In the following pages, those logical components
will be explained. Most of these components are easily mappable onto the technical implementation. The technical
structure will not be explained in detail but can be found under the GitHub \ac{PowerTAC}
organization\footnote{\url{https://github.com/powertac/}}.
\subsubsection{Distribution Utility} The \ac{DU} represents an entity that regulates the real-time electric usage and
corrects any imbalances in its brokers portfolios. Any broker who did not balance its electric supply and demand incurs
costs and this works as an incentivize to always balance its portfolios as good as possible. The \ac{DU} also owns the
distribution grid and every broker must pay fees for the use of the grid in proportion to the volume of its customers
energy \citep[p.10]{ketter2018powertac}. Fees for the grid are constructed in a way to incentivize
brokers to not only balance their portfolio but also to avoid high peak demand. It further offers tariffs and it is
thus the equivalent of a \emph{baseline broker} whose tariffs define an upper bound on broker profitability.
\subsubsection{Accounting} All accounting is managed by the central simulation server to avoid adversarial brokers
from tampering with the games rules. Negative balances are usually punished with a 10\% p.a.\ interest rate while
positive balances receive a 5\% p.a.\ interest rate. This component tracks every broker's financial balance as well as
all brokers' customer subscriptions and wholesale market positions \citep[p.11]{ketter2018powertac}.
\subsubsection{Wholesale Market}
%TODO energy or electricity? What's the "right" word? --> ENERGY
Every broker needs to purchase energy before it can sell it to the customers unless such customers themselves
generate sufficient energy to balance the broker's own portfolio. For this, \ac{PowerTAC} offers a wholesale market that operates
a \emph{periodic double auction} which represents traditional energy exchanges like those existing in the United States
and European markets. Participants in this wholesale market are brokers as well as a large general entity
representing a number of generating facilities, a grid buyer who simulates large-scale demands based on real-world data
adjusted based on weather-forecasts and a wholesale buyer who regularly places high-volume, low-price bids. During each
time slot, 24 future slots are open for placing bids by the brokers. After the bids have been collected, a clearing
price is calculated which is the intersection between supply and demand curves. Orders without limit prices are
always served first. After the clearing, all uncleared bids and asks are disclosed and distributed to the brokers to indicate the
direction of the markets' demand and supply curves \citep[p.21f.]{ketter2018powertac}.
\subsubsection{Balancing Market}
The Balancing market is the last and final trading opportunity for agents and in the
sense of the game it occurs right after the last opportunity to trade for a target time slot meaning that it occurs
virtually in parallel to the consumption of electricity. Any imbalance during this phase is corrected by the \ac{DU} who
imposes forced balancing of brokers with an imbalanced portfolio. As a
result, brokers with too much supply in their portfolio receive very little reimbursement for it and those whose customers' usage is higher than the estimated amount pay
high prices for the additionally supplied energy. The \ac{DU} also distributes the cost for the grid infrastructure
according to the peak demand distribution among all brokers. This is based on the assumption that the grid
infrastructure has a static capacity equivalent to the maximum
transmission demands. Brokers are incentivized to create portfolios that don't exhibit large deltas between
different hours of the day or days of the week.
Moreover, those who have tariffs with economic control abilities can pass this capacity along to the \ac{DU} which uses
them to correct the markets imbalances, charging customers' storage devices if an oversupply is present or
depleting their devices in case of an undersupply. It is hence economically beneficial for brokers to attract
customers with such balancing capabilities since it offers a buffer capacity against the balancing costs otherwise
incurred through the actions of the \ac{DU} \citep[p.5]{ketter2018powertac}.
\subsubsection{Customer Market}
The foundation for any brokers' ability to generate profit is a sufficient amount of customers being subscribed to its tariffs. For
this to occur, the broker must publish tariffs that are competitive to attract customers. Nonetheless, if the
broker offers tariffs that lead to net losses, long term profit will not be possible\footnote{While the 2017
competition technically allowed for brokers to remain in the game despite offering highly under-priced tariffs that
corrupted the simulation results, a proper broker must not pursue such strategies due to economic
reasoning.}.
The broker has a wide variety of actions at its disposal to create a rich portfolio. The simulation offers the
creation of a variety of tariff types that have variables which are adaptable by the broker. The types include:
\begin{description}
\item[Flat rate] Customers pay a flat rate per kWh and they always receive their demanded
amount.
\item[Tariff with fixed fee] Customers pay a definable fixed fee every day to receive the service.
\item [Tiered rates] Customers pay a certain price per kWh until a limit is reached after which the kWh price
changes. Arbitrarily many such tiers can be added.
\item[Time-of-use] Customers pay different prices depending
on the time of the day or the day of the week.
\item[Dynamic Pricing] Allows the broker to dynamically adapt the price per kWh in real-time to incentive customers
to reduce their usage during high demand times. A minimum, maximum and mean price per kWh as well as a
notification interval needs to be specified.
\item[Curtailable] Customers can opt in to a tariff that allows the broker to reduce the delivered amount of
electricity per time slot up to a certain percentage. This means the customer is exposed to a risk of not
receiving the entire electrical supply demanded, usually for a discounted unit cost per kWh. \item[Storage]
Customers can offer their storage capacity to brokers to allow them to balance their portfolio. Customers
receive payment from the brokers if their storage devices are being depleted and pay a (reduced) fee for charging
events.
\item[Signup fees and withdrawl fees] Customers can receive bonuses or pay fees
for signing up or canceling a subscription.
\end{description}
\noindent Some of the above types can also be combined to create complex tariff landscapes for customers to choose from \citep[p.9]{ketter2018powertac}.
\subsubsection{Customer models}% \label{sub:customer_models}
The final part of the simulation environment is made up by the customer models which simulate real-world customers.
Each customer can both produce and consume electricity. Consumers are modeled both by factored and elemental models
\citep[p.14]{ketter2018powertac}, allowing for small numbers but detailed patterns and large number averaged
patterns respectively. The customers evaluate the offered tariffs based on a number of deterministic functions
including the various costs and variants of the offered tariffs multiplied by an \emph{irrationality factor} that
allows for a more realistic limited rationality of the actors. Additional assessments such as broker reputation
evaluation and energy source preferences are also included in the utility function.
Customers evaluate new tariffs irregularly based on an \emph{inertia factor} that limits
their attention to such tariffs. Customers are not inherently loyal to their brokers but the inertia factor
indirectly causes them to not immediately switch if there is a more rational tariff available.
As previously noted, customers can both consume and produce electricity. While most production is non deterministic
and non controllable (i.e.\ in the case of solar and wind electricity), some are controllable such as \ac{CHP} or
bio-gas units \citep[p.16]{ketter2018powertac}. Devices such as electric vehicles or water heaters can also offer
regulation actions allowing brokers to balance their portfolios. A \emph{smart} water heater could refill only minimally
after heavy use if usage patterns show that the owners will most likely not use it again for several hours. This
way, an additional, possibly profitable capacity for energy consumption is created for the consumer, as the broker
usually charges less for electricity delivered under capacity regulation terms \citep[p.14ff.]{ketter2018powertac}.
\subsection{Existing broker implementations}%
\label{sub:existing_broker_concepts}
Before designing an agent, it is helpful to investigate previously developed agents and their design to understand
the current state of research. For this, the papers of the brokers AgentUDE, TacTex and COLDPower were analyzed, because they
performed well in previous tournaments and because their creators have published their concepts. The
paper by \citet{peters2013reinforcement} was also analyzed as it was the first paper to describe a \ac{RL} agent acting in a predecessor
of the \ac{PowerTAC} environment. This broker, while technically not competing in the \ac{PowerTAC} competition, is
referred to as \ac{SELF}. Their architectures,
models and performances are summarized in the following pages. The analysis is based on publications that describe the
TacTex, COLDPower and AgentUDE agents of 2015, as they are the last publications of these brokers that are available on
the \ac{PowerTAC} website. Unfortunately, the source code of the agents has not been made available, which does not
allow inspection of the exact inner mechanics.
From what is visible by their shared binaries, all agents are based on java and do not employ any other technologies to
perform their actions during competitions.
\subsubsection{Tariff market strategies}%
\label{ssub:tariff_market_strategies}
AgentUDE deploys an aggressive but rigid tariff market strategy, offering cheap tariffs at the beginning of the game to
trigger competing agents to react. It also places high transaction costs on the tariffs by making use of early
withdrawal penalties and bonus payments \citep{ozdemir2017strategy}. While this may be beneficial for the success in the
competition, it doesn't translate into real-world scenarios as energy markets are not a round based, finite game.
TacTex does not target tariff fees such as early withdrawal fees to make a profit. It also doesn't publish tariffs for
production of energy \citep{tactexurieli2016mdp}. TacTex has modeled the entire competition as a \ac{MDP} and included the
tariff market actions in this model. It selects a tariff from a set of predefined fixed-rate consumption tariffs to
reduce the action space complexity of the agent. Ultimately, it uses \ac{RL} to decide on its tariff market
actions, reducing the possible actions based on domain knowledge.
COLDPower also deploys \ac{RL} approaches with a Q-Learning based agent choosing from a range of predefined changes to
its existing tariff portfolio. It can perform the following actions: \emph{maintain, lower, raise, inline, minmax, wide,
bottom}. These describe fixed action strategies that have been constructed based on domain knowledge \citep{cuevas2015distributed}. The agent
is not \emph{learning} how to behave in the market on a low level but rather on a more abstract level. It can be
compared to an \ac{RL} agent that doesn't learn how to perform locomotion to move a controllable body through space but
rather one that may choose the direction of the walk, without needing to understand \emph{how} to do it. While this
leads to quick results, it may significantly reduce the possible performance as the solution space is greatly reduced.
\ac{SELF} also defines the tariff market as a \ac{MDP} and uses feature selection and regularization to reduce the state
space of their learning \ac{SARSA} agent. The action space has been defined with discrete pre-defined actions that are
similar to the ones of the COLDPower agent \citep{peters2013reinforcement}. As COLDPower, the discrete action space by itself introduces assumptions about
the problem domain that the agent cannot overcome. As an example, the two actions \emph{LowMargin} (10\% margin) and
\emph{HighMargin} (20\% margin) restrict the profitablity of the agent to two points in the overall action space. Maybe
the optimum is at 14.25\% or maybe it is even higher than 20\%. A discrete action agent cannot discover nor act upon
these possible improvements. Neural networks may help overcome this limitation because they can both handle large state spaces
and act successfully in continuous state spaces.
\subsubsection{Wholesale market strategies}%
\label{ssub:wholesale_market_strategies}
AgentUDE's strategy for the wholesale market includes both demand and price prediction. For the demand prediction, AgentUDE
uses a simple weighted estimation based on the previous time-step and the demand of 24 hours before the target time-step
\citep{ozdemir2015winner}. Their price prediction is more complex and involves a dynamic programming model based on
\citep{tesauro2002strategic} to find \emph{similar hours} in recent history and determine current prices using
Q-Learning \citep{ozdemir2017strategy}. Their \ac{MDP} is constructed in a way that the agent needs to determine the
limit price that minimizes costs. It only has one action dimension which describes the limit price and its environment
observation is represented by a belief function $f(s,a)$ which makes it a \ac{POMDP}. The agent uses value iteration to
solve the Bellman equations, determining the expected price. The ultimate limit prices are then determined based on a
heuristic that works by offering higher prices for "short-term" purchases and adjusting this to also offer higher prices
in the case of an expected higher overall trading volume \citep{ozdemir2017strategy}.
TacTex considers the wholesale market actions to be part of the overall complexity reduced \ac{MDP}. It uses a demand
predictor to determine the \ac{MWh} amount to order and sets this amount to be placed in the order. The
predictor is based on the actual customer models of the simulation server itself. While this surely leads to good
performance, it can be argued whether it is something that actually benefits the research goal. The price predictor is
a linear regression model based on the bootstrap period, corrected by a bias correction based on the prediction error of
the last 24 hours \citep{tactexurieli2016mdp}.
COLDPower deploys a linear regression model to predict prices and determines the demand by "using the energy demand
historical information" \citep{cuevas2015distributed}. The order is placed accordingly.
The authors of \ac{SELF} don't describe its actions in the wholesale market. Probably, the early variant of the
simulation did not contain this component yet and simply calculated the market price for the
electricity to then submit it to the agent.