-
Notifications
You must be signed in to change notification settings - Fork 3
/
arith.texi
2604 lines (2192 loc) · 93.5 KB
/
arith.texi
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
@node Arithmetic, Date and Time, Mathematics, Top
@c %MENU% Low level arithmetic functions
@chapter Arithmetic Functions
This chapter contains information about functions for doing basic
arithmetic operations, such as splitting a float into its integer and
fractional parts or retrieving the imaginary part of a complex value.
These functions are declared in the header files @file{math.h} and
@file{complex.h}.
@menu
* Integers:: Basic integer types and concepts
* Integer Division:: Integer division with guaranteed rounding.
* Floating Point Numbers:: Basic concepts. IEEE 754.
* Floating Point Classes:: The five kinds of floating-point number.
* Floating Point Errors:: When something goes wrong in a calculation.
* Rounding:: Controlling how results are rounded.
* Control Functions:: Saving and restoring the FPU's state.
* Arithmetic Functions:: Fundamental operations provided by the library.
* Complex Numbers:: The types. Writing complex constants.
* Operations on Complex:: Projection, conjugation, decomposition.
* Parsing of Numbers:: Converting strings to numbers.
* System V Number Conversion:: An archaic way to convert numbers to strings.
@end menu
@node Integers
@section Integers
@cindex integer
The C language defines several integer data types: integer, short integer,
long integer, and character, all in both signed and unsigned varieties.
The GNU C compiler extends the language to contain long long integers
as well.
@cindex signedness
The C integer types were intended to allow code to be portable among
machines with different inherent data sizes (word sizes), so each type
may have different ranges on different machines. The problem with
this is that a program often needs to be written for a particular range
of integers, and sometimes must be written for a particular size of
storage, regardless of what machine the program runs on.
To address this problem, the GNU C library contains C type definitions
you can use to declare integers that meet your exact needs. Because the
GNU C library header files are customized to a specific machine, your
program source code doesn't have to be.
These @code{typedef}s are in @file{stdint.h}.
@pindex stdint.h
If you require that an integer be represented in exactly N bits, use one
of the following types, with the obvious mapping to bit size and signedness:
@itemize @bullet
@item int8_t
@item int16_t
@item int32_t
@item int64_t
@item uint8_t
@item uint16_t
@item uint32_t
@item uint64_t
@end itemize
If your C compiler and target machine do not allow integers of a certain
size, the corresponding above type does not exist.
If you don't need a specific storage size, but want the smallest data
structure with @emph{at least} N bits, use one of these:
@itemize @bullet
@item int_least8_t
@item int_least16_t
@item int_least32_t
@item int_least64_t
@item uint_least8_t
@item uint_least16_t
@item uint_least32_t
@item uint_least64_t
@end itemize
If you don't need a specific storage size, but want the data structure
that allows the fastest access while having at least N bits (and
among data structures with the same access speed, the smallest one), use
one of these:
@itemize @bullet
@item int_fast8_t
@item int_fast16_t
@item int_fast32_t
@item int_fast64_t
@item uint_fast8_t
@item uint_fast16_t
@item uint_fast32_t
@item uint_fast64_t
@end itemize
If you want an integer with the widest range possible on the platform on
which it is being used, use one of the following. If you use these,
you should write code that takes into account the variable size and range
of the integer.
@itemize @bullet
@item intmax_t
@item uintmax_t
@end itemize
The GNU C library also provides macros that tell you the maximum and
minimum possible values for each integer data type. The macro names
follow these examples: @code{INT32_MAX}, @code{UINT8_MAX},
@code{INT_FAST32_MIN}, @code{INT_LEAST64_MIN}, @code{UINTMAX_MAX},
@code{INTMAX_MAX}, @code{INTMAX_MIN}. Note that there are no macros for
unsigned integer minima. These are always zero.
@cindex maximum possible integer
@cindex minimum possible integer
There are similar macros for use with C's built in integer types which
should come with your C compiler. These are described in @ref{Data Type
Measurements}.
Don't forget you can use the C @code{sizeof} function with any of these
data types to get the number of bytes of storage each uses.
@node Integer Division
@section Integer Division
@cindex integer division functions
This section describes functions for performing integer division. These
functions are redundant when GNU CC is used, because in GNU C the
@samp{/} operator always rounds towards zero. But in other C
implementations, @samp{/} may round differently with negative arguments.
@code{div} and @code{ldiv} are useful because they specify how to round
the quotient: towards zero. The remainder has the same sign as the
numerator.
These functions are specified to return a result @var{r} such that the value
@code{@var{r}.quot*@var{denominator} + @var{r}.rem} equals
@var{numerator}.
@pindex stdlib.h
To use these facilities, you should include the header file
@file{stdlib.h} in your program.
@comment stdlib.h
@comment ISO
@deftp {Data Type} div_t
This is a structure type used to hold the result returned by the @code{div}
function. It has the following members:
@table @code
@item int quot
The quotient from the division.
@item int rem
The remainder from the division.
@end table
@end deftp
@comment stdlib.h
@comment ISO
@deftypefun div_t div (int @var{numerator}, int @var{denominator})
This function @code{div} computes the quotient and remainder from
the division of @var{numerator} by @var{denominator}, returning the
result in a structure of type @code{div_t}.
If the result cannot be represented (as in a division by zero), the
behavior is undefined.
Here is an example, albeit not a very useful one.
@smallexample
div_t result;
result = div (20, -6);
@end smallexample
@noindent
Now @code{result.quot} is @code{-3} and @code{result.rem} is @code{2}.
@end deftypefun
@comment stdlib.h
@comment ISO
@deftp {Data Type} ldiv_t
This is a structure type used to hold the result returned by the @code{ldiv}
function. It has the following members:
@table @code
@item long int quot
The quotient from the division.
@item long int rem
The remainder from the division.
@end table
(This is identical to @code{div_t} except that the components are of
type @code{long int} rather than @code{int}.)
@end deftp
@comment stdlib.h
@comment ISO
@deftypefun ldiv_t ldiv (long int @var{numerator}, long int @var{denominator})
The @code{ldiv} function is similar to @code{div}, except that the
arguments are of type @code{long int} and the result is returned as a
structure of type @code{ldiv_t}.
@end deftypefun
@comment stdlib.h
@comment ISO
@deftp {Data Type} lldiv_t
This is a structure type used to hold the result returned by the @code{lldiv}
function. It has the following members:
@table @code
@item long long int quot
The quotient from the division.
@item long long int rem
The remainder from the division.
@end table
(This is identical to @code{div_t} except that the components are of
type @code{long long int} rather than @code{int}.)
@end deftp
@comment stdlib.h
@comment ISO
@deftypefun lldiv_t lldiv (long long int @var{numerator}, long long int @var{denominator})
The @code{lldiv} function is like the @code{div} function, but the
arguments are of type @code{long long int} and the result is returned as
a structure of type @code{lldiv_t}.
The @code{lldiv} function was added in @w{ISO C99}.
@end deftypefun
@comment inttypes.h
@comment ISO
@deftp {Data Type} imaxdiv_t
This is a structure type used to hold the result returned by the @code{imaxdiv}
function. It has the following members:
@table @code
@item intmax_t quot
The quotient from the division.
@item intmax_t rem
The remainder from the division.
@end table
(This is identical to @code{div_t} except that the components are of
type @code{intmax_t} rather than @code{int}.)
See @ref{Integers} for a description of the @code{intmax_t} type.
@end deftp
@comment inttypes.h
@comment ISO
@deftypefun imaxdiv_t imaxdiv (intmax_t @var{numerator}, intmax_t @var{denominator})
The @code{imaxdiv} function is like the @code{div} function, but the
arguments are of type @code{intmax_t} and the result is returned as
a structure of type @code{imaxdiv_t}.
See @ref{Integers} for a description of the @code{intmax_t} type.
The @code{imaxdiv} function was added in @w{ISO C99}.
@end deftypefun
@node Floating Point Numbers
@section Floating Point Numbers
@cindex floating point
@cindex IEEE 754
@cindex IEEE floating point
Most computer hardware has support for two different kinds of numbers:
integers (@math{@dots{}-3, -2, -1, 0, 1, 2, 3@dots{}}) and
floating-point numbers. Floating-point numbers have three parts: the
@dfn{mantissa}, the @dfn{exponent}, and the @dfn{sign bit}. The real
number represented by a floating-point value is given by
@tex
$(s \mathrel? -1 \mathrel: 1) \cdot 2^e \cdot M$
@end tex
@ifnottex
@math{(s ? -1 : 1) @mul{} 2^e @mul{} M}
@end ifnottex
where @math{s} is the sign bit, @math{e} the exponent, and @math{M}
the mantissa. @xref{Floating Point Concepts}, for details. (It is
possible to have a different @dfn{base} for the exponent, but all modern
hardware uses @math{2}.)
Floating-point numbers can represent a finite subset of the real
numbers. While this subset is large enough for most purposes, it is
important to remember that the only reals that can be represented
exactly are rational numbers that have a terminating binary expansion
shorter than the width of the mantissa. Even simple fractions such as
@math{1/5} can only be approximated by floating point.
Mathematical operations and functions frequently need to produce values
that are not representable. Often these values can be approximated
closely enough for practical purposes, but sometimes they can't.
Historically there was no way to tell when the results of a calculation
were inaccurate. Modern computers implement the @w{IEEE 754} standard
for numerical computations, which defines a framework for indicating to
the program when the results of calculation are not trustworthy. This
framework consists of a set of @dfn{exceptions} that indicate why a
result could not be represented, and the special values @dfn{infinity}
and @dfn{not a number} (NaN).
@node Floating Point Classes
@section Floating-Point Number Classification Functions
@cindex floating-point classes
@cindex classes, floating-point
@pindex math.h
@w{ISO C99} defines macros that let you determine what sort of
floating-point number a variable holds.
@comment math.h
@comment ISO
@deftypefn {Macro} int fpclassify (@emph{float-type} @var{x})
This is a generic macro which works on all floating-point types and
which returns a value of type @code{int}. The possible values are:
@vtable @code
@item FP_NAN
The floating-point number @var{x} is ``Not a Number'' (@pxref{Infinity
and NaN})
@item FP_INFINITE
The value of @var{x} is either plus or minus infinity (@pxref{Infinity
and NaN})
@item FP_ZERO
The value of @var{x} is zero. In floating-point formats like @w{IEEE
754}, where zero can be signed, this value is also returned if
@var{x} is negative zero.
@item FP_SUBNORMAL
Numbers whose absolute value is too small to be represented in the
normal format are represented in an alternate, @dfn{denormalized} format
(@pxref{Floating Point Concepts}). This format is less precise but can
represent values closer to zero. @code{fpclassify} returns this value
for values of @var{x} in this alternate format.
@item FP_NORMAL
This value is returned for all other values of @var{x}. It indicates
that there is nothing special about the number.
@end vtable
@end deftypefn
@code{fpclassify} is most useful if more than one property of a number
must be tested. There are more specific macros which only test one
property at a time. Generally these macros execute faster than
@code{fpclassify}, since there is special hardware support for them.
You should therefore use the specific macros whenever possible.
@comment math.h
@comment ISO
@deftypefn {Macro} int isfinite (@emph{float-type} @var{x})
This macro returns a nonzero value if @var{x} is finite: not plus or
minus infinity, and not NaN. It is equivalent to
@smallexample
(fpclassify (x) != FP_NAN && fpclassify (x) != FP_INFINITE)
@end smallexample
@code{isfinite} is implemented as a macro which accepts any
floating-point type.
@end deftypefn
@comment math.h
@comment ISO
@deftypefn {Macro} int isnormal (@emph{float-type} @var{x})
This macro returns a nonzero value if @var{x} is finite and normalized.
It is equivalent to
@smallexample
(fpclassify (x) == FP_NORMAL)
@end smallexample
@end deftypefn
@comment math.h
@comment ISO
@deftypefn {Macro} int isnan (@emph{float-type} @var{x})
This macro returns a nonzero value if @var{x} is NaN. It is equivalent
to
@smallexample
(fpclassify (x) == FP_NAN)
@end smallexample
@end deftypefn
Another set of floating-point classification functions was provided by
BSD. The GNU C library also supports these functions; however, we
recommend that you use the ISO C99 macros in new code. Those are standard
and will be available more widely. Also, since they are macros, you do
not have to worry about the type of their argument.
@comment math.h
@comment BSD
@deftypefun int isinf (double @var{x})
@comment math.h
@comment BSD
@deftypefunx int isinff (float @var{x})
@comment math.h
@comment BSD
@deftypefunx int isinfl (long double @var{x})
This function returns @code{-1} if @var{x} represents negative infinity,
@code{1} if @var{x} represents positive infinity, and @code{0} otherwise.
@end deftypefun
@comment math.h
@comment BSD
@deftypefun int isnan (double @var{x})
@comment math.h
@comment BSD
@deftypefunx int isnanf (float @var{x})
@comment math.h
@comment BSD
@deftypefunx int isnanl (long double @var{x})
This function returns a nonzero value if @var{x} is a ``not a number''
value, and zero otherwise.
@strong{NB:} The @code{isnan} macro defined by @w{ISO C99} overrides
the BSD function. This is normally not a problem, because the two
routines behave identically. However, if you really need to get the BSD
function for some reason, you can write
@smallexample
(isnan) (x)
@end smallexample
@end deftypefun
@comment math.h
@comment BSD
@deftypefun int finite (double @var{x})
@comment math.h
@comment BSD
@deftypefunx int finitef (float @var{x})
@comment math.h
@comment BSD
@deftypefunx int finitel (long double @var{x})
This function returns a nonzero value if @var{x} is finite or a ``not a
number'' value, and zero otherwise.
@end deftypefun
@strong{Portability Note:} The functions listed in this section are BSD
extensions.
@node Floating Point Errors
@section Errors in Floating-Point Calculations
@menu
* FP Exceptions:: IEEE 754 math exceptions and how to detect them.
* Infinity and NaN:: Special values returned by calculations.
* Status bit operations:: Checking for exceptions after the fact.
* Math Error Reporting:: How the math functions report errors.
@end menu
@node FP Exceptions
@subsection FP Exceptions
@cindex exception
@cindex signal
@cindex zero divide
@cindex division by zero
@cindex inexact exception
@cindex invalid exception
@cindex overflow exception
@cindex underflow exception
The @w{IEEE 754} standard defines five @dfn{exceptions} that can occur
during a calculation. Each corresponds to a particular sort of error,
such as overflow.
When exceptions occur (when exceptions are @dfn{raised}, in the language
of the standard), one of two things can happen. By default the
exception is simply noted in the floating-point @dfn{status word}, and
the program continues as if nothing had happened. The operation
produces a default value, which depends on the exception (see the table
below). Your program can check the status word to find out which
exceptions happened.
Alternatively, you can enable @dfn{traps} for exceptions. In that case,
when an exception is raised, your program will receive the @code{SIGFPE}
signal. The default action for this signal is to terminate the
program. @xref{Signal Handling}, for how you can change the effect of
the signal.
@findex matherr
In the System V math library, the user-defined function @code{matherr}
is called when certain exceptions occur inside math library functions.
However, the Unix98 standard deprecates this interface. We support it
for historical compatibility, but recommend that you do not use it in
new programs.
@noindent
The exceptions defined in @w{IEEE 754} are:
@table @samp
@item Invalid Operation
This exception is raised if the given operands are invalid for the
operation to be performed. Examples are
(see @w{IEEE 754}, @w{section 7}):
@enumerate
@item
Addition or subtraction: @math{@infinity{} - @infinity{}}. (But
@math{@infinity{} + @infinity{} = @infinity{}}).
@item
Multiplication: @math{0 @mul{} @infinity{}}.
@item
Division: @math{0/0} or @math{@infinity{}/@infinity{}}.
@item
Remainder: @math{x} REM @math{y}, where @math{y} is zero or @math{x} is
infinite.
@item
Square root if the operand is less then zero. More generally, any
mathematical function evaluated outside its domain produces this
exception.
@item
Conversion of a floating-point number to an integer or decimal
string, when the number cannot be represented in the target format (due
to overflow, infinity, or NaN).
@item
Conversion of an unrecognizable input string.
@item
Comparison via predicates involving @math{<} or @math{>}, when one or
other of the operands is NaN. You can prevent this exception by using
the unordered comparison functions instead; see @ref{FP Comparison Functions}.
@end enumerate
If the exception does not trap, the result of the operation is NaN.
@item Division by Zero
This exception is raised when a finite nonzero number is divided
by zero. If no trap occurs the result is either @math{+@infinity{}} or
@math{-@infinity{}}, depending on the signs of the operands.
@item Overflow
This exception is raised whenever the result cannot be represented
as a finite value in the precision format of the destination. If no trap
occurs the result depends on the sign of the intermediate result and the
current rounding mode (@w{IEEE 754}, @w{section 7.3}):
@enumerate
@item
Round to nearest carries all overflows to @math{@infinity{}}
with the sign of the intermediate result.
@item
Round toward @math{0} carries all overflows to the largest representable
finite number with the sign of the intermediate result.
@item
Round toward @math{-@infinity{}} carries positive overflows to the
largest representable finite number and negative overflows to
@math{-@infinity{}}.
@item
Round toward @math{@infinity{}} carries negative overflows to the
most negative representable finite number and positive overflows
to @math{@infinity{}}.
@end enumerate
Whenever the overflow exception is raised, the inexact exception is also
raised.
@item Underflow
The underflow exception is raised when an intermediate result is too
small to be calculated accurately, or if the operation's result rounded
to the destination precision is too small to be normalized.
When no trap is installed for the underflow exception, underflow is
signaled (via the underflow flag) only when both tininess and loss of
accuracy have been detected. If no trap handler is installed the
operation continues with an imprecise small value, or zero if the
destination precision cannot hold the small exact result.
@item Inexact
This exception is signalled if a rounded result is not exact (such as
when calculating the square root of two) or a result overflows without
an overflow trap.
@end table
@node Infinity and NaN
@subsection Infinity and NaN
@cindex infinity
@cindex not a number
@cindex NaN
@w{IEEE 754} floating point numbers can represent positive or negative
infinity, and @dfn{NaN} (not a number). These three values arise from
calculations whose result is undefined or cannot be represented
accurately. You can also deliberately set a floating-point variable to
any of them, which is sometimes useful. Some examples of calculations
that produce infinity or NaN:
@ifnottex
@smallexample
@math{1/0 = @infinity{}}
@math{log (0) = -@infinity{}}
@math{sqrt (-1) = NaN}
@end smallexample
@end ifnottex
@tex
$${1\over0} = \infty$$
$$\log 0 = -\infty$$
$$\sqrt{-1} = \hbox{NaN}$$
@end tex
When a calculation produces any of these values, an exception also
occurs; see @ref{FP Exceptions}.
The basic operations and math functions all accept infinity and NaN and
produce sensible output. Infinities propagate through calculations as
one would expect: for example, @math{2 + @infinity{} = @infinity{}},
@math{4/@infinity{} = 0}, atan @math{(@infinity{}) = @pi{}/2}. NaN, on
the other hand, infects any calculation that involves it. Unless the
calculation would produce the same result no matter what real value
replaced NaN, the result is NaN.
In comparison operations, positive infinity is larger than all values
except itself and NaN, and negative infinity is smaller than all values
except itself and NaN. NaN is @dfn{unordered}: it is not equal to,
greater than, or less than anything, @emph{including itself}. @code{x ==
x} is false if the value of @code{x} is NaN. You can use this to test
whether a value is NaN or not, but the recommended way to test for NaN
is with the @code{isnan} function (@pxref{Floating Point Classes}). In
addition, @code{<}, @code{>}, @code{<=}, and @code{>=} will raise an
exception when applied to NaNs.
@file{math.h} defines macros that allow you to explicitly set a variable
to infinity or NaN.
@comment math.h
@comment ISO
@deftypevr Macro float INFINITY
An expression representing positive infinity. It is equal to the value
produced by mathematical operations like @code{1.0 / 0.0}.
@code{-INFINITY} represents negative infinity.
You can test whether a floating-point value is infinite by comparing it
to this macro. However, this is not recommended; you should use the
@code{isfinite} macro instead. @xref{Floating Point Classes}.
This macro was introduced in the @w{ISO C99} standard.
@end deftypevr
@comment math.h
@comment GNU
@deftypevr Macro float NAN
An expression representing a value which is ``not a number''. This
macro is a GNU extension, available only on machines that support the
``not a number'' value---that is to say, on all machines that support
IEEE floating point.
You can use @samp{#ifdef NAN} to test whether the machine supports
NaN. (Of course, you must arrange for GNU extensions to be visible,
such as by defining @code{_GNU_SOURCE}, and then you must include
@file{math.h}.)
@end deftypevr
@w{IEEE 754} also allows for another unusual value: negative zero. This
value is produced when you divide a positive number by negative
infinity, or when a negative result is smaller than the limits of
representation. Negative zero behaves identically to zero in all
calculations, unless you explicitly test the sign bit with
@code{signbit} or @code{copysign}.
@node Status bit operations
@subsection Examining the FPU status word
@w{ISO C99} defines functions to query and manipulate the
floating-point status word. You can use these functions to check for
untrapped exceptions when it's convenient, rather than worrying about
them in the middle of a calculation.
These constants represent the various @w{IEEE 754} exceptions. Not all
FPUs report all the different exceptions. Each constant is defined if
and only if the FPU you are compiling for supports that exception, so
you can test for FPU support with @samp{#ifdef}. They are defined in
@file{fenv.h}.
@vtable @code
@comment fenv.h
@comment ISO
@item FE_INEXACT
The inexact exception.
@comment fenv.h
@comment ISO
@item FE_DIVBYZERO
The divide by zero exception.
@comment fenv.h
@comment ISO
@item FE_UNDERFLOW
The underflow exception.
@comment fenv.h
@comment ISO
@item FE_OVERFLOW
The overflow exception.
@comment fenv.h
@comment ISO
@item FE_INVALID
The invalid exception.
@end vtable
The macro @code{FE_ALL_EXCEPT} is the bitwise OR of all exception macros
which are supported by the FP implementation.
These functions allow you to clear exception flags, test for exceptions,
and save and restore the set of exceptions flagged.
@comment fenv.h
@comment ISO
@deftypefun int feclearexcept (int @var{excepts})
This function clears all of the supported exception flags indicated by
@var{excepts}.
The function returns zero in case the operation was successful, a
non-zero value otherwise.
@end deftypefun
@comment fenv.h
@comment ISO
@deftypefun int feraiseexcept (int @var{excepts})
This function raises the supported exceptions indicated by
@var{excepts}. If more than one exception bit in @var{excepts} is set
the order in which the exceptions are raised is undefined except that
overflow (@code{FE_OVERFLOW}) or underflow (@code{FE_UNDERFLOW}) are
raised before inexact (@code{FE_INEXACT}). Whether for overflow or
underflow the inexact exception is also raised is also implementation
dependent.
The function returns zero in case the operation was successful, a
non-zero value otherwise.
@end deftypefun
@comment fenv.h
@comment ISO
@deftypefun int fetestexcept (int @var{excepts})
Test whether the exception flags indicated by the parameter @var{except}
are currently set. If any of them are, a nonzero value is returned
which specifies which exceptions are set. Otherwise the result is zero.
@end deftypefun
To understand these functions, imagine that the status word is an
integer variable named @var{status}. @code{feclearexcept} is then
equivalent to @samp{status &= ~excepts} and @code{fetestexcept} is
equivalent to @samp{(status & excepts)}. The actual implementation may
be very different, of course.
Exception flags are only cleared when the program explicitly requests it,
by calling @code{feclearexcept}. If you want to check for exceptions
from a set of calculations, you should clear all the flags first. Here
is a simple example of the way to use @code{fetestexcept}:
@smallexample
@{
double f;
int raised;
feclearexcept (FE_ALL_EXCEPT);
f = compute ();
raised = fetestexcept (FE_OVERFLOW | FE_INVALID);
if (raised & FE_OVERFLOW) @{ /* @dots{} */ @}
if (raised & FE_INVALID) @{ /* @dots{} */ @}
/* @dots{} */
@}
@end smallexample
You cannot explicitly set bits in the status word. You can, however,
save the entire status word and restore it later. This is done with the
following functions:
@comment fenv.h
@comment ISO
@deftypefun int fegetexceptflag (fexcept_t *@var{flagp}, int @var{excepts})
This function stores in the variable pointed to by @var{flagp} an
implementation-defined value representing the current setting of the
exception flags indicated by @var{excepts}.
The function returns zero in case the operation was successful, a
non-zero value otherwise.
@end deftypefun
@comment fenv.h
@comment ISO
@deftypefun int fesetexceptflag (const fexcept_t *@var{flagp}, int @var{excepts})
This function restores the flags for the exceptions indicated by
@var{excepts} to the values stored in the variable pointed to by
@var{flagp}.
The function returns zero in case the operation was successful, a
non-zero value otherwise.
@end deftypefun
Note that the value stored in @code{fexcept_t} bears no resemblance to
the bit mask returned by @code{fetestexcept}. The type may not even be
an integer. Do not attempt to modify an @code{fexcept_t} variable.
@node Math Error Reporting
@subsection Error Reporting by Mathematical Functions
@cindex errors, mathematical
@cindex domain error
@cindex range error
Many of the math functions are defined only over a subset of the real or
complex numbers. Even if they are mathematically defined, their result
may be larger or smaller than the range representable by their return
type. These are known as @dfn{domain errors}, @dfn{overflows}, and
@dfn{underflows}, respectively. Math functions do several things when
one of these errors occurs. In this manual we will refer to the
complete response as @dfn{signalling} a domain error, overflow, or
underflow.
When a math function suffers a domain error, it raises the invalid
exception and returns NaN. It also sets @var{errno} to @code{EDOM};
this is for compatibility with old systems that do not support @w{IEEE
754} exception handling. Likewise, when overflow occurs, math
functions raise the overflow exception and return @math{@infinity{}} or
@math{-@infinity{}} as appropriate. They also set @var{errno} to
@code{ERANGE}. When underflow occurs, the underflow exception is
raised, and zero (appropriately signed) is returned. @var{errno} may be
set to @code{ERANGE}, but this is not guaranteed.
Some of the math functions are defined mathematically to result in a
complex value over parts of their domains. The most familiar example of
this is taking the square root of a negative number. The complex math
functions, such as @code{csqrt}, will return the appropriate complex value
in this case. The real-valued functions, such as @code{sqrt}, will
signal a domain error.
Some older hardware does not support infinities. On that hardware,
overflows instead return a particular very large number (usually the
largest representable number). @file{math.h} defines macros you can use
to test for overflow on both old and new hardware.
@comment math.h
@comment ISO
@deftypevr Macro double HUGE_VAL
@comment math.h
@comment ISO
@deftypevrx Macro float HUGE_VALF
@comment math.h
@comment ISO
@deftypevrx Macro {long double} HUGE_VALL
An expression representing a particular very large number. On machines
that use @w{IEEE 754} floating point format, @code{HUGE_VAL} is infinity.
On other machines, it's typically the largest positive number that can
be represented.
Mathematical functions return the appropriately typed version of
@code{HUGE_VAL} or @code{@minus{}HUGE_VAL} when the result is too large
to be represented.
@end deftypevr
@node Rounding
@section Rounding Modes
Floating-point calculations are carried out internally with extra
precision, and then rounded to fit into the destination type. This
ensures that results are as precise as the input data. @w{IEEE 754}
defines four possible rounding modes:
@table @asis
@item Round to nearest.
This is the default mode. It should be used unless there is a specific
need for one of the others. In this mode results are rounded to the
nearest representable value. If the result is midway between two
representable values, the even representable is chosen. @dfn{Even} here
means the lowest-order bit is zero. This rounding mode prevents
statistical bias and guarantees numeric stability: round-off errors in a
lengthy calculation will remain smaller than half of @code{FLT_EPSILON}.
@c @item Round toward @math{+@infinity{}}
@item Round toward plus Infinity.
All results are rounded to the smallest representable value
which is greater than the result.
@c @item Round toward @math{-@infinity{}}
@item Round toward minus Infinity.
All results are rounded to the largest representable value which is less
than the result.
@item Round toward zero.
All results are rounded to the largest representable value whose
magnitude is less than that of the result. In other words, if the
result is negative it is rounded up; if it is positive, it is rounded
down.
@end table
@noindent
@file{fenv.h} defines constants which you can use to refer to the
various rounding modes. Each one will be defined if and only if the FPU
supports the corresponding rounding mode.
@table @code
@comment fenv.h
@comment ISO
@vindex FE_TONEAREST
@item FE_TONEAREST
Round to nearest.
@comment fenv.h
@comment ISO
@vindex FE_UPWARD
@item FE_UPWARD
Round toward @math{+@infinity{}}.
@comment fenv.h
@comment ISO
@vindex FE_DOWNWARD
@item FE_DOWNWARD
Round toward @math{-@infinity{}}.
@comment fenv.h
@comment ISO
@vindex FE_TOWARDZERO
@item FE_TOWARDZERO
Round toward zero.
@end table
Underflow is an unusual case. Normally, @w{IEEE 754} floating point
numbers are always normalized (@pxref{Floating Point Concepts}).
Numbers smaller than @math{2^r} (where @math{r} is the minimum exponent,
@code{FLT_MIN_RADIX-1} for @var{float}) cannot be represented as
normalized numbers. Rounding all such numbers to zero or @math{2^r}
would cause some algorithms to fail at 0. Therefore, they are left in
denormalized form. That produces loss of precision, since some bits of
the mantissa are stolen to indicate the decimal point.
If a result is too small to be represented as a denormalized number, it
is rounded to zero. However, the sign of the result is preserved; if
the calculation was negative, the result is @dfn{negative zero}.
Negative zero can also result from some operations on infinity, such as
@math{4/-@infinity{}}. Negative zero behaves identically to zero except
when the @code{copysign} or @code{signbit} functions are used to check
the sign bit directly.
At any time one of the above four rounding modes is selected. You can
find out which one with this function:
@comment fenv.h
@comment ISO
@deftypefun int fegetround (void)
Returns the currently selected rounding mode, represented by one of the
values of the defined rounding mode macros.
@end deftypefun
@noindent
To change the rounding mode, use this function:
@comment fenv.h
@comment ISO
@deftypefun int fesetround (int @var{round})
Changes the currently selected rounding mode to @var{round}. If
@var{round} does not correspond to one of the supported rounding modes
nothing is changed. @code{fesetround} returns zero if it changed the
rounding mode, a nonzero value if the mode is not supported.
@end deftypefun
You should avoid changing the rounding mode if possible. It can be an
expensive operation; also, some hardware requires you to compile your
program differently for it to work. The resulting code may run slower.
See your compiler documentation for details.
@c This section used to claim that functions existed to round one number
@c in a specific fashion. I can't find any functions in the library
@c that do that. -zw
@node Control Functions
@section Floating-Point Control Functions
@w{IEEE 754} floating-point implementations allow the programmer to
decide whether traps will occur for each of the exceptions, by setting
bits in the @dfn{control word}. In C, traps result in the program
receiving the @code{SIGFPE} signal; see @ref{Signal Handling}.
@strong{NB:} @w{IEEE 754} says that trap handlers are given details of
the exceptional situation, and can set the result value. C signals do
not provide any mechanism to pass this information back and forth.
Trapping exceptions in C is therefore not very useful.
It is sometimes necessary to save the state of the floating-point unit
while you perform some calculation. The library provides functions
which save and restore the exception flags, the set of exceptions that
generate traps, and the rounding mode. This information is known as the
@dfn{floating-point environment}.
The functions to save and restore the floating-point environment all use
a variable of type @code{fenv_t} to store information. This type is
defined in @file{fenv.h}. Its size and contents are
implementation-defined. You should not attempt to manipulate a variable
of this type directly.
To save the state of the FPU, use one of these functions:
@comment fenv.h
@comment ISO
@deftypefun int fegetenv (fenv_t *@var{envp})
Store the floating-point environment in the variable pointed to by
@var{envp}.
The function returns zero in case the operation was successful, a
non-zero value otherwise.
@end deftypefun
@comment fenv.h