-
Notifications
You must be signed in to change notification settings - Fork 20
/
n64.lyx
5658 lines (4493 loc) · 135 KB
/
n64.lyx
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
#LyX 1.5.5 created this file. For more info see http://www.lyx.org/
\lyxformat 276
\begin_document
\begin_header
\textclass article
\language english
\inputencoding auto
\font_roman palatino
\font_sans default
\font_typewriter default
\font_default_family default
\font_sc false
\font_osf false
\font_sf_scale 100
\font_tt_scale 100
\graphics default
\paperfontsize 12
\spacing single
\papersize letterpaper
\use_geometry false
\use_amsmath 1
\use_esint 0
\cite_engine basic
\use_bibtopic false
\paperorientation portrait
\secnumdepth 3
\tocdepth 3
\paragraph_separation indent
\defskip medskip
\quotes_language english
\papercolumns 1
\papersides 1
\paperpagestyle default
\tracking_changes false
\output_changes false
\author ""
\author ""
\end_header
\begin_body
\begin_layout Title
Embedded MIPS Development with the Nintendo 64
\end_layout
\begin_layout Author
Ryan Underwood
\end_layout
\begin_layout Abstract
Using Nintendo's 64-bit console, we explore the intricacies and design decisions
involved in developing software for a console or embedded platform.
\series bold
This is a work in progress!
\end_layout
\begin_layout Section
Introduction
\end_layout
\begin_layout Paragraph
Embedded development, while drastically simplified compared to systems programmi
ng for a general purpose platform, presents many challenges to attain a
superior product at minimal cost.
The first set of challenges we can identify are purely physical.
Obviously, the speed of the CPU and the size of memory are the two main
decisions that impact the cost per unit.
Also necessary to consider is what user interface is desired, and the appropria
te I/O hardware to accomodate each aspect of the interface.
Size must be considered, not only in terms of material cost; who wants
to lug around a bulky digital music player? Power consumption is critical;
it presents a three-pronged blade to threaten effective platform design.
The first issue is simply the rate at which the system consumes power.
This is a problem especially for battery-powered devices, but extends as
a problem of scalability too; greater power consumption simply means greater
operating cost to the end user and thus a lesser value assigned to the
product.
Power consumption is determined by the components selected for the platform,
as well as how software chooses to utilize those components.
The second issue is the heat that is dissipated by the switching of silicon
gates.
Excessive heat generation produces undesirable traits for the user, and
can also lead to premature component failures.
Usually, power consumption and heat dissipation of a system are strongly
linked, and frequently result from a hardware design strategy that is excessive
in nature.
The last issue is that of reliability.
Designs which employ a high rate of power consumption must have safety
features such as filters scattered throughout the system to prevent erroneous
behavior under load, and require higher or tighter component tolerances
in areas such as the power supply.
All this leads to extra design time, more potential sources of design error,
and extra unit cost.
\end_layout
\begin_layout Paragraph
Among other factors, the number of components and their complexity affect
size and power consumption the most.
Therefore, the hardware designer is faced with a difficult tradeoff.
Essentially, he must select which features of a general purpose computing
platform must be omitted in his embedded platform in order to reduce costs,
while allowing the features that will provide the user with the most value
to remain.
What makes this decision even more difficult is that knowing which features
will provide the most value requires knowledge of how software designers
will make use of the platform.
Frequently, the wrong features are cut and less important features remain,
incurring unnecessary cost on the manufacturer and thwarting attempts at
elegant software design.
\end_layout
\begin_layout Paragraph
The focus of this text is on software development where the hardware platform
is a given.
Targeting a fixed and known hardware platform fits well with development
for set-top boxes or game consoles, but also with the rising popularity
of single-board computers and System-On-a-Chip (SOC) solutions.
In this paradigm, the hardware is mass produced so that it may be sold
at a very low cost, and individual providers develop the software to run
on the mass produced platform.
The hardware/software combination together completes the product that is
to be shipped to end users.
\end_layout
\begin_layout Paragraph
Our focus is on development for a game console, the Nintendo 64 (heretoafter
referred to as Ultra 64 or N64), which had a multi-year mass market lifespan.
Consequently, the install base is very large and consists of primarily
users who purchased the console to be able to use game software designed
for it.
The knowledge we gain in developing a software platform for this console
can be extended to embedded development in general; the only difference
is that in an embedded product, the platform and its software are designed
to be inseparable.
If the N64 were an embedded platform, it would have shipped with the program
cartridge moulded to the console.
\end_layout
\begin_layout Section
Taking Stock of the Hardware
\end_layout
\begin_layout Paragraph
As software designers, we will be working intimately with our chosen hardware
platform.
Therefore, it is essential to know with as much precision as possible what
the details of our hardware architecture are.
Sometimes hardware manufacturers or licensed third parties offer complete
SDKs (Software Development Kits) for their platforms that ease the bootstrappin
g of a project; usually an experienced C programmer will be able to make
use of a SDK to drastically cut down time-to-market.
However, the SDK usually contains proprietary information obtained or developed
at cost to the platform designer, which manifests itself to the software
developer as a per-unit licensing fee or the unavailability of source code.
(See Section 3 for more information on SDKs).
For this discussion, we will focus on developing software solutions without
a third-party SDK, and on developing our own SDK to use in-house or to
license to third parties.
\end_layout
\begin_layout Paragraph
The information we need to know about our platform boils down to three categorie
s:
\end_layout
\begin_layout Enumerate
How to program the processor(s)
\end_layout
\begin_layout Enumerate
How to program peripherals
\end_layout
\begin_layout Enumerate
How to execute our code on the target
\end_layout
\begin_layout Paragraph
This information can be gleaned (rarely) from marketing materials, or more
usually a designer's handbook.
In a limited fashion, it can also be derived from observation or from reverse
engineering.
\end_layout
\begin_layout Paragraph
The N64's hardware features are as follows:
\end_layout
\begin_layout Itemize
MIPS R4300i RISC 64-bit embedded processor, 93.75 MHz
\end_layout
\begin_layout Itemize
Reality CoProcessor (RCP), 62.5 MHz
\end_layout
\begin_layout Itemize
4MB Rambus RDRAM (8MB with memory expansion)
\end_layout
\begin_layout Itemize
4 peripheral ports
\end_layout
\begin_layout Itemize
Cartridge/system bus interface
\end_layout
\begin_layout Standard
<FIXME block diagram>
\end_layout
\begin_layout Paragraph
MIPS provides a public specification for the MIPS IV instruction set as
well as the R4300i processor specifically.
Therefore, programming this processor should be no problem, assuming no
customization has been made to it.
\end_layout
\begin_layout Paragraph
The rest of the system is documented only privately, in the Nintendo SDK
HTML manuals and 'man' pages.
These are available to SDK licensees only.
Furthermore, these documents typically only cover the Nintendo SDK operating
system interfaces, and do not go into much detail about the underlying
software<->hardware interface.
Information about the rest of the system has been derived by members of
the
\begin_inset LatexCommand url
name "Dextrose"
target "http://www.dextrose.com"
\end_inset
group and message board, by N64 emulator authors, by many unnamed and defunct
groups around the globe, and by commercial interests who develop unofficial
N64 development platforms to be sold at a much lower cost than the official
ones.
Through the efforts of these disparate (and only occasionally cooperating)
groups, unofficial programming information for nearly all of the N64's
peripherals, its coprocessor, and its memory/register map have been produced.
Nintendo is unable to claim any intellectual property rights on this independen
tly derived information, so we use it freely in this document.
We use symbolic constant names compatible with the official SDK, so that
individuals already familiar with the SDK can more easily follow along.
\end_layout
\begin_layout Standard
FIXME: note Sega vs Accolade, Nintendo v Tengen court cases regarding hardware
lock-outs.
\end_layout
\begin_layout Subsection
Programming the N64 CPU
\end_layout
\begin_layout Paragraph
The N64 CPU is a NEC VR4300, which is a clone of the MIPS R4300i, a low
power version of the R4300, which is it itself a low cost version of the
R4200.
The main differences between the VR4300 and the R4200 are that the VR4300
does not support cache parity, implements only a 32-bit data bus (SysAD),
and supports only a 32-bit (4GB) address space instead of the 36-bit (64GB)
address space of the R4200.
The R4300, like all R4000 processors, implements the MIPS III instruction
set.
The CPU runs at 93.75 MHz PClock (MasterClock*1.5).
It can execute one instruction per clock cycle and has a 5-stage static
pipeline.
It has a 16KB instruction cache and a 8KB write-back data cache, both non-parit
y.
At the nominal 93.75MHz clock speed, the VR4300 attains 125 MIPS and a score
of 60/45 on SPECint/fp92, while dissipating 1.8 watts on average, a very
modest amount.
It has fixed-width instructions and a clearly specified interface for up
to four coprocessors, two of which are implemented on-board (CP0, the MMU,
and CP1, the FPU).
The N64 configures the CPU to run in big-endian mode, but it is capable
of running in either endian mode, and can be switched at runtime between
32-bit and 64-bit addressing.
It can be switched to reduced power mode to slow the processor to 1/4 of
its nominal clock speed at any time.
The memory management unit (MMU) is powered off when not in use, saving
power.
\end_layout
\begin_layout Subsubsection
Addressing and Cacheability
\end_layout
\begin_layout Paragraph
The CPU has three modes of execution, kernel, supervisor, and user mode,
each with a different memory map.
Since a normal program is in complete control of the machine, we will exclusive
ly be operating in kernel mode.
(If we were writing a general purpose operating system, we would implement
the operating system in kernel mode, and processes would be executed in
user mode.
kseg0 and kseg1 would be inaccessible to user code.) We will also use 32-bit
addressing instead of 64-bit; since the N64 is physically limited to 8
megabytes of system memory, 64-bit pointers would only be a waste of space.
\end_layout
\begin_layout Paragraph
An important effect of the mode of the processor is in determining the system's
memory map.
According to the R4300 datasheet, there are five regions of memory when
in Kernel mode:
\end_layout
\begin_layout Itemize
0x00000000-0x7FFFFFFF: kuseg (TLB-mapped 2G physical)
\end_layout
\begin_layout Itemize
0x80000000-0x9FFFFFFF: kseg0 (direct-mapped 512MB, cached)
\end_layout
\begin_layout Itemize
0xA0000000-0xBFFFFFFF: kseg1 (direct-mapped 512MB, uncached)
\end_layout
\begin_layout Itemize
0xC0000000-0xDFFFFFFF: ksseg (TLB-mapped)
\end_layout
\begin_layout Itemize
0xE0000000-0xFFFFFFFF: kseg3 (TLB-mapped)
\end_layout
\begin_layout Paragraph
We are concerned mainly with the first three regions.
The power-on configuration causes the physical memory of the N64 to be
mirrored in kseg0 and kseg1.
kuseg, ksseg, and kseg3 are unmapped and cannot be accessed until mapped.
Since we do not need more than a 512MB virtual address space on the N64,
we will use kseg0 addressing when the CPU needs to access main memory.
kseg1 is useful because in some cases, bypassing the CPU's cache is desirable.
The RCP, on the other hand, does not cache or mirror memory, so physical
addresses are required when programming the RCP to access memory.
Usually we will use the following guidelines for selecting memory addressing
modes:
\end_layout
\begin_layout Itemize
Use kseg0 (cached) virtual addresses in general.
\end_layout
\begin_layout Itemize
Use kseg1 (uncached) virtual addresses when the CPU must access memory-mapped
hardware registers.
\end_layout
\begin_layout Itemize
Use physical addresses when informing the RCP of a memory location (i.e.
for a DMA transaction).
\end_layout
\begin_layout Paragraph
When peripherals are programmed to perform DMA reads, they will always access
the uncached version of the data directly from memory.
Therefore, if a cached region (kseg0 or kuseg) is written to, the CPU cache
line corresponding to that region must be flushed before using that memory
for operations in the RCP or other peripherals.
Conversely, when a peripheral performs a DMA write to memory, the CPU cache
line corresponding to that region must be invalidated to ensure two things:
that reads from cached memory reflect the updated data, and that a subsequent
CPU cache flush doesn't overwrite the new data from the peripheral.
\end_layout
\begin_layout Paragraph
It is possible to disable the kseg0 caching to aid in debugging.
In order to do this, set the low three bits of the CP0 config to 2 instead
of the default 3.
\end_layout
\begin_layout Paragraph
Remember that if the low 29 bits of two addresses are identical, their physical
location in main memory is identical.
Changing the upper four bits only changes the CPU's access semantics (cached/un
cached and TLB-mapped or direct-mapped).
For example, the PIF-ROM can be found at two seemingly different locations
-- 0x9FC00000 and 0xBFC00000.
These two locations are actually one and the same, because the low-order
29 bits in the two addresses are identical.
They only differ in the access semantics.
Note that even though the physical address 0x1FC00000 is in the kuseg address
space, it cannot be accessed until it has been mapped by the user.
\end_layout
\begin_layout Subsubsection
Delay Slots
\end_layout
\begin_layout Paragraph
For the highest performing code, we must observe the necessity of delay
slots on the MIPS architecture.
Delay slots are required by the design in order to prevent pipeline stalls.
A delay slot is specified when:
\end_layout
\begin_layout Itemize
A conditional branch is taken.
\newline
The instruction in the delay slot is executed,
unless the branch instruction is of
\begin_inset Quotes eld
\end_inset
Likely
\begin_inset Quotes erd
\end_inset
type, in which case the instruction in the delay slot is not executed if
the branch is not taken.
\end_layout
\begin_layout Itemize
A register load is performed.
\newline
If the instruction in the delay slot is dependent
on the register being loaded, an interlock condition occurs and the processor
must stall until the register load has completed.
\end_layout
\begin_layout Itemize
A register load is performed from a coprocessor.
\newline
Since there is no interlock
on coprocessor loads, the destination register is not filled until after
the next instruction has executed! (This is similar to pre-MIPS IV behavior,
but limited to from-coprocessor loads only.)
\end_layout
\begin_layout Paragraph*
Not observing the branch or load delay slots will cause seemingly-mysterious
performance declines or unwanted program behavior.
\end_layout
\begin_layout Subsubsection
Exception Handling
\end_layout
\begin_layout Paragraph
The CPU reset/NMI vector is mapped to 0xBFC00000, which is 0x1FC00000 in
physical RAM.
As we will later see, the reset condition is always handled by an embedded
ROM (PIF-ROM) in the N64 which manages the lockout lock-out chip.
The vector can be changed later for purposes of NMI or soft-reset (generated
by the Reset button on the console), but we can't do anything about a hard
reset; the PIF-ROM will always be executed.
(FIXME right?)
\end_layout
\begin_layout Paragraph
The general exception vector is at 0xBFC00380 when BEV is set (see section
\begin_inset LatexCommand ref
reference "sub:Programming-the-RCP"
\end_inset
), and can be used for all other purposes.
When an exception condition is encountered, the address at 0xBFC00380 will
be the target of a branch, and the machine state will be on the stack.
\end_layout
\begin_layout Subsubsection
Code Generation
\end_layout
\begin_layout Paragraph
Assembling MIPS instructions to binary can be performed using the GNU binutils
package, configured for the mips or mips64 targets.
32-bit MIPS code can be executed on the N64, but 64-bit code gives higher
performance with calculations involving large values.
We will later use the GNU C compiler to generate MIPS code, but it is much
easier for now to simply build binutils (for the assembler) than to build
a complete C/C++ toolchain.
In addition, running a C program requires start-up code, which we have
not developed yet, to initialize the hardware and stack pointer and to
invoke main().
\end_layout
\begin_layout Subsubsection
MMU (CP0)
\end_layout
\begin_layout Paragraph
Here we list the CP0 registers that interest us.
There are more, but the others deal with memory management (TLB), which
we are not particularly concerned with unless we are writing an OS.
\end_layout
\begin_layout List
\labelwidthstring 00.00.0000
R9 (Count) Timer count register
\end_layout
\begin_layout List
\labelwidthstring 00.00.0000
R11 (Compare) Timer compare value register
\end_layout
\begin_layout List
\labelwidthstring 00.00.0000
R12 (SR) Status Register
\end_layout
\begin_layout List
\labelwidthstring 00.00.0000
R13 (Cause) Exception Cause
\end_layout
\begin_layout List
\labelwidthstring 00.00.0000
R14 (EPC) Exception Saved PC
\end_layout
\begin_layout List
\labelwidthstring 00.00.0000
R16 (Config) Config Register
\end_layout
\begin_layout List
\labelwidthstring 00.00.0000
R18 (WatchLo) address trap lower bits
\end_layout
\begin_layout List
\labelwidthstring 00.00.0000
R19 (WatchHi) address trap upper bits
\end_layout
\begin_layout List
\labelwidthstring 00.00.0000
R30 (ErrorEPC) Reset/NMI saved PC
\end_layout
\begin_layout List
\labelwidthstring 00.00.0000
RP - Reduced Power - Bit 27 On
\end_layout
\begin_layout Standard
FR - Floating-point Register - Bit 26 On
\end_layout
\begin_layout Standard
RE - Reverse Endian - Bit 25 On
\end_layout
\begin_layout Standard
BEV - Bootstrap Exception Vector - Bit 22 On
\end_layout
\begin_layout Standard
SR - Is set if Soft Reset/NMI occurred - Bit 20
\end_layout
\begin_layout Standard
KSU - Set to 00 for kernel mode - Bit 3-4
\end_layout
\begin_layout Standard
ERL - Is set if a CPU error occurred - Bit 2
\end_layout
\begin_layout Standard
EXL - Is set if an exception occurred - Bit 1
\newline
Read the cause register to
figure out what happened
\end_layout
\begin_layout Standard
IE - Interrupt Enable - Bit 0 On
\end_layout
\begin_layout Standard
ErrorEPC register - Reset/NMI saved PC
\end_layout
\begin_layout Standard
EPC register - Exception saved PC
\end_layout
\begin_layout Standard
WatchHi/WatchLo registers to trap memory accesses
\end_layout
\begin_layout Standard
Config register -
\end_layout
\begin_layout Standard
EC - Bit 28-30 System Clock Ratio (1:1/110 1.5:1/111 2:1/000 3:1/001) readonly
\end_layout
\begin_layout Standard
BE - Bit 15 - Big Endian - 0 => LE, 1=> BE
\end_layout
\begin_layout Standard
K0 - bit 0-2 - 010 => noncacheable 011=>cacheable
\end_layout
\begin_layout Standard
Count register / Compare register / sets IP7 which causes
\end_layout
\begin_layout Standard
interrupt if IE / write to Compare to clear
\end_layout
\begin_layout Standard
Count increments at 1/2 PClock and rolls over.
\end_layout
\begin_layout Subsubsection
FPU (CP1)
\end_layout
\begin_layout Subsection
\begin_inset LatexCommand label
name "sub:Programming-the-RCP"
\end_inset
Programming the RCP Coprocessor Unit
\end_layout
\begin_layout Paragraph
The RCP is interfaced to the CPU as a standard MIPS coprocessor unit (CP2)
and connected to the CPU via the 32-bit SysAD interface that the R4300
provides.
The RCP is clocked at 62.5 MHz (==MasterClock) and is capable of 500MFlops
at that speed.
It is based on a .35 micron manufacturing process.
The RCP is connected to CPU interrupt 0 (INT0).
\end_layout
\begin_layout Paragraph
The various functional subunits of the RCP can be accessed through memory-mapped
registers.
The available subunits are:
\end_layout
\begin_layout Itemize
RAC (Rambus ASIC Cell)
\newline
A memory controller IP for the Rambus RDRAM system
memory.
This could be off-the-shelf depending on the design.
\end_layout
\begin_layout Itemize
RSP (Reality Signal Processor)
\newline
A custom R4000-like and DSP-like processor
with 64-bit instruction width, 32-bit scalar register width, 16-bit integer
SIMD capability, and a Harvard microarchitecture with separated 4K instruction
memory (IMEM) of 64 64-bit words and 4K data memory (DMEM) of 32 128-bit
words.
In typical usage, it reads the next operation from a task list, uses DMA
to fetch the microcode and data for the next task, and executes the microcode.
Typical tasks are to preprocess graphics commands for dispatch to the RDP,
or to process sound data that is then written to the Audio Interface.
The RSP has its own set of internal coprocessors: its own CP0 is the RDP,
and CP1 is the 16-bit integer SIMD Vector Unit (VU).
While they are accessed through the MIPS coprocessor interface, the RDP
and VU are not intended to be at all compatible with the CP0 (MMU) and
CP1 (FPU) that would be found on a MIPS CPU.
\end_layout
\begin_layout Itemize
RDP (Reality Display Processor)
\newline
The rasterizing engine of the N64.
The RDP receives commands from the RSP or a memory buffer, filters and
manipulates the image data, and writes the final image data to an off-screen
framebuffer in main memory.
\end_layout
\begin_layout Itemize
VI (Video Interface)
\newline
The N64's framebuffer interface.
It displays the contents of the on-screen framebuffer onto the external
video display.
\end_layout
\begin_layout Itemize
AI (Audio Interface)
\newline
Plays digital audio samples via DMA.
\end_layout
\begin_layout Itemize
PI (Peripheral Interface)
\newline
Connects cartridges and external hardware units
to the N64.
\end_layout
\begin_layout Itemize
MI (MIPS Interface)
\newline
A control interface for CPU-related functionality, such
as masking interrupts or determining the source of an interrupt.
\end_layout
\begin_layout Itemize
RI (RDRAM Interface)
\newline
The RCP's interface to the RAC and RDRAM system memory.
\end_layout
\begin_layout Itemize
SI (Serial Interface)
\newline
Responsible for accessing devices connected to the
controller ports.
\end_layout
\begin_layout Subsection
Programming the RSP (Reality Signal Processor)
\end_layout
\begin_layout Subsection
Programming the RDP (Reality Display Processor)
\end_layout
\begin_layout Standard
4KB Texture memory (one 32x32 RGBA texture)
\end_layout
\begin_layout Standard
Features:
\end_layout
\begin_layout Standard
Alpha Transparency (8-bit)
\end_layout
\begin_layout Standard
Anti-Aliasing
\end_layout
\begin_layout Standard
Bilinear/Trilinear Filtering/Interpolation (and Point Sampling)
\end_layout
\begin_layout Standard
Culling/Level of Detail Management
\end_layout
\begin_layout Standard
Dithering
\end_layout
\begin_layout Standard
Environment Mapping
\end_layout
\begin_layout Standard
Fog
\end_layout
\begin_layout Standard
Mipmapping
\end_layout
\begin_layout Standard
Perspective Correct Texture Mapping
\end_layout
\begin_layout Standard
Shading (Flat/Gourad)
\end_layout
\begin_layout Standard
Specular Reflection/Shiny Surfaces (Metal Mario)
\end_layout
\begin_layout Standard
Trilinear Mipmap Interpolation
\end_layout
\begin_layout Standard
Z-Buffering
\end_layout
\begin_layout Subsection
Interfacing System RAM
\end_layout
\begin_layout Paragraph
The Rambus RDRAM memory used in the N64 has a narrow 9-bit interface to
the Rambus ASIC Cell (RAC) in the RCP, and is capable of 562.5 MB/s burst
bandwidth.
It implements parity checking, so it should generate a NMI when a parity
error is detected (FIXME but R4300 doesn't support parity?).
The RAM is mapped at 0x00000000, and 64MB of address space is reserved
for it.
However, only 63MB of this address space can be used for physical memory,
since the last 1MB is reserved for the RAC registers.
\end_layout
\begin_layout Subsubsection
Addressing
\end_layout
\begin_layout Paragraph
The N64 console ships with 4MB of memory onboard, and a required bus terminator
called a
\begin_inset Quotes eld
\end_inset
Jumper Pak
\begin_inset Quotes erd
\end_inset
is installed in the memory expansion slot.
A 4MB upgrade can be purchased and installed for a total of 8MB system
memory.
The following maps can be used depending on the amount of memory installed:
\end_layout
\begin_layout Itemize
For cached access to 4MB memory, use 0x80000000-0x803FFFFF.
\end_layout
\begin_layout Itemize
For cached access to 8MB memory, use 0x80000000-0x807FFFFF.
\end_layout
\begin_layout Itemize
For uncached access to 4MB memory, use 0xA0000000-0xA03FFFFF.
\end_layout
\begin_layout Itemize
For uncached access to 8MB memory, use 0xA0000000-0xA07FFFFF.
\end_layout
\begin_layout Subsubsection
Memory Detection
\end_layout
\begin_layout Paragraph
To check how much memory is installed, we consult the address 0x80000318.
It corresponds to osMemSize from the SDK, and contains the size of memory
(0x00400000 for 4MB or 0x00800000 for 8MB).
The PIF-ROM code detects the size of memory and sets the value at that
address before jumping to program code.
If we wish to detect the size of memory ourselves instead of using the
value the PIF-ROM gave us, we can use the following algorithm:
\end_layout
\begin_layout Enumerate
Start at address (0xA0400000 - 4) and a memory count of 4MB.
\end_layout
\begin_layout Enumerate
Write a 32-bit word to the address, read it back, and compare it to the
value written.
\end_layout
\begin_layout Enumerate
If it is the same, add 0x100000 to the address and 1MB to the memory count.
\end_layout
\begin_layout Enumerate
Repeat until the memory count is equal to the desired amount of memory.
If we are only verifying the (non)existence of an officially-released memory
expansion pack, stop when the memory count is equal to 8MB.
\end_layout
\begin_layout Paragraph
Note that an attempt to access a non-existent physical memory address will
result in an exception, so in order to prevent a CPU crash when performing
memory detection, an appropriate general exception handler must be installed.
\end_layout
\begin_layout Subsubsection
Access Strategy
\end_layout
\begin_layout Paragraph
The N64's Rambus RDRAM system memory has a high burst bandwidth but trades
access latency for this bandwidth.
Compared to the same speed SDRAM, accessing main memory in a random fashion
is exceptionally slow.
A poster at the
\emph on
Beyond3d.com
\emph default
forums claimed that it would take 64 clocks (640ns) to initiate memory access,
which is a crippling latency when combined with the lack of prefetch and
read-around-write capabilities
\begin_inset LatexCommand cite
key "beyond3d-forum"
\end_inset
.
(benchmark)
\end_layout
\begin_layout Paragraph
Also, since the memory controller is part of the RCP and can only be accessed
through the RCP, the data width of memory access is limited to the 32-bit
SysAD interface between the CPU and RCP; storing a 64-bit longword to memory
will thus require two transfers to the RCP, while storing a value of size
32 bits or less requires only one transfer.
(benchmark) Additionally, the R4300 CPU will be locked out from accessing
memory whenever the RCP's internal bus is busy, causing program slowdown.
(benchmark while e.g.
PI DMA is going on)
\end_layout
\begin_layout Paragraph
To avoid this latency, programs that run on the R4300 CPU should be optimized
in a fashion such that in both passing procedure parameters and for procedure
temporary (stack) variables, registers are primarily utilized.
Thanks to the MIPS architecture providing 32 64-bit general-purpose registers
(GPR), there are ample registers to use for this purpose.
\end_layout
\begin_layout Paragraph
The
\series medium
\emph on
n32
\series default
\emph default
Application Binary Interface (ABI) developed by SGI
\begin_inset LatexCommand ref
reference "sub:64-Bit-Issues"
\end_inset
leverages this advantage at the compiler and linker level.
When this ABI is employed, up to eight 64-bit words, whether integer, float,
struct, or union members, are passed in registers.
Any data in excess of the eight 64-bit words is then additionally passed
on the stack.
Inside a function, up to 18 integer registers minus the number of register-pass
ed integer arguments are available, and up to 26 floating-point registers
minus the number of register-passed FP arguments are available without
allocating additional stack space in main memory for local temporary variables.
\end_layout
\begin_layout Paragraph
If register variables are allocated by the caller and its callers, they
will be saved to the stack before the call, even if the subroutine parameters
fit into the allocated registers.
For this reason, recursive functions and other algorithms with deep call
graphs, as well as algorithms with high space complexity, are, all other
things equal, a poor choice for the N64; eventually the data cache will
be exhausted and the stack will spill over into the main memory that is
extremely slow when accessed by the CPU.
Functions written in a way such that they can be inlined into the caller,
on the other hand, both leverage the high register count of the R4300 and
have less data cache impact, but caution must be exercised such that excessive
use of inline code inside a loop does not thrash the relatively small instructi
on cache.
(Instruction cache misses are almost as expensive as data cache misses,
only being somewhat mitigated by the R4300 design, which prefetches two
instructions at a time to save power.) As standard practice goes, pass variables
by reference when possible, and minimize a subroutine's copying of its
parameters into stack-allocated temporary variables by simply passing the
variable by value instead.
\end_layout
\begin_layout Paragraph
Always use DMA transfers back and forth to main memory when large amounts
of data must be read from or written to peripherals.
Do not read a whole data structure into memory to then simply write a (possibly
modified) copy to another memory-mapped location! This is what DMA is for.
Trying to do all of this on the CPU will be extraordinarily slow due to
the 32-bit interface between the CPU and RCP, high latency of RDRAM, and
internal bus competition with the RCP.
When main memory must be read from or stored to, try to design programs
such that memory is accessed in bursts of contiguous reads or writes that
are smaller than the data cache size.
Then use the cache flush command to write the cache out to memory all at
once, instead of waiting for individual cache lines to be automatically
flushed to memory by the LRU algorithm.
(FIXME test is this really faster?)
\end_layout
\begin_layout Paragraph
Also remember that the CPU has a 16K instruction cache (512 32-bit instruction
words) organized with 32-byte lines, and an 8K data cache (256 32-bit or
128 64-bit data words) organized with 16-byte lines.
If, in a given program mode, your program's call graph and data structures
can be made to fit within these cache limits, main memory accesses will