Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update SMCA Error Decoding for AMD EPYC Processors #101

Closed
wants to merge 2 commits into from

Conversation

AvaNaik
Copy link
Contributor

@AvaNaik AvaNaik commented Jun 5, 2023

Modern AMD EPYC processors support Scalable MCA (SMCA) Error decoding.
Currently, however, on Family 19h and 1Ah based, AMD EPYC processors, not
all SMCA errors are being decoded. This patchset attempts to address the
very issue by updating error description structures and handling errata
of some SMCA bank types.

The first patch adds new error descriptions for various SMCA bank types,
while also rewording existing and removing unused error descriptions.

The second patch tackles the erratum, encountered in Genoa and a
few other CPUs due to bit reassignments in the Control register of the Coherent
Slave (CS) SMCA bank type.

Update, reword some existing SMCA bank type error descriptions to extend
SMCA error decoding functionality for modern AMD processors. Additionally,
also add new error descriptions for missing SMCA bank types.

Signed-off-by: Avadhut Naik <avadnaik@amd.com>
Currently, on AMD systems with Scalable MCA (SMCA), each machine check
error of a SMCA bank type has an associated bit position in the bank's
control (CTL) register used for enabling / disabling reporting of the
very error. An error's bit position in the CTL register is also used
during error decoding for offsetting into the corresponding bank's error
description structure. As new errors are being added in newer AMD systems
for existing SMCA bank types, the underlying SMCA architecture guarantees
that the bit positions of existing errors are not altered.

However, on some AMD systems viz. Genoa, some of the existing bit
definitions in the CTL register of the Coherent Slave (CS) SMCA bank type
are reassigned without defining new HWID and McaType. Consequently, the
very errors whose bit definitions have been reassigned in the CTL register
are being erroneously decoded.

As a solution, create a new software defined SMCA bank type by utilizing
one of the hardware-reserved values for HWID. The new SMCA bank type will
only be employed for CS error decoding on affected CPU models.

Additionally, since the existing error description structure for the CS
SMCA bank type is still valid, add new error description structure to
compensate for the reassigned bit definitions.

Signed-off-by: Avadhut Naik <avadnaik@amd.com>
@AvaNaik AvaNaik closed this Aug 14, 2023
@AvaNaik
Copy link
Contributor Author

AvaNaik commented Aug 14, 2023

Will be combining Pull requests 101, 105 and 106 into a single pull request

@AvaNaik AvaNaik deleted the smca_error_decoding branch August 14, 2023 01:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants