Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[origin-aggregated-logging 207] Add diagnostics for aggregated logging #10964

Merged
merged 1 commit into from
Oct 4, 2016

Conversation

jcantrill
Copy link
Contributor

This PR satisfies: https://trello.com/c/BAwWkEiy

It provides diagnostic check for:

@jcantrill
Copy link
Contributor Author

[test]

@jcantrill
Copy link
Contributor Author

cc @sosiouxme first pass review..

}
for _, name := range clusterReaderRoleBindingNames.List() {
if !boundServiceAccounts.Has(name) {
r.Error("AGL0610", nil, fmt.Sprintf(clusterReaderUnboundServiceAccount, name, project, project, name))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of repeating arguments, use the %[n]s arg-specifier notation in the template. So

oadm policy add-cluster-role-to-user cluster-reader system:serviceaccount:%[2]s:%[1]s


func newFakeDiagnostic(t *testing.T) *fakeDiagnostic {
return &fakeDiagnostic{
messages: make(map[string]fakeLogMessage, 20),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make is not used often. This would normally be:

map[string]fakeLogMessage{}

func (d *AggregatedLogging) Check() types.DiagnosticResult {
project := retrieveLoggingProject(d.result, d.masterConfig, d.OsClient)
if len(project) != 0 {
checkServiceAccounts(d, d, project)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These look really, really weird. Is there a reason you couldn't have d.result as the first arg for each of these and skip the Error/Debug/Warn/Info definitions above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sosiouxme changed it to facilitate testing of the code by introducing an interface 'diagnosticReporter' that would allow verifying various messages were produced. It may be there is a way to utilize the actual code and retrieve the results from where it saves them. I sort of envision moving this reporter to a higher, public interface so it could be used across the spectrum of diagnostics for testing.

@jcantrill
Copy link
Contributor Author

cc @openshift/ui-review

@jwforres
Copy link
Member

@openshift/cli-review


const daemonSetPartialNodesLabeled = `
There are some nodes that do not match the selector for DaemonSet '%s'.
A list of these nodes can be discovered by running:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't this be the list of nodes that do match, might be more clear if this sentence said "A list of matching nodes can be discovered by running:"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the error, but I'm wondering about the value now. Wouldn't be more useful to be able to determine the nodes that DO NOT match the selector which is what the first line in the sentence is saying. Is there a way to find all nodes that do not match these labels? That should be the correct ones right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modified to address nodes that match... not sure how we would advise to get unmatched nodes.

The Pod '%[1]s' matched by DaemonSet '%[2]s' is not in '%[3]s' status: %[4]s.

Depending upon the state, this could mean there is an error running the image
for one or more pod container, the node could be pulling images, etc. Try running
Copy link
Member

@jwforres jwforres Sep 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plural containers, think this occurs in a couple other places in this PR

return false, errors.New("config must include a cluster-admin context to run this diagnostic")
}
if d.KubeClient == nil {
return false, errors.New("config must include a cluster-admin context to run this diagnostic")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason these two don't start capitalized?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

func decodeSecret(secret *kapi.Secret, key string) (string, error) {
value, ok := secret.Data[key]
if !ok {
return "", errors.New(fmt.Sprintf("The %s secret did not have an data entry for %s", secret.ObjectMeta.Name, key))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'an entry' or 'a data entry'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

)

const routeUnaccepted = `
An unaccepted route is a most likely due to one of the following reasons:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'is most likely' or 'is usually'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

if route.Spec.TLS != nil && len(route.Spec.TLS.Certificate) != 0 && len(route.Spec.TLS.Key) != 0 {
checkRouteCertificate(r, route)
} else {
r.Debug("AGL0331", fmt.Sprintf("Skipping check of key and certificate of route '%s'. It could be be missing one or the other or both", route.ObjectMeta.Name))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping key and certificate checks on route X. Either of them may be missing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

var serviceAccountNames = sets.NewString("logging-deployer", "aggregated-logging-kibana", "aggregated-logging-curator", "aggregated-logging-elasticsearch", fluentdServiceAccountName)

const serviceAccountsMissing = `
Did not find any logging ServiceAccounts. The logging infrastructure may not be able to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logging infrastructure may not function properly without a service account with cluster permissions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

re-run the installer.
`
const serviceAccountMissing = `
Did not find ServiceAccount '%s'. The logging infrastructure may not be able to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logging infrastructure may not function properly without a service account with cluster permissions.

var loggingServices = sets.NewString("logging-es", "logging-es-cluster", "logging-es-ops", "logging-es-ops-cluster", "logging-kibana", "logging-kibana-ops")

const serviceNotFound = `
Expected to find '%s' among the logging services for the project but did not. It may be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on the options chosen while running the deployer, these services may not have been provided.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified slightly to exclude mention of the deployer since it will be going away

return
}
if len(endpoints.Subsets) == 0 {
r.Warn("AGL0225", nil, fmt.Sprintf("There are no endpoints found for service '%s'. This may be immaterial if the backing pods were not deployed (e.g. ops).", service))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not a fan of using the word immaterial but I dont have another suggestion at the moment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not a fan of using the word immaterial but I dont have another suggestion at the moment

"This might not be a problem if..."?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@jwforres
Copy link
Member

On Fri, Sep 23, 2016 at 2:06 PM, Sam Padgett notifications@github.com
wrote:

@spadgett commented on this pull request.

In pkg/diagnostics/cluster/aggregated_logging/services.go
#10964:

  •       checkServiceEndpoints(r, adapter, project, service)
    
  • } else {
    
  •     r.Warn("AGL0215", nil, fmt.Sprintf(serviceNotFound, service))
    
  • }
    
  • }
    +}
    +
    +// checkServiceEndpoints validates if there is an available endpoint for the service.
    +func checkServiceEndpoints(r diagnosticReporter, adapter servicesAdapter, project string, service string) {
  • endpoints, err := adapter.endpointsForService(project, service)
  • if err != nil {
  • r.Warn("AGL0220", err, fmt.Sprintf("Unable to retrieve endpoints for service '%s': %s", service, err))
    
  • return
    
  • }
  • if len(endpoints.Subsets) == 0 {
  • r.Warn("AGL0225", nil, fmt.Sprintf("There are no endpoints found for service '%s'. This may be immaterial if the backing pods were not deployed (e.g. ops).", service))
    

i'm not a fan of using the word immaterial but I dont have another
suggestion at the moment

"This might not be a problem if..."?

Yes that ^


You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
#10964, or mute the thread
https://github.com/notifications/unsubscribe-auth/ABZk7QzrT-zNLNa-6V-rbpIgsCjvq0sKks5qtBUZgaJpZM4J_R7m
.

@jcantrill
Copy link
Contributor Author

@openshift/cli-review @sosiouxme any more comments? any +1, LGTM...Beuller?

Copy link
Member

@fabianofranz fabianofranz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit otherwise LGTM. This is nice!

There are no nodes that match the selector for DaemonSet '%[1]s'.
An example of a command to target a specific node for this DaemonSet:

oc label node/ip-172-18-2-170.ec2.internal %[2]s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure the indentation is consistent across all messages. We usually like 2 spaces which is the same used in commands examples.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which indentation are you refering to? The initial statement like "There are..." or the command example: " oc label"? Do you maybe have an example to reference I can follow?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean this line oc label .... Just make sure there are 2 spaces of indentation here and in other messages where you give command examples. ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed one file where it was an issue

`

func checkClusterRoleBindings(r diagnosticReporter, adapter clusterRoleBindingsAdapter, project string) {
r.Info("AGL0600", "Checking ClusterRoleBindings...")
Copy link
Member

@sosiouxme sosiouxme Sep 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In an effort to minimize the "wall of text" that you get when running diagnostics, I would prefer that these sub-steps be announced only at Debug level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

if loggingUrl.Host == route.Spec.Host {
project := route.ObjectMeta.Namespace
r.Debug("AGL0015", fmt.Sprintf("aggregated logging project name: '%s'", project))
return project
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like this loop to go on looking for other routes that match and raise a warning if there are any, as that would be a pretty clear sign that someone had multiple logging deployments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it should check that the project has an empty nodeSelector annotation and issue a warning if it doesn't, as that's an easy problem to have and not notice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to look for additional issues. Added empty nodeSelector check

r.Debug("AGL0013", fmt.Sprintf("Comparing URL to route.Spec.Host: %s", route.Spec.Host))
if loggingUrl.Host == route.Spec.Host {
project := route.ObjectMeta.Namespace
r.Debug("AGL0015", fmt.Sprintf("aggregated logging project name: '%s'", project))
Copy link
Member

@sosiouxme sosiouxme Sep 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be promoted to an Info e.g.
"Found route %s matching logging URL %s in project %s"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

d.result.Info(id, message)
}

func (d *AggregatedLogging) Error(id string, err error, message string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's one thing I really don't like about these wrapper routines... they obscure the source of the error. You get:

ERROR: [AGL0515 from diagnostic AggregatedLogging@openshift/origin/pkg/diagnostics/cluster/aggregated_logging/diagnostic.go:96]

... regardless of where in the code it actually was called. So instead of a file and line number to look at to see what the context of the error was, you have to go by ID. I don't know if that's worth changing your whole testing framework over.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the tests have value. I am leaning towards leaving it as is and to modify it if we find it becomes an issue going forward.

d.Error("AGL0505", err, fmt.Sprintf("There was an error while trying to retrieve the pods for the AggregatedLogging stack: %s", err))
return
}
if len(saList.Items) == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will always be service accounts in a project... they're created by default. So this will never be seen. I would rather see this message instead of four "can't find SA" msgs, but the meaning should be that none of the expected service accounts are present.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to log single message. removed zero check


//checkKibanaRoutesInOauthClient verifies the client contains the correct redirect uris
func checkKibanaRoutesInOauthClient(r types.DiagnosticResult, osClient *client.Client, project string, oauthclient *oauthapi.OAuthClient) {
r.Info("AGL0141", "Checking oauthclient redirectURIs for the logging routes...")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

if foundServices.Has(service) {
checkServiceEndpoints(r, adapter, project, service)
} else {
r.Warn("AGL0215", nil, fmt.Sprintf(serviceNotFound, service))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that all the services are treated as equal here. The non-ops ones should be an error if they're missing, as nothing will work. The ops ones should be errors too, if we can determine that they're supposed to exist, but I'm not sure we can cleanly, and I think the deployment creates them regardless of whether there's anything to use them. So I think the ops ones should just be warnings if they're missing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to warn for ops and error on non-ops

return
}
if len(dcList.Items) == 0 {
r.Error("AGL0047", nil, fmt.Sprintf("Did not find any matching DeploymentConfigs in project '%s'", project))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little context for what this implies would be good. Something like "This means that no logging components have been deployed."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

r.Info("AGL0400", fmt.Sprintf("Checking DaemonSets in project '%s'...", project))
dsList, err := adapter.daemonsets(project, kapi.ListOptions{LabelSelector: loggingInfraFluentdSelector.AsSelector()})
if err != nil {
r.Error("AGL0405", err, fmt.Sprintf("There was an error while trying to retrieve the logging DaemonSets in project '%s'", project))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would almost certainly be a transient error, would probably be nice to say that and print the error (which would probably be something like "connection reset" or whatever).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated to include the error. Im not sure how else to identify its transient. These errors as you note I would expect to almost never see.

return
}
if len(dsList.Items) == 0 {
r.Error("AGL0407", err, fmt.Sprintf("There were no DaemonSets in project '%s' that included label '%s'", project, loggingInfraFluentdSelector.AsSelector()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Give context on what this implies: The fluentd log collectors are not deployed, or possibly they still have an old version of logging that they need to upgrade.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

}
nodeList, err := adapter.nodes(kapi.ListOptions{})
if err != nil {
r.Error("AGL0410", err, "There was an error while trying to retrieve the list of Nodes")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print the error and note that it's probably transient

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

)

const daemonSetNoLabeledNodes = `
There are no nodes that match the selector for DaemonSet '%[1]s'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, context for if I don't really know what a Daemonset is or why it matters...

"... so fluentd isn't running and gathering logs from any nodes."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

r.Debug("AGL0435", fmt.Sprintf("Checking for running pods for DaemonSet '%s' with matchLabels '%s'", ds.ObjectMeta.Name, podSelector))
podList, err := adapter.pods(project, kapi.ListOptions{LabelSelector: podSelector})
if err != nil {
r.Error("AGL0438", err, fmt.Sprintf("There was an error retrieving pods matched to DaemonSet '%s'", ds.ObjectMeta.Name))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print error, note it's transient

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

}

}
} else {
Copy link
Member

@sosiouxme sosiouxme Sep 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverse the conditional and return from this branch, so that you don't have to indent the main flow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

r.Error("AGL0438", err, fmt.Sprintf("There was an error retrieving pods matched to DaemonSet '%s'", ds.ObjectMeta.Name))
return
}
if len(podList.Items) == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can it also check that the number of pods is equal to the number of nodes labeled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added:

r.Error("AGL0443", nil, fmt.Sprintf("The number of deployed pods %s does not match the number of labeled nodes %s", len(podList.Items), numLabeledNodes))

exists := found.Has(entry)
if !exists {
if strings.HasSuffix(entry, "-ops") {
r.Warn("AGL0060", nil, fmt.Sprintf(deploymentConfigWarnMissingForOps, entry))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like having warnings for completely correct configurations. Once again I wish that we could determine after the deployment whether there is supposed to be an ops deployment or not. Someone without one shouldn't "get used to" seeing warnings from diagnostics. Maybe we should go down the messy path of checking the fluentd env vars for whether the ops host is different from the regular one.

Otherwise, I'd say this only warrants an Info.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to info

return project
}
}
r.Error("AGL0014", errors.New("aggregated logging project name is empty"), "Unable to determine the project from the loggingPublicURL defined in the master config")
Copy link
Member

@sosiouxme sosiouxme Sep 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unable to find a route matching the loggingPublicURL defined in the master config:
  [URL]
Check that the URL is correct and aggregated logging is deployed.

(This may be the only message they see so make it informative)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


routeList, err := osClient.Routes(kapi.NamespaceAll).List(kapi.ListOptions{LabelSelector: loggingSelector.AsSelector()})
if err != nil {
r.Error("AGL0012", err, fmt.Sprintf("There was an error while trying to find the route associated with '%s'", loggingUrl))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print the error, note it's probably transient

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

func retrieveLoggingProject(r types.DiagnosticResult, masterCfg *configapi.MasterConfig, osClient *client.Client) string {
r.Debug("AGL0010", fmt.Sprintf("masterConfig.AssetConfig.LoggingPublicURL: '%s'", masterCfg.AssetConfig.LoggingPublicURL))

loggingUrl, err := url.Parse(masterCfg.AssetConfig.LoggingPublicURL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before doing this, need to exclude the case where the URL is empty. I should not get an error in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added check of len for 0, added debug message, returned empty

return
}
if len(endpoints.Subsets) == 0 {
r.Warn("AGL0225", nil, fmt.Sprintf("There are no endpoints found for service '%s'. This may not be a problem if the backing pods were not deployed (e.g. ops).", service))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's another place that warns needlessly for just not deploying ops... could this be an info for -ops services?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, a little context... what does this mean to the end user? Need to let them know it means there are no pods serviced by this service, so the component is not functioning.

@jcantrill jcantrill force-pushed the 207_agl_diagnostics branch 2 times, most recently from f4372c5 to 38d3246 Compare September 30, 2016 20:15
@jcantrill
Copy link
Contributor Author

@sosiouxme care to glance at the updates again?

}
project, err := osClient.Projects().Get(projectName)
if err != nil {
r.Error("AGL0018", err, fmt.Sprintf("There was an error retrieving project '%s' which most likely a transient error: %s", projectName, err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which /is/ most likely

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

r.Debug("AGL0013", fmt.Sprintf("Comparing URL to route.Spec.Host: %s", route.Spec.Host))
if loggingUrl.Host == route.Spec.Host {
if len(projectName) == 0 {
projectName := route.ObjectMeta.Namespace
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not be := but just plain =

means Fluentd is not running and is not gathering logs from any nodes.
An example of a command to target a specific node for this DaemonSet:

oc label node/ip-172-18-2-170.ec2.internal %[2]s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a little too specific. Could it be something like:

oc label node/node1.example.com %[2]s

Alternatively, I think we could leave the specific case out and just give the command to label all of them which is most likely to be what we want anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated


Try finding and deleting the other route by running the following:

oc get --all-namespaces routes --template='{{println}}{{range .items}}{{if eq .spec.host "%[2]s"}}{{.metadata.name}}{{println}}{{end}}{{end}}'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I suggest the slightly more informative:

oc get --all-namespaces routes --template='{{range .items}}{{if eq .spec.host "kibana.example.com"}}{{println .metadata.name "in" .metadata.namespace}}{{end}}{{end}}'

Has output like:

logging-kibana in logging

Also, I think "Try finding and deleting" is a little too imperative. I don't want them doing this blindly. How about something like "If a router has been deployed, look for duplicate matching routes by running the following:"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually updated with the exception of removed 'kibana.example.com' in favor of '%[2]s'

@openshift-bot
Copy link
Contributor

Evaluated for origin test up to 1fbfe81

@jcantrill
Copy link
Contributor Author

[merge]

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/9615/)

@openshift-bot
Copy link
Contributor

openshift-bot commented Oct 4, 2016

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/9615/) (Image: devenv-rhel7_5129)

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to 1fbfe81

@openshift-bot openshift-bot merged commit 01c20d0 into openshift:master Oct 4, 2016
Copy link
Member

@sosiouxme sosiouxme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found some more nits, then LGTM.

func checkKibana(r types.DiagnosticResult, osClient *client.Client, kClient *kclient.Client, project string) {
oauthclient, err := osClient.OAuthClients().Get(kibanaProxyOauthClientName)
if err != nil {
r.Error("AGL0115", err, fmt.Sprintf("Error retrieving the OauthClient '%s'. Unable to check Kibana", kibanaProxyOauthClientName))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print the error in the message

nodeSelector, ok := project.ObjectMeta.Annotations["openshift.io/node-selector"]
if ok && len(nodeSelector) != 0 {
r.Warn("AGL0030", nil, fmt.Sprintf(projectNodeSelectorWarning, projectName))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the annotation is not there then I would like them to get a warning because it leaves the project subject to the master config projectConfig.defaultNodeSelector which could be set now or later to something that would limit fluentd being deployed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if !ok {
    r.Warn(<new error>);
} else if len(nodeSelector) != 0 {
    r.Warn("AGL0030"...)
}

r.Debug("AGL0100", "Checking oauthclient secrets...")
secret, err := kClient.Secrets(project).Get(kibanaProxySecretName)
if err != nil {
r.Error("AGL0105", err, fmt.Sprintf("Error retrieving the secret '%s'", kibanaProxySecretName))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print the error in the message

}
decoded, err := decodeSecret(secret, oauthSecretKeyName)
if err != nil {
r.Error("AGL0110", err, fmt.Sprintf("Unable to decode Kibana Secret"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print the error

r.Debug("AGL0141", "Checking oauthclient redirectURIs for the logging routes...")
routeList, err := osClient.Routes(project).List(kapi.ListOptions{LabelSelector: loggingSelector.AsSelector()})
if err != nil {
r.Error("AGL0143", err, "Error retrieving the logging routes.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put error in message

r.Debug("AGL0300", "Checking routes...")
routeList, err := adapter.routes(project, kapi.ListOptions{LabelSelector: loggingSelector.AsSelector()})
if err != nil {
r.Error("AGL0305", err, fmt.Sprintf("There was an error retrieving routes in the project '%s' with selector '%s'", project, loggingSelector.AsSelector()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put error in message

if block != nil {
cert, err := x509.ParseCertificate(block.Bytes)
if err != nil {
r.Error("AGL0335", err, fmt.Sprintf("Unable to parse the certificate for route '%s'", route.ObjectMeta.Name))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print error in message; this one is actually reasonably likely to happen at some point...

r.Debug("AGL0355", fmt.Sprintf("Checking certificate matches key for route '%s'", route.ObjectMeta.Name))
_, err := tls.X509KeyPair([]byte(route.Spec.TLS.Certificate), []byte(route.Spec.TLS.Key))
if err != nil {
r.Error("AGL0365", err, fmt.Sprintf("Route '%s' key and certficate do not match: %s. The router will be unable to pass traffic using this route.", route.ObjectMeta.Name, err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

certificate

@jcantrill jcantrill deleted the 207_agl_diagnostics branch October 5, 2016 14:12
jcantrill added a commit to jcantrill/origin that referenced this pull request Oct 5, 2016
jcantrill added a commit to jcantrill/origin that referenced this pull request Oct 5, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants