Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Still dups in search with 1.9.4 #4317

Closed
ChristopheBordieu opened this issue Apr 25, 2023 · 16 comments
Closed

Still dups in search with 1.9.4 #4317

ChristopheBordieu opened this issue Apr 25, 2023 · 16 comments
Assignees

Comments

@ChristopheBordieu
Copy link

Hi @vladak

On one of my PRD OG instance, I have installed 1.9.4 and reindexed everything from scratch.
Users report they still have dups when searching :-(
Same issue on another UAT instance running 1.9.4 where a user found dup (after he found first a dup on the related (I mean, same code is indexed) PRD one running 1.7.35).

What could I do to try to understand the issue?

@vladak
Copy link
Member

vladak commented Apr 25, 2023

Firstly, it would help to get more information about the dups, i.e. quantify them and see how they change over time. Ideally track when they started appearing after the reindex from scratch, esp. w.r.t. periodic reindex events.

@ChristopheBordieu
Copy link
Author

ChristopheBordieu commented Apr 26, 2023

Very difficult for me to quantify the number of dups and when they start to appear, because of the size of the data indexed.

On a case that has been reported on my UAT.
I see no dup on history cache. This is a good point since on PRD (1.7.35) I have plenty of dups for history cache for the file.

opengrok@[uat-1.9.4]:/srv/opengrok-var/historycache/QWERTY/azerty$ zgrep -A 1 '<void property="revision">' Forest.xml.gz  | grep string | sed -e 's/<string>//' -e 's/<\/string>//' | sort | uniq -c | grep -v " 1 " | wc -l
0
opengrok@[prd-1.7.35]:/srv/opengrok-var/historycache/QWERTY/azerty$ zgrep -A 1 '<void property="revision">' Forest.xml.gz  | grep string | sed -e 's/<string>//' -e 's/<\/string>//' | sort | uniq -c | grep -v " 1 " | wc -l
55

What I don't understand is the data regarding number of commits on the file duplicated:

opengrok@[uat-1.9.4]:/srv/opengrok-src/QWERTY/azerty$ git log --oneline --merges -- Forest.xml | wc -l
116
opengrok@[uat-1.9.4]:/srv/opengrok-src/QWERTY/azerty$ git log --oneline -- Forest.xml | wc -l
1154
opengrok@[uat-1.9.4]:/srv/opengrok-src/QWERTY/azerty$ cd -
/srv/opengrok-var/historycache/QWERTY/azerty
opengrok@[uat-1.9.4]:/srv/opengrok-var/historycache/QWERTY/azerty$ zgrep -A 1 '<void property="revision">' Forest.xml.gz  | grep string | sed -e 's/<string>//' -e 's/<\/string>//' | sort | wc -l
1112

1154-116=1038 so I would have expected 1038 different commits into history cache file.

@ChristopheBordieu
Copy link
Author

When searching, Tomcat logs for file duplicated:

26-Apr-2023 16:24:01.976 FINEST [http-nio-8080-exec-5] org.opengrok.indexer.search.Results.createMap 12,602: /QWERTY/azerty/Forest.xml
26-Apr-2023 16:24:01.976 FINEST [http-nio-8080-exec-5] org.opengrok.indexer.search.Results.createMap 12,773: /QWERTY/azerty/Forest.xml

@ChristopheBordieu
Copy link
Author

My mistake... I closed the issue.

@vladak
Copy link
Member

vladak commented Apr 27, 2023

Is QWERTY a Git repository ? What settings are enabled for it ? (esp. merge commits, renamed files)

As for the discrepancy between the git log output and the history cache, it is not always 1-1 because the way how the history cache is constructed - getting the history for the top level directory of given repository and then creating a reverse hash map (file -> list of commits) to store the history cache entries for individual files. I commented about that recently in a discussion, cannot find it right now.

@vladak
Copy link
Member

vladak commented Apr 27, 2023

Also, what do the changes for /QWERTY/azerty/Forest.xml look like ? The contents of the file are not relevant, I am after how it is being changed in terms of changesets.

Given that I am not able to reproduce this locally, I will need some extensive logging and problem pinpointing on your side.

@ChristopheBordieu
Copy link
Author

Is QWERTY a Git repository ? What settings are enabled for it ? (esp. merge commits, renamed files)

As for the discrepancy between the git log output and the history cache, it is not always 1-1 because the way how the history cache is constructed - getting the history for the top level directory of given repository and then creating a reverse hash map (file -> list of commits) to store the history cache entries for individual files. I commented about that recently in a discussion, cannot find it right now.

Yes. Repository azerty into project QWERTY is a Git repository.
Renamed files is supposed to be off but I will check.
Merge commits too but I will check too.

@vladak
Copy link
Member

vladak commented May 3, 2023

I'd propose the following strategy for identifying the pattern that leads to the duplicate documents:

  • setup a separate test environment with just a single project (where the dups were detected in the past)
  • increase the log level for the indexer to FINEST, use a logging configuration that maintains separate log file for each indexer run and keeps the log files from rotating away
  • run the indexer check after each indexing using the --checkIndex documents

As for the mirroring, one way would be to let it run organically, i.e. let the incoming changes to the repository brought in as they happen in the origin. The alternative would be to start in a certain changeset in the repository and add bunch of changesets at a time, followed by reindex and index check. Repeat until there are duplicates in the index.

@vladak vladak added the bug label May 31, 2023
@tarangchikhalia
Copy link
Contributor

This is happenning in the OpenGrok version 1.12.12. I will try to provide you the details asked in the previous comment.

@vladak
Copy link
Member

vladak commented Jan 22, 2024

We have hit this problem with a Mercurial repository on 1.12.12 (i.e. history based reindex enabled and in action). The problem become apparent after upgrade to 1.12.28. In 1.12.15 AnalyzerGuru version was bumped due to addition of proper YAML support and this revealed checkSettings() being called for documents that do not have matching file under source root, i.e. should have been deleted. Looking closer, there was also a case of multiple checkSettings() calls for files with the same path (different than the missing file), i.e. existence of duplicate documents, and I have confirmed that with running --indexCheck documents. Tracing the origin of these changes, the files were changed in a particular changeset that renamed these files (the indexer is running with renamed file handling on). The bottom line is: the duplicate document problem still exists and there is also an angle with documents not being detected when they should be, although the cause is probably similar.

@vladak
Copy link
Member

vladak commented Jan 26, 2024

Managed to replicate the problem with https://github.com/oracle/solaris-userland/ using this script (needs #4535 to avoid failing the document check due to zero live documents on the initial indexing):

#!/bin/bash

#
# attempt to reproduce https://github.com/oracle/opengrok/issues/4317
#
# based on git-commit-hopping.sh but with added randomness

# set -x
set -e

repo_url="https://github.com/oracle/solaris-userland/"
initial_rev=32c0d9faed7b049872ca9bd78f9bf3e901cff482	# from 2022

src_root="/var/tmp/src.opengrok-issue-4317"
data_root="/var/tmp/data.opengrok-issue-4317"

# Assumes built OpenGrok.
function run_indexer()
{
        # TODO: store the log somewhere
        java -cp '/home/vkotal/opengrok-vladak-scratch/distribution/target/dist/*' \
            org.opengrok.indexer.index.Indexer \
            -c /usr/local/bin/ctags \
	    --economical \
            -H -S -P -s "$src_root" -d "$data_root" \
	    -W "$data_root/config.xml" \
	    >/dev/null 2>&1

	java -cp '/home/vkotal/opengrok-vladak-scratch/distribution/target/dist/*' \
            org.opengrok.indexer.index.Indexer \
	    -H \
	    -R "$data_root/config.xml" \
	    --checkIndex documents
}

function get_next()
{
	ids=( $(git log --oneline --reverse ..origin/master | awk '{ print $1 }') )
	size="${#ids[@]}"
	modulo=16
	if (( size == 0 )); then
		echo ""
		return
	fi
	if (( modulo > size )); then
		modulo=$size
	fi
	n=`expr $RANDOM % $modulo`
	if (( n == 0 )); then
		n=1
	fi
	echo ${ids[$n]}
}

project_root="$src_root/solaris-userland"
if [[ ! -d $src_root ]]; then
	echo "Cloning $repo_url to source root"
	git clone $repo_url "$project_root"
fi

echo "Removing data root"
rm -rf "$data_root"

cd "$project_root"
git checkout "$initial_rev"

while [[ 1 ]]; do
	rev=$(get_next)
	if [[ -z $rev ]]; then
		break
	fi
	echo "Checking out $rev"
	git checkout -q $rev
	run_indexer
done

It ended after just 4 iterations with:

Jan 26, 2024 6:42:55 PM org.opengrok.indexer.index.IndexCheck checkProjectsParallel
WARNING: index check failed
index '/var/tmp/data.opengrok-issue-4317/index/solaris-userland' failed document check: duplicate paths = {/solaris-userland/components/python/pyasn1-modules/Makefile=2}
	at org.opengrok.indexer.index.IndexCheck.checkDocuments(IndexCheck.java:590)
	at org.opengrok.indexer.index.IndexCheck.checkDir(IndexCheck.java:274)
	at org.opengrok.indexer.index.IndexCheck.checkDirWithLogging(IndexCheck.java:244)
	at org.opengrok.indexer.index.IndexCheck.lambda$checkProjectsParallel$0(IndexCheck.java:152)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)

which matches my internal observations.

For the record, here is the sequence of changesets: 32c0d9fae (no index), 1870f2259, 3b039b16f, a45ee5bf6, 87ffc28c4

@vladak
Copy link
Member

vladak commented Jan 29, 2024

The cause of the above is that in the last reindex for changeset 87ffc28c4, the list of files returned from the FileCollector contained 2 similar paths, in this order:

  • /solaris-userland/components/python/pyasn1-modules/Makefile
  • /solaris-userland/components/python/pyasn1/Makefile

The ordering of the uidIter is reversed, which means when the uidIter is on the latter and the former comes in to processFileIncremental(), the document is skipped so there is no call to removeFile() however is added to the IndexDownArgs, leading to the duplicate document.

The trouble lies here:

* Assumes comparing in the same way as {@code org.opengrok.indexer.index.IndexDatabase#FILENAME_COMPARATOR}.

Rather than assuming, the set should be returned unsorted and IndexDatabase should perform the sorting itself.

@vladak
Copy link
Member

vladak commented Jan 29, 2024

There is another problem on

&& Util.uid2url(uidIter.term().utf8ToString()).compareTo(path) <= 0) {
- this also needs to compare the paths using path components, not the whole path as string.

This is needed so that e.g. potentially longer paths (in terms of path components) that have e.g. lower character at the point of the path separator of the other path will come out lesser even though they should not.

vladak added a commit to vladak/OpenGrok that referenced this issue Jan 29, 2024
@vladak vladak self-assigned this Jan 29, 2024
@vladak
Copy link
Member

vladak commented Jan 31, 2024

There is some trickery involved with fixing this. After making sure the paths coming from the FileCollector to IndexDatabase are properly sorted based on path components and the same comparator is also used when traversing the index in IndexDatabase#processFileIncremental(), rerunning a test case multiple times still makes it fail with duplicate documents sometimes.

What seems to be happening in the case when the test fails is that the path that is later identified as duplicate goes through IndexDatabase#processFileIncremental() and there is a document with that path that is deleted and so is skipped. There does not seem to be a live document matched for that path in the cycle (something I have yet to verify) so the document is merely added, not updated, leading to the duplicity.

I think the reason this is not 100% reproducible is that the heavy parallelization done in the 2nd phase of indexing leads to different ordering of how documents are added/deleted in the index, possibly leading to segment reduction or lack of. For the cases when the same test passes, there are no deleted documents for the paths that otherwise cause document verification failure in the last reindex.

For the record, here's the test script:

#!/bin/bash

#
# attempt to reproduce https://github.com/oracle/opengrok/issues/4317
# (more isolation)
#
# based on ~/opengrok-issue-4317.sh sans the randomness.
#
# It tests particular sequence of changes, that **SOMETIMES** leads
# to duplicate documents.
#


set -e

repo_url="https://github.com/oracle/solaris-userland/"

src_root="/var/tmp/src.opengrok-issue-4317"
data_root="/var/tmp/data.opengrok-issue-4317"

# Assumes built OpenGrok.
function run_indexer()
{
        java -cp '/home/vkotal/opengrok-vladak-scratch/distribution/target/dist/*' \
	    -Djava.util.logging.config.file=/var/tmp/logging.properties-opengrok-FINEST_Console \
            org.opengrok.indexer.index.Indexer \
            -c /usr/local/bin/ctags \
	    --economical \
            -H -S -P -s "$src_root" -d "$data_root" \
	    -W "$data_root/config.xml" \
	    >/var/tmp/opengrok-issue-4317-$1.log 2>&1

	java -cp '/home/vkotal/opengrok-vladak-scratch/distribution/target/dist/*' \
            org.opengrok.indexer.index.Indexer \
	    -H \
	    -R "$data_root/config.xml" \
	    --checkIndex documents
}

project_root="$src_root/solaris-userland"
if [[ ! -d $src_root ]]; then
	echo "Cloning $repo_url to source root"
	git clone $repo_url "$project_root"
fi

echo "Removing data root"
rm -rf "$data_root"

echo "Removing logs"
rm -f /var/tmp/opengrok-issue*.log

cd "$project_root"
git checkout -q master

ids="3b039b16f 170ebb43b 80cc6ae18 52c217090 a609ca278 69a8daead 07e01a4e6 154009177 794af3182 c073248f7 823f2c28e 4a5e3cb85 341f9beb2 653378bce 4f8fe9ee8"

for id in $ids; do
	echo "### $id"
	git checkout -q $id
	#if [[ $id != "4f8fe9ee8" ]]; then
	#	run_indexer $id
	#fi
	run_indexer $id
done

@vladak
Copy link
Member

vladak commented Feb 7, 2024

What is causing the latest test failure is uid overlap/clash. For the iteration that leads to the duplicate document, a match for the file is found in IndexDatabase#processFileIncremental() however the uid was found in deleted documents so the iteration was short-circuited by the continue statement. The cycle should have been through another non-deleted document for the same path, however it was not, because it shared the same uid with the deleted document. The cycle finished and the file was present, therefore IndexDownArgs was created, leading to the duplicate document.

The first part of the uid is the path, the second path is the time stamp that is take from the last modified time value of the file. I have yet to figure out whether this is caused by my approach (using git checkout to switch to given changeset; maybe I should have used git reset --hard) or whether this is inherent to the repository (in which case more fixing would be needed).

@vladak
Copy link
Member

vladak commented Feb 9, 2024

Git it does not preserve/restore file time stamps so this behavior is sort of expected. Therefore, the last failure was indeed a cause of sloppy test setup. Detailed analysis is on https://gist.github.com/vladak/351f135c1f9b0ddc979be70ceaa20133. That said, to avoid such corruption in the future (not related to this issue, I believe) I added uid/date check to the history based processing. Running the updated script (below) with indexer that performs the check survived through 16 iterations. With the previous version of the script it failed after handful of iterations.

#!/bin/bash

#
# attempt to reproduce https://github.com/oracle/opengrok/issues/4317
# (more isolation)
#
# based on ~/opengrok-issue-4317.sh sans the randomness.
#
# v4: touch the files to avoid duplicate uids
#
# It tests particular sequence of changes, that **SOMETIMES** leads
# to duplicate documents.
#


set -e

repo_url="https://github.com/oracle/solaris-userland/"

src_root="/var/tmp/src.opengrok-issue-4317"
data_root="/var/tmp/data.opengrok-issue-4317"

# Assumes built OpenGrok.
function run_indexer()
{
        java -cp '/home/vkotal/opengrok-vladak-scratch/distribution/target/dist/*' \
	    -Djava.util.logging.config.file=/var/tmp/logging.properties-opengrok-FINEST_Console \
            org.opengrok.indexer.index.Indexer \
            -c /usr/local/bin/ctags \
	    --economical \
            -H -S -P -s "$src_root" -d "$data_root" \
	    -W "$data_root/config.xml" \
	    >/var/tmp/opengrok-issue-4317-$1.log 2>&1

	java -cp '/home/vkotal/opengrok-vladak-scratch/distribution/target/dist/*' \
	    -Djava.util.logging.config.file=/var/tmp/logging.properties-opengrok-FINEST_Console \
            org.opengrok.indexer.index.Indexer \
	    -H \
	    -R "$data_root/config.xml" \
	    --checkIndex documents \
	    >/var/tmp/opengrok-issue-4317-$1.check.log 2>&1
}

project_root="$src_root/solaris-userland"
if [[ ! -d $src_root ]]; then
	echo "Cloning $repo_url to source root"
	git clone $repo_url "$project_root"
fi

echo "Removing data root"
rm -rf "$data_root"

echo "Removing logs"
rm -f /var/tmp/opengrok-issue*.log

cd "$project_root"
git checkout -q master
# Establish a common time base line.
find "$src_root/" -type f -exec touch {} \;

ids="3b039b16f 170ebb43b 80cc6ae18 52c217090 a609ca278 69a8daead 07e01a4e6 154009177 794af3182 c073248f7 823f2c28e 4a5e3cb85 341f9beb2 653378bce 4f8fe9ee8"

for id in $ids; do
	echo "### $id"
	git checkout -q $id
	# Git does not preserve/restore file time stamps so simulate a git pull.
	# Ideally this should be done only for the "incoming" files, howerver it suffices for this use case.
	find "$src_root/" -type f -exec touch {} \;

	#if [[ $id != "4f8fe9ee8" ]]; then
	#	run_indexer $id
	#fi
	run_indexer $id
done

I am going to modify the original randomness based script and will let it run extended time period to see if this really sticks.

vladak added a commit to vladak/OpenGrok that referenced this issue Feb 13, 2024
vladak added a commit to vladak/OpenGrok that referenced this issue Feb 13, 2024
@vladak vladak closed this as completed in 520ce3b Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants