Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a workflow to remove a repository after ingestion? #15

Open
gasche opened this issue May 13, 2021 · 1 comment
Open

a workflow to remove a repository after ingestion? #15

gasche opened this issue May 13, 2021 · 1 comment

Comments

@gasche
Copy link
Contributor

gasche commented May 13, 2021

It has happened to me several times now that I ingest a large set of repositories, I look at the data, and I notice oddities caused by a repository that should not have been there in the first place.

Is there a workflow to remove a repository from the database, and rerun the plotting?

Currently I don't know of such a workflow, so I manually remove the repository, delete the database, and restart ingestion from scratch. This is ok, but it can be annoying when ingestion is slow (several minutes on large repository sets).

I thought about running sqlite on the database and doing a DELETE operation on all raw_commits coming from this directory. However, if I understand correctly, the plotting data comes from the authors table that I would need to update with new aggregates, and I don't know how to do it easily.

Assuming this does not currently exist, my proposal would be to have a command fornalder reanalyze foo.db that would drop the current authors table and recompute it from the raw_commits table as it currently exists.

(Another option of course would be to have a fornalder repo-remove foo.db repo.git command that removes a repository from a table, instead of adding it as fornalder ingest foo.db repo.git does. But that sounds like more work.)

@hpjansson
Copy link
Owner

The authors table gets derived from raw_commits every run, so it should be safe to poke around in the latter. See:

fornalder/src/commitdb.rs

Lines 204 to 233 in 43f3d48

// Generate table with per-author stats like time of first and
// last commit.
self.conn.execute ("drop table authors;", NO_PARAMS).ok();
self.conn.execute ("
create table authors as
select author_name,
first_time,
first_year,
last_time,
last_year,
last_time-first_time as active_time,
n_commits,
n_changes
from
(
select author_name,
min(author_time) as first_time,
min(author_year) as first_year,
max(author_time) as last_time,
max(author_year) as last_year,
count(id) as n_commits,
sum(n_insertions) + sum(n_deletions) as n_changes
from raw_commits
group by author_name
);
create index index_author_name on authors (author_name);
create index index_first_time on authors (first_time);
create index index_active_time on authors (active_time);
", NO_PARAMS).chain_err(|| "Could not create author summaries")?;

I intended to re-run postprocess() only if something changed (e.g. store a hash of the meta file provided, clear a flag whenever a fornalder command like ingest changes the database), but it wasn't too slow in practice, so I didn't feel the need to optimize it, at least not yet. I left a reminder here:

cdb.postprocess(&meta.domains)?; // FIXME: Skip if metadata is unchanged

Anyway, the bottom line is that manually editing raw_commits is safe, for now.

I like the idea of having CLI for common database editing (like removing a repo, or maybe a date range). Let's keep this issue open for repo-remove (or remove-repo?).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants