Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Markdown / Jupyter markup not getting counted #937

Merged
merged 11 commits into from
Jan 24, 2023

Conversation

spenserblack
Copy link
Collaborator

@spenserblack spenserblack commented Jan 19, 2023

This allows an optional definition of the line type for a language. If it is defined, it will sum them together. Defaults to using .code. By using this definition, comments are included in the type of lines of code to be summed for Markdown and Jupyter Notebooks.

This also changes the summing of child languages to be recursive, so that deeply nested code (e.g. Bash in Markdown in Jupyter) can be counted.

Fixes #933

@spenserblack
Copy link
Collaborator Author

spenserblack commented Jan 19, 2023

I can add a test or two if needed.

Edit: Added a couple of tests.

@spenserblack spenserblack marked this pull request as draft January 19, 2023 16:28
@spenserblack spenserblack force-pushed the feature/933/line-types branch 2 times, most recently from 811b320 to 5ed5031 Compare January 19, 2023 16:31
This allows an optional definition of the line type for a language. If
it is defined, it will sum them together. Defaults to using `.code`.

Fixes o2sh#933
src/info/langs/mod.rs Outdated Show resolved Hide resolved
languages.yaml Outdated Show resolved Hide resolved
src/info/langs/language.tera Outdated Show resolved Hide resolved
Copy link
Owner

@o2sh o2sh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done!

Left a few remarks.

Don't forget to apply the same logic to the get_total_loc function.

spenserblack and others added 2 commits January 20, 2023 22:47
"compile" is a bit of a misnomer, since it's related to any compilation,
but merely counts the number of lines of code for a language.

Co-authored-by: Ossama Hjaji <ossama-hjaji@live.fr>
Co-authored-by: Ossama Hjaji <ossama-hjaji@live.fr>
@spenserblack
Copy link
Collaborator Author

spenserblack commented Jan 20, 2023

Thanks for the review! I'll probably get to get_total_loc over the weekend.

spenserblack and others added 2 commits January 21, 2023 09:34
This references internal logic, and is unnecessarily wordy. Instead,
this field is now described in CONTRIBUTING to explain what it is and
what values it can take.

Co-authored-by: Ossama Hjaji <ossama-hjaji@live.fr>

let has_children = !language.children.is_empty();

if has_children {
for reports in language.children.values() {
for stats in reports.iter().map(|r| r.stats.summarise()) {
code += stats.code;
loc += stats.code;
Copy link
Owner

@o2sh o2sh Jan 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

language::loc should probably also be applied here in case one of the children language happens to be Markdown.

Jupyter notebooks is a good example:

image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I don't use Jupyter notebooks, so I thought Markdown was the "top level" code. Could that Markdown then contain its own code blocks? If so, perhaps this function should be recursive for deeply nested code blocks?

Copy link
Owner

@o2sh o2sh Jan 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tokei's parser is indeed recursive, as stated in its changelog:

image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deeply nested stats are now recursively counted with 939e9e1

languages.yaml Outdated Show resolved Hide resolved
@spenserblack spenserblack marked this pull request as draft January 22, 2023 14:14
@spenserblack
Copy link
Collaborator Author

spenserblack commented Jan 22, 2023

Oops, I thought that I would be labelled the committer and you would be the author of f41bc02. I guess codespaces behave differently when --author= is used. Hope you don't mind that I made a commit that looks like you.

@spenserblack
Copy link
Collaborator Author

As something to do in the future, I think it would be nice to collect code samples like linguist and tokei do, and perform some snapshot tests on the CLI output using those samples. As a lazy test, I used

{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "```bash\n",
    "echo 'Hello, World!'\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.4.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}

And got 4 lines as expected (Jupyter contains Markdown, Markdown contains 3 comment lines and 1 line of Bash).

spenserblack and others added 3 commits January 23, 2023 02:09
All lines are counted from the language's children.

Co-authored-by: Ossama Hjaji <ossama-hjaji@live.fr>
`.summarise()` accomplishes what recursion was doing.

Co-authored-by: Ossama Hjaji <ossama-hjaji@live.fr>
@spenserblack
Copy link
Collaborator Author

Just checked the source code of .summarise, and it looks like we'll lose some granularity.
Before aa31a72, a comment in a bash code block in Markdown in Jupyter would be ignored, since only code is counted for Bash. Summarise would combine the CodeStats from the Bash with the Markdown, making all comments get counted, as long as the 1st-level child is Markdown.

Which may be preferable behavior (it makes sense to me IMO), but I just want to throw that out there.

Comment on lines 115 to 139
pub fn loc(language_type: &tokei::LanguageType, language: &tokei::Language) -> usize {
let loc = match language_type {
{% for language, attrs in languages -%}
{%- set line_types = attrs.line_types | default(value=['code']) -%}
tokei::LanguageType::{{ language }} => language.{{ line_types.0 }}{% for line_type in line_types | slice(start=1) %} + language.{{ line_type }}{% endfor %},
{% endfor %}
_ => unimplemented!("Language Type {:?}", language_type),
};
loc + language.children.iter().fold(0, |sum, (lang_type, reports)| {
sum + reports.iter().fold(0, |sum, report| sum + stats_loc(lang_type, &report.stats))
})
}

/// Counts the lines-of-code of a tokei `Report`. This is the child of a
/// `tokei::CodeStats`.
pub fn stats_loc(language_type: &tokei::LanguageType, stats: &tokei::CodeStats) -> usize {
let stats = stats.summarise();
match language_type {
{% for language, attrs in languages -%}
{%- set line_types = attrs.line_types | default(value=['code']) -%}
tokei::LanguageType::{{ language }} => stats.{{ line_types.0 }}{% for line_type in line_types | slice(start=1) %} + stats.{{ line_type }}{% endfor %},
{% endfor %}
_ => unimplemented!("Language Type {:?}", language_type),
}
}
Copy link
Owner

@o2sh o2sh Jan 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you think it would be better to keep the code logic outside of the tera template? I'm refering to lines 123-125 and 131.

We could probably extract those lines and put them back in get_language_distribution() between line 34 and 36:

let has_children = !language.children.is_empty();

if has_children {
    for (lang_type, reports) in language.children.iter() {
        loc += reports
            .iter()
            .map(|r| r.stats.summarise())
            .fold(0, |sum, stats| sum + language::stats_loc(lang_type, &stats));
    }
}

this would simplify the template but more importantly avoid obfuscating part of the code logic.

Leaving two methods loc and stats_loc inside the template which would be almost identical except for one of their parameters. I wonder if there is a way we could combine the two methods 🤔 (maybe in the future)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 Yeah, it does obfuscate it a bit by hiding the logic in the template. But my thinking was that I'd like to avoid needing to repeat similar logic (adding the LOC of the children) for both language distributions and total line counts. Since these are kind of private methods, maybe they should be #[doc(hidden)] and called __loc(), and have more visible methods called loc() and stats_loc() in language.rs that call these methods and act as wrappers.

fn loc(language_type: &tokei::LanguageType, language: &tokei::Language) -> usize {
    __loc(language_type, language) + language
        .children 
        // ...

What do you think?

Copy link
Owner

@o2sh o2sh Jan 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you mentioned in your comment #937 (comment)
We do need additional tests covering more complex cases (language with children f. ex.) to make sure we don't break things. Should not block this PR, but preferably before the next release.

Copy link
Owner

@o2sh o2sh Jan 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But my thinking was that I'd like to avoid needing to repeat similar logic (adding the LOC of the children) for both language distributions and total line counts.

You're right, I missed that 🤦

Since these are kind of private methods, maybe they should be #[doc(hidden)] and called __loc(), and have more visible methods called loc() and stats_loc() in language.rs that call these methods and act as wrappers.

I'm sorry, but I'm lost - too many loc 😅 - maybe you can commit your suggestion to illustrate your idea. But yeah, if we could avoid putting logic inside the tera template, that would be great. we could probably factorize this part inside a method:

let mut loc = language::loc(language_name, language);

let has_children = !language.children.is_empty();

if has_children {
    for (lang_type, reports) in language.children {
        loc += reports
            .iter()
            .map(|r| r.stats.summarise())
            .fold(0, |sum, stats| sum + language::stats_loc(&lang_type, &stats));
    }
}

and call it for both language distributions and total line counts.

But it's your call, do what you think is best 👍 I think this PR is already pretty good.

Copy link
Collaborator Author

@spenserblack spenserblack Jan 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you can commit your suggestion to illustrate your idea.

Sure! 64077e7 is the basic idea. __<function> contains the repetitive code that the template manages, and <function> is a wrapper that calls that function and includes some helper code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll probably add a test for Jupyter code and call it on get_total_loc. That way loc and stats_loc is 100% covered. __loc's template would be 100% covered, but the generated code would probably be <10% with each language being a branch 😆
I think I'll just hardcode the structs necessary to simulate Jupyter code.

Copy link
Owner

@o2sh o2sh Jan 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect 🎊 , thanks @spenserblack

This adds public wrapper functions to the templated functions and makes
them private. This makes it so as much code is in actual Rust code, and
leaves the template to only manage the repetitive code.
@o2sh o2sh merged commit 5379ecd into o2sh:main Jan 24, 2023
@spenserblack spenserblack deleted the feature/933/line-types branch January 24, 2023 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Markdown is ignored if it doesn't contain code blocks
2 participants