Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug about remove_markup #3520

Open
seadog-www opened this issue Mar 28, 2024 · 2 comments
Open

bug about remove_markup #3520

seadog-www opened this issue Mar 28, 2024 · 2 comments

Comments

@seadog-www
Copy link

seadog-www commented Mar 28, 2024

Problem description

After calling gensim.corpora.wikicorpus.filter_wiki,there are still characters not been stripped.

RE_P1 = re.compile(r'<ref([> ].*?)(</ref>|/>)', re.DOTALL | re.UNICODE)

Before stripping RE_P1, characters as following should be stripped.

re.compile('(?:<br />|<br/>|<nowiki/>)', re.DOTALL | re.UNICODE)

Steps/code/corpus to reproduce

import re
from gensim.corpora.wikicorpus import extract_pages,filter_wiki

# https://zh.wikipedia.org/wiki/%E5%90%89%E6%9E%97%E7%9C%81
s1 = '''
{{seealso|吉林省各市州人口列表}}
2022年末,全省总人口为2347.69万人<ref>吉林省2022年人口<nowiki/>https://www.hongheiku.com/sjrk/1059.html</ref>,其中城镇常住人口1496.18万人,占总人口比重(常住人口城镇化率)为63.73%,比上年末提高0.37个百分点。户籍人口城镇化率为49.08%。全年出生人口10.23万人,出生率为4.33‰;死亡人口19.84万人,死亡率为8.40‰;自然增长率为-4.07‰。人口性别比为99.83(以女性为100)。
'''
# https://zh.wikipedia.org/wiki/%E7%BB%8F%E6%B5%8E%E5%AD%A6
s2 = '''
羅賓斯認為,此定義注重的不是以經濟學「研究某些行為」,而是要以分析的角度去「研究行為是如何被資源有限的條件所改變」<ref>Robbins, Lionel (1932). ''An Essay on the Nature and Significance of Economic Science'', p. [http://books.google.com/books?id=nySoIkOgWQ4C&printsec=find&pg=PA16#v=onepage&q&f=false 16] {{Wayback|url=http://books.google.com/books?id=nySoIkOgWQ4C&printsec=find&pg=PA16#v=onepage&q&f=false |date=20130910062356 }}.</ref>。一些人批評此定義過度廣泛,而且無法將分析範疇侷限在對於市場的研究上。然而,自從1960年代起,由於理性選擇理論和其引發的[[賽局理論]]不斷將經濟學的研究領域擴張,這個定義已為世所認<ref name="Backhouse2009Stigler">Backhouse, Roger E., and Steven G. Medema (2009). "Defining Economics: The Long Road to Acceptance of the Robbins Definition", ''Economica'', 76(302), [http://onlinelibrary.wiley.com/doi/10.1111/j.1468-0335.2009.00789.x/full#ss4 V. Economics Spreads Its Wings] {{Wayback|url=http://onlinelibrary.wiley.com/doi/10.1111/j.1468-0335.2009.00789.x/full#ss4 |date=20130602222736 }}. [Pp. [http://onlinelibrary.wiley.com/doi/10.1111/j.1468-0335.2009.00789.x/full 805–820] {{Wayback|url=http://onlinelibrary.wiley.com/doi/10.1111/j.1468-0335.2009.00789.x/full |date=20130602222736 }}.]<br/>&nbsp;&nbsp; [[喬治·斯蒂格勒|Stigler, George J.]] (1984). "Economics—The Imperial Science?" ''Scandinavian Journal of Economics'', 86(3), pp. [http://www.jstor.org/pss/3439864 301]-313.</ref>,但仍有對此定義的批評<ref>Blaug, Mark (2007). "The Social Sciences: Economics", ''The New Encyclopædia Britannica'', v. 27, p. 343 [pp. 343–52].</ref>。
'''

print(filter_wiki(s1))
print(filter_wiki(s2))

print('=============')
RE_P = re.compile('(?:<br />|<br/>|<nowiki/>)', re.DOTALL | re.UNICODE)

print(filter_wiki(re.sub(RE_P, '', s1)))
print(filter_wiki(re.sub(RE_P, '', s2)))

output:

2022年末,全省总人口为2347.69万人https://www.hongheiku.com/sjrk/1059.html,其中城镇常住人口1496.18万人,占总人口比重(常住人口城镇化率)为63.73%,比上年末提高0.37个百分点。户籍人口城镇化率为49.08%。全年出生人口10.23万人,出生率为4.33‰;死亡人口19.84万人,死亡率为8.40‰;自然增长率为-4.07‰。人口性别比为99.83(以女性为100)。

羅賓斯認為,此定義注重的不是以經濟學「研究某些行為」,而是要以分析的角度去「研究行為是如何被資源有限的條件所改變」。一些人批評此定義過度廣泛,而且無法將分析範疇侷限在對於市場的研究上。然而,自從1960年代起,由於理性選擇理論和其引發的賽局理論不斷將經濟學的研究領域擴張,這個定義已為世所認   Stigler, George J. (1984). "Economics—The Imperial Science?" ''Scandinavian Journal of Economics'', 86(3), pp. 301-313.,但仍有對此定義的批評。

=============

2022年末,全省总人口为2347.69万人,其中城镇常住人口1496.18万人,占总人口比重(常住人口城镇化率)为63.73%,比上年末提高0.37个百分点。户籍人口城镇化率为49.08%。全年出生人口10.23万人,出生率为4.33‰;死亡人口19.84万人,死亡率为8.40‰;自然增长率为-4.07‰。人口性别比为99.83(以女性为100)。

羅賓斯認為,此定義注重的不是以經濟學「研究某些行為」,而是要以分析的角度去「研究行為是如何被資源有限的條件所改變」。一些人批評此定義過度廣泛,而且無法將分析範疇侷限在對於市場的研究上。然而,自從1960年代起,由於理性選擇理論和其引發的賽局理論不斷將經濟學的研究領域擴張,這個定義已為世所認,但仍有對此定義的批評。

Versions

Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
Bits 64
NumPy 1.26.4
SciPy 1.12.0
gensim 4.3.2
FAST_VERSION 0

wiki text from zhwiki-20231201-pages-articles-multistream1.xml-p1p187712.bz2

@gojomo
Copy link
Collaborator

gojomo commented Apr 10, 2024

Thank you for your report & code-to-reproduce!

In looking over the code, the more basic problem might be that RE_P1 (for REF tags) assumes that any /> ends the tag, as if there'd never be any kind of nested tag – but in your example fragments, the <NOWIKI/> or <BR/> tags do appear. Your suggested fix seems it would only remedy the problem in the few cases you've seen – while the unexpected nesting of any other tags ending /> risk triggering the same problem.

It looks like the RE_P9 (described as "external links"?) and RE_P10 (math) share a similar assumption that any /> must end of the tag-of-interest, rather than some nested tag, and thus might be susceptible to the same issue.

I think it'd be better to tune those regexes to not assume the absence of all nested tags, but that might risk other side-effects, or require other re-ordering of steps – I'm not sure why the existing regexes work the way they do, and processing HTML or Wikipedia's weird wikitext format with regexes is an inherently clunky & hard-to-maintain approach.

It might be most robust to move some form of RE_P11 ("All other tags") up in the process, but narrowed to leave any specific tags of interest.

@seadog-www
Copy link
Author

Thank you for your report & code-to-reproduce!

In looking over the code, the more basic problem might be that RE_P1 (for REF tags) assumes that any /> ends the tag, as if there'd never be any kind of nested tag – but in your example fragments, the <NOWIKI/> or <BR/> tags do appear. Your suggested fix seems it would only remedy the problem in the few cases you've seen – while the unexpected nesting of any other tags ending /> risk triggering the same problem.

It looks like the RE_P9 (described as "external links"?) and RE_P10 (math) share a similar assumption that any /> must end of the tag-of-interest, rather than some nested tag, and thus might be susceptible to the same issue.

I think it'd be better to tune those regexes to not assume the absence of all nested tags, but that might risk other side-effects, or require other re-ordering of steps – I'm not sure why the existing regexes work the way they do, and processing HTML or Wikipedia's weird wikitext format with regexes is an inherently clunky & hard-to-maintain approach.

It might be most robust to move some form of RE_P11 ("All other tags") up in the process, but narrowed to leave any specific tags of interest.

Yes, my suggestion is not perfect. Do you have better method for processing HTML or Wikipedia's weird wikitext format without regexes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants