Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use available corpora for opensubtitles (63 languages) #79

Open
hugolpz opened this issue Feb 25, 2021 · 3 comments
Open

Use available corpora for opensubtitles (63 languages) #79

hugolpz opened this issue Feb 25, 2021 · 3 comments

Comments

@hugolpz
Copy link

hugolpz commented Feb 25, 2021

Research

  • J. Tiedemann, 2016, Finding Alternative Translations in a Large Corpus of Movie Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)

Gain

Closest of natural oral corpora.

Links

  • Portal
    • bre.txt.gz -- Bretonl corpus.
    • 60+ languages available.
    • List: af,ar,bg,bn,br,bs,ca,cs,da,de,el,en,eo,es,et,eu,fa,fi,fr,gl,he,hi,hr,hu,hy,id,is,it,ja,ka,kk,ko,lt,lv,mk,ml,ms,nl,no,pl,pt,pt_br,ro,ru,si,sk,sl,sq,sr,sv,ta,te,th,tl,tr,uk,ur,vi,ze_en,ze_zh,zh_cn,zh_tw

There are ready-to-download open licence Wikipedia corpora available.

Project introduction Type Languages (2024) Portal all Language specific Download link Comments
OpenSubtitles 2016/2018
Subtitles
Parallel sentences
Monolingual sentences
75 Portal br&en bre (mono) '''Source:''' * P. Lison and J. Tiedemann (2016), ''"OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles"'', http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf . '''Licence:''' unclear, "The corpora is made freely available to the research community on the OPUS website" − Lison and Tiedemann (2016).
@brawer
Copy link
Collaborator

brawer commented Feb 25, 2021

Sounds great. Send a pull request?

@hugolpz hugolpz changed the title Add crawler for opensubtitles Add crawler for opensubtitles (63 languages) Feb 25, 2021
@hugolpz
Copy link
Author

hugolpz commented Feb 26, 2021

Hello Sascha / @brawer,
My Python skills are near zero so far, I do my best to help with my available knowledges and know-how :

  • multilingual corpora literature review → sharing
  • Wikimedia's API, ecosystems, resources → sharing
  • documenting opensource project positively to increase engagements
  • clarifying roadmaps
  • networking for stronger projects¹

The project also lacks meaningul documentation (#80). It would be inefficient to get a total Python-newbie on Python copy-engineering. I will be more productive on other linguistic diversity issues, here on on @lingua-libre projects.

Given how central to web linguistic diversity is this CLDR/UNILEX/Unicode/Google's CorpusCrawler repository, is there an email contact to which I or/and Wikimedia France or/and Wikimedia Foundation could write to ask for more solid support for CorpusCrawler ? Volunteership can do a lot but is too irregular. A dedicated, versatile, paid maintainer supervising ~20² Google's open sources projects, unblocking most key bottlenecks via 4 hours coding sprints and community support would quickly provide a positive ROI. 2020 opens access to skilled workers all around the world. There is surely a long list of open sources projects which would gain of such tiny yet skilled bottlenecks-kicks to move forward.

I would be interested to coordinate such email with Wikimedia France and the US Wikimedia Foundation to get a hand of names of that email. (If there is a reasonable >5~10% chances to achieved the intended goal of a skilled, paid maintainer here 4hrs/week in next 2 years).

1: see text above
2: depending on projects activity, could be less or more. Current project has about 1 issue / month.

@hugolpz
Copy link
Author

hugolpz commented Mar 4, 2021

Thanks for the chat @brawer. Our online chat will help me conceive better the next phases of Lingualibre and collaboration with crawler.

@hugolpz hugolpz changed the title Add crawler for opensubtitles (63 languages) Use available corpora for opensubtitles (63 languages) Feb 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants