Stopwords for various languages in JSON format. Per Wikipedia:
Stop words are words which are filtered out prior to, or after, processing of natural language data [...] these are some of the most common, short function words, such as the, is, at, which, and on.
You can use all stopwords with stopwords-all.json (keyed by language ISO 639-1 code), or see the below table for individual language stopword files.
There are a total of 43 supported languages:
Language | Stopword count | Filename |
---|---|---|
Arabic | 162 | ar.json |
Armenian | 45 | hy.json |
Basque | 98 | eu.json |
Bengali | 116 | bn.json |
Breton | 126 | br.json |
Bulgarian | 259 | bg.json |
Catalan | 218 | ca.json |
Chinese | 542 | zh.json |
Croatian | 179 | hr.json |
Czech | 346 | cs.json |
Danish | 101 | da.json |
Dutch | 275 | nl.json |
English | 570 | en.json |
Esperanto | 173 | eo.json |
Ewe | 35 | ee.json |
Finnish | 772 | fi.json |
French | 606 | fr.json |
Galician | 160 | gl.json |
German | 596 | de.json |
Greek | 75 | el.json |
Hebrew | 194 | he.json |
Hindi | 225 | hi.json |
Hungarian | 781 | hu.json |
Indonesian | 355 | id.json |
Irish | 109 | ga.json |
Italian | 623 | it.json |
Japanese | 109 | ja.json |
Korean | 679 | ko.json |
Latin | 49 | la.json |
Latvian | 161 | lv.json |
Marathi | 99 | mr.json |
Norwegian | 172 | no.json |
Persian | 332 | fa.json |
Polish | 260 | pl.json |
Portuguese | 408 | pt.json |
Romanian | 282 | ro.json |
Russian | 539 | ru.json |
Slovak | 110 | sk.json |
Slovenian | 446 | sl.json |
Spanish | 577 | es.json |
Swedish | 401 | sv.json |
Thai | 115 | th.json |
Turkish | 279 | tr.json |
- Apache Lucene - Apache 2.0 License
- Carrot2 - License
- cue.language - Apache 2.0 License
- Jacques Savoy - BSD License
- SMART Information Retrieval System: ftp://ftp.cs.cornell.edu/pub/smart/
Copyright (c) 2014 Peter Graham, contributors. Released under the Apache-2.0 license