Supported languages are **French** (fra), **Hausa** (hau), **Igbo** (ibo), **Luganda** (lug), **Nigeria Pidgin** (pcm), **Kirundi** (run), **Shona** (sna), **Somali** (som), **Swahili** (swa), and **Yoruba** (yor).

#Stopword Extraction based on POS Tags
*   *MasakhaPOS* is a dataset of excerpts in African languages annotated using Part-of-Speech tags as explained [here](https://universaldependencies.org/u/pos/).
*   We consider the terms annotated using these POS Tags as Stopwords: *AUX*, *PRON*, *CCONJ*, *SCONJ*, *DET*, and *PART*.

In [None]:
import requests
#Extraction the stopwords' extraction
lang = "yor"
url = "https://raw.githubusercontent.com/masakhane-io/masakhane-pos/main/data/"+lang+"/dev.txt"
response = requests.get(url)
textlist = response.text.split("\n")

In [None]:
stopwords = []
if (textlist != ['404: Not Found']):
  stopwordsc = ["AUX", "PRON", "CCONJ", "SCONJ", "DET", "PART"]
  for phrase in textlist:
   if phrase != '':
    entryn = phrase.split(" ")
    if entryn[1] in stopwordsc: stopwords.append(entryn[0])

In [None]:
stopwords = list(set(stopwords))
len(stopwords)

122

# Retrieving Stopwords from the African Stopwords Project
* We extract stopwords from the African Stopwords Project and we add them to the ones extracted from MasakhaPOS.
* The final list of considered stopwords is lowercased.

In [None]:
dictlang = {"hau": "ha", "pcm": "pcm", "run": "rn", "som": "so", "swa": "sw", "yor": "yo"}
if lang in dictlang:
  lg = dictlang[lang]
  url = "https://raw.githubusercontent.com/masakhane-io/masakhanePreprocessor/main/african-stopwords/languages/"+lg+".txt"
  response = requests.get(url)
  textlist = response.text.split("\n")[:-1]
  stopwords = list(set(stopwords + textlist))
  stopwords = list(set([x.lower() for x in stopwords]))
len(list(set(textlist)))

60

# Extracting stopwords for French
* French is not supported by MasakhaPOS and the African Stopword Project.
* We extract stopwords for French from Stopwords-ISO (https://github.com/stopwords-iso).


In [None]:
dictlang = {"fra": "fr"}
if lang in dictlang:
  lg = dictlang[lang]
  url = "https://raw.githubusercontent.com/stopwords-iso/stopwords-"+lg+"/master/stopwords-"+lg+".txt"
  response = requests.get(url)
  textlist = response.text.split("\n")[:-1]
  stopwords = list(set(stopwords + textlist))
  stopwords = list(set([x.lower() for x in stopwords]))
len(list(set(textlist)))

60

#Availability of stopwords in categorized corpora
*   *MasakhaNEWS* is a dataset of news items in African languages that are categorized according to their main topic.
*   We identify the number of stopwords that can be found in the dataset as well as the number of categories where they are available.

In [None]:
import requests
import re
#Evaluating the stopwords' extraction
url = "https://raw.githubusercontent.com/masakhane-io/masakhane-news/main/data/"+lang+"/dev.tsv"
response = requests.get(url)
textlist = re.sub(r'[^\w\s]', '', response.text).split("\n")

In [None]:
corpus = []
category_list = []
for line in textlist[1:]:
  linetext = line.split("\t")
  if (len(linetext) == 4):
    category = linetext[0]
    text = linetext[2]
    if not(category in category_list):
      category_list.append(category)
      corpus.append(text)
    else:
      corpus[category_list.index(category)] += " " + text.lower()

In [None]:
len(corpus)

5

In [None]:
category_list

['entertainment', 'sports', 'health', 'politics', 'religion']

In [None]:
wordlist = []
for item in corpus:
  wordlist.append(list(set(item.split(" "))))

In [None]:
len(wordlist)

5

In [None]:
wordlist

[['',
  'faithia',
  'film',
  'ìgbéyàwó',
  'sukun',
  'awawi',
  'iwọnba',
  'steve',
  'dae',
  'mura',
  'nadeem',
  'florida',
  'aladun',
  'fùn',
  'anfani',
  'náà',
  'bu',
  'olabiran',
  'kó',
  'ero',
  'alabayọ',
  'lẹyin',
  'ensemble',
  'gbẹnu',
  'oloufe',
  'ẹlomiran',
  'party',
  'ọlọsha',
  'orúkọ',
  'abuke',
  'kofo',
  'buruku',
  'oniṣowo',
  'tree',
  'mgm',
  'fire',
  'baba',
  'bankole',
  'idi',
  'gbọngan',
  'orí',
  'ajao',
  'jáwó',
  'lalẹ',
  'tani',
  'bọwọ',
  'dúkìá',
  'mejidinlaadọrin',
  'ṣọun',
  'age',
  'family',
  'imoore',
  'aráàlú',
  'of',
  'portable',
  'dwan',
  'amin',
  'láàárín',
  'sẹnatọ',
  'iṣẹlẹ',
  'kej',
  'ojuti',
  'yannana',
  'ṣade',
  'sonatas',
  'adewusi',
  'ṣọrun',
  'koro',
  'latiu',
  'ayọ',
  'igbákejì',
  'ra',
  'adeyinka',
  'bẹẹbẹẹlọ',
  'àwòrán',
  'apala',
  '2',
  'mọ',
  'alien',
  'weeknd',
  'rock',
  'ọrọ',
  'taa',
  'sira',
  'fesi',
  'your',
  'ijesha',
  'nibẹ',
  'tiwa',
  'kingsaheed',
  'audi

In [None]:
lexicon_ = []
for i in range(len(wordlist)):
  lexicon_ += wordlist[i]

In [None]:
from collections import Counter
counter = Counter(lexicon_)
common = counter.most_common(100000)
common = [x for x in common if (x[0] in stopwords)]
len(list(set(lexicon_)))

8210

In [None]:
print("Available Stopwords:", len(common), "\nAll Stopwords:", len(stopwords), "\nRate:", len(common)/len(stopwords))

Available Stopwords: 62 
All Stopwords: 160 
Rate: 0.3875


In [None]:
counts = [x[1] for x in common]
counter1 = Counter(counts)
cat = counter1.most_common(100000)
cat

[(5, 49), (4, 4), (3, 3), (2, 3), (1, 3)]

In [None]:
uncommon = [x[0] for x in common if (x[1] == 1)]
uncommon

['í', 'kì', 'é']