[Rozha](https://github.com/ian-nai/Rozha) is a package to simplify and streamline a number of natural language processing processes and methods for a wide variety of languages, empowering users to use NLP on both non-English and English texts.

Much of the work that has been done using natural language processing (NLP) has been focused on an Anglocentric model, using English texts in conjunction with tools and computer models that are primarily designed to work with the English language. Rozha was created to make it easier for people to begin engaging with non-English materials within the context of their NLP and digital humanities work. 

In [None]:
# Getting Started

# Install using pip:

!pip install rozha

# Or download the GitHub repo and the install the requirements:

# pip3 install -r requirements.txt

# While using Colab, we need to set up our taggers in this way:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
!python -m spacy download en_core_web_md

 Then begin using the package by importing the modules you plan to use. Rozha is structured into three classes: process, analyze, and visualize. 


In [None]:
# If you installed using pip, import them this way:

import rozha.process as process # (or whatever name you choose)
import rozha.analyze as analyze # (or whatever name you choose)
import rozha.visualize as visualize # (or whatever name you choose)

# If running from a local copy of the files, use the following:

# from process import process
# from analyze import analyze
# from visualize import visualize

# Some Example Pipelines

The following pipelines show some examples of how to work with the package on non-English and multilingual texts. For a a full list of the package's functions, [follow this link](https://github.com/ian-nai/Rozha/blob/main/Functions.md).

First, let's open a file, perform word tokenization and remove stopwords, make the text lowercase, and then get part-of-speech tags for the text:

In [None]:
import rozha.process as process
import rozha.analyze as analyze

word_tokenized = process.lowerFile("sample_text.txt")
pos_tags = analyze.posList(word_tokenized)
print(pos_tags)

Next, we'll perform sentence tokenization without removing stopwords, and then perform named entity recognition on each sentence using spaCy:

In [None]:
import rozha.process as process
import rozha.analyze as analyze

# Tokenizing our sentences and performing the named entity recognition
sent_tokenized = process.allSentsVar("Jane runs fast. John runs slow.", "english")
ner_tags = analyze.spacyNer(sent_tokenized, "en")

print(ner_tags)

Next we'll perform word tokenization and remove stopwords from a string, make the text lowercase, and graph the 5 most common words as a bar graph:

In [20]:
import rozha.process as process
import rozha.visualize as visualize

our_text = "During the whole of a dull, dark, and soundless day in the autumn of the year, when the clouds hung oppressively low in the heavens, I had been passing alone, on horseback, through a singularly dreary tract of country; and at length found myself, as the shades of the evening drew on, within view of the melancholy House of Usher."
no_stopwords = str(process.stopwordsVar(our_text, "english"))
word_tokenized = process.lowerVar(no_stopwords)
# Pass the var, number of words to graph, the height and width of the graph, and your preferred filename
# The graph will save as "my_graph.png"
visualize.barFreq(word_tokenized, 5, 400, 400, 'my_graph')

['whole dull dark soundless day autumn year clouds hung oppressively low heavens passing alone horseback singularly dreary tract country length found shades evening drew within view melancholy house usher']


Finally, let's detect the languages in a text and tag words according to which language they belong to.

In [None]:
import rozha.process as process
import rozha.analyze as analyze

# Tokenizing into sentences
sent_tokenized = process.sentTokenizeFile("sample_text.txt")
print(sent_tokenized)

# # Now we detect the languages and print the output
tagged_sents = analyze.detectLangVar(sent_tokenized)
print(tagged_sents)

# Conclusion

The Rozha package ultimately aims to make multilingual digital humanities and natural language processing more accessible and to simplify the work of those already working in the field, and perhaps open up new avenues to explore for newcomers and established NLP practitioners. My hope is that this tool will help encourage diversity in the NLP landscape, and that people who may have felt it too daunting to work with materials in non-English languages may now feel more comfortable through the ease of working with this package.  Beyond that, I hope the package will serve as a conduit for additional contributions and collaboration, and that the code will ultimately help strengthen the field and community of practitioners working with non-English materials. 