# Working in Languages Beyond English

By [Quinn Dombrowski](http://www.quinndombrowski.com/)

<div class="admonition note" name="html-admonition" style="background: lightblue; padding: 10px">
<p class="title">Note</p>
This section, "Working in Languages Beyond English," is authored by <a href="https://dlcl.stanford.edu/people/quinn-dombrowski/">Quinn Dombrowski</a>, the Academic Technology Specialist at Stanford University and a leading voice in multilingual digital humanities. I'm grateful to Quinn for helping expand this textbook to serve languages beyond English. 
</div>

Most of the tools and tutorials you'll find for computational text analysis assume that you're working with English language text. This section is dedicated to helping students and scholars accomplish text analysis tasks in languages beyond English. Select lessons are adapted for non-English languages including Danish, Spanish, Chinese, and Russian.

## Two Kinds of Text Analysis
The steps you need to take to analyze a language beyond English will depend on the kind of text analysis method that you are interesting in using. 

The methods introduced in this chapter can be broadly organized into two groups:

1) Methods based on word counts — such as TF-IDF and topic modeling

2) Methods that use language-specific NLP models — such as Named Entity Recognition and part-of-speech-tagging

There are more resources to support non-English text analysis in the first group of methods than in the second group.

To apply the first group of methods to non-English texts, you will need to *pre-process* your texts — in other words, to create a derivative version of your text that will work better with these tools. 

To apply the second group of methods to non-English texts, you will need to find a language-specific version of the NLP models. Unfortunately, for most of the roughly 6,500 languages spoken in the world, there are currently few if any language-specific tools or resources to support computational analysis. Out of the 100 languages with the greatest number of speakers, at least 2/3 are missing the tools you'll need to complete all the activities in this section of the textbook.


## Text Analysis Based on Word Counts: Pre-Processing Non-English Texts

The pre-processing steps needed to make texts in other languages usable with computational text analysis methods vary depending on the language. For example, some languages, such as Chinese, do not separate words with spaces, and texts in these languages will need to have artifical spaces inserted before text analysis.

Other languages with more *inflection* than English (e.g. where words appear in different forms, depending on how they're used) need to be *lemmatized*, replacing every variant word form with the dictionary form, or *stemmed*, cutting off the inflection at the end of the word. Lemmatizing or stemming usually (but not always) leaves you with something resembling the root. For example, in Spanish `hablar` ("to speak") and its inflected forms `hablo` ("I speak") and `hablas` ("you speak") all become `hab` when stemmed.

The situation is even more complicated for languages known as *agglutinative languages*, in which words are formed by repeatedly gluing together *morphemes*, or small bits of meaning. In agglutinative languages, a single "word" can be translated as an entire English sentence. How would you reduce a word like Turkish *Çekoslovakyalılaştıramadıklarımızdanmışsınız* — meaning, "you are reportedly one of those that we could not make Czechoslovakian" — down to a root that you could count?

When doing text analysis in English, you can do things like word frequency without thinking too much about questions like "what, actually, is a word?" However, the ways you have to modify text in many other languages to make it compatible with computational text analysis — even to the point of harming human readability — mean that you have to grapple with this question more directly when working with other languages.

If you want to do text analysis with word counts for Danish or Spanish, you will first need to pre-process the texts for your chosen language, and then use the derivative text for the TF-IDF or topic modeling code.

## Text Analysis Based on NLP Models: Non-English NLP


The named-entity recognition and part-of-speech keywords sections also have language-specific dependencies; there are separate versions of those tutorials for a number of requested languages.