# Language Detection
[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/use_cases/lang_detection.ipynb)

## Overview
Language detectors allow you to use 2 following methods:
1. `detect_single_language` - returns the language with the highest score obtained with the detection model,
2. `detect_many_languages` - returns list of pairs consisting of language and corresponding score obtained by the model.

## Quickstart

### Langdetect
Detect language of a text using naive Bayesian filter. Based on [langdetect](https://github.com/Mimino666/langdetect/tree/master).

**Important!** It is **non-deterministic**!

In [2]:
# Install necessary packages
# ! pip install langdetect

In [3]:
from langchain_experimental.language_detector import LangDetector

single_lang_text = "Hello world, my name is John Doe"

lang_detector = LangDetector()
lang_detector.detect_single_language(single_lang_text)

'en'

In [4]:
lang_detector.detect_many_languages(single_lang_text)

[('en', 0.99999476088831)]

In [5]:
many_langs_text = "Hello world! Me lammo Sofía, soy Madrileña. Auf Wiedersehen!"
lang_detector.detect_single_language(many_langs_text)

'es'

In [6]:
lang_detector.detect_many_languages(many_langs_text)

[('en', 0.5714281048193632),
 ('es', 0.28571321105210407),
 ('de', 0.14285709363629925)]

In [7]:
# Install necessary packages
# ! pip install fasttext-wheel

In [8]:
from langchain_experimental.language_detector import FastTextDetector

lang_detector = LangDetector()
lang_detector.detect_single_language(single_lang_text)

ImportError: cannot import name 'FastTextDetector' from 'langchain_experimental.language_detector' (/home/mateusz/Documents/Projects/langchain/libs/experimental/langchain_experimental/language_detector/__init__.py)

In [None]:
lang_detector.detect_many_languages(single_lang_text)

In [None]:
lang_detector.detect_single_language(many_langs_text)

In [None]:
lang_detector.detect_many_languages(many_langs_text)

## Usage in chain
Language detection is particulary useful with the components that require selection of a language, e.g. text-to-speech or data anonymizer. 

### Usage with data anonymizer
Let's investigate how to join the functionalities of both modules.

In [None]:
# Install other required packages
# ! pip install presidio-analyzer presidio-anonymizer faker
# ! python -m spacy download en_core_web_lg
# ! python -m spacy download es_core_news_md

In [9]:
from langchain.chat_models import ChatOpenAI
from langchain_experimental.data_anonymizer import PresidioReversibleAnonymizer
from langchain.schema import runnable

languages_config = {
    "nlp_engine_name": "spacy",
    "models": [
        {"lang_code": "en", "model_name": "en_core_web_lg"},
        {"lang_code": "es", "model_name": "es_core_news_sm"},
    ],
}
anonymizer = PresidioReversibleAnonymizer(languages_config=languages_config)
llm = ChatOpenAI(temperature=0)

In [10]:
def detect_language(text: str) -> dict:
    language = lang_detector.detect_single_language(text)
    return {"text": text, "language": language}

print(single_lang_text)
chain = (
    runnable.RunnableLambda(detect_language)
    | (lambda x: anonymizer.anonymize(x["text"], language=x["language"])) 
    | llm
)
chain.invoke(single_lang_text)

Hello world, my name is John Doe


AIMessage(content='Hello Nathan Jimenez! How can I assist you today?', additional_kwargs={}, example=False)

As it was expected the entity called "John Doe" was anonymized. 

However, english language is the default one. Let's see how the anonimyzation works for spanish.

In [11]:
es_lang_text = "Hola el mundo, Yo soy Sofia Lopez"
print(es_lang_text)
chain.invoke(es_lang_text)

Hola el mundo, Yo soy Sofia Lopez


AIMessage(content='¡Hola Scott Diaz! ¿Cómo estás? ¿En qué puedo ayudarte hoy?', additional_kwargs={}, example=False)

We can see that "Sofia Lopez" was anonymized as a single entity.

In [12]:
anonymizer._deanonymizer_mapping.data

{'PERSON': {'Nathan Jimenez': 'John Doe', 'Scott Diaz': 'Sofia Lopez'}}

Let's compare it with the anonymizer set to the default english language.

In [13]:
bare_anonymizer = PresidioReversibleAnonymizer(languages_config=languages_config)
bare_chain = bare_anonymizer.anonymize | llm
bare_chain.invoke(es_lang_text)

AIMessage(content='¡Hola Patrick y Daniel! ¿Cómo están? ¿En qué puedo ayudarles hoy?', additional_kwargs={}, example=False)

This time, both "Yo soy" and "Sofia Lopez" were recognized as PERSON entities and got anonymized.

In [14]:
bare_anonymizer._deanonymizer_mapping.data

{'PERSON': {'Patrick Vasquez': 'Yo soy', 'Daniel Lynch': 'Sofia Lopez'}}

## Next Steps
- Adding [fasttext](https://fasttext.cc/docs/en/language-identification.html) language detection which is reported to be more accurate for some cases