# Language Detection with langdetect package
[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/use_cases/lang_detection.ipynb)

## Overview
Implemented model detects language of a text using naive Bayesian filter. Achieves 99% over precision for 53 languages.

Language detector allows you to use 2 following methods:
1. detect_single_language - returns the language with the highest score obtained with the detection model,
2. detect_many_languages - returns list of pairs consisting of language and corresponding score obtained by the model.

## Quickstart

In [1]:
# Install necessary packages
# ! pip install langdetect

In [17]:
from langchain_experimental.language_detector import LangDetector

single_lang_text = "Hello world, my name is John Doe"

lang_detector = LangDetector()
lang_detector.detect_single_language(single_lang_text)

'en'

In [2]:
lang_detector.detect_many_languages(single_lang_text)

[('en', 0.999996206585063)]

In [3]:
many_langs_text = "Hello world! Me lammo Sofía, soy Madrileña. Auf Wiedersehen!"
lang_detector.detect_single_language(many_langs_text)

'en'

In [4]:
lang_detector.detect_many_languages(many_langs_text)

[('en', 0.5714275935995099), ('de', 0.42857059444347717)]

## Usage in chain
Language detection is particulary useful with the components that require selection of a language, e.g. text-to-speech or data anonymizer. 

### Usage with data anonymizer
Let's investigate how to join the functionalities of both modules.

In [68]:
# Install other required packages
# ! pip install presidio-analyzer presidio-anonymizer faker
# ! python -m spacy download en_core_web_lg
# ! python -m spacy download es_core_news_md

In [61]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import TransformChain
from langchain_experimental.data_anonymizer import PresidioReversibleAnonymizer

languages_config = {
    "nlp_engine_name": "spacy",
    "models": [
        {"lang_code": "en", "model_name": "en_core_web_lg"},
        {"lang_code": "es", "model_name": "es_core_news_sm"},
    ],
}
anonymizer = PresidioReversibleAnonymizer(languages_config=languages_config)
llm = ChatOpenAI(temperature=0)

In [62]:
def lang_transform_func(inputs: dict) -> dict:
    text = inputs["text"]
    language = lang_detector.detect_single_language(text)
    return {"text": text, "language": language}

lang_transform_chain = TransformChain(input_variables=["text"], output_variables=["text", "language"], transform=lang_transform_func)

In [63]:
print(single_lang_text)
chain = lang_transform_chain | (lambda x: anonymizer.anonymize(x["text"], language=x["language"])) | llm
chain.invoke(single_lang_text)

Hello world, my name is John Doe


AIMessage(content='Hello Emily Woods! How can I assist you today?', additional_kwargs={}, example=False)

As it was expected the entity called "John Doe" was anonymized. 

However, english language is the default one. Let's see how the anonimyzation works for spanish.

In [64]:
es_lang_text = "Hola el mundo, Yo soy Sofia Lopez"
print(es_lang_text)
chain.invoke(es_lang_text)

Hola el mundo, Yo soy Sofia Lopez


AIMessage(content='¡Hola Daisy Dudley! ¿Cómo puedo ayudarte hoy?', additional_kwargs={}, example=False)

We can see that "Sofia Lopez" was anonymized as a single entity.

In [65]:
anonymizer._deanonymizer_mapping.data

{'PERSON': {'Emily Woods': 'John Doe', 'Daisy Dudley': 'Sofia Lopez'}}

Let's compare it with the anonymizer set to the default english language.

In [66]:
bare_anonymizer = PresidioReversibleAnonymizer(languages_config=languages_config)
bare_chain = bare_anonymizer.anonymize | llm
bare_chain.invoke(es_lang_text)

AIMessage(content='¡Hola Crystal y Monica! ¿Cómo están? ¿En qué puedo ayudarles hoy?', additional_kwargs={}, example=False)

This time, both "Yo soy" and "Sofia Lopez" were recognized as PERSON entities and got anonymized.

In [67]:
bare_anonymizer._deanonymizer_mapping.data

{'PERSON': {'Crystal Dawson': 'Yo soy', 'Monica Black': 'Sofia Lopez'}}