# Final Project

## **Introduction**

In Natural Language Processing (NLP), languages are categorized based on the availability of linguistic resources:

- **High-Resource Languages:** Languages like English, Spanish, and Chinese, which have abundant linguistic data, including extensive text corpora, annotated datasets, and comprehensive computational tools. This wealth of resources facilitates the development of effective NLP applications such as machine translation, sentiment analysis, and speech recognition.

- **Low-Resource Languages:** Languages that lack extensive data and funding, making it challenging to develop effective NLP applications.

To bridge the gap between high-resource and low-resource languages, **knowledge transfer** is employed. This involves leveraging knowledge gained from one domain or language to improve performance in another. In cross-lingual transfer, models trained on high-resource languages are adapted to perform tasks in low-resource languages, enhancing NLP capabilities where data is scarce.

**Multilingual Models** play a pivotal role in this process. Trained on data from multiple languages simultaneously, they learn shared representations across languages, allowing for effective cross-lingual transfer. This shared understanding enables the model to apply knowledge from high-resource languages to low-resource ones, improving performance in tasks such as machine translation and language understanding.

Understanding the factors that influence the effectiveness of cross-lingual data transfer is crucial for optimizing multilingual models, especially for low-resource languages. This study aims to investigate how variables such as language contamination, linguistic distances, and script differences between source and target languages influence the efficiency of cross-lingual knowledge transfer. By analyzing these factors, we seek to identify key determinants that facilitate or hinder a model's ability to generalize and perform effectively in new linguistic contexts, thereby providing insights to enhance multilingual model performance.

We leverage a key metric, **data transfer**, proposed by Souza et al. (2024), to quantify how effectively a model utilizes pretraining knowledge when fine-tuned for a target language.

The central focus of this work is to investigate how linguistic and dataset-related factors contribute to variations in this metric. Specifically, we examine the roles of:

- **Language Contamination:** The overlap of sentences between source and target datasets.

- **Linguistic Distances:** Pre-computed measures of syntactic, phonological, and genetic differences.

- **Script Differences:** Captured through the Unicode token-to-text ratio.

Previous work in cross-lingual transfer learning suggests that language similarity, contamination, and script differences significantly influence model performance.

**Language Similarity:**
Lin et al. (2024) introduced mPLM-Sim, a measure assessing language similarities using multilingual pretrained language models. Their findings indicate that mPLM-Sim correlates with linguistic similarity measures and can effectively select source languages to enhance zero-shot cross-lingual transfer performance.

**Language Contamination:**
Blevins and Zettlemoyer (2022) explored how unintended inclusion of non-English data during pretraining affects cross-lingual transfer. They discovered that even minimal amounts of non-English text in the training data can significantly improve a model's performance in other languages, highlighting the impact of language contamination on cross-lingual capabilities.

**Script Differences and Tokenization:**
Rust et al. (2021) examined the impact of tokenization on multilingual language models. Their study found that designated monolingual tokenizers play a crucial role in downstream performance, especially for languages with unique scripts. Replacing a multilingual tokenizer with a specialized monolingual one improved performance across various tasks and languages.

In this context, we aim to analyze the results from the experiments performed by De Souza et al. (2024) to test the hypothesis that these aspects contribute to the variability of the data transfer metric.


---

## **Quick Description**

- **Objective:** Analyze factors influencing the fraction of effective data transfer during cross-lingual fine-tuning.
- **Key Variables:** 
  - Contamination ratios (`detected_language_ratio` and `detected_language_ratio_on_target`).
  - Linguistic distances (e.g., `syntactic_distance`, `phonological_distance`).
  - Script differences (`token2text_rate_square_difference`).
- **Methodology:** Conduct exploratory data analysis, followed by hypothesis testing using statistical models to evaluate the impact of these variables on the metric.
- **Expected Outcome:** Identify the most influential predictors and understand their interaction with cross-lingual transfer efficiency.

## **References:**

- Lin, P., Hu, C., Zhang, Z., Martins, A. F. T., & Schütze, H. (2024). mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Models. *Findings of the Association for Computational Linguistics: EACL 2024*, 276–310. 

- Blevins, T., & Zettlemoyer, L. (2022). Language Contamination Helps Explain the Cross-lingual Capabilities of English Pretrained Models. *Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing*, 3563–3574.

- De Souza, L. R., Almeida, T. S., Lotufo, R., & Nogueira, R. (2024). Measuring cross-lingual transfer in bytes. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 7526–7537). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.naacl-long.418

- Rust, P., Pfeiffer, J., Vulić, I., Ruder, S., & Gurevych, I. (2021). How good is your tokenizer? On the monolingual performance of multilingual language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 3118–3135). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.acl-long.243

