# The Quadrilingual Land of Lonpestia

In a land far, far away, there lies a country with four official languages: Englcrevbeh, Hungeleabeen, En Gli and Hure. The locals can understand each other no problem, even those that speak different languages, but the same cannot be said for you, a simple tourist in the harsh lands of Lonpestia.

All of the locals in Lonpestia become extremely angry if you assume they belong to a different demographic, so you need to be able to correctly identify what language people are speaking in order to safely ask for directions and survive your vacation.

## Train & Test Data

There is no train data. 🙂

The test data is a series of texts in the four languages spoken in Lonpestia:

- Englcrevbeh
- Hungeleabeen
- En Gli
- Hure

## Task

To start off, you will need to correctly identify if two people are speaking the same language in a conversation. If not, you won't be able to approach them for directions, because at least one of them will attack you.

### Subtask 1 – Monolingual Conversations (30p)

You are given a series of data points in the following format: `datapointID`, `textA`, `textB`. Your task is to create a submission that, for each datapoint in this subtask, specifies whether textA and textB are the same language or not.

Once you have confirmed that a conversation is monolingual, you need to figure out which language to approach the locals with!

### Subtask 2 – Language Identification (70p)

You are given a series of data points in the following format: `datapointID`, `textA`, `textB` where `textB` is always missing. Your task is to create a submission that, for each datapoint, returns the language that the `textA` is in. `textB` can be safely disregarded for this subtask.

## Output Format

Your submission must be in the following format: `datapointID`, `subtaskID`, `answer`, where the rows in the .csv file are 1 or 2 in the `subtaskID` column depending on the subtask that you are solving.

The `answer` column must contain:

- either `true` or `false` for rows corresponding to the first subtask.
- one of `Englcrevbeh`, `Hungeleabeen`, `En Gli` or `Hure` for rows corresponding to the second subtask.
- 
Values in the `answer` column are case-insensitive. For example, for the first subtask, `true`, `True` and `tRue` are all accepted as `true`, while for the second subtask, `engLCRevbeh` is also acceptable instead of `Englcrevbeh`, but `EnGli` is not acceptable instead of `En Gli`.

The `datapointID` column must match the datapoint from test_data.csv that you are solving.

## Scoring

The first subtask is scored as follows:

- F1 < 0.65 → 2 points.
- 0.65 <= F1 < 0.85 → 10-20 points.
- 0.85 <= F1 → 30 points.

The second subtask is scored as follows:

- Accuracy < 0.4 → 3 points.
- 0.4 <= Accuracy < 0.8 → 10-60 points.
- 0.8 <= Accuracy < 0.95 → 65 points.
- 0.95 <= Accuracy → 70 points.

## Editorial

[Editorial.](https://gitlab.com/nitronlp/judge-editorials/-/wikis/Cram-School-PreONIA-2025)

In [1]:
import pandas as pd
import numpy as np
from itertools import permutations
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

In [2]:
df = pd.read_csv("test_data.csv")
df

Unnamed: 0,datapointID,textA,textB
0,120,o utse airs eaien eii llin cri.,
1,1168,e str er's ga ii hie ied al wi e aad en swih e...,
2,796,"den diin eipri skis, eie ren ers ie taend ie c...",
3,114,he e kekcleazologuclea nem e hejen eleajecleaz...,
4,989,licrhcrevbe coheleaenlicr gleaoup crhiileaed i...,
...,...,...,...
1582,782,e eldi taams i event r uns ae diii ers (i ee i...,
1583,416,elegucleagucleauk eclea felegucleagucleauk cle...,
1584,760,"hev er, iis eded ieet urn, wi e t20 iraif ee, ...",
1585,656,hbele ciicrlicrlecr eleae ucred licro hold ii ...,


## Subtask 1

In [3]:
df1 = df[df['textB'].notnull()]

In [4]:
vectorizer = TfidfVectorizer(analyzer='char', ngram_range=(2, 4))

In [5]:
subtask1_rows = []

for idx, row in df1.iterrows():
    vectors = vectorizer.fit_transform([row['textA'], row['textB']])
    sim = cosine_similarity(vectors[0], vectors[1])[0, 0]
    
    same_lang = sim > 0.4
    subtask1_rows.append([1, row['datapointID'], str(same_lang)])

## Subtask 2

In [6]:
df2 = df[df["textB"].isnull()]

In [7]:
langs = ["Englcrevbeh", "Hungeleabeen", "En Gli", "Hure"]

In [8]:
texts = df2["textA"].tolist()
vectors = vectorizer.fit_transform(texts)

In [9]:
pca = PCA(3)
vectors = pca.fit_transform(vectors)

In [10]:
kmeans = KMeans(n_clusters=len(langs), random_state=42)
clusters = kmeans.fit_predict(vectors)

In [11]:
lang_vectors = vectorizer.transform(langs)
lang_vectors = pca.transform(lang_vectors)

In [12]:
best_perm = None
best_score = -np.inf

for perm in permutations(range(4)):
    l_perm = lang_vectors[list(perm)]
    sim_matrix = cosine_similarity(kmeans.cluster_centers_, l_perm)
    score = np.trace(sim_matrix)  # sum of diagonal elements (pairwise sim)
    
    if score > best_score:
        best_score = score
        best_perm = perm
best_perm, best_score

((1, 0, 2, 3), np.float64(1.4518432190427306))

In [13]:
subtask2_rows = []

for did, cidx in zip(df2['datapointID'], clusters):
    lang = langs[best_perm[cidx]]
    subtask2_rows.append([2, did, lang])

## Save answers

In [14]:
submission_rows = subtask1_rows + subtask2_rows
df_submission = pd.DataFrame(submission_rows, columns=["subtaskID", "datapointID", "answer"])
df_submission.to_csv("submission.csv", index=False)

## Submission results

Subtask 1:
- F1 Score: 0.85475
- Score: 30/30

Subtask 2:
- Accuracy: 0.978172
- Score: 70/70