https://www.hackerrank.com/challenges/stitch-the-torn-wiki/problem

## Input Format

An Integer N on the first line. This is followed by 2N+1 lines.

Text fragments (numbered 1 to N) from Set A, each on a new line (so a total of N lines).

A separator with five asterisk marks "*" which indicates the end of Set A and beginning of Set B.

Text fragments (numbered 1 to N) from Set B, each on a new line (so a total of N lines).

## Output Format

N lines, each containing one integer.

The i-th line should contain an integer j such that the i-th element of Set A and the j-th element of Set B are a pair, i.e., both originally came from the same block of text/Wikipedia article.

## Constraints

1 <= N <= 100

No text fragment will have more than 10000 characters.



## Sample Input
(Please note that the real inputs used will be much longer, and generated with text blocks with 500-1000 words. This is for explanatory purposes only)

```txt
3
Delhi (also known as the National Capital Territory of India) is a metropolitan region in India that includes the national capital city, New Delhi. With a population of 22 million in 2011, it is the world's second most populous city and the largest city in India in terms of area. The NCT and its urban region have been given the special status of National Capital Region (NCR) under the Constitution of India's 69th amendment act of 1991. The NCR includes the neighbouring cities of Baghpat, Gurgaon, Sonepat, Faridabad, Ghaziabad, Noida, Greater Noida and other nearby towns, and has nearly 22.2 million residents.    
Seattle is a coastal seaport city and the seat of King County, in the U.S. state of Washington. With an estimated 634,535 residents as of 2012, Seattle is the largest city in the Pacific Northwest region of North America and one of the fastest-growing cities in the United States. The Seattle metropolitan area of around 4 million inhabitants is the 15th largest metropolitan area in the nation.[6] The city is situated on a narrow isthmus between Puget Sound (an inlet of the Pacific Ocean) and Lake Washington, about 100 miles (160 km) south of the Canada–United States border. A major gateway for trade with Asia, Seattle is the 8th largest port in the United States and 9th largest in North America in terms of container handling.  
Martin Luther OSA (10 November 1483 – 18 February 1546) was a German monk, Catholic priest, professor of theology and seminal figure of a reform movement in 16th century Christianity, subsequently known as the Protestant Reformation.[1] He strongly disputed the claim that freedom from God's punishment for sin could be purchased with money. He confronted indulgence salesman Johann Tetzel, a Dominican friar, with his Ninety-Five Theses in 1517. His refusal to retract all of his writings at the demand of Pope Leo X in 1520 and the Holy Roman Emperor Charles V at the Diet of Worms in 1521 resulted in his excommunication by the Pope and condemnation as an outlaw by the Emperor.
*****  
The Seattle area had been inhabited by Native Americans for at least 4,000 years before the first permanent European settlers. Arthur A. Denny and his group of travelers, subsequently known as the Denny Party, arrived at Alki Point on November 13, 1851. The settlement was moved to its current site and named "Seattle" in 1853, after Chief Si'ahl of the local Duwamish and Suquamish tribes.  
Although technically a federally administered union territory, the political administration of the NCT of Delhi today more closely resembles that of a state of India, with its own legislature, high court and an executive council of ministers headed by a Chief Minister. New Delhi is jointly administered by the federal government of India and the local government of Delhi, and is the capital of the NCT of Delhi.  
Luther taught that salvation and subsequently eternity in heaven is not earned by good deeds but is received only as a free gift of God's grace through faith in Jesus Christ as redeemer from sin and subsequently eternity in hell. His theology challenged the authority of the Pope of the Roman Catholic Church by teaching that the Bible is the only source of divinely revealed knowledge from God and opposed sacerdotalism by considering all baptized Christians to be a holy priesthood. Those who identify with these, and all of Luther's wider teachings, are called Lutherans.
```



## Sample Output

2  
1  
3  


### Explanation
```txt
The first, second and third text fragment of Set A are about Delhi, Seattle and Martin Luther respectively.
In set B, the paragraph on Delhi, is the second text fragment.
The paragraph on Seattle is the first text fragment in Set B.
The paragraph on Martin Luther Kind is the third text fragment in Set B.
So, the expected output is 2, 1, 3 respectively.
```

## Scoring
```txt
A sample test case with twenty paragraphs is provided to you when you Compile and Test.
Extensive training data is not required for this challenge. The weightage for a test case will be proportional to the number of tests (Articles) which it contains. This works out to a ratio of 1:2 (Sample Test: Hidden Test).
Score = M * (C)/N Where M is the Maximum Score for the test case.
C = Number of correct answers in your output.
N = Total number of Wikipedia Articles (which were split into 2N fragments and divided into Set A and Set B respectively).
```

Note:
    Submissions will be disqualified if it is evident that the code has been written in such a way that the sample test case answers are hard-coded, or similar approaches, where the answer is not computed, but arrived at by trying to ensure the code matches the sample answers.

# Solution 1 (score: 92.5)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def find_matching_fragments(N, fragments):
    set_a = fragments[:N]
    set_b = fragments[N+1:]
    
    # 文本预处理
    vectorizer = CountVectorizer().fit(set_a + set_b)
    tfidf_transformer = TfidfTransformer()
    
    # 计算TF-IDF
    tfidf_a = tfidf_transformer.fit_transform(vectorizer.transform(set_a))
    tfidf_b = tfidf_transformer.transform(vectorizer.transform(set_b))
    
    # 计算余弦相似度
    similarities = cosine_similarity(tfidf_a, tfidf_b)
    
    # 匹配片段
    result = []
    for i in range(N):
        most_similar_index = np.argmax(similarities[i])
        result.append(most_similar_index + 1)
    
    return result

# 读取输入
N = int(input())
fragments = [input().strip() for _ in range(2 * N + 1)]

# 找到匹配的片段
result = find_matching_fragments(N, fragments)

# 输出结果
for res in result:
    print(res)

# Solution 2 (score: 95)

In [None]:
import re
import numpy as np
from collections import Counter
from sklearn.metrics.pairwise import cosine_similarity
from scipy.optimize import linear_sum_assignment

def preprocess_text(text):
    # 将文本分割成单词，并去除特殊字符
    words = re.findall(r'\w+', text.lower())
    return words

def create_word_count_dict(texts):
    # 创建词频字典
    word_count_dicts = [Counter(preprocess_text(text)) for text in texts]
    return word_count_dicts

def create_global_word_count_dict(word_count_dicts):
    # 创建全局词频字典
    global_word_count = Counter()
    for word_count in word_count_dicts:
        global_word_count.update(word_count)
    return global_word_count

def normalize_word_count_dict(word_count_dict, global_word_count):
    # 归一化词频字典
    total_words = sum(global_word_count.values())
    normalized_dict = {word: count / global_word_count[word] for word, count in word_count_dict.items()}
    return normalized_dict

def dict_to_vector(word_count_dict, global_word_count):
    # 将词频字典转换为向量
    vector = np.zeros(len(global_word_count))
    for i, word in enumerate(global_word_count):
        if word in word_count_dict:
            vector[i] = word_count_dict[word]
    return vector

def find_matching_fragments(N, fragments):
    set_a = fragments[:N]
    set_b = fragments[N+1:]
    
    # 创建词频字典
    word_count_dicts_a = create_word_count_dict(set_a)
    word_count_dicts_b = create_word_count_dict(set_b)
    
    # 创建全局词频字典
    global_word_count = create_global_word_count_dict(word_count_dicts_a + word_count_dicts_b)
    
    # 归一化词频字典
    normalized_dicts_a = [normalize_word_count_dict(word_count, global_word_count) for word_count in word_count_dicts_a]
    normalized_dicts_b = [normalize_word_count_dict(word_count, global_word_count) for word_count in word_count_dicts_b]
    
    # 将词频字典转换为向量
    vectors_a = np.array([dict_to_vector(word_count, global_word_count) for word_count in normalized_dicts_a])
    vectors_b = np.array([dict_to_vector(word_count, global_word_count) for word_count in normalized_dicts_b])
    
    # 计算余弦相似度
    similarity_matrix = cosine_similarity(vectors_a, vectors_b)
    
    # 找到最佳匹配
    row_ind, col_ind = linear_sum_assignment(-similarity_matrix)    # 本质上就是"匈牙利算法"; 
                                                                    # 线性分配问题的目标是找到一种分配方式，使得总成本最小化。因为我们需要找相似度最高的, 刚好与定义相反发, 所以取负值
                                                                    # row_ind[i] 表示集合 A 中的第 i 个片段，col_ind[i] 表示集合 B 中与之匹配的片段。
    
    return col_ind + 1

# 读取输入
N = int(input())
fragments = [input().strip() for _ in range(2 * N + 1)]

# 找到匹配的片段
result = find_matching_fragments(N, fragments)

# 输出结果
for res in result:
    print(res)

# 匈牙利算法

匈牙利算法（Hungarian Algorithm），也称为Kuhn-Munkres算法，是一种用于解决二分图匹配问题的组合优化算法。它可以在多项式时间内找到最大权匹配或最小权匹配，常用于解决分配问题（Assignment Problem）。以下是对匈牙利算法的详细解释：

## 问题背景
在分配问题中，我们有两组元素，例如工人和任务，每个工人完成每个任务都有一个成本或收益。目标是找到一种分配方式，使得总成本最小或总收益最大。

## 算法步骤
匈牙利算法的核心思想是通过一系列矩阵操作来简化问题，最终找到最优匹配。以下是算法的主要步骤：

## 构建初始成本矩阵：

将问题表示为一个N x N的成本矩阵，其中每个元素表示一个工人完成一个任务的成本。

## 行操作：

对每一行，减去该行的最小值，使得每一行至少有一个零。

## 列操作：

对每一列，减去该列的最小值，使得每一列至少有一个零。

## 覆盖零：

使用尽可能少的水平线和垂直线覆盖所有的零。如果覆盖线的数量等于矩阵的维度，则找到最优匹配；否则，继续下一步。

## 调整矩阵：

找到未被覆盖的最小元素，将其从未覆盖的元素中**减去**，**并将其加到被覆盖两次的元素上**。重复步骤4和5，直到找到最优匹配。

## 例子:
假设有如下成本矩阵:
```txt
  1  2  3
1 4  1  3
2 2  0  5
3 3  2  2
```

步骤1：行操作

对每一行减去该行的最小值

```txt
  1  2  3
1 3  0  2
2 2  0  5
3 1  0  0
```

步骤2：列操作
对每一列减去该列的最小值：

```txt
  1  2  3
1 2  0  2
2 1  0  5
3 0  0  0
```

步骤3：覆盖零

使用尽可能少的线覆盖所有的零：
```txt
  1  2  3
1 2  0  2
2 1  0  5
3 0  0  0
```
可以用两条线覆盖所有的零（例如，覆盖第2列和第3行）。


步骤4：调整矩阵

找到未被覆盖的最小元素（1），调整矩阵：

```txt
  1  2  3
1 1  0  1
2 0  0  4
3 0  0  0
```

重复覆盖零和调整矩阵的步骤，直到找到最优匹配。


应用场景

匈牙利算法广泛应用于各种分配问题，例如：

- 任务分配：将任务分配给工人，使得总成本最小。
- 学生宿舍分配：将学生分配到宿舍，使得总满意度最大。
- 图像处理：在多目标跟踪中，将检测到的目标分配给跟踪器。