    You have 1 million text files, each is a news article scraped from various news sites. Since news sites often report the same news, even the same articles, many of the files have content very similar to each other. Write a program to filter out these files so that the end result contains only files that are sufficiently different from each other in the language of your choice. You’re free to choose a metric to define the “similarity” of content between files.

    - 파일에서 텍스트를 읽고 전처리함. 
    특수 문자를 제거하고 텍스트를 소문자로 변환.
    전처리된 텍스트를 scikit-learn 라이브러리의 TfidfVectorizer를 사용해
    TF_IDF vector로 변환
    
    - similarity는 cosine-similarity 함수를 사용
    TF-IDF vector간의 cosine similarity 행렬 계산(모든 article의 유사도 점수)
    
    - 유사도 임계값을 정의하고, 유사한 기사를 식별하기 위해 유사도 행렬 반복적으로 확인
    - 두 기사 간의 유사도 점수가 임계값을 초과하면 그들을 유사하다고 간주하고 그 중 하나만 고유하다고 표시
    
    - 고유한 파일의 인덱스 목록을 출력함. 

In [4]:
import os
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
def get_similar_articles(filePath):
    files = os.listdir(filePath)
    num_files = len(files)
    texts = []
    for file in files:
        with open(os.path.join(filePath, file), 'r', encoding='utf-8') as f:
            text = f.read()
            text= preprocess_text(text)
            texts.append(text)
            
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(texts)
    
    # cacluate cosine similarity matrix
    similarity_matrix = cosine_similarity(tfidf_matrix)
    
    # filter out similar articles
    unique_files = []
    similary_threshold = 0.9
    
    for i in range(num_files):
        if i not in unique_fiels:
            unique_files.append(i)
            for j in range(i+1, num_files):
                if similarity_matrix[i, j] >= similar_threshold:
                    unique_files.append(j)
    
    print(f" Total files : {num_files}")
    print(f" unique files : {len(unique_files)}")
    
    for file_idx in unique_files:
        print(files[file_idx])    
    
    

In [5]:
def preprocess_text(text):
    text = re.sub('[^a-zA-Z0-9]+',' ',text)
    text = text.lower()
    return text