## Novelty Detection in News Project

### Introduction
This project aims to develop a novelty detection system applied to sports news. Novelty detection is a machine learning task that identifies new content by comparing it to previously known information.

### Task
The main task is to identify "novel" news articles that cover new aspects within each event. We use the news in the `source` folder as a reference to evaluate the items in the `target` folder, implementing novelty detection algorithms to detect significant changes in content. ( anomalies )

### Dataset
We are using the **LREC2018 corpus** in the "SPORTS" category. The dataset is organized by events in subfolders, where each event contains:
- A `source` folder with three initial (seed) news articles.
- A `target` folder with additional news articles to be evaluated as "novel" or "non-novel."

Each article has a `.txt` file with the content and an accompanying `.xml` file containing metadata such as title, publication date, publisher, and other event-related information.



-----

### Libs

In [1]:
from scripts.parser import CorpusParser 
import os

base_dir = os.getcwd()
corpus_dir = os.path.join(base_dir, 'database', 'TAP-DLND-1.0_LREC2018')

parser = CorpusParser(corpus_dir)
df_news = parser.parse()


In [2]:
from tabulate import tabulate
print(tabulate(df_news, headers='keys', tablefmt='psql'))

+----+------------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [3]:
print(df_news.count())

event_id     96
news_id      96
content      96
is_source    96
DOP          96
publisher    96
title         6
eventid      96
eventname    96
topic        96
sentence     96
words        96
sourceid     90
DLA          90
SLNS         90
dtype: int64


In [4]:
df_news.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96 entries, 0 to 95
Data columns (total 15 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   event_id   96 non-null     string
 1   news_id    96 non-null     string
 2   content    96 non-null     string
 3   is_source  96 non-null     bool  
 4   DOP        96 non-null     string
 5   publisher  96 non-null     string
 6   title      6 non-null      string
 7   eventid    96 non-null     string
 8   eventname  96 non-null     string
 9   topic      96 non-null     string
 10  sentence   96 non-null     Int64 
 11  words      96 non-null     Int64 
 12  sourceid   90 non-null     string
 13  DLA        90 non-null     string
 14  SLNS       90 non-null     string
dtypes: Int64(2), bool(1), string(12)
memory usage: 10.9 KB


### Data Structure and Preprocessing news text


In [5]:
from scripts.tokenize_and_normalize import tokenize_and_remove_punctuation, remove_stopwords

# Aplicar las funciones de procesamiento de texto
df_news['content_clean_tokenized'] = df_news['content'].apply(tokenize_and_remove_punctuation)
print(df_news['content_clean_tokenized'].iloc[0])

# Filtrar contenido sin stopwords
df_news['content_no_sw'] = df_news['content_clean_tokenized'].apply(remove_stopwords)
print(len(df_news['content_clean_tokenized'].iloc[0]))
print(len(df_news['content_no_sw'].iloc[0]))

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/mab0205/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/mab0205/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


First 10 stopwords: ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
['Dangal', 'Baba', 'Ramdev', 'to', 'wrestle', 'it', 'out', 'with', 'Russian', 'Olympic', 'medallist', 'Dangal', 'Baba', 'Ramdev', 'to', 'wrestle', 'it', 'out', 'with', 'Russian', 'Olympic', 'medallist', 'Currently', 'reading', 'Dangal', 'Baba', 'Ramdev', 'to', 'wrestle', 'it', 'out', 'with', 'Russian', 'Olympic', 'medallist', 'Baba', 'Ramdev', 'Wrestling', 'Yoga', 'guru', 'Ramdev', 'will', 'challenge', 'the', 'Olympic', 'for', 'a', 'friendly', 'wrestling', 'bout', 'ahead', 'of', 'the', 'second', 'semifinal', 'match', 'between', 'Mumbai', 'Maharathi', 'and', 'NCR', 'Punjab', 'Royals', 'in', 'the', 'Pro', 'Wrestling', 'League', 'The', 'eye', 'turning', 'match', 'is', 'scheduled', 'at', 'pm', 'today', 'in', 'New', 'Delhi', 'Indira', 'Gandhi', 'Indoor', 'Stadium', 'I', 'have', 'fought', 'bouts', 'with', 'national', 'level', 'wrestlers', 'But', 'playing', 'against', 'an', 'internationally', 'r

In [6]:

# content with no stopwords and punctuations    
df_news['content_no_sw'] = df_news['content_clean_tokenized'].apply(lambda x: remove_stopwords(x))

print (len(df_news['content_clean_tokenized'].iloc[0]))
print (len(df_news['content_no_sw'].iloc[0]))

244
147


After tokenization, normalization, and removing stopwords and punctuation, the length of the content in the first article was reduced from 244 to 153.
i removed stopwords like ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

### Exploratory Analysis