# News Article Clustering

Data Collection: Gather a large dataset of news articles. This could be done through web scraping, APIs, or using existing datasets.

Text Preprocessing:

Tokenization: Split the text into words or tokens.
Stopword Removal: Eliminate common words that don't contribute much meaning (like "and", "the", etc.).
Stemming/Lemmatization: Reduce words to their base or root form.
Removing Punctuation and Special Characters: Clean up the text to retain only alphanumeric characters.
Feature Extraction:

Convert text data into numerical form using techniques like TF-IDF (Term Frequency-Inverse Document Frequency) or word embeddings (like Word2Vec, GloVe).
These methods help in representing text in a way that captures the semantic meaning and importance of words in the document.
Dimensionality Reduction (Optional but recommended for large datasets):

Techniques like PCA (Principal Component Analysis) or t-SNE can be used to reduce the number of features while retaining the essential information.
This step helps in reducing computational complexity and improving clustering performance.
Clustering:

Apply clustering algorithms like K-means, DBSCAN, or Hierarchical clustering on the processed text data.
These algorithms will group articles into clusters based on the similarity of their content.
NLP Techniques for Improved Clustering:

Topic Modeling: Techniques like LDA (Latent Dirichlet Allocation) can be used to identify topics within the articles. This can guide or enhance the clustering process.
Named Entity Recognition (NER): Identifying and classifying key entities (like people, organizations, locations) can provide additional features for clustering.
Sentiment Analysis: Understanding the sentiment of the articles might also help in clustering, especially for differentiating articles with similar content but different tones.

In [None]:
import sys

from cluster import *

# Import main utility functions
sys.path.insert(0, r'c:\Users\joneh\master_thesis\src')
from main_utils import *

### Create cluserts

In [None]:
df_NYT = load_df('news', 'NYT_CrudeANDOil.csv')
df_TG = load_df('news', 'TG_CrudeANDOil.csv')

NYT_cluster = NewsCluster(df_NYT)
TG_cluster = NewsCluster(df_TG)

### Plot clusters

In [None]:
fig1, ax1, = NYT_cluster.plot_clusters()
fig2, ax2, = TG_cluster.plot_clusters()

fig1.savefig('images/NYT_clusters.png')
fig2.savefig('images/TG_clusters.png')

In [None]:
NYT_cluster.print_clusters()
TG_cluster.print_clusters()

### Remove unwanted clusters

In [None]:
TG_remove = [7, 8, 9, 10]

TG_cleaned = TG_cluster.remove_cluster(TG_remove)


NYT_remove = [1, 2, 3, 4, 5]

NYT_cleaned = NYT_cluster.remove_cluster(NYT_remove)

### Combine and save news dataframe

In [None]:
# add tags
NYT_cleaned['source'] = 'NYT'

TG_cleaned['source'] = 'TG'

combined_df = pd.concat([NYT_cleaned, TG_cleaned])

# drop cluster column
combined_df = combined_df.drop(columns=['cluster'])

# Enter filename here:
file_name = f'CombinedArchive.csv'
# Enter relative path for saving the file:
relative_path = 'data/news'

combined_df.to_csv(save_path(relative_path, file_name), index=True)