# News Article Clustering

Data Collection: Gather a large dataset of news articles. This could be done through web scraping, APIs, or using existing datasets.

Text Preprocessing:

Tokenization: Split the text into words or tokens.
Stopword Removal: Eliminate common words that don't contribute much meaning (like "and", "the", etc.).
Stemming/Lemmatization: Reduce words to their base or root form.
Removing Punctuation and Special Characters: Clean up the text to retain only alphanumeric characters.
Feature Extraction:

Convert text data into numerical form using techniques like TF-IDF (Term Frequency-Inverse Document Frequency) or word embeddings (like Word2Vec, GloVe).
These methods help in representing text in a way that captures the semantic meaning and importance of words in the document.
Dimensionality Reduction (Optional but recommended for large datasets):

Techniques like PCA (Principal Component Analysis) or t-SNE can be used to reduce the number of features while retaining the essential information.
This step helps in reducing computational complexity and improving clustering performance.
Clustering:

Apply clustering algorithms like K-means, DBSCAN, or Hierarchical clustering on the processed text data.
These algorithms will group articles into clusters based on the similarity of their content.
NLP Techniques for Improved Clustering:

Topic Modeling: Techniques like LDA (Latent Dirichlet Allocation) can be used to identify topics within the articles. This can guide or enhance the clustering process.
Named Entity Recognition (NER): Identifying and classifying key entities (like people, organizations, locations) can provide additional features for clustering.
Sentiment Analysis: Understanding the sentiment of the articles might also help in clustering, especially for differentiating articles with similar content but different tones.

In [None]:
import sys
import warnings

from cluster import *

warnings.filterwarnings("ignore")

# Import main utility functions
sys.path.insert(0, r'c:\Users\joneh\master_thesis\src')
from main_utils import *
from db_utils import *

### Load news data

In [None]:
# load news from database
df = news_db_load('news')
display(df)

query_source = (df['source'] + '_' + df['query']).unique()
print(query_source)

### Check for NaN and Duplicates

In [None]:

for qs in query_source:
    tags = qs.split('_')
    query_string = 'source == @tags[0] and query == @tags[1]'
    print(f'{qs.replace("AND", " ")}:')

    # chech for NaN values
    nNaN = df.query(query_string)['headline'].isna().sum()
    print(f'NaNs: {nNaN}')

    # check for duplicates
    nDuplicates = df.query(query_string)['headline'].duplicated().sum()
    print(f'Duplicates: {nDuplicates}')

    print()

df = df.dropna(subset=['headline'])

display(df)



### Create and plot clusters

In [None]:
clusters = {}

for qs in query_source:
    tags = qs.split('_')
    query_string = 'source == @tags[0] and query == @tags[1]'

    df_query = df.query(query_string)

    clusters[qs] = NewsCluster(df_query, qs)

    fig, ax, = clusters[qs].plot_clusters()

    fig.savefig(f'images/{qs}_clusters.png')


In [None]:
for tag, cluster in clusters.items():
    cluster.print_clusters()


### Remove unwanted clusters

In [None]:
remove_clusters ={
    'NYT_CrudeANDOil': [10, 11, 12], 
    'TG_CrudeANDOil': [7, 8, 9, 10], 
    'TG_NaturalANDGas': []
}

cleaned_dfs = {}

for tag, cluster in clusters.items():
    print(tag)
    cleaned = cluster.remove_cluster(remove_clusters[tag])
    cleaned_dfs[tag] = cleaned
    print()


### Combine and save news dataframe

In [None]:

combined_df = pd.concat(cleaned_dfs.values())

# drop cluster column
combined_df = combined_df.drop(columns=['cluster'])

# Enter filename here:
file_name = f'CombinedArchive.csv'
# Enter relative path for saving the file:
relative_path = 'data/news'

combined_df.to_csv(save_path(relative_path, file_name), index=True)

news_db_commit(combined_df, 'news_filtered')
db_info()