# Basic Dictionary Classification

In the following, I attempt to distinguish between hard news (i.e. news focused on current events, politics, public affairs etc.) and soft news (i.e. news focused on lifestyle, entertainment, human interest stories etc.) using a simple dictionary method. I define two dictionaries, one for hard and one for soft news, in order to classify the articles in the dataset. The dictionaries are based on a brief qualitative immersion phase in which I identify some of the words that may identify soft vs. hard news. Then I use keyword expansion with word2vec to broaden their scope. The classification categories consist of 'soft news' meaning there were words from the soft word dictionary, 'hard news' meaning there were more words from the hard news dictionary, 'mixed' meaning there were equal words from both dictionaries and 'none' meaning there were no words from either dictionary. 

In [11]:
# Import libraries
import pandas as pd
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
import nltk

# Read the data
df = pd.read_csv('bbc_news.csv')  

# Define initial keyword dictionaries
soft_news = {'gucci', 'disney', 'world cup', 'celebrity', 'football', 'arsenal', 'liverpool', 'wsl', 'music',
             'song', 'fashion', 'squad', 'dance', 'artist', 'coffee', 'f1', 'royal', 'star', 'film', 'travel',
             'novel', 'book', 'author', 'chef', 'lifestyle', 'healthy', 'actor', 'surfing', 'athlete', 'paralympic',
             'medal', 'league', 'sci fi', 'DJ'}

hard_news = {'war', 'ukraine', 'covid', 'restrictions', 'tariffs', 'politics', 'zelensky', 'tory', 'russia',
             'murder', 'civilians', 'died', 'assault', 'killed', 'putin', 'tornado', 'climate change', 'emissions',
             'elections', 'business', 'prisoner', 'arrest', 'farmer', 'crash', 'supply', 'abortion', 'farming',
             'EU', 'terrorism'}

# Combine 'title' and 'description' into a single text field
df['full_text'] = df['title'].astype(str) + " " + df['description'].astype(str)

# Tokenize and lowercase the combined text
df['tokens'] = df['full_text'].apply(lambda x: [w.lower() for w in word_tokenize(x)])

# Train Word2Vec model on tokenized descriptions
model = Word2Vec(sentences=df['tokens'], vector_size=100, window=2, 
                 min_count=1, sg=1, epochs=5, seed=42) # set sg=1, min_count=1, window=2 because the texts are short and seed for consistency

# Function to expand keywords
def expand_keywords_w2v(model, seed_words, topn=100, threshold=0.6):
    expanded = set(seed_words)
    for word in seed_words:
        try:
            similar = model.wv.most_similar(word, topn=topn)
            for sim_word, score in similar:
                if score >= threshold:
                    expanded.add(sim_word.lower())
        except KeyError:
            continue
    return expanded

# Expand both categories
soft_news_expanded = expand_keywords_w2v(model, soft_news)
hard_news_expanded = expand_keywords_w2v(model, hard_news)

# Classification of Articles
def classify_article(tokens, soft_set, hard_set):
    soft_hits = sum(1 for w in tokens if w in soft_set)
    hard_hits = sum(1 for w in tokens if w in hard_set)
    if soft_hits > hard_hits:
        return 'soft'
    elif hard_hits > soft_hits:
        return 'hard'
    elif soft_hits == hard_hits and soft_hits > 0:
        return 'mixed'
    else:
        return 'none'

# Apply classification
df['news_type'] = df['tokens'].apply(lambda x: classify_article(x, soft_news_expanded, hard_news_expanded))

# Proportion
proportion = df['news_type'].value_counts(normalize=True).round(3)
print("News Type Proportions:\n", proportion)


News Type Proportions:
 news_type
hard     0.500
soft     0.389
mixed    0.078
none     0.034
Name: proportion, dtype: float64
