## Categorical Feature Transformation

- One-hot encoding for categorical variables
- Collapse low-frequency section names with 'Other'
- Save the transformed dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
merged_df = pd.read_csv('cleaned_data.csv')

### 1. One-hot encoding for categorical variables

In [2]:
# Display original distribution
print("Distribution of article types before encoding:")
print(merged_df['type'].value_counts())
print("\nPercentage distribution:")
print((merged_df['type'].value_counts() / len(merged_df) * 100).round(2), "%")

# Perform one-hot encoding with dummy variable
type_encoded = pd.get_dummies(merged_df['type'], prefix='type', drop_first=True)

print("\nShape of encoded matrix:", type_encoded.shape)
print("\nEncoded columns (reference category excluded):")
print(type_encoded.columns.tolist())

print("\nCount of each encoded type:")
for col in type_encoded.columns:
    print(f"{col}: {type_encoded[col].sum()}")

# Add encoded columns to dataframe
merged_df = pd.concat([merged_df, type_encoded], axis=1)

Distribution of article types before encoding:
type
article        1398
liveblog        259
interactive       2
Name: count, dtype: int64

Percentage distribution:
type
article        84.27
liveblog       15.61
interactive     0.12
Name: count, dtype: float64 %

Shape of encoded matrix: (1659, 2)

Encoded columns (reference category excluded):
['type_interactive', 'type_liveblog']

Count of each encoded type:
type_interactive: 2
type_liveblog: 259


### 2. Collapse low-frequency section names with 'Other'

In [3]:
# Calculate section frequencies and threshold
section_counts = merged_df['sectionName'].value_counts()
threshold_percent = 5
threshold = len(merged_df) * threshold_percent / 100

# Show sections to be collapsed
print("\nOriginal number of sections:", len(section_counts))
print("\nSections to be collapsed (less than 5%):")
for section, count in section_counts[section_counts < threshold].items():
    print(f"{section}: {count} articles ({(count/len(merged_df)*100):.1f}%)")

# Replace low-frequency sections with 'Other'
merged_df['sectionName'] = merged_df['sectionName'].apply(
    lambda x: 'Other' if section_counts[x] < threshold else x
)


Original number of sections: 38

Sections to be collapsed (less than 5%):
Music: 80 articles (4.8%)
Sport: 64 articles (3.9%)
Film: 61 articles (3.7%)
Politics: 59 articles (3.6%)
Television & radio: 53 articles (3.2%)
Football: 51 articles (3.1%)
Books: 49 articles (3.0%)
Society: 48 articles (2.9%)
Opinion: 45 articles (2.7%)
Life and style: 44 articles (2.7%)
Art and design: 38 articles (2.3%)
Environment: 31 articles (1.9%)
Education: 26 articles (1.6%)
Games: 25 articles (1.5%)
Stage: 25 articles (1.5%)
Media: 23 articles (1.4%)
Culture: 22 articles (1.3%)
Technology: 17 articles (1.0%)
Fashion: 17 articles (1.0%)
Science: 16 articles (1.0%)
Money: 13 articles (0.8%)
Global development: 13 articles (0.8%)
Wellness: 9 articles (0.5%)
Food: 8 articles (0.5%)
Stand Out By Design: 6 articles (0.4%)
Travel: 5 articles (0.3%)
Crosswords: 5 articles (0.3%)
Law: 3 articles (0.2%)
GNM press office: 3 articles (0.2%)
News: 2 articles (0.1%)
Visit The USA: The United States of Adventure: 1 

In [4]:
# Show final distribution
final_counts = merged_df['sectionName'].value_counts()
print("\nFinal section distribution:")
print(final_counts)
print("\nFinal number of sections:", len(final_counts))

# Display final dataset info
print("\nFinal dataset shape:", merged_df.shape)
print("\nColumns in transformed dataset:")
for col in merged_df.columns:
    print(f"- {col}")


Final section distribution:
sectionName
Other             864
Business          301
UK news           175
World news        125
US news            98
Australia news     96
Name: count, dtype: int64

Final number of sections: 6

Final dataset shape: (1659, 28)

Columns in transformed dataset:
- Unnamed: 0
- X
- Open
- High
- Low
- Close
- Volume
- Dividends
- Stock.Splits
- Date
- id
- type
- sectionName
- publicationDate
- webTitle
- webUrl
- headline
- body
- sentiment_score
- sentiment_label
- vader_score
- vader_label
- word_count
- month
- day
- volatility
- type_interactive
- type_liveblog


In [5]:
# Save the transformed dataset
output_filename = 'processed_dataset.csv'
merged_df.to_csv(output_filename, index=False)
print(f"\nTransformed dataset saved to: {output_filename}")


Transformed dataset saved to: final_dataset.csv
