# Notebook 0. About the Dataset

## 1. Description

The Twitter Sentiment Analysis dataset combines annotations from two sentiment analysis approaches: Roberta (a transformer-based NLP model) and Vader (a lexicon-based sentiment analyzer).

Each entry in the dataset represents a tweet, along with its sentiment label, which is typically classified as positive, neutral, or negative. Additional metadata, such as tweet ID, user, and date, may also be included depending on the CSV version.

This dataset is useful for training, evaluating, and demonstrating sentiment analysis models. It is small enough to be easily handled for experiments and demonstrations while containing enough diversity to show real-world variations in tweets.

## 2. Load the dataset



In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split

file_path = '../data/tweets.csv'
df = pd.read_csv(file_path).sort_values(by='Created At')

## 3. Split dataset

In [None]:

# Split into main (80%) and updates (20%) sets
main_df, updates_df = train_test_split(df, test_size=0.2, shuffle=False, random_state=42)


# Split into main (80%) and updates (20%) sets
update1_df, update2_df = train_test_split(updates_df, test_size=0.5, shuffle=False, random_state=42)


print("\nMain set shape:", main_df.shape)
print("Update1 set shape:", update1_df.shape)
print("Update2 set shape:", update2_df.shape)


Main set shape: (800, 7)
Update1 set shape: (100, 7)
Update2 set shape: (100, 7)


## 4. Save splits

In [12]:

main_df.to_csv('../data/main.csv', index=False)
update1_df.to_csv('../data/update1.csv', index=False)
update2_df.to_csv('../data/update2.csv', index=False)

print("Main/updates CSV files saved.")


Main/updates CSV files saved.
