# Twitter Sentiment Data Exploration

This notebook was created to explore the data present in the Twitter Sentiment dataset ([dummy URL](https://www.kaggle.com/datasets/jp797498e/twitter-entity-sentiment-analysis)), with the goal of understanding the data and how it influences the final sentiment results expressed by users.

For starters, we'll load the necessary libraries needed for this project, as you can see we have already a training and validation dataset available

In [23]:
import pandas as pd

import kagglehub

path = kagglehub.dataset_download("jp797498e/twitter-entity-sentiment-analysis")
training_path = f"{path}/twitter_training.csv"
validation_path = f"{path}/twitter_validation.csv"

df_train = pd.read_csv(
    training_path,
    header=None,
    names=['Tweet ID', 'Entity', 'Sentiment','Tweet content']
)

df_test = pd.read_csv(
    validation_path,
    header=None,
    names=['Tweet ID', 'Entity', 'Sentiment','Tweet content']
)

df_train

Unnamed: 0,Tweet ID,Entity,Sentiment,Tweet content
0,2401,Borderlands,Positive,im getting on borderlands and i will murder yo...
1,2401,Borderlands,Positive,I am coming to the borders and I will kill you...
2,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...
3,2401,Borderlands,Positive,im coming on borderlands and i will murder you...
4,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...
...,...,...,...,...
74677,9200,Nvidia,Positive,Just realized that the Windows partition of my...
74678,9200,Nvidia,Positive,Just realized that my Mac window partition is ...
74679,9200,Nvidia,Positive,Just realized the windows partition of my Mac ...
74680,9200,Nvidia,Positive,Just realized between the windows partition of...


Once data has been loaded let's have a look at the percentage of missing values

In [24]:
missing_count = df_train.isna().sum()
missing_pct = df_train.isna().mean() * 100
missing_stats = (
    pd.DataFrame({
        'missing_count': missing_count,
        'missing_pct': missing_pct
    })
    .sort_values('missing_count', ascending=False)
)

missing_stats

Unnamed: 0,missing_count,missing_pct
Tweet content,686,0.918561
Tweet ID,0,0.0
Entity,0,0.0
Sentiment,0,0.0


There are just a few of rows with missing values, based on this we'll drop both duplicates and na values, depending on the performance of the models we might wanna add extra rules

In [25]:
df_train = df_train.drop_duplicates()
df_train = df_train.dropna()

df_test = df_test.drop_duplicates()
df_test = df_test.dropna()

Now that everything has been handled let's quickly test a few models, and get an idea of how they work just based on their default parameters, also we can check if the "Entity" column actually has any impact in the performance of the model.
Once they are working we can setup a more robust cvsearch to adjust parameters.

In [None]:
import mlflow, mlflow.sklearn

from sklearn.model_selection import GridSearchCV

from sklearn.metrics import classification_report
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.pipeline import Pipeline

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from xgboost import XGBClassifier

def sentiment_encoding(sentiment):
    sentiments = {
        'Positive': 1,
        'Negative': 0,
        'Neutral': 2,
        # Neutral and irrelevant are treated equally as per the comment
        # on the source dataset (see link at the start of the notebook):
        # "We regard messages that are not relevant to the entity (i.e. Irrelevant) as Neutral."
        'Irrelevant': 2
    }
    
    return sentiments[sentiment]

# Pre-processing, drop tweet id
text_pipeline = Pipeline([
    ('vect', None)
])

fields = {
    'Tweet content': ('text', text_pipeline, 'Tweet content'),
    'Entity': ('entity', OneHotEncoder(handle_unknown='ignore'), ['Entity'])
}

fields_to_test = []
for field_name in fields.keys():
    fields_to_test.append(field_name)

    X_train = df_train[fields_to_test]
    # X_train = df_train[['Tweet content']]
    y_train = df_train['Sentiment'].map(sentiment_encoding)

    X_test = df_test[fields_to_test]
    # X_test = df_test[['Tweet content']]
    y_test = df_test['Sentiment'].map(sentiment_encoding)

    preprocessor = ColumnTransformer(list(fields[field] for field in fields_to_test))

    # preprocessor = ColumnTransformer([
    #     ('text', text_pipeline, 'Tweet content'),
    # ])

    print(X_train.columns)
    print(preprocessor)

    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('clf', None)
    ])

    param_grid = [
        {
            'preprocessor__text__vect': [TfidfVectorizer(), CountVectorizer()],
            'clf': [RandomForestClassifier()],
        },
        {
            'preprocessor__text__vect': [TfidfVectorizer(), CountVectorizer()],
            'clf': [LogisticRegression(max_iter=1000)],
        },
        {
            'preprocessor__text__vect': [TfidfVectorizer(), CountVectorizer()],
            'clf': [MultinomialNB()],
        },
        {
            'preprocessor__text__vect': [TfidfVectorizer(), CountVectorizer()],
            'clf': [XGBClassifier(objective='multi:softprob', eval_metric='mlogloss', use_label_encoder=False)],
        }
    ]

    grid = GridSearchCV(
        pipeline,
        param_grid=param_grid,
        cv=5,
        scoring=['accuracy','f1_macro', 'roc_auc_ovr'],
        refit='f1_macro',
        n_jobs=-1,
        verbose=1
    )

    print("Fitting to pipeline")
    mlflow.set_tracking_uri("http://localhost:5000")
    mlflow.sklearn.autolog()
    mlflow.set_experiment(experiment_id="1")
    with mlflow.start_run():
        mlflow.log_param("fields",'-'.join(fields_to_test))
        grid.fit(X_train, y_train)

    print("Best estimator:", grid.best_estimator_)
    print("Best CV score:", grid.best_score_)
    print("Estimating performance on test set")
    y_pred = grid.predict(X_test)
    print(classification_report(y_test, y_pred))



Index(['Tweet content'], dtype='object')
ColumnTransformer(transformers=[('text', Pipeline(steps=[('vect', None)]),
                                 'Tweet content')])
Fitting to pipeline
Fitting 5 folds for each of 8 candidates, totalling 40 fits


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
2025/08/07 21:57:15 INFO mlflow.sklearn.utils: Log

🏃 View run rare-shad-684 at: http://localhost:5000/#/experiments/1/runs/f34c2f546ac34f228d5787302126413d
🧪 View experiment at: http://localhost:5000/#/experiments/1
🏃 View run victorious-perch-246 at: http://localhost:5000/#/experiments/1/runs/6f02e5567cdd4d6ba330d9ed303a84a1
🧪 View experiment at: http://localhost:5000/#/experiments/1
🏃 View run unleashed-whale-244 at: http://localhost:5000/#/experiments/1/runs/1c809e2772f24348b67c39d731e82f38
🧪 View experiment at: http://localhost:5000/#/experiments/1
🏃 View run dapper-squirrel-182 at: http://localhost:5000/#/experiments/1/runs/e8177464b5a94233bea33b8b1c4b4870
🧪 View experiment at: http://localhost:5000/#/experiments/1
🏃 View run casual-wolf-596 at: http://localhost:5000/#/experiments/1/runs/345e30bc02bf49dcbfa95dc626482c58
🧪 View experiment at: http://localhost:5000/#/experiments/1
🏃 View run masked-goat-126 at: http://localhost:5000/#/experiments/1/runs/3f83a7b3fdd342b391d3d31853dc7daa
🧪 View experiment at: http://localhost:5000/#/e

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


In [None]:
cv_df = pd.DataFrame(grid.cv_results_)
print(cv_df.columns)
cv_df[['params', 'mean_test_score', 'std_test_score']]

In [None]:
# Best set of hyperparameters
print("best params:", grid.best_params_)
print("best score (f1):", grid.best_score_)
print("Best estimator pipeline:", grid.best_estimator_)

Index(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time',
       'param_clf', 'param_preprocessor__text__vect', 'params',
       'split0_test_score', 'split1_test_score', 'split2_test_score',
       'split3_test_score', 'split4_test_score', 'mean_test_score',
       'std_test_score', 'rank_test_score'],
      dtype='object')
                                              params  mean_test_score  \
0  {'clf': RandomForestClassifier(), 'preprocesso...         0.427614   
1  {'clf': RandomForestClassifier(), 'preprocesso...         0.425714   
2  {'clf': LogisticRegression(max_iter=1000), 'pr...         0.456627   
3  {'clf': LogisticRegression(max_iter=1000), 'pr...         0.436991   
4  {'clf': MultinomialNB(), 'preprocessor__text__...         0.421585   
5  {'clf': MultinomialNB(), 'preprocessor__text__...         0.461814   
6  {'clf': XGBClassifier(base_score=None, booster...         0.448392   
7  {'clf': XGBClassifier(base_score=None, booster...         0.449732 