### Things to do
#### General Notes
- `airline_sentiment` and possibly `airline_sentiment_confidence` are target columns (the latter cannot be in traning data)
- Remove instance of `"@airline"` tags from text 

####  How to handle each column
**Numerical Columns**
- `negativereason_confidence` -- fill missing data with 0
- `retweet_count` -- remove, almost 100% is just 0

**Categorical Columns**
- `negativereason` -- one hot encode top K reasons +1 column for "other"
- `airline` -- remove or one hot encode with "other" column
- `airline_sentiment_gold` -- remove, almost 100% missing data
- `name` -- remove, unique data
- `negative_reason_gold` -- remove, almost 100% missing data
- `tweet_location` -- remove or one hot encode with "other" column

**Other Columns**
- `tweet_coord` -- remove, almost 100% missing data
- `user_timezone` -- remove, a lot of missing and correlates with location
- `tweet_created` -- convert to columns: day of year (sin/cos), day of week, time of day (sin/cos)
- `text` -- sklearn.feature_extraction.text -> CountVectorizer (?)


# Stage 1: Data Collection and Preparation

**Goal**: Gather and perform an initial analysis of publicly available datasets containing labeled texts with sentiment (positive, negative, neutral) in both Polish and English.

**Dataset**:
- **E2 - Twitter US Airline Sentiment**:
  - [Link to Dataset](https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment)

**Tasks**:
1. Conduct an initial data exploration (e.g., number of examples, class distribution).
2. Prepare the data for modeling:
   - Handle missing data.
   - Split the data into training and test sets.

---


In [12]:
# Stage 1 
import sys
sys.path.append('..')

In [13]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer

from src.transformersJS import *

In [14]:
def load_data():
    df = pd.read_csv('../data/Tweets.csv')
    df = df.drop(columns=['tweet_id'])

    df_train, df_test = train_test_split(df, test_size=0.1, stratify=df[['airline_sentiment']], random_state=0)

    X_train = df_train.drop(columns=['airline_sentiment', 'airline_sentiment_confidence'])
    y_train = df_train[['airline_sentiment']]

    X_test = df_test.drop(columns=['airline_sentiment', 'airline_sentiment_confidence'])
    y_test = df_test[['airline_sentiment']]

    return X_train, y_train, X_test, y_test

In [15]:
X_train, y_train, X_test, y_test = load_data()

X_train

print(X_train[X_train['text'].str.contains('#', na=False)]['text'].tolist())

print(X_train[X_train['text'].str.contains('http', na=False)].shape[0])


['@SouthwestAir when are you releasing your flights for September? Just found out you fly direct lbb to las! So excited! #tripofalifetime', "@AmericanAir I paid extra $ for my seat &amp; the monitor didn't work from on AA111. How about a refund on the seat? Conf #: MDBEEI, McMullen", '@USAirways forced sections 4 and 5 to check their carry on. would have packed differently to check my bag. Why even allow it? #pissed', '@AmericanAir Would have had to fly real far south, huh? #WinterWeather #Brrr', '@SouthwestAir you are failing! Diverted, stuck and no communication! Make a decision and let us go!!!! 😞😡 flight #4229', "@SouthwestAir what's up with these delays?! Throw some priority boarding my way &amp; I'll forgive you!! 👍 #southwest #southwestairlines", '@SouthwestAir Thank you for having flights going out of Nashville! You guys Rock! #DisneyPrincessHalfMarathon #girlsweekend #bffs', "@VirginAmerica Beats EPS Views, Takes On #SouthwestAir VA LUV - Investor's Business Daily http://t.co/


# Stage 2: Building a Simple Sentiment Analysis Model

**Goal**: Develop a basic classification model without advanced variable transformations.

**Tasks**:

1. **Basic Text Processing**:
   - Convert text to lowercase.
   - Remove punctuation and special characters.
   - Remove stop words.
   - Tokenization.
   
2. **Text Representation**:
   - Use Bag-of-Words (BoW) or TF-IDF to transform text into feature vectors.

3. **Model Training**:
   - Apply simple classifiers, such as:
     - Naive Bayes classifier.
     - Logistic regression.
     - Decision trees.
   - Train the model on the training set.

4. **Model Evaluation**:
   - Test the model on the test set.
   - Calculate metrics: AUC/GINI, accuracy, precision, recall, F1-score.
   - Analyze the confusion matrix.

---

In [16]:
# Stage 2
columns_to_drop = ['retweet_count', 'airline_sentiment_gold', 'negativereason_gold', 'tweet_coord', 'name', 'user_timezone']
columns_to_fill_zero = ['negativereason_confidence']
columns_to_fill_unknown = ['negativereason', 'tweet_location']
columns_to_ohe = ['negativereason', 'airline', 'tweet_location']

column_order_after_transform = (
    columns_to_fill_zero
    + columns_to_fill_unknown
    + ["airline", "text", "tweet_created"]
)


def column_idx(c):
    return column_order_after_transform.index(c)

preprocessor = Pipeline(
    steps=[
        ("drop", DropColumnTransformer(columns_to_drop)),
        (
            "fill_missing",
            ColumnTransformer(
                transformers=[
                    (
                        "fill_zero",
                        SimpleImputer(strategy="constant", fill_value=0),
                        columns_to_fill_zero,
                    ),
                    (
                        "fill_other",
                        SimpleImputer(strategy="constant", fill_value="Unknown"),
                        columns_to_fill_unknown,
                    ),
                ],
                remainder="passthrough",
            ),
        ),
        (
            "encode",
            ColumnTransformer(
                transformers=[
                    (
                        "ohe",
                        OneHotEncoder(
                            handle_unknown="infrequent_if_exist",
                            max_categories=3,
                            sparse_output=False,
                        ),
                        list(map(column_idx, columns_to_ohe)),
                    ),
                    (
                        "time",
                        TimeTransformer(),
                        list(map(column_idx, ["tweet_created"])),
                    ),
                    (
                        "text",
                        Pipeline(
                            [
                                ("text_transformer", TextTransformer()),
                                ("tfidf", TfidfVectorizer(stop_words="english")),
                            ]
                        ),
                        list(map(column_idx, ["text"])),
                    ),
                ],
                remainder="passthrough",
            ),
        ),
    ]
)

X_transformed = preprocessor.fit_transform(X_train)


In [17]:
text_data = X_train["text"].values.reshape(-1, 1)

text_transformer = TextTransformer()
processed_texts = text_transformer.fit_transform(text_data)

for i in range(70):
    print(f"Original: {text_data[i][0]}")
    print(f"Processed: {processed_texts[i]}")

Original: @SouthwestAir when are you releasing your flights for September? Just found out you fly direct lbb to las! So excited! #tripofalifetime
Processed: when are you releasing your flight for september ? just found out you fly direct lbb to la ! so excited ! # tripofalifetime
Original: @USAirways can you help us figure out our correct six digit confirmation number?
Processed: can you help u figure out our correct six digit confirmation number ?
Original: @AmericanAir I paid extra $ for my seat &amp; the monitor didn't work from on AA111. How about a refund on the seat? Conf #: MDBEEI, McMullen
Processed: i paid extra $ for my seat & amp ; the monitor did n't work from on aa111 . how about a refund on the seat ? conf # : mdbeei , mcmullen
Original: @JetBlue could I get a free flight to Vegas since it's my bday😏☺️
Processed: could i get a free flight to vega since it 's my bday : smirking_face : :smiling_face :
Original: @united flight 4841...3 gate changes on top of this.  Really ho

In [18]:
X = preprocessor.fit_transform(X_train)
X

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 218096 stored elements and shape (13176, 12477)>


# Stage 3: Developing an Advanced Sentiment Analysis Model

**Goal**: Build a more advanced model, considering detailed data cleaning, transformations, and the use of advanced modeling techniques.

**Tasks**:

1. **Advanced Data Processing and Cleaning**:
   - Handle emoticons and emojis.
   - Correct spelling errors.
   - Apply stemming or lemmatization.
   - Consider negations in the text (e.g., "not good" vs. "bad").
   - Remove duplicates.
   - Normalize text (e.g., expand abbreviations).

2. **Feature Engineering**:
   - Create additional features such as:
     - N-grams (bigrams, trigrams).
     - Word frequency.
     - Sentiment indicators based on dictionaries.
     - Use word embeddings (e.g., Word2Vec, GloVe).

3. **Advanced Modeling Techniques**:
   - Apply more complex models, such as:
     - Support Vector Machines (SVM).
     - Random Forest.
     - Gradient Boosting (e.g., XGBoost).
     - Neural Networks:
       - Recurrent Neural Networks (RNN, LSTM).
       - Convolutional Neural Networks (CNN).
       - Transformer models (e.g., BERT, RoBERTa).

4. **Hyperparameter Tuning**:
   - Use techniques like Grid Search or Random Search for model optimization.

5. **Model Evaluation**:
   - Use cross-validation for model evaluation.
   - Compare results with the simple model:
     - Did advanced techniques improve the performance?
   - Analyze cases where the model performs better or worse.

---

In [19]:
# Stage 3

# Stage 4: Comparison with LLM Models (e.g., OpenAI)

**Goal**: Compare the results of custom-built models with those from LLM (Large Language Models).

**Tasks**:

1. **Developing an LLM Prompt**:
   - Create an effective prompt for sentiment analysis. Example:
     ```
     Analyze the sentiment of the following text and classify it as positive, negative, or neutral:
     "{text}"
     ```

2. **Using LLM API**:
   - Send test data to the LLM model via API.
   - Save LLM model predictions.
   - Ensure compliance with LLM usage policies.

3. **Analysis and Comparison of Results**:
   - Compare evaluation metrics of all models.
   - Identify differences in predictions between models.
   - Discuss potential reasons for these differences:
     - Ability to understand context.
     - Handling irony or sarcasm.
     - Impact of input data quality.

---

In [20]:
# Stage 4
import sys
!{sys.executable} -m pip install transformers
!{sys.executable} -m pip install transformers[sentencepiece]
!{sys.executable} -m pip install openai google-generativeai transformers torch seaborn

Collecting sentencepiece!=0.1.92,>=0.1.91 (from transformers[sentencepiece])
  Using cached sentencepiece-0.2.0.tar.gz (2.6 MB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: sentencepiece
  Building wheel for sentencepiece (setup.py) ... [?25lerror
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py bdist_wheel[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[98 lines of output][0m
  [31m   [0m running bdist_wheel
  [31m   [0m running build
  [31m   [0m running build_py
  [31m   [0m creating build/lib.linux-x86_64-cpython-313/sentencepiece
  [31m   [0m copying src/sentencepiece/__init__.py -> build/lib.linux-x86_64-cpython-313/sentencepiece
  [31m   [0m copying src/sentencepiece/_version.py -> build/lib.linux-x86_64-cpython-313/sentencepiece
  [31m   [0m copying src/sentencepiece/sentencepiece_model_pb2.py -> build/lib.linux-x86_64-cpython-313/senten

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import roc_auc_score, RocCurveDisplay
import matplotlib.pyplot as plt
from yellowbrick.classifier import ROCAUC

results_df = pd.read_csv('model_comparison.csv')
y_true = y_test

le = LabelEncoder()
y_true_encoded = le.fit_transform(y_true)

models = {
    'DeBERTa': 'deberta',
    'GPT': 'gpt',
    'Gemini': 'gemini',
    'RoBERTa': 'roberta'
}

plt.figure(figsize=(10, 8))

for model_name, col_name in models.items():
    y_pred = results_df[col_name].values
    y_pred_encoded = le.transform(y_pred)
    
    auc = roc_auc_score(y_true_encoded, y_pred_encoded, multi_class='ovr')
    
    RocCurveDisplay.from_predictions(
        y_true_encoded,
        y_pred_encoded,
        name=f'{model_name} (AUC = {auc:.2f})',
        plot_chance_level=(model_name == 'DeBERTa')
    )

plt.title('ROC Curves for All Models')
plt.axis('square')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()

In [None]:
results_df = pd.read_csv('model_comparison.csv')
y_true = y_test

results_df['deberta_probs'] = results_df['deberta_probs'].apply(eval).apply(np.array)
results_df['roberta_probs'] = results_df['roberta_probs'].apply(eval).apply(np.array)

# Optional: Enhancing Project Appeal

1. **Experimenting with Ensemble Methods**:
   - Combine results from different models to improve accuracy (e.g., voting, stacking).

2. **Bias and Ethics in AI**:
   - Analyze whether models exhibit biases towards certain groups or topics.
   - Propose methods to reduce bias in models.

3. **Practical Application of Models**:
   - Use the models for analyzing current data (e.g., recent tweets on a particular topic).

In [None]:
# Optional