### Things to do
#### General Notes
- `airline_sentiment` and possibly `airline_sentiment_confidence` are target columns (the latter cannot be in traning data)
- Remove instance of `"@airline"` tags from text 

####  How to handle each column
**Numerical Columns**
- `negativereason_confidence` -- fill missing data with 0
- `retweet_count` -- remove, almost 100% is just 0

**Categorical Columns**
- `negativereason` -- one hot encode top K reasons +1 column for "other"
- `airline` -- remove or one hot encode with "other" column
- `airline_sentiment_gold` -- remove, almost 100% missing data
- `name` -- remove, unique data
- `negative_reason_gold` -- remove, almost 100% missing data
- `tweet_location` -- remove or one hot encode with "other" column

**Other Columns**
- `tweet_coord` -- remove, almost 100% missing data
- `user_timezone` -- remove, a lot of missing and correlates with location
- `tweet_created` -- convert to columns: day of year (sin/cos), day of week, time of day (sin/cos)
- `text` -- sklearn.feature_extraction.text -> CountVectorizer (?)


In [1]:
import sys
sys.path.append('..')

In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

from src.transformers import *

In [3]:
def load_data():
    df = pd.read_csv('../data/Tweets.csv')
    df = df.drop(columns=['tweet_id'])

    df_train, df_test = train_test_split(df, test_size=0.1, stratify=df[['airline_sentiment']], random_state=0)

    X_train = df_train.drop(columns=['airline_sentiment', 'airline_sentiment_confidence'])
    y_train = df_train[['airline_sentiment']]

    X_test = df_test.drop(columns=['airline_sentiment', 'airline_sentiment_confidence'])
    y_test = df_test[['airline_sentiment']]

    return X_train, y_train, X_test, y_test

In [4]:
X_train, y_train, X_test, y_test = load_data()

In [5]:
columns_to_drop = ['retweet_count', 'airline_sentiment_gold', 'negativereason_gold', 'tweet_coord', 'name', 'user_timezone']
columns_to_fill_zero = ['negativereason_confidence']
columns_to_fill_unknown = ['negativereason', 'tweet_location']
columns_to_ohe = ['negativereason', 'airline', 'tweet_location']

column_order_after_transform = \
    columns_to_fill_zero \
    + columns_to_fill_unknown \
    + ['airline', 'text', 'tweet_created']
column_idx = lambda c : column_order_after_transform.index(c)

preprocessor = Pipeline(steps=[
    ('drop', DropColumnTransformer(columns_to_drop)),
    ('fill_missing', 
        ColumnTransformer(
            transformers=[
                ('fill_zero', SimpleImputer(strategy='constant', fill_value=0), columns_to_fill_zero),
                ('fill_other', SimpleImputer(strategy='constant', fill_value='Unknown'), columns_to_fill_unknown),
                
            ], 
            remainder='passthrough')),
    ('encode', ColumnTransformer(transformers=[
        ('ohe', OneHotEncoder(
            handle_unknown='infrequent_if_exist', 
            max_categories=3, 
            sparse_output=False), 
            list(map(column_idx, columns_to_ohe))),
        ('time', TimeTransformer(), list(map(column_idx, ['tweet_created']))),
        ('text', TextTransformer(), list(map(column_idx, ['text'])))
    ],
    remainder='passthrough'))
])

In [6]:
X = preprocessor.fit_transform(X_train)

[['@SouthwestAir when are you releasing your flights for September? Just found out you fly direct lbb to las! So excited! #tripofalifetime']
 ['@USAirways can you help us figure out our correct six digit confirmation number?']
 ["@AmericanAir I paid extra $ for my seat &amp; the monitor didn't work from on AA111. How about a refund on the seat? Conf #: MDBEEI, McMullen"]
 ...
 ['@USAirways despite mechanical issues and many delays followed by a Cancelled Flightlation, still getting to Vegas thanks to great gate agents!']
 ['@SouthwestAir Thx Ops Agt Rich Westagard n Flight Att. Nancy @ DEN Airport.Held flight 1027 n even saved seat 4 Bus Select #CustomersFirst!']
 ['@united Your social listening capabilities are awful if this is the reply for the context in which you were mentioned @stevelord212']]


In [7]:
new_columns = preprocessor[1:].get_feature_names_out()
df = pd.DataFrame(X, columns=new_columns)
df