# Preprocessing Pipeline
**Tokenization**: in my experiment i'm going to consider tokens as words in a tweet separated by whitespace.

Punctuation is dropped, except for the following:
- Keep "?" and "!" because they are import and often find in surprise or anger tweets. 
- Keep "@" because Users are different from i.e words that may be contained in the users. 
- Keep "#" because hashtag are different and carry a different meaning from word. 
- I keep "." for elipses "..." that also carry emotional information.

The following class incapsulate the logic of the preprocessing

In [None]:
import nltk
from nltk.tokenize import word_tokenize, TweetTokenizer

In [None]:
class PreprocessPipeline:
  def __init__(self):
    self.tweet_tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True)

    # List of chars to keep
    # point is keep for elipses "..."
    chars_to_keep = "@#?!.'_"
    self.punct_to_remove = "".join([c for c in string.punctuation if c not in chars_to_keep])

  def clean_text(self, text):
    # Converts ðŸ˜‚ to " :face_with_tears_of_joy: "
    text = emoji.demojize(text, delimiters=(" ", " "))

    # Lower
    text = text.lower()

    # Remove URLs
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    text = re.sub(r'\b(href|http|https)\b', '', text)

    # Some noise patterns found
    noise_patterns = [
        r'gt',
        r'class[^\w\s]*delicious[^\w\s]*title[^\w\s]*share[^\w\s]*del', # Removes 'gt' (from >)
        r'rel[^\w\s]*nofollow[^\w\s]*target[^\w\s]*blank',              # Specific CSS/HTML string
        r'languagedirection[^\w\s]*ltr',                                 # Specific CSS/HTML string
        r'\b(type|application|atom|xml|feedlinks|href|http|https)\b',     # Directional metadata
    ]

    combined_noise = '|'.join(noise_patterns)
    text = re.sub(combined_noise, '', text)

    # Remove puntuation, keep some special characters
    # We use a translation table here; it's much faster than regex for single characters
    table = str.maketrans('', '', self.punct_to_remove)
    text = text.translate(table)

    text = re.sub(combined_noise, '', text) # re apply

    # Remove extra space
    text = re.sub(r'\s+', ' ', text).strip()
    return text

  def transform(self, text):
    text = self.clean_text(text)
    tokens = self.tweet_tokenizer.tokenize(text)
    return tokens