## Text Cleaning Swedish Tweets
We begin by importing and installing all necessary libraries. We got some extras for later usage.

In [2]:
%%capture
!pip install gensim --upgrade
!pip install transformers
!pip install fse

In [9]:
import numpy as np # linear algebra
import pandas as pd # dat

from tqdm import tqdm
tqdm.pandas()

from pathlib import Path

import os
import random 
import operator 
import regex as re

"""
import torch
import torch.optim as optim

# fastai
from fastai import *
from fastai.text import *
from fastai.callbacks import *

# transformers
from transformers import PreTrainedModel, PreTrainedTokenizer, PretrainedConfig

from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForTokenClassification, AutoModelWithLMHead, AutoModel
from transformers import AutoTokenizer, TFAutoModelForTokenClassification
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer, DistilBertConfig
"""

# gensim + fasttext
from gensim.models.fasttext import FastText, load_facebook_vectors
from gensim.models import KeyedVectors

# import fastai
# import transformers
import spacy

In [10]:
# print('fastai version :', fastai.__version__)
# print('transformers version :', transformers.__version__)
print('spacy version :', spacy.__version__)

# tokenizer = AutoTokenizer.from_pretrained("KB/bert-base-swedish-cased-ner")
# model = AutoModelForTokenClassification.from_pretrained("KB/bert-base-swedish-cased-ner")			

spacy version : 2.2.4


## Reading the data
Let's begin by inspecting the data to understand it further

In [44]:
# Make sure format is raw,y
FILENAME = "../datasets/swedish_tweet_combined.csv"

df = pd.read_csv(FILENAME)
print(df.shape)
df.head()

(36680, 2)


Unnamed: 0,raw,y
0,RT @Carolinafarraj: Fick höra idag 'hur känner...,Negative
1,Nice att vakna halv 3 på natten med en snustor...,Negative
2,RT @DKristoffersson: David Luiz är rolig på In...,Positive
3,"RT @Chiyokosmet: Nej, jag bryr inte om vem du ...",Positive
4,RT @petterbristav: Det har läckt ut nakenbilde...,Neutral


## Preprocessing allowing our Word Embeddings to actually cover more
We want to cover the data as well as possible through our embeddings, else the feature / word is completely lost.

To begin with we'll look at the vocabs. Both word & characters.

In [45]:
def build_vocab(sentences, verbose =  True):
    """
    :param sentences: list of list of words
    :return: dictionary of words and their count
    """
    vocab = {}
    for sentence in tqdm(sentences, disable = (not verbose)):
        for word in sentence:
            try:
                vocab[word] += 1
            except KeyError:
                vocab[word] = 1
    return vocab

In [46]:
def build_char_vocab(sentences, verbose =  True):
    """
    :param sentences: list of list of words
    :return: dictionary of words and their count
    """
    vocab = {}
    for sentence in tqdm(sentences, disable = (not verbose)):
        for word in sentence:
          for char in word:
            try:
                vocab[char] += 1
            except KeyError:
                vocab[char] = 1
    return vocab

### Inspecting our vocab
Let's look at how good we cover our words with these methods!

In [47]:
sentences = df["raw"].progress_apply(lambda x: x.split()).values
vocab = build_vocab(sentences)
print()
print({k: vocab[k] for k in list(vocab)[:5]})

100%|██████████| 36680/36680 [00:00<00:00, 127130.37it/s]
100%|██████████| 36680/36680 [00:00<00:00, 152065.62it/s]
{'RT': 12726, '@Carolinafarraj:': 58, 'Fick': 104, 'höra': 88, 'idag': 354}



In [49]:
char_vocab = build_char_vocab(sentences)
print()
print({k: char_vocab[k] for k in list(char_vocab)[-10:]})
print({k: char_vocab[k] for k in list(char_vocab)[:10]})

100%|██████████| 36680/36680 [00:00<00:00, 64041.68it/s]
{'👺': 1, '™': 2, 'ā': 2, 'ś': 1, 'ī': 1, 'ū': 1, '🇫': 1, '🇷': 1, 'ć': 1, '🐻': 1}
{'R': 20832, 'T': 22687, '@': 33486, 'C': 4685, 'a': 247282, 'r': 214951, 'o': 122662, 'l': 140640, 'i': 150416, 'n': 201214}



### Loading our Word2Vec
Let's load the W2V and then check the coverage!

In [51]:
MODEL_PATH = "../models/cc.sv.100.bin"
f_vectors = load_facebook_vectors(MODEL_PATH) # load_facebook_vectors('cc.sv.300.bin.gz')

In [52]:
def check_coverage(vocab,embeddings_index):
    a = {}
    oov = {}
    k = 0
    i = 0
    for word in tqdm(vocab):
        try:
            a[word] = embeddings_index[word]
            k += vocab[word]
        except:
            oov[word] = vocab[word]
            i += vocab[word]
            pass

    print('Found embeddings for {:.2%} of vocab'.format(len(a) / len(vocab)))
    print('Found embeddings for  {:.2%} of all text'.format(k / (k + i)))
    sorted_x = sorted(oov.items(), key=operator.itemgetter(1))[::-1]

    return sorted_x

In [53]:
oov = check_coverage(vocab, f_vectors.vocab)

100%|██████████| 99887/99887 [00:00<00:00, 307338.26it/s]Found embeddings for 39.04% of vocab
Found embeddings for  77.00% of all text



### Coverage
Pretty okay for no changes applied, covering ~ 77% of the text (but only 39% of the vocab! :O)

We can simply inspect the OOV words

In [60]:
oov[:10]

[('#svpol', 1271),
 (':)', 729),
 ('#dinröst', 655),
 ('&amp;', 558),
 ('#val2014', 530),
 ('Egots', 316),
 (';)', 242),
 ('@Emmywin:', 217),
 ('#08pol', 214),
 ('@niklassvensson:', 200)]

Looking at the data it seems as if the embeddings are lower-cased. By looking at them you can see this yourself, it's actually only lower-case words. 

Let us fix this by lowering all text (shouldn't loose way to much context by this - but important to keep in mind during cleaning!)

In [0]:
df['TweetText'] = df['TweetText'].progress_apply(lambda x: x.lower())
sentences = df["TweetText"].apply(lambda x: x.split())
vocab = build_vocab(sentences)
oov = check_coverage(vocab, f_vectors.vocab)

oov[:10]

100%|██████████| 36680/36680 [00:00<00:00, 540415.38it/s]
100%|██████████| 36680/36680 [00:00<00:00, 187809.17it/s]
100%|██████████| 92393/92393 [00:00<00:00, 726411.08it/s]

Found embeddings for 46.43% of vocab
Found embeddings for  82.60% of all text





[('&amp;', 558),
 ('…', 496),
 ('ja,', 263),
 ('@emmywin:', 217),
 ('@niklassvensson:', 200),
 ('http://…', 169),
 ('@expressen:', 160),
 ('nej,', 160),
 ('@ahedenstedt:', 153),
 ("'det", 152)]

**Gains**

10 % units gained in vocab coverage and almost 20 % units gained in text coverage! That's _really_ good for such a simple thing to do. Looking at the top OOV we can still see some issues.

1. Retweets & mentions
2. http/https websites
3. html-code (`&amp` etc)
4. bad tokenization (e.g. `'det`)

Let's start by fixing a few of these. Namely retweet-pattern & removing punctuation (using subword embeddings such as fastText would allow us to keep these better)

In [0]:
retweet_pattern = re.compile('rt \S+: ')
punct_pattern = re.compile('([!"$%&\'()*+,-./:;<=>?[\\]^_`{|}~])')

In [0]:
def clean_text(text):
  text = retweet_pattern.sub('', text)
  text = at_pattern.sub(' @ användare', text)
  text = http_pattern.sub('länk ', text)
  text = punct_pattern.sub(' ', text)
  text = text.replace(' #', ' # ')
  text = text.replace(':)', 'glad')
  text = text.replace(';)', 'glad')
  text = text.replace(':-)', 'glad')
  text = text.replace('&amp;', '&')
  #for punct in '?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '“”’':
  #      text = text.replace(punct, '')
  return text

In [0]:
df['TweetText'] = df['TweetText'].progress_apply(lambda x: clean_text(x))
sentences = df["TweetText"].apply(lambda x: x.split())
vocab = build_vocab(sentences)



  0%|          | 0/36680 [00:00<?, ?it/s][A[A

 19%|█▉        | 6912/36680 [00:00<00:00, 69112.32it/s][A[A

 38%|███▊      | 13834/36680 [00:00<00:00, 69143.67it/s][A[A

 56%|█████▋    | 20720/36680 [00:00<00:00, 69056.91it/s][A[A

 76%|███████▌  | 27843/36680 [00:00<00:00, 69691.49it/s][A[A

100%|██████████| 36680/36680 [00:00<00:00, 69404.87it/s]


  0%|          | 0/36680 [00:00<?, ?it/s][A[A

100%|██████████| 36680/36680 [00:00<00:00, 198765.74it/s]


In [0]:
oov = check_coverage(vocab, f_vectors.vocab)
oov[:10]



100%|██████████| 53503/53503 [00:00<00:00, 652081.11it/s]

Found embeddings for 75.42% of vocab
Found embeddings for  95.43% of all text





[('…', 713),
 ('08pol', 218),
 ('http…', 116),
 ('användare…', 112),
 ('dinroest', 106),
 ('ht…', 92),
 ('htt…', 85),
 ('jobbvalet', 76),
 ('hypnostillstånd', 70),
 ('twittpuck', 64)]

In [0]:
at_pattern = re.compile('\s(@[\w_-]+):?')
http_pattern = re.compile('https?:\S+')