# Tweet Sanitizer
---
A Python code to sanitize i.e. remove hashtags, mentions, links, photos, etc. from raw tweet content.

In [1]:
import numpy as np

import re
import csv
import os

import emot
import emoji

from tqdm.notebook import tqdm

from src.sanitization import TweetSanitizer

from src.constants import (RAW_PATH, SANITIZED_PATH,
                           SUPPLEMENT_RAW_DIR, SUPPLEMENT_SANITIZED_DIR)

**PARTIAL SANITIZATION**

Remove:
* strange non-utf-8 characters
* user mentions
* links {https://t.co/P3zt8zBUbL}
* photos content {pic.twitter.com...}
* hashtags with hashcodes {#.43djr324rj34}
* special characters {/w; /n; /r}
* redundant spaces


**FULL SANITIZATION**

Remove:
* all like in partial sanitization
* all hashtag hashes {#}
* others but texts

Extract:
* emoticons {:); ;)}
* emoji {🤦‍♂️; 🤣; 😂; 🤣}

In [2]:
sanitizer = TweetSanitizer()

In [3]:
test_text = '#Kompania #Węglowa @weglowa :( pic.twitter.com/O2ixmQ2Jm1 https:// blokuje śląskie sądy. http://niezalezna.pl/209246-sprawdzili-czy-tusk 😂 20 tysięcy pozwów ws. deputatów węglowych :/- Dziennik...zachodni.pl:http://niezalezna.pl/209246-sprawdzili-czy-tusk-jest-winny #.VIXGNXEL7p8.twitter …'
test_text

'#Kompania #Węglowa @weglowa :( pic.twitter.com/O2ixmQ2Jm1 https:// blokuje śląskie sądy. http://niezalezna.pl/209246-sprawdzili-czy-tusk 😂 20 tysięcy pozwów ws. deputatów węglowych :/- Dziennik...zachodni.pl:http://niezalezna.pl/209246-sprawdzili-czy-tusk-jest-winny #.VIXGNXEL7p8.twitter …'

In [4]:
sanitizer.partial_sanitization(test_text)

'#Kompania #Węglowa :( blokuje śląskie sądy. 😂 20 tysięcy pozwów ws. deputatów węglowych :/- Dziennikzachodni.pl: '

In [5]:
sanitizer.full_sanitization(test_text)

('Kompania Węglowa blokuje śląskie sądy. 20 tysięcy pozwów ws. deputatów węglowych - Dziennikzachodni.pl: ',
 '😂',
 ':( :/')

In [6]:
sanitizer.extract_emoji___('This does not work: 🤨, !🇵, 🤪, and 🥺. But this 😂 works!')

('This does not work: \U0001f928, !🇵, \U0001f92a, and \U0001f97a. But this  works!',
 '😂')

In [7]:
sanitizer.extract_emoji('This does not work: 🤨, !🇵, 🤪, and 🥺. But this 😂 works!')

('This does not work: , !, , and . But this  works!',
 '\U0001f928 🇵 \U0001f92a \U0001f97a 😂')

In [8]:
file_label = '2016-0206'
sanitizer.sanitize_tweets(SUPPLEMENT_RAW_DIR.replace('{}', file_label),
                          SUPPLEMENT_SANITIZED_DIR.replace('{}', file_label), full_sanitize=True)

**Get all texts from vulgar tweets.**