# Data Preparation

Inhaltsverzeichnis

In [1]:
%load_ext autoreload
%autoreload 2

## Data Cleaning

In der Data Exploration konnte festgestellt werden, dass der Datensatz nicht vollständig den Anforderungen entspricht und einige Fehler aufweist. Im Data Cleaning Prozess wird der Datensatz daher angepasst und verbessert. Identifizierte Fehler werden behoben. Ziel ist es, den Datensatz für die Preprocessing Pipeline vorzubereiten. 

#### A. Data Cleaning 

Der nachfolgende Code führt den gesamten Data Cleaning Prozess durch. Alternativ kann das Data Cleaning auch Schritt für Schritt eigenständig durchgeführt werden.

In [3]:
%%script false
from src.features.data_cleaning import *

# run data cleaning
pipeline = CleaningPipeline(path='../data/raw/twitter_tweets_raw.pkl')
df = pipeline.run()

# save cleaned data set
df.to_feather('../data/intermediate/twitter_tweets_intermediate.feather')

Couldn't find program: 'false'


#### 0. Datensatz laden

In [4]:
from src.data.nitter_scraper_standalone_v2 import Tweet, TweetScraper
import pandas as pd

# load tweet objects
list_of_tweets = TweetScraper.load_collected_tweets(path='../data/raw/twitter_tweets_raw.pkl')

# transform into a dataframe
dict_of_tweets =  [{"url": tweet.url, "date": tweet.date, "rawContent": tweet.rawContent} for tweet in list_of_tweets]
df = pd.DataFrame(dict_of_tweets)

#### 1. Duplikate löschen 

In [5]:
df.drop_duplicates(subset=['rawContent'], inplace=True)

# check for success
if df['rawContent'].duplicated().any():
    print(f"{len(df[df['rawContent'].duplicated()])} duplicates found.")

#### 2. Nicht-englische Beiträge identifizieren & löschen 

In [6]:
from langdetect import detect
from tqdm import tqdm

# define function to identify language of tweet
def detect_language(text):
    try:
        lang = detect(text)
    except:
        lang = None
    return lang

tqdm.pandas()
# determine the language for each tweet
df['lang'] = df['rawContent'].progress_apply(detect_language)

# identify and delete all non-english tweets
non_english_posts = df.query('lang != "en"')
df.drop(index=non_english_posts.index, inplace=True)

# check for success
if not df['lang'].eq('en').all():
    print(df.query('lang != "en"'))

100%|█████████████████████████████████████████████████████████████████████████| 854238/854238 [57:31<00:00, 247.47it/s]


#### 3. Date aktualisieren

In [8]:
import datetime

df['date'] = df['date'].apply(lambda x: datetime.datetime.strptime(x, "%b %d, %Y · %I:%M %p %Z"))

#### 4. Irrelevante Daten löschen

In [9]:
df.drop(columns=['lang'], inplace=True)

#### 5. Gesäuberten Datensatz speichern

In [10]:
df.set_index('url', inplace=True)
df.reset_index(inplace=True)
df.to_feather('../data/intermediate/twitter_tweets_intermediate.feather')

---

## Preprocessing Pipeline

Nachdem der Datensatz gesäubert wurde, werden die Textdaten in eine sogenannte Preprocessing Pipeline gegeben. Die Preprocessing Pipeline ist entscheidend für das spätere Ergebnis des Modelings. Diese stellt sicher, dass der Text für das Modell geeignet ist. Die Pipeline, die für die Anforderungen des Projekts speziell entworfen wurde, kann wie folgt visualisiert werden:

TODO: ABBILDUNG VON DER PREPROCESSING PIPELINE

#### A. Preprocessing Pipeline

Der nachfolgende Code führt die gesamte Preprocessing Pipeline aus. Alternativ kann die Preprocessing Pipeline auch Schritt für Schritt eigenständig ausgeführt werden.

In [1]:
%%script false
from src.features.preprocessing_pipeline import *
import pandas as pd

# run preprocessing pipeline
pipeline = DefaultPipeline(dataframe=pd.read_feather('../data/intermediate/twitter_tweets_intermediate.feather'))
df = pipeline.run()

# save preprocessed data set
df.to_feather('../data/processed/twitter_tweets_processed.feather')
df.to_csv('../data/processed/twitter_tweets_processed.csv', index=False)

2023-05-23 14:29:14,545 - INFO - initialize pipeline and download required nltk packages...
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\lukas\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\lukas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\lukas\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
2023-05-23 14:29:14,818 - INFO - remove urls...
100%|███████████████████████████████████████████████████████████████████████| 851926/851926 [00:11<00:00, 76401.46it/s]
2023-05-23 14:29:26,017 - INFO - remove twitter user mentions...
100%|██████████████████████████████████████████████████████████████████████| 851926/851926 [00:02<00:00, 369182.39it/s]
2023-05-23 14:29:28,369 - INFO - fix contractions...
100%|█████████████████████████████

#### 0. Packages & Datensatz laden 

In [27]:
import pandas as pd
import contractions
import nltk
import string
import emoji
import re

from tqdm import tqdm

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import WordPunctTokenizer

# download required nltk packages
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# enable progress bar
tqdm.pandas()

# load dataframe
df = pd.read_feather('../data/intermediate/twitter_tweets_intermediate.feather')
df.head(4)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\lukas\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\lukas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\lukas\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,url,date,rawContent
0,https://twitter.com/_Bob_S/status/164159110508...,2023-03-30 23:59:00,Govt IT security isnt 'a nice thing to do': it...
1,https://twitter.com/WYSIWYGVentures/status/164...,2023-03-30 23:59:00,ISC West 2023: Cyberattackers Are Targeting Ph...
2,https://twitter.com/HackerAran7/status/1641591...,2023-03-30 23:59:00,What’s the hack. #stem #science #stemeducation...
3,https://twitter.com/bytefeedai/status/16415909...,2023-03-30 23:59:00,BuzzFeed Is Using AI To Write SEO-Bait Travel ...


#### 1.  URLs entfernen

In [28]:
def remove_urls(text):
    # define regex pattern for url detection
    url_pattern = re.compile(r'\b(?:https?://)?(?:[a-z]+\.[a-z]+\.[a-z]+|[a-z]+\.[a-z]+(?:/[^\s]*)?)\b')
    # remove url matches from the text
    text_without_urls = re.sub(url_pattern, '', text)
    return text_without_urls

df['preprocessed_text'] = df['rawContent'].progress_apply(remove_urls)

100%|███████████████████████████████████████████████████████████████████████| 851926/851926 [00:12<00:00, 69055.39it/s]


#### 2.  Erwähnungen entfernen

In [29]:
def remove_mentions(text):
    # define regex pattern for user mentions
    mention_pattern = re.compile(r'@\w+')
    # remove user mentions from the text
    text_without_mentions = re.sub(mention_pattern, '', text)
    return text_without_mentions

df['preprocessed_text'] = df['preprocessed_text'].progress_apply(remove_mentions)

100%|██████████████████████████████████████████████████████████████████████| 851926/851926 [00:02<00:00, 342273.36it/s]


#### 3. Kontraktionen auflösen

In [30]:
def fix_contractions(text):
    try:
        return contractions.fix(text)
    except IndexError: # error should not appear
        return text

df['preprocessed_text'] = df['preprocessed_text'].progress_apply(fix_contractions)

100%|███████████████████████████████████████████████████████████████████████| 851926/851926 [00:10<00:00, 79503.52it/s]


#### 4. Tokenization durchführen

In [31]:
# define tokenizer function
tokenizer = WordPunctTokenizer()

def tokenize_text(text):
    return tokenizer.tokenize(text)

df['preprocessed_text'] = df['preprocessed_text'].progress_apply(tokenize_text)

100%|███████████████████████████████████████████████████████████████████████| 851926/851926 [00:16<00:00, 53018.49it/s]


#### 5. Tokens in Kleinbuchstaben umwandeln

In [32]:
def lowercase(tokens):
    return [token.lower() for token in tokens]

df['preprocessed_text'] = df['preprocessed_text'].progress_apply(lowercase)

100%|███████████████████████████████████████████████████████████████████████| 851926/851926 [00:12<00:00, 67626.35it/s]


#### 6. Satzzeichen entfernen 

In [33]:
# adding more characters to the punctuation list
punct = string.punctuation + "’" + "``" +"`" + "''" +"'" + "•" + "“" + "”" + "…" + "�" + "‘" + "…" + "/…" + "-…" + "-#" + "’" + "..."

def remove_punct(tokens):
    return [token for token in tokens if token not in punct]

df['preprocessed_text'] = df['preprocessed_text'].progress_apply(remove_punct)

100%|███████████████████████████████████████████████████████████████████████| 851926/851926 [00:15<00:00, 56254.17it/s]


#### 7.  Numerische Daten entfernen

In [34]:
def remove_numerics(tokens):
    return [token for token in tokens if not token.isdigit()]

df['preprocessed_text'] = df['preprocessed_text'].progress_apply(remove_numerics)

100%|██████████████████████████████████████████████████████████████████████| 851926/851926 [00:04<00:00, 202292.88it/s]


#### 8.  Stopwörter entfernen

In [35]:
# define list of stopwords
stop_words = stopwords.words('english')

def remove_stopwords(tokens):
    return [token for token in tokens if token not in stop_words and len(token) > 1]

df['preprocessed_text'] = df['preprocessed_text'].progress_apply(remove_stopwords)

100%|███████████████████████████████████████████████████████████████████████| 851926/851926 [00:39<00:00, 21569.20it/s]


#### 9.  Emojis entfernen

In [36]:
def remove_emoji(tokens):
    return [token for token in tokens if not any(char in emoji.EMOJI_DATA for char in token)]

df['preprocessed_text'] = df['preprocessed_text'].progress_apply(remove_emoji)

100%|███████████████████████████████████████████████████████████████████████| 851926/851926 [00:18<00:00, 46123.68it/s]


#### 10. Lemmatisierung durchführen

In [37]:
# initialization of the lemmatizer
lemmatizer = WordNetLemmatizer()

def lemmatize(tokens):
    return [lemmatizer.lemmatize(token) for token in tokens]

df['preprocessed_text'] = df['preprocessed_text'].progress_apply(lemmatize)

100%|███████████████████████████████████████████████████████████████████████| 851926/851926 [00:54<00:00, 15540.25it/s]


#### 11. Preprocessed Datensatz speichern

In [38]:
df.to_feather('../data/processed/twitter_tweets_processed.feather')
df.to_csv('../data/processed/twitter_tweets_processed.csv', index=False)

---