# Telegram Mining (Notebook Identifizierung von Autoren)

**Master-Thesis: Social Media & Text Mining am Beispiel von Telegram**

Informatik Master

Maximilian Bundscherer

**Hinweis**: Die Abschnitte ``Arbeitungsumgebung initalisieren`` und ``Chats laden und aufbereiten`` werden im Notebook ``Telegram.iypnb`` bereits ausführlich beschrieben und werden daher hier übersprungen. 

## Arbeitsumgebung initalisieren

Siehe Beschreibung im Notebook ``Telegram.ipynb``

In [None]:
# Import default libs
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import time
import re
import os
import sys
import demjson
import requests
import networkx as nx
import warnings
from pprint import pprint
from urllib.parse import urlparse
from collections import Counter
from pathlib import Path
from lxml.html import fromstring

In [None]:
# Hide DeprecationWarning
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [None]:
!{sys.executable} -m pip install demoji

In [None]:
import nltk
import demoji

# Sklearn
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.dummy import DummyClassifier

from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

In [None]:
dictGloStopwatches = dict()

# Start timer (for reporting)
def gloStartStopwatch(key):
    print("[Stopwatch started >>" + str(key) + "<<]")
    dictGloStopwatches[key] = time.time()

# Stop timer (for reporting)
def gloStopStopwatch(key):
    endTime     = time.time()
    startTime   = dictGloStopwatches[key]
    print("[Stopwatch stopped >>" + str(key) + "<< (" + '{:5.3f}s'.format(endTime-startTime) + ")]")

In [None]:
nltk.download("stopwords")

In [None]:
demoji.download_codes()

In [None]:
# Show all columns (pandas hides columns by default)
pd.set_option('display.max_columns', None)

# Set plot style
plt.style.use('ggplot')

font = {'size'   : 13}

plt.rc('font', **font)

In [None]:
dir_var                 = "./work/notebooks/"
dir_var_output          = dir_var + "output/"
dir_var_pandas_cache    = dir_var + "cache/pandas/"

In [None]:
def gloReplaceGermanChars(inputText):

    inputText = inputText.replace("ö", "oe")
    inputText = inputText.replace("ü", "ue")
    inputText = inputText.replace("ä", "ae")

    inputText = inputText.replace("Ö", "Oe")
    inputText = inputText.replace("Ü", "Ue")
    inputText = inputText.replace("Ä", "Ae")

    inputText = inputText.replace("ß", "ss")
    
    return inputText

In [None]:
# Rm unsafe chars
def gloConvertToSafeString(text):
    text = demoji.replace(text, "")
    text = gloReplaceGermanChars(text)
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    return text

# Generate unique chat name
def gloConvertToSafeChatName(chatName):
    chatName = gloConvertToSafeString(chatName)
    return chatName[:30]

In [None]:
def gloGetStopWordsList(filterList):

    stopwWorldsList = []

    deWordsList = nltk.corpus.stopwords.words('german')

    enWordsList = nltk.corpus.stopwords.words('english')

    aStopwords = []
    with open(dir_var + "additionalStopwords.txt") as file:
        for line in file: 
            line = line.strip()
            if(line != ""):
                aStopwords.append(line)

    for s in filterList:
        s = gloReplaceGermanChars(s)
        stopwWorldsList.append(s)

    for s in deWordsList:
        s = gloReplaceGermanChars(s)
        stopwWorldsList.append(s)

    for s in enWordsList:
        stopwWorldsList.append(s)

    for s in aStopwords:
        s = gloReplaceGermanChars(s)
        stopwWorldsList.append(s)

    return stopwWorldsList

## Chats laden und aufbereiten

Siehe Beschreibung im Notebook ``Telegram.ipynb``

## Identifizierung von Autoren

Siehe Beschreibung in ``Thesis.pdf``

### Chats laden

In [None]:
C_USE_CACHE_FILE = "final-run-24-03.pkl"

#### Von Cache Laden

In [None]:
gloStartStopwatch("Cache einlesen")
dfAllDataMessages = pd.read_pickle(dir_var_pandas_cache + C_USE_CACHE_FILE)
gloStopStopwatch("Cache einlesen")

#### Filtern und anzeigen

In [None]:
dfAllDataMessages= dfAllDataMessages[dfAllDataMessages['ftFilePath'].isin(
    [
        "DS-05-01-2021/ChatExport_2021-01-05-hildmann",
        "DS-05-01-2021/ChatExport_2021-01-05-janich",
        "DS-05-01-2021/ChatExport_2021-01-05-xavier",
        "DS-05-01-2021/ChatExport_2021-01-05-evaherman"
    ]
)]

dfAllDataMessages = dfAllDataMessages[dfAllDataMessages.ftQrIsValidText == True]
dfAllDataMessages = dfAllDataMessages[dfAllDataMessages.ftTdCleanText != ""]
dfAllDataMessages = dfAllDataMessages[dfAllDataMessages.ftTdTextLength > 5]

dfAllDataMessages["from"] = dfAllDataMessages["from"].apply(gloConvertToSafeChatName)

In [None]:
dfAllDataMessages.head(3)

### Chats aufbereiten

#### Test CountVectorizer

In [None]:
test = CountVectorizer(ngram_range=(1, 3))

In [None]:
d1 = test.fit_transform(["wordA wordB wordC wordA wordB wordC"])
d1.toarray()

In [None]:
test.get_feature_names()

In [None]:
d2 = test.transform(["wordA wordB"])
d2.toarray()

In [None]:
test.get_params()

#### Test TfidfTransformer

In [None]:
test = TfidfTransformer()

In [None]:
test.fit_transform(d1).toarray()

In [None]:
test.transform(d2).toarray()

#### Features und Balancieren von Autoren

In [None]:
targetDf = dfAllDataMessages

targetDf['clText']    = targetDf['ftTdCleanText']
targetDf['clFrom']    = targetDf['from']
targetDf['clFromId']  = targetDf['from'].factorize()[0]

In [None]:
_ = targetDf['clFrom'].value_counts().plot.bar()

In [None]:
targetDf['clText'][:5]

In [None]:
targetDf['clFromId'].value_counts()

In [None]:
def getSamples(df, k=9103):
    if len(df) < k:
        return df
    return df.sample(k)

targetDf = targetDf.groupby('clFromId').apply(getSamples).reset_index(drop=True)

In [None]:
_ = targetDf['clFrom'].value_counts().plot.bar()

#### Autor Dictionary

In [None]:
dfFromId            = targetDf[['clFrom', 'clFromId']].drop_duplicates().sort_values('clFromId')

dictFrom_to_id      = dict(dfFromId.values)
dictId_to_from      = dict(dfFromId[['clFromId', 'clFrom']].values)

In [None]:
dictId_to_from

### Tranieren und Evaluieren

In [None]:
X_train, X_test, y_train, y_test = train_test_split(targetDf['clText'], targetDf['clFrom'], random_state = 42, test_size=0.20, stratify=targetDf['clFrom'])

print("Train size:\t" + str(len(X_train.index)))
print("Test size:\t" + str(len(X_test.index)))

In [None]:
_ = y_train.value_counts().plot.bar()

In [None]:
_ = y_test.value_counts().plot.bar()

In [None]:
gloStartStopwatch("Transform messages")

count_vect          = CountVectorizer(ngram_range=(1, 3))
tfidf_transformer   = TfidfTransformer()

# Transform and fit train
X_train_counts      = count_vect.fit_transform(X_train)
X_train_tfidf       = tfidf_transformer.fit_transform(X_train_counts)

# Transform test
X_test_counts       = count_vect.transform(X_test)
X_test_tfidf        = tfidf_transformer.transform(X_test_counts)

gloStopStopwatch("Transform messages")

In [None]:
def trainAndEvalModel(model, outputFilename):
   
    gloStartStopwatch("Train now model " + str(model))
    model.fit(X_train_tfidf, y_train)
    gloStopStopwatch("Train now model " + str(model))

    searchStrings = ["Folge Attila Hildmann", "Liebe Eva", "Premium Kanal", "OneLove"]

    for sS in searchStrings:

        sS = str(sS)
        print()
        print("Who has written '" + sS + "'?")
        t = tfidf_transformer.transform(count_vect.transform([sS]))
        r = model.predict(t)
        print(str(r))

    y_pred_train        = model.predict(X_train_tfidf)
    y_pred_test         = model.predict(X_test_tfidf)

    print()
    print("Train Score:\t"  + str(accuracy_score(y_true=y_train, y_pred=y_pred_train)))
    print("Test Score:\t"   + str(accuracy_score(y_true=y_test, y_pred=y_pred_test)))

    print()
    print("Confusion Matrix on test:")   
    conf_mat = confusion_matrix(y_true = y_test, y_pred = y_pred_test)
    
    fig, ax  = plt.subplots(figsize=(9,9))

    plt.title(str(model).replace("()", ""))
    
    sns.heatmap(conf_mat, annot=True, fmt='d',
                xticklabels=dfFromId.clFrom.values, yticklabels=dfFromId.clFrom.values)
                
    plt.ylabel('Actual')
    plt.xlabel('Predicted')
    
    plt.xticks(rotation=45)
    plt.yticks(rotation=45)

    if(outputFilename != ""):
        plt.savefig(dir_var_output + outputFilename)

    plt.show()

#### LinearSVC

In [None]:
trainAndEvalModel(LinearSVC(), "class-linearsvc.svg")

#### MultinomialNB

In [None]:
trainAndEvalModel(MultinomialNB(), "class-multinomialnb.svg")

#### LogisticRegression

In [None]:
trainAndEvalModel(LogisticRegression(), "class-logisticregression.svg")

#### MLPClassifier

In [None]:
trainAndEvalModel(MLPClassifier(), "class-mlp.svg")

#### DecisionTreeClassifier

In [None]:
trainAndEvalModel(DecisionTreeClassifier(), "class-decisiontree.svg")

#### RandomForestClassifier

In [None]:
trainAndEvalModel(RandomForestClassifier(), "class-randomforest.svg")

#### DummyClassifier

In [None]:
trainAndEvalModel(DummyClassifier(strategy="uniform"), "class-dummy.svg")

## Mehr lesen / Inspirationen

- https://towardsdatascience.com/multi-class-text-classification-with-scikit-learn-12f1e60e0a9f
- https://towardsai.net/p/data-mining/text-mining-in-python-steps-and-examples-78b3f8fd913b
- https://towardsdatascience.com/text-mining-for-dummies-text-classification-with-python-98e47c3a9deb
- https://www.analyticsvidhya.com/blog/2018/02/the-different-methods-deal-text-data-predictive-python/
- https://realpython.com/python-keras-text-classification/
- https://www.tidytextmining.com/ngrams.html
- http://seaborn.pydata.org/tutorial/categorical.html?highlight=bar%20plot
- https://towardsdatascience.com/visualizing-data-with-pair-plots-in-python-f228cf529166
- https://www.kirenz.com/post/2019-08-13-network_analysis/
- https://tgstat.com
- https://huggingface.co/bert-base-german-cased
- https://github.com/sekhansen/text-mining-tutorial/tree/master
- https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/
- https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0
- https://github.com/sekhansen/text-mining-tutorial/blob/master/tutorial_notebook.ipynb
- https://textmining.wp.hs-hannover.de/Preprocessing.html
- https://likegeeks.com/nlp-tutorial-using-python-nltk/
- https://www.datacamp.com/community/tutorials/text-analytics-beginners-nltk
- https://data-flair.training/blogs/nltk-python-tutorial/
- https://github.com/expectocode/telegram-analysis
- https://stackoverflow.com/questions/23199796/detect-and-exclude-outliers-in-pandas-data-frame