# Telegram Mining

**Master-Thesis: Social Media & Text Mining am Beispiel von Telegram**

Informatik Master

Maximilian Bundscherer

Beschreibung tbd.

## Arbeitsumgebung initialisieren

### Jupyter Notebook Parameter

Die Läufe lassen sich mit diesen Parametern beinflussen:

| Bezeichner | Datentyp | Beispiel | Beschreibung |
|---|---|---|---|
|``C_LOCAL``|``bool``|``True``|Setzte auf ``True``, falls eine externe Verbindung zum Kernel besteht. Setzte auf ``False``, falls im Browser gearbeitet wird. Beeinflusst Pfade Arbeitsverzeichnisse.|
|``C_SHORT_RUN``|``bool``|``False``|Setzte auf ``True``, falls ein reduzierter Lauf durchgeführt werden soll. Ver- kürzt Entwicklungszeiten lokal.|
|``C_NUMBER_SAMPLES``|``int``|``1000``|Falls ``C_SHORT_RUN`` auf ``True`` gesetzt ist gültig. Um die Entwicklungszeiten weiter zu verkürzen, kann nur auf ei- nem Teil der Datenmenge operiert wer- den.|
|``C_RESOLVE_NEW_URLS``|``bool``|``True``|Sollen YouTube-Titel und Webseiten- Titel während dieses Laufs aufgelöst werden?|
|``C_LOAD_DATASETS``|``string[]``|``["dataSet0"]``|Welche DateSets sollen geladen werden?|
|``C_LOAD_TRANSFORMERS``|``bool``|``True``|Definiert ob die Transformers geladen werden sollen. Die Läufe berücksichtigen das, da es lange Laden kann.|
|``C_TRANSFORMERS_DATASETS``|``string[]``|``["dataSet0"]``|Falls ``C_LOAD_TRANSFORMERS`` auf ``True`` gesetzt ist gültig. Definiert welche DateSets Trans- formers angewendet werden.|
|``C_TIME_PLOT_FREQ``|``string``|``1D``|Definiert Zeitspanne weiter unten|
|``C_USE_CACHE_FILE``|``string``|``file.pkl``|Setzten falls neuer DataFrame-Cache erzeugt werden soll. Definiert Dateinamen|
|``C_NEW_CACHE_FILE``|``string``|``file.pkl``|Setzten falls besteher DataFrame- Cache verwendet werden soll. Defi- niert Dateinamen.|

In [None]:
C_LOCAL                 = False

C_SHORT_RUN             = False
C_NUMBER_SAMPLES        = 500

C_RESOLVE_NEW_URLS      = False

"""
Ava:    ["dataSet0", "dataSet1", "dataSet1a", "dataSet2"]
Htdocs: ["dataSet0", "dataSet1a", "dataSet2"]
Req:    ["dataSet0"]
"""
C_LOAD_DATASETS         = ["dataSet0", "dataSet1", "dataSet1a", "dataSet2"]

C_LOAD_TRANSFORMERS         = False
C_TRANSFORMERS_DATASETS     = ["dataSet0"]

C_TIME_PLOT_FREQ        = "6M"

"""
Please set only one value!
e.g.
# - long-run-server-28-01.pkl   (Long run, with hf, with htdocs-datasets, updated with sen-pipe-2)
# - long-run-server-07-02.pkl   (Long run, with hf, with all datasets, updated with sen-pipe-2)
# - local-run-28-01.pkl         (Short run, with hf, with htdocs-datasets, updated with sen-pipe-2)
# - test.pkl                    (Test file)
"""
C_USE_CACHE_FILE        = "final-run-24-03.pkl"
C_NEW_CACHE_FILE        = ""

### Arbeitsumgebung vorbereiten

#### Bibliotheken und Abhängigkeiten laden

##### Abhänigkeiten vom Docker-Image und IO-Libs und weitere

In [None]:
# Import default libs
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import time
import re
import os
import sys
import demjson
import requests
import networkx as nx
import warnings
from pprint import pprint
from urllib.parse import urlparse
from collections import Counter
from pathlib import Path
from lxml.html import fromstring

##### Weitere Abhänigkeiten installieren

In [None]:
!{sys.executable} -m pip install demoji
!{sys.executable} -m pip install HanTa
!{sys.executable} -m pip install textblob-de

##### Weitere Abhängigkeiten importieren

In [None]:
import demoji

import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess

#import pyLDAvis.gensim
import pickle 
import pyLDAvis

import nltk
from nltk.util import ngrams

from wordcloud import WordCloud

import torch

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

from HanTa import HanoverTagger as ht

from textblob_de import TextBlobDE as TextBlob

In [None]:
# Init hanoverTagger (https://github.com/wartaal/HanTa/blob/master/Demo.ipynb)
hanoverTagger = ht.HanoverTagger('morphmodel_ger.pgz')

In [None]:
# DeprecationWarnings ausblenden
warnings.filterwarnings("ignore", category=DeprecationWarning)

#### Stoppuhr bereitstellen

In [None]:
dictGloStopwatches = dict()

# Start timer (for reporting)
def gloStartStopwatch(key):
    print("[Stopwatch started >>" + str(key) + "<<]")
    dictGloStopwatches[key] = time.time()

# Stop timer (for reporting)
def gloStopStopwatch(key):
    endTime     = time.time()
    startTime   = dictGloStopwatches[key]
    print("[Stopwatch stopped >>" + str(key) + "<< (" + '{:5.3f}s'.format(endTime-startTime) + ")]")

#### Modelle und Datenbanken bereitstellen

##### Transfomers

In [None]:
dictPipelines = {}

def loadPipelines():

    if(C_LOAD_TRANSFORMERS == False):
        print("Skip loading pipelines")
        return list()

    gloStartStopwatch("Load Pipelines")
    
    gloStartStopwatch("Load ner-xlm-Roberta")
    dictPipelines["ner-xlm-roberta"] = pipeline(
        'ner', 
        model='xlm-roberta-large-finetuned-conll03-german',
        tokenizer='xlm-roberta-large-finetuned-conll03-german'
    )
    gloStopStopwatch("Load ner-xlm-Roberta")

    gloStartStopwatch("Load ner-Bert")
    dictPipelines["ner-bert"] = pipeline(
        'ner', 
        model='fhswf/bert_de_ner',
        tokenizer='fhswf/bert_de_ner'
    )
    gloStopStopwatch("Load ner-Bert")

    gloStartStopwatch("Load sen-Bert")
    dictPipelines["sen-bert"] = pipeline(
        'sentiment-analysis', 
        model='nlptown/bert-base-multilingual-uncased-sentiment',
        tokenizer='nlptown/bert-base-multilingual-uncased-sentiment'
    )
    gloStopStopwatch("Load sen-Bert")

    gloStartStopwatch("Load text-gen-gpt2")
    dictPipelines["text-gen-gpt2"] = pipeline(
        'text-generation', 
        model='dbmdz/german-gpt2',
        tokenizer='dbmdz/german-gpt2'
    )
    gloStopStopwatch("Load text-gen-gpt2")

    gloStartStopwatch("Load text-gen-gpt2-faust")
    dictPipelines["text-gen-gpt2-faust"] = pipeline(
        'text-generation', 
        model='dbmdz/german-gpt2-faust',
        tokenizer='dbmdz/german-gpt2-faust'
    )
    gloStopStopwatch("Load text-gen-gpt2-faust")

    gloStopStopwatch("Load Pipelines")

    return dictPipelines.keys()

pipelineKeys = loadPipelines()
print()
print(str(pipelineKeys))

##### NLTK

In [None]:
nltk.download("stopwords")
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

Bereitstellen von Stop Words Datenbanken

In [None]:
def gloGetStopWordsList(filterList):

    stopwWorldsList = []

    deWordsList = nltk.corpus.stopwords.words('german')

    enWordsList = nltk.corpus.stopwords.words('english')

    aStopwords = []
    with open(dir_var + "additionalStopwords.txt") as file:
        for line in file: 
            line = line.strip()
            if(line != ""):
                aStopwords.append(line)

    for s in filterList:
        s = gloReplaceGermanChars(s)
        stopwWorldsList.append(s)

    for s in deWordsList:
        s = gloReplaceGermanChars(s)
        stopwWorldsList.append(s)

    for s in enWordsList:
        stopwWorldsList.append(s)

    for s in aStopwords:
        s = gloReplaceGermanChars(s)
        stopwWorldsList.append(s)

    return stopwWorldsList

##### Demoji

In [None]:
demoji.download_codes()

#### Konfigurationen auf Umgebung anwenden

##### Umgebungs Einstellungen anwenden

In [None]:
# Set tokenizer parallelism 
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# matplotlib output
%matplotlib inline

# Show all columns (pandas hides columns by default)
pd.set_option('display.max_columns', None)

# Set plot style
plt.style.use('ggplot')

font = {'size'   : 13}

plt.rc('font', **font)

##### Arbeitsverzeichnis definieren

In [None]:
# Set env vars
if(C_LOCAL == True):
    dir_var = "./work/notebooks/"
else:
    dir_var = "./"

dir_var_output = dir_var + "output/"

dir_var_cache= dir_var + "cache/"

dir_var_pandas_cache = dir_var + "cache/pandas/"

# Debug output
! echo "- Workdir -"
! ls -al $dir_var

! echo
! echo "- Outputdir -"
! ls -al $dir_var_output

! echo
! echo "- Cachedir -"
! ls -al $dir_var_cache

! echo
! echo "- Pandas -"
! ls -al $dir_var_pandas_cache

##### Cache für dynamisches Auflößen initalisieren

Cache-IO Funktionen

- toFile
- fromFile
- initFile

In [None]:
# Dict File Cache
dictFileCache = {}

# Write dict to file (CSV)
def gloWriteDictToFile(filename, targetDict):
    dictFileCache = {} #Clear cache
    d = pd.DataFrame.from_dict(targetDict, orient="index")
    d.to_csv(dir_var_cache + filename, header=False)

# Read dict from file (CSV)
def gloReadDictFromFile(filename):
    # Cache?
    if(filename in dictFileCache):
        return dictFileCache[filename]

    d = pd.read_csv(dir_var_cache + filename, header=None, index_col=0, squeeze=True)
    retDict = d.to_dict()

    dictFileCache[filename] = retDict #Add to cache

    return retDict

# Init csv file if not exists
def gloInitFileDict(filename):
    f = Path(dir_var_cache + filename)
    if(f.exists() == False):
        print("Init cache file >>" + filename + "<<")
        f.touch()
        gloWriteDictToFile(filename, {"initKey": "initValue"})
    else:
        print("Cache already exists >>" + filename + "<<")

Cache Funktionen

- checkIsCached
- addToCache
- getFromCache

In [None]:
# Check if is already cached
def gloCheckIsAlreadyCached(filename, targetKey):
    targetDict = gloReadDictFromFile(filename)
    if(targetKey in targetDict.keys()):
        return True
    else:
        return False

# Add key to cache
def gloAddToCache(filename, targetKey, targetValue):
    targetDict = gloReadDictFromFile(filename)
    targetDict[targetKey] = targetValue
    gloWriteDictToFile(filename, targetDict)

# Get key from cache
def gloGetCached(filename, targetKey):
    targetDict = gloReadDictFromFile(filename)
    return targetDict[targetKey]

Cache-IO init

In [None]:
gloInitFileDict("resolved-urls.csv")
gloInitFileDict("resolved-youtube.csv")

## Chats laden und aufbereiten

### Aufbereitungsfunktionen für die deutsche Sprache

#### Deutsch-spezifische Buchstaben aus einem String ersetzten

In [None]:
def gloReplaceGermanChars(inputText):

    inputText = inputText.replace("ö", "oe")
    inputText = inputText.replace("ü", "ue")
    inputText = inputText.replace("ä", "ae")

    inputText = inputText.replace("Ö", "Oe")
    inputText = inputText.replace("Ü", "Ue")
    inputText = inputText.replace("Ä", "Ae")

    inputText = inputText.replace("ß", "ss")
    
    return inputText

gloReplaceGermanChars("ö ä ü Ö Ä Ü")

#### Tokenization über NLTK

NLTK German Token

In [None]:
def getTokenFromText(inputText):
    return nltk.word_tokenize(inputText, language="german")

list(getTokenFromText("Hallo Leser! Das ist ein Test."))

#### Lemmatization & POS-Tagging über HanTa

Vorher: POS Versuch mit NLTK Englisch

1. NLTK German Token
2. Englische Sprache (NLTK)

In [None]:
sampleText = "Sie lesen gerade einen kurzen Beispielsatz!"
sampleText

In [None]:
nltk.pos_tag(getTokenFromText(sampleText))

HanTa

In [None]:
def getLemmaAndTaggingFromText(inputText):
    return hanoverTagger.tag_sent(getTokenFromText(inputText))

getLemmaAndTaggingFromText(sampleText)

### Stufe 1: Chats laden

#### CSV Einlesen

In [None]:
def readDataFrameFromCSV(filePath):
    return pd.read_csv(dir_var + filePath, sep=";")

dfInputFiles = readDataFrameFromCSV("inputFiles.csv")

#### Filtern uns ausgeben

In [None]:
def filterBaseData(df):
    dfFilter = pd.DataFrame()

    for dS in C_LOAD_DATASETS:
        dfFilter = dfFilter.append(df[df.inputLabel == dS])
        
    return dfFilter

dfInputFiles = filterBaseData(dfInputFiles)

dfInputFiles

### Stufe 2: Chats aufbereiten

In [None]:
# Convert to DataFrame Meta (Chat Meta)
def convertToDataFrameMeta(filePath):
    dF = pd.read_json(dir_var + "data/" + filePath + "/result.json", encoding='utf-8')
    return dF

dictMeta          = {}   

# Add Key = filePath / Value = DataFrame (Chat Meta)
for fP in dfInputFiles.inputPath:
    dictMeta[fP] = convertToDataFrameMeta(fP)

In [None]:
list(dictMeta.keys())

In [None]:
list(dictMeta["DS-05-01-2021/ChatExport_2021-01-05-hildmann"].keys())

### Stufe 3: Nachrichten aufbereiten

#### Stufe 3a: Nachrichten parsen

##### Nachrichten parsen

In [None]:
# Convert to DataFrame Messages (Chat Messages)
def convertToDataFrameMessages(filePath):
    dF = pd.json_normalize(dictMeta[filePath].messages)
    return dF

##### Auf Stichprobe reduzieren (optional)

- filePath
- chatType

(wird hier nicht beschrieben)

##### Nachrichten Attribute kennenlernen

In [None]:
convertToDataFrameMessages("DS-05-01-2021/ChatExport_2021-01-05-hildmann").columns

##### Chat Attribute zuweisen

- filePath
- chatType

(wird hier nicht beschrieben)

##### Formatierte Nachrichten erkennen

Unterstützt singleMode und multiMode

In [None]:
def gloCheckIsTextJsonFormatted(text, singleMode):
    textString = str(text)
    if      (singleMode == False and textString.startswith("[") == True and textString.endswith("]") == True):
        return True
    elif    (singleMode == True and textString.startswith("{") == True and textString.endswith("}") == True):
        return True
    else:
        return False

#### Stufe 3b: Text und Meta Informationen extrahieren

##### Text und Meta Informationen extrahieren

In [None]:
"""
Extract text data (see cell above key)
See cell above (key)

param   ftIsJsonFormatted Boolean (is text json formatted?)
param   text                String  (text from message) 

return
a   procText            Plain Text
b   processedURLs       Array of URLs in Text
c   processedHashtags   Array of Hashtags in Text #TODO: RM
d   processedBolds      Array of Bold Items in Text
e   processedItalics    Array of Italic Items in Text
f   processedUnderlines Array of Underlined Items in Text
g   processedEmails     Array of E-Mails in Text
"""
def extractTextData(ftIsJsonFormatted, text):
    
    # 3 returns in this function...
    
    processedURLs       = list()
    processedHashtags   = list() # TODO: RM
    processedBolds      = list()
    processedItalics    = list()
    processedUnderlines = list()
    processedEmails     = list()
    
    if(ftIsJsonFormatted != True):
        #Is not JSON formatted (return normal text)
        return (text, processedURLs, processedHashtags, processedBolds, processedItalics, processedUnderlines, processedEmails)
    else:
        #Is is JSON formatted (try to parse)
        try:
            returnList = []
            jsonList = demjson.decode(str(text), encoding='utf8')

            # Do for each item in list
            for lItem in jsonList:

                messageString = str(lItem)

                isJsonSubString = gloCheckIsTextJsonFormatted(messageString, singleMode = True)

                if(isJsonSubString):
                    # Is Json Sub String
                    subJsonString = demjson.decode(str(messageString), encoding='utf8')
                    subJsonType = subJsonString["type"]

                    if(subJsonType == "bold"):
                        #text included
                        processedBolds.append(subJsonString["text"])
                        returnList.append(subJsonString["text"])
                        
                    elif(subJsonType == "italic"):
                        #text included
                        processedItalics.append(subJsonString["text"])
                        returnList.append(subJsonString["text"])
                        
                    elif(subJsonType == "underline"):
                        #text included
                        processedUnderlines.append(subJsonString["text"])
                        returnList.append(subJsonString["text"])
                    
                    elif(subJsonType == "email"):
                        #text included
                        processedEmails.append(subJsonString["text"])
                        
                    elif(subJsonType == "text_link"):
                        #text and href included
                        processedURLs.append(subJsonString["href"])
                        #returnList.append(subJsonString["text"])
                        
                    elif(subJsonType == "link"):
                        #text included
                        processedURLs.append(subJsonString["text"])
                        
                    elif(subJsonType == "hashtag"):
                        #text included
                        #processedHashtags.append(subJsonString["text"]) # TODO: Refactor: Dont add hashtags here!
                        returnList.append(subJsonString["text"])
                        
                    elif(subJsonType == "mention"):
                        #text included
                        returnList.append(subJsonString["text"])
                        
                    elif(subJsonType == "mention_name"):
                        #text and user_id included
                        returnList.append(subJsonString["text"])
                        
                    elif(subJsonType == "bot_command"):
                        #text included
                        returnList = returnList 
                        
                    elif(subJsonType == "code"):
                        #text included
                        returnList = returnList
                        
                    elif(subJsonType == "phone"):
                        #text included
                        returnList = returnList
                        
                    elif(subJsonType == "strikethrough"):
                        #text included
                        returnList.append(subJsonString["text"])
                        
                    elif(subJsonType == "pre"):
                        #text and language included
                        returnList.append(subJsonString["text"])
                        
                    elif(subJsonType == "bank_card"):
                        #text included
                        returnList = returnList
                        
                    else:
                        print("- Error: Unkown json type >>" + str(subJsonType) + "<< (ignore) >>" + str(text) + "<<")

                else:
                    # Is no json formatted sub string (append text)
                    returnList.append(messageString)

            return (''.join(returnList), processedURLs, processedHashtags, processedBolds, processedItalics, processedUnderlines, processedEmails)
        
        except:
            # Parser error (set inputText to returnText)
            print("- Warn: Json parser error (set inputText to returnText) >>" + str(text) + "<<")
            return (text, processedURLs, processedHashtags, processedBolds, processedItalics, processedUnderlines, processedEmails)

##### Text und Meta Informationen nachbearbeiten

Url

- getUrlRegex
- extractUrls
- removeUrls

In [None]:
# https://stackoverflow.com/questions/6718633/python-regular-expression-again-match-url
def getUrlRegex():
    return "((?:https?://)?(?:(?:www\.)?(?:[\da-z\.-]+)\.(?:[a-z]{2,6})|(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)|(?:(?:[0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|(?:[0-9a-fA-F]{1,4}:){1,7}:|(?:[0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|(?:[0-9a-fA-F]{1,4}:){1,5}(?::[0-9a-fA-F]{1,4}){1,2}|(?:[0-9a-fA-F]{1,4}:){1,4}(?::[0-9a-fA-F]{1,4}){1,3}|(?:[0-9a-fA-F]{1,4}:){1,3}(?::[0-9a-fA-F]{1,4}){1,4}|(?:[0-9a-fA-F]{1,4}:){1,2}(?::[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:(?:(?::[0-9a-fA-F]{1,4}){1,6})|:(?:(?::[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(?::[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(?:ffff(?::0{1,4}){0,1}:){0,1}(?:(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])|(?:[0-9a-fA-F]{1,4}:){1,4}:(?:(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])))(?::[0-9]{1,4}|[1-5][0-9]{4}|6[0-4][0-9]{3}|65[0-4][0-9]{2}|655[0-2][0-9]|6553[0-5])?(?:/[\w\.-]*)*/?)"

def urlExtractUrls(inputText):
    return re.findall(getUrlRegex(), str(inputText))

def urlRemoveUrls(inputText):
    return re.sub(getUrlRegex(), " ", str(inputText))

Hashtags

- getHashtagRegex
- extractHashTags

In [None]:
def getHashtagRegex():
    return "#(\w+)"

def hashTagExtractHashTags(inputText):

    inputText = str(inputText)

    inputText = re.sub('\n', ' ', inputText) # Replace \n
    inputText = demoji.replace(inputText, " ") # Rm emoji
    inputText = gloReplaceGermanChars(inputText) # Replace german chars

    return re.findall(getHashtagRegex(), inputText)

In [None]:
"""
Get params from extractedTextData
See cell below (key)
"""
def getExtractedTextDataParam(key, extractedTextData):

    a,b,c,d,e,f,g = extractedTextData

    if(key == 0):

        return urlRemoveUrls(a)

    elif(key == 1):

        before = b
        extracted = urlExtractUrls(a)

        after = before
        after.extend(extracted)

        """
        if(str(extracted) != "[]"):
            # TODO: Fix return bug
            print("Debug >>" + str(before) + "/" + str(extracted) + ">>" + str(after) + "<<")
        """

        return after

    elif(key == 2):

        # TODO: Refactor dont take it from extractedTextData
        return hashTagExtractHashTags(a)

    else:
        switcher = {
            3: d,
            4: e,
            5: f,
            6: g
        }
        return switcher.get(key)

##### Text bereinigen und weitere Attribute berechnen

- cleanText
- emojis
- safeText
- safeLowercaseText
- textLength

(hier nicht beschrieben)

#### Stufe 3c: Query Features

##### Hilfsspalten einfügen (optional)

(hier nicht beschrieben)

##### Evaluation Attribute zuweisen

In [None]:
def evalIsValidText(ftTdTextLength):
    if(ftTdTextLength > 3):
        return True
    else:
        return False

In [None]:
def evalContainsSomething(att):
    if(str(att) == "nan"):
        return False
    else:
        return True

In [None]:
def evalNonEmptyList(att):
    if(str(att) == "[]"):
        return False
    else:
        return True

#### Stufe 3d: Fortgeschrittene Text Mining Modelle

##### Transformers anwenden
NER Transformers

In [None]:
# returns dict (empty dict if disabled, dict with not entries if error)
listUnknownTypes = list()
def processNerPipeline(inputText, pipelineKey, configMinScore):
    if(pipelineKey in pipelineKeys):

        listPer     = list()
        listMisc    = list()
        listOrg     = list()
        listLoc     = list()


        try:

            data = dictPipelines[pipelineKey](inputText)

            for d in data:

                jsonData = demjson.decode(str(d), encoding='utf8')
                            
                if(jsonData["score"] >= configMinScore):
                    # Is Valid
                    if      (jsonData["entity"] == "I-PER" or jsonData["entity"] == "B-PER"):
                        listPer.append(jsonData["word"])
                    elif    (jsonData["entity"] == "I-MISC" or jsonData["entity"] == "B-MISC"):
                        listMisc.append(jsonData["word"])
                    elif    (jsonData["entity"] == "I-ORG" or jsonData["entity"] == "B-ORG"):
                        listOrg.append(jsonData["word"])
                    elif    (jsonData["entity"] == "I-LOC" or jsonData["entity"] == "B-LOC"):
                        listLoc.append(jsonData["word"])
                    else:
                        uT = str(jsonData["entity"])
                        if(uT not in listUnknownTypes):
                            print("- Warn - Got unknown type >>" + uT + "<<")
                            listUnknownTypes.append(uT)

        except:
            pass
            #print("Error in processNerPipeline (ignore) >>" + str(inputText) + "<<")
        

        return {
            "per": listPer,
            "misc": listMisc,
            "org": listOrg,
            "loc": listLoc
        }

    else:
        return dict()

SEN Transformers

In [None]:
# returns
# 1 - 5 (1 = bad / 5 = good)
# -1 disabled or error
def processSenPipeline(inputText, pipelineKey, configMinScore):
    if(pipelineKey in pipelineKeys):

        sen = -1

        try:

            data = dictPipelines[pipelineKey](inputText)
            
            for d in data:


                jsonData = demjson.decode(str(d), encoding='utf-8')

                if(jsonData["score"]) > configMinScore:
                    # Is Valid
                    labelData = str(jsonData["label"])

                    if("stars" in labelData):
                        labelData = re.sub(" stars", "", labelData)
                    else:
                        labelData = re.sub(" star", "", labelData)
                    
                    sen = int(labelData)

        except:
            pass
            #print("Error in processSenPipeline (ignore) >>" + str(inputText) + "<<")

        return sen

    else:
        return -1

##### TextBlob anwenden

In [None]:
# returns
# dict (polarity, subjectivity) or none (fail or disabled)
def processSentimentAnalysisPython(inputText):

    try:
        t = TextBlob(inputText)
        return {
            "polarity": t.polarity,
            "subjectivity": t.subjectivity
        }
    except:
        return None

#### Aus dem Cache laden oder Cache erzeugen

In [None]:
# return dictMessages and dfAllDataMessages
def initProcessData():

    dictMessages      = {}
    dfAllDataMessages = pd.DataFrame()

    gloStartStopwatch("Extract Text Data")

    # Add Key = filePath / Value = DataFrame (Chat Message)
    for fP in dfInputFiles.inputPath:

        gloStartStopwatch("TD-Extract " + fP)
        
        ##############################
        ########## Stufe 3a ##########
        ##############################
        
        # Nachrichten parsen
        dfMessages                          = convertToDataFrameMessages(fP)
        tmpMeta                             = convertToDataFrameMeta(fP)

        # Auf Stichprobe reduzieren (optional)
        if(C_SHORT_RUN):
            print("Short run active!")
            dfMessages = dfMessages.head(C_NUMBER_SAMPLES)
            
        # Nachrichten Attribute kennenlernen
        # siehe oben

        # Chat Attribute zuweisen (filePath, chatType)
        dfMessages["ftFilePath"]      = fP
        dfMessages["ftChatType"]      = tmpMeta.type.iloc[0]
        
        # Formatierte Nachrichten erkennen (isJsonFormatted)
        dfMessages["ftIsJsonFormatted"]   = dfMessages["text"].apply(gloCheckIsTextJsonFormatted, singleMode = False)        
        
        ##############################
        ########## Stufe 3b ##########
        ##############################
        
        # Text und Meta Informationen extrahieren
        dfMessages["tmpExtractedTD"]        = dfMessages.apply(lambda x: extractTextData(x.ftIsJsonFormatted, x.text), axis=1)

        # Text und Meta Informationen nachbearbeiten
        dfMessages["ftTdText"]            = dfMessages.apply(lambda x: getExtractedTextDataParam(0, x.tmpExtractedTD), axis=1)        
        
        dfMessages["ftTdUrls"]            = dfMessages.apply(lambda x: getExtractedTextDataParam(1, x.tmpExtractedTD), axis=1)        
        dfMessages["ftTdHashtags"]        = dfMessages.apply(lambda x: getExtractedTextDataParam(2, x.tmpExtractedTD), axis=1)
        dfMessages["ftTdBolds"]           = dfMessages.apply(lambda x: getExtractedTextDataParam(3, x.tmpExtractedTD), axis=1)
        dfMessages["ftTdItalics"]         = dfMessages.apply(lambda x: getExtractedTextDataParam(4, x.tmpExtractedTD), axis=1)
        dfMessages["ftTdUnderlines"]      = dfMessages.apply(lambda x: getExtractedTextDataParam(5, x.tmpExtractedTD), axis=1)        
        dfMessages["ftTdEmails"]          = dfMessages.apply(lambda x: getExtractedTextDataParam(6, x.tmpExtractedTD), axis=1)        

        # Text bereinigen und weitere Attribute zuweisen
        dfMessages['ftTdCleanText']           = dfMessages['ftTdText'].map(lambda x: re.sub('\n', ' ', x)) # Replace \n
        
        dfMessages['ftTdEmojis']              = dfMessages['ftTdCleanText'].map(lambda x: demoji.findall_list(x, desc = False)) # Filter out emoji
        dfMessages['ftTdEmojisDesc']          = dfMessages['ftTdCleanText'].map(lambda x: demoji.findall_list(x, desc = True)) # Filter out emoji with desc
        
        dfMessages['ftTdCleanText']           = dfMessages['ftTdCleanText'].map(lambda x: demoji.replace(x, " ")) # Rm emoji
        dfMessages['ftTdCleanText']           = dfMessages['ftTdCleanText'].map(lambda x: gloReplaceGermanChars(x)) # Replace german chars
        
        dfMessages['ftTdSafeText']            = dfMessages['ftTdCleanText'].map(lambda x: re.sub(r'[^a-zA-Z0-9\s]', ' ', x)) # Filter out . ! ? ... (get only safe chars)
        dfMessages['ftTdSafeLowerText']       = dfMessages['ftTdSafeText'].map(lambda x: x.lower()) # To lower
        
        dfMessages["ftTdTextLength"]          = dfMessages["ftTdCleanText"].str.len()

        ##############################
        ########## Stufe 3c ##########
        ##############################
        
        # Hilfsspalten einfügen (optional)
        if "photo" not in dfMessages:
            dfMessages["photo"] = np.nan

        if "file" not in dfMessages:
            dfMessages["file"] = np.nan

        if "edited" not in dfMessages:
            dfMessages["edited"] = np.nan

        if "forwarded_from" not in dfMessages:
            dfMessages["forwarded_from"] = np.nan

        # Query Attribute zuweisen
        dfMessages["ftQrIsValidText"]               = dfMessages.ftTdTextLength.apply(evalIsValidText)
        dfMessages["ftQrIsEdited"]                  = dfMessages.edited.apply(evalContainsSomething)       
        dfMessages["ftQrIsForwarded"]               = dfMessages.forwarded_from.apply(evalContainsSomething)
        
        dfMessages["ftQrCoPhotos"]                  = dfMessages.photo.apply(evalContainsSomething)        
        dfMessages["ftQrCoFiles"]                   = dfMessages.file.apply(evalContainsSomething)
        dfMessages["ftQrCoUrls"]                    = dfMessages.ftTdUrls.apply(evalNonEmptyList)
        dfMessages["ftQrCoHashtags"]                = dfMessages.ftTdHashtags.apply(evalNonEmptyList)
        dfMessages["ftQrCoBolds"]                   = dfMessages.ftTdBolds.apply(evalNonEmptyList)
        dfMessages["ftQrCoItalics"]                 = dfMessages.ftTdItalics.apply(evalNonEmptyList)
        dfMessages["ftQrCoUnderlines"]              = dfMessages.ftTdUnderlines.apply(evalNonEmptyList)
        dfMessages["ftQrCoEmails"]                  = dfMessages.ftTdEmails.apply(evalNonEmptyList)
        dfMessages['ftQrCoEmojis']                  = dfMessages.ftTdEmojis.apply(evalNonEmptyList)

        ##############################
        ########## Stufe 3d ##########
        ##############################
        
        # Transformers anwenden
        if dfInputFiles[dfInputFiles.inputPath == fP].iloc[0].inputLabel in C_TRANSFORMERS_DATASETS:
            
            # NER Transformers
            
            # - ner-xlm-roberta
            gloStartStopwatch("Process pipeline ner-xlm-roberta")
            dfMessages['ftTrNerRoberta']    = dfMessages['ftTdCleanText'].map(lambda x: processNerPipeline(x, "ner-xlm-roberta", configMinScore=0))
            gloStopStopwatch("Process pipeline ner-xlm-roberta")

            # - ner-bert
            gloStartStopwatch("Process pipeline ner-bert")
            dfMessages['ftTrNerBert']           = dfMessages['ftTdCleanText'].map(lambda x: processNerPipeline(x, "ner-bert", configMinScore=0))
            gloStopStopwatch("Process pipeline ner-bert")

            # SEN Transformers
            
            # - sen-bert
            gloStartStopwatch("Process pipeline sen-bert")
            dfMessages['ftTrSenBert']           = dfMessages['ftTdCleanText'].map(lambda x: processSenPipeline(x, "sen-bert", configMinScore=0))
            gloStopStopwatch("Process pipeline sen-bert")

        # TextBlob anwenden
        gloStartStopwatch("Process textblob")
        dfMessages['ftSenTb']           = dfMessages['ftTdCleanText'].map(lambda x: processSentimentAnalysisPython(x))
        gloStopStopwatch("Process textblob")
        
        ##############################
        ## (Mapping dictMessages) ####
        ##############################
        
        dictMessages[fP] = dfMessages
        gloStopStopwatch("TD-Extract " + fP)

    gloStopStopwatch("Extract Text Data")

    ###############################
    # (Mapping dfAllDataMessages) #
    ###############################
    
    # All Messages to DataFrame
    gloStartStopwatch("Generate global DataFrame")
    for fP in dfInputFiles.inputPath:
        dfMessages        = dictMessages[fP].copy()
        dfAllDataMessages = dfAllDataMessages.append(dfMessages)
    gloStopStopwatch("Generate global DataFrame")

    return (dictMessages, dfAllDataMessages)

In [None]:
# return dictMessages and dfAllDataMessages
def initCacheData(dfAllDataMessages):
    dictMessages = {}
    for fP in dfInputFiles.inputPath:
        dictMessages[fP] = dfAllDataMessages[dfAllDataMessages.ftFilePath == fP]
    return (dictMessages, dfAllDataMessages)

Globale Stopuhr starten

In [None]:
gloStartStopwatch("Global notebook")

In [None]:
if(C_USE_CACHE_FILE == ""):
    print("Should not use cache (build new cache)")
    dictMessages, dfAllDataMessages = initProcessData()
    if(C_NEW_CACHE_FILE != ""):
        print("Write cache to file >>" + str(C_NEW_CACHE_FILE) + "<<")
        dfAllDataMessages.to_pickle(dir_var_pandas_cache + C_NEW_CACHE_FILE)
else:
    print("Should use cache (load cache)")
    dictMessages, dfAllDataMessages = initCacheData(pd.read_pickle(dir_var_pandas_cache + C_USE_CACHE_FILE))

## Social Media Mining

### Chats

#### Einführung: Was für Chat Arten?

In [None]:
dfInputFiles.inputType.value_counts()

#### Eindeutiger Chat Bezeichner

In [None]:
# Rm unsafe chars
def gloConvertToSafeString(text):
    text = demoji.replace(text, "")
    text = gloReplaceGermanChars(text)
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    return text

# Generate unique chat name
def gloConvertToSafeChatName(chatName):
    chatName = gloConvertToSafeString(chatName)
    return chatName[:30]

#### Abfragen zu Attributen definieren

In [None]:
def queryChatId(filePath):
    dfMeta = dictMeta[filePath].copy()
    return str(dfMeta["id"].iloc[0])

In [None]:
def queryChatName(filePath):
    dfMeta      = dictMeta[filePath].copy()
    chatName    = str(dfMeta["name"].iloc[0])
    chatName    = gloConvertToSafeChatName(chatName)
    return chatName

In [None]:
def queryChatType(filePath):
    dfMeta = dictMeta[filePath].copy()
    return str(dfMeta["type"].iloc[0])

In [None]:
def queryNumberOfMessages(filePath):
    dfMessages = dictMessages[filePath].copy()
    return len(dfMessages.index)

In [None]:
def queryNumberOfMessagesByAttEqTrue(filePath, attKey):
    dfMessages = dictMessages[filePath].copy()
    dfMessages = dfMessages[dfMessages[attKey] == True]
    return len(dfMessages.index)

#### Abfragen ausführen (dfQueryMeta)

In [None]:
dfQueryMeta = pd.DataFrame(dfInputFiles.inputPath)

dfQueryMeta["qryChatId"]                        = dfQueryMeta.inputPath.apply(queryChatId)
dfQueryMeta["qryChatName"]                      = dfQueryMeta.inputPath.apply(queryChatName)
dfQueryMeta["qryChatType"]                      = dfQueryMeta.inputPath.apply(queryChatType)
dfQueryMeta["qryNumberOfMessages"]              = dfQueryMeta.inputPath.apply(queryNumberOfMessages)

dfQueryMeta["qryNumberOfFormattedTextMessages"] = dfQueryMeta.apply(lambda x: queryNumberOfMessagesByAttEqTrue(x.inputPath, "ftIsJsonFormatted"), axis=1)

dfQueryMeta["qryNumberOfValidTextMessages"]     = dfQueryMeta.apply(lambda x: queryNumberOfMessagesByAttEqTrue(x.inputPath, "ftQrIsValidText"), axis=1)

dfQueryMeta["qryNumberOfPhotos"]                = dfQueryMeta.apply(lambda x: queryNumberOfMessagesByAttEqTrue(x.inputPath, "ftQrCoPhotos"), axis=1)
dfQueryMeta["qryNumberOfFiles"]                 = dfQueryMeta.apply(lambda x: queryNumberOfMessagesByAttEqTrue(x.inputPath, "ftQrCoFiles"), axis=1)
dfQueryMeta["qryNumberOfEditedMessages"]        = dfQueryMeta.apply(lambda x: queryNumberOfMessagesByAttEqTrue(x.inputPath, "ftQrIsEdited"), axis=1)
dfQueryMeta["qryNumberOfForwardedMessages"]     = dfQueryMeta.apply(lambda x: queryNumberOfMessagesByAttEqTrue(x.inputPath, "ftQrIsForwarded"), axis=1)

dfQueryMeta["qryNumberOfMessagesWithUrl"]           = dfQueryMeta.apply(lambda x: queryNumberOfMessagesByAttEqTrue(x.inputPath, "ftQrCoUrls"), axis=1)
dfQueryMeta["qryNumberOfMessagesWithHashtag"]       = dfQueryMeta.apply(lambda x: queryNumberOfMessagesByAttEqTrue(x.inputPath, "ftQrCoHashtags"), axis=1)
dfQueryMeta["qryNumberOfMessagesWithBold"]          = dfQueryMeta.apply(lambda x: queryNumberOfMessagesByAttEqTrue(x.inputPath, "ftQrCoBolds"), axis=1)
dfQueryMeta["qryNumberOfMessagesWithItalic"]        = dfQueryMeta.apply(lambda x: queryNumberOfMessagesByAttEqTrue(x.inputPath, "ftQrCoItalics"), axis=1)
dfQueryMeta["qryNumberOfMessagesWithUnderline"]     = dfQueryMeta.apply(lambda x: queryNumberOfMessagesByAttEqTrue(x.inputPath, "ftQrCoUnderlines"), axis=1)
dfQueryMeta["qryNumberOfMessagesWithEmail"]         = dfQueryMeta.apply(lambda x: queryNumberOfMessagesByAttEqTrue(x.inputPath, "ftQrCoEmails"), axis=1)
dfQueryMeta["qryNumberOfMessagesWithEmoji"]         = dfQueryMeta.apply(lambda x: queryNumberOfMessagesByAttEqTrue(x.inputPath, "ftQrCoEmojis"), axis=1)

#### Wie könnte man diese Attribute darstellen?

##### Über Tabelle

In [None]:
dfQueryMeta.sort_values(by="qryNumberOfMessages", ascending=False)

##### Über Plot

Implementierung

In [None]:
# Auto label query plot
def autolabelAx(rects, ax):
    """
    Attach a text label above each bar in *rects*, displaying its height.
    Copied from https://matplotlib.org/3.1.1/gallery/lines_bars_and_markers/barchart.html (22.12.2020)
    """
    for rect in rects:
        height = rect.get_height()
        ax.annotate('{}'.format(height),
                    xy=(rect.get_x() + rect.get_width() / 2, height),
                    xytext=(0, 3),  # 3 points vertical offset
                    textcoords="offset points",
                    ha='center', va='bottom')
        
# param inputDescFilter set "" == no filter
# param outputFilename set "" = no output
def queryMetaPlotter(inputDescFilter, configPlotWidth, configPlotHeight, configBarWidth, outputFilename):
    # Init data
    dataLabels                          = list()
    dataNumberOfMesssages               = list()
    dataNumberOfFormattedTextMessages   = list()
    dataNumberOfValidTextMessages       = list()
    dataNumberOfEditedMessages          = list()
    dataNumberOfForwardedMessages       = list()
    dataNumberOfPhotos                  = list()
    dataNumberOfFiles                   = list()
    dataNumberOfMessagesWUrl            = list()
    dataNumberOfMessagesWHashtag        = list()
    dataNumberOfMessagesWBold           = list()
    dataNumberOfMessagesWItalic         = list()
    dataNumberOfMessagesWUnderline      = list()
    dataNumberOfMessagesWEmail          = list()
    dataNumberOfMessagesWEmoji          = list()

    # Iterate over Meta DataFrame
    for index, row in dfQueryMeta.sort_values(by="qryNumberOfMessages", ascending=False).iterrows():

        # Get attributes (check filter)
        if(inputDescFilter == "" or dfInputFiles[dfInputFiles.inputPath == row.inputPath].inputLabel.iloc[0] == inputDescFilter):
            dataLabels                          .append(row.qryChatName)
            dataNumberOfMesssages               .append(row.qryNumberOfMessages)
            dataNumberOfFormattedTextMessages   .append(row.qryNumberOfFormattedTextMessages)
            dataNumberOfValidTextMessages       .append(row.qryNumberOfValidTextMessages)
            dataNumberOfEditedMessages          .append(row.qryNumberOfEditedMessages)
            dataNumberOfForwardedMessages       .append(row.qryNumberOfForwardedMessages)
            dataNumberOfPhotos                  .append(row.qryNumberOfPhotos)
            dataNumberOfFiles                   .append(row.qryNumberOfFiles)
            dataNumberOfMessagesWUrl            .append(row.qryNumberOfMessagesWithUrl)
            dataNumberOfMessagesWHashtag        .append(row.qryNumberOfMessagesWithHashtag)
            dataNumberOfMessagesWBold           .append(row.qryNumberOfMessagesWithBold)
            dataNumberOfMessagesWItalic         .append(row.qryNumberOfMessagesWithItalic)
            dataNumberOfMessagesWUnderline      .append(row.qryNumberOfMessagesWithUnderline)
            dataNumberOfMessagesWEmail          .append(row.qryNumberOfMessagesWithEmail)
            dataNumberOfMessagesWEmoji          .append(row.qryNumberOfMessagesWithEmoji)

    # Convert list to array
    dataLabels                          = np.array(dataLabels)
    dataNumberOfMesssages               = np.array(dataNumberOfMesssages)
    dataNumberOfFormattedTextMessages   = np.array(dataNumberOfFormattedTextMessages)
    dataNumberOfValidTextMessages       = np.array(dataNumberOfValidTextMessages)
    dataNumberOfEditedMessages          = np.array(dataNumberOfEditedMessages)
    dataNumberOfForwardedMessages       = np.array(dataNumberOfForwardedMessages)
    dataNumberOfPhotos                  = np.array(dataNumberOfPhotos)
    dataNumberOfFiles                   = np.array(dataNumberOfFiles)
    dataNumberOfMessagesWUrl            = np.array(dataNumberOfMessagesWUrl)
    dataNumberOfMessagesWHashtag        = np.array(dataNumberOfMessagesWHashtag)
    dataNumberOfMessagesWBold           = np.array(dataNumberOfMessagesWBold)
    dataNumberOfMessagesWItalic         = np.array(dataNumberOfMessagesWItalic)
    dataNumberOfMessagesWUnderline      = np.array(dataNumberOfMessagesWUnderline)
    dataNumberOfMessagesWEmail          = np.array(dataNumberOfMessagesWEmail)
    dataNumberOfMessagesWEmoji          = np.array(dataNumberOfMessagesWEmoji)

    # Draw
    with sns.color_palette("tab10", 11):
        fig, ax = plt.subplots()
    x = np.arange(len(dataLabels))

    barWidth = configBarWidth

    fig.set_figwidth(configPlotWidth)
    fig.set_figheight(configPlotHeight)

    r1 = x
    r2 = [x + barWidth for x in r1]
    r3 = [x + barWidth for x in r2]
    r4 = [x + barWidth for x in r3]
    r5 = [x + barWidth for x in r4]
    r6 = [x + barWidth for x in r5]
    r7 = [x + barWidth for x in r6]
    r8 = [x + barWidth for x in r7]
    r9 = [x + barWidth for x in r8]
    r10 = [x + barWidth for x in r9]
    r11 = [x + barWidth for x in r10]
    r12 = [x + barWidth for x in r11]
    r13 = [x + barWidth for x in r12]
    r14 = [x + barWidth for x in r13]

    rects1 = ax.bar(r1, dataNumberOfMesssages, barWidth, label='Messages')
    rects2 = ax.bar(r2, dataNumberOfFormattedTextMessages, barWidth, label='Formatted Messsages')
    rects3 = ax.bar(r3, dataNumberOfValidTextMessages, barWidth, label='Valid Text Messages')
    rects4 = ax.bar(r4, dataNumberOfEditedMessages, barWidth, label='Edited Messages')
    rects5 = ax.bar(r5, dataNumberOfForwardedMessages, barWidth, label='Forwarded Messages')
    rects6 = ax.bar(r6, dataNumberOfPhotos, barWidth, label='Messages with Photos')
    rects7 = ax.bar(r7, dataNumberOfFiles, barWidth, label='Messages with Files')
    rects8 = ax.bar(r8, dataNumberOfMessagesWUrl, barWidth, label='Messages with Urls')
    rects9 = ax.bar(r9, dataNumberOfMessagesWHashtag, barWidth, label='Messages with Hashtags')
    rects10 = ax.bar(r10, dataNumberOfMessagesWBold, barWidth, label='Messages with Bold Items')
    rects11 = ax.bar(r11, dataNumberOfMessagesWItalic, barWidth, label='Messages with Italic Items')
    rects12 = ax.bar(r12, dataNumberOfMessagesWUnderline, barWidth, label='Messages with Underlined Items')
    rects13 = ax.bar(r13, dataNumberOfMessagesWEmail, barWidth, label='Messages with E-Mails')
    rects14 = ax.bar(r14, dataNumberOfMessagesWEmoji, barWidth, label='Messages with Emojis')

    chartTitle = ""
    if(inputDescFilter != ""):
        chartTitle = " (" + inputDescFilter + ")"

    ax.set_ylabel("Number of")
    ax.set_title("Meta Overview" + chartTitle)
    ax.set_xticks(x)
    ax.set_xticklabels(dataLabels)
    plt.xticks(rotation=0)
    ax.legend()

    rects = [rects1, rects2, rects3, rects4, rects5, rects6, rects7, rects8, rects9, rects10, rects11, rects12, rects13, rects14]

    for rect in rects:
        autolabelAx(rect, ax)

    fig.tight_layout()

    #plt.xticks(rotation=30)
    
    if(outputFilename != ""):
        plt.savefig(dir_var_output + outputFilename)
    
    plt.show()

Ausführung

- Plot DataSet0

In [None]:
queryMetaPlotter(
    inputDescFilter = "dataSet0",
    configPlotWidth = 16,
    configPlotHeight = 9,
    configBarWidth = 0.065,
    outputFilename = "meta-overview-dataSet0.svg"
)

- Plot DataSet1

In [None]:
if("dataSet1" in C_LOAD_DATASETS):
    queryMetaPlotter(
        inputDescFilter = "dataSet1",
        configPlotWidth = 100,
        configPlotHeight = 9,
        configBarWidth = 0.065,
        outputFilename = "meta-overview-dataSet1.svg"
    )

- Plot DataSet1a

In [None]:
if("dataSet1a" in C_LOAD_DATASETS):
    queryMetaPlotter(
        inputDescFilter = "dataSet1a",
        configPlotWidth = 16,
        configPlotHeight = 9,
        configBarWidth = 0.065,
        outputFilename = "meta-overview-dataSet1a.svg"
    )

- Plot DataSet2

In [None]:
if("dataSet2" in C_LOAD_DATASETS):
    queryMetaPlotter(
        inputDescFilter = "dataSet2",
        configPlotWidth = 32,
        configPlotHeight = 9,
        configBarWidth = 0.065,
        outputFilename = "meta-overview-dataSet2.svg"
    )

### Social Graphs - Abbildung von Chats auf Features

#### Abweichungen von Chat und Nutzernamen

Verschiedene Attribute

In [None]:
def compareIdsAndLabels(df):

    gloStartStopwatch("Compare ids and labels")

    dictFromTranslator  = {}
    dictActorTranslator = {}

    df = df.copy()
    df["date"] = pd.to_datetime(df["date"])
    
    df = df.set_index("date")
    df = df.sort_index()
    
    addFromCounter      = 0
    changedFromCounter  = 0
    
    addActorCounter     = 0
    changedActorCounter = 0

    for index, row in df.iterrows():
        
        n_from      = row["from"]
        n_from_id   = row["from_id"]

        n_from = str(n_from)
        n_from_id = str(n_from_id)

        n_actor      = row["actor"]
        n_actor_id   = row["actor_id"]

        n_actor = str(n_actor)
        n_actor_id = str(n_actor_id)

        if(str(n_from) != "nan"):
            if(n_from_id not in dictFromTranslator):
                # Add new key
                dictFromTranslator[n_from_id] = [n_from]
                addFromCounter = addFromCounter + 1
            else:
                # Has changed?
                oValueL = dictFromTranslator[n_from_id]
                if(n_from not in oValueL):
                    newList = oValueL.copy()
                    newList.append(n_from)
                    print("- Add changed attribute in from (prev=" + str(oValueL) + "/new=" + str(newList) + ")")
                    changedFromCounter = changedFromCounter + 1
                    dictFromTranslator[n_from_id] = newList

        if(str(n_actor) != "nan"):
            if(n_actor_id not in dictActorTranslator):
                # Add new key
                dictActorTranslator[n_actor_id] = [n_actor]
                addActorCounter = addActorCounter + 1
            else:
                # Has changed?
                oValueL = dictActorTranslator[n_actor_id]
                if(n_actor not in oValueL):
                    newList = oValueL.copy()
                    newList.append(n_actor)
                    print("- Add changed attribute in actor (prev=" + str(oValueL) + "/new=" + str(newList) + ")")
                    changedActorCounter = changedActorCounter + 1
                    dictActorTranslator[n_actor_id] = newList

    gloStopStopwatch("Compare ids and labels")
    
    print()
    print("addFromCounter:\t\t" + str(addFromCounter))
    print("changedFromCounter:\t" + str(changedFromCounter))
    
    print()
    print("addActorCounter:\t" + str(addActorCounter))
    print("changedActorCounter:\t" + str(changedFromCounter))
    
    print()
    if(addFromCounter != 0):
        print("fromFails Percent:\t" + str((changedFromCounter/addFromCounter)* 100) + "%")
            
    if(addActorCounter != 0):
        print("actorFails Percent:\t" + str((changedActorCounter/addActorCounter)* 100) + "%")

    return dictFromTranslator

In [None]:
if(C_SHORT_RUN == False):
    compareIdsAndLabels(dfAllDataMessages)

#### Extrahieren von Features und dynamisches Auflösen

##### Formatierungs-spezifischen statischen Features

In [None]:
def extractImportantHashtags(df):
    dfMessages = df.copy()
    dfMessages = dfMessages[dfMessages.ftQrCoHashtags == True]

    hashTagList = list()
    for index, row in dfMessages.iterrows():
        for hashtagItem in row["ftTdHashtags"]:
            hashTagList.append(hashtagItem)

    return hashTagList

In [None]:
# return combinations
def extractImportantEmojis(df):

    dfMessages = df.copy()
    dfMessages = dfMessages[dfMessages.ftQrCoEmojis == True]

    li = dfMessages.ftTdEmojisDesc.values.tolist()

    retLi = list()

    for l in li:
        aString = ""
        for e in l:
            aString = aString + ":" + e 
        retLi.append(aString)

    return retLi

##### Autor-spezifischen statischen Features

(hier nicht beschrieben, siehe unten)

##### Dynamische Features

In [None]:
# param flagResolveNewUrls  Flag (see config above)

def resolveUrl(completeUrl, flagResolveNewUrls):
    
    if "bit.ly" in completeUrl:

        if(gloCheckIsAlreadyCached("resolved-urls.csv", completeUrl)):
            return gloGetCached("resolved-urls.csv", completeUrl)
        else:

            if(flagResolveNewUrls == False):
                return completeUrl

            print("(Resolve now >>" + completeUrl + "<<)")
            try:
                r = requests.get(completeUrl, timeout = 5)
                u = r.url
                gloAddToCache("resolved-urls.csv", completeUrl, u)
                return u
            except:
                print("(- Warn: Can not resolve (return completeUrl))")
                return completeUrl

    else:
        return completeUrl

In [None]:
# Return
# a = urlList,
# b = refList
# c = hostList
def extractImportantUrls(df):
    dfMessages = df.copy()
    dfMessages = dfMessages[dfMessages.ftQrCoUrls == True]

    hostList        = list()
    urList          = list()
    refList         = list()

    counterSucHostname = 0
    counterErrHostname = 0

    for index, row in dfMessages.iterrows():
        for urlItem in row["ftTdUrls"]:
            
            urlData = urlparse(str(urlItem))

            completeUrl      = urlData.geturl()

            rUrl     = resolveUrl(completeUrl, flagResolveNewUrls=C_RESOLVE_NEW_URLS)
            rUrlData = urlparse(rUrl)
            rCompleteUrl = rUrlData.geturl()
            rCompleteHostname = rUrlData.hostname

            if(str(rCompleteHostname) != "None"):
                counterSucHostname = counterSucHostname + 1

                hostList.append(str(rCompleteHostname))

                urList.append(str(rCompleteUrl))

                if "t.me" in str(rCompleteHostname):
                    refList.append(str(rCompleteUrl))
            else:
                counterErrHostname = counterErrHostname + 1

    print("Got Hostnames (suc=" + str(counterSucHostname) + "/err=" + str(counterErrHostname) + ")")

    return (urList, refList, hostList)

In [None]:
# param flagResolveNewUrls  Flag (see config above)
def resolveImportantYoutubeVideos(urlList, flagResolveNewUrls):

    # Thanks https://gist.github.com/rodrigoborgesdeoliveira/987683cfbfcc8d800192da1e73adc486

    ytList = list()

    for url in urlList:

        url = str(url)

        if("youtube.com" in url or "youtu.be" in url or "youtube-nocookie.com" in url):
            if(gloCheckIsAlreadyCached("resolved-youtube.csv", url)):
                ytList.append(gloGetCached("resolved-youtube.csv", url)) 
            else:

                if(flagResolveNewUrls == False):
                    print("(Disable resolve new youtube urls (return completeUrl) >>" + url + "<<)")
                    ytList.append(url)
                else:
                    print("Resolve now youtube >>" + url + "<<")
                    try:
                        r = requests.get(url, timeout = 5)
                        t = fromstring(r.content)
                        a = str(t.findtext('.//title'))
                        ytList.append(a)
                        gloAddToCache("resolved-youtube.csv", url, a)
                    except:
                        print("(- Warn: Can not resolve youtube url (return completeUrl))")
                        ytList.append(url)

    return ytList

##### Implementierung

In [None]:
# TODO: Bug: No Hostname detected if string startsWith ! "http" in urlparse
# TODO: Check: Refs ins both directions

# Returns
# a = Counter forwardedFromList
# b = Counter refList
# c = Counter hashtagList
# d = Counter hostList
# e = Counter emojiList
# f = Counter fromList
def extractSocialGraph(filePath, debugPrint, debugPrintCount):

    dfMessages = dictMessages[filePath].copy()

    # Formatierungs spezifischen statischen Features
    
    hashtagList = extractImportantHashtags(dfMessages)
    emojiList = extractImportantEmojis(dfMessages)
    
    # Autor spezifischen statischen Features
    
    forwardedFromList = list()
    if("forwarded_from" in dfMessages.columns):
        df = dfMessages.copy()
        df = df[df.ftQrIsForwarded == True]
    
        for index, row in df.iterrows():        
            forwardedFromList.append(str(row["forwarded_from"]))
            
    actorList = list()
    if("actor" in dfMessages.columns):
        for index, row in dfMessages.iterrows():
            actorList.append(str(row["actor"]))
    
    memberList = list()
    if("members" in dfMessages.columns):
        for index, row in dfMessages.iterrows():
            if(str(row["members"]) != "nan"):
                for memberItem in row["members"]:
                    memberList.append(str(memberItem))
                    
    fromList = list()
    if("from" in dfMessages.columns):
        for index, row in dfMessages.iterrows():
            s = str(row["from"])
            s = gloConvertToSafeString(s)
            if(s != "None"):
                fromList.append(s)
            
    savedFromList = list()
    if("saved_from" in dfMessages.columns):
        for index, row in dfMessages.iterrows():
            savedFromList.append(str(row["saved_from"]))

    # Dynamische Features
    urlList, refList, hostList = extractImportantUrls(dfMessages)

    ytList = resolveImportantYoutubeVideos(urlList, flagResolveNewUrls = C_RESOLVE_NEW_URLS)
    
    # Debug print
            
    configTopN = debugPrintCount

    if(debugPrint):

        print()
        print("Set top n to " + str(debugPrintCount))
        print()

        print("- Top Hosts (resovled) -")
        print ("\n".join(map(str, Counter(hostList).most_common(configTopN))))
        print()
        print("- Top URLs (resolved) -")
        print ("\n".join(map(str, Counter(urlList).most_common(configTopN))))
        print()
        print("- Top Refs from text (resolved) -")
        print ("\n".join(map(str, Counter(refList).most_common(configTopN))))
        print()
        print("- Top Refs (forwarded_from) -")
        print ("\n".join(map(str, Counter(forwardedFromList).most_common(configTopN))))
        print()
        print("- Top Refs (actor) -")
        print ("\n".join(map(str, Counter(actorList).most_common(configTopN))))
        print()
        print("- Top Refs (members) -")
        print ("\n".join(map(str, Counter(memberList).most_common(configTopN))))
        print()
        print("- Top Refs (from) -")
        print ("\n".join(map(str, Counter(fromList).most_common(configTopN))))
        print()
        print("- Top Refs (saved_from) -")
        print ("\n".join(map(str, Counter(savedFromList).most_common(configTopN))))
        print()
        print("- Top hashtags -")
        print ("\n".join(map(str, Counter(hashtagList).most_common(configTopN))))
        print()
        print("- Top emojis -")
        print ("\n".join(map(str, Counter(emojiList).most_common(configTopN))))
        print()
        print("- Top yt (resolved) -")
        print ("\n".join(map(str, Counter(ytList).most_common(configTopN))))
        print()
    
    return (Counter(forwardedFromList), Counter(refList), Counter(hashtagList),  Counter(hostList), Counter(emojiList), Counter(fromList))

In [None]:
dictSGD_ForwardedFrom = {}
dictSGD_Ref           = {}
dictSGD_Hashtag       = {}
dictSGD_Host          = {}
dictSGD_Emoji         = {}
dictSGD_From          = {}

gloStartStopwatch("Extract Social Graph Data")

for fP in dfInputFiles.inputPath:

    gloStartStopwatch("Extract Social Graph Data >>" + fP + "<<")

    a, b, c, d, e, f = extractSocialGraph(fP, debugPrint=False, debugPrintCount = 0)

    dictSGD_ForwardedFrom[fP]   = a
    dictSGD_Ref[fP]             = b
    dictSGD_Hashtag[fP]         = c
    dictSGD_Host[fP]            = d
    dictSGD_Emoji[fP]           = e
    dictSGD_From[fP]            = f

    gloStopStopwatch("Extract Social Graph Data >>" + fP + "<<")

gloStopStopwatch("Extract Social Graph Data")

#### Top-n Darstellung

In [None]:
def printSocialGraphDebug(filePathList):
    for fP in filePathList:
        print("Analyse now >>" + fP + "<<")
        _ = extractSocialGraph(fP, debugPrint=True, debugPrintCount=10)

In [None]:
if(C_SHORT_RUN == False):
    printSocialGraphDebug(dfInputFiles[dfInputFiles.inputLabel == "dataSet0"].inputPath)

#### Suchmuster für Chats

In [None]:
# Get Top Influencer
# param fPList      filePath List
# param configTopN  Get Top n influencer e.g. 10
def getTopInfluencer(fPList, configTopN):

    for fP in fPList:

        chatName = queryChatName(fP)

        print()
        print("Analyse Chat (Forwarded From) >>" + chatName + "<<")
        
        socialGraphData = dictSGD_ForwardedFrom[fP]
        socialGraphData = socialGraphData.most_common(configTopN)

        counter = 1

        # Iterate over data
        for oChatName, oChatRefs in socialGraphData:
            
            # Query other params
            oChatName    = gloConvertToSafeChatName(str(oChatName))
            oChatRefs    = oChatRefs

            # Already downloaded?
            flagDownloaded = False
            if oChatName in dfQueryMeta.qryChatName.values:
                flagDownloaded = True

            if(oChatName != "nan"):

                print(str(counter) + ": (downloaded=" + str(flagDownloaded) + ") (refs=" + str(oChatRefs) + ")\t\t>>" + str(oChatName) + "<<")
                counter = counter + 1


        print()
        print("Analyse Chat (Refs) >>" + chatName + "<<")
        
        socialGraphData = dictSGD_Ref[fP]
        socialGraphData = socialGraphData.most_common(configTopN)

        counter = 1

        # Iterate over data
        for oChatName, oChatRefs in socialGraphData:
            
            # Query other params
            oChatName    = str(oChatName)
            oChatRefs    = oChatRefs

            if(oChatName != "nan"):

                print(str(counter) + " (refs=" + str(oChatRefs) + ")\t\t>>" + str(oChatName) + "<<")
                counter = counter + 1

In [None]:
# TODO: Can not get all items in dataSet1

"""
# Attila Hildmann #
- Anonymous Germany - not found
- https://t.me/DEMOKRATENCHAT - no entries
- https://t.me/ChatDerFreiheit - no entries
- https://t.me/FREIHEITSCHAT2020 - not found

# Oliver Janich #
- Oliver Janich Premium - not found

# Xavier Naidoo #
- Xavier(Der VereiNiger)Naidoo😎 - not found
- https://t.me/PostAppender_bot - bot chat
"""
getTopInfluencer(list(dfInputFiles[dfInputFiles.inputLabel == "dataSet0"].inputPath), 10)

### Social Graphs - Darstellung von Graphen

#### Visualisierung-Möglichkeiten von Graphen

##### Layouts und Zeichen Funktionen definieren

Layout auswählen

- 1 = Kamda Kawai Layout
- 2 = Spring Layout
- 3 = Graphviz Layout

In [None]:
def getSocialGraphLayout(layoutSelector, G):
    if(layoutSelector == 1):
        return nx.kamada_kawai_layout(G.to_undirected())
    elif(layoutSelector == 2):
        return nx.spring_layout(G.to_undirected(), k = 0.15, iterations=200)
    elif(layoutSelector == 3):
        return nx.nx_pydot.graphviz_layout(G)

Plot Funktion definieren

- ``G``: graph
- ``layoutSelector``: siehe oben (Layout auswählen)
- ``configFactorEdge``: e.g. 100 => weight / 100
- ``configFactorNode``: e.g. 10  => weight / 10
- ``configArrowSize``: e.g. 5
- ``configPlotWidth``: e.g. 16
- ``configPlotHeight``: e.g. 9
- ``outputFilename``: e.g. test.png (set "" == no output file)
- ``outputTitle``: e.g. Graph (required)

In [None]:
def drawSocialGraph(G, layoutSelector, configFactorEdge, configFactorNode, configArrowSize, configPlotWidth, configPlotHeight, outputFilename, outputTitle):
    
    gloStartStopwatch("Social Graph Plot")
    
    plt.figure(figsize=(configPlotWidth,configPlotHeight))
        
    pos = getSocialGraphLayout(layoutSelector = layoutSelector, G = G)
    
    # Clean edges
    edges       = nx.get_edge_attributes(G, "weight")
    edgesTLabel = nx.get_edge_attributes(G, "tLabel")

    clean_edges         = dict()
    clean_edges_labels  = dict()
    
    for key in edges:
        
        #Set edge weight
        clean_edges[key]        = (100 - edges[key]) / configFactorEdge

        #set edge layout
        clean_edges_labels[key] = edgesTLabel[key]
    
    # Clean nodes
    nodes       = nx.get_node_attributes(G,'weight')
    nodesTLabel = nx.get_node_attributes(G,'tLabel')
    nodesTColor = nx.get_node_attributes(G,'tColor')

    clean_nodes         = dict()
    clean_nodes_labels  = dict()
    clean_nodes_color   = dict()
    
    for key in nodes:
        
        #Set node weight        
        clean_nodes[key]        = nodes[key] / configFactorNode

        #Set node layout
        clean_nodes_labels[key] = nodesTLabel[key]
        clean_nodes_color[key]  = nodesTColor[key]
    
    # Revert DiGraph (arrows direction)
    #G_rev = nx.DiGraph.reverse(G) 
    
    G_rev = G

    # Draw
    nx.draw(G_rev,
        pos,
        with_labels=True,
        width=list(clean_edges.values()),
        node_size=list(clean_nodes.values()),
        labels=clean_nodes_labels,
        node_color=list(clean_nodes_color.values()),
        arrowsize=configArrowSize,
        #arrowstyle="wedge"
        #connectionstyle="arc3, rad = 0.1"
    )
    
    # Set labels
    _ = nx.draw_networkx_edge_labels(G_rev, pos, edge_labels=clean_edges_labels, font_size = 13)

    if(outputTitle != ""):
        plt.title(outputTitle)

    # Save and show fig
    if(outputFilename != ""):
        plt.savefig(dir_var_output + outputFilename)
    
    plt.show()
    
    gloStopStopwatch("Social Graph Plot")

##### Testen von Plotten

In [None]:
# Generates Test Graph
def generateTestGraph():

    G_weighted = nx.DiGraph()

    # Formel (1-(Anzahl gesendete Nachrichten/Anzahl gültige Nachrichten von Zielchat)) * 100
    
    G_weighted.add_edge("Autor 1", "Chat 1", weight=44,  tLabel = "(1 - (7.000/12.500)) * 100 = 44")
    G_weighted.add_edge("Autor 2", "Chat 1", weight=92,  tLabel = "(1 - (1.000/12.500)) * 100 = 92")
    G_weighted.add_edge("Autor 3", "Chat 1", weight=88,  tLabel = "(1 - (1.500/12.500)) * 100 = 88")
    G_weighted.add_edge("Chat 2",  "Chat 1", weight=76,  tLabel = "(1 - (3.000/12.500)) * 100 = 76")
        
    G_weighted.add_edge("Autor 1", "Chat 2", weight=81.25,  tLabel = "(1 - (3.000/16.000)) * 100 = 81,25")
    G_weighted.add_edge("Autor 2", "Chat 2", weight=25,     tLabel = "(1 - (12.000/16.000)) * 100 = 25")
    G_weighted.add_edge("Autor 3", "Chat 2", weight=93.75,  tLabel = "(1 - (1.000/16.000)) * 100 = 93,75")
    
    # Knoten Gewicht (Anzahl gültiger Nachrichten)
    
    # - Exakt Anzahl gültige Nachrichten (empfangen)
    G_weighted.add_node("Chat 1", weight=12500, tLabel = "Chat 1\n[12.500]", tColor="#0080ff")
    G_weighted.add_node("Chat 2", weight=16000, tLabel = "Chat 2\n[16.000]", tColor="#0080ff")

    # - Geschätzt Anzahl gültige Nachrichten (gesendet) MAX-Wert
    G_weighted.add_node("Autor 1", weight=7000, tLabel = "Autor 1\n[7.000]", tColor="#ff8000")
    G_weighted.add_node("Autor 2", weight=12000, tLabel = "Autor 2\n[12.000]", tColor="#ff8000")
    G_weighted.add_node("Autor 3", weight=1500, tLabel = "Autor 3\n[1.500]", tColor="#ff8000")
    
    return G_weighted

generatedTestGraph = generateTestGraph()

In [None]:
drawSocialGraph(
    G = generatedTestGraph,
    layoutSelector=1,
    configFactorEdge = 10,
    configFactorNode = 2,
    configArrowSize = 15,
    configPlotWidth = 16,
    configPlotHeight = 9,
    outputFilename = "social-graph-s-sample.svg",
    outputTitle = "Test Graph Kamda Kawai Layout"
)

In [None]:
drawSocialGraph(
    G = generatedTestGraph,
    layoutSelector=2,
    configFactorEdge = 10,
    configFactorNode = 2,
    configArrowSize = 15,
    configPlotWidth = 16,
    configPlotHeight = 9,
    outputFilename = "",
    outputTitle = "Test Graph Spring Layout"
)

In [None]:
drawSocialGraph(
    G = generatedTestGraph,
    layoutSelector=3,
    configFactorEdge = 10,
    configFactorNode = 2,
    configArrowSize = 15,
    configPlotWidth = 16,
    configPlotHeight = 9,
    outputFilename = "",
    outputTitle = "Test Graph Graphviz Layout"
)

#### Visualisierung-Möglichkeiten von heruntergeladenen Chats

##### Implementierung

Hilfsfunktion für Gewichte

In [None]:
# Add node weight to dict
# Only adds new weight if newWeight > oldWeight
def addSocialGraphNodeWeight(chatName, chatWeight, targetDict):
    
    if(chatName in targetDict):
        oldWeight = targetDict[chatName]
        if(chatWeight > oldWeight):
            targetDict[chatName] = chatWeight
    else:
        targetDict[chatName] = chatWeight

Graph berechnen

- ``configTopNInfluencer``: e.g. For top 10 = 10
- ``configMinRefs``: e.g. 1 must have > 1 % forwarded messages
- ``listFilePaths``: List process filePaths
- ``socialGraphTargetDict``: e.g. forwarded from dict or hashtag dict
- ``socialGraphTargetAttribute``: e.g. ftQrIsForwarded (for calc percent)
- ``configFlagDebugLabel``: e.g. show debug info on label

In [None]:
def generateSocialGraph(configTopNInfluencer, configMinRefs, listFilePaths, socialGraphTargetDict, socialGraphTargetAttribute, configFlagDebugLabel):
    
    # Save node weights to dict
    dictSocialNodeWeights   = dict()

    # Flag downloaded nodes (exact node weight)
    dictExactNodesLabels    = {}
    
    gloStartStopwatch("Social Graph")
    
    # Generate directed graph
    G_weighted = nx.DiGraph()
    
    print("- Add edges")
    for fP in listFilePaths:
        
        # Query own params
        chatName                        = queryChatName(fP)
        chatNumberOfMessages            = queryNumberOfMessages(fP)
        chatNumberOfTargetMessages      = queryNumberOfMessagesByAttEqTrue(fP, socialGraphTargetAttribute)

        gloStartStopwatch("SG-Extract " + chatName + "(" + str(chatNumberOfTargetMessages) + "/" + str(chatNumberOfMessages) + " messages)")
        
        # Add exact node size (chat downloaded) and flag node
        addSocialGraphNodeWeight(chatName, chatNumberOfMessages, dictSocialNodeWeights)
        dictExactNodesLabels[chatName] = str(chatName) + "\n=[" + str(chatNumberOfTargetMessages) + "/" + str(chatNumberOfMessages) + "]"

        # Extract social graph data and get top influencer
        socialGraphData = socialGraphTargetDict[fP]
        socialGraphData = socialGraphData.most_common(configTopNInfluencer)
        
        # Iterate over forwarder
        for oChatName, oChatRefs in socialGraphData:
            
            # Query other params
            oChatName    = gloConvertToSafeChatName(str(oChatName))
            oChatRefs    = oChatRefs

            # If has forwarder
            if(oChatName != "nan"):
        
                # Calc percent (forwarded_messages)
                per = (oChatRefs/chatNumberOfTargetMessages) * 100

                # Filter unimportant forwarders
                if(per > configMinRefs):
                
                    # Add estimanted node size (chat not downloaded)
                    addSocialGraphNodeWeight(oChatName, oChatRefs, dictSocialNodeWeights)

                    # Invert percent (distance)
                    wei = 100 - per

                    # Label
                    if(configFlagDebugLabel):
                        lab = str(round(per, 3)) + "% (" + str(oChatRefs) + "/" + str(chatNumberOfTargetMessages) + "≙" + str(round(wei, 3)) + ")"
                    else:
                        lab = ""

                    # Add edge
                    G_weighted.add_edge(
                        chatName,
                        oChatName,
                        weight=wei,
                        tLabel = lab
                    )

        gloStopStopwatch("SG-Extract " + chatName + "(" + str(chatNumberOfTargetMessages) + "/" + str(chatNumberOfMessages) + " messages)")
        
    print("- Add different nodes")
    for aNode in dictSocialNodeWeights:
        
        # Query node params
        nodeName   = str(aNode)
        nodeWeight = dictSocialNodeWeights[aNode]

        # Set defaults
        tValueColor = "#ff8000"
        tLabel = str(nodeName) + "\n≈[" + str(nodeWeight) + "]"

        # Overwrite (if chat downloaded = exact weight)
        if(nodeName in dictExactNodesLabels):
            tValueColor = "#0080ff"
            tLabel = dictExactNodesLabels[nodeName]
        
        G_weighted.add_node(
            nodeName,
            weight=nodeWeight,
            tLabel = tLabel,
            tColor=tValueColor
        )
        
    gloStopStopwatch("Social Graph")
        
    return G_weighted

##### Top 25 Forwarded From (DataSet0)

In [None]:
drawSocialGraph(
    generateSocialGraph(
        configTopNInfluencer = 25,  
        configMinRefs = 0,       
        listFilePaths = list(dfInputFiles[dfInputFiles.inputLabel == "dataSet0"].inputPath),
        socialGraphTargetDict = dictSGD_ForwardedFrom,
        socialGraphTargetAttribute = "ftQrIsForwarded",
        configFlagDebugLabel = False
    ),
    layoutSelector = 1,
    configFactorEdge = 10,
    configFactorNode = 4,
    configArrowSize = 20,
    configPlotWidth = 16,
    configPlotHeight = 9,
    outputFilename = "social-graph-dataSet0-forwarded-from.svg",
    outputTitle = "Top 25 Forwarded From (DataSet0)"
)

##### Top 25 Hashtags (DataSet0)

In [None]:
drawSocialGraph(
    generateSocialGraph(
        configTopNInfluencer = 25,  
        configMinRefs = 0,        
        listFilePaths = list(dfInputFiles[dfInputFiles.inputLabel == "dataSet0"].inputPath),
        socialGraphTargetDict = dictSGD_Hashtag,
        socialGraphTargetAttribute = "ftQrCoHashtags",
        configFlagDebugLabel = False
    ),
    layoutSelector = 1,
    configFactorEdge = 10,
    configFactorNode = 5,
    configArrowSize = 15,
    configPlotWidth = 16,
    configPlotHeight = 9,
    outputFilename = "",
    outputTitle = "Top 25 Hashtags (DataSet0)"
)

##### Top 25 Hosts (DataSet0)

In [None]:
drawSocialGraph(
    generateSocialGraph(
        configTopNInfluencer = 25,  
        configMinRefs = 0,        
        listFilePaths = list(dfInputFiles[dfInputFiles.inputLabel == "dataSet0"].inputPath),
        socialGraphTargetDict = dictSGD_Host,
        socialGraphTargetAttribute = "ftQrCoUrls",
        configFlagDebugLabel = False
    ),
    layoutSelector = 1,
    configFactorEdge = 10,
    configFactorNode = 8,
    configArrowSize = 20,
    configPlotWidth = 16,
    configPlotHeight = 9,
    outputFilename = "",
    outputTitle = "Top 25 Hosts (DataSet0)"
)

##### Top 25 Emojis (dDataSet0)

In [None]:
drawSocialGraph(
    generateSocialGraph(
        configTopNInfluencer = 25,  
        configMinRefs = 0,        
        listFilePaths = list(dfInputFiles[dfInputFiles.inputLabel == "dataSet0"].inputPath),
        socialGraphTargetDict = dictSGD_Emoji,
        socialGraphTargetAttribute = "ftQrCoEmojis",
        configFlagDebugLabel = False
    ),
    layoutSelector = 1,
    configFactorEdge = 10,
    configFactorNode = 10,
    configArrowSize = 20,
    configPlotWidth = 16,
    configPlotHeight = 9,
    outputFilename = "",
    outputTitle = "Top 25 Emojis (dDataSet0)"
)

##### Top 25 From (DataSet1a)

In [None]:
if("dataSet1a" in C_LOAD_DATASETS):
    drawSocialGraph(
        generateSocialGraph(
            configTopNInfluencer = 25,  
            configMinRefs = 0,        
            listFilePaths = list(dfInputFiles[dfInputFiles.inputLabel == "dataSet1a"].inputPath),
            socialGraphTargetDict = dictSGD_From,
            socialGraphTargetAttribute = "ftQrIsValidText",
            configFlagDebugLabel = False
        ),
        layoutSelector = 1,
        configFactorEdge = 10,
        configFactorNode = 10,
        configArrowSize = 20,
        configPlotWidth = 16,
        configPlotHeight = 9,
        outputFilename = "social-graph-dataSet1a-from.svg",
        outputTitle = "Top 25 From (DataSet1a)"
    )

### Dimension Zeit

#### Wie aufbereiten?

##### Anzahl Nachrichten bis zu Stichtach (mit Term-Filter)

- ``targetDate``: e.g. 1970-01-01
- ``fP``: filePath
- ``highlightWord``: set "" = no filter

In [None]:
def queryNumberOfMessagesByDate(targetDate, fP, highlightWord):

    df = dictMessages[fP].copy()

    df = df[df.ftQrIsValidText == True]

    df["date"] = pd.to_datetime(df["date"])
    
    df = df[df.date <= targetDate]

    if(highlightWord != ""):
        df = df[df.ftTdSafeLowerText.str.contains(highlightWord)]

    l = len(df.index)

    if(l > 0):
        return l
    else:
        return np.nan

##### MatPlot-Umsetzung

- ``filePathList``: filePathList
- ``outputFilename``: set "" = no output file
- ``highlightWords``: list of highlight words (leave empty if not used)
- ``configFrequency``: e.g. 1D or 1M

In [None]:
# TODO: Add percent to label

def drawTimePlot(filePathList, outputFilename, highlightWords, configFrequency):

    gloStartStopwatch("Time Plot")

    plt.figure(figsize=(16, 9))

    df = pd.DataFrame(
        index=pd.date_range( #m/d/y
            start='9/1/2018',
            end='2/1/2021',
            freq=configFrequency
            )
        )

    # Add date to process
    df["date"] = df.index

    vLineHeight = -1

    for fP in filePathList:
        gloStartStopwatch("Time Plot >>" + fP + "<<")

        # Plot Graph Var 1
        if not highlightWords:
            # Plot
            plt.plot(
                df.index, #x
                df.apply(lambda x: queryNumberOfMessagesByDate(x.date, fP, highlightWord = ""), axis=1), #y
                label = queryChatName(fP) #label
            )
            # Set vline height
            currentHeight = queryNumberOfMessagesByAttEqTrue(fP, "ftQrIsValidText")
            if(currentHeight > vLineHeight):
                vLineHeight = currentHeight

        # Plot High Light Word Graph / Var 2
        for hWord in highlightWords:
            y = df.apply(lambda x: queryNumberOfMessagesByDate(x.date, fP, highlightWord = hWord), axis=1)
            # Plot
            plt.plot(
                df.index, #x
                y, #y
                label = queryChatName(fP) + " usages of '" + hWord + "'" #label
            )
            # Set vline height
            currentHeight = y.max()
            if(currentHeight > vLineHeight):
                vLineHeight = currentHeight

        gloStopStopwatch("Time Plot >>" + fP + "<<")

    # yy - mm - dd
    # TODO: Double check https://www.bundesgesundheitsministerium.de/coronavirus/chronik-coronavirus.html?stand=20210104
    plt.vlines(x = ["2018-12-10"], ymin=0, ymax=vLineHeight, color="orange", ls='--', label="Global Compact for Migration (2018-12-10)")
    plt.vlines(x = ["2020-01-27"], ymin=0, ymax=vLineHeight, color="grey", ls='--', label="Corona Patient Zero Germany")
    plt.vlines(x = ["2020-03-23"], ymin=0, ymax=vLineHeight, color="purple", ls='--', label="1. Lockdown Germany (2020-03-23)")
    plt.vlines(x = ["2020-11-02"], ymin=0, ymax=vLineHeight, color="purple", ls='--', label="2. Lockdown light Germany (2020-11-02)")
    plt.vlines(x = ["2020-12-16"], ymin=0, ymax=vLineHeight, color="purple", ls='--', label="3. Lockdown Germany (2020-12-16)")

    plt.gcf().autofmt_xdate()
    _ = plt.legend()

    if(outputFilename != ""):
        plt.savefig(dir_var_output + outputFilename)

    gloStopStopwatch("Time Plot")

#### Normaler Time Plot

In [None]:
if(C_SHORT_RUN == False):
    drawTimePlot(
        filePathList = list(dfInputFiles[dfInputFiles.inputLabel == "dataSet0"].inputPath),
        outputFilename = "time-plot-dataSet0.svg",
        highlightWords = [],
        configFrequency=C_TIME_PLOT_FREQ
    )

In [None]:
if(C_SHORT_RUN == False):
    if("dataSet1" in C_LOAD_DATASETS):
        drawTimePlot(
            filePathList = list(dfInputFiles[dfInputFiles.inputLabel == "dataSet1"].inputPath),
            outputFilename = "time-plot-dataSet1.svg",
            highlightWords = [],
            configFrequency=C_TIME_PLOT_FREQ
        )

In [None]:
if(C_SHORT_RUN == False):
    if("dataSet1a" in C_LOAD_DATASETS):
        drawTimePlot(
            filePathList = list(dfInputFiles[dfInputFiles.inputLabel == "dataSet1a"].inputPath),
            outputFilename = "time-plot-dataSet1a.svg",
            highlightWords = [],
            configFrequency=C_TIME_PLOT_FREQ
        )

In [None]:
if(C_SHORT_RUN == False):
    if("dataSet2" in C_LOAD_DATASETS):
        drawTimePlot(
            filePathList = list(dfInputFiles[dfInputFiles.inputLabel == "dataSet2"].inputPath),
            outputFilename = "time-plot-dataSet2.svg",
            highlightWords = [],
            configFrequency=C_TIME_PLOT_FREQ
        )

#### Word Tracer

In [None]:
# https://www.bundestag.de/parlament/plenum/sitzverteilung_19wp
highlightwords = ["cdu", "spd", "afd", "fdp", "linke", "gruenen", "merkel"]

In [None]:
if(C_SHORT_RUN == False):
    drawTimePlot(
        filePathList = list(["DS-05-01-2021/ChatExport_2021-01-05-janich"]),
        outputFilename = "word-tracer-oliver-janich.svg",
        highlightWords = highlightwords,
        configFrequency=C_TIME_PLOT_FREQ
    )

In [None]:
if(C_SHORT_RUN == False):
    drawTimePlot(
        filePathList = list(["DS-05-01-2021/ChatExport_2021-01-05-hildmann"]),
        outputFilename = "word-tracer-attila-hildmann.svg",
        highlightWords = highlightwords,
        configFrequency=C_TIME_PLOT_FREQ
    )

In [None]:
if(C_SHORT_RUN == False):
    drawTimePlot(
        filePathList = list(["DS-05-01-2021/ChatExport_2021-01-05-evaherman"]),
        outputFilename = "word-tracer-eva-herman.svg",
        highlightWords = highlightwords,
        configFrequency=C_TIME_PLOT_FREQ
    )

In [None]:
if(C_SHORT_RUN == False):
    drawTimePlot(
        filePathList = list(["DS-05-01-2021/ChatExport_2021-01-05-xavier"]),
        outputFilename = "word-tracer-xavier-naidoo.svg",
        highlightWords = highlightwords,
        configFrequency=C_TIME_PLOT_FREQ
    )

## Text Mining

### Welche Text-Nachrichten sind verwertbar?

In [None]:
def removeTextLengthOutliersFromDataFrame(df, interval, maxTextLength):
    df = df.copy()
    df = df[df.ftTdTextLength < maxTextLength]
    # https://stackoverflow.com/questions/23199796/detect-and-exclude-outliers-in-pandas-data-frame
    # keep only the ones that are within <interval> to -<interval> standard deviations in the column 'Data'.
    return df[np.abs(df.ftTdTextLength-df.ftTdTextLength.mean()) <= (interval*df.ftTdTextLength.std())]

In [None]:
# param outputFilename set "" == no output file
def textLengthHistPlotter(outputFilename):
    dfMessages = dfAllDataMessages.copy()
    print("Number of all messages:\t\t\t\t\t\t" + str(len(dfMessages.index)))

    dfMessages = dfMessages[dfMessages.ftQrIsValidText == True]
    print("Number of valid text messages:\t\t\t\t\t" + str(len(dfMessages.index)))

    dfMessagesOT = removeTextLengthOutliersFromDataFrame(
        dfMessages,
        interval = 3,               #Default is 3
        maxTextLength = 999999999   #TODO: Maybe enable max text length
        )
    print("Number of valid text messages (after outliers filtering):\t" + str(len(dfMessagesOT.index)))

    print()
    print("Text Length Hist (after normalization)")
    plt.figure(figsize=(16,9))
    _ = dfMessagesOT.ftTdTextLength.hist(bins=20)
    plt.title('Histogram Text Length (after normalization - global) (20 bins)')

    if(outputFilename != ""):
        plt.savefig(dir_var_output + outputFilename)

In [None]:
textLengthHistPlotter(outputFilename = "meta-text-length-hist.svg")

### Word Clouds

- ``targetDataFrame``: DataFrame
- ``outputFilename``: filename in outputdir (set "" == no output file)
- ``filterList``: Exclude list
- ``flagShow``: Set true == show wordcloud
- ``configPlotWidth``: e.g. 1920
- ``configPlotHeight``: e.g. 1080

In [None]:
# param rowID e.g. ftTdSafeText
def gloGenerateTextFromChat(df, rowID):
    df = df.copy()
    df = df[df.ftQrIsValidText == True]
    
    # Iterate over text (global text from group)
    textList = []
    for index, row in df.iterrows():
        textList.append(" " + row[rowID])
        
    textString = ''.join(textList)

    return textString

In [None]:
# TODO: Context?
# TODO: Improve stop words

def generateWordCloud(targetDataFrame, outputFilename, filterList, flagShow, configPlotWidth, configPlotHeight):
    

    dfMessages = targetDataFrame.copy()
    
    textString = gloGenerateTextFromChat(dfMessages, rowID="ftTdSafeText")
    
    stopWordsList = gloGetStopWordsList(filterList)
    
    # Generate word cloud and save it to file
    wordcloud = WordCloud(
                background_color="black",
                width=configPlotWidth,
                height=configPlotHeight,
                stopwords=stopWordsList
            ).generate(textString)

    if(outputFilename != ""):
        wordcloud.to_file(dir_var_output + outputFilename)
    
    if(flagShow):
        # Show top 20
        print()
        print("Top 20 occ:\n" + str(pd.Series(wordcloud.words_).head(20)))
        print()
        
        # Show word cloud
        print("- Start generate figure")
        plt.figure(figsize=(14, 14))
        plt.imshow(wordcloud, interpolation="bilinear")
        plt.show()
    

#### Über gesamten Chat

In [None]:
# Oliver Janich öffentlich (public_channel - dataSet0)
generateWordCloud(
    dictMessages["DS-05-01-2021/ChatExport_2021-01-05-janich"],
    "wordcloud-oliver-janich.png",
    [],
    flagShow = True,
    configPlotWidth = 1920,
    configPlotHeight = 1080
)

In [None]:
# ATTILA HILDMANN OFFICIAL (public_channel - dataSet0)
if(C_SHORT_RUN == False):
    generateWordCloud(
        dictMessages["DS-05-01-2021/ChatExport_2021-01-05-hildmann"],
        "wordcloud-attila-hildmann.png",
        [],
        flagShow = True,
        configPlotWidth = 1920,
        configPlotHeight = 1080
    )

In [None]:
# Eva Herman Offiziell (public_channel - dataSet0)
if(C_SHORT_RUN == False):
    generateWordCloud(
        dictMessages["DS-05-01-2021/ChatExport_2021-01-05-evaherman"],
        "wordcloud-eva-herman.png",
        [],
        flagShow = True,
        configPlotWidth = 1920,
        configPlotHeight = 1080
    )

In [None]:
# Xavier Naidoo (public_channel - dataSet0)
if(C_SHORT_RUN == False):
    generateWordCloud(
        dictMessages["DS-05-01-2021/ChatExport_2021-01-05-xavier"],
        "wordcloud-xavier-naidoo.png",
        [],
        flagShow = True,
        configPlotWidth = 1920,
        configPlotHeight = 1080
    )

In [None]:
if(C_SHORT_RUN == False):
    if("dataSet2" in C_LOAD_DATASETS):
        generateWordCloud(
            dictMessages["DS-13-01-2021/ChatExport_2021-01-13-querdenken089"],
            "wordcloud-querdenken-089-group.png",
            [],
            flagShow = True,
            configPlotWidth = 1920,
            configPlotHeight = 1080
        )

In [None]:
if(C_SHORT_RUN == False):
    if("dataSet2" in C_LOAD_DATASETS):
        generateWordCloud(
            dictMessages["DS-13-01-2021/ChatExport_2021-01-13-querdenken591Info"],
            "wordcloud-querdenken-591-info.png",
            [],
            flagShow = True,
            configPlotWidth = 1920,
            configPlotHeight = 1080
        )

In [None]:
if(C_SHORT_RUN == False):
    if("dataSet2" in C_LOAD_DATASETS):
        generateWordCloud(
            dictMessages["DS-13-01-2021/ChatExport_2021-01-13-querdenken773"],
            "wordcloud-querdenken-773-group.png",
            [],
            flagShow = True,
            configPlotWidth = 1920,
            configPlotHeight = 1080
        )

In [None]:
if(C_SHORT_RUN == False):
    if("dataSet2" in C_LOAD_DATASETS):
        generateWordCloud(
            dictMessages["DS-13-01-2021/ChatExport_2021-01-13-querdenken773Info"],
            "wordcloud-querdenken-773-info.png",
            [],
            flagShow = True,
            configPlotWidth = 1920,
            configPlotHeight = 1080
        )

In [None]:
if(C_SHORT_RUN == False):
    if("dataSet2" in C_LOAD_DATASETS):
        generateWordCloud(
            dictMessages["DS-13-01-2021/ChatExport_2021-01-13-querdenken711"],
            "wordcloud-querdenken-711-group.png",
            [],
            flagShow = True,
            configPlotWidth = 1920,
            configPlotHeight = 1080
        )

In [None]:
if(C_SHORT_RUN == False):
    if("dataSet2" in C_LOAD_DATASETS):
        generateWordCloud(
            dictMessages["DS-13-01-2021/ChatExport_2021-01-13-querdenken711Info"],
            "wordcloud-querdenken-711-info.png",
            [],
            flagShow = True,
            configPlotWidth = 1920,
            configPlotHeight = 1080
        )

In [None]:
if(C_SHORT_RUN == False):
    if("dataSet2" in C_LOAD_DATASETS):
        generateWordCloud(
            dictMessages["DS-13-01-2021/ChatExport_2021-01-13-querdenken69"],
            "wordcloud-querdenken-69-group.png",
            [],
            flagShow = True,
            configPlotWidth = 1920,
            configPlotHeight = 1080
        )

In [None]:
if(C_SHORT_RUN == False):
    if("dataSet2" in C_LOAD_DATASETS):
        generateWordCloud(
            dictMessages["DS-13-01-2021/ChatExport_2021-01-13-querdenken69Info"],
            "wordcloud-querdenken-69-info.png",
            [],
            flagShow = True,
            configPlotWidth = 1920,
            configPlotHeight = 1080
        )

#### Zeitlicher Verlauf

##### Bestimmter Zeitraum aus DataFrame und Perioden berechnen

In [None]:
def extractTimePeriodDataFrame(df, timeStart, timeStop):

    #print("- Got Start " + str(timeStart) + " and Stop " + str(timeStop))

    df = df.copy()
    df["date"] = pd.to_datetime(df["date"])
    
    dfNew = df[df.date <= timeStop]
    dfNew = dfNew[dfNew.date >= timeStart]

    dfNew = dfNew.set_index("date")
    dfNew = dfNew.sort_index()

    return dfNew

##### Perioden berechnen

In [None]:
def generateWCPeriod():
    return list(pd.date_range( #m/d/y
            start='1/1/2018',
            end='2/1/2021',
            #freq="W-MON",
            freq="1M"
            ))

##### Wrapper

- ``fP``: filePath
- ``label``: e.g. chatName
- ``filterList``: additional stopWords

In [None]:
def generateWordCloudAuto(fP, label, filterList):

    gloStartStopwatch("Generate World Cloud Auto >>" + fP + "<<")

    periods = generateWCPeriod()

    dictSaved = {}

    prevStart = periods[0]

    for period in periods:

        stop = period

        e = extractTimePeriodDataFrame(dictMessages[fP], timeStart = prevStart, timeStop = stop)

        if(prevStart != stop and len(e.index) > 0):
            fileName = "autoWordCloud/" + queryChatName(fP) + "-" + str(prevStart) + "-" + str(stop) + ".png"
            generateWordCloud(
                e,
                fileName,
                filterList,
                flagShow = False,
                configPlotWidth = 1280,
                configPlotHeight = 720
            )
            #print("- Save file " + fileName)
            dictSaved[fileName] = str(prevStart) + " - " + str(stop)

        """
        else:
            print("- Start and Stop equal or no message found")
        """

        prevStart = stop

    gloWriteDictToFile("auto-wordcloud-" + label + ".csv", dictSaved)

    gloStopStopwatch("Generate World Cloud Auto >>" + fP + "<<")

##### Anwenden auf DataSet0

In [None]:
if(C_SHORT_RUN == False):
    generateWordCloudAuto(
        fP = "DS-05-01-2021/ChatExport_2021-01-05-janich",
        label = "oliver-janich",
        filterList = []
    )

In [None]:
if(C_SHORT_RUN == False):
    generateWordCloudAuto(
        fP = "DS-05-01-2021/ChatExport_2021-01-05-hildmann",
        label = "attila-hildmann",
        filterList = []
    )

In [None]:
if(C_SHORT_RUN == False):
    generateWordCloudAuto(
        fP = "DS-05-01-2021/ChatExport_2021-01-05-evaherman",
        label = "eva-herman",
        filterList = []
    )

In [None]:
if(C_SHORT_RUN == False):
    generateWordCloudAuto(
        fP = "DS-05-01-2021/ChatExport_2021-01-05-xavier",
        label = "xavier-naidoo",
        filterList = []
    )

### N Grams

#### Text zu N Gram

In [None]:
def generateNGram(text, n):
    # https://albertauyeung.github.io/2018/06/03/generating-ngrams.html
    text = text.lower()
    text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)
    tokens = [token for token in text.split(" ") if token != ""]
    
    return list(ngrams(tokens, n))

##### Beispiele (1-3)

In [None]:
sampleText = "Mir geht es heute gut!"
sampleText

In [None]:
generateNGram(sampleText, 1)

In [None]:
generateNGram(sampleText, 2)

In [None]:
generateNGram(sampleText, 3)

#### Top N grams pro Chat

##### Implementierung

In [None]:
def generateNGramChat(fP, n, mostCommon):
    return Counter(
        generateNGram(
            gloGenerateTextFromChat(dictMessages[fP], rowID="ftTdSafeText"),
            n = n
        )
    ).most_common(mostCommon)

In [None]:
def generateNGramAuto(filePathList, n, mostCommon):
    for fP in filePathList:

        print()
        print("Analyse now >>" + fP + "<<")

        c = generateNGramChat(
            fP,
            n = n,
            mostCommon = mostCommon
        )

        print ("\n".join(map(str, c)))

##### Angwendet auf DataSet 0

In [None]:
generateNGramAuto(
    dfInputFiles[dfInputFiles.inputLabel == "dataSet0"].inputPath,
    n = 2,
    mostCommon = 10
)

In [None]:
generateNGramAuto(
    dfInputFiles[dfInputFiles.inputLabel == "dataSet0"].inputPath,
    n = 3,
    mostCommon = 10
)

In [None]:
generateNGramAuto(
    dfInputFiles[dfInputFiles.inputLabel == "dataSet0"].inputPath,
    n = 4,
    mostCommon = 10
)

In [None]:
generateNGramAuto(
    dfInputFiles[dfInputFiles.inputLabel == "dataSet0"].inputPath,
    n = 5,
    mostCommon = 10
)

In [None]:
generateNGramAuto(
    dfInputFiles[dfInputFiles.inputLabel == "dataSet0"].inputPath,
    n = 6,
    mostCommon = 10
)

### POS-Tagging (Eigennamen)

#### Implementierung

In [None]:
# param outputFilename, if "" - no output
def plotFreqNouns(inputText, outputFilename, mostCommon, flagRemoveStopwords):
    # https://textmining.wp.hs-hannover.de/Preprocessing.html
    nouns = []
    sentences_tok = [nltk.tokenize.word_tokenize(sent) for sent in getTokenFromText(inputText)]

    for sent in sentences_tok:
        tags = hanoverTagger.tag_sent(sent) 
        nouns_from_sent = [lemma for (word,lemma,pos) in tags if pos == "NE"] # pos == "NN" or 
        nouns.extend(nouns_from_sent)

    pNouns = list()

    if(flagRemoveStopwords):

        print("- Warn: remove stopWords")
        stopWords = gloGetStopWordsList(filterList = list())
        for n in nouns:
            if n.lower() not in stopWords:
                pNouns.append(n)

    else:
        pNouns = nouns

    # Thank you https://stackoverflow.com/questions/52908305/how-to-save-a-nltk-freqdist-plot
    fig = plt.figure(figsize = (16,9))
    plt.gcf().subplots_adjust(bottom=0.15)

    fdist = nltk.FreqDist(pNouns)    

    fdist.plot(mostCommon,cumulative=False)

    _ = plt.show()

    if(outputFilename != ""):
        fig.savefig(dir_var_output + outputFilename, bbox_inches="tight")

##### Beispiel

In [None]:
sampleText = "Ich denke an Eis in München. Das ist ein guter Beispielstext aus München. An diesem tollen Text werde ich nun einige Verfahren anwenden! Ich wohne in der Nähe von München und esse gerne Eis."
sampleText

In [None]:
plotFreqNouns(sampleText, outputFilename = "", mostCommon = 10, flagRemoveStopwords = True)

#### Über einen gesamten Chat

In [None]:
def generateFreqNounsPlot(fP, mostCommon, outputFilename):

    gloStartStopwatch("Generate text")
    df = dictMessages[fP].copy()
    inputText = gloGenerateTextFromChat(df, "ftTdCleanText")
    gloStopStopwatch("Generate text")

    gloStartStopwatch("Process data")
    plotFreqNouns(inputText, outputFilename=outputFilename, mostCommon=mostCommon, flagRemoveStopwords=True)
    gloStopStopwatch("Process data")

##### Angewendet auf DataSet0

In [None]:
if(C_SHORT_RUN == False):
    generateFreqNounsPlot("DS-05-01-2021/ChatExport_2021-01-05-janich", mostCommon=25, outputFilename = "freq-nouns-oliver-janich.svg")

In [None]:
if(C_SHORT_RUN == False):
    generateFreqNounsPlot("DS-05-01-2021/ChatExport_2021-01-05-hildmann", mostCommon=25, outputFilename = "freq-nouns-attila-hildmann.svg")

In [None]:
if(C_SHORT_RUN == False):
    generateFreqNounsPlot("DS-05-01-2021/ChatExport_2021-01-05-evaherman", mostCommon=25, outputFilename = "freq-nouns-eva-herman.svg")

In [None]:
if(C_SHORT_RUN == False):
    generateFreqNounsPlot("DS-05-01-2021/ChatExport_2021-01-05-xavier", mostCommon=25, outputFilename = "freq-nouns-xavier-naidoo.svg")

### Named-entity recognition

#### Beispiele

In [None]:
sampleText = "Hallo, mein Name ist Maximilian Mustermann und ich lebe in Deutschland in Europa."
sampleText

In [None]:
processNerPipeline(sampleText, "ner-bert", 0)

In [None]:
processNerPipeline(sampleText, "ner-xlm-roberta", 0)

#### Angwendet auf DataSet0

In [None]:
def evalNerPipeline(pipelineKey, inputSelector, configTopN):

    if(inputSelector in C_TRANSFORMERS_DATASETS):
        
        filePaths = dfInputFiles[dfInputFiles.inputLabel == inputSelector].inputPath

        for fP in filePaths:
            
            gloStartStopwatch("Process now >>" + str(fP) + "<<")

            if(pipelineKey == "ftTrNerRoberta" or pipelineKey == "ftTrNerBert"):
                
                df = dictMessages[fP].copy()
                df = df[df.ftQrIsValidText == True]
                
                listPer     = list()
                listMisc    = list()
                listOrg     = list()
                listLoc     = list()

                for index, row in df.iterrows():

                    d = row[pipelineKey]
                    
                    listPer.extend(d["per"])
                    listMisc.extend(d["misc"])
                    listOrg.extend(d["org"])
                    listLoc.extend(d["loc"])

                print("- Top per -")
                print ("\n".join(map(str, Counter(listPer).most_common(configTopN))))
                print()

                print("- Top misc -")
                print ("\n".join(map(str, Counter(listMisc).most_common(configTopN))))
                print()

                print("- Top org -")
                print ("\n".join(map(str, Counter(listOrg).most_common(configTopN))))
                print()

                print("- Top loc -")
                print ("\n".join(map(str, Counter(listLoc).most_common(configTopN))))
                print()

            else:
                print("Error pipeline not found >>" + str(pipelineKey) + "<<")

            gloStopStopwatch("Process now >>" + str(fP) + "<<")

    else:
        print("Error data not found >>" + inputSelector + "<<")

In [None]:
evalNerPipeline("ftTrNerRoberta", "dataSet0", configTopN = 20)

In [None]:
evalNerPipeline("ftTrNerBert", "dataSet0", configTopN = 20)

### Sentiment analysis

#### TextBlob

##### Beispiele

In [None]:
print(str(processSentimentAnalysisPython("Heute ist ein toller Tag. Ich freue mich hier zu sein!")))

In [None]:
print(str(processSentimentAnalysisPython("Heute war ein furchtbarer Tag. Ich hasse alles.")))

##### DataSet0

Implementierung für beiden

In [None]:
def evalSenPipeline(pipelineKey, inputSelector, outputFilename, configRolling, configShowScatter):

    if(inputSelector in C_TRANSFORMERS_DATASETS):
        
        filePaths = dfInputFiles[dfInputFiles.inputLabel == inputSelector].inputPath

        plt.figure(figsize=(16, 9))

        for fP in filePaths:
            
            gloStartStopwatch("Process now >>" + str(fP) + "<<")

            if(pipelineKey == "sen-bert"):
                
                df = dictMessages[fP].copy()
                df = df[df.ftQrIsValidText == True]

                df["date"] = pd.to_datetime(df["date"])
                df = df.set_index("date")
                df = df.sort_index()

                # key = x = time / value = y = score
                dictData = {}

                for index, row in df.iterrows():
                    
                    date = index
                    score = row["ftTrSenBert"]

                    if(score != -1):
                        dictData[date] = score

                # Plot
                x,y = zip(*sorted(dictData.items()))
                
                df = pd.DataFrame(list(zip(x, y)), columns =['x', 'y'])

                df['rolling'] = df.y.rolling(configRolling).mean()

                sns.lineplot(data=df, x="x", y="rolling", label = queryChatName(fP))

                if(configShowScatter):
                    sns.scatterplot(data=df, x="x", y="y", label = queryChatName(fP), marker="+")

                plt.gcf().autofmt_xdate()

                # Add vlines
                vLineMin = 2
                vLineMax = 4

            elif(pipelineKey=="sentiment"):

                df = dictMessages[fP].copy()
                df = df[df.ftQrIsValidText == True]

                df["date"] = pd.to_datetime(df["date"])
                df = df.set_index("date")
                df = df.sort_index()

                # key = x = time / value = y = score
                dictData = {}

                for index, row in df.iterrows():
                    
                    date = index
                    retDict = row["ftSenTb"]

                    if(retDict != None):
                        polarity = retDict["polarity"]
                        dictData[date] = polarity

                # Plot
                x,y = zip(*sorted(dictData.items()))

                df = pd.DataFrame(list(zip(x, y)), columns =['x', 'y'])

                df['rolling'] = df.y.rolling(configRolling).mean()

                sns.lineplot(data=df, x="x", y="rolling", label = queryChatName(fP))

                if(configShowScatter):
                    sns.scatterplot(data=df, x="x", y="y", label = queryChatName(fP), marker="+")

                plt.gcf().autofmt_xdate()

                # Add vlines
                vLineMin = -0.05
                vLineMax = 0.175

            elif(pipelineKey=="subjectivity"):

                df = dictMessages[fP].copy()
                df = df[df.ftQrIsValidText == True]

                df["date"] = pd.to_datetime(df["date"])
                df = df.set_index("date")
                df = df.sort_index()

                # key = x = time / value = y = score
                dictData = {}

                for index, row in df.iterrows():
                    
                    date = index
                    retDict = row["ftSenTb"]

                    if(retDict != None):

                        subjectivity = retDict["subjectivity"]
                        dictData[date] = subjectivity

                # Plot
                x,y = zip(*sorted(dictData.items()))

                df = pd.DataFrame(list(zip(x, y)), columns =['x', 'y'])

                df['rolling'] = df.y.rolling(configRolling).mean()

                sns.lineplot(data=df, x="x", y="rolling", label = queryChatName(fP))

                if(configShowScatter):
                    sns.scatterplot(data=df, x="x", y="y", label = queryChatName(fP), marker="+")

                plt.gcf().autofmt_xdate()

                # Add vlines
                vLineMin = 0
                vLineMax = 0.10
                
            else:
                print("Error pipeline not found >>" + str(pipelineKey) + "<<")

            gloStopStopwatch("Process now >>" + str(fP) + "<<")

        # yy - mm - dd
        # TODO: Double check https://www.bundesgesundheitsministerium.de/coronavirus/chronik-coronavirus.html?stand=20210104
        plt.vlines(x = ["2018-12-10"], ymin=vLineMin, ymax=vLineMax, color="orange", ls='--', label="Global Compact for Migration (2018-12-10)")
        plt.vlines(x = ["2020-01-27"], ymin=vLineMin, ymax=vLineMax, color="grey", ls='--', label="Corona Patient Zero Germany")
        plt.vlines(x = ["2020-03-23"], ymin=vLineMin, ymax=vLineMax, color="purple", ls='--', label="1. Lockdown Germany (2020-03-23)")
        plt.vlines(x = ["2020-11-02"], ymin=vLineMin, ymax=vLineMax, color="purple", ls='--', label="2. Lockdown light Germany (2020-11-02)")
        plt.vlines(x = ["2020-12-16"], ymin=vLineMin, ymax=vLineMax, color="purple", ls='--', label="3. Lockdown Germany (2020-12-16)")

        _ = plt.legend()

        if(outputFilename != ""):
            plt.savefig(dir_var_output + outputFilename)

    else:
        print("Error data not found >>" + inputSelector + "<<")

Polarity

In [None]:
evalSenPipeline("sentiment", "dataSet0", outputFilename = "", configRolling = 600, configShowScatter = True)

In [None]:
evalSenPipeline("sentiment", "dataSet0", outputFilename = "eval-pipeline-sen-textblob-dataSet0.svg", configRolling = 600, configShowScatter = False)

Subjectivity

In [None]:
evalSenPipeline("subjectivity", "dataSet0", outputFilename = "", configRolling = 600, configShowScatter = True)

In [None]:
evalSenPipeline("subjectivity", "dataSet0", outputFilename = "eval-pipeline-subjectivity-dataSet0.svg", configRolling = 600, configShowScatter = False)

#### Transformers

##### Beispiele

In [None]:
processSenPipeline("Das ist toll. Ich würde es mir wieder kaufen!", "sen-bert", 0)

In [None]:
processSenPipeline("Das ist toll. Ich würde es aber nicht mehr kaufen!", "sen-bert", 0)

In [None]:
processSenPipeline("Das funktioniert nicht.", "sen-bert", 0)

##### DataSet0

In [None]:
evalSenPipeline("sen-bert", "dataSet0", outputFilename = "", configRolling = 600, configShowScatter = True)

In [None]:
evalSenPipeline("sen-bert", "dataSet0", outputFilename = "eval-pipeline-sen-dataSet0.svg", configRolling = 600, configShowScatter = False)

### Latent Dirichlet Allocation (LDA)

Inspiriert von: https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0

Übersicht Topic Models Ansätze Python: https://nlpforhackers.io/topic-modeling/

- LDA (Probabilistic Graphical Models)
- LSA or LSI (Linear Algebra Singular Value Decomposition)
- NMF (Linear Algebra Non-Negative Matrix Factorization)

#### Aufbereitung

##### Text Aufbereitung

In [None]:
sampleList = ["Studenten sind faul", "und Studenten essen gerne Eis"]
sampleList

In [None]:
def gensimPreprocess(sentences):
    for sentence in sentences:
        # deacc= removes punctuations
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
        
sampleList = list(gensimPreprocess(sampleList))
sampleList

In [None]:
def gensimRemoveStopwords(inputList, stop_words):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in inputList]

sampleList = gensimRemoveStopwords(inputList = sampleList, stop_words=gloGetStopWordsList(filterList = []))
sampleList

##### LDA Aufbereitung

In [None]:
def ldaGetDictionary(inputList):
    return corpora.Dictionary(inputList)

sampleDictonary = ldaGetDictionary(sampleList)

In [None]:
sampleDictonary.num_docs

In [None]:
sampleDictonary.num_pos

In [None]:
sampleDictonary.token2id

In [None]:
def ldaGetBOW(dictonary, inputList):
    return [dictonary.doc2bow(text) for text in inputList]

ldaGetBOW(sampleDictonary, sampleList)

#### Modellierung und Visualisierung

Return

- ``lda_model``
- ``corpus``
- ``id2word``

In [None]:
def processLda(df, num_topics, debugPrint, stopWords):

    df = df.copy()

    df = df[df.ftQrIsValidText == True]

    df = df[["date", "ftTdSafeLowerText"]]

    df = df.set_index("date")
    df = df.sort_index()

    inputList = df.ftTdSafeLowerText.values.tolist()

    inputList = list(gensimPreprocess(inputList))
    inputList  = gensimRemoveStopwords(inputList, stopWords)

    dictionary = ldaGetDictionary(inputList)

    # Term Document Frequency (dict to bag of words)
    corpus = ldaGetBOW(dictionary, inputList)

    # Build LDA model
    lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                       id2word=dictionary,
                                       num_topics=num_topics)

    if(debugPrint):
        pprint(lda_model.print_topics())
        #doc_lda = lda_model[corpus] # TODO: ?

    return (lda_model, corpus, dictionary)

Lda zu Html

In [None]:
def ldaToHtml(lda_model, corpus, id2word, outputLabel):

    LDAvis_prepared = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)

    pyLDAvis.save_html(LDAvis_prepared, dir_var_output + 'pyLDAvis/' + outputLabel + '-report.html')

Wrapper

In [None]:
# param outputLabel required
def autoLda(df, debugPrint, outputLabel, filterList, listNumberTopics):

    for iTopics in listNumberTopics:

        iLabel = outputLabel + "-t-" + str(iTopics)

        gloStartStopwatch("Process LDA (" + str(iTopics) + " topics) >> "+ iLabel + "<<")
              
        try:
            
            lda_model, corpus, id2word = processLda(
                    df = df,
                    num_topics = iTopics,
                    debugPrint = debugPrint,
                    stopWords = gloGetStopWordsList(filterList)
                )

            ldaToHtml(
                    lda_model = lda_model,
                    corpus = corpus,
                    id2word = id2word,
                    outputLabel = iLabel
                )

        except:
            print("Error in process lda")

        gloStopStopwatch("Process LDA (" + str(iTopics) + " topics) >> "+ iLabel + "<<")

##### LDA auf DataSet0

In [None]:
if(C_SHORT_RUN == False):
    autoLda(
        df = dictMessages["DS-05-01-2021/ChatExport_2021-01-05-janich"],
        debugPrint = False,
        outputLabel = "oliver-janich",
        filterList = [],
        listNumberTopics = [2,4,8,16]
    )

In [None]:
if(C_SHORT_RUN == False):
    autoLda(
        df = dictMessages["DS-05-01-2021/ChatExport_2021-01-05-hildmann"],
        debugPrint = False,
        outputLabel = "attila-hildmann",
        filterList = [],
        listNumberTopics = [2,4,8,16]
    )

In [None]:
if(C_SHORT_RUN == False):
    autoLda(
        df = dictMessages["DS-05-01-2021/ChatExport_2021-01-05-evaherman"],
        debugPrint = False,
        outputLabel = "eva-herman",
        filterList = [],
        listNumberTopics = [2,4,8,16]
    )

In [None]:
if(C_SHORT_RUN == False):
    autoLda(
        df = dictMessages["DS-05-01-2021/ChatExport_2021-01-05-xavier"],
        debugPrint = False,
        outputLabel = "xavier-naidoo",
        filterList = [],
        listNumberTopics = [2,4,8,16]
    )

##### LDA auf DataSet1a

In [None]:
if(C_SHORT_RUN == False):
    if("dataSet1a" in C_LOAD_DATASETS):
        autoLda(
            df = dictMessages["DS-05-01-2021a/ChatExport_2021-01-05-freiheitsChat"],
            debugPrint = False,
            outputLabel = "group-freiheitsChat",
            filterList = [],
            listNumberTopics = [2,4,8,16,32]
        )

In [None]:
if(C_SHORT_RUN == False):
    if("dataSet1a" in C_LOAD_DATASETS):
        autoLda(
            df = dictMessages["DS-05-01-2021a/ChatExport_2021-01-05-freiheitsChatBlitz"],
            debugPrint = False,
            outputLabel = "group-freiheitsChatBlitz",
            filterList = [],
            listNumberTopics = [2,4,8,16,32]
        )

In [None]:
if(C_SHORT_RUN == False):
    if("dataSet1a" in C_LOAD_DATASETS):
        autoLda(
            df = dictMessages["DS-05-01-2021a/ChatExport_2021-01-05-liveFuerDeOsSc"],
            debugPrint = False,
            outputLabel = "group-liveFuerDeOsSc",
            filterList = [],
            listNumberTopics = [2,4,8,16,32]
        )

### Ausblick Textgenerierung

Globale Stopuhr beenden

In [None]:
gloStopStopwatch("Global notebook")

Bsp Text

In [None]:
sampleText = "Hallo, mein Name ist Max und ich esse gerne Eis. Ich schreibe gerade an meiner Masterarbeit und teste neue Verfahren. Ich komme aus dem Großraum München und bin Informatiker."
sampleText

In [None]:
def processTextGenPipeline(inputText, pipelineKey, cMaxLength):
    if(pipelineKey in pipelineKeys):
        return dictPipelines[pipelineKey](inputText, max_length=cMaxLength)

##### gpt2

In [None]:
processTextGenPipeline(sampleText, "text-gen-gpt2", cMaxLength = 100)

##### gpt2-faust

In [None]:
processTextGenPipeline(sampleText, "text-gen-gpt2-faust", cMaxLength = 100)

## Mehr lesen / Inspirationen

- https://towardsai.net/p/data-mining/text-mining-in-python-steps-and-examples-78b3f8fd913b
- https://towardsdatascience.com/text-mining-for-dummies-text-classification-with-python-98e47c3a9deb
- https://www.analyticsvidhya.com/blog/2018/02/the-different-methods-deal-text-data-predictive-python/
- https://realpython.com/python-keras-text-classification/
- https://www.tidytextmining.com/ngrams.html
- http://seaborn.pydata.org/tutorial/categorical.html?highlight=bar%20plot
- https://towardsdatascience.com/visualizing-data-with-pair-plots-in-python-f228cf529166
- https://www.kirenz.com/post/2019-08-13-network_analysis/
- https://tgstat.com
- https://huggingface.co/bert-base-german-cased
- https://github.com/sekhansen/text-mining-tutorial/tree/master
- https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/
- https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0
- https://github.com/sekhansen/text-mining-tutorial/blob/master/tutorial_notebook.ipynb
- https://textmining.wp.hs-hannover.de/Preprocessing.html
- https://likegeeks.com/nlp-tutorial-using-python-nltk/
- https://www.datacamp.com/community/tutorials/text-analytics-beginners-nltk
- https://data-flair.training/blogs/nltk-python-tutorial/
- https://github.com/expectocode/telegram-analysis