# Einführung in Python für die Computational Social Science (CSS)

## Jonas Volle
Wissenschaftlicher Mitarbeiter  
Chair of Methodology and Empirical Social Research  
Otto-von-Guericke-Universität

[jonas.volle@ovgu.de](mailto:jonas.volle@ovgu.de)

**Sprechstunde**: individuell nach vorheriger Anmeldung per [Mail](mailto:jonas.volle@ovgu.de)

Donnerstag, 04.07.2024

**Quelle:** Ich orientiere mich für diese Sitzung in Teilen am Kapitel 7 aus dem Buch:  

McLevey, John. 2021. Doing Computational Social Science: A Practical Introduction. 1st ed. Thousand Oaks: SAGE Publications.

und der Introduction to Computational Social Science methods with Python von GESIS unter: https://github.com/gesiscss/css_methods_python 

# Session 6: Textanalysen 

## Import der Textdaten

In [None]:
import cred
import requests
import pprint as pp
import time
from bs4 import BeautifulSoup

GUARDIAN_KEY = cred.GUARDIAN_KEY

In [None]:
# API Endpoint
API_ENDPOINT = 'http://content.guardianapis.com/search'

# API Parameter
PARAMS = {
    'api-key': GUARDIAN_KEY,
    'from-date': '2024-01-01',
    'to-date': '2024-06-30',
    'lang': 'en',
    'production-office': 'uk',
    'q': '"european union" OR EU OR eurozone OR brussels OR "european parliament"',
    'show-fields': 'wordcount,body,byline',
    'page-size': 50
} 


In [None]:
# GET request

response = requests.get(API_ENDPOINT, params=PARAMS) 
response_dict = response.json()['response']

In [None]:
response_dict['total']

In [None]:
response_dict['pages']

In [None]:
all_results = []
cur_page = 1
total_pages = 1

while (cur_page <= total_pages) and (cur_page < 50):

    # Make API request
    PARAMS['page'] = cur_page
    response = requests.get(API_ENDPOINT, params=PARAMS) 
    response_dict = response.json()['response']

    # update total pages
    total_pages = response_dict['pages']

    print(f"page: {cur_page} of {total_pages}")

    # update cur page
    cur_page += 1

    # append result
    all_results += (response_dict['results'])

    # sleep
    time.sleep(1)

In [None]:
all_results_df = pd.json_normalize(all_results)

In [None]:
all_results_df['text'] = [BeautifulSoup(i, "html.parser").text for i in all_results_df['fields.body']]

In [None]:
# date
all_results_df['article_date'] = pd.to_datetime(all_results_df.webPublicationDate)

# rename columns
all_results_df = all_results_df.rename(columns={'webTitle':'article_title',
                                               'webUrl':'article_url',
                                               'fields.byline': 'article_author',
                                               'sectionName': 'section_name',
                                               'pillarName': 'pillar_name'})

# filter columns
all_results_df_f = all_results_df[['id', 'article_date', 'section_name', 'pillar_name',
                                   'article_title', 'article_url', 
                                   'article_author', 'text']].copy()

In [None]:
all_results_df_f.head()

In [None]:
# export full data
# all_results_df_f.to_csv('../data/guardian_eu_textdata.csv', index= False)


In [None]:
# export sampled data

all_results_df_sample_100 = all_results_df_f.sample(100, random_state=1234)
# all_results_df_sample_100.to_csv('../data/guardian_eu_textdata_sample_100.csv',
#                                  index= False)

all_results_df_sample_200 = all_results_df_f.sample(200, random_state=1234)
# all_results_df_sample_200.to_csv('../data/guardian_eu_textdata_sample_200.csv',
#                                  index= False)

all_results_df_sample_500 = all_results_df_f.sample(500, random_state=1234)
# all_results_df_sample_500.to_csv('../data/guardian_eu_textdata_sample_500.csv',
#                                  index= False)

all_results_df_sample_1000 = all_results_df_f.sample(1000, random_state=1234)
# all_results_df_sample_1000.to_csv('../data/guardian_eu_textdata_sample_1000.csv',
#                                  index= False)

## Natural Language Processing

Zunächst werden die Textdaten importiert.

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('../data/guardian_eu_textdata_sample_100.csv')

### Tokenization

Für die Analyse werden die Texte in Analyseeinheiten zerteilt.

Die durch Leerzeichen und Interpunktion getrennten Wörter eines Textdokuments werden als Token bezeichnet.

Wir können auch alle tokens in eine List packen, um die häufigsten token zu zählen.

In [None]:


# Alle tokens in einer Liste


In [None]:
plt.hist(vocabulary.values(), bins=1000, color='blue', edgecolor='black')
plt.yscale('log')
plt.show()

### Stemming

Beim Stemming werden die Suffixe von Wörtern entfernt, um eine vereinfachte Form des Wortes zu erhalten.

running, runner, run -> run

Ein weit verbreiteter Stemming Algorithmus ist der von Porter.

In [None]:
# !conda install nltk

In [None]:
# !python -m nltk.downloader popular

In [None]:
example_df = pd.DataFrame({'token': word_tokenize(example_text)})
example_df.head()

In [None]:
example_df['stem'] = 

### Lemmatization

Ein Lemma ist die Grundform eines Wortes.  

go, goes, went, gone oder going --> go

In [None]:
# import von spacy


In [None]:
# Process the text with spaCy


In [None]:


# https://spacy.io/api/token#attributes

In [None]:
lemma_df = pd.DataFrame({'token': ,
                         'lemma': })

lemma_df['stem'] = [stemmer.stem(i) for i in lemma_df.token]

In [None]:
lemma_df.sample(20)

### N-grams

N-grams sind Kombinationen von n Wörtern. gensim kann Worte erkennen, die oft zusammen auftauchen.

In [None]:
# gensim expect as input tokenized texts
texts = [word_tokenize(text) for text in df.text]

In [None]:
# extract bigrams


In [None]:
# visualize the extracted bigrams
extracted_bigrams = []
for text in texts_bigrams:
    for el in text:
        if "_" in el:
            extracted_bigrams.append(el)

extracted_bigrams = set(extracted_bigrams)
print(extracted_bigrams)

### Stopwords

Stoppwörter sind Wörter, die häufig in einer Sprache verwendet werden, aber normalerweise keine große Bedeutung oder keinen semantischen Wert haben, wenn sie im Kontext verwendet werden. Beispiele für Stoppwörter im Englischen sind "the", "a", "an", "and", "in", "on", "is", "are", "for", "with" und so weiter.

In [None]:
plt.hist(vocabulary.values(), bins=1000, color='blue', edgecolor='black')
plt.yscale('log')
plt.show()

In [None]:
text = df.text[40]

# Process the text with spaCy


# Define the list of stop words


In [None]:
# Remove stop words from the text


In [None]:
# Print the original and filtered text, and the stop words removed
print("Original tokens: ", [token.text for token in doc])

In [None]:
print("Filtered tokens:", filtered_text)

In [None]:
print("Stop words removed: ", stop_words_removed)

In [None]:
print(len(stop_words))
stop_words.extend(["bst"])
print(len(stop_words))

In [None]:
'bst' in stop_words

### Parts of Speech

English has 9 main categories:

verb — Expresses an action or a state of being. E.g. jump, is, write, become  
noun — identifies a person, a place or a thing or names of particular of one of these (pronoun). E.g. man, house, happiness  
pronoun — can replace a noun or noun phrase. E.g. she, we, they, it  
determiner — Is placed in front of a noun to express a quantity or clarify  what the noun refers to 
adjective — modifies a noun or a pronoun. E.g. pretty, old, blue, smart  
adverb — modifies a verb, an adjective, or another adverb. E.g. gently, extremely, carefully, well  
preposition — Connect a noun/pronoun to other parts of the sentence. E.g. by, with, about, until  
conjunction — glue words, clauses, and sentences together. E.g. and, but, or, while, because  
interjection — Expresses emotion in a sudden or exclamatory way. E.g. oh!, wow!, oops!  

In [None]:
pd.DataFrame({'token': [token.text for token in doc],
             'pos': [token.pos_ for token in doc]}).sample(20)

In [None]:
spacy.explain("PROPN")

### Named enitity recognition

In [None]:
example_text_entities = pd.DataFrame({'entity': ,
                                      'entity_label': })

In [None]:
example_text_entities[example_text_entities.entity_label == ''].head()

In [None]:
example_text_entities[example_text_entities.entity_label == ''].value_counts('entity')

### Preprocessing Pipeline

Verschiedene Vorverarbeitungsschritte können wir in einer Pipeline an Funktionen zusammenfassen:

In [None]:
import spacy
import re # regex
from nltk.tokenize import word_tokenize
from spacy.lang.en.stop_words import STOP_WORDS
nlp = spacy.load("en_core_web_sm")

Sonderzeichen, Zahlen, Zeilenumbrüch etc. entfernen. In dieser Funktion benutzen wir reguläre Ausdrücke `regex`. Eine Übersicht über diese Ausdrücke finden wir z.B. hier: https://images.datacamp.com/image/upload/v1665049611/Marketing/Blog/Regular_Expressions_Cheat_Sheet.pdf

In [None]:
def clean_text(text):

    # remove punctuation and special characters
    pattern = r"[^\w\s]"
    text_clean = re.sub(pattern, "", text)

    # remove numbers
    pattern = r"\d+"
    text_clean = re.sub(pattern, "", text_clean)

    # remove all non-ASCII characters
    pattern = r"[^\x00-\x7F]+"
    text_clean = re.sub(pattern, "", text_clean)

    # remove new line characters
    text_clean.replace("\n", "")

    # remove empty spaces left by regex
    text_clean = ' '.join(text_clean.split())
    
    return text_clean

... Tokenisierung

In [None]:
def tokenization(texts):

... Stopwords entfernen

In [None]:
def remove_stop_words(texts, stop_words=[]):


... Bigrams hinzufügen

In [None]:
def add_bigrams(texts):

... stemmen

In [None]:
def stemming(texts):

... lemmatisieren

In [None]:
def lemmatization(texts):

Alle Funktionen können wir jetzt in eine Pipeline integrieren. Diese Funktion nimmt einen Textkorpus auf. Ein Textkorpus besteht aus einer Reihe an Dokumenten. Diese Dokumente können z.B. in einer Liste oder einem array gespeichert sein.

In [None]:
stop_words = list(STOP_WORDS)
stop_words.extend(['bst', 'year'])

In [None]:
def pipeline(corpus):
    print("Cleaning text...")
    corpus = [clean_text(text) for text in corpus]

    print("Tokenization...")
    corpus = tokenization(corpus)

    print("Lowercasing...")
    corpus = [[el.lower() for el in text] for text in corpus]

    print("Stop Words removal...")
    corpus = remove_stop_words(corpus, stop_words=stop_words)
    
    print("Extract bigrams...")
    corpus = add_bigrams(corpus)

    print("Lemmatization...")
    corpus = lemmatization(corpus)

    print("Stop Words removal after lemmatizing...")
    corpus = remove_stop_words(corpus, stop_words)

    print("Removing tokens that are too short...")
    corpus = [[c for c in text if len(c) > 2] for text in corpus]

    return corpus

Nun erstellen wir unseren Corpus:

In [None]:
# we create a dictionary

In [None]:
print(f"Number of words in the dictionary: {len(dictionary)}")
print("Dictionary first 5 elements (id, token):", list(dictionary.items())[:5])

In [None]:
plt.hist(dictionary.dfs.values())
plt.show()

In [None]:
# Prune the dictionary:


In [None]:
print(f"Number of words in the dictionary after pruning: {len(dictionary)}")

In [None]:
plt.hist(dictionary.dfs.values())
plt.show()

In [None]:
# covert the corpus to bag of words format 


In [None]:
print("First document in bag-of-words format (raw):", document_term_matrix[0])

In [None]:
print("First document in bag-of-words format (word, frequency):",
      [[dictionary[id], freq] for id, freq in document_term_matrix[0]])

Top words im Corpus:

In [None]:
word_counts_df = pd.DataFrame([[dictionary[id], freq] for id, freq in dictionary.cfs.items()],
                            columns=['word', 'count']).sort_values('count', ascending=False)

word_counts_df.head(20)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
ax = sns.barplot(data=top_words, 
                 y='word',
                 x='count',
                 color='darkgray') 
sns.despine() 
plt.show()

In [None]:
word_counts_df.tail(20)

### export

In [None]:
import pickle as pkl

with open("../data/dict_gensim.pkl", "wb") as file:
    pkl.dump(dictionary, file)

with open("../data/text_df.pkl", "wb") as file:
    pkl.dump(df, file)

with open("../data/document_term_matrix.pkl", "wb") as file:
    pkl.dump(document_term_matrix, file)

<div class='alert alert-block alert-success'>

### Aufgabe 1

--> session_06_excercise_01.ipynb

</div>

## Topic Model

Topic Models sind probabilistische Modelle, die zur Bestimmung von semantischen Clustern in Dokumentensammlungen verwendet werden. Sie eignen sich für die Erforschung von Textdaten, da sie thematische Strukturen finden, die nicht im Voraus definiert sind. Die Berechnung zielt darauf ab, die proportionale Zusammensetzung einer festen Anzahl von Themen in den Dokumenten einer Sammlung zu bestimmen. Diese semantischen Cluster können wir als Themen interpretieren.

Topic Modelle liefern Wahrscheinlichkeitsverteilungen über die Menge aller Wörter für jedes Thema und Wahrscheinlichkeitsverteilungen über die Menge der Themen für jedes Dokument. Jede kleinste Analyseeinheit (z. B. ein Wort oder ein n-Gramm) hat eine Wahrscheinlichkeit, zu jedem Thema zu gehören, und jedes Thema hat eine Wahrscheinlichkeit, in jedem Dokument aufzutreten. Ein Thema wird semantisch interpretierbar durch die n wahrscheinlichsten Wörter, die es enthält.

In [None]:
# import
# Load the gensim dictionary
with open("../data/dict_gensim.pkl", "rb") as file:
    dictionary = pkl.load(file)

# Load the DataFrame
with open("../data/text_df.pkl", "rb") as file:
    df = pkl.load(file)

# Load the document term matrix
with open("../data/document_term_matrix.pkl", "rb") as file:
    document_term_matrix = pkl.load(file)

In [None]:


# train an LDA model on the corpus

In [None]:
# print the topics and associated keywords


Auswahl des Modells anhand des Kohärenz Scores. 


Die Topic Kohärenz bewerten ein einzelnes Topic, indem sie den Grad der semantischen Ähnlichkeit zwischen hoch bewerteten Wörtern im Thema messen. Diese Messungen helfen bei der Unterscheidung zwischen Themen, die semantisch interpretierbar sind, und Themen, die Artefakte statistischer Inferenz sind.  Zusätzlich können wir verschiedene Modelle mit dem Wert der mittleren Kohärenz vergleichen.

In [None]:
from gensim.models import CoherenceModel
import numpy as np

scores = []
models = []

for num_topics in np.arange(5, 30):

    # fit LDA model
    lda_model = LdaModel(document_term_matrix,
                         id2word=dictionary,
                         num_topics=num_topics,
                         random_state=12345
                        )

    # compute Coherence Score
    coherence_model_lda = CoherenceModel(model=lda_model,
                                         texts=df.text_preprocessed,
                                         dictionary=dictionary)
    
    coherence_lda = coherence_model_lda.get_coherence()
    print(f'Coherence Score with {num_topics} topics: {coherence_lda}')

    scores.append([num_topics, coherence_lda])
    models.append(lda_model)

In [None]:
scores_df = pd.DataFrame(scores, columns=['num_topic', 'coherence_score'])

In [None]:
ax = sns.lineplot(data=scores_df, x='num_topic', y='coherence_score')
plt.show()

In [None]:
scores_df.sort_values('coherence_score', ascending=False)

In [None]:
# best model

In [None]:
import pyLDAvis
import pyLDAvis.gensim_models

# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model_best,
                                     document_term_matrix,
                                     dictionary)
vis

<div class='alert alert-block alert-success'>

### Aufgabe 2

--> session_06_excercise_02.ipynb

</div>