**Text Mining in R**

# Text, Strings und Reguläre Ausdrücke

Jan R. Riebling

## Agenda

* What is text?
* Text in R
* Text I/O
* String operations

## Zusätzliche Pakete

R-base verfügt aus historischen Gründen nur über sehr eingeschränkte Möglichkeiten Zeichenketten (*strings*) zu verarbeiten. Daher sind für Text Mining zusätzliche Pakete empfohlen:

* `tidyverse`: Meta-Paket für Data Science enthält unter anderem `dplyr`, `stringr` und andere. Die Dokumentation befindet sich [hier]().
* `stringr`: Implementiert Stringfunktionalitäten auf Basis von C Bibliotheken.
* `stringi`: Ausführlichere Version von `stringr`.
* `tidytext`: Zusätzliche Unterstützung für textbasierte Datenstrukturen im Rahmen der `tidyverse` Philosophie. Ausführliche Dokumentation im Buch [„Text Mining with R“]().
* `tm`: Stellt Datenstrukturen zum Ugang mit Textkorpora bereit. Siehe hierzu die entsprechende [Dokumentation](https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf).

In [1]:
install.packages(c( 'tidytext', 'tm', 'tidyverse' ))

Installiere Pakete nach ‘/home/jrriebling/.local/lib/R’
(da ‘lib’ nicht spezifiziert)



In [5]:
library(tidyverse)
library(stringr)
library(tm)

Lade nötiges Paket: NLP


Attache Paket: ‘NLP’


Das folgende Objekt ist maskiert ‘package:ggplot2’:

    annotate




## Tibbles vs. data.frames

For most purposes, a `tibble`-object works almost the same as a `data.frame`. In fact, the `tibble` is much more strict and exhibits the following behaviors:

* Never converts or transforms data types, variable names or assigns row names automatically.
* No autocompletion of variable names.
* Subsetting with the index notation `[` always returns a `tibble`-object.

# Maschinenlesbarer Text

## A formal definition of language

A language $\mathcal{L}$ can be defined as consisting of an alphabet $\Sigma{}$ and a grammar containing the rules for the construction of valid expressions. The alphabet of a language consists of a set of "words" $w$, the smallest elements of a language. Under the model of a generative grammar the construction of a valid sentence can be described as a set of substitution rules, starting from the level of the "sentence" and stopping when the word level is reached. If each substitution is correct the resulting sentence is correct. Only formal languages can be described entirely in terms of a grammar.

## Grammar 

> Eine Grammatik $G$ ist ein Tupel $G = (V, T, R, S)$. Dabei ist
> * $V$ eine endliche Menge von Variablen;
> * $T$ eine endliche Menge von Terminalen; es gilt $V \cap T = \emptyset$;
> * $R$ eine endliche Menge von Regeln. Eine Regel ist ein Element $(P, Q)$ aus
> $(V \cup T )∗ V (V \cup T )∗ \times (V \cup T )∗$ . Das heißt, P ist ein Wort über $(V \cup T)$,
> das mindestens eine Variable aus $V$ enthält, während $Q$ ein beliebiges Wort über $(V \cup T)$ ist. P heißt auch Prämisse und Q Konklusion > der Regel. Für eine Regel (P, Q) ∈ R schreibt man üblicherweise u auch P →G Q
> oder nur P → Q.
> * S das Startsymbol, S ∈ V .

(Erk & Priese 2008, 54)


## Text als „String“

Standardmäßig wird Text in Computern als eine Zeichenkette definiert. Die Grammatik die einen validen String erzeugt wird als eine reguläre Grammatik: Vom einem Starpunkt werden Elemente in eine vordefinierte Richtung hinzugefügt bis ein Endsymbol erreicht wird.

In Basis-R werden Strings mittels der Klasse `character` repräsentiert.

In [2]:
class("This is a string!")

# Eigenschaften von Strings

## Definition

Strings werden entweder mit `"` oder `'` geöffnet und müssen mit dem selben Zeichen geschlossen werden.

In [3]:
string1 <- "This is a string."
string2 <- 'This is another one, using single quotes'
string1

## Sonderzeichen

In [4]:
"This will work"

In [6]:
"But this won't"

## Escaping

Um Sonderzeichen in Strings zu repräsentieren oder die Funktion als Sonderzeichen zu unterdrücken müssen diese „escaped“ werden. Dies geschieht durch Voranstellen von `\`, hat jedoch zur Folge das eine Repräsentation von `\` im String selbst so auszusehen hat: `\\`. 

Sonderzeichen:

* `\n`: die (UNIX) newline; Ein Zielenumbruch.
* `\r`: Carriage Return; Springt an den Anfang der Zeile.. 
* `\t`: Tabulator Whitespace.
* `\b`: ein einfacher Whitespace.
* `\u....`: definiert einen unicode codepoint (UTF-8) mittels hexadezimaler Zahlen.

In [9]:
cat("But this won\'t
Leer\tzeile")

But this won't
Leer	zeile

In [11]:
print("\u201e\u01f6\u201c")

[1] "„Ƕ“"


# Text I/O

# Unicode

Im ursprünglichen ASCII encoding wurden Strings mittels 8bit repräsentiert, daher konnten nur 256 Zeichen dargestellt werden. Um die Bandbreite natürlicher Sprachen abzubilden wurde der Unicode Standard eingeführt. Mittels eines spezifischen Encodings (z.B.: UTF-8) wird die Korrespondenz eines bestimmten codepoints (unicode character) mit einem in einer gegebenen Schriftart darstellbaren Zeichen verknüpft.

Um das encoding eines Textstreams zu verändern kann die Funktion `file()` genutzt werden. Die Kodierung kann dann über das Argument `encoding="..."` angegeben werden.

In [13]:
lines <- readLines(file('../data/ucexample_Weber_utf8.txt', 
                        encoding='UTF-8'))

In [16]:
show(lines)

[1] "Der Nationalstaat und die Volkswirtschaftspolitik"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 

## Raw text input

R's standard function to read text files is `readLines()`. As the name suggests, it actually produces a vector of lines (seperated by newlines) from the file input. To preserve the formatting and get the *plain text* content, the function `readr::read_file()` can be used. Other packages offer additional read/write functions for text.

In [28]:
plain <- read_file('../data/ucexample_Weber_utf8.txt')

In [29]:
show(plain)

[1] "Der Nationalstaat und die Volkswirtschaftspolitik\n\n1895\nVorbemerkung\n\nNicht die Zustimmung, sondern der Widerspruch, welchen die nachstehenden Ausführungen bei vielen ihrer Hörer fanden, veranlaßten mich, sie zu veröffentlichen. Sachlich Neues werden sie Fachgenossen wie andern nur in Einzelheiten bringen, und in welchem speziellen Sinn allein sie den Anspruch auf das Prädikat der »Wissenschaftlichkeit« erheben, ergibt sich aus der Veranlassung ihres Entstehens. Eine Antrittsrede bietet eben Gelegenheit zur offenen Darlegung und Rechtfertigung des persönlichen und insoweit »subjektiven« Standpunktes bei der Beurteilung volkswirtschaftlicher Erscheinungen. Die Ausführungen S. 15 bis 18 (oben)hatte ich mit Rücksicht auf Zeit und Hörerkreis fortgelassen, andere mögen beim Sprechen eine andere Form angenommen haben. Zu den Darlegungen im Eingang ist zu bemerken, daß die Vorgänge hier naturgemäß wesentlich vereinfacht gegenüber der Wirklichkeit dargestellt werden. Die Zeit von 187

## CSV files

One of the most common and robust data types is CSV (*comma seperated values*). Each line contains a row of data, seperated by a specific character (e.g. comma). Optionally, the first line contains a header giving the column names.

In [20]:
library(tidyverse)
econ_df <- as_tibble(read.csv(file="~/Dropbox/UNI/Lehre/Workshops/Text Analysis and Topic Modelling in R/data/EconAbstRaw.csv", 
                              header=TRUE, 
                              sep=","))

In [10]:
dim(econ_df)

In [21]:
econ_df

X,Authors,Year.of.Publication,Title,Author.Keywords,KeyWords.Plus,Abstracts,ISO.4.Journal.Abbreviation,Zeitschrift,Origin
<int>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
0,"Lizzeri, A; Siniscalchi, M",2008,Parental guidance and supervised learning,,NATIONAL LONGITUDINAL SURVEY; MATERNAL EMPLOYMENT; CULTURAL TRANSMISSION; CHILDREN; HERITABILITY; ENVIRONMENT; OUTCOMES; MONKEY; INCOME; YOUTH,We propose a simple theoretical model of supervised learning that is potentially useful to interpret a number of empirical phenomena relevant to the nature-nurture debate. The model captures a basic trade-off between sheltering the child from the consequences of his mistakes and allowing him to learn from experience. We characterize the optimal parenting policy and its comparative-statics properties. We then show that key features of the optimal policy can be useful to interpret provocative findings from behavioral genetics.,Q. J. Econ.,QUARTERLY JOURNAL OF ECONOMICS,General Economics
1,"Caselli, F; Gennaioli, N",2008,Economics and politics of alternative institutional reforms,,FINANCIAL DEVELOPMENT; TRANSITION; CONSTRAINTS; UNCERTAINTY; SOCIETIES; TRADE,"In a model with heterogeneity in managerial talent, we compare the economic and political consequences of reforms aimed at reducing fixed costs of entry (deregulation) and improving the efficiency of financial markets (financial reform). The effects of these reforms depend on the market where control rights over incumbent firms are traded. In the absence of a market for control, both reforms increase the number and the average quality of firms, and are politically equivalent. When a market for control exists, financial reform induces less entry than deregulation, and endogenously compensates incumbents, thereby encountering less political opposition from them. Using this result, we show that financial reform may be used in the short run to open the way for future deregulation. Our model sheds light on the privatization and reform experiences of formerly planned economies as well as on the observed path of reforms in economies of the Organisation for Economic Co-operation and Development.",Q. J. Econ.,QUARTERLY JOURNAL OF ECONOMICS,General Economics
2,"Doyle, JJ",2008,Child protection and adult crime: Using investigator assignment to estimate causal effects of foster care,,WEAK INSTRUMENTS; YOUNG-ADULTS; PLACEMENT; INCARCERATION; OUTCOMES; IDENTIFICATION; NEIGHBORHOOD; SERVICES; POLICE; MODELS,"This paper uses the randomization of families to child protection investigators to estimate causal effects of foster care on adult crime. The analysis uses a new data set that links criminal justice data to child protection data in Illinois, and I find that investigators affect foster care placement. Children on the margin of placement are found to be two to three times more likely to enter the criminal justice system as adults if they were placed in foster care. One innovation describes the types of children on the margin of placement, a group that is more likely to include African Americans, girls, and young adolescents.",J. Polit. Econ.,JOURNAL OF POLITICAL ECONOMY,General Economics
3,"Boone, AL; Mulherin, JH",2008,Do auctions induce a winner's curse? New evidence from the corporate takeover market,mergers and acquisitions; auction; negotiation,COMMON VALUE AUCTIONS; OPERATING PERFORMANCE; ACQUIRING FIRMS; EMPIRICAL ECONOMICS; PUBLIC INFORMATION; BIDDING FIRMS; GAME-THEORY; ACQUISITIONS; MERGERS; WEALTH,"We contrast the winner's curse hypothesis and the competitive market hypothesis as potential explanations for the observed returns to bidders in corporate takeovers. The winner's curse hypothesis posits suboptimal behavior in which winning bidders fail to adapt their strategies to the level of competition and the amount of uncertainty in the takeover environment and predicts that bidder returns are inversely related to the level of competition in a given deal and to the uncertainty in the value of the target. Our measure of takeover competition comes from a unique data set on the auction process that occurs prior to the announcement of a takeover. In our empirical estimation, we control for the endogeneity between bidder returns and the level of competition in takeover deals. Controlling for endogeneity, we find that the returns to bidders are not significantly related to takeover competition. We also find that uncertainty in the value of the target does not reduce bidder returns. Related analysis indicates that prestigious investment banks do not promote overbidding. Analysis of post-takeover operating performance also fails to find any negative effects of takeover competition. As a whole, the results indicate that the breakeven returns to bidders in corporate takeovers stem not from the winner's curse but from the competitive market for targets that occurs predominantly prior to the public announcement of bids. (C) 2008 Elsevier B.V. All rights reserved.",J. Financ. Econ.,JOURNAL OF FINANCIAL ECONOMICS,General Economics
4,"Beneish, MD; Jansen, IP; Lewis, MF; Stuart, NV",2008,Diversification to mitigate expropriation in the tobacco industry,tobacco; acquisitions; diversification; expropriation costs,FREE CASH FLOW; CAMPAIGN CONTRIBUTIONS; CORPORATE DIVERSIFICATION; DIVIDEND ANNOUNCEMENTS; CONGLOMERATE MERGER; ACCOUNTING CHOICE; SHARE REPURCHASES; FIRM PERFORMANCE; MARKET-STRUCTURE; ACQUIRING FIRMS,"While it is well established that diversifying acquisitions by large, cash-rich firms destroy shareholder wealth, we document positive abnormal returns to such acquisitions in the tobacco industry. We show that these abnormal returns are associated with proxies for lower expected expropriation costs. Specifically, we show that wealth creation increases in the degree of domestic geographic expansion afforded by the acquisition (increasing tobacco firms' influence in more political districts) and in the liquidity of tobacco firms' assets (converting cash to harder-to-expropriate operating assets). We also show that the threat of expropriation constrains payments to shareholders before expropriation becomes certain in 1998. (C) 2008 Elsevier B.V. All rights reserved.",J. Financ. Econ.,JOURNAL OF FINANCIAL ECONOMICS,General Economics
5,"Yung, C; Colak, G; Wang, W",2008,Cycles in the IPO market,initial public offerings; adverse selection; underpricing; delisting rates; cross-sectional return variance,INITIAL PUBLIC OFFERINGS; ABNORMAL STOCK RETURNS; INVESTOR SENTIMENT; PRICE PERFORMANCE; INFORMATION; ISSUES; HOT,"We develop a model in which time-varying real investment opportunities lead to time-varying adverse selection in the market for IPOs. The model is consistent with several stylized facts known about the IPO market: economic expansions are associated with a dramatic increase in the number of firms going public, which is in turn positively correlated with underpricing. Adverse selection is procyclical in the sense that dispersion in unobservable quality across firms should be more pronounced during booms. Taking the premise that uncertainty is resolved (and thus private information revealed) over time, we test this hypothesis by looking at long-rum abnormal returns and delisting rates. Consistent with the model, we find (a) greater cross-sectional return variance, and (b) higher incidence of delisting for hot-market IPOs. (C) 2008 Elsevier B.V. All Rights reserved.",J. Financ. Econ.,JOURNAL OF FINANCIAL ECONOMICS,General Economics
6,"Malmendier, U; Tate, G",2008,Who makes acquisitions? CEO overconfidence and the market's reaction,mergers and acquisitions; returns to mergers; overconfidence; hubris; managerial biases,DIVERSIFICATION DESTROY VALUE; CORPORATE DIVERSIFICATION; TENDER OFFERS; STOCK-OPTIONS; MERGER WAVE; CASH FLOW; TOBINS-Q; TAKEOVERS; RETURNS; FIRM,"Does CEO overconfidence help to explain merger decisions? Overconfident CEOs overestimate their ability to generate returns. As a result, they overpay for target companies and undertake value-destroying mergers. The effects are strongest if they have access to internal financing. We test these predictions using two proxies for overconfidence: CEOs' personal over-investment in their company and their press portrayal. We find that the odds of making an acquisition are 65% higher if the CEO is classified as overconfident. The effect is largest if the merger is diversifying and does not require external financing. The market reaction at merger announcement (-90 basis points) is significantly more negative than for non-overconfident CEOs (-12 basis points). We consider alternative interpretations including inside information, signaling, and risk tolerance. (C) 2008 Elsevier B.V. All rights reserved.",J. Financ. Econ.,JOURNAL OF FINANCIAL ECONOMICS,General Economics
7,"Graham, JR; Li, S; Qiu, JP",2008,Corporate misreporting and bank loan contracting,corporate misreporting; financial restatement; corporate fraud; bank loans; cost of debt,DEBT MATURITY STRUCTURE; ECONOMETRIC EVALUATION ESTIMATOR; ASYMMETRIC INFORMATION; GROWTH OPPORTUNITIES; FINANCIAL CONTRACTS; SYNDICATED LOANS; LIQUIDITY RISK; DETERMINANTS; COVENANTS; COST,"This paper is the first to study the effect of financial restatement on bank loan contracting. Compared with loans initiated before restatement, loans initiated after restatement have significantly higher spreads, shorter maturities, higher likelihood of being secured, and More covenant restrictions. The increase in loan spread is significantly larger for fraudulent restating firms than other restating firms. We also find that after restatement, the number of lenders per loan declines and firms pay higher upfront and annual fees. These results are consistent with banks using tighter loan contract terms to overcome risk and information problems arising from financial restatements. (C) 2008 Elsevier B.V. All Rights reserved.",J. Financ. Econ.,JOURNAL OF FINANCIAL ECONOMICS,General Economics
8,"Denis, DJ; Osobov, I",2008,Why do firms pay dividends? International evidence on the determinants of dividend policy,dividend policy; international,DISAPPEARING DIVIDENDS; CATERING INCENTIVES; EARNINGS,"In the US, Canada, UK, Germany, France, and Japan, the propensity to pay dividends is higher among larger, more profitable firms, and those for which retained earnings comprise a large fraction of total equity. Although there are hints of reductions in the propensity to pay dividends in most of the sample countries over the 1994-2002 period, they are driven by a failure of newly listed firms to initiate dividends when expected to do so. Dividend abandonment and the failure to initiate by existing nonpayers are economically unimportant except in Japan. Moreover, in each country, aggregate dividends have not declined and are concentrated among the largest. most profitable firms. Finally, outside of the US there is little evidence of a systematic positive relation between relative prices of dividend paying and non-paying firms and the propensity to pay dividends. Overall, these findings cast doubt on signaling, clientele. and catering explanations for dividends, but support agency cost-based lifecycle theories. (C) 2008 Elsevier B.V. All rights reserved.",J. Financ. Econ.,JOURNAL OF FINANCIAL ECONOMICS,General Economics
9,"Bassetto, M; Phelan, C",2008,Tax riots,,TAXATION; IMPLEMENTATION; INSURANCE; POLICY,"This paper considers an optimal taxation environment where household income is private information, and the government randomly audits and punishes households found to be underreporting. We prove that the optimal mechanism derived using standard mechanism design techniques has a bad equilibrium (a tax riot) where households underreport their incomes, precisely because other households are expected to do so as well. We then consider three alternative approaches to designing a tax scheme when one is worried about bad equilibria.",Rev. Econ. Stud.,REVIEW OF ECONOMIC STUDIES,General Economics


In [11]:
table(econ_df$Zeitschrift)


                        AMERICAN ECONOMIC REVIEW 
                                            1020 
     AMERICAN JOURNAL OF ECONOMICS AND SOCIOLOGY 
                                             587 
                      ECONOMIC AND SOCIAL REVIEW 
                                             158 
       INTERNATIONAL JOURNAL OF SOCIAL ECONOMICS 
                                              92 
                  JOURNAL OF ECONOMIC LITERATURE 
                                              58 
                  JOURNAL OF ECONOMIC PSYCHOLOGY 
                                             760 
                  JOURNAL OF FINANCIAL ECONOMICS 
                                            1157 
                    JOURNAL OF POLITICAL ECONOMY 
                                             782 
JOURNAL OF SOCIAL POLITICAL AND ECONOMIC STUDIES 
                                               7 
                  QUARTERLY JOURNAL OF ECONOMICS 
                                             753 

# String Operationen

## `stringr`

Dieses Paket stellt Standardfunktionen für die Manipulation von Strings bereit. Diese beginnen immer mit `str_`.

* `str_length()`: gibt Länge des Strings zurück; respektiert `NA`.
* `str_c()`: konkateniert Strings.
* `str_sub()`: Subsetting für Strings (funktioniert wie ein Index).
* `str_to_lower()`, `...upper()`, `...title()`: Gibt lower/upper/title case Repräsentation des Strings zurück.
* `str_sort()`: returns a sorted representation of the string.
* `str_extract()`: Extrahiert Substrings die dem angegeben Muster entsprechen.
* `str_detect()`: Gibt einen boolschen Vektor entsprechend des angegebenen Musters aus.
* `str_count()`: Zählt Häufigkeit des Vorkommens im String.

Manche dieser Funktionen lassen sich mit `_all` auf alle Übereinstimmungen mit einem Muster innerhalb des Strings verallgmeinern. Ansonsten wird immer nur die erste Übereinstimmung verwendet.

In [23]:
## <TAB> to expand
?str_replace

In [24]:
str_c("Ham, bacon", "Spam", sep=" and ")

In [25]:
breakfast <- c("eggs", "bacon", "spam", "lobster")
breakfast

In [26]:
str_c(breakfast, collapse=", ")

In [20]:
str_c(breakfast, "NA")

In [27]:
## NA is contagious!
str_c(breakfast, NA)

## Vektorisierung

Die meisten `stringr`-Funktionen werden bei Anwendung auf einen Vektor automatisch auf alle Einzelelemente angewendet. Die Rückgabe ist in diesem Fall ein Vektor.

In [28]:
str_length('breakfast')

In [29]:
str_length(breakfast)

In [30]:
str_to_lower(breakfast)

## Locale

`stringr` Funktionen erlauben die Angabe eines Sprachraums mittels des entsprechenden [ISO 639 Codes](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes).

In [31]:
## Default english language ("en")
str_to_upper(c("i", "ı"))

In [32]:
## Turkish upper case
str_to_upper(c("i", "ı"), locale="tr")

# Reguläre Ausdrücke

## Ein paar Worte der Warnung

> Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

> *Jamie Zawinski* 	 

## Was sind RegEx

Spezifizieren eine Auswahl aus einem endlichen Zeichenvorrat (Alphabet) $\Sigma$. Im Prinzip handelt es sich um eine sehr elgante Art Teile von Zeichenketten mit Platzhaltern auszuwählen.

Sie stehen in fast allen Betriebssystemen und Programmiersprachen zur Verfügung (siehe [hier](https://www.regular-expressions.info/)).

## Formale Definition

Wenn $x$ und $y$ reguläre Ausdrücke sind, dann sind:

2. Verkettung: $(xy)$
1. Alternative: $(x|y)$
3. Wiederholung (Kleene-Stern): $(x^*)$

ebenfalls valide, reguläre Ausdrücke.

## Implementation

Zwei dominante Standards für Reguläre Ausdrücke:

* [POSIX ERE](http://www.regular-expressions.info/posix.html)
* [PCRE - Perl Compatible Regular Expressions](http://www.pcre.org/)

## ... und dann gibt es noch R

http://www.regular-expressions.info/rlanguage.html:

>  The R Project for Statistical Computing provides seven regular expression functions in its base package. The R documentation claims that the default flavor implements POSIX extended regular expressions. That is not correct. In R 2.10.0 and later, the default regex engine is a modified version of Ville Laurikari's TRE engine. It mimics POSIX but deviates from the standard in many subtle and not-so-subtle ways. What this website says about POSIX ERE does not (necessarily) apply to R.

## Funktionen

R-base hat wie  die meisten Programmiersprachen grundlegende Möglichkeiten zum Umgang mit RegEx (e.g. `grep`). Zusätzlich verstehen die meisten `stringr` or `stringi` Funktionen ebenfalls reguläre Ausdrücke.

## String matching

In der einfachsten Variante können reguläre Ausdrücke einfach zur Spezifikation eines Teils einer Zeichenkette genutzt werden. 

In [33]:
breakfast <- 'Egg and bacon\
Egg, sausage, and bacon\
Egg and Spam\
Egg, bacon, and Spam\
Egg, bacon, sausage, and Spam\
Spam, bacon, sausage, and Spam\
Spam, egg, Spam, Spam, bacon, and Spam\
Spam, Spam, Spam, egg, and Spam\
Spam, Spam, Spam, Spam, Spam, Spam, baked beans, Spam, Spam, Spam, and Spam\
Lobster Thermidor aux crevettes with a Mornay sauce, garnished with truffle pâté, brandy, and a fried egg on top, and Spam.'

In [34]:
str_extract(breakfast, 'Spam')

In [35]:
str_extract_all(breakfast, 'Spam')

## Sonderzeichen

* `.` : entspricht arbiträrem Zeichen außer der Newline.
* `^` : entspricht dem Anfang eines Strings.
* `$` : entspricht dem Ende eines Strings.
* `\` : Escaping von Sonderzeichen.

In [36]:
str_extract_all(breakfast, ".gg.")

## Alternative

Der `|` verhält sich ähnlich wie ein logisches ODER.

In [37]:
str_extract_all(breakfast, 'Egg|egg')

In [44]:
dennis <- "Listen, strange women lying in ponds distributing swords is no basis for a system of government. Supreme executive power derives from a mandate from the masses, not from some farcical aquatic ceremony."
dennis

In [38]:
alice <- '\'When I\'M a Duchess,\' she said to herself, (not in a very hopeful tone though), \'I won\'t have any pepper in my kitchen AT ALL. Soup does very well without--Maybe it\'s always pepper that makes people hot-tempered\'.'
alice

## Wiederholungen

Spezifiziert Anzahl der Wiederholungen des vorangegangenen regulären Ausdrucks $x$. Folgende Wiederholungen sind möglich:

Syntax | Bedeutung
-|-
`*` | 0 oder mehr Wiederholungen
`+` | 1 oder mehr Wiederholungen
`{m}` | Genau `m` Wiederholungen
`{m,n}` | Von `m` bis einschließlich `n`
`?` | 0 bis 1 Wiederholungen; Schaltet greedy ab.

Die Wiederholungen sind standardmäßig *greedy*, d.h. es wird soviel vom String verbraucht, wie möglich. Dieses Verhalten kann abgeschaltet werden, indem ein `?` nach der Wiederholung gesetzt wird.

In [39]:
## Greedy by default!
gene <- "GCUGCCGCAGCG"

str_extract(gene, "C.+?C")

## Spezifizierung von Gruppen

Syntax | Äquivalent | Bedeutung
-|-|-
`\d` | `[0-9]` | Ganze Zahlen
`\D` | `[^0-9]` | Alles was keine Zahl ist
`\s` | `[ \t\n\r\f\v]` | Alles was whitespace ist 
`\S` | `[^ \t\n\r\f\v] ` | Alles was nicht whitespace ist
`\w` | `[a-zA-Z0-9_]` | Alphanumerische Zeichen und Unterstrich
`\W` | `[^a-zA-Z0-9_]` | Kein alphanumerische Zeichen oder Unterstrich

## Texte in Token zerlegen

In [45]:
tokens <- str_extract_all(alice, "\\w+[-\']?\\w*")
tokens

## Eine valide Emailadresse finden

![Regular Expression](https://imgs.xkcd.com/comics/regular_expressions.png)

In der Wirklichkeit sehr viel Komplizierter: http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html.

# Anwendungsbeispiel

## Vokale zählen

In [61]:
str_count(breakfast, "[aeiou]")

## Suchen und ersetzen

In [46]:
bsp <- 'Ersetzen sie bei allen Zitaten die eckigen Klammern ([])\
mit den korrekten, runden Klammern.\
Spezifisch bei Blabla [2010] aber auch bei anderen [z.B.: Foobaz 2009, 17].'


str_replace_all(bsp, '\\[(.+?)\\]', '\\(\\1\\1\\1\\)')

## HTML oder anderes Markup entfernen

In [48]:
## Remove the tags
htmltag <- "<p>text</p>"

str_replace_all(htmltag, "<.*?>", "")