In [1]:
import csv
import pandas as pd

# english data
classes_en = {1: "World", 2: "Sports", 3: "Business", 4: "Sci/Tech"}
train_en = pd.read_csv("https://raw.githubusercontent.com/michabirklbauer/hgb_dse_text_mining/master/data/AGNews/train.csv", 
                       names = ["Label", "Title", "Article"],
                       encoding = "utf-8")
test_en = pd.read_csv("https://raw.githubusercontent.com/michabirklbauer/hgb_dse_text_mining/master/data/AGNews/test.csv", 
                      names = ["Label", "Title", "Article"],
                      encoding = "utf-8")

# german data
train_de = pd.read_csv("https://raw.githubusercontent.com/michabirklbauer/hgb_dse_text_mining/master/data/10kGNAD/train.csv", 
                       sep = ";", names = ["Label", "Article"], 
                       quotechar = "\'", quoting = csv.QUOTE_MINIMAL, encoding = "utf-8")
test_de = pd.read_csv("https://raw.githubusercontent.com/michabirklbauer/hgb_dse_text_mining/master/data/10kGNAD/test.csv", 
                       sep = ";", names = ["Label", "Article"], 
                       quotechar = "\'", quoting = csv.QUOTE_MINIMAL, encoding = "utf-8")

In [2]:
print(train_en.shape)
print(test_en.shape)
train_en.head()

(120000, 3)
(7600, 3)


Unnamed: 0,Label,Title,Article
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli..."
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...
3,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...
4,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco..."


In [3]:
print(train_de.shape)
print(test_de.shape)
train_de.head()

(9245, 2)
(1028, 2)


Unnamed: 0,Label,Article
0,Sport,21-Jähriger fällt wohl bis Saisonende aus. Wie...
1,Kultur,"Erfundene Bilder zu Filmen, die als verloren g..."
2,Web,Der frischgekürte CEO Sundar Pichai setzt auf ...
3,Wirtschaft,"Putin: ""Einigung, dass wir Menge auf Niveau vo..."
4,Inland,Estland sieht den künftigen österreichischen P...


# **spaCy**

spaCy is a natural language processing library that comes with many built-in features that solve core linguistic tasks.  
The following exercises should be carried out by getting familiar with the spaCy API. The documentation can be found at:

[https://spacy.io/usage](https://spacy.io/usage)

spaCy needs a language model to analyze text, we will work with both the english and the german language models which can be downloaded by executing the following:

In [4]:
!python -m spacy download en_core_web_sm

Collecting en_core_web_sm==2.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4 MB)
     --------------------------------------- 37.4/37.4 MB 11.7 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'

[93m    Error: Couldn't link model to 'en_core_web_sm'[0m
    Creating a symlink in spacy/data failed. Make sure you have the required
    permissions and try re-running the command as admin, or use a
    virtualenv. You can still import the model as a module and call its
    load() method, or create the symlink manually.

    D:\Users\Micha\anaconda3\envs\text\lib\site-packages\en_core_web_sm -->
    D:\Users\Micha\anaconda3\envs\text\lib\site-packages\spacy\data\en_core_web_sm


[93m    Creating a shortcut link for 'en' didn't work (maybe you don't have
    admin permissions?), but you can still load the model via its full
    package nam

In [5]:
!python -m spacy download de_core_news_sm

Collecting de_core_news_sm==2.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-2.0.0/de_core_news_sm-2.0.0.tar.gz (38.2 MB)
     --------------------------------------- 38.2/38.2 MB 11.7 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'

[93m    Error: Couldn't link model to 'de_core_news_sm'[0m
    Creating a symlink in spacy/data failed. Make sure you have the required
    permissions and try re-running the command as admin, or use a
    virtualenv. You can still import the model as a module and call its
    load() method, or create the symlink manually.

    D:\Users\Micha\anaconda3\envs\text\lib\site-packages\de_core_news_sm -->
    D:\Users\Micha\anaconda3\envs\text\lib\site-packages\spacy\data\de_core_news_sm


[93m    Creating a shortcut link for 'en' didn't work (maybe you don't have
    admin permissions?), but you can still load the model via its full
    packa

### Loading models

In [6]:
import spacy
import random
import en_core_web_sm
import de_core_news_sm

nlp_en = en_core_web_sm.load()
nlp_de = de_core_news_sm.load()

### Use spaCy to tokenize a random Article of both the english and the german dataset

In [30]:
# english
r_en = random.randint(0, train_en.shape[0] - 1)
tokens_en = nlp_en(train_en["Article"].iloc[r_en])

print(train_en["Article"].iloc[r_en])
print([token_en.text for token_en in tokens_en])

EXPECT to see and hear Creative Technology everywhere as it embarks on a worldwide marketing campaign to snatch dominance of the digital music player market from Apple Computer.
['EXPECT', 'to', 'see', 'and', 'hear', 'Creative', 'Technology', 'everywhere', 'as', 'it', 'embarks', 'on', 'a', 'worldwide', 'marketing', 'campaign', 'to', 'snatch', 'dominance', 'of', 'the', 'digital', 'music', 'player', 'market', 'from', 'Apple', 'Computer', '.']


In [31]:
# german
r_de = random.randint(0, train_de.shape[0] - 1)
tokens_de = nlp_de(train_de["Article"].iloc[r_de])

print(train_de["Article"].iloc[r_de])
print([token_de.text for token_de in tokens_de])

Bereits auf CES und MWC präsentiert, DualSIM-Unterstützung als neues Feature. Das US-Unternehmen Saygus will es noch einmal wissen. Konnte man 2009 aufgrund zahlreicher Verzögerungen das damals vorgestellte vPhone nie in den Verkauf bringen, soll es nun beim neuen Projekt, dem Saygus V2 (V-Squared) besser laufen. Auch diesmal ist man um Versprechen nicht verlegen, soll das Gerät doch praktisch alle erdenklichen Ansprüche an ein Smartphone abdecken. Das wasserdichte Gerät verfügt über eine Snapdragon-801-CPU, drei GB RAM, zwei microSD-Slots, eine 21-MP-Kamera mit optischer Bildstabilisierung, einen Fingerabdruckscanner, Infrarot, austauschbarer Akku und eine Reihe anderer Funktionen, die es zum attraktiven Komplettpaket machen sollen. Doch der geplante Start im ersten Halbjahr wurde nach hinten vertagt. Nun will man zu Jahresende liefern und will zur Unterstützung der Massenproduktion per Crowdfunding auf Indiegogo Geld einsammeln. Dafür wurde das Smartphone mit einem weiteren neuen Fea

### Use spaCy to lemmatize a random Article of both the english and the german dataset

In [9]:
print(train_en["Article"].iloc[r_en])
print([token_en.lemma_ for token_en in tokens_en])

The price of crude oil climbed in European trading Monday, edging back over \$43 per barrel on fears that the producer cartel OPEC may cut production to stem a recent price drop.
['the', 'price', 'of', 'crude', 'oil', 'climb', 'in', 'european', 'trading', 'monday', ',', 'edge', 'back', 'over', '\\$43', 'per', 'barrel', 'on', 'fear', 'that', 'the', 'producer', 'cartel', 'opec', 'may', 'cut', 'production', 'to', 'stem', 'a', 'recent', 'price', 'drop', '.']


In [10]:
print(train_de["Article"].iloc[r_de])
print([token_de.lemma_ for token_de in tokens_de])

Ithaca – Dass zuckerhaltige Getränke, Fast Food und Süßigkeiten nicht gesund sind, ist unumstritten. Forscher der Cornell University behaupten nun aber in einer Studie, dass der Konsum solcher Lebensmittel nicht die Hauptursache für Fettleibigkeit in den USA sei. Das allgemeine Ernährungs- und Bewegungsverhalten sei weitaus bedeutsamer für die Entwicklung von Adipositas als häufiger Verzehr ungesunder Nahrungsmittel allein. LinkObesity Science & Practice (red, 6.11.2015)
['Ithaca', '–', 'dass', 'zuckerhaltige', 'Getränk', ',', 'Fast', 'Food', 'und', 'Süßigkeit', 'nicht', 'gesund', 'sein', ',', 'sein', 'unumstritten', '.', 'Forscher', 'der', 'Cornell', 'University', 'behaupten', 'nun', 'aber', 'in', 'einer', 'Studie', ',', 'dass', 'der', 'Konsum', 'solch', 'Lebensmittel', 'nicht', 'der', 'Hauptursache', 'für', 'Fettleibigkeit', 'in', 'der', 'USA', 'sein', '.', 'der', 'allgemein', 'Ernährungs-', 'und', 'Bewegungsverhalten', 'sein', 'weitaus', 'bedeutsam', 'für', 'der', 'Entwicklung', 'vo

### Use spaCy for Part-Of-Speech tagging of a random Article of both the english and the german dataset

- Either print the token attributes or visualize them as a table!
- What do the attributes describe?
- Visualize the POS attribute as a dependency plot with spaCy's displacy!
- Optional: For the german dataset visualize sentences separately for better readability.

In [11]:
"""
    Text: The original word text.
    Lemma: The base form of the word.
    POS: The simple UPOS part-of-speech tag.
    Tag: The detailed part-of-speech tag.
    Dep: Syntactic dependency, i.e. the relation between tokens.
    Shape: The word shape – capitalization, punctuation, digits.
    is alpha: Is the token an alpha character?
    is stop: Is the token part of a stop list, i.e. the most common words of the language?
"""

token_df_en = pd.DataFrame({"Text": [token_en.text for token_en in tokens_en],
                            "Lemma": [token_en.lemma_ for token_en in tokens_en],
                            "POS": [token_en.pos_ for token_en in tokens_en],
                            "Tag": [token_en.tag_ for token_en in tokens_en],
                            "Dep": [token_en.dep_ for token_en in tokens_en],
                            "Shape": [token_en.shape_ for token_en in tokens_en],
                            "is alpha": [token_en.is_alpha for token_en in tokens_en],
                            "is stop": [token_en.is_stop for token_en in tokens_en]})

token_df_en.head()

Unnamed: 0,Text,Lemma,POS,Tag,Dep,Shape,is alpha,is stop
0,The,the,DET,DT,det,Xxx,True,False
1,price,price,NOUN,NN,nsubj,xxxx,True,False
2,of,of,ADP,IN,prep,xx,True,True
3,crude,crude,ADJ,JJ,amod,xxxx,True,False
4,oil,oil,NOUN,NN,pobj,xxx,True,False


In [19]:
from spacy import displacy

displacy.render(tokens_en, style = "dep", jupyter = True)

In [12]:
token_df_de = pd.DataFrame({"Text": [token_de.text for token_de in tokens_de],
                            "Lemma": [token_de.lemma_ for token_de in tokens_de],
                            "POS": [token_de.pos_ for token_de in tokens_de],
                            "Tag": [token_de.tag_ for token_de in tokens_de],
                            "Dep": [token_de.dep_ for token_de in tokens_de],
                            "Shape": [token_de.shape_ for token_de in tokens_de],
                            "is alpha": [token_de.is_alpha for token_de in tokens_de],
                            "is stop": [token_de.is_stop for token_de in tokens_de]})

token_df_de.head()

Unnamed: 0,Text,Lemma,POS,Tag,Dep,Shape,is alpha,is stop
0,Ithaca,Ithaca,NOUN,NN,ROOT,Xxxxx,True,False
1,–,–,PROPN,NE,punct,–,False,False
2,Dass,dass,SCONJ,KOUS,cp,Xxxx,True,False
3,zuckerhaltige,zuckerhaltige,ADJ,ADJA,nk,xxxx,True,False
4,Getränke,Getränk,NOUN,NN,sb,Xxxxx,True,False


In [23]:
displacy.render(list(tokens_de.sents), style = "dep", jupyter = True)

### Use spaCy for Named Entity Recognition (NER) of a random Article of both the english and the german dataset

- Either print the entity attributes or visualize them as a table!
- Visualize the entities as an entity plot with spaCy's displacy!

In [25]:
#entities

entities_en_df = pd.DataFrame({"Text": [ent.text for ent in tokens_en.ents],
                               "Start": [ent.start_char for ent in tokens_en.ents],
                               "End": [ent.end_char for ent in tokens_en.ents],
                               "Label": [ent.label_ for ent in tokens_en.ents]})

entities_en_df.head()

Unnamed: 0,Text,Start,End,Label
0,European,34,42,NORP
1,Monday,51,57,DATE
2,OPEC,126,130,ORG


In [27]:
displacy.render(tokens_en, style = "ent", jupyter = True)

In [26]:
entities_de_df = pd.DataFrame({"Text": [ent.text for ent in tokens_de.ents],
                               "Start": [ent.start_char for ent in tokens_de.ents],
                               "End": [ent.end_char for ent in tokens_de.ents],
                               "Label": [ent.label_ for ent in tokens_de.ents]})

entities_de_df.head()

Unnamed: 0,Text,Start,End,Label
0,Ithaca,0,6,PER
1,Süßigkeiten,52,63,ORG
2,Cornell University,114,132,ORG
3,Konsum,178,184,ORG
4,den USA,251,258,LOC


In [28]:
displacy.render(tokens_de, style = "ent", jupyter = True)

In [32]:
# https://spacy.io/usage/spacy-101#vectors-similarity
tokens_en.similarity(tokens_de)

-0.015280610111442511

In [38]:
token = tokens_en[1]

In [39]:
token.vector.shape

(384,)

In [41]:
tokens_en.vector.shape

(384,)