<a href="https://colab.research.google.com/github/nikolimaj/text_mining/blob/main/Ex1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [22]:
import csv
import pandas as pd

paragraph = """
The Datasaurus dozen comprises thirteen data sets that have nearly identical simple descriptive statistics to two decimal places, yet have very different distributions and appear very different when graphed. It was inspired by the smaller Anscombe's quartet that was created in 1973. Similar to the Anscombe's quartet, the Datasaurus dozen was designed to further illustrate the importance of looking at a set of data graphically before starting to analyze according to a particular type of relationship, and the inadequacy of basic statistic properties for describing realistic data sets. The first data set, in the shape of a Tyrannosaurus, that inspired the rest of the "datasaurus" data set was constructed in 2016 by Alberto Cairo. It was proposed by Maarten Lambrechts that this data set also be called "Anscombosaurus." This data set was then accompanied by twelve other data sets that were created by Justin Matejka and George Fitzmaurice at Autodesk. Unlike the Anscombe's quartet, where it is not known how the data set was generated, the authors used simulated annealing to make these data sets. They made small, random, and biased changes to each point towards the desired shape. Each shape took 200,000 iterations of perturbations to complete.
"""

In [8]:
%%capture output
!python -m spacy download en_core_web_lg

In [None]:
if "Download and installation successful" in output.stdout:
    print("Download and installation successful!")
else:
    print(output.stdout)
    print(output.stderr)

Download and installation successful!


### Loading models

In [9]:
import spacy
import random
import re
import en_core_web_lg

nlp_en = en_core_web_lg.load()

### Use spaCy to tokenize a random sentence

In [27]:
sentences = re.split(r'(?<=[.!?]) +', paragraph)
sentence = random.choice(sentences)
tokens_en = nlp_en(sentence)

print(sentence)
print([token_en.text for token_en in tokens_en])

It was proposed by Maarten Lambrechts that this data set also be called "Anscombosaurus." This data set was then accompanied by twelve other data sets that were created by Justin Matejka and George Fitzmaurice at Autodesk.
['It', 'was', 'proposed', 'by', 'Maarten', 'Lambrechts', 'that', 'this', 'data', 'set', 'also', 'be', 'called', '"', 'Anscombosaurus', '.', '"', 'This', 'data', 'set', 'was', 'then', 'accompanied', 'by', 'twelve', 'other', 'data', 'sets', 'that', 'were', 'created', 'by', 'Justin', 'Matejka', 'and', 'George', 'Fitzmaurice', 'at', 'Autodesk', '.']


### Use spaCy to lemmatize a random sentence

In [28]:
print(sentence)
print([token_en.lemma_ for token_en in tokens_en])

It was proposed by Maarten Lambrechts that this data set also be called "Anscombosaurus." This data set was then accompanied by twelve other data sets that were created by Justin Matejka and George Fitzmaurice at Autodesk.
['it', 'be', 'propose', 'by', 'Maarten', 'Lambrechts', 'that', 'this', 'datum', 'set', 'also', 'be', 'call', '"', 'Anscombosaurus', '.', '"', 'this', 'datum', 'set', 'be', 'then', 'accompany', 'by', 'twelve', 'other', 'datum', 'set', 'that', 'be', 'create', 'by', 'Justin', 'Matejka', 'and', 'George', 'Fitzmaurice', 'at', 'Autodesk', '.']


### Use spaCy for Part-Of-Speech tagging of a random sentence

- Either print the token attributes or visualize them as a table!
- What do the attributes describe?
- Visualize the POS attribute as a dependency plot with spaCy's displacy!
- Optional: For the german dataset visualize sentences separately for better readability.

In [29]:
"""
    Text: The original word text.
    Lemma: The base form of the word.
    POS: The simple UPOS part-of-speech tag.
    Tag: The detailed part-of-speech tag.
    Dep: Syntactic dependency, i.e. the relation between tokens.
    Shape: The word shape – capitalization, punctuation, digits.
    is alpha: Is the token an alpha character?
    is stop: Is the token part of a stop list, i.e. the most common words of the language?
"""

token_df_en = pd.DataFrame({"Text": [token_en.text for token_en in tokens_en],
                            "Lemma": [token_en.lemma_ for token_en in tokens_en],
                            "POS": [token_en.pos_ for token_en in tokens_en],
                            "Tag": [token_en.tag_ for token_en in tokens_en],
                            "Dep": [token_en.dep_ for token_en in tokens_en],
                            "Shape": [token_en.shape_ for token_en in tokens_en],
                            "is alpha": [token_en.is_alpha for token_en in tokens_en],
                            "is stop": [token_en.is_stop for token_en in tokens_en]})

token_df_en.head()

Unnamed: 0,Text,Lemma,POS,Tag,Dep,Shape,is alpha,is stop
0,It,it,PRON,PRP,nsubjpass,Xx,True,True
1,was,be,AUX,VBD,auxpass,xxx,True,True
2,proposed,propose,VERB,VBN,ROOT,xxxx,True,False
3,by,by,ADP,IN,agent,xx,True,True
4,Maarten,Maarten,PROPN,NNP,compound,Xxxxx,True,False


In [30]:
from spacy import displacy

displacy.render(tokens_en, style = "dep", jupyter = True)

### Use spaCy for Named Entity Recognition (NER) of a random sentence

In [31]:
entities_en_df = pd.DataFrame({"Text": [ent.text for ent in tokens_en.ents],
                               "Start": [ent.start_char for ent in tokens_en.ents],
                               "End": [ent.end_char for ent in tokens_en.ents],
                               "Label": [ent.label_ for ent in tokens_en.ents]})

entities_en_df.head()

Unnamed: 0,Text,Start,End,Label
0,Maarten Lambrechts,19,37,PERSON
1,Anscombosaurus,73,87,WORK_OF_ART
2,twelve,128,134,CARDINAL
3,Justin Matejka,172,186,PERSON
4,George Fitzmaurice,191,209,PERSON


In [32]:
displacy.render(tokens_en, style = "ent", jupyter = True)