<a href="https://colab.research.google.com/github/popelucha/digital-humanities/blob/main/DH4_Text_Analysis_(continued).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Analysis
The aim of the notebook is to begin with quantitative analysis of text data. We select a Czech text, split it into tokens, perform frequency analysis, and observe the nature of the data.

## Install necessary packages
In this notebook, we use NLTK (Natural Language ToolKit) for tokenization of input text, and Pandas, a package for easy handling of tabular data.

In [None]:
# do not run in G13, all packages are already installed, no need to run in Google Colab
!pip3 install --user nltk
!pip3 install --user pandas
!pip3 install --user matplotlib
!pip3 install --user numpy

In [None]:
import pandas as pd
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from collections import Counter
import numpy as np

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Get the data
Here, you have to probably change the filename.

In [None]:
!wget https://nlp.fi.muni.cz/~xpopelk/maj.txt

--2024-10-17 07:13:54--  https://nlp.fi.muni.cz/~xpopelk/maj.txt
Resolving nlp.fi.muni.cz (nlp.fi.muni.cz)... 147.251.51.11
Connecting to nlp.fi.muni.cz (nlp.fi.muni.cz)|147.251.51.11|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 29777 (29K) [text/plain]
Saving to: ‘maj.txt.1’


2024-10-17 07:13:55 (185 KB/s) - ‘maj.txt.1’ saved [29777/29777]



In [None]:
text = None
with open('maj.txt') as f:  # modify the path if needed
    text = f.read()

In [None]:
len(word_tokenize(text))

5675

In [None]:
tokens = Counter()
for token in word_tokenize(text):
    if token:
        tokens[token] += 1
tokens

Counter({'1': 1,
         'Byl': 3,
         'pozdní': 6,
         'večer': 5,
         '–': 256,
         'první': 6,
         'máj': 15,
         'večerní': 7,
         'byl': 5,
         'lásky': 11,
         'čas': 23,
         '.': 203,
         'Hrdliččin': 1,
         'zval': 3,
         'ku': 10,
         'lásce': 7,
         'hlas': 28,
         ',': 405,
         'kde': 14,
         'borový': 3,
         'zaváněl': 2,
         'háj': 4,
         'O': 2,
         'šeptal': 2,
         'tichý': 7,
         'mech': 2,
         ';': 67,
         'květoucí': 3,
         'strom': 3,
         'lhal': 2,
         'žel': 4,
         'svou': 14,
         'lásku': 2,
         'slavík': 2,
         'růži': 3,
         'pěl': 2,
         'růžinu': 2,
         'jevil': 2,
         'vonný': 3,
         'vzdech': 3,
         'Jezero': 2,
         'hladké': 2,
         'v': 130,
         'křovích': 2,
         'stinných': 2,
         'zvučelo': 2,
         'temně': 2,
         'tajný': 2,
   

## Create DataFrame
Pandas DataFrame is a data object, easy to handle. Let's experiment with it.

In [None]:
df = pd.DataFrame.from_dict({"token": [k for k,v in dict(tokens).items()], "freq": [v for k,v in dict(tokens).items()]})
df

Unnamed: 0,token,freq
0,1,1
1,Byl,3
2,pozdní,6
3,večer,5
4,–,256
...,...,...
1921,takému,1
1922,utěchy,1
1923,jaké,1
1924,Zklamánať,1


### DataFrame Info
**TASK 1**: How many different tokens are in the text? This number is the *vocabulary size*.

In [None]:
df.columns

Index(['token', 'freq'], dtype='object')

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1926 entries, 0 to 1925
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   token   1926 non-null   object
 1   freq    1926 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 30.2+ KB


In [None]:
df.sort_values(by='freq', ascending=False)

Unnamed: 0,token,freq
17,",",405
4,–,256
11,.,203
42,v,130
75,se,93
...,...,...
945,věží,1
944,jezerní,1
943,lehounce,1
941,uveden,1


**TASK 2**: How many *hapax legomena* do we have in the data?

In [None]:
df.loc[df.freq==1]

Unnamed: 0,token,freq
0,1,1
12,Hrdliččin,1
58,bloudila,1
59,blankytnými,1
60,pásky,1
...,...,...
1921,takému,1
1922,utěchy,1
1923,jaké,1
1924,Zklamánať,1


In [None]:
1200/1926

0.6230529595015576

### Pandas Series
Pandas Series is a slice of DataFrame. Usually, a Series is a result of slicing a DataFrame using a condition.
Let's see a singe row, a single column, and a single cell.



In [None]:
df.loc[2]

token    pozdní
freq          6
Name: 2, dtype: object

In [None]:
df['freq']

0         1
1         3
2         6
3         5
4       256
5         6
6        15
7         7
8         5
9        11
10       23
11      203
12        1
13        3
14       10
15        7
16       28
17      405
18       14
19        3
20        2
21        4
22        2
23        2
24        7
25        2
26       67
27        3
28        3
29        2
30        4
31       14
32        2
33        2
34        3
35        2
36        2
37        2
38        3
39        3
40        2
41        2
42      130
43        2
44        2
45        2
46        2
47        2
48        2
49        5
50       32
51        2
52       14
53       77
54        6
55        2
56        2
57        2
58        1
59        1
60        1
61        1
62       24
63       39
64       10
65        3
66        1
67        3
68        1
69        1
70        8
71        2
72        1
73        1
74       22
75       93
76        1
77       24
78        4
79        1
80        1
81        5
82        1
83  

In [None]:
df['token'][2]

'pozdní'

#### Tokens with a certain frequency

In [None]:
df.loc[df.freq==10]

Unnamed: 0,token,freq
14,ku,10
64,slzy,10
89,tvář,10
116,vzdy,10
154,pod,10
161,jezera,10
293,si,10
297,hory,10
468,lesů,10
533,žádný,10


### Processing of the Text
So far, we only performed **tokenization** in order to observe single words. Tokenization is quite simple for languages that use spaces (all except CJK=Chinese, Japanese, Korean). However, there are decisions to be made and some of them are language dependent:
 - "can't" -> "can", "not" or "can", "'", "t"
 - "won't" -> "will", "not" or "won", "'", "t"
 - "cannot" -> "can", "not" or "cannot"
 - "přišels" -> "přišel", "jsi" or "přišels"
 - "P. D. Jamesová" -> "P.", "D.", "Jamesová" or "P", ".", "D", ".", "Jamesová"
 - "16/10/2019" -> "16", "/", "10", "/", "2019" or "16/10/2019" or "16/", "10/", "2019"

### Tagging
Apparently, we could make further analysis if we have more information, for example about particular part-of-speech (POS) there are in the text. Note that the tagging task (assigning one POS for each word) is language dependent and sometimes very difficult, e.g.:
- "hope" - verb or noun
- "loving" - noun, adjective, verb
- "stát" - verb or noun
- "svíčková" - noun or adjective

### Use of remote services
POS-tagging is a common NLP task provided by many services. To annotate your own text, either you have to upload it somewhere and download the result, or you can let computer programs to do the stuff via Application Programming Interfaces (APIs). The task of an API is similar to that of a waiter.

<img src="https://www.10000ft.com/assets/img/blog/api-restaurant-analogy-example1.jpg"/>

Analogically, we let out computer program to send a request "I need this tokenized text to be POS-tagged" and let it to present the result.

As an example API, we will use the Language Services at NLPC FI MUNI: https://nlp.fi.muni.cz/languageservices/. We will use the python library `requests`. For notation of the requests and responses between computer programs we use `JSON`.

In [None]:
!pip3 install --user requests # not necessary to run in Google Colab

In [None]:
import requests
import json

In [None]:
data = {"call": "tagger",
        "lang": "cs",
        "output": "json",
        "text": text.replace(';', ',')
       }
uri = "https://nlp.fi.muni.cz/languageservices/service.py"
r = requests.post(uri, data=data)
r

<Response [200]>

In [None]:
print(r.content)

b'{\n    "vertical": [\n        [\n            "<s>"\n        ], \n        [\n            "1", \n            "#num#", \n            "k4"\n        ], \n        [\n            "Byl", \n            "b\xc3\xbdt", \n            "k5eAaImAgInS"\n        ], \n        [\n            "pozdn\xc3\xad", \n            "pozdn\xc3\xad", \n            "k2eAgInSc1d1"\n        ], \n        [\n            "ve\xc4\x8der", \n            "ve\xc4\x8der", \n            "k1gInSc1"\n        ], \n        [\n            "\xe2\x80\x93", \n            "\xe2\x80\x93", \n            "k?"\n        ], \n        [\n            "prvn\xc3\xad", \n            "prvn\xc3\xad", \n            "k4xOgInSc4"\n        ], \n        [\n            "m\xc3\xa1j", \n            "m\xc3\xa1j", \n            "k1gInSc4"\n        ], \n        [\n            "\xe2\x80\x93", \n            "\xe2\x80\x93", \n            "k?"\n        ], \n        [\n            "ve\xc4\x8dern\xc3\xad", \n            "ve\xc4\x8dern\xc3\xad", \n            "k2eAgI

In [None]:
data = r.json()
data

{'vertical': [['<s>'],
  ['1', '#num#', 'k4'],
  ['Byl', 'být', 'k5eAaImAgInS'],
  ['pozdní', 'pozdní', 'k2eAgInSc1d1'],
  ['večer', 'večer', 'k1gInSc1'],
  ['–', '–', 'k?'],
  ['první', 'první', 'k4xOgInSc4'],
  ['máj', 'máj', 'k1gInSc4'],
  ['–', '–', 'k?'],
  ['večerní', 'večerní', 'k2eAgInSc4d1'],
  ['máj', 'máj', 'k1gFnSc4'],
  ['–', '–', 'k?'],
  ['byl', 'být', 'k5eAaImAgInS'],
  ['lásky', 'láska', 'k1gFnSc2'],
  ['čas', 'čas', 'k1gInSc1'],
  ['<g/>'],
  ['.', '.', 'kIx.'],
  ['</s>'],
  ['<s desamb="1">'],
  ['Hrdliččin', 'hrdliččin', 'k2eAgInSc1d1'],
  ['zval', 'zvát', 'k5eAaImAgInS'],
  ['ku', 'k', 'k7c3'],
  ['lásce', 'láska', 'k1gFnSc3'],
  ['hlas', 'hlas', 'k1gInSc1'],
  ['<g/>'],
  [',', ',', 'kIx,'],
  ['kde', 'kde', 'k6eAd1'],
  ['borový', 'borový', 'k2eAgMnSc1d1'],
  ['zaváněl', 'zavánět', 'k5eAaImAgInS'],
  ['háj', 'háj', 'k1gInSc1'],
  ['<g/>'],
  ['.', '.', 'kIx.'],
  ['</s>'],
  ['<s desamb="1">'],
  ['O', 'o', 'k7c6'],
  ['lásce', 'láska', 'k1gFnSc6'],
  ['šeptal',

In [None]:
tokens = [token for token in data['vertical'] if len(token)==3]
pd.set_option('display.max_rows', len(tokens))
df2 = pd.DataFrame.from_dict({"word": [word for word, lemma, tag in tokens],
                              "lemma": [lemma for word, lemma, tag in tokens],
                              "tag": [tag for word, lemma, tag in tokens]
                               })
df2

Unnamed: 0,word,lemma,tag
0,1,#num#,k4
1,Byl,být,k5eAaImAgInS
2,pozdní,pozdní,k2eAgInSc1d1
3,večer,večer,k1gInSc1
4,–,–,k?
5,první,první,k4xOgInSc4
6,máj,máj,k1gInSc4
7,–,–,k?
8,večerní,večerní,k2eAgInSc4d1
9,máj,máj,k1gFnSc4


In [None]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5684 entries, 0 to 5683
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   word    5684 non-null   object
 1   lemma   5684 non-null   object
 2   tag     5684 non-null   object
dtypes: object(3)
memory usage: 133.3+ KB


In [None]:
pos = [tag[0:2] for tag in df2["tag"]]
df2["pos"] = pos
df2

Unnamed: 0,word,lemma,tag,pos
0,1,#num#,k4,k4
1,Byl,být,k5eAaImAgInS,k5
2,pozdní,pozdní,k2eAgInSc1d1,k2
3,večer,večer,k1gInSc1,k1
4,–,–,k?,k?
5,první,první,k4xOgInSc4,k4
6,máj,máj,k1gInSc4,k1
7,–,–,k?,k?
8,večerní,večerní,k2eAgInSc4d1,k2
9,máj,máj,k1gFnSc4,k1


### List numerals appearing in text

In [None]:
set(df2[df2["pos"]=="k?"]['word'].values)

{'!!', '!!!', '?!', 'an', 'líbý', 'ust', 'zaň', 'č.', '–', '‘', '‚', '“', '„'}

**TASK3**: List prepositions and store it in the variable `prep`.

In [None]:
prep = df2[df2["pos"]=="k7"]

### Count prepositions frequencies
If you stored prepositions in the `prep`, you can see the frequencies of prepositions in the text.

In [None]:
x = prep.groupby(by="lemma").count()['word']
x

lemma
bez        10
beze        1
do          1
k          35
kol         4
kolem       5
mezi        5
mimo        1
na         34
nad        23
o           8
od          4
ode         1
po         37
pod        14
pro         2
proti       1
před        4
přes        5
při         5
s          21
u          18
v         153
vstříc      1
z          26
za         16
zpod        1
Name: word, dtype: int64

### Count on POS frequencies

In [None]:
df2.groupby(by="pos").count()['word']

Unnamed: 0_level_0,word
pos,Unnamed: 1_level_1
k0,9
k1,1633
k2,594
k3,496
k4,38
k5,693
k6,278
k7,436
k8,231
k9,76
