<center>
<img src="https://laelgelcpublic.s3.sa-east-1.amazonaws.com/lael_50_years_narrow_white.png.no_years.400px_96dpi.png" width="300" alt="LAEL 50 years logo">
<h3>APPLIED LINGUISTICS GRADUATE PROGRAMME (LAEL)</h3>
</center>
<hr>

# Corpus Linguistics - Study 1 - Phase 2 - Marcos

This solution reads the text `vuam_test.xml`, in `TEI/XML` format, into a `pandas` DataFrame allowing data wrangling and demonstrates how to export the DataFrame into `JSONL` and `TSV` formats for further processing.

## What is `TEI/XML`?

The [Text Encoding Initiative](https://tei-c.org/) (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form especially for the area now known as textual digital humanities.

`TEI/XML` can be thought of as a sibling of HTML (they're approximately the same age, depending on how you measure it) which evolved with a focus on defined textual semantics rather than defined display semantics.

Among the related software tools available is [Stylesheets](https://github.com/TEIC/Stylesheets), used for converting TEI documents to various formats.

## What is `Beautiful Soup`?

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It is a well-known library for web scraping.

Please refer to:
- [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/)
- [Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [How To Scrape Web Pages with Beautiful Soup and Python 3](https://www.digitalocean.com/community/tutorials/how-to-scrape-web-pages-with-beautiful-soup-and-python-3)
- [A Practical Introduction to Web Scraping in Python](https://realpython.com/python-web-scraping-practical-introduction/)

## Required Python packages

- beautifulsoup4
- lxml
- pandas

## Importing the required libraries

In [1]:
from bs4 import BeautifulSoup
import pandas as pd

## Importing `vuam_test.xml` into a DataFrame

In [2]:
# Parsing the document
with open('vuam_test.xml', 'r', encoding='utf8', newline='\n') as xml_doc:
    soup = BeautifulSoup(xml_doc, 'lxml-xml')

In [3]:
# Capturing the desired information
rows = []
for element in soup.find_all(['w', 'c']):
    lemma = element.get('lemma', '')
    word_type = element.get('type', '')
    text = element.text.strip()
    seg = element.find('seg')
    if seg:
        seg_function = seg.get('function', '')
        seg_type = seg.get('type', '')
        seg_vici_morph = seg.get('vici:morph', '')
        seg_text = seg.text.strip()
    else:
        seg_function = ''
        seg_type = ''
        seg_vici_morph = ''
        seg_text = ''
    rows.append([lemma, word_type, text, seg_function, seg_type, seg_vici_morph, seg_text])

In [4]:
# Creating DataFrame
df = pd.DataFrame(rows, columns=['lemma', 'type', 'text', 'seg function', 'seg type', 'seg vici:morph', 'seg text'])

### Checking data types

In [5]:
df.dtypes

lemma             object
type              object
text              object
seg function      object
seg type          object
seg vici:morph    object
seg text          object
dtype: object

### Displaying the DataFrame

In [6]:
df

Unnamed: 0,lemma,type,text,seg function,seg type,seg vici:morph,seg text
0,late,AJS,Latest,,,,
1,corporate,AJ0,corporate,,,,
2,unbundler,NN1,unbundler,,,,
3,reveal,VVZ,reveals,mrw,met,n,reveals
4,laid-back,AJ0,laid-back,,,,
...,...,...,...,...,...,...,...
238270,be,VBB,'re,,,,
238271,here,AV0,here,,,,
238272,that,DT0-CJT,that,mrw,met,n,that
238273,be,VBZ,'s,,,,


## Exporting to a file

### `JSONL` format

In [7]:
df.to_json('vuam_test.jsonl', orient='records', lines=True)

### `TSV` format

In [8]:
df.to_csv('vuam_test.tsv', sep='\t', index=False, encoding='utf-8', lineterminator='\n')

## Appendices

### Running the solution from the command line

It is recommended to set up a specific Python virtual environment for each Python project.

#### Setting up the `myenv` virtual environment

```
eyamrog@Rog-ASUS:~$ cd "$HOME"
eyamrog@Rog-ASUS:~$ sudo apt update && sudo apt upgrade -y
<omitted>
eyamrog@Rog-ASUS:~$ sudo apt install -y python3-pip python3-venv
[sudo] password for eyamrog: 
<omitted>
eyamrog@Rog-ASUS:~$ python3 -m venv myenv
eyamrog@Rog-ASUS:~$ ll
total 148
drwxr-x--- 16 eyamrog eyamrog  4096 Aug 27 12:15 ./
drwxr-xr-x  3 root    root     4096 Mar 31 22:09 ../
<omitted>
drwxr-xr-x  5 eyamrog eyamrog  4096 Aug 27 12:15 myenv/
<omitted>
```

#### Activating the `myenv` virtual environment and checking the installed packages

```
eyamrog@Rog-ASUS:~$ source "$HOME"/myenv/bin/activate
(myenv) eyamrog@Rog-ASUS:~$ pip freeze
```

#### Installing the required Python packages

```
(myenv) eyamrog@Rog-ASUS:~/work/cl_st1_marcos$ pip install beautifulsoup4 lxml pandas
Collecting beautifulsoup4
  Using cached beautifulsoup4-4.12.3-py3-none-any.whl (147 kB)
Collecting lxml
  Using cached lxml-5.3.0-cp310-cp310-manylinux_2_28_x86_64.whl (5.0 MB)
Collecting pandas
  Using cached pandas-2.2.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.0 MB)
Collecting soupsieve>1.2
  Using cached soupsieve-2.6-py3-none-any.whl (36 kB)
Collecting pytz>=2020.1
  Using cached pytz-2024.1-py2.py3-none-any.whl (505 kB)
Collecting numpy>=1.22.4
  Using cached numpy-2.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.3 MB)
Collecting python-dateutil>=2.8.2
  Using cached python_dateutil-2.9.0.post0-py2.py3-none-any.whl (229 kB)
Collecting tzdata>=2022.7
  Using cached tzdata-2024.1-py2.py3-none-any.whl (345 kB)
Collecting six>=1.5
  Using cached six-1.16.0-py2.py3-none-any.whl (11 kB)
Installing collected packages: pytz, tzdata, soupsieve, six, numpy, lxml, python-dateutil, beautifulsoup4, pandas
Successfully installed beautifulsoup4-4.12.3 lxml-5.3.0 numpy-2.1.0 pandas-2.2.2 python-dateutil-2.9.0.post0 pytz-2024.1 six-1.16.0 soupsieve-2.6 tzdata-2024.1
(myenv) eyamrog@Rog-ASUS:~/work/cl_st1_marcos$ pip freeze
beautifulsoup4==4.12.3
lxml==5.3.0
numpy==2.1.0
pandas==2.2.2
python-dateutil==2.9.0.post0
pytz==2024.1
six==1.16.0
soupsieve==2.6
tzdata==2024.1
(myenv) eyamrog@Rog-ASUS:~/work/cl_st1_marcos$ 
```

#### Installing and running the solution

```
(myenv) eyamrog@Rog-ASUS:~$ mkdir work
(myenv) eyamrog@Rog-ASUS:~$ cd work
(myenv) eyamrog@Rog-ASUS:~/work$ ll
total 8
drwxr-xr-x  2 eyamrog eyamrog 4096 Aug 27 12:01 ./
drwxr-x--- 16 eyamrog eyamrog 4096 Aug 27 12:15 ../
(myenv) eyamrog@Rog-ASUS:~/work$ git clone https://github.com/laelgelc/cl_st1_marcos.git
Cloning into 'cl_st1_marcos'...
remote: Enumerating objects: 11, done.
remote: Counting objects: 100% (11/11), done.
remote: Compressing objects: 100% (9/9), done.
remote: Total 11 (delta 1), reused 8 (delta 1), pack-reused 0 (from 0)
Receiving objects: 100% (11/11), 31.27 KiB | 6.25 MiB/s, done.
Resolving deltas: 100% (1/1), done.
(myenv) eyamrog@Rog-ASUS:~/work$ cd cl_st1_marcos/
(myenv) eyamrog@Rog-ASUS:~/work/cl_st1_marcos$ ll
total 47088
drwxr-xr-x 4 eyamrog eyamrog     4096 Aug 27 19:05 ./
drwxr-xr-x 4 eyamrog eyamrog     4096 Aug 27 18:56 ../
drwxr-xr-x 8 eyamrog eyamrog     4096 Aug 27 19:07 .git/
drwxr-xr-x 2 eyamrog eyamrog     4096 Aug 27 18:56 .ipynb_checkpoints/
-rw-r--r-- 1 eyamrog eyamrog    23600 Aug 27 18:56 CL_St1_Ph1_Marcos.ipynb
-rw-r--r-- 1 eyamrog eyamrog    21270 Aug 27 19:05 CL_St1_Ph2_Marcos.ipynb
-rw-r--r-- 1 eyamrog eyamrog      458 Aug 27 18:56 README.md
-rw-r--r-- 1 eyamrog eyamrog      257 Aug 27 18:56 cl_st1_ph1_marcos.py
-rw-r--r-- 1 eyamrog eyamrog     1050 Aug 27 18:59 cl_st1_ph2_marcos.py
-rw-r--r-- 1 eyamrog eyamrog   175914 Aug 27 18:56 debate.xml
-rw-r--r-- 1 eyamrog eyamrog 26535146 Aug 27 19:05 vuam_test.jsonl
-rw-r--r-- 1 eyamrog eyamrog  4601087 Aug 27 19:05 vuam_test.tsv
-rwxr-xr-x 1 eyamrog eyamrog 16820947 Aug 27 19:04 vuam_test.xml*
(myenv) eyamrog@Rog-ASUS:~/work/cl_st1_marcos$ python cl_st1_ph2_marcos.py
(myenv) eyamrog@Rog-ASUS:~/work/cl_st1_marcos$ ll
total 47088
drwxr-xr-x 4 eyamrog eyamrog     4096 Aug 27 19:14 ./
drwxr-xr-x 4 eyamrog eyamrog     4096 Aug 27 18:56 ../
drwxr-xr-x 8 eyamrog eyamrog     4096 Aug 27 19:07 .git/
drwxr-xr-x 2 eyamrog eyamrog     4096 Aug 27 18:56 .ipynb_checkpoints/
-rw-r--r-- 1 eyamrog eyamrog    23600 Aug 27 18:56 CL_St1_Ph1_Marcos.ipynb
-rw-r--r-- 1 eyamrog eyamrog    21632 Aug 27 19:14 CL_St1_Ph2_Marcos.ipynb
-rw-r--r-- 1 eyamrog eyamrog      458 Aug 27 18:56 README.md
-rw-r--r-- 1 eyamrog eyamrog      257 Aug 27 18:56 cl_st1_ph1_marcos.py
-rw-r--r-- 1 eyamrog eyamrog     1026 Aug 27 19:17 cl_st1_ph2_marcos.py
-rw-r--r-- 1 eyamrog eyamrog   175914 Aug 27 18:56 debate.xml
-rw-r--r-- 1 eyamrog eyamrog 26535146 Aug 27 19:17 vuam_test.jsonl
-rw-r--r-- 1 eyamrog eyamrog  4601087 Aug 27 19:17 vuam_test.tsv
-rwxr-xr-x 1 eyamrog eyamrog 16820947 Aug 27 19:04 vuam_test.xml*
(myenv) eyamrog@Rog-ASUS:~/work/cl_st1_marcos$ 
```

#### Deactivating the `myenv` virtual environment

```
(myenv) eyamrog@Rog-ASUS:~/work/cl_st1_marcos$ deactivate
eyamrog@Rog-ASUS:~/work/cl_st1_marcos$ 
```

#### Removing the `myenv` virtual environment (optional)

If the virtual environment is no longer needed, it can be removed.

```
eyamrog@Rog-ASUS:~$ cd "$HOME"
eyamrog@Rog-ASUS:~$ rm -r myenv
eyamrog@Rog-ASUS:~$ 
```