<center>
<img src="https://laelgelcpublic.s3.sa-east-1.amazonaws.com/lael_50_years_narrow_white.png.no_years.400px_96dpi.png" width="300" alt="LAEL 50 years logo">
<h3>APPLIED LINGUISTICS GRADUATE PROGRAMME (LAEL)</h3>
</center>
<hr>

# Corpus Linguistics - Study 1 - Phase 1 - Marcos

This solution reads the text `debate.xml`, in `XML` format, into a `pandas` DataFrame allowing data wrangling and demonstrates how to export the DataFrame into `JSONL` and `TSV` formats for further processing.

## Required Python packages

- pandas
- lxml

## Importing the required libraries

In [1]:
import pandas as pd

## Importing `debate.xml` into a DataFrame

In [2]:
df = pd.read_xml('debate.xml')

### Checking data types

In [3]:
df.dtypes

Title           object
Debate          object
Date            object
Participants    object
Moderators      object
Speaker         object
Text            object
dtype: object

### Converting the column `Date` to `datetime64[ns]` format

In [4]:
df['Date'] = pd.to_datetime(df['Date'])

In [5]:
df.dtypes

Title                   object
Debate                  object
Date            datetime64[ns]
Participants            object
Moderators              object
Speaker                 object
Text                    object
dtype: object

### Displaying the DataFrame

In [6]:
df

Unnamed: 0,Title,Debate,Date,Participants,Moderators,Speaker,Text
0,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),TRUMP,"Thank you very much, Chris. I will tell you ve..."
1,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),BIDEN,"Well, first of all, thank you for doing this a..."
2,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),BIDEN,The American people have a right to have a say...
3,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),TRUMP,There aren’t a hundred million people with pre...
4,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),TRUMP,"During that period of time, during that period..."
...,...,...,...,...,...,...,...
247,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),BIDEN,Yes. And here’s the deal. We count the ballots...
248,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),TRUMP,It’s already been established. Take a look at ...
249,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),TRUMP,Look at Carolyn Maloney’s race. They have no i...
250,"September 29, 2020 Debate Transcript",Presidential Debate at Case Western Reserve Un...,2020-09-29,Former Vice President Joe Biden (D) and Presid...,Chris Wallace (Fox News),BIDEN,He has no idea what he’s talking about. Here’s...


## Exporting to a file

### `JSONL` format

In [7]:
df.to_json('debate.jsonl', orient='records', lines=True)

### `TSV` format

In [8]:
df.to_csv('debate.tsv', sep='\t', index=False, encoding='utf-8', lineterminator='\n')

## Appendices

### Running the solution from the command line

It is recommended to set up a specific Python virtual environment for each Python project.

#### Setting up the `myenv` virtual environment

```
eyamrog@Rog-ASUS:~$ cd "$HOME"
eyamrog@Rog-ASUS:~$ sudo apt update && sudo apt upgrade -y
<omitted>
eyamrog@Rog-ASUS:~$ sudo apt install -y python3-pip python3-venv
[sudo] password for eyamrog: 
<omitted>
eyamrog@Rog-ASUS:~$ python3 -m venv myenv
eyamrog@Rog-ASUS:~$ ll
total 148
drwxr-x--- 16 eyamrog eyamrog  4096 Aug 27 12:15 ./
drwxr-xr-x  3 root    root     4096 Mar 31 22:09 ../
<omitted>
drwxr-xr-x  5 eyamrog eyamrog  4096 Aug 27 12:15 myenv/
<omitted>
```

#### Activating the `myenv` virtual environment and checking the installed packages

```
eyamrog@Rog-ASUS:~$ source "$HOME"/myenv/bin/activate
(myenv) eyamrog@Rog-ASUS:~$ pip freeze
```

#### Installing the required Python packages

```
(myenv) eyamrog@Rog-ASUS:~$ pip install pandas lxml
Collecting pandas
  Using cached pandas-2.2.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.0 MB)
Collecting tzdata>=2022.7
  Using cached tzdata-2024.1-py2.py3-none-any.whl (345 kB)
Collecting pytz>=2020.1
  Using cached pytz-2024.1-py2.py3-none-any.whl (505 kB)
Collecting python-dateutil>=2.8.2
  Using cached python_dateutil-2.9.0.post0-py2.py3-none-any.whl (229 kB)
Collecting numpy>=1.22.4
  Using cached numpy-2.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.3 MB)
Collecting six>=1.5
  Using cached six-1.16.0-py2.py3-none-any.whl (11 kB)
Installing collected packages: pytz, tzdata, six, numpy, python-dateutil, pandas
Successfully installed numpy-2.1.0 pandas-2.2.2 python-dateutil-2.9.0.post0 pytz-2024.1 six-1.16.0 tzdata-2024.1
Collecting lxml
  Using cached lxml-5.3.0-cp310-cp310-manylinux_2_28_x86_64.whl (5.0 MB)
Installing collected packages: lxml
Successfully installed lxml-5.3.0
(myenv) eyamrog@Rog-ASUS:~$ pip freeze
lxml==5.3.0
numpy==2.1.0
pandas==2.2.2
python-dateutil==2.9.0.post0
pytz==2024.1
six==1.16.0
tzdata==2024.1
```

#### Installing and running the solution

```
(myenv) eyamrog@Rog-ASUS:~$ mkdir work
(myenv) eyamrog@Rog-ASUS:~$ cd work
(myenv) eyamrog@Rog-ASUS:~/work$ ll
total 8
drwxr-xr-x  2 eyamrog eyamrog 4096 Aug 27 12:01 ./
drwxr-x--- 16 eyamrog eyamrog 4096 Aug 27 12:15 ../
(myenv) eyamrog@Rog-ASUS:~/work$ git clone https://github.com/laelgelc/cl_st1_marcos.git
Cloning into 'cl_st1_marcos'...
remote: Enumerating objects: 11, done.
remote: Counting objects: 100% (11/11), done.
remote: Compressing objects: 100% (9/9), done.
remote: Total 11 (delta 1), reused 8 (delta 1), pack-reused 0 (from 0)
Receiving objects: 100% (11/11), 31.27 KiB | 6.25 MiB/s, done.
Resolving deltas: 100% (1/1), done.
(myenv) eyamrog@Rog-ASUS:~/work$ cd cl_st1_marcos/
(myenv) eyamrog@Rog-ASUS:~/work/cl_st1_marcos$ ll
total 212
drwxr-xr-x 4 eyamrog eyamrog   4096 Aug 27 12:20 ./
drwxr-xr-x 3 eyamrog eyamrog   4096 Aug 27 12:20 ../
drwxr-xr-x 8 eyamrog eyamrog   4096 Aug 27 12:20 .git/
drwxr-xr-x 2 eyamrog eyamrog   4096 Aug 27 12:20 .ipynb_checkpoints/
-rw-r--r-- 1 eyamrog eyamrog  16084 Aug 27 12:20 CL_St1_Marcos.ipynb
-rw-r--r-- 1 eyamrog eyamrog     54 Aug 27 12:20 README.md
-rw-r--r-- 1 eyamrog eyamrog    257 Aug 27 12:20 cl_st1_marcos.py
-rw-r--r-- 1 eyamrog eyamrog 175914 Aug 27 12:20 debate.xml
(myenv) eyamrog@Rog-ASUS:~/work/cl_st1_marcos$ python cl_st1_marcos.py
(myenv) eyamrog@Rog-ASUS:~/work/cl_st1_marcos$ ll
total 492
drwxr-xr-x 4 eyamrog eyamrog   4096 Aug 27 12:26 ./
drwxr-xr-x 3 eyamrog eyamrog   4096 Aug 27 12:20 ../
drwxr-xr-x 8 eyamrog eyamrog   4096 Aug 27 12:20 .git/
drwxr-xr-x 2 eyamrog eyamrog   4096 Aug 27 12:20 .ipynb_checkpoints/
-rw-r--r-- 1 eyamrog eyamrog  16084 Aug 27 12:20 CL_St1_Marcos.ipynb
-rw-r--r-- 1 eyamrog eyamrog     54 Aug 27 12:20 README.md
-rw-r--r-- 1 eyamrog eyamrog    257 Aug 27 12:20 cl_st1_marcos.py
-rw-r--r-- 1 eyamrog eyamrog 153138 Aug 27 12:26 debate.jsonl
-rw-r--r-- 1 eyamrog eyamrog 129291 Aug 27 12:26 debate.tsv
-rw-r--r-- 1 eyamrog eyamrog 175914 Aug 27 12:20 debate.xml
(myenv) eyamrog@Rog-ASUS:~/work/cl_st1_marcos$ 
```

#### Deactivating the `myenv` virtual environment

```
(myenv) eyamrog@Rog-ASUS:~/work/cl_st1_marcos$ deactivate
eyamrog@Rog-ASUS:~/work/cl_st1_marcos$ 
```

#### Removing the `myenv` virtual environment (optional)

If the virtual environment is no longer needed, it can be removed.

```
eyamrog@Rog-ASUS:~$ cd "$HOME"
eyamrog@Rog-ASUS:~$ rm -r myenv
eyamrog@Rog-ASUS:~$ 
```