                                                                                        Patricia Grau Francitorra
# Exploratory article analysis
## Objective
1. Get data from Wikipedia - analyse html format
2. UD tagging
3. Treebanks
4. Comparison
  - Problem: lining up data with same information.
    - Some files have a line for explanation: see first line of french article.
    - Info in two sentences contained in one; see French article.

In [23]:
import re
import requests
from bs4 import BeautifulSoup # first pip install bs4
# format is shift alt f (in VSC)

### 1. Get data from Wikipedia

In [24]:
# Neat idea, but doesn't work with disambiguating urls, such as Jupiter
list_of_langs = ['en','sv' , 'ca', 'es', 'fr']
list_of_urls = [f'https://{lang}.wikipedia.org/wiki/Jupiter' for lang in list_of_langs]
list_of_urls

['https://en.wikipedia.org/wiki/Jupiter',
 'https://sv.wikipedia.org/wiki/Jupiter',
 'https://ca.wikipedia.org/wiki/Jupiter',
 'https://es.wikipedia.org/wiki/Jupiter',
 'https://fr.wikipedia.org/wiki/Jupiter']

In [25]:
eng_jup_url = 'https://en.wikipedia.org/wiki/Jupiter'
sv_jup_url = 'https://sv.wikipedia.org/wiki/Jupiter'
fr_jup_url = 'https://fr.wikipedia.org/wiki/Jupiter_(planète)' # disambiguation issue
spa_jup_url = 'https://es.wikipedia.org/wiki/Júpiter_(planeta)' # disambiguation issue
ca_jup_url = 'https://ca.wikipedia.org/wiki/Júpiter_(planeta)' # disambiguation issue

In [36]:
with open('jupiter/jupiter_en.html', 'w') as f:
    f.write(requests.get(url = eng_jup_url).text)
with open('jupiter/jupiter_spa.html', 'w') as g:
    g.write(requests.get(url = spa_jup_url).text)
with open('jupiter/jupiter_fr.html', 'w') as g:
    g.write(requests.get(url = fr_jup_url).text)

In [27]:
def print_cleantext(file):
    f = open(file)
    # print(f.readlines())
    for alltext in f.readlines():
        texto = alltext.strip('\t')
        if texto.startswith('<p>') or texto.startswith('</p><p>'): # if '<p>' in texto or '</p>' in texto:
            cleantext = BeautifulSoup(texto, "html.parser").text # when using "lxml" as the parser insead of html.parser, some paragraphs/lines are missing
            cleantext = re.sub(r'\[\d+\]', '', cleantext) # remove references [digit]
            cleantext = cleantext.strip()
            if len(cleantext) != 0:
                print(cleantext)
                print()
    f.close()
    # TO DO - save to a file or use it for UD tagging

In [40]:
def save_cleantext(readfile, writefile):
    f = open(readfile)
    with open(writefile, 'w') as g:
        for alltext in f.readlines():
            texto = alltext.strip('\t')
            if texto.startswith('<p>') or texto.startswith('</p><p>'): # if '<p>' in texto or '</p>' in texto:
                cleantext = BeautifulSoup(texto, "html.parser").text # when using "lxml" as the parser insead of html.parser, some paragraphs/lines are missing
                cleantext = re.sub(r'\[\d+\]', '', cleantext) # remove references [digit]
                cleantext = cleantext.strip()
                if len(cleantext) != 0:
                    g.write(cleantext + '\n')
    f.close()

In [42]:
save_cleantext('jupiter/jupiter_en.html', 'jupiter/clean_jupiter_en.txt')
save_cleantext('jupiter/jupiter_spa.html', 'jupiter/clean_jupiter_spa.txt')
save_cleantext('jupiter/jupiter_fr.html', 'jupiter/clean_jupiter_fr.txt')

### 2. UD tagging

![Jupiter%20is%20the%20fifth%20planet%20from%20the%20Sun%20and%20the%20largest%20in%20the%20solar%20system.png](attachment:Jupiter%20is%20the%20fifth%20planet%20from%20the%20Sun%20and%20the%20largest%20in%20the%20solar%20system.png)
Example of what I want (from corenlp.run) -> in CONLLU format

In [14]:
# using stanza
import stanza
from stanza.utils.conll import CoNLL

In [45]:
# Getting first paragraph
list_of_clean_files = [('jupiter/clean_jupiter_en.txt', 'en'), ('jupiter/clean_jupiter_spa.txt', 'es'), ('jupiter/clean_jupiter_fr.txt', 'fr')]
for file, lang in list_of_clean_files:
    stanza.download(lang)
    nlp = stanza.Pipeline(lang)
    with open(file, 'r') as f:
        for line in f:
            if line.startswith('J'): # First line talking about Jupiter
                doc = nlp(line)
                CoNLL.write_doc2conll(doc, f"jupiter/jup_{lang}.conllu")
                break

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2022-01-25 15:42:20 INFO: Downloading default packages for language: en (English)...
2022-01-25 15:42:22 INFO: File exists: /home/gusgraupa@GU.GU.SE/stanza_resources/en/default.zip.
2022-01-25 15:42:30 INFO: Finished downloading models and saved to /home/gusgraupa@GU.GU.SE/stanza_resources.
2022-01-25 15:42:30 INFO: Loading these models for language: en (English):
| Processor    | Package   |
----------------------------
| tokenize     | combined  |
| pos          | combined  |
| lemma        | combined  |
| depparse     | combined  |
| sentiment    | sstplus   |
| constituency | wsj       |
| ner          | ontonotes |

2022-01-25 15:42:30 INFO: Use device: gpu
2022-01-25 15:42:30 INFO: Loading: tokenize
2022-01-25 15:42:30 INFO: Loading: pos
2022-01-25 15:42:32 INFO: Loading: lemma
2022-01-25 15:42:33 INFO: Loading: depparse
2022-01-25 15:42:35 INFO: Loading: sentiment
2022-01-25 15:42:38 INFO: Loading: constituency
2022-01-25 15:42:42 INFO: Loading: ner
2022-01-25 15:42:48 INFO: Don

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2022-01-25 15:42:48 INFO: Downloading default packages for language: es (Spanish)...
2022-01-25 15:42:50 INFO: File exists: /home/gusgraupa@GU.GU.SE/stanza_resources/es/default.zip.
2022-01-25 15:42:59 INFO: Finished downloading models and saved to /home/gusgraupa@GU.GU.SE/stanza_resources.
2022-01-25 15:42:59 INFO: Loading these models for language: es (Spanish):
| Processor | Package |
-----------------------
| tokenize  | ancora  |
| mwt       | ancora  |
| pos       | ancora  |
| lemma     | ancora  |
| depparse  | ancora  |
| ner       | conll02 |

2022-01-25 15:42:59 INFO: Use device: gpu
2022-01-25 15:42:59 INFO: Loading: tokenize
2022-01-25 15:42:59 INFO: Loading: mwt
2022-01-25 15:42:59 INFO: Loading: pos
2022-01-25 15:43:01 INFO: Loading: lemma
2022-01-25 15:43:01 INFO: Loading: depparse
2022-01-25 15:43:06 INFO: Loading: ner
2022-01-25 15:43:17 INFO: Done loading processors!


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2022-01-25 15:43:17 INFO: Downloading default packages for language: fr (French)...
2022-01-25 15:43:18 INFO: File exists: /home/gusgraupa@GU.GU.SE/stanza_resources/fr/default.zip.
2022-01-25 15:43:28 INFO: Finished downloading models and saved to /home/gusgraupa@GU.GU.SE/stanza_resources.
2022-01-25 15:43:28 INFO: Loading these models for language: fr (French):
| Processor | Package |
-----------------------
| tokenize  | gsd     |
| mwt       | gsd     |
| pos       | gsd     |
| lemma     | gsd     |
| depparse  | gsd     |
| ner       | wikiner |

2022-01-25 15:43:28 INFO: Use device: gpu
2022-01-25 15:43:28 INFO: Loading: tokenize
2022-01-25 15:43:28 INFO: Loading: mwt
2022-01-25 15:43:28 INFO: Loading: pos
2022-01-25 15:43:32 INFO: Loading: lemma
2022-01-25 15:43:33 INFO: Loading: depparse
2022-01-25 15:43:36 INFO: Loading: ner
2022-01-25 15:43:45 INFO: Done loading processors!
