In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
import os
os.chdir("/content/drive/MyDrive/NLP/lab_1")

# Lab 1. Tehnici de bază în prelucrarea textelor

## Regex

Expresiile regulate reprezintă un șir de caractere care definesc un șablon⁠ de căutare.

Ele sunt utile pentru căutarea anumitor șabloane în text și de asemenea, pentru normalizarea textelor - https://www.w3schools.com/python/python_regex.asp


In [1]:
import re

text = """
Praise for The Rain in Portugal
 
“Nothing in Billy Collins’s twelfth book . . . is exactly what readers might expect, and that’s the charm of this collection.”—The Washington Post
 
“This new collection shows [Collins] at his finest. . . . Certain to please his large readership and a good place for readers new to Collins to begin.”—Library Journal. 
 
“Disarmingly playful and wistfully candid.”—Booklist
Buy new:$38.65
No Import Fees Deposit & $13.01 Shipping to Romania Details -12.3.
"""

Exemplu de utilizare: utilizând metoda `re.sub` ștergem toate caracterele diferite de literele mari și mici ale alfabetului englez, apoi normalizăm toate secvențele de caractere de tip spațiu consecutive la un singur spațiu.

In [2]:
cleaned_text = re.sub("[^A-Za-z]", " ", text)
cleaned_text = re.sub("\s+", " ", cleaned_text)
print(cleaned_text)

 Praise for The Rain in Portugal Nothing in Billy Collins s twelfth book is exactly what readers might expect and that s the charm of this collection The Washington Post This new collection shows Collins at his finest Certain to please his large readership and a good place for readers new to Collins to begin Library Journal Disarmingly playful and wistfully candid Booklist Buy new No Import Fees Deposit Shipping to Romania Details 


Pentru testarea pattern-urilor putem folosi https://regex101.com/.

### Funcția `finditer`

Această funcție găsește un pattern într-un șir de caractere și returnează un iterator ce generează obiecte de tip Match cu toate potrivirile.

In [3]:
import re

s = 'Readability counts.'
pattern = r'[aeoui]'

matches = re.finditer(pattern, s)
for match in matches:
    print(match)

<re.Match object; span=(1, 2), match='e'>
<re.Match object; span=(2, 3), match='a'>
<re.Match object; span=(4, 5), match='a'>
<re.Match object; span=(6, 7), match='i'>
<re.Match object; span=(8, 9), match='i'>
<re.Match object; span=(13, 14), match='o'>
<re.Match object; span=(14, 15), match='u'>


Exemplu: căutăm toate numerele float sau int, împreună cu pozițiile și valorile lor. Aici folosim metode `compile()` pentru a compila expresia regulată sub forma de string într-un pattern de tip regex.

In [4]:
pattern = re.compile("[+-]?(\d+\.)?\d+")
for match in pattern.finditer(text):
    print(match, "--> valoarea căutată începe de la caracterul nr.", match.start(), ", și este ", match.group())

<re.Match object; span=(418, 423), match='38.65'> --> valoarea căutată începe de la caracterul nr. 418 , și este  38.65
<re.Match object; span=(450, 455), match='13.01'> --> valoarea căutată începe de la caracterul nr. 450 , și este  13.01
<re.Match object; span=(484, 489), match='-12.3'> --> valoarea căutată începe de la caracterul nr. 484 , și este  -12.3


## Encodings

Codificarea (encoding-ul) unui text poate varia, în funcție de limbă și este un element foarte mportant când lucrăm cu texte. 

Python foloseste standardul 'utf-8' pentru limba română, și nu numai. 

Următorul exemplu este preluat dintr-o subtitrare (.srt) din limba rusă, dar nu este encodat in utf-8. Așadar dacă vom încerca să îl citim fără să specificăm tipul de encoding, vom primi următoarea eroare:

In [5]:
with open('encoded_text.txt', "r") as fin:
    content = fin.read()
    print(content)

1
00:00:05,100 --> 00:00:10,860
Ýòî áûëè òÿæåëûå âðåìåíà. Ðèì íàõîäèëñÿ ïîä ãîñïîäñòâîì êîððóìïèðîâàííîãî Ïàïû è ñîìíèòåëüíûõ çàêîíîâ

2
00:00:10,960 --> 00:00:13,820
èãð âëàñòè è ìåæäîóñîáíîé áîðüáû

3
00:00:15,660 --> 00:00:20,310
Ñèíüîðû íà÷èíàëè æåñòîêèå ñðàæåíèÿ ñ åäèíñòâåííîé öåëüþ – íàêîïèòü ñîñòîÿíèå

4
00:00:21,640 --> 00:00:24,950
À òåì âðåìåíåì, ïðîñòûå ëþäè åëè íå êàæäûé äåíü

5
00:00:30,090 --> 00:00:36,040
Ëþáîâü áûëà òåìîé äëÿ ïîýòîâ, íî ðåäêî óïîìèíàëàñü â ñâàäåáíûõ êëÿòâàõ

6
00:00:36,940 --> 00:00:42,140
Æåíùèí îòäàâàëè â æåíû ìóæ÷èíàì, êîòîðûõ îíè åäâà çíàëè, íå ãîâîðÿ óæå î ëþáâè

7
00:00:43,040 --> 00:00:47,980
Â ýòîì ìèðå, æåñòîêîì è íåñïðàâåäëèâîì, ÿ ïîâñòðå÷àëà äâóõ ìîëîäûõ ëþäåé



Putem detecta encoding-ul folosit cu librăria `chardet`:

In [6]:
! pip install chardet

Collecting chardet
  Downloading chardet-5.1.0-py3-none-any.whl (199 kB)
     ---------------------------------------- 0.0/199.1 kB ? eta -:--:--
     ------------------ -------------------- 92.2/199.1 kB 1.7 MB/s eta 0:00:01
     -------------------------------------- 199.1/199.1 kB 2.4 MB/s eta 0:00:00
Installing collected packages: chardet
Successfully installed chardet-5.1.0



[notice] A new release of pip is available: 23.0 -> 23.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [8]:
import chardet

with open('encoded_text.txt', "rb") as f:
    rawdata = f.read()
    result = chardet.detect(rawdata)
    print(result)
    extracted_encoding = result['encoding']
    print("Encoding-ul acestui fișier este: ", extracted_encoding)

{'encoding': 'windows-1251', 'confidence': 0.9865976935897622, 'language': 'Russian'}
Encoding-ul acestui fișier este:  windows-1251


Cu encoding-ul potrivit, acum fișierul se poate citi:

In [9]:
with open('encoded_text.txt', "r", encoding="windows-1251") as fin:
    content = fin.read()
    print(content)

1
00:00:05,100 --> 00:00:10,860
Это были тяжелые времена. Рим находился под господством коррумпированного Папы и сомнительных законов

2
00:00:10,960 --> 00:00:13,820
игр власти и междоусобной борьбы

3
00:00:15,660 --> 00:00:20,310
Синьоры начинали жестокие сражения с единственной целью – накопить состояние

4
00:00:21,640 --> 00:00:24,950
А тем временем, простые люди ели не каждый день

5
00:00:30,090 --> 00:00:36,040
Любовь была темой для поэтов, но редко упоминалась в свадебных клятвах

6
00:00:36,940 --> 00:00:42,140
Женщин отдавали в жены мужчинам, которых они едва знали, не говоря уже о любви

7
00:00:43,040 --> 00:00:47,980
В этом мире, жестоком и несправедливом, я повстречала двух молодых людей



Putem, dacă vrem, să salvam conținutul în format utf-8, deoarece acest format este default pentru python și nu mai trebie specificat la deschidere:

In [16]:
with open('encoded_text.txt', 'r', encoding=extracted_encoding) as fin:
    content = fin.read()
with open('utf8_text.txt', 'w', encoding='utf-8') as fout:
    fout.write(content)

In [21]:
with open('utf8_text.txt', "r") as fin:
    content = fin.read()
    print(content)

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 53: character maps to <undefined>

## Non-standard files (PDF, Word, etc.)

Putem citi texte din documente word folosind librăria `doc2txt`.

In [22]:
!pip install docx2txt

Collecting docx2txt
  Downloading docx2txt-0.8.tar.gz (2.8 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Installing collected packages: docx2txt
  Running setup.py install for docx2txt: started
  Running setup.py install for docx2txt: finished with status 'done'
Successfully installed docx2txt-0.8


  DEPRECATION: docx2txt is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/8559

[notice] A new release of pip is available: 23.0 -> 23.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
import docx2txt
my_text = docx2txt.process("soup.docx")
print(my_text)

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.



These instructions illustrate all major features of Beautiful Soup 4, with examples. I show you what the library is good for, how it works, how to use it, how to make it do what you want, and what to do when it violates your expectations.



This document covers Beautiful Soup version 4.10.0. The examples in this documentation were written for Python 3.8.



You might be looking for the documentation for Beautiful Soup 3. If so, you should know that Beautiful Soup 3 is no longer being developed and that all support for it was dropped on December 31, 2020. If you want to learn about the differences between Beautiful Soup 3 and Beautiful Soup 4, see Porting code to BS4.



This documentation has been translated into other languages by Beaut

Putem citi pdf-uri care sunt salvate ca texte (nu poze), de exemplu, cu librăria `pdfplumber`:

In [None]:
! pip install pdfplumber

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pdfplumber
  Downloading pdfplumber-0.8.0-py3-none-any.whl (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 KB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pdfminer.six==20221105
  Downloading pdfminer.six-20221105-py3-none-any.whl (5.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m44.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting Wand>=0.6.10
  Downloading Wand-0.6.11-py2.py3-none-any.whl (143 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.6/143.6 KB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting Pillow>=9.1
  Downloading Pillow-9.4.0-cp38-cp38-manylinux_2_28_x86_64.whl (3.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m54.3 MB/s[0m eta [36m0:00:00[0m
Collecting cryptography>=36.0.0
  Downloading cryptogra

In [None]:
import pdfplumber
with pdfplumber.open('soup.pdf') as pdf:
    for page in pdf.pages:
        print(page.extract_text())

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite
parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly
saves programmers hours or days of work.
These instructions illustrate all major features of Beautiful Soup 4, with examples. I show you what the
library is good for, how it works, how to use it, how to make it do what you want, and what to do when
it violates your expectations.
This document covers Beautiful Soup version 4.10.0. The examples in this documentation were written
for Python 3.8.
You might be looking for the documentation for Beautiful Soup 3. If so, you should know that Beautiful
Soup 3 is no longer being developed and that all support for it was dropped on December 31, 2020. If
you want to learn about the differences between Beautiful Soup 3 and Beautiful Soup 4, see Porting
code to BS4.
This documentation has been translated into other languages by Beautiful Soup us

## Web scraping

Scraping-ul se referă la o mulțime de metode prin care putem descărca date nestructurate din mediul web. Pe noi ne interesează datele text, pe care după preluarea din mediul online le putem procesa și stoca într-o formă structurată.

Ca prim exemplu de scraping vom incerca următorul task: pornind de la site-ul de programare competitiva "infoarena.ro" dorim pentru un utilizator sa descarcam informatii despre toate submisiile efectuate de acesta.

Exemplu pagină de submisii: https://www.infoarena.ro/monitor?user=iordache.bogdan

Pentru a realiza un request care să întoarca conținutul paginii putem folosi librăria `requests`:

In [23]:
! pip install requests

Collecting requests
  Downloading requests-2.28.2-py3-none-any.whl (62 kB)
     ---------------------------------------- 0.0/62.8 kB ? eta -:--:--
     ------------------------ ------------- 41.0/62.8 kB 991.0 kB/s eta 0:00:01
     ---------------------------------------- 62.8/62.8 kB 1.1 MB/s eta 0:00:00
Collecting idna<4,>=2.5
  Downloading idna-3.4-py3-none-any.whl (61 kB)
     ---------------------------------------- 0.0/61.5 kB ? eta -:--:--
     ---------------------------------------- 61.5/61.5 kB ? eta 0:00:00
Collecting charset-normalizer<4,>=2
  Downloading charset_normalizer-3.0.1-cp310-cp310-win_amd64.whl (96 kB)
     ---------------------------------------- 0.0/96.5 kB ? eta -:--:--
     ---------------------------------------- 96.5/96.5 kB 5.4 MB/s eta 0:00:00
Collecting certifi>=2017.4.17
  Downloading certifi-2022.12.7-py3-none-any.whl (155 kB)
     ---------------------------------------- 0.0/155.3 kB ? eta -:--:--
     -------------------------------------- 155.3/155.


[notice] A new release of pip is available: 23.0 -> 23.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [24]:
import requests

def get_submissions_page(user):
    return requests.get(f"https://www.infoarena.ro/monitor?user={user}")

In [None]:
html = get_submissions_page("iordache.bogdan").content

Observăm că folosind metoda de mai sus putem descarca întreg conținutul HTML al paginii. Pentru a extrage informații utile trebuie să parsam acest conținut. Pentru aceasta vom folosi biblioteca [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/):

In [25]:
import bs4

def parse_html(html):
    return bs4.BeautifulSoup(html, "html.parser")

ModuleNotFoundError: No module named 'bs4'

Având conținutul parsat, putem determina acum câte submisii are în total acest utilizator:

In [None]:
import re

soup = parse_html(html)

# cautam un span care are clasa "count", in acest span se afla numarul de submisii
submission_count_text = soup.find("span", class_="count").text
print(submission_count_text)


 (5033 rezultate)


Pentru a extrage doar numărul din această înșiruire de caractere ne putem folosi de regex:

In [None]:
submission_count = int(re.search(r"\d+", submission_count_text).group())
print(submission_count)

5033


Observăm că aceste submisii sunt împărtite în mai multe pagini (paginarea rezultatelor). 

De asemenea, link-ul următor: https://www.infoarena.ro/monitor?user=iordache.bogdan&display_entries=250&first_entry=100 ne returnează 250 de submisii, incepând cu submisia cu numarul 100. 

Putem modifica metoda `get_submissions_page` astfel:

In [None]:
def get_submissions_page(user, display_entries=None, first_entry=None):
    req_string = f"https://www.infoarena.ro/monitor?user={user}"
    if display_entries is not None:
        req_string += f"&display_entries={display_entries}"
    if first_entry is not None:
        req_string += f"&first_entry={first_entry}"

    return requests.get(req_string)

Și putem implementa o funcție care returnează informații despre toate submisiile unui utilizator:

In [None]:
from tqdm import tqdm
import pandas as pd
import pdb

def scrape_submissions(user):
    # determinam numarul total de submisii
    html = get_submissions_page(user).content
    soup = parse_html(html)
    submission_count_text = soup.find("span", class_="count").text
    submission_count = int(re.search(r"\d+", submission_count_text).group())

    # vom salva in acest dictionar datele despre submisiile extrase, structura aceasta
    # ne va ajuta ulterior sa construim un tabel (dataframe) folosind pandas
    d = {
        "id": [],
        "problema": [],
        "url_problema": [],
        "url_sursa": [],
        "data": [],
        "puncte": [],
    }

    # accesam pagini cu submisii in grupuri de 250
    for first_entry in tqdm(range(0, submission_count, 250)):
        html = get_submissions_page(user, display_entries=250, first_entry=first_entry).content
        soup = parse_html(html)

        # selectam toate liniile de tabel (tr)
        lines = soup.select("table.monitor tbody tr")

        for line in lines:
            # selectam celulele de pe aceasta linie
            cells = [cell for cell in line.select("td")]

            # extragem link-urile pentru problema si codul sursa
            try:
                url_problema = cells[2].select_one("a")["href"]
                url_sursa = cells[4].select_one("a")["href"]
            except Exception:  # daca vreun link nu exista ignoram linia
                continue
            
            d["id"].append(cells[0].text)
            d["problema"].append(cells[2].text)
            d["url_problema"].append(url_problema)
            d["url_sursa"].append(url_sursa)
            d["data"].append(cells[5].text)

            try:
                puncte = int(re.search(r"\d+", cells[6].text).group())
            except Exception:
                puncte = 0
            d["puncte"].append(puncte)

    return pd.DataFrame(d)

In [None]:
df_submissions = scrape_submissions("iordache.bogdan")

100%|██████████| 21/21 [00:31<00:00,  1.50s/it]


In [None]:
df_submissions.head()

Unnamed: 0,id,problema,url_problema,url_sursa,data,puncte
0,#2971352,Atac,/problema/atac,/job_detail/2971352?action=view-source,27 ian 23 01:27:44,100
1,#2971346,Atac,/problema/atac,/job_detail/2971346?action=view-source,27 ian 23 01:07:30,20
2,#2971294,Pirati,/problema/pirati,/job_detail/2971294?action=view-source,26 ian 23 23:18:14,100
3,#2970859,Lowest Common Ancestor,/problema/lca,/job_detail/2970859?action=view-source,25 ian 23 23:42:36,100
4,#2970853,Lowest Common Ancestor,/problema/lca,/job_detail/2970853?action=view-source,25 ian 23 23:19:15,100


In [None]:
df_submissions.to_csv("submissions.csv", index=False)

Exemplu scriere/citire fisier JSON:

In [None]:
import json

vec = [
    {"title": "example_1", "size": 7},
    {"title": "example_2", "size": 3},
    {"title": "example_3", "size": 8},
]

with open("example.json", "w") as f:
    json.dump(vec, f, indent=4)

In [None]:
with open("example.json", "r") as f:
    vec = json.load(f)
print(vec)

[{'title': 'example_1', 'size': 7}, {'title': 'example_2', 'size': 3}, {'title': 'example_3', 'size': 8}]


Un alt mod de a face scraping este sa folosim biblioteca pandas pentru a ne extrage tabele html, transformandu-le in DataFrame-uri, pe care le putem manipula foarte usor. Un exemplu util este extragerea sărbătorilor legale romanesti, din anul 2022, de pe https://www.timeanddate.com/.

In [None]:
! pip install lxml

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import pandas as pd

tables_df = pd.read_html('https://www.timeanddate.com/holidays/romania/2022?hol=1')
df = tables_df[0]

# Il putem curata prin a sterge liniile nule si modifica coloanele de la tuplul "(Date, Date)" -> "Date"
df = df.dropna(axis='index')
df.columns = ['Date', 'Day', 'Name', 'Type']

# Reindexam tabelul
df = df.reset_index(drop="True")

# Afisam primele 5 randuri
df.head()

Unnamed: 0,Date,Day,Name,Type
0,Jan 1,Saturday,New Year's Day,National holiday
1,Jan 2,Sunday,Day after New Year's Day,National holiday
2,Jan 24,Monday,Unification Day,National holiday
3,Feb 19,Saturday,Constantin Brancusi Day,Observance
4,Feb 24,Thursday,Dragobete,Observance


In [None]:
# Daca vrem se vedem sarbatorile care se nimeresc in ziua de luni putem face o selecție în dataframe
df_luni = df.loc[df["Day"] == "Monday"]
df_luni

Unnamed: 0,Date,Day,Name,Type
2,Jan 24,Monday,Unification Day,National holiday
10,Apr 25,Monday,Orthodox Easter Monday,"National holiday, Orthodox"
19,Jun 13,Monday,Orthodox Pentecost Monday,"National holiday, Orthodox"
23,Aug 15,Monday,St Mary's Day,National holiday
25,Oct 31,Monday,Halloween,Observance
33,Dec 26,Monday,Second day of Christmas,National holiday


Putem salva rezultatul (la fel ca orice dicționar de python) intr-un json, ca alternativa la DataFrame - acest lucru poate fi util într-o aplicație pentru comunicarea cu front-end-ul.

In [None]:
import json
json_str = df.to_json(orient='records')
json_result = json.loads(json_str)

with open('holidays.json', 'w', encoding='utf8') as fout:
    json.dump(json_result, fout, indent=4, sort_keys=True, ensure_ascii=False)

Alte biblioteci utile pentru scraping:
 * [scrapy](https://scrapy.org/) (folosit in special pentru web crawling)
 * [selenium](https://selenium-python.readthedocs.io/) (folosit pentru a simula activitatea din browser, utilizat in special in scrierea de teste pentru aplicatii front-end)

## TASK: IMDb scraping

File upload:
https://docs.google.com/forms/d/e/1FAIpQLSdOKefipRl6cwjukN5YIxl7Q64dHoUHk2zqg1aO31U7kieHXQ/viewform?usp=sf_link

1. Pornind de la lista cu cele mai populare 250 de filme de pe IMDb ([https://www.imdb.com/chart/top/](https://www.imdb.com/chart/top/)), identificati pentru toate aceste filme link-ul catre pagina sa de recenzii.

Exemplu: aici se gaseste pagina cu recenzii pentru "The Shawshank Redemption": [https://www.imdb.com/title/tt0111161/reviews](https://www.imdb.com/title/tt0111161/reviews)

2. Pentru fiecare film colectati date despre recenziile sale (titlu, text, rating, data, utlizator, etc.)

3. Creati un dataset de recenzii, pentru fiecare recenzie stocati:
 * filmul caruia ii apartine
 * titlul recenziei
 * textul recenziei
 * ratingul
 * data
 * utilizator

 Salvati datasetul intr-un fisier JSON.

4. Pe o pagina cu recenzii putem gasi un numar mic de astfel de date. Butonul de "Load more" de la final, cand este apasat, produce un request care returneaza HTML-ul urmatoarelor recenzii. Folosind aceasta logica colectati automat pentru fiecare film un numar mai mare de recenzii.