<a href="https://colab.research.google.com/github/japarra27/Dr.-Semmelweis-and-the-discovery-of-handwashing/blob/master/ProyectoFinal_An%C3%A1lisisDeepLearning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![image](https://docs.google.com/uc?export=download&id=1NUy1Q-abpoV9XYK9qT9t8Mdhj3ZVlveO)

# **Proyecto Final**

* Steven Llerena - 202010212 
* Jaime Andrés Parra - 202107161
* Jesús Ramírez - 201015827

## **Contenido**
1. [**Problema**](#id1)
2. [**Instalando e Importando las librerías necesarias para el laboratorio**](#id2)
3. [**Cargue de datos**](#id3)
4. [**Entendimiento del dataset**](#id4)
5. [**Modelamiento**](#id5)
6. [**Estimación de métricas**](#id6)
7. [**Conclusiones**](#id7)

## **Problema**<a name="id1"></a>

- <p align = "justify">Actualmente el tiempo utilizado en investigación de papers para saber si aportarán a la investigación es un proceso largo y en ocasiones poco fructifero, por lo tanto se va a realizar un modelo de resumen de texto utilizando transformers. El dataset a utilizar es el CORD19.:</p>

> [Fuente de datos](https://www.semanticscholar.org/cord19/download)

https://colab.research.google.com/github/PubChimps/ibmvirtualmeetups/blob/master/5-12/meetup.ipynb#scrollTo=8-hqHk_49P6h
https://colab.research.google.com/github/huggingface/nlp/blob/master/notebooks/Overview.ipynb#scrollTo=xxLcdj2yvSU3

## **Instalando e Importando las librerías necesarias para el laboratorio**

In [None]:
%%capture
# Configuración de utilidades colab
!shred -u setup_colab_general.py
!wget -q "https://github.com/jpcano1/python_utils/raw/main/setup_colab_general.py" -O setup_colab_general.py
!pip install --progress-bar off -q tqdm==4.56.0
!pip install datasets
!pip install wandb
!pip install transformers
!pip install pytorch-lightning
!pip install SentencePiece
!jupyter nbextension enable --py widgetsnbextension

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
%%capture
# Importando las librerias a utilizar
import argparse
import glob
import os
import json
import time
import logging
import random
import re
from itertools import chain
from string import punctuation

import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

import pandas as pd
import numpy as np
import torch
import pytorch_lightning as pl
from torch.utils.data import Dataset, DataLoader
from pytorch_lightning.loggers import WandbLogger
#from nlp import load_metric
import datasets
from datasets import list_datasets, load_dataset, load_metric
from pprint import pprint
import wandb

from transformers import (
    AdamW,
    T5ForConditionalGeneration,
    T5Tokenizer,
    get_linear_schedule_with_warmup
)

In [None]:
## Login a wandb para autorizar el apikey
!wandb login

In [None]:
# Login para empezar el monitoreo del modelo
wandb_logger = WandbLogger(project='resumen-textos-transformers')

## **Cargue de datos**

- <p align = "justify">Actualmente, el dataset de CORD19, bajo la administración del Instituto Allen puede ser descargado desde su página oficial de forma gratuita. Sin embargo, gracias al trabajo de Priya Dwivedi se cuenta con la libreria nlp, con la cual podemos hacer la descarga del dataset ya procesado y optimizado para los diferentes casos de uso que se quieren realizar.</p>

### **Atributos del dataset**

In [None]:
# Atributos del dataset
cord_dataset = list_datasets(with_details=True)[list_datasets().index('cord19')]
pprint(cord_dataset.__dict__)

### **Descarga del dataset**

In [None]:
# Descarga del datset
dataset = load_dataset('cord19', "fulltext", data_dir='data/')

In [None]:
# Información del dataset
pprint(dataset)

In [None]:
# Validación de las llaves del dataset
dataset.keys(), dataset.get("train")[0].keys()

### **Ejemplo de las características de un paper**

In [None]:
# Ejemplo de un título de un paper
pprint(dataset.get("train")[100].get('title'))
print("\n the len of the title is:", len(dataset.get("train")[100].get('title').split(" ")), 'words')

In [None]:
# Ejemplo de un texto completo
pprint(dataset.get("train")[100].get('fulltext'))
print("\n the len of the paper is:", len(dataset.get("train")[100].get('fulltext').split(" ")), 'words')

In [None]:
# Ejemplo de un abstract
pprint(dataset.get("train")[100].get('abstract'))
print("\n the len of the paper is:", len(dataset.get("train")[100].get('abstract').split(" ")), 'words')

## **Entendimiento del dataset**

### **Estimación de la cantidad promedio de palabras en los papers**

In [None]:
muestra_dataset = dataset.get('train').select(list(range(0, 1000)))
texto_len = []
summary_len=[]

In [None]:
for i in range(len(muestra_dataset)):
    ejemplo = muestra_dataset[i]
    texto_ejemplo = ejemplo.get('fulltext')
    texto_ejemplo = texto_ejemplo.replace('\n','')
    texto_words = texto_ejemplo.split()
    texto_len.append(len(texto_words))
    summary_ejemplo = ejemplo['abstract']
    summary_ejemplo = summary_ejemplo.replace('\n','')
    summary_words = summary_ejemplo.split()
    summary_len.append(len(summary_words))

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.hist(texto_len, bins=10)
plt.title('Distribución de la cantidad de palabras texto completo - Primeros 1000 ejemplos')
plt.show()

In [None]:
plt.hist(summary_len)
plt.title('Distribución de la cantidad de palabras abstract - Primeros 1000 ejemplos')
plt.show()

In [None]:
print("Promedio palabras fulltext: ", sum(texto_len)/len(texto_len))

In [None]:
print("Promedio palabras abstract: ", sum(summary_len)/len(summary_len))

## **Modelamiento**

In [None]:
# Eliminación de caracteres especiales
CHARS = [
    "¦",
    "§",
    "¨",
    "©",
    "ª",
    "«",
    "®",
    "¯",
    "°",
    "±",
    "²",
    "³",
    "´",
    "µ",
    "¶",
    "·",
    "º",
    "»",
    "¼",
    "½",
    "¿",
    "×",
    "Ø",
    "÷",
    "ø",
    "Ɵ",
    "Ƶ",
    "ǁ",
    "ǆ",
    "Ǉ",
    "ǌ",
    "ʹ",
    "ʼ",
    "ˆ",
    "ˇ",
    "À",
    "Á",
    "Â",
    "Ã",
    "Ä",
    "Å",
    "Ç",
    "È",
    "É",
    "Ê",
    "Í",
    "Ð",
    "Ñ",
    "Ò",
    "Ó",
    "Ô",
    "Õ",
    "Ö",
    "Ú",
    "Û",
    "Ü",
    "Þ",
    "ß",
    "à",
    "á",
    "â",
    "ã",
    "ä",
    "å",
    "ç",
    "è",
    "é",
    "ê",
    "ë",
    "ì",
    "í",
    "î",
    "ï",
    "ð",
    "ñ",
    "ò",
    "ó",
    "ô",
    "õ",
    "ö",
    "ù",
    "ú",
    "û",
    "ü",
    "ý",
    "þ",
    "ÿ",
    "ā",
    "Ă",
    "ą",
    "Ć",
    "ć",
    "Č",
    "č",
    "ď",
    "Đ",
    "ē",
    "ę",
    "Ě",
    "ě",
    "Ğ",
    "Ĩ",
    "Į",
    "ı",
    "ĸ",
    "Ĺ",
    "ł",
    "ń",
    "Ň",
    "Ō",
    "ō",
    "Ő",
    "ő",
    "Ś",
    "ś",
    "ŝ",
    "ş",
    "Š",
    "š",
    "Ŭ",
    "ů",
    "ŵ",
    "Ŷ",
    "ź",
    "ż",
    "Ž",
    "ž",
    "Ɖ",
    "Ƌ",
    "ƌ",
    "Ɛ",
    "ƚ",
    "ǎ",
    "ǐ",
    "ǒ",
    "ǔ",
    "ǡ",
    "ș",
    "ɑ",
    "ɛ",
    "ɣ",
    "ʋ",
    "˘",
    "˚",
    "˛",
    "˝",
    "́",
    "̇",
    "͕",
    "͖",
    "͗",
    "͘",
    "ͬ",
    "Ͳ",
    "а",
    "б",
    "в",
    "г",
    "д",
    "е",
    "ж",
    "з",
    "и",
    "й",
    "к",
    "л",
    "м",
    "н",
    "о",
    "п",
    "р",
    "с",
    "т",
    "у",
    "ф",
    "х",
    "ц",
    "ч",
    "ш",
    "щ",
    "ы",
    "ь",
    "э",
    "ю",
    "я",
    "ӧ",
    "Յ",
    "Ն",
    "؉",
    "؊",
    "؋",
    "،",
    "؍",
    "؎",
    "ء",
    "آ",
    "أ",
    "ؤ",
    "إ",
    "ئ",
    "ا",
    "ب",
    "ة",
    "ت",
    "ث",
    "ج",
    "ح",
    "خ",
    "د",
    "ذ",
    "ر",
    "ز",
    "س",
    "ش",
    "ص",
    "ض",
    "ط",
    "ظ",
    "ع",
    "غ",
    "ف",
    "ق",
    "ك",
    "ل",
    "م",
    "ن",
    "ه",
    "و",
    "ى",
    "ي",
    "ً",
    "ٌ",
    "ٍ",
    "َ",
    "ُ",
    "ِ",
    "ّ",
    "ْ",
    "ܰ",
    "ܴ",
    "݅",
    "݇",
    "ݏ",
    "ݑ",
    "ݕ",
    "ߚ",
    "ߜ",
    "ߤ",
    "ߪ",
    "ଝ",
    "ଵ",
    "ଶ",
    "᭧",
    "ᮊ",
    "ᵒ",
    "Ḡ",
    "ỹ",
    "‖",
    "‚",
    "†",
    "‡",
    "•",
    "…",
    "‰",
    "′",
    "″",
    "⁄",
    "⁎",
    "⁶",
    "⁹",
    "₀",
    "€",
    "℃",
    "ℜ",
    "™",
    "Ω",
    "Ⅰ",
    "Ⅱ",
    "Ⅲ",
    "→",
    "↓",
    "↵",
    "⇑",
    "⌬",
    "⌿",
    "⍀",
    "␣",
    "␤",
    "␥",
    "␦",
    "■",
    "▪",
    "▶",
    "▸",
    "►",
    "○",
    "◗",
    "★",
    "☆",
    "✔",
    "✜",
    "✩",
    "➜",
    "⩾",
    "、",
    "・",
    "Ϳ",
    "΄",
    "·",
    "Ί",
    "Α",
    "Γ",
    "Ε",
    "Θ",
    "Ι",
    "Λ",
    "Μ",
    "ϩ",
    "Ϫ",
    "ϫ",
    "Ϭ",
    "ϭ",
    "Ϯ",
    "ϯ",
    "ϰ",
    "ϱ",
    "ϲ",
    "ϳ",
    "ϵ",
    "Ϸ",
    "Ͻ",
    "Ͼ",
    "Ј",
    "Љ",
    "Њ",
    "А",
    "Б",
    "В",
    "Д",
    "И",
    "К",
    "Н",
    "О",
    "Р",
    "С",
    "Т",
    "У",
    "Ф",
    "Х",
    "Ц",
    "Ч",
    "Ш",
    "中",
    "乌",
    "亏",
    "代",
    "何",
    "充",
    "冒",
    "吃",
    "國",
    "型",
    "子",
    "學",
    "寄",
    "寒",
    "山",
    "感",
    "扬",
    "方",
    "明",
    "是",
    "暑",
    "替",
    "板",
    "根",
    "桑",
    "民",
    "決",
    "熱",
    "狗",
    "理",
    "生",
    "福",
    "脊",
    "膽",
    "與",
    "良",
    "芳",
    "藍",
    "藥",
    "處",
    "補",
    "論",
    "醫",
    "钟",
    "間",
    "風",
    "首",
    "龍",
    "가",
    "각",
    "간",
    "감",
    "갑",
    "강",
    "같",
    "개",
    "객",
    "거",
    "걱",
    "건",
    "걸",
    "검",
    "것",
    "게",
    "겨",
    "격",
    "겪",
    "결",
    "겼",
    "경",
    "계",
    "고",
    "공",
    "과",
    "관",
    "교",
    "구",
    "국",
    "군",
    "그",
    "근",
    "글",
    "급",
    "기",
    "긴",
    "길",
    "까",
    "꺼",
    "꼈",
    "나",
    "낙",
    "난",
    "남",
    "났",
    "내",
    "넷",
    "년",
    "노",
    "높",
    "누",
    "느",
    "는",
    "능",
    "니",
    "다",
    "단",
    "달",
    "당",
    "대",
    "던",
    "도",
    "동",
    "되",
    "된",
    "두",
    "드",
    "든",
    "들",
    "등",
    "따",
    "때",
    "또",
    "라",
    "람",
    "램",
    "략",
    "량",
    "러",
    "렇",
    "레",
    "려",
    "력",
    "련",
    "령",
    "로",
    "록",
    "론",
    "롯",
    "료",
    "루",
    "률",
    "르",
    "른",
    "를",
    "리",
    "립",
    "마",
    "만",
    "말",
    "망",
    "매",
    "머",
    "멀",
    "메",
    "며",
    "면",
    "명",
    "모",
    "목",
    "못",
    "무",
    "문",
    "물",
    "미",
    "밀",
    "및",
    "바",
    "반",
    "받",
    "발",
    "방",
    "배",
    "백",
    "번",
    "범",
    "법",
    "별",
    "병",
    "보",
    "복",
    "본",
    "부",
    "분",
    "불",
    "비",
    "빈",
    "사",
    "산",
    "상",
    "생",
    "서",
    "석",
    "선",
    "설",
    "성",
    "세",
    "소",
    "속",
    "손",
    "쇄",
    "수",
    "순",
    "술",
    "슈",
    "스",
    "시",
    "식",
    "신",
    "실",
    "심",
    "써",
    "아",
    "악",
    "안",
    "않",
    "알",
    "았",
    "애",
    "야",
    "약",
    "양",
    "어",
    "언",
    "얼",
    "없",
    "었",
    "에",
    "여",
    "역",
    "연",
    "염",
    "였",
    "영",
    "예",
    "와",
    "왔",
    "외",
    "요",
    "욕",
    "용",
    "우",
    "운",
    "울",
    "움",
    "원",
    "월",
    "웠",
    "위",
    "유",
    "육",
    "율",
    "으",
    "은",
    "을",
    "음",
    "응",
    "의",
    "이",
    "인",
    "일",
    "임",
    "입",
    "있",
    "자",
    "작",
    "잘",
    "잠",
    "장",
    "재",
    "저",
    "적",
    "전",
    "절",
    "점",
    "접",
    "정",
    "제",
    "조",
    "족",
    "존",
    "종",
    "주",
    "준",
    "줄",
    "중",
    "증",
    "지",
    "직",
    "진",
    "질",
    "징",
    "차",
    "착",
    "찰",
    "참",
    "처",
    "척",
    "철",
    "첫",
    "청",
    "체",
    "쳐",
    "촉",
    "총",
    "최",
    "추",
    "축",
    "출",
    "충",
    "취",
    "측",
    "치",
    "칠",
    "코",
    "콩",
    "크",
    "타",
    "태",
    "택",
    "터",
    "토",
    "통",
    "트",
    "특",
    "파",
    "판",
    "퍼",
    "편",
    "평",
    "폐",
    "포",
    "폭",
    "푛",
    "표",
    "품",
    "프",
    "피",
    "하",
    "학",
    "한",
    "할",
    "함",
    "항",
    "해",
    "핵",
    "했",
    "행",
    "향",
    "헌",
    "험",
    "혀",
    "현",
    "형",
    "호",
    "혹",
    "홍",
    "화",
    "확",
    "환",
    "활",
    "황",
    "회",
    "효",
    "후",
    "휴",
    "흡",
    "\u202a",
    "\u202b",
    "\u202c",
    "\ue024",
    "\ue02c",
    "\ue02e",
    "\ue031",
    "\ue032",
    "\ue033",
    "\ue035",
    "\ue061",
    "\ue062",
    "\ue06d",
    "\ue152",
    "\uf020",
    "\uf02b",
    "\uf02d",
    "\uf02f",
    "\uf03d",
    "\uf044",
    "\uf046",
    "\uf05b",
    "\uf05d",
    "\uf061",
    "\uf062",
    "\uf063",
    "\uf065",
    "\uf067",
    "\uf06b",
    "\uf06c",
    "\uf06d",
    "\uf09f",
    "\uf0a2",
    "\uf0a3",
    "\uf0a7",
    "\uf0ae",
    "\uf0b0",
    "\uf0b4",
    "\uf0b7",
    "\uf0bb",
    "\uf0d7",
    "\uf0e0",
    "\uf6d9",
    "\uf761",
    "\uf762",
    "\uf764",
    "\uf765",
    "\uf766",
    "\uf767",
    "\uf768",
    "\uf769",
    "\uf76b",
    "\uf76c",
    "\uf76e",
    "\uf76f",
    "\uf770",
    "\uf772",
    "\uf773",
    "\uf774",
    "\uf775",
    "\uf776",
    "\uf777",
    "\uf778",
    "\uf779",
    "\uf77a",
    "�",
]

### **Creación de la clase Cord19 para cargar los datos**

In [None]:
class Cord19(Dataset):
    def __init__(
        self,
        tokenizer,
        type_path,
        num_samples,
        input_length,
        output_length,
        print_text=False,
    ):
        self.dataset = load_dataset(
            "cord19", "fulltext", data_dir="data/", split=type_path
        )
        if num_samples:
            self.dataset = self.dataset.get("train").select(list(range(0, num_samples)))
        self.input_length = input_length
        self.tokenizer = tokenizer
        self.output_length = output_length
        self.print_text = print_text

    def __len__(self):
        return self.dataset.shape[0]

    def clean_text(self, text):
        text = text.translate({ord(x): "" for x in CHARS})
        text = text.replace("\n", "")
        text = text.replace("``", "")
        text = text.replace('"', "")

        return text

    def convert_to_features(self, example_batch):
        # Tokenize contexts and questions (as pairs of inputs)

        if self.print_text:
            print("Input Text: ", self.clean_text(example_batch["fulltext"]))

        input_ = self.clean_text(example_batch["fulltext"])
        target_ = self.clean_text(example_batch["abstract"])

        source = self.tokenizer.batch_encode_plus(
            [input_],
            max_length=self.input_length,
            padding="max_length",
            truncation=True,
            return_tensors="pt",
        )

        targets = self.tokenizer.batch_encode_plus(
            [target_],
            max_length=self.output_length,
            padding="max_length",
            truncation=True,
            return_tensors="pt",
        )

        return source, targets

    def __getitem__(self, index):
        source, targets = self.convert_to_features(self.dataset[index])

        source_ids = source["input_ids"].squeeze()
        target_ids = targets["input_ids"].squeeze()

        src_mask = source["attention_mask"].squeeze()
        target_mask = targets["attention_mask"].squeeze()

        return {
            "source_ids": source_ids,
            "source_mask": src_mask,
            "target_ids": target_ids,
            "target_mask": target_mask,
        }

In [None]:
tokenizer = T5Tokenizer.from_pretrained('t5-small')
dataset = Cord19(tokenizer, 'train', None, 512, 150, True)
len(dataset)

In [None]:
tokenizer.batch_encode_plus([])

In [None]:
data = dataset[50]
print()
print("Shape of Tokenized Text: ", data['source_ids'].shape)
print()
print("Sanity check - Decode Text: ", tokenizer.decode(data['source_ids']))
print("====================================")
print("Sanity check - Decode Summary: ", tokenizer.decode(data['target_ids']))