# Tratamiento de datos
Este cuaderno reune todas las transformaciones que se le hacen a los datos, de los cuales se nutrirán el resto de cuadernos, ya sea para el estudio de los datos o para las pruebas de diferentes técnicas: distancia de Levenshtein, ML, DL...

## Carga de datos

In [2]:
from google.colab import drive
drive.mount('/content/drive', force_remount = True)

Mounted at /content/drive


Importamos librerías necesarias

In [None]:
import matplotlib.pyplot as plt

import numpy as np
import pandas as pd
import json

In [None]:
df_authors = pd.read_csv('/content/drive/MyDrive/TFM/notebooks/entrega/characteristics_modificados_concatenados.csv', index_col = 0)
df_authors

Unnamed: 0.1,Unnamed: 0,id,characteristics,domain,source,name
0,0,31.0,"Bowness, James Simeon Assistive artificial int...",TECNOLOGIA,WOS,"Bowness, James Simeon"
1,1,32.0,"Burckett-St Laurent, D Assistive artificial in...",TECNOLOGIA,WOS,"Burckett-St Laurent, D"
2,2,,"Hernandez, Nadia Assistive artificial intellig...",TECNOLOGIA,WOS,"Hernandez, Nadia"
3,3,,"Keane, Pearse A Assistive artificial intellige...",TECNOLOGIA,WOS,"Keane, Pearse A"
4,4,,"Lobo, Clara Assistive artificial intelligence ...",TECNOLOGIA,WOS,"Lobo, Clara"
...,...,...,...,...,...,...
390,390,20.0,"Yuan'An Liu DNN Deployment, Task Offloading, a...",TECNOLOGIA,IEEE,Yuan'An Liu
391,391,24.0,Z. Ye Deep Negative Correlation Multisource Do...,TECNOLOGIA,IEEE,Z. Ye
392,392,25.0,Jianbo Yu Deep Negative Correlation Multisourc...,TECNOLOGIA,IEEE,Jianbo Yu
393,393,24.0,Zhuang Ye Multiscale Weighted Morphological Ne...,TECNOLOGIA,IEEE,Zhuang Ye


## Tokenizar

In [None]:
!pip install nltk



In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk

# Downloading stopwords ad punkt packages...
nltk.download('stopwords')
nltk.download('punkt')
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
docs_tokenized = []
ids = []
for index, row in df_authors.iterrows() :
  word_tokens = word_tokenize(row['characteristics'])
  # converts the words in word_tokens to lower case and then checks whether they are present in stop_words or not
  filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
  docs_tokenized.append(filtered_sentence)
  ids.append(row['id'])

authors_tokenized = [" ".join(item) for item in docs_tokenized]

In [None]:
authors_tokenized[0]

'Bowness , James Simeon Assistive artificial intelligence ultrasound image interpretation regional anaesthesia : external validation study . BACKGROUND : Ultrasonound used identify anatomical structures regional anaesthesia guide needle insertion injection local anaesthetic . ScanNav Anatomy Peripheral Nerve Block ( Intelligent Ultrasound , Cardiff , UK ) artificial intelligence-based device produces colour overlay real-time B-mode ultrasound highlight anatomical structures interest . evaluated accuracy artificial-intelligence colour overlay perceived influence risk adverse events block failure.METHODS : Ultrasound-guided regional anaesthesia experts acquired 720 videos 40 volunteers ( across nine anatomical regions ) without using device . artificial-intelligence colour overlay subsequently applied . Three experts independently reviewed video ( original unmodified video ) assess accuracy colour overlay relation key anatomical structures ( true positive/negative false positive/negative

In [None]:
dataset_tokenized = pd.DataFrame({'id': ids, 'characteristics': authors_tokenized, 'name': df_authors['name']})
dataset_tokenized.head()

Unnamed: 0,id,characteristics,name
0,31.0,"Bowness , James Simeon Assistive artificial in...","Bowness, James Simeon"
1,32.0,"Burckett-St Laurent , Assistive artificial int...","Burckett-St Laurent, D"
2,,"Hernandez , Nadia Assistive artificial intelli...","Hernandez, Nadia"
3,,"Keane , Pearse Assistive artificial intelligen...","Keane, Pearse A"
4,,"Lobo , Clara Assistive artificial intelligence...","Lobo, Clara"


## Separar en conjuntos
Train y test (70% - 30%)

In [None]:
train_set = dataset_tokenized.sample(frac = 0.7)
train_set.to_csv('/content/drive/MyDrive/TFM/notebooks/entrega/train_set.csv')
print('Train')
print(len(train_set))
print(train_set.head())
test_set = dataset_tokenized.drop(train_set.index)
test_set.to_csv('/content/drive/MyDrive/TFM/notebooks/entrega/test_set.csv')
print('Test')
print(len(test_set))
print(test_set.head())

Train
276
       id                                    characteristics  \
360  14.0  B. Hu Adaptive Hierarchical Energy Management ...   
43   35.0  Ko , Cerebrospinal fluid mutant huntingtin bio...   
357   NaN  M. Molan Anomaly Detection Anticipation High P...   
190  66.0  Tabrizi , Sarah J. Mutant huntingtin neurofila...   
332   NaN  Z. Wang Image Reconstruction Based Multilevel ...   

                  name  
360              B. Hu  
43               Ko, S  
357           M. Molan  
190  Tabrizi, Sarah J.  
332            Z. Wang  
Test
119
      id                                    characteristics               name
5   27.0  Margetts , Assistive artificial intelligence u...        Margetts, S
7    NaN  Pawa , Amit Assistive artificial intelligence ...         Pawa, Amit
8    NaN  Rosenblatt , Meg Assistive artificial intellig...    Rosenblatt, Meg
18   NaN  Harris , Catherine Evaluation impact assistive...  Harris, Catherine
20   NaN  Morecroft , Megan Evaluation impact assis

## Confrontar por conjunto

### Train

In [None]:
df_train_0 = pd.DataFrame(None, columns = ["author", "author_name", "candidate", "candidate_name", "label"])
for index_author, author in train_set.iterrows():
  for index_candidate, candidate in train_set.iterrows():
    if index_author != index_candidate:
      if author['id'] == candidate['id']:
        df_train_0 = pd.concat([pd.DataFrame([[author['characteristics'], author['name'], candidate['characteristics'], candidate['name'], 1]], columns = df_train_0.columns), df_train_0], ignore_index = True)
      else:
        df_train_0 = pd.concat([pd.DataFrame([[author['characteristics'], author['name'], candidate['characteristics'], candidate['name'], 0]], columns = df_train_0.columns), df_train_0], ignore_index = True)

df_train_0.head()

Unnamed: 0,author,author_name,candidate,candidate_name,label
0,J. Liu Joint Task Offloading Resource Allocati...,J. Liu,"Robbins , TW Biological clinical characteristi...","Robbins, TW",0
1,J. Liu Joint Task Offloading Resource Allocati...,J. Liu,"Rodrigues , Filipe B. Brain-derived neurotroph...","Rodrigues, Filipe B.",0
2,J. Liu Joint Task Offloading Resource Allocati...,J. Liu,M. Sun Trustworthy Localization EM-Based Feder...,M. Sun,0
3,J. Liu Joint Task Offloading Resource Allocati...,J. Liu,"Lowe , J Biological clinical characteristics g...","Lowe, J",0
4,J. Liu Joint Task Offloading Resource Allocati...,J. Liu,"Zetterberg , H Biological clinical characteris...","Zetterberg, H",0


In [None]:
len(df_train_0)

75900

### Test

In [None]:
df_test_0 = pd.DataFrame(None, columns = ["author", "author_name", "candidate", "candidate_name", "label"])
for index_author, author in test_set.iterrows():
  for index_candidate, candidate in test_set.iterrows():
    if index_author != index_candidate:
      if author['id'] == candidate['id']:
        df_test_0 = pd.concat([pd.DataFrame([[author['characteristics'], author['name'], candidate['characteristics'], candidate['name'], 1]], columns = df_test_0.columns), df_test_0], ignore_index = True)
      else:
        df_test_0 = pd.concat([pd.DataFrame([[author['characteristics'], author['name'], candidate['characteristics'], candidate['name'], 0]], columns = df_test_0.columns), df_test_0], ignore_index = True)

df_test_0.head()

Unnamed: 0,author,author_name,candidate,candidate_name,label
0,Zhuang Ye Multiscale Weighted Morphological Ne...,Zhuang Ye,"Fan Wu DNN Deployment , Task Offloading , Reso...",Fan Wu,0
1,Zhuang Ye Multiscale Weighted Morphological Ne...,Zhuang Ye,"Z. Chen DNN Deployment , Task Offloading , Res...",Z. Chen,0
2,Zhuang Ye Multiscale Weighted Morphological Ne...,Zhuang Ye,"Wenhao Fan DNN Deployment , Task Offloading , ...",Wenhao Fan,0
3,Zhuang Ye Multiscale Weighted Morphological Ne...,Zhuang Ye,Y. Su Joint Task Offloading Resource Allocatio...,Y. Su,0
4,Zhuang Ye Multiscale Weighted Morphological Ne...,Zhuang Ye,"Guo , W Deep-Learning-Based Surrogate Model Th...","Guo, W",0


In [None]:
len(df_test_0)

14042

## Limpiar por conjunto
En total había unas 800 líneas con la etiqueta 1 y 75000 con la etiqueta 0.
Nos quedamos con todas las que tienen etiqueta 1 y solo con el mismo número de etiquetados como 0.

Vamos a ver por conjunto...

### Train
Hay unas 200 líneas con la etiqueta 1 y 90000 con la etiqueta 0. Nos quedamos con todas las que tienen etiqueta 1 y solo con el mismo número de etiquetados como 0.

In [None]:
print(len(df_train_0))
print(len(df_train_0.loc[df_train_0['label'] == 0]))
print(len(df_train_0.loc[df_train_0['label'] == 1]))

75900
75726
174


In [None]:
df_train_0_positives = df_train_0.loc[df_train_0['label'] == 1]
df_train_0_negatives = df_train_0.loc[df_train_0['label'] == 0]
df_train_0_clean = pd.concat([df_train_0_positives, df_train_0_negatives.sample(n = 200)])
df_train_0_clean = df_train_0_clean.sample(frac = 1)

In [None]:
print(len(df_train_0_clean))
print(len(df_train_0_clean.loc[df_train_0_clean['label'] == 0]))
print(len(df_train_0_clean.loc[df_train_0_clean['label'] == 1]))

374
200
174


In [None]:
df_train_0_clean.to_csv('/content/drive/MyDrive/TFM/notebooks/entrega/train_limpio.csv')

### Test
Hay unas 100 líneas con la etiqueta 1 y 14000 con la etiqueta 0. Nos quedamos con todas las que tienen etiqueta 1 y solo con el mismo número de etiquetados como 0.

In [None]:
print(len(df_test_0))
print(len(df_test_0.loc[df_test_0['label'] == 0]))
print(len(df_test_0.loc[df_test_0['label'] == 1]))

14042
13990
52


In [None]:
df_test_0_positives = df_test_0.loc[df_test_0['label'] == 1]
df_test_0_negatives = df_test_0.loc[df_test_0['label'] == 0]
df_test_0_clean = pd.concat([df_test_0_positives, df_test_0_negatives.sample(n = 100)])
df_test_0_clean = df_test_0_clean.sample(frac = 1)

In [None]:
df_test_0_clean.head()

Unnamed: 0,author,author_name,candidate,candidate_name,label
6676,"Wild , Edward J Mutant huntingtin neurofilamen...","Wild, Edward J","Wild , E J Longitudinal evaluation proton magn...","Wild, E J",1
8383,"Byrne , LM Longitudinal evaluation proton magn...","Byrne, LM","Guo , W Deep-Learning-Based Surrogate Model Th...","Guo, W",0
12080,"Gordon , B Cerebrospinal fluid mutant huntingt...","Gordon, B","Wu , H Diagnostic value alpha-fetoprotein , Le...","Wu, H",0
11992,"Banos , Raul Mutant Huntingtin Cleared Brain v...","Banos, Raul","Shapiro , JI Oxidized HDL , Adipokines , Endot...","Shapiro, JI",0
241,"Z. Chen DNN Deployment , Task Offloading , Res...",Z. Chen,H. Chaoui Developing Online Data-Driven State ...,H. Chaoui,0


In [None]:
print(len(df_test_0_clean))
print(len(df_test_0_clean.loc[df_test_0_clean['label'] == 0]))
print(len(df_test_0_clean.loc[df_test_0_clean['label'] == 1]))

152
100
52


In [None]:
df_test_0_clean.to_csv('/content/drive/MyDrive/TFM/notebooks/entrega/test_limpio.csv')