# Reconhecimento de nomes de instituições utilizando Inteligência Artificial

## Projeto PIBITI  

### Edital 18/2021-PROPPG-IFG

#### Estudante (bolsista): João Gabriel Grandotto Viana
#### Orientador: Waldeyr Mendes Cordeiro da Silva

## Parte 01 Coleta e tratamento de dados das fontes primárias

In [30]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.datasets import make_blobs
import pandas as pd

In [31]:
url1 ='https://raw.githubusercontent.com/joaograndotto/PIBITI/main/Datasets/scopus.csv'
url2 = 'https://raw.githubusercontent.com/joaograndotto/PIBITI/main/Datasets/webscience.csv'

scopus = pd.read_csv(url1)
web_of_science = pd.read_csv(url2, sep="\t" )

In [32]:
web_of_science.shape

(31, 67)

In [33]:
scopus.shape

(1276, 50)

In [34]:
#Renomeie as Colunas relacionadas porque estavam com os nomes diferentes
web_of_science.rename(columns={'Publication Year': 'Year'}, inplace = True)
web_of_science.rename(columns={'Article Title': 'Title'}, inplace = True)
web_of_science.rename(columns={'Publisher': 'Affiliations'}, inplace = True)

In [35]:
scopus["Affiliations"].head(10)

0    Instituto Federal de Goiás-IFG, Campus Goiânia...
1    Instituto Federal de Goiás (IFG), Aparecida de...
2    Instituto Federal de Educação, Ciência e Tecno...
3    Universidade Federal de Goiás – UFG, Rede Pró ...
4    Department of Agronomy, Universidade Federal R...
5    Institute for Hygiene and Public Health, Medic...
6    Department of Environmental Informatics, Helmh...
7    Grupo de Estudos em Geomática (GEO), Instituto...
8    Laboratory of Environmental Biotechnology and ...
9    Universidade Federal de Goiás (UFG), Instituto...
Name: Affiliations, dtype: object

In [36]:
web_of_science["Affiliations"].head(10)

0    INST FED EDUCATION, SCIENCE & TECHNOLOGY OF GO...
1    INST FED EDUCATION, SCIENCE & TECHNOLOGY OF GO...
2                        UNIV DO VALE DO RIO DOS SINOS
3    INST FED EDUCATION, SCIENCE & TECHNOLOGY OF GO...
4                                             ELSEVIER
5                          UNIV FEDERAL CAMPINA GRANDE
6    INST FED EDUCATION, SCIENCE & TECHNOLOGY OF GO...
7                PONTIFICIA UNIV CATOLICA PARANA-PUCPR
8                             UNIV FEDERAL SANTA MARIA
9                                      INST AGRONOMICO
Name: Affiliations, dtype: object

In [37]:
#juntando os dois dataframes
result = pd.concat([scopus, web_of_science])

In [38]:
nulos = result.loc[result['DOI'].isnull()] # somente registros sem DOI
nulos.index[0]
result = result.drop([nulos.index[0]])

In [39]:
total_artigos = result.shape[0]

In [40]:
result.columns

Index(['Authors', 'Author(s) ID', 'Title', 'Year', 'Source title', 'Volume',
       'Issue', 'Art. No.', 'Page start', 'Page end',
       ...
       'Number of Pages', 'WoS Categories', 'Research Areas', 'IDS Number',
       'UT (Unique WOS ID)', 'Pubmed Id', 'Open Access Designations',
       'Highly Cited Status', 'Hot Paper Status', 'Date of Export'],
      dtype='object', length=105)

In [41]:
#Verificando duplicados na coluna DOI e apagando as linhas com DOI duplicado
result = result.drop_duplicates(subset=['DOI'], keep='first')
result.shape

(1162, 105)

In [42]:
duplicados_eliminados = total_artigos - result.shape[0]
duplicados_eliminados

144

In [43]:
# nomear os indices do tamanho do novo dataframe
index =[]  
for i in range(result.shape[0]): 
    index.append(i)
result.index = index

In [44]:
result[['DOI', 'Title', 'Affiliations']]

Unnamed: 0,DOI,Title,Affiliations
0,10.1016/j.nonrwa.2021.103406,Classical solution for a nonlinear hybrid syst...,"Instituto Federal de Goiás-IFG, Campus Goiânia..."
1,10.1007/978-3-030-79165-0_25,An Innovative Textile Product Proposal Based o...,"Instituto Federal de Goiás (IFG), Aparecida de..."
2,10.1590/1519-6984.245368,Detection of enteroparasites in foliar vegetab...,"Instituto Federal de Educação, Ciência e Tecno..."
3,10.1590/1519-6984.234476,"Phytochemical characterization, and antioxidan...","Universidade Federal de Goiás – UFG, Rede Pró ..."
4,10.1038/s41598-021-97854-8,Stability analysis of reference genes for RT-q...,"Department of Agronomy, Universidade Federal R..."
...,...,...,...
1157,10.31977/grirfi.v16i2.774,HUMAN RIGHTS: FROM THE UNIFORMITY OF THE SPECI...,"UNIV FED RECONCAVO BAHIA, CENTRO FORMACAO PROF..."
1158,10.1590/S0101-31732015000400002,PRESENTATION OF THE DOSSIER ROUSSEAU,UNESP-MARILIA
1159,10.1590/S1415-43662014000400013,Physiological quality of soybean seeds stored ...,UNIV FEDERAL CAMPINA GRANDE
1160,10.1590/S0034-89102010005000053,Ethics in the publication of studies on human ...,REVISTA DE SAUDE PUBLICA


In [45]:
result[['DOI','Affiliations']].to_csv("dados_para_label.tsv", sep = "\t", index=False)

## Parte 02 - Tratamento dos dados etiquetados

Os dados slecionados foram etiquetados manualmente e salvos no dataset [dados com label](https://raw.githubusercontent.com/joaograndotto/PIBITI/main/Datasets/dados_com_label.csv)

In [46]:
url = 'https://raw.githubusercontent.com/joaograndotto/PIBITI/main/Datasets/dados_com_label.csv'

scopus = pd.read_csv(url)
dataset = pd.read_csv(url, sep="," )
dataset

Unnamed: 0,DOI,Title,Year,Affiliations,Campus,Institution
0,10.1016/j.nonrwa.2021.103406,Classical solution for a nonlinear hybrid syst...,2022,"Instituto Federal de Goiás-IFG, Campus Goiânia...",Goiânia,Instituto Federal de Goiás
1,10.1007/978-3-030-79165-0_25,An Innovative Textile Product Proposal Based o...,2022,"Instituto Federal de Goiás (IFG), Aparecida de...",Aparecida de Goiânia,Instituto Federal de Goiás
2,10.1590/1519-6984.245368,Detection of enteroparasites in foliar vegetab...,2022,"Instituto Federal de Educação, Ciência e Tecno...",Aparecida de Goiânia,Instituto Federal de Goiás
3,10.1590/1519-6984.234476,"Phytochemical characterization, and antioxidan...",2022,"Universidade Federal de Goiás – UFG, Rede Pró ...",Goiânia,Instituto Federal de Goiás
4,10.1038/s41598-021-97854-8,Stability analysis of reference genes for RT-q...,2021,"Department of Agronomy, Universidade Federal R...",Águas Lindas,Instituto Federal de Goiás
...,...,...,...,...,...,...
1157,10.31977/grirfi.v16i2.774,HUMAN RIGHTS: FROM THE UNIFORMITY OF THE SPECI...,2017,"UNIV FED RECONCAVO BAHIA, CENTRO FORMACAO PROF...",Bahia,Universidade Federal Reconcavo
1158,10.1590/S0101-31732015000400002,PRESENTATION OF THE DOSSIER ROUSSEAU,2015,UNESP-MARILIA,Marilia,UNESP
1159,10.1590/S1415-43662014000400013,Physiological quality of soybean seeds stored ...,2014,UNIV FEDERAL CAMPINA GRANDE,Campina Grande,Universidade Federal Campina Grande
1160,10.1590/S0034-89102010005000053,Ethics in the publication of studies on human ...,2011,REVISTA DE SAUDE PUBLICA,Brasil,Revista Saude Publica


### Estratégia 01 - Saco de palavras

Os arquivos de texto precisam ser convertidos em arquivos numéricos para serem utilizaod por algoritmos de machine learning.

A primeira estratégia que utilizaremos e aquela conhecida como saco de palavras, onde um texto é segmentado em palavras (separadas por espaço), calcula-se a frequência dessas pavras em cada documento e finalmente é atribuído um ID para cada palavra.

In [47]:
dataset = pd.read_csv('https://raw.githubusercontent.com/joaograndotto/PIBITI/main/Datasets/dados_com_label.csv')
dataset.sample(n=3)
dataset['target'] = dataset['Institution'].apply(lambda x: 1 if str(x).strip() == "Instituto Federal de Goiás" else 0)
dataset.sample(n=3)

Unnamed: 0,DOI,Title,Year,Affiliations,Campus,Institution,target
152,10.1016/j.compositesa.2020.105939,Partially compacted polypropylene glass fiber ...,2020,"Transfercenter für Kunststofftechnik GmbH, Fra...",Austria,Schachermayerstrasse,0
625,10.1590/1678-69712016/administracao.v17n4p108-129,The co-production of innovation: A case study ...,2016,"Department of Academic Areas I, Instituto Fede...",Goiânia,Instituto Federal de Goiás,1
1122,10.1002/xrs.733,A modular system for XRF and XRD applications ...,2004,"IfG - Inst. F. Gerätebau GmbH, Rudower Chausse...",Berlin,Institut für Geraetebau GmbH,0


In [48]:
X, y = dataset['Institution'], dataset['target']

train_dataset = dataset.sample(frac = 0.9, random_state = 25)
test_dataset  = dataset.drop(train_dataset.index)

In [49]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y,  
                                                    random_state=1, 
                                                    test_size=0.2, shuffle=True)
x_train.shape, x_test.shape

((929,), (233,))

In [50]:
from sklearn.feature_extraction.text import CountVectorizer
countVectorizer = CountVectorizer()

In [51]:
a = countVectorizer.fit(x_train.values.astype('U'))

x_train_vec = countVectorizer.transform(x_train.values.astype('U'))
x_test_vec = countVectorizer.transform(x_test.values.astype('U'))

In [52]:
# Acessando parte da matriz esparsa
x_train_vec[:, [0,1,2,3,4,5,6,7,8,9]].toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [53]:
x_test_vec[:, [0,1,2,3,4,5,6,7,8,9]].toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

*TF* $\rightarrow$ Term frequencies (frquências de termos) $\rightarrow$ count(palavra/total de palavras)

*TF-IDF* $\rightarrow$ Term Frequency times inverse document frequency $\rightarrow$ reduz o peso de palavras repetitivas

In [54]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidfTransformer= TfidfTransformer()

In [55]:
X_train_tfidf = tfidfTransformer.fit_transform(x_train_vec)
X_test_tdidf = tfidfTransformer.fit_transform(x_test_vec)


In [56]:
from sklearn import svm
svm = svm.SVC(kernel = 'linear')

prob = svm.fit(X_train_tfidf, y_train)
pred_svm = svm.predict(x_test_vec)

In [57]:
from sklearn.metrics import confusion_matrix, accuracy_score


print("Acurácia =", accuracy_score(y_test, pred_svm) * 100, '%')

Acurácia = 98.71244635193133 %


In [58]:
cm = confusion_matrix(y_test, pred_svm)
TN, FP, FN, TP = confusion_matrix(y_test, pred_svm).ravel()
print('True Positive(TP) = ', TP)
print('False Positive(FP) = ', FP)
print('True Negative(TN) = ', TN)
print('False Negative(FN) = ', FN)
accuracy = (TP+TN) /(TP+FP+TN+FN)

print('Acurácia da classificação binária = {:0.3f}%'.format(accuracy*100))

True Positive(TP) =  61
False Positive(FP) =  3
True Negative(TN) =  169
False Negative(FN) =  0
Acurácia da classificação binária = 98.712%
