# Data analysis

Now I have to clean and analyse the data collected. Here I:

    - removed duplicate occurrences of article and researcher id
    - decided to keep, after the previous removal, duplicate occurrences of article title
    - removed articles with less than 3 letters on the title

- [Cleaning](#Cleaning)
    - [Repeated occurrences of article and researcher id](#Repeated-occurrences-of-article-and-researcher-id)
    - [Short article titles](#Short-article-titles)
- [Analysis](#Analysis)
- [Saving](#Saving)

In [19]:
import pandas as pd
import pickle

In [69]:
# Reading the data collected previously
with open('data/articles_and_fields_df.pickle', 'rb') as f:
    articles_and_fields_df = pickle.load(f)
print("Shape:", articles_and_fields_df.shape)
articles_and_fields_df.head()

Shape: (14735, 4)


Unnamed: 0,researcher_name,researcher_lattes_id,researcher_major_field,article_title
0,Juarez Morbini Lopes,2612407908632983,Ciências Agrárias,Níveis das vitaminas A e E em dietas de frango...
1,Juarez Morbini Lopes,2612407908632983,Ciências Agrárias,Adição de bentonita sódica como adsorvente de ...
2,Juarez Morbini Lopes,2612407908632983,Ciências Agrárias,Níveis de substituição da DL-metionina pela me...
3,Juarez Morbini Lopes,2612407908632983,Ciências Agrárias,Enzimas de função hepática na aflatoxicose agu...
4,Juarez Morbini Lopes,2612407908632983,Ciências Agrárias,"Efeitos de Níveis das VitaminasA, E, Piridoxin..."


## Cleaning

### Repeated occurrences of article and researcher id

In [70]:
# Removing repeated occurrences of article and researcher id
articles_and_fields_df.drop_duplicates(['researcher_lattes_id', 'article_title'], inplace=True, ignore_index=True)
print("New shape:", articles_and_fields_df.shape)

New shape: (14541, 4)


### Short article titles

In [71]:
# Getting articles with less than 3 words in the title 
short_title = articles_and_fields_df['article_title'].apply(len) < 3
short_title_index = articles_and_fields_df[short_title].index

In [72]:
# Removing those article
articles_and_fields_df.drop(index=short_title_index, inplace=True)
articles_and_fields_df.reset_index(drop=True, inplace=True)

In [73]:
print("New shape:", articles_and_fields_df.shape)

New shape: (14537, 4)


## Analysis

In [74]:
# I think that is important to keep repeated articles if they were wrote by different people
# because it makes the relation between the words in the article and the major field became stronger
# if the samples are from researchers of the same field.

# Also, if the article were written by researchers from different fields its is important to know that,
# maybe, the words from the title don't strongly define a specific major field.

print('Some samples with repeated article titles:')
duplicated_titles = articles_and_fields_df.duplicated(['article_title'], keep=False)
articles_and_fields_df[duplicated_titles].sort_values('article_title').head(10)

Some samples with repeated article titles:


Unnamed: 0,researcher_name,researcher_lattes_id,researcher_major_field,article_title
6008,Eloisa Madeira Szanto,9217930367595325,Ciências Exatas e da Terra,A triple telescope for the simultaneous identi...
6557,Marcia Maria de Moura,8259109725413086,Ciências Exatas e da Terra,A triple telescope for the simultaneous identi...
13111,Roberto de Oliveira Brandão,4225280762403997,"Lingüística, Letras e Artes",Apresentação
13103,Tanira Castro,5948190405351510,"Lingüística, Letras e Artes",Apresentação
7880,José William Vesentini,3945292708273502,Ciências Humanas,Apresentação
13530,Irene Teodora Helena Aron,1685178556977854,"Lingüística, Letras e Artes",Deutsch Als Fremdsprache In Brasilien
13055,Sidney Camargo,5528024320536892,"Lingüística, Letras e Artes",Deutsch Als Fremdsprache In Brasilien
5999,Eloisa Madeira Szanto,9217930367595325,Ciências Exatas e da Terra,Dynamics of light heavy-ion reactions in the f...
6511,Marcia Maria de Moura,8259109725413086,Ciências Exatas e da Terra,Dynamics of light heavy-ion reactions in the f...
10478,Marlene Picarelli,872170049864518,Ciências Sociais Aplicadas,Editorial


In [75]:
# Checking for NULL values
articles_and_fields_df.isna().sum()

researcher_name           0
researcher_lattes_id      0
researcher_major_field    0
article_title             0
dtype: int64

In [76]:
print("Count of researchers:", articles_and_fields_df['researcher_lattes_id'].nunique())

Count of researchers: 1349


In [77]:
cols = ['researcher_lattes_id', 'researcher_major_field']
researcher_major_field = articles_and_fields_df[cols].groupby(
    'researcher_lattes_id', as_index=False).apply(lambda x: x.mode()).reset_index(drop=True)

print('Researchers by major field:')
researcher_major_field['researcher_major_field'].value_counts()

Researchers by major field:


Ciências Humanas               269
Engenharias                    246
Ciências Sociais Aplicadas     234
Lingüística, Letras e Artes    209
Ciências da Saúde              136
Ciências Biológicas             90
Ciências Agrárias               89
Ciências Exatas e da Terra      76
Name: researcher_major_field, dtype: int64

In [78]:
# Articles by researcher major field
print('Articles by researcher major field:')
articles_and_fields_df['researcher_major_field'].value_counts()

Articles by researcher major field:


Ciências da Saúde              2539
Ciências Humanas               2231
Ciências Sociais Aplicadas     1982
Ciências Biológicas            1834
Ciências Agrárias              1607
Lingüística, Letras e Artes    1607
Engenharias                    1573
Ciências Exatas e da Terra     1164
Name: researcher_major_field, dtype: int64

In [79]:
print('Titles composed just by spaces:')
articles_and_fields_df[articles_and_fields_df['article_title'].apply(lambda t: t.isspace())]

Titles composed just by spaces:


Unnamed: 0,researcher_name,researcher_lattes_id,researcher_major_field,article_title


In [80]:
print('Titles composed just by digits:')
articles_and_fields_df[articles_and_fields_df['article_title'].apply(lambda t: t.isdigit())]

Titles composed just by digits:


Unnamed: 0,researcher_name,researcher_lattes_id,researcher_major_field,article_title


In [81]:
no_space = lambda s: not ' ' in s
no_space_articles = articles_and_fields_df[articles_and_fields_df['article_title'].apply(no_space)]

print('Number of articles composed by just one word:', no_space_articles.shape[0])
no_space_articles.head()

Number of articles composed by just one word: 55


Unnamed: 0,researcher_name,researcher_lattes_id,researcher_major_field,article_title
1827,Bruno Edgar Irgang,7939236055203512,Ciências Biológicas,Umbelliferae
2633,Maria do Carmo Mendes Marques,5521926255675400,Ciências Biológicas,Balsaminaceas
2634,Maria do Carmo Mendes Marques,5521926255675400,Ciências Biológicas,Ericaceas
3547,Antônio Jorge Salomão,7351422514524870,Ciências da Saúde,Dismenorréia
3772,Lorivaldo Minelli,9004250710060552,Ciências da Saúde,Psoríase


In [82]:
print("The 10 articles with the smallest titles:")
articles_and_fields_df.loc[articles_and_fields_df['article_title'].apply(len).sort_values().index].head(10)

The 10 articles with the smallest titles:


Unnamed: 0,researcher_name,researcher_lattes_id,researcher_major_field,article_title
14093,Luiz Carlos da Silva Dantas,8629430408925370,"Lingüística, Letras e Artes",Rua
10498,Ercilio Antonio Denny,1283151791303387,Ciências Sociais Aplicadas,Poder
11291,Isaac Epstein,850416458609837,Ciências Sociais Aplicadas,Jogos
9663,Wilson Thomé Sardinha Martins,9425012950971918,Ciências Sociais Aplicadas,INCRA
10499,Ercilio Antonio Denny,1283151791303387,Ciências Sociais Aplicadas,Pessoa
14289,Vera Beatriz Sass,2245606626815858,"Lingüística, Letras e Artes",Música
13265,Níobe Abreu Peixoto da Silva,8900271000825079,"Lingüística, Letras e Artes",Sobre
4016,Abes Mahmed Amed,1565520483471962,Ciências da Saúde,Herpes
13713,Maria Augusta Calado De Saloma Rodrigues,7846824770288139,"Lingüística, Letras e Artes",Tapuio
7196,Euclides Redin,7949812265735134,Ciências Humanas,Título?


## Saving

In [68]:
with open('data/articles_and_fields_cleaned_df.pickle', 'wb') as f:
    pickle.dump(articles_and_fields_df, f)