## DATA CLEANING 

This notebook achieves the following goals:

   1. Find Authors with Aliases in the Persons datatable 
   2. Create A look-up table to remove Aliases (by choosing only one name)
   3. Clean all datatables to have no Aliases
   4. Merge Book/Journal name with the Title field (books/journals correlated with topics)
   5. Bring all tables into a cannonical form [KEY, AUTHOR, TITLE , YEAR]
   6. Tokenize, and remove stopwords from Titles.
   
At the end of this process, we shall have a clean dataset which can be used for Topic extraction in the future.

In [1]:
import pandas as pd

### 1.1 Authors with Aliases

In [2]:
persons_df = pd.read_csv('dblp_persons.csv')

In [3]:
multiple = persons_df.Resource.value_counts()

In [4]:
multiple_names = persons_df[persons_df.Resource.isin(multiple[multiple >1].index)].set_index('Resource')

In [5]:
multiple = multiple_names.Author.values 
multiple_names.head(6)

Unnamed: 0_level_0,Author
Resource,Unnamed: 1_level_1
homepages/96/520,Fu-Chiang Tsui
homepages/96/520,Fuchiang (Rich) Tsui
homepages/96/3827,Peter A. Henning
homepages/96/3827,Peter Henning
homepages/96/7099,Jos Kleinjans
homepages/96/7099,Jos C. S. Kleinjans


### 1.2   A look-up table for getting unique name per author

In [6]:
multiple_names['new_names'] = multiple_names.index.map(lambda x : multiple_names.loc[x].Author.values[0]).values

In [7]:
multiple_names.reset_index(inplace=True, drop=True)
multiple_names.set_index('Author', inplace=True)

In [8]:
multiple_names.head(6)

Unnamed: 0_level_0,new_names
Author,Unnamed: 1_level_1
Fu-Chiang Tsui,Fu-Chiang Tsui
Fuchiang (Rich) Tsui,Fu-Chiang Tsui
Peter A. Henning,Peter A. Henning
Peter Henning,Peter A. Henning
Jos Kleinjans,Jos Kleinjans
Jos C. S. Kleinjans,Jos Kleinjans


In [9]:
multiple_names.to_csv('name_synonyms.csv')

### 1.3 Cleaning all datatables to have no Aliases

In [10]:
proceedings_df = pd.read_csv('dblp_proceedings.csv')

In [11]:
proceedings_df.head(6)

Unnamed: 0,Proceeding,Editor,Title,Year
0,journals/thipeac/2009-2,Per Stenström,Transactions on High-Performance Embedded Arch...,2009
1,journals/thipeac/2011-4,Per Stenström,Transactions on High-Performance Embedded Arch...,2011
2,journals/thipeac/2007-1,Per Stenström,Transactions on High-Performance Embedded Arch...,2007
3,journals/thipeac/2007-1,Michael F. P. O'Boyle,Transactions on High-Performance Embedded Arch...,2007
4,journals/thipeac/2007-1,François Bodin,Transactions on High-Performance Embedded Arch...,2007
5,journals/thipeac/2007-1,Marcelo Cintra,Transactions on High-Performance Embedded Arch...,2007


In [12]:
proceedings_df_a = proceedings_df.loc[proceedings_df.Editor.isin(multiple)]
proceedings_df_n_a = proceedings_df.loc[proceedings_df.Editor.isin(set(proceedings_df.Editor)- set(multiple))]

In [13]:
proceedings_df_a.Editor = proceedings_df_a.Editor.apply(lambda x : multiple_names.loc[x].new_names)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


In [14]:
proceedings_df_clean = proceedings_df_a.append(proceedings_df_n_a)

In [15]:
proceedings_df_clean.reset_index(inplace=True, drop=True)

In [16]:
proceedings_df_clean.to_csv('dblp_proceedings_clean.csv')

In [17]:
def clean_names(df, filename):
    
    df_a = df.loc[df.Author.isin(multiple)]
    df_n_a = df.loc[df.Author.isin(set(df.Author)-set(multiple))]
    df_a.Author = df_a.Author.apply(lambda x : multiple_names.loc[x].new_names)
    
    df = df_a.append(df_n_a)
    df.reset_index(inplace=True, drop=True)
    
    df.to_csv(filename)

In [35]:
%time clean_names(pd.read_csv('dblp_books.csv'), 'dblp_books_clean.csv')

CPU times: user 490 ms, sys: 367 ms, total: 857 ms
Wall time: 1.13 s


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


In [18]:
%time clean_names(pd.read_csv('dblp_theses.csv'), 'dblp_theses_clean.csv')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


CPU times: user 962 ms, sys: 51.8 ms, total: 1.01 s
Wall time: 1.06 s


In [19]:
%time clean_names(pd.read_csv('dblp_incollections.csv'), 'dblp_incollections_clean.csv')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


CPU times: user 1.86 s, sys: 74.1 ms, total: 1.94 s
Wall time: 2 s


In [20]:
%time clean_names(pd.read_csv('dblp_articles.csv'), 'dblp_articles_clean.csv')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


CPU times: user 2min 1s, sys: 3.22 s, total: 2min 4s
Wall time: 2min 6s


In [21]:
%time clean_names(pd.read_csv('dblp_inproceedings.csv'), 'dblp_inproceedings_clean.csv')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


CPU times: user 2min 40s, sys: 4.62 s, total: 2min 45s
Wall time: 2min 46s


### 1.4 & 1.5 Merging Journal/Book information and converting to the 'cannonical' form 

In [22]:
articles_df = pd.read_csv('dblp_articles_clean.csv', index_col = 0)

articles_df.Title = articles_df.Title + ' ' + articles_df.Publication

articles_df.drop( 'Publication', axis=1, inplace=True)

articles_df.columns = ['Key', 'Author', 'Title', 'Year']

articles_df.to_csv('dblp_articles_clean.csv')

In [38]:
books_df = pd.read_csv('dblp_books_clean.csv', index_col = 0)

books_df.columns = ['Key', 'Author', 'Title', 'Year']

books_df.to_csv('dblp_books_clean.csv')

In [24]:
incollections_df = pd.read_csv('dblp_incollections_clean.csv',index_col = 0)

incollections_df.Title = incollections_df.Title + ' ' + incollections_df.Publication

incollections_df.drop( 'Publication', axis=1, inplace=True)

incollections_df.columns = ['Key', 'Author', 'Title', 'Year']

incollections_df.to_csv('dblp_incollections_clean.csv')


In [25]:
inproceedings_df = pd.read_csv('dblp_inproceedings_clean.csv',index_col = 0)

inproceedings_df.Title = inproceedings_df.Title + ' ' + inproceedings_df.Publication

inproceedings_df.drop('Publication', axis=1, inplace=True)

inproceedings_df.columns = ['Key', 'Author', 'Title', 'Year']

inproceedings_df.to_csv('dblp_inproceedings_clean.csv')


In [26]:
thesis_df = pd.read_csv('dblp_theses_clean.csv',index_col = 0)

thesis_df.columns = ['Key', 'Author', 'Title', 'Year']

thesis_df.to_csv('dblp_theses_clean.csv')

In [27]:
proceedings_df = pd.read_csv('dblp_proceedings_clean.csv',index_col = 0)

proceedings_df.columns = ['Key', 'Author', 'Title', 'Year']

proceedings_df.to_csv('dblp_proceedings_clean.csv')

We now finally have some useable textual dataframes, all of the same format. 

### 1.6 Tokenisation of Title , Removal of Stopwords

In [28]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
    
def tokenize_and_clean_title(filename):
    
    df = pd.read_csv(filename, index_col=0)
    df.Title = df.Title.apply(lambda x : [y for y in simple_preprocess(x) if y not in STOPWORDS]) 
    df.to_csv(filename)



In [29]:
tokenize_and_clean_title('dblp_articles_clean.csv')
tokenize_and_clean_title('dblp_books_clean.csv')
tokenize_and_clean_title('dblp_incollections_clean.csv')
tokenize_and_clean_title('dblp_theses_clean.csv')
tokenize_and_clean_title('dblp_proceedings_clean.csv')
tokenize_and_clean_title('dblp_inproceedings_clean.csv')

  mask |= (ar1 == a)


#### Results -- Example Data frames

In [40]:
pd.read_csv('dblp_books_clean.csv', index_col=0).head(5)

Unnamed: 0,Key,Author,Title,Year
0,books/iee/Ghanbari2003,Mohammed Ghanbari,"['standard', 'codecs', 'image', 'compression',...",2003
1,books/garland/Brosgol73,Benjamin M. Brosgol,"['deterministic', 'translation', 'grammars']",1973
2,books/garland/Sedgewick75,Robert Sedgewick,['quicksort'],1975
3,books/garland/Yun73,David Y. Y. Yun,"['hensel', 'lemma', 'algebraic', 'manipulation']",1973
4,books/bu/Rijsbergen79,C. J. van Rijsbergen,"['information', 'retrieval']",1979


In [41]:
pd.read_csv('dblp_articles_clean.csv', index_col=0).head(5)

  mask |= (ar1 == a)


Unnamed: 0,Key,Author,Title,Year
0,journals/acta/FinkelC87,Annie Choquet-Geniet,"['fifo', 'nets', 'order', 'deadlock', 'acta', ...",1988
1,journals/acta/CalzarossaIS86,Mariacarla Calzarossa,"['workload', 'model', 'representative', 'stati...",1986
2,journals/acta/KariK17,Lila Kari,"['disjunctivity', 'properties', 'sets', 'pseud...",2017
3,journals/acta/BulychevDLL14,Kim G. Larsen,"['efficient', 'controller', 'synthesis', 'frag...",2014
4,journals/acta/CremersH78a,Armin B. Cremers,"['functional', 'behavior', 'data', 'spaces', '...",1978
