In [98]:
import pandas as pd
import re
from IPython.display import clear_output, display, HTML

* **Orginal Idea**: https://coppelia.io/2012/06/graphing-the-history-of-philosophy/

**<font size = "6">The JNaaP Title**
***

### 1. Load Data and Exploratory Analysis

#### Exploratory Analysis
Structured content from Wikipedia can be access through DBpedia. Exploring some pages on philosophers, we find out that there are two relevant relational attributes:
1. https://dbpedia.org/ontology/influenced i.e., The people, works or schools of thought that were **influenced by** this philosopher
2. https://dbpedia.org/ontology/influencedBy i.e., The people, works or schools of thought that were an **influence on** this philosopher
 
 
There are a few things to consider:
1. DBpedia pages on the respective attributes say that these are mutual inverse of each other. However, it is not clear at this stage whether there is one to one relationship in the dataset for all links - i.e., an Influenced By link exists for each Influenced link and vice versa. This is something we will need to invstigate
2. The subject and object of each relation can be a Person, Work or School of Thought (e.g., Neoplatonism). We are only interested in links between Philosophers, a subset of the category Persons. We will need to filter the data appropriately for this

To achieve this we download four datasets:
1. List of Wikipedia pages categorised as Philosophers http://dbpedia.org/ontology/Philosopher
2. List of *Influenced_By* links (referred to as *influece_from* from now on)
3. List of *Influenced* links (referred to as *influece_to* from now on)
4. List of Philosophers with both *Influenced_By* and *Influenced* links


In [99]:
# SPARQL Queries (DBpedia SPARQL endpoint: https://dbpedia.org/sparql) 
query = \
"""
# Philosopher: 
SELECT ?Philosopher, ?BirthYr, ?DeathYr
WHERE {
?Philosopher a <http://dbpedia.org/ontology/Philosopher> .
?Philosopher <http://dbpedia.org/ontology/birthYear> ?BirthYr .
?Philosopher <http://dbpedia.org/ontology/deathYear> ?DeathYr .
}

# Influenced By: 
SELECT ?Philosopher, ?Influenced_By WHERE {
?Philosopher a <http://dbpedia.org/ontology/Philosopher> .
?Philosopher <http://dbpedia.org/ontology/influencedBy> ?Influenced_By .
}

# Influenced (To): 
SELECT ?Philosopher, ?Influenced 
WHERE {
?Philosopher a <http://dbpedia.org/ontology/Philosopher> .
?Philosopher <http://dbpedia.org/ontology/influenced> ?Influenced .
}

# Influenced (Two Way): 
SELECT ?Philosopher, ?Influenced_By, ?Influenced 
WHERE {
?Philosopher a <http://dbpedia.org/ontology/Philosopher> .
?Philosopher <http://dbpedia.org/ontology/influencedBy> ?Influenced_By .
?Philosopher <http://dbpedia.org/ontology/influenced> ?Influenced .
}
"""

In [100]:
list_data_xplor = []

df_philosophers = pd.read_csv('philosophers.csv')
list_data_xplor.append(
    ['df_philosophers', df_philosophers.shape, df_philosophers[df_philosophers.isnull().any(axis=1)].shape[0], len(df_philosophers.iloc[:,0].unique())])

df_influence_from = pd.read_csv('influence_from.csv')
list_data_xplor.append(
    ['df_influence_from', df_influence_from.shape, df_influence_from[df_influence_from.isnull().any(axis=1)].shape[0], len(df_influence_from.iloc[:,0].unique())])

df_influence_to = pd.read_csv('influence_to.csv')
list_data_xplor.append(
    ['df_influence_to', df_influence_to.shape, df_influence_to[df_influence_to.isnull().any(axis=1)].shape[0], len(df_influence_to.iloc[:,0].unique())])

df_influence_two_way = pd.read_csv('influence_two_way.csv')
list_data_xplor.append(
    ['df_influence_two_way', df_influence_two_way.shape, df_influence_two_way[df_influence_two_way.isnull().any(axis=1)].shape[0], len(df_influence_two_way.iloc[:,0].unique())])

The table below gives the shape of each dataset and also the number of uniqe philosphers in each. We see that Wikipedia has 2854 unique pages on Philosophers (which is not to say that it has pages on 2854 philosophers, as we shall see later). 

We see that both influence_from and influence_to have much lower number of unique pages compared to the list of Philosophers. Also, the Number of unique pages is different for each. So clearly, there is no one to one relationship between the two. There might still be some duplicates. 

Number of Philosopher pages with both influence_from and influence_to is very small and we shall abandon this specific dataset 

Thankfully none of the rows in any of the datasets have NaNs, Yay!

In [101]:
df_xplor = pd.DataFrame(list_data_xplor, columns=["Data", "Shape", "Rows with NaN", '# Unique Pages'])
display(HTML(df_xplor.to_html()))

Unnamed: 0,Data,Shape,Rows with NaN,# Unique Pages
0,df_philosophers,"(2854, 1)",0,2854
1,df_influence_from,"(7729, 2)",0,1666
2,df_influence_to,"(5175, 2)",0,904
3,df_influence_two_way,"(10000, 3)",0,174


One more thing to cross check is whether the List of pages on Philosophers is a superset of Philosophers in influence_from and influence_to. The cell below shows this is that case. Yay #2 

In [102]:
set_philosophers = set(df_philosophers.iloc[:,0])
print(set(df_influence_from.iloc[:,0]).issubset(set_philosophers))
print(set(df_influence_to.iloc[:,0]).issubset(set_philosophers))

True
True


### Data Cleaning

Thankfully there are no NaN rows, but we need to do some gentle cleaning

1. Rename the columns consistently across dataframes to make life easier

In [103]:
col_names = {'Philosopher':'ph_url'}
df_philosophers.rename(columns = col_names, inplace=True)

col_names = {'Philosopher':'ph_url', 'Influenced_By': 'infl_from_url'}
df_influence_from.rename(columns = col_names, inplace=True)

col_names = {'Philosopher':'ph_url', 'Influenced': 'infl_to_url'}
df_influence_to.rename(columns = col_names, inplace=True)

2. The data entries are in the form of URLs, lets extract names from these. Extraction scheme is simple - everything after the ../resource/ part in the URLs

In [104]:
df_philosophers.head(2)

Unnamed: 0,ph_url
0,http://dbpedia.org/resource/A.K._Chatterjee
1,http://dbpedia.org/resource/A._C._Ewing


In [105]:
regex_search = r'^.*resource\/(.*)$'

df_philosophers['ph_name'] = df_philosophers['ph_url'].str.extract(regex_search)

df_influence_from['ph_name'] = df_influence_from['ph_url'].str.extract(regex_search)
df_influence_from['infl_from_name'] = df_influence_from['infl_from_url'].str.extract(regex_search)

df_influence_to['ph_name'] = df_influence_to['ph_url'].str.extract(regex_search)
df_influence_to['infl_to_name'] = df_influence_to['infl_to_url'].str.extract(regex_search)

Lets check how the data looks

In [106]:
df_philosophers.head(2)

Unnamed: 0,ph_url,ph_name
0,http://dbpedia.org/resource/A.K._Chatterjee,A.K._Chatterjee
1,http://dbpedia.org/resource/A._C._Ewing,A._C._Ewing


In [107]:
df_influence_from.head(2)

Unnamed: 0,ph_url,infl_from_url,ph_name,infl_from_name
0,http://dbpedia.org/resource/A._C._Ewing,http://dbpedia.org/resource/Brand_Blanshard,A._C._Ewing,Brand_Blanshard
1,http://dbpedia.org/resource/A._D._Gordon,http://dbpedia.org/resource/Leo_Tolstoy,A._D._Gordon,Leo_Tolstoy


We also should check if there have been any errors

In [108]:
print(df_philosophers.isnull().any().any())
print(df_influence_from.isnull().any().any())
print(df_influence_to.isnull().any().any())

False
False
True


Some NaNs in df_influence_to, **Uh oh!** Lets dig deeper

In [109]:
display(HTML(df_influence_to[df_influence_to.isnull().any(axis=1)].to_html()))

Unnamed: 0,ph_url,infl_to_url,ph_name,infl_to_name
5087,http://dbpedia.org/resource/T._K._Seung,https://las.depaul.edu/academics/political-science/faculty/Pages/david-lay-williams.aspx,T._K._Seung,


It seems that for one of the entries, the influence_to URL points outside wikipedia. This is a error in the dataset. We'll drop this datapoint

In [110]:
df_influence_to = df_influence_to[~df_influence_to.isnull().any(axis=1)]

Sort the list of philosophers by name

In [111]:
df_philosophers.sort_values('ph_name', inplace=True)

So good to go? Sadly no

While eyeballing the data, a problem was spotted. The data set has two distinct pages on the great logician Frege:
1. https://dbpedia.org/page/Gottlob_Frege
2. https://dbpedia.org/page/Gottlob_Frege__Gottlob_Frege__1

On invetigation, one finds that this is not true for all Philosophers. A few have duplicate pages like this but most do not. It is not immediately clear why this is the case - some quirk of Wikipedia? Perhaps, but we shall deal with it!

#### Linking duplicate pages

So we need to come up with a scheme to Map multiple pages on one Philosophers (if they exist). Following is a simple scheme:

1. In the ph_name field of df_philosophers, test for string duplication using first n characters (n = 10 below)
2. For the rows that are highlighted as duplicates, test for duplication in the first two words of the name (a word is taken to be separated by underscores, and includes initials. So A._C._Ewing will have three words while Bertrand_Russell will have two)
3. For the rows highlighted as duplicates in test #2, manually check and create a map for true duplicates

In [112]:
df_philosophers['dup_test_col1'] = df_philosophers['ph_name'].str[:10]

regex_search = r'^([^_^\n]*_[^_^\n]*).*$'
df_philosophers['dup_test_col2'] = df_philosophers['ph_name'].str.extract(regex_search)

In [113]:
df_philosophers.head(3)

Unnamed: 0,ph_url,ph_name,dup_test_col1,dup_test_col2
857,http://dbpedia.org/resource/'Abd_al-Haqq_al-De...,'Abd_al-Haqq_al-Dehlawi__1,'Abd_al-Ha,'Abd_al-Haqq
0,http://dbpedia.org/resource/A.K._Chatterjee,A.K._Chatterjee,A.K._Chatt,A.K._Chatterjee
1,http://dbpedia.org/resource/A._C._Ewing,A._C._Ewing,A._C._Ewin,A._C.


In [114]:
df_dup_test1 = df_philosophers.groupby('dup_test_col1').count()
list_suspected_dups_test1 = list(df_dup_test1[df_dup_test1['ph_name'] > 1].index)
len(list_suspected_dups_test1)

102

We found 102 potential duplicates out of 2854 total rows

In [115]:
df_dup_test2 = df_philosophers[df_philosophers['dup_test_col1'].isin(list_suspected_dups_test1)]
df_dup_test2 = df_dup_test2.groupby('dup_test_col2').count()
list_suspected_dups_test2 = list(df_dup_test2[df_dup_test2['ph_name'] > 1].index)
len(list_suspected_dups_test2)

36

We were able to reduce the list of duplicates to 36. Now we shall just export this as CSV and manually create a mapping for true duplicates

In [116]:
df_philosophers[df_philosophers['dup_test_col2'].isin(list_suspected_dups_test2)].to_csv('suspected_dups.csv', index = False)

In [117]:
df_dups_mapping = pd.read_csv('dups_mapping.csv')
print(df_dups_mapping[df_dups_mapping['Map'].isna()].shape)
print(df_dups_mapping[~df_dups_mapping['Map'].isna()].shape)

(36, 5)
(42, 5)


As confirmed earlier, our manual inspection also found 36 duplicates. Each of these have been mapped to a single value which corresponds to the ph_name attribute of the philosopher

In [118]:
df_dups_mapping = df_dups_mapping[~df_dups_mapping['Map'].isna()]
df_dups_mapping.head(4)

Unnamed: 0,ph_url,ph_name,dup_test_col1,dup_test_col2,Map
2,http://dbpedia.org/resource/Algernon_Sidney,Algernon_Sidney,Algernon_S,Algernon_Sidney,Algernon_Sidney
3,http://dbpedia.org/resource/Algernon_Sidney__A...,Algernon_Sidney__Algernon_Sidney__1,Algernon_S,Algernon_Sidney,Algernon_Sidney
4,http://dbpedia.org/resource/Allan_Bloom,Allan_Bloom,Allan_Bloo,Allan_Bloom,Allan_Bloom
5,http://dbpedia.org/resource/Allan_Bloom__Allan...,Allan_Bloom__Allan_Bloom__1,Allan_Bloo,Allan_Bloom,Allan_Bloom


Now we use this map to replace the name fields in all dataframes with the mapped name

In [119]:
dict_dups_mapping = df_dups_mapping[['ph_name','Map']].set_index('ph_name').T.to_dict('records')[0]
# Adapted from https://stackoverflow.com/questions/26716616/convert-a-pandas-dataframe-to-a-dictionary

In [120]:
df_philosophers['ph_name_dedupl'] = df_philosophers['ph_name']
df_philosophers['ph_name_dedupl'] = df_philosophers['ph_name_dedupl'].replace(dict_dups_mapping)

So lets check if we have been able to make the doppelganger of Herr Frege disappear

In [121]:
df_philosophers[df_philosophers['ph_name'].str.contains('Frege')]

Unnamed: 0,ph_url,ph_name,dup_test_col1,dup_test_col2,ph_name_dedupl
2451,http://dbpedia.org/resource/Gottlob_Frege,Gottlob_Frege,Gottlob_Fr,Gottlob_Frege,Gottlob_Frege
2839,http://dbpedia.org/resource/Gottlob_Frege__Got...,Gottlob_Frege__Gottlob_Frege__1,Gottlob_Fr,Gottlob_Frege,Gottlob_Frege


Success!

Finally, check for errors:

In [122]:
df_philosophers['ph_name_dedupl'].isna().any()

False

Do the same for other two dataframes

In [123]:
df_influence_from['ph_name_dedupl'] = df_influence_from['ph_name'].replace(dict_dups_mapping)
df_influence_from['infl_from_name_dedupl'] = df_influence_from['infl_from_name'].replace(dict_dups_mapping)
df_influence_from[['ph_name_dedupl','infl_from_name_dedupl']].isna().any()

ph_name_dedupl           False
infl_from_name_dedupl    False
dtype: bool

In [124]:
df_influence_to['ph_name_dedupl'] = df_influence_to['ph_name'].replace(dict_dups_mapping)
df_influence_to['infl_to_name_dedupl'] = df_influence_to['infl_to_name'].replace(dict_dups_mapping)
df_influence_to[['ph_name_dedupl','infl_to_name_dedupl']].isna().any()

ph_name_dedupl         False
infl_to_name_dedupl    False
dtype: bool

FINALLY were are at the moment of synthesis!

We have two dataframes where each row is a relationship or link. On one side is an influencer and on the other is the influenced

1. Change the column names to a common format:

        a. infl_from_url, infl_from_name -  Page URL name of philosopher who is the influencer in the relationship
        b. infl_to_url, infl_to_name - Page URL and name of philosopher who is getting the influence in the relationship
        
2. Concatenate the dataframes
3. Remove duplicate links

In [125]:
df_influence_from_concat = df_influence_from[['ph_name_dedupl','ph_url','infl_from_name_dedupl','infl_from_url']]
df_influence_from_concat = df_influence_from_concat.rename(columns =
                                                           {'ph_name_dedupl':'infl_to_name','ph_url':'infl_to_url','infl_from_name_dedupl':'infl_from_name','infl_from_url':'infl_from_url'})

df_influence_to_concat = df_influence_to[['ph_name_dedupl','ph_url','infl_to_name_dedupl','infl_to_url']]
df_influence_to_concat = df_influence_to_concat.rename(columns = 
                                                       {'ph_name_dedupl':'infl_from_name','ph_url':'infl_from_url','infl_to_name_dedupl':'infl_to_name','infl_to_url':'infl_to_url',})

df_influence_links = pd.concat([df_influence_from_concat, df_influence_to_concat], join='inner', ignore_index=True)

df_influence_links = df_influence_links.drop_duplicates(keep='first',  ignore_index=True)

But wait! We have not solved the consideration #2 yet: "The subject and object of each relation can be a Person, Work or School of Thought (e.g., Neoplatonism). We are only interested in links between Philosophers, a subset of the category Persons. We will need to filter the data appropriately for this"

Lets do it

In [126]:
set_philosophers = set(df_philosophers['ph_name_dedupl'])
df_influence_links = df_influence_links[df_influence_links['infl_to_name'].isin(set_philosophers)]
df_influence_links = df_influence_links[df_influence_links['infl_from_name'].isin(set_philosophers)]

In [127]:
print(df_influence_links.shape)
print(len(set(df_influence_links['infl_to_name'])))
print(len(set(df_influence_links['infl_from_name'])))

(6214, 4)
1636
881


So finally, we have a dataset with 6214 unique relationships with 1636 influencers and 881 influenced. 

Final check for errors

In [128]:
df_influence_links.isna().any()

infl_to_name      False
infl_to_url       False
infl_from_name    False
infl_from_url     False
dtype: bool

Time to have fun!

In [129]:
df_influence_links.to_csv('results_links.csv', index = False)

In [138]:
df_count_infl_to = df_influence_links.groupby('infl_from_name', as_index=False)['infl_to_name'].count().rename(columns={'infl_from_name':'ph_name','infl_to_name':'count_influence_to'})
df_count_infl_from = df_influence_links.groupby('infl_to_name', as_index=False)['infl_from_name'].count().rename(columns={'infl_to_name':'ph_name', 'infl_from_name':'count_influence_from'})
df_count_infl = pd.merge(df_count_infl_to, df_count_infl_from, how = 'outer', on = 'ph_name').fillna(0)
df_count_infl['displname'] = df_count_infl['ph_name'].replace(r'_', ' ', regex=True).replace(r'\.', '', regex=True)
df_count_infl = pd.merge(df_count_infl, df_philosophers[['ph_name','ph_url']], how='left', on='ph_name')
df_count_infl['ph_url'] = df_count_infl['ph_url'].replace(r'dbpedia.org/resource',r'en.wikipedia.org/wiki', regex=True)
df_count_infl.to_csv('results_nodes.csv', index = False)
df_count_infl.to_json('results_nodes.json', orient = 'records', force_ascii=False)

In [136]:
df_count_infl

Unnamed: 0,ph_name,count_influence_to,count_influence_from,displname,ph_url
0,A._J._Ayer,7.0,7.0,A J Ayer,http://en.wikipedia.org/wiki/A._J._Ayer
1,Abdolkarim_Soroush,1.0,1.0,Abdolkarim Soroush,http://en.wikipedia.org/wiki/Abdolkarim_Soroush
2,Abdulhakim_Arvasi,1.0,0.0,Abdulhakim Arvasi,http://en.wikipedia.org/wiki/Abdulhakim_Arvasi
3,Abu_Yusuf,1.0,0.0,Abu Yusuf,http://en.wikipedia.org/wiki/Abu_Yusuf
4,Abu_Zayd_al-Balkhi,1.0,1.0,Abu Zayd al-Balkhi,http://en.wikipedia.org/wiki/Abu_Zayd_al-Balkhi
...,...,...,...,...,...
1751,Zhang_Weiwei_(professor),0.0,4.0,Zhang Weiwei (professor),http://en.wikipedia.org/wiki/Zhang_Weiwei_(pro...
1752,Zhao_Tingyang,0.0,7.0,Zhao Tingyang,http://en.wikipedia.org/wiki/Zhao_Tingyang
1753,Ágnes_Heller,0.0,4.0,Ágnes Heller,http://en.wikipedia.org/wiki/Ágnes_Heller
1754,Éric_Geoffroy,0.0,2.0,Éric Geoffroy,http://en.wikipedia.org/wiki/Éric_Geoffroy
