## Getting the Data <a class="anchor" id="Getting the Data"></a>

We first going to gather the data needed for this project. We are going to extract the words in each Wikipedia page to understand the relation between each physicist and physics domain. 

In [None]:
## We get the nobel data set
import numpy as np
import pandas as pd
from httplib2 import Http
from bs4 import BeautifulSoup, SoupStrainer
import requests

class Parser:
    
    def __init__(self, url):  
        http = Http()
        status, response = http.request(url)
        tables = BeautifulSoup(response, "lxml", 
                              parse_only=SoupStrainer("table", {"class":"wikitable sortable"}))
        self.table = tables.contents[1]
    
    def parse_table(self):      
        rows = self.table.find_all("tr")
        header = self.parse_header(rows[0])
        table_array = [self.parse_row(row) for row in rows[1:]]
        table_df = pd.DataFrame(table_array, columns=header).apply(self.clean_table, 1)
        return table_df.replace({"Year":{'':np.nan}})
        
    def parse_row(self, row):     
        columns = row.find_all("td")
        return [BeautifulSoup.get_text(col).strip() for col in columns if BeautifulSoup.get_text(col) != '']
    
    def parse_header(self, row):     
        columns = row.find_all("th")
        return [BeautifulSoup.get_text(col).strip() for col in columns if BeautifulSoup.get_text(col) != ""]
    
    def clean_table(self, row):
        if not row.iloc[0].isdigit() and row.iloc[0] != '':
            return row.shift(1)
        else:
            return row
        
url = "https://en.wikipedia.org/wiki/List_of_Nobel_laureates_in_Physics"        
parser = Parser(url)   
nobel_df = parser.parse_table()
nobel_df.columns = ["Year", "Laureate", "Country", "Rationale"]
nobel_df.dropna(subset=["Country"], inplace=True)
nobel_df.fillna(method="ffill", inplace=True)
nobel_df.drop(["Year", "Country", "Rationale"], 1, inplace=True)

http = Http()
status, response = http.request(url)

table = BeautifulSoup(response, "lxml", parse_only=SoupStrainer('table'))
link_df = pd.DataFrame([[x.string, x["href"]] for x in table.contents[1].find_all("a")],
                       columns=["Laureate", "link"]).drop_duplicates()

nobel_df = nobel_df.merge(link_df, on="Laureate", how="left")
nobel_df.set_index("Laureate", inplace=True)
nobel_df.drop_duplicates(inplace=True)
nobel_df

We also going to extract the links of each of the physics domains listed in the Research fields table of the [https://en.wikipedia.org/wiki/Physics](https://en.wikipedia.org/wiki/Physics) Wikipedia page.

In [53]:
## We get the physics links
url = "https://en.wikipedia.org/wiki/Physics"

http = Http()
status, response = http.request(url)

table = BeautifulSoup(response, "lxml", parse_only=SoupStrainer('table'))
physics_df = pd.DataFrame([[x.string.lower(), x["href"].lower()] for x in table.contents[2].find_all("a")],
                       columns=["Physics_domain", "link"]).drop_duplicates()

physics_df = physics_df.groupby("Physics_domain").first()
physics_df

Unnamed: 0_level_0,link
Physics_domain,Unnamed: 1_level_1
't hooft,/wiki/gerard_%27t_hooft
albert einstein,/wiki/albert_einstein
antimatter,/wiki/antimatter
antiparticle,/wiki/antiparticle
applied,/wiki/applied_physics
astrophysics,/wiki/astrophysics
atomic physics,/wiki/atomic_physics
"atomic, molecular, and optical physics","/wiki/atomic,_molecular,_and_optical_physics"
bardeen,/wiki/john_bardeen
becquerel,/wiki/henri_becquerel



## Cleaning the data <a class="anchor" id="Cleaning the data"></a>

>Use the code from your previous homework to create the different functions to clean the get the word data and clean it.

In [54]:
from string import punctuation

## We get the bios
def get_text(link, root_website = "https://en.wikipedia.org"):    
    http = Http()
    status, response = http.request(root_website + link)
    body = BeautifulSoup(response, "lxml", parse_only=SoupStrainer("div", {"id":"mw-content-text"}))
    return BeautifulSoup.get_text(body.contents[1])

# TODO: copy your clean_string function from the previous homework
def clean_string(string):
    for p in punctuation + "1234567890":
        string = string.replace(p,'').lower()   
    return string

# TODO: copy your remove function from the previous homework
def remove(list_to_clean, element_to_remove=[None, ""]):
    return list(filter(lambda x: x != element_to_remove, list_to_clean))

# TODO: copy your remove_one function from the previous homework
def remove_one(list_to_clean):
    return list(filter(lambda x: len(x) > 1, list_to_clean))


We now going to write a function that takes a data frame with a "link" column and return a column of list of words. We basically aggregate all the above function into one to reproduce what was done in the previous homework. Note that here we DO NOT use the function that keep only a unique element of each list nor the one that filter on the number of occurance.

> Write a function that applies all the previous functions to clean a text.

In [56]:
from nltk.corpus import stopwords
words_to_remove = set(stopwords.words('english'))

#remove stopwords
def remove_stopword(list_to_clean, remove_list = words_to_remove):
    return list(filter(lambda x: x not in remove_list, list_to_clean))

# TODO: aggregate all the above function into one to return a list of words from each link
def clean_everything(df):
    return df['link'].apply(get_text) \
            .apply(clean_string) \
            .str.split() \
            .apply(remove) \
            .apply(remove_one) \
            .apply(remove_stopword)

    
    
physics_df["physics_list"] = clean_everything(physics_df)
nobel_df["physics_list"] = clean_everything(nobel_df)
nobel_df

LookupError: 
**********************************************************************
  Resource [93mstopwords[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('stopwords')
  [0m
  Attempted to load [93mcorpora/stopwords[0m

  Searched in:
    - 'C:\\Users\\Bob/nltk_data'
    - 'C:\\Users\\Bob\\Anaconda3\\envs\\dscsulb\\nltk_data'
    - 'C:\\Users\\Bob\\Anaconda3\\envs\\dscsulb\\share\\nltk_data'
    - 'C:\\Users\\Bob\\Anaconda3\\envs\\dscsulb\\lib\\nltk_data'
    - 'C:\\Users\\Bob\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


We saw last time that there are many words that are not relevant to physics concepts in those Wikipedia pages. We are going to attempt to filter those with the simple following approach. 

We are going to compile a set of all the unique words in the `nobel_df` lists and a set of all the unique words in the `physics_df` lists. By taking the intersection of those 2 sets, we can subset the words corpus to something more relevant to physics.

>- Compile a set of all unique words in `nobel_df["physics_list"]`. You can use the function [`pd.sum`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sum.html) to concatenate lists. You can cast the final list to a [`set`](https://docs.python.org/2/library/sets.html)
- Compile a set of all unique words in `physics_df["physics_list"]`.
- Compile the intersection of those 2 sets using the `intersection` function.

In [57]:
# TODO: find all the words in nobel_df["physics_list"]
all_nobel_words =  set(nobel_df['physics_list'].sum())

# TODO: find all the words in physics_df["physics_list"]
all_physics_words =  set(physics_df['physics_list'].sum())

# TODO: find all the intersection of all_nobel_words and all_physics_words
physics_corpus =  set(all_nobel_words.intersection(all_physics_words))

physics_corpus

NameError: name 'nobel_df' is not defined

In [8]:
#Checking intersection list length
print(len(physics_corpus), len(all_nobel_words), len(all_physics_words))

12725 31370 33494


>Write a function that keep only specific words from a list

In [9]:
# TODO: write a function that keep only specific words from a list
def keep_only(list_to_clean, corpus=physics_corpus):
    return list(filter(lambda x: x in corpus, list_to_clean))
    
nobel_df["physics_list_clean"] = nobel_df["physics_list"].apply(keep_only)
physics_df["physics_list_clean"] = physics_df["physics_list"].apply(keep_only)


## Compiling the nodes data <a class="anchor" id="Compiling the nodes data"></a>

For those 2 dataframes, we are going to create 2 additional columns:
    
>- create columns "length" that counts the number of words in each list. This column will be used to capture the size of the nodes in the networks. Basically we are going to say: the more words in the Wikipedia page, the more significant the physicist or physics domain is.
- create columns "group" with a unique value for each of those dataframes. Set the value to 1 in the `nobel_df` dataframe and 0 for the `physics_df` dataframe. This columns will be used to distinguish the physicists from the physics domains and attribute them different colors in the network visualization.

In [10]:
# TODO: compute the length of each list
nobel_df["length"] =  nobel_df['physics_list_clean'].apply(len)
physics_df["length"] =  physics_df['physics_list_clean'].apply(len)

# TODO: Set this column to 1
nobel_df["group"] =  1
# TODO: Set this column to 0 
physics_df["group"] =  0

Let's concatenate the those 2 dataframes into the `nodes_df` dataframe. 

>Use the [`pd.concat`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html) function to do so and only keep the "length" and "group" columns. The concatenation needs to be done along the row axis.

In [11]:
# TODO: concatenate those two dataframe into the nodes_df dataframe. 
# keep only the "length" and "group" columns.
nodes_df =   pd.concat([nobel_df[['length', 'group']], physics_df[['length','group']]], axis=0)

nodes_df.index.name = "id"

From this dataframe, we can easily format the data as a list of dictionaries as the d3.js library expects the data to be. We have the "length" attribute for the size of the node, the "group" attribute to distinguish between physicists and physics domains and each node has a unique "id" tag represented by the names.  

In [12]:
nodes_list = list(nodes_df.reset_index().transpose().to_dict().values())
nodes_list

[{'group': 1, 'id': 'Wilhelm Conrad Röntgen', 'length': 1293},
 {'group': 1, 'id': 'Hendrik Lorentz', 'length': 2745},
 {'group': 1, 'id': 'Pieter Zeeman', 'length': 986},
 {'group': 1, 'id': 'Antoine Henri Becquerel', 'length': 1159},
 {'group': 1, 'id': 'Pierre Curie', 'length': 1432},
 {'group': 1, 'id': 'Maria Skłodowska-Curie', 'length': 4966},
 {'group': 1, 'id': 'Lord Rayleigh', 'length': 1418},
 {'group': 1, 'id': 'Philipp Eduard Anton von Lenard', 'length': 1258},
 {'group': 1, 'id': 'Joseph John Thomson', 'length': 2901},
 {'group': 1, 'id': 'Albert Abraham Michelson', 'length': 2306},
 {'group': 1, 'id': 'Gabriel Lippmann', 'length': 1472},
 {'group': 1, 'id': 'Guglielmo Marconi', 'length': 4085},
 {'group': 1, 'id': 'Karl Ferdinand Braun', 'length': 840},
 {'group': 1, 'id': 'Johannes Diderik van der Waals', 'length': 2188},
 {'group': 1, 'id': 'Wilhelm Wien', 'length': 747},
 {'group': 1, 'id': 'Nils Gustaf Dalén', 'length': 779},
 {'group': 1, 'id': 'Heike Kamerlingh-Onne


## Compiling the links data <a class="anchor" id="Compiling the links data"></a>

We have the nodes, we need to find a way to connect them. We are going to compute the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) between each of the wikipedia pages. It is called the cosine similarity because a dot product between 2 vectors $\mathbf{A}$ and $\mathbf{B}$ can be express as:
\begin{equation}
A\cdot B = \Vert A\Vert_2\Vert B\Vert_2\cos\theta
\end{equation}
where $\theta$ is the angle between the 2 vectors. Similarly:
\begin{equation}
\cos\theta = \frac{ A\cdot B}{\Vert A\Vert_2\Vert B\Vert_2}
\end{equation}
$\cos\theta\in[-1,1]$ and specifically $\cos\theta = 1$ if the 2 vectors are in the same direction or $\cos\theta = -1$ if the 2 vectors are in the opposite direction. The matter here becomes to be able to express a Wikipedia page as a vector. I suggest here a simple approach but there are many ways to achieve this. 

We defined earlier the corpus of physics words `physics_corpus`. Each word can be thought as a orthogonal basis defining a vector space where our Wikipedia pages are living in. Each component can be represented by the number of time a specific word appear in a page. As an example, imaging a page $P$ represented by the following list of words:
```
P_list = ["data", "data", "science", "python", "python"]
```
And let's imaging that we have a simple word corpus:
```
corpus = {"engineering", "data", "science", "python"}
```
Then in that basis $P$ could be represented by a vector 
\begin{equation}
\mathbf{P}=\left(\begin{matrix}
  0  \\
  2  \\
  1  \\
  2
 \end{matrix}\right)
\end{equation}

We need to express each Wikipedia page as such vector. 

>Let's start by creating a dataframe with as columns, all the indices of the `nodes_df` dataframe (`nodes_df.index.values`) and as index, the whole `physics_corpus` set.

In [13]:
# TODO: create a data frame with the index of nodes_df as columns and physics_corpus as index
words_vector =  pd.DataFrame(index=physics_corpus, columns = nodes_df.index.values)
words_vector

Unnamed: 0,Wilhelm Conrad Röntgen,Hendrik Lorentz,Pieter Zeeman,Antoine Henri Becquerel,Pierre Curie,Maria Skłodowska-Curie,Lord Rayleigh,Philipp Eduard Anton von Lenard,Joseph John Thomson,Albert Abraham Michelson,...,superfluid,supernova,superstring theory,supersymmetry,surface physics,theory of everything,universe,vacuum energy,vehicle dynamics,weak
wish,,,,,,,,,,,...,,,,,,,,,,
günter,,,,,,,,,,,...,,,,,,,,,,
irreversibility,,,,,,,,,,,...,,,,,,,,,,
daughter,,,,,,,,,,,...,,,,,,,,,,
bertotti,,,,,,,,,,,...,,,,,,,,,,
lifetime,,,,,,,,,,,...,,,,,,,,,,
highlevel,,,,,,,,,,,...,,,,,,,,,,
chinas,,,,,,,,,,,...,,,,,,,,,,
cline,,,,,,,,,,,...,,,,,,,,,,
qualifying,,,,,,,,,,,...,,,,,,,,,,


To fill this table we are going to use the [`value_counts`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) function on the list of words contained in `nobel_df["physics_list_clean"]` and `physics_df["physics_list_clean"]`. 

>Write a function that takes a list and return a value_counts. The return value should be a pandas series with the words as index. We are using this function to populate the `words_vector` dataframe. Note that because `words_vector` already has an index, the values get populated at the right place automatically.

In [14]:
#TODO: write a function that take a list and return the a word count
def count_words(list_to_count):
    return pd.Series(list_to_count).value_counts()

words_vector.loc[:,nobel_df.index] = nobel_df["physics_list_clean"].apply(count_words).transpose()
words_vector.loc[:,physics_df.index] = physics_df["physics_list_clean"].apply(count_words).transpose()
words_vector

Unnamed: 0,Wilhelm Conrad Röntgen,Hendrik Lorentz,Pieter Zeeman,Antoine Henri Becquerel,Pierre Curie,Maria Skłodowska-Curie,Lord Rayleigh,Philipp Eduard Anton von Lenard,Joseph John Thomson,Albert Abraham Michelson,...,superfluid,supernova,superstring theory,supersymmetry,surface physics,theory of everything,universe,vacuum energy,vehicle dynamics,weak
wish,,,,,,1,2,,,,...,,,,,,,,,,
günter,,,,,,,,,,,...,,,,,,,,,,
irreversibility,,,,,,,,,,,...,,,,,,,,,,
daughter,,1,,,3,7,1,,2,4,...,,1,,,,,,,,
bertotti,,,,,,,,,,,...,,,,,,,,,,
lifetime,,,,,,,,1,,,...,,1,,,,,1,,,2
highlevel,,,,,,,,,,,...,,,,,,,,,,
chinas,,,,,,,,,,,...,,,,,,,,,,
cline,,,,,,,,,,,...,,,,,,,,,,
qualifying,,,,,,,,,,,...,,,,,,,,,,


There are many entries in this dataframe that appear as `NaN`. We just need to replace those missing values by 0 since they indicate that no records of the word were found for those pages. 

>Use the function [`pd.fillna`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html) to fill with 0.

In [15]:
# TODO: fill the missing values
words_vector = words_vector.fillna(0)
words_vector

Unnamed: 0,Wilhelm Conrad Röntgen,Hendrik Lorentz,Pieter Zeeman,Antoine Henri Becquerel,Pierre Curie,Maria Skłodowska-Curie,Lord Rayleigh,Philipp Eduard Anton von Lenard,Joseph John Thomson,Albert Abraham Michelson,...,superfluid,supernova,superstring theory,supersymmetry,surface physics,theory of everything,universe,vacuum energy,vehicle dynamics,weak
wish,0.0,0.0,0.0,0.0,0.0,1.0,2.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
günter,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
irreversibility,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
daughter,0.0,1.0,0.0,0.0,3.0,7.0,1.0,0.0,2.0,4.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
bertotti,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
lifetime,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0
highlevel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
chinas,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
cline,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
qualifying,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


>- Write a function that takes 2 vectors (2 pandas series) and return the cosine similarity index. You can use the function [`dot`](http://pandas.pydata.org/pandas-docs/version/0.18.1/generated/pandas.Series.dot.html), [`pow`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pow.html) and `sum` if you like.
- Use this function to fill the `similarity_df` dataframe
- Bonus points if you can compute this dataframe using matrix algebra ([`dot`](http://pandas.pydata.org/pandas-docs/version/0.18.1/generated/pandas.DataFrame.dot.html)) without having to iterate through the columns. Hint create 2 dataframe: one that is the dot products of words_vector with itself and one that represent a matrix of norm products. Then divide one matrix by the other element wise. For this case, you would not need to use the `compute_similarity` function.

In [16]:
# TODO: write a function that takes 2 vectors and return the cosine similarity index
def compute_similarity(vect1, vect2):
    return vect1.dot(vect2) / (np.sqrt(vect1.pow(2).sum()) * np.sqrt(vect2.pow(2).sum()))

similarity_df = pd.DataFrame(columns=words_vector.columns, index=words_vector.columns, dtype=float)

# TODO: fill the similarity_df dataframe with the cosine similarity
for i in range(336):
    for j in range(336):
        similarity_df.iloc[i,j] = compute_similarity(words_vector.T.iloc[i,:], words_vector.iloc[:,j])

# TODO: bonus points if you can compute this dataframe using matrix algebra 


similarity_df

Unnamed: 0,Wilhelm Conrad Röntgen,Hendrik Lorentz,Pieter Zeeman,Antoine Henri Becquerel,Pierre Curie,Maria Skłodowska-Curie,Lord Rayleigh,Philipp Eduard Anton von Lenard,Joseph John Thomson,Albert Abraham Michelson,...,superfluid,supernova,superstring theory,supersymmetry,surface physics,theory of everything,universe,vacuum energy,vehicle dynamics,weak
Wilhelm Conrad Röntgen,1.000000,0.185980,0.235582,0.246754,0.243904,0.264306,0.279527,0.322531,0.271153,0.226106,...,0.069598,0.063835,0.067767,0.082096,0.075353,0.098687,0.100884,,,
Hendrik Lorentz,0.185980,1.000000,0.372728,0.220099,0.179496,0.181588,0.205124,0.255789,0.181050,0.272384,...,0.121991,0.073885,0.215487,0.205946,0.073176,0.290997,0.165452,,,
Pieter Zeeman,0.235582,0.372728,1.000000,0.257315,0.209804,0.210226,0.217910,0.282626,0.209114,0.268707,...,0.095153,0.074724,0.071284,0.098204,0.074647,0.120857,0.092422,,,
Antoine Henri Becquerel,0.246754,0.220099,0.257315,1.000000,0.310111,0.277176,0.199766,0.243003,0.213223,0.209687,...,0.072694,0.086675,0.064999,0.090554,0.088253,0.099746,0.108139,,,
Pierre Curie,0.243904,0.179496,0.209804,0.310111,1.000000,0.799271,0.231557,0.211746,0.201587,0.194248,...,0.073215,0.061602,0.058849,0.088676,0.106891,0.098525,0.105708,,,
Maria Skłodowska-Curie,0.264306,0.181588,0.210226,0.277176,0.799271,1.000000,0.252649,0.238631,0.197374,0.226771,...,0.089659,0.068067,0.068095,0.100521,0.104622,0.119297,0.142642,,,
Lord Rayleigh,0.279527,0.205124,0.217910,0.199766,0.231557,0.252649,1.000000,0.227093,0.346124,0.338095,...,0.079245,0.058784,0.139080,0.121577,0.078784,0.174022,0.116242,,,
Philipp Eduard Anton von Lenard,0.322531,0.255789,0.282626,0.243003,0.211746,0.238631,0.227093,1.000000,0.313150,0.254457,...,0.136562,0.075325,0.115500,0.152290,0.102614,0.206807,0.158576,,,
Joseph John Thomson,0.271153,0.181050,0.209114,0.213223,0.201587,0.197374,0.346124,0.313150,1.000000,0.262140,...,0.089759,0.081221,0.109685,0.133455,0.105196,0.160041,0.139934,,,
Albert Abraham Michelson,0.226106,0.272384,0.268707,0.209687,0.194248,0.226771,0.338095,0.254457,0.262140,1.000000,...,0.085421,0.095191,0.075190,0.089988,0.071441,0.121639,0.129082,,,


We are now going to "melt" `similarity_df` into a long dataframe: 

>- We need to reset the index of `similarity_df` ([`pd.reset_index`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html))
- Then we use the [`melt`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html) function to melt the dataframe. Use the resetted index as `id_vars`. 

In [17]:
# TODO: reset the index and melt the dataframe
similarity_df =  similarity_df.reset_index()
melted_df = pd.melt(similarity_df, id_vars=['index'])

melted_df.columns = ["source", "target", "value"]
melted_df

Unnamed: 0,source,target,value
0,Wilhelm Conrad Röntgen,Wilhelm Conrad Röntgen,1.000000
1,Hendrik Lorentz,Wilhelm Conrad Röntgen,0.185980
2,Pieter Zeeman,Wilhelm Conrad Röntgen,0.235582
3,Antoine Henri Becquerel,Wilhelm Conrad Röntgen,0.246754
4,Pierre Curie,Wilhelm Conrad Röntgen,0.243904
5,Maria Skłodowska-Curie,Wilhelm Conrad Röntgen,0.264306
6,Lord Rayleigh,Wilhelm Conrad Röntgen,0.279527
7,Philipp Eduard Anton von Lenard,Wilhelm Conrad Röntgen,0.322531
8,Joseph John Thomson,Wilhelm Conrad Röntgen,0.271153
9,Albert Abraham Michelson,Wilhelm Conrad Röntgen,0.226106


You can see that at this point we have something close to what is necessary to create the links data. There are 3 things we need do to finalize our dataset. First, it is unnecessary to have "source" equal to the "target" (a node does not need to be linked to itself). Second, we have a duplicated links because in our case a "source" has the same role than a "target" and can be interchanged (our graph is not directed). Third, we need to subset the links set because there are too many for the program to run efficiently.

Let's shuffle the data set rowwise ([`sample`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html)). This help us to not bias our links selection due to a prior alphabetic ordering of the data

In [18]:
melted_df = melted_df.sample(frac=1.).reset_index(drop=True)
melted_df

Unnamed: 0,source,target,value
0,Philip Warren Anderson,Antoine Henri Becquerel,0.167537
1,Pierre Curie,spontaneous symmetry breaking,0.069079
2,Gabriel Lippmann,Takaaki Kajita,0.255798
3,optics,Saul Perlmutter,0.074754
4,Antoine Henri Becquerel,Werner Heisenberg,0.165486
5,Leon Neil Cooper,spectral line,0.062728
6,Felix Bloch,Steven Chu,0.263732
7,Subrahmanyan Chandrasekhar,John Hasbrouck Van Vleck,0.242608
8,applied physics,particle physics phenomenology,0.508666
9,Philip Warren Anderson,Cecil Frank Powell,0.284631


We then going to find the pairs of ("source", "target") that are equal to pairs ("target", "source"). To do that we are going to merge `melted_df` with itself where ("source", "target") = ("target", "source").

>- merge it with itself with `left_on=["source", "target"]` and `right_on=["target", "source"]`. Pass the dataframe with the index resetted using [`pd.reset_index`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html).

In [19]:
# TODO: merge melted_df with itself
melted_df = melted_df.reset_index()
merged_df =  pd.merge(left=melted_df, right=melted_df, left_on=['source', 'target'], right_on=['target','source'])

merged_df

Unnamed: 0,index_x,source_x,target_x,value_x,index_y,source_y,target_y,value_y
0,0,Philip Warren Anderson,Antoine Henri Becquerel,0.167537,17835,Antoine Henri Becquerel,Philip Warren Anderson,0.167537
1,1,Pierre Curie,spontaneous symmetry breaking,0.069079,92305,spontaneous symmetry breaking,Pierre Curie,0.069079
2,2,Gabriel Lippmann,Takaaki Kajita,0.255798,104280,Takaaki Kajita,Gabriel Lippmann,0.255798
3,3,optics,Saul Perlmutter,0.074754,15475,Saul Perlmutter,optics,0.074754
4,4,Antoine Henri Becquerel,Werner Heisenberg,0.165486,93342,Werner Heisenberg,Antoine Henri Becquerel,0.165486
5,5,Leon Neil Cooper,spectral line,0.062728,92846,spectral line,Leon Neil Cooper,0.062728
6,6,Felix Bloch,Steven Chu,0.263732,96128,Steven Chu,Felix Bloch,0.263732
7,7,Subrahmanyan Chandrasekhar,John Hasbrouck Van Vleck,0.242608,77560,John Hasbrouck Van Vleck,Subrahmanyan Chandrasekhar,0.242608
8,8,applied physics,particle physics phenomenology,0.508666,33769,particle physics phenomenology,applied physics,0.508666
9,9,Philip Warren Anderson,Cecil Frank Powell,0.284631,13782,Cecil Frank Powell,Philip Warren Anderson,0.284631


At this point, we can see that each pair of ("source", "target") has the redondant equivalent ("target", "source"). This also highlight the cases where "source" = "target". To filter the useless rows we can simply pick the ("source", "target") pair or the ("target", "source") pick to remove. Let's choose which pair to remove by capturing the index we want to remove

>- Look at the pair of columns `merged_df[["index_x", "index_y"]]` and simple choose the greater between the two using [`max`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.max.html). By selecting only the unique values ([`unique`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html)) of the resulting list of indices we have selected the index to remove and we can drop them using [`drop`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html). 

In [20]:
# TODO: find the index to drop
index_to_drop =  merged_df[['index_x','index_y']].max(axis=1).unique()

# TODO: use the index_to_drop to subset the melted_df dataframe
melted_df_sub = merged_df.drop(index_to_drop)
melted_df_sub

Unnamed: 0,index_x,source_x,target_x,value_x,index_y,source_y,target_y,value_y
0,0,Philip Warren Anderson,Antoine Henri Becquerel,0.167537,17835,Antoine Henri Becquerel,Philip Warren Anderson,0.167537
1,1,Pierre Curie,spontaneous symmetry breaking,0.069079,92305,spontaneous symmetry breaking,Pierre Curie,0.069079
2,2,Gabriel Lippmann,Takaaki Kajita,0.255798,104280,Takaaki Kajita,Gabriel Lippmann,0.255798
3,3,optics,Saul Perlmutter,0.074754,15475,Saul Perlmutter,optics,0.074754
4,4,Antoine Henri Becquerel,Werner Heisenberg,0.165486,93342,Werner Heisenberg,Antoine Henri Becquerel,0.165486
5,5,Leon Neil Cooper,spectral line,0.062728,92846,spectral line,Leon Neil Cooper,0.062728
6,6,Felix Bloch,Steven Chu,0.263732,96128,Steven Chu,Felix Bloch,0.263732
7,7,Subrahmanyan Chandrasekhar,John Hasbrouck Van Vleck,0.242608,77560,John Hasbrouck Van Vleck,Subrahmanyan Chandrasekhar,0.242608
8,8,applied physics,particle physics phenomenology,0.508666,33769,particle physics phenomenology,applied physics,0.508666
9,9,Philip Warren Anderson,Cecil Frank Powell,0.284631,13782,Cecil Frank Powell,Philip Warren Anderson,0.284631


We have filtered quite a bit of rows but it still is too many for the network simulation to run efficiently. For each source, we are going to select the 10 highest values. 

>- Group `melted_df_sub` by "source" using the [`groupby`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) method and select the 10 targets that have the highest values using the [`nlargest`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.nlargest.html) method.
- The resulting pandas Series has a multiindex with 2 levels. We need to get the level 1 of the multiindex to know which rows to keep in `melted_df_sub`. You can get it using the function [`get_level_values`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.get_level_values.html) on the index. 

In [21]:
# TODO: Group melted_df_sub by "source" using the groupby method and select the 10 
# targets that have the highest values using the nlargest method
largest_df =  melted_df_sub.groupby('source_x')['value_x'].nlargest(10)

# TODO: get the level 1 of the multiindex
index_to_keep =  largest_df.index.get_level_values(1)
links_df = melted_df_sub.loc[index_to_keep]
links_df

Unnamed: 0,index_x,source_x,target_x,value_x,index_y,source_y,target_y,value_y
37985,37985,Aage Bohr,Ben Roy Mottelson,0.649636,45870,Ben Roy Mottelson,Aage Bohr,0.649636
59067,59067,Aage Bohr,Leo James Rainwater,0.573598,90561,Leo James Rainwater,Aage Bohr,0.573598
52003,52003,Aage Bohr,Nicolaas Bloembergen,0.484146,81549,Nicolaas Bloembergen,Aage Bohr,0.484146
59769,59769,Aage Bohr,Val Logsdon Fitch,0.466606,83125,Val Logsdon Fitch,Aage Bohr,0.466606
1342,1342,Aage Bohr,Jerome I. Friedman,0.434127,105573,Jerome I. Friedman,Aage Bohr,0.434127
46015,46015,Aage Bohr,Andre Geim,0.429708,89971,Andre Geim,Aage Bohr,0.429708
88032,88032,Aage Bohr,Kai Manne Börje Siegbahn,0.427583,113744,Kai Manne Börje Siegbahn,Aage Bohr,0.427583
12650,12650,Aage Bohr,Bertram Brockhouse,0.419082,71591,Bertram Brockhouse,Aage Bohr,0.419082
22796,22796,Aage Bohr,Yoichiro Nambu,0.415775,94382,Yoichiro Nambu,Aage Bohr,0.415775
81987,81987,Aage Bohr,David J. Wineland,0.405996,92620,David J. Wineland,Aage Bohr,0.405996


In [22]:
#Cutting out unnecessary columns and renaming column names to look cleaner
links_df2 = links_df[['source_x', 'target_x', 'value_x']].copy()
links_df2.columns = ['source', 'target', 'value']
links_df2

Unnamed: 0,source,target,value
37985,Aage Bohr,Ben Roy Mottelson,0.649636
59067,Aage Bohr,Leo James Rainwater,0.573598
52003,Aage Bohr,Nicolaas Bloembergen,0.484146
59769,Aage Bohr,Val Logsdon Fitch,0.466606
1342,Aage Bohr,Jerome I. Friedman,0.434127
46015,Aage Bohr,Andre Geim,0.429708
88032,Aage Bohr,Kai Manne Börje Siegbahn,0.427583
12650,Aage Bohr,Bertram Brockhouse,0.419082
22796,Aage Bohr,Yoichiro Nambu,0.415775
81987,Aage Bohr,David J. Wineland,0.405996


We need to cast this data frame as a list of dictionaries as we have done for the list of nodes. 

>Use a similar code to than for the nodes to create a list of links: 

In [23]:
# TODO: create the list of links
links_list =  list(links_df2.transpose().to_dict().values())
links_list

[{'source': 'Aage Bohr',
  'target': 'Ben Roy Mottelson',
  'value': 0.6496358338199949},
 {'source': 'Aage Bohr',
  'target': 'Leo James Rainwater',
  'value': 0.5735984893350551},
 {'source': 'Aage Bohr',
  'target': 'Nicolaas Bloembergen',
  'value': 0.48414552726817844},
 {'source': 'Aage Bohr',
  'target': 'Val Logsdon Fitch',
  'value': 0.46660587882502436},
 {'source': 'Aage Bohr',
  'target': 'Jerome I. Friedman',
  'value': 0.43412684404660457},
 {'source': 'Aage Bohr', 'target': 'Andre Geim', 'value': 0.42970797701864916},
 {'source': 'Aage Bohr',
  'target': 'Kai Manne Börje Siegbahn',
  'value': 0.42758271079183063},
 {'source': 'Aage Bohr',
  'target': 'Bertram Brockhouse',
  'value': 0.41908191993562294},
 {'source': 'Aage Bohr',
  'target': 'Yoichiro Nambu',
  'value': 0.41577494725922026},
 {'source': 'Aage Bohr',
  'target': 'David J. Wineland',
  'value': 0.40599597458476433},
 {'source': 'Abdus Salam', 'target': 'Andre Geim', 'value': 0.312172016599622},
 {'source': 

We now create the final dictionary for the network and save it into a json file

In [24]:
import json
network_dict = {"nodes": nodes_list,
                "links": links_list}

with open("./data/physicists.json","w") as f:
    json.dump(network_dict, f, indent=4)

If you have a mac, the following script is going to open a safari window to visualize this network, otherwise you can just open the index.html file with Safari or Firefox. For some reason it does not work with chrome.

In [25]:
import os
os.system("open -a /Applications/Safari.app ./index.html")

256

Adjust the parameters to try to find nodes that tend to be grouped together. You can try to recreate the network with with different number of links.