# Social Network Data Visualisation

- [Physics as network](#Physics as network)
- [Getting the Data](#Getting the Data)
- [Cleaning the data](#Cleaning the data)
- [Compiling the nodes data](#Compiling the nodes data)
- [Compiling the links data](#Compiling the links data)

<img src="additional/facebook_world_friend_map.png",width=600>

We are using the [d3.js](https://d3js.org/) javascript library, a data visualization library, to represent our networks. You can see [here](https://bl.ocks.org/mbostock/4062045) an example of such a network.


## Physics as a social network  <a class="anchor" id="Physics as network"></a>

A social network can be represented as a graph with a set of nodes and a set of links between those nodes:

<img src="additional/small_undirected_network_labeled.png",width=300>

In the previous homework, we used data sets with physicists and physics domains. If we consider each physicist and each physics domain as a possible node on a network, the problem becomes building edges between those. Here an example of how to represent graphs with python data structures:

In [1]:
small_network = {
    "nodes": [
        {"id": "Albert Einstein"},
        {"id": "Paul Dirac"},
        {"id": "Niels Bohr"}
    ],
    "links": [
        {"source": "Albert Einstein", "target": "Paul Dirac"},
        {"source": "Albert Einstein", "target": "Niels Bohr"},
        {"source": "Paul Dirac", "target": "Niels Bohr"}
    ]
}

## We dump this network into a .json file
import json
with open("./data/small_network.json","w") as f:
    json.dump(small_network, f, indent=4)

FileNotFoundError: [Errno 2] No such file or directory: './data/small_network.json'

If you have a mac, the following script is going to open a safari window to visualize this network, otherwise you can just open the `small_network.html` file with Safari or Firefox. For some reason, it does not work with chrome. 

In [2]:
import os
os.system("open -a /Applications/Safari.app ./small_network.html")

256

Obviously, that is not a very interesting network and we are going to build a more substantial one! D3.js expects the data in a specific format as shown above and we are going to shape our data to follow those requirements. For each node, we need an "id" tag and we can add other attribute if we desire. For the links, we need a "source" and a "target" to connect nodes and we can add other attributes also. 

In the following network, we are going to add the "length" attribute that captures how big is each node and the "value" attribute for each link that captures how "strong" the links are. We also are going to distinguish between 2 types of nodes: the nodes for the physicist and the nodes for the physics domains. Those 2 sets of nodes will be distinguished by the attribute "group". For example:

```
small_network = {
    "nodes": [
        {"id": "Albert Einstein", "group": 1, "length": 100},
        {"id": "Paul Dirac", "group": 1, "length": 200},
        {"id": "Niels Bohr", "group": 1, "length": 300}
    ],
    "links": [
        {"source": "Albert Einstein", "target": "Paul Dirac", "value": 0.5},
        {"source": "Albert Einstein", "target": "Niels Bohr", "value": 0.4},
        {"source": "Paul Dirac", "target": "Niels Bohr", "value": 0.3}
    ]
}
```

## Getting the Data <a class="anchor" id="Getting the Data"></a>

We first going to gather the data needed for this project. We are going to extract the words in each Wikipedia page to understand the relation between each physicist and physics domain. 

In [3]:
## We get the nobel data set
import numpy as np
import pandas as pd
from httplib2 import Http
from bs4 import BeautifulSoup, SoupStrainer

class Parser:
    
    def __init__(self, url):  
        http = Http()
        status, response = http.request(url)
        tables = BeautifulSoup(response, "lxml", 
                              parse_only=SoupStrainer("table", {"class":"wikitable sortable"}))
        self.table = tables.contents[1]
    
    def parse_table(self):      
        rows = self.table.find_all("tr")
        header = self.parse_header(rows[0])
        table_array = [self.parse_row(row) for row in rows[1:]]
        table_df = pd.DataFrame(table_array, columns=header).apply(self.clean_table, 1)
        return table_df.replace({"Year":{'':np.nan}})
        
    def parse_row(self, row):     
        columns = row.find_all("td")
        return [BeautifulSoup.get_text(col).strip() for col in columns if BeautifulSoup.get_text(col) != '']
    
    def parse_header(self, row):     
        columns = row.find_all("th")
        return [BeautifulSoup.get_text(col).strip() for col in columns if BeautifulSoup.get_text(col) != ""]
    
    def clean_table(self, row):
        if not row.iloc[0].isdigit() and row.iloc[0] != '':
            return row.shift(1)
        else:
            return row
        
url = "https://en.wikipedia.org/wiki/List_of_Nobel_laureates_in_Physics"        
parser = Parser(url)   
nobel_df = parser.parse_table()
nobel_df.columns = ["Year", "Laureate", "Country", "Rationale"]
nobel_df.dropna(subset=["Country"], inplace=True)
nobel_df.fillna(method="ffill", inplace=True)
nobel_df.drop(["Year", "Country", "Rationale"], 1, inplace=True)

http = Http()
status, response = http.request(url)

table = BeautifulSoup(response, "lxml", parse_only=SoupStrainer('table'))
link_df = pd.DataFrame([[x.string, x["href"]] for x in table.contents[1].find_all("a")],
                       columns=["Laureate", "link"]).drop_duplicates()

nobel_df = nobel_df.merge(link_df, on="Laureate", how="left")
nobel_df.set_index("Laureate", inplace=True)
nobel_df.drop_duplicates(inplace=True)
nobel_df

Unnamed: 0_level_0,link
Laureate,Unnamed: 1_level_1
Wilhelm Conrad Röntgen,/wiki/Wilhelm_R%C3%B6ntgen
Hendrik Lorentz,/wiki/Hendrik_Lorentz
Pieter Zeeman,/wiki/Pieter_Zeeman
Antoine Henri Becquerel,/wiki/Henri_Becquerel
Pierre Curie,/wiki/Pierre_Curie
Maria Skłodowska-Curie,/wiki/Maria_Sk%C5%82odowska-Curie
Lord Rayleigh,"/wiki/John_Strutt,_3rd_Baron_Rayleigh"
Philipp Eduard Anton von Lenard,/wiki/Philipp_Lenard
Joseph John Thomson,/wiki/J._J._Thomson
Albert Abraham Michelson,/wiki/Albert_Abraham_Michelson


We also going to extract the links of each of the physics domains listed in the Research fields table of the [https://en.wikipedia.org/wiki/Physics](https://en.wikipedia.org/wiki/Physics) Wikipedia page.

In [4]:
## We get the physics links
url = "https://en.wikipedia.org/wiki/Physics"

http = Http()
status, response = http.request(url)

table = BeautifulSoup(response, "lxml", parse_only=SoupStrainer('table'))
physics_df = pd.DataFrame([[x.string.lower(), x["href"].lower()] for x in table.contents[2].find_all("a")],
                       columns=["Physics_domain", "link"]).drop_duplicates()

physics_df = physics_df.groupby("Physics_domain").first()
physics_df

Unnamed: 0_level_0,link
Physics_domain,Unnamed: 1_level_1
accelerator physics,/wiki/accelerator_physics
acoustics,/wiki/acoustics
agrophysics,/wiki/agrophysics
antimatter,/wiki/antimatter
applied physics,/wiki/applied_physics
astrometry,/wiki/astrometry
astronomy,/wiki/astronomy
astrophysics,/wiki/astrophysics
atom,/wiki/atom
atomic and molecular astrophysics,/wiki/atomic_and_molecular_astrophysics



## Cleaning the data <a class="anchor" id="Cleaning the data"></a>

>Use the code from your previous homework to create the different functions to clean the get the word data and clean it.

In [5]:
from string import punctuation

## We get the bios
def get_text(link, root_website = "https://en.wikipedia.org"):    
    http = Http()
    status, response = http.request(root_website + link)
    body = BeautifulSoup(response, "lxml", parse_only=SoupStrainer("div", {"id":"mw-content-text"}))
    return BeautifulSoup.get_text(body.contents[1])

# TODO: copy your clean_string function from the previous homework
def clean_string(string):
    for p in punctuation + "1234567890":
        string = string.replace(p,'').lower()   
    return string

# TODO: copy your remove function from the previous homework
def remove(list_to_clean, element_to_remove=[None, ""]):
    return list(filter(lambda x: x != element_to_remove, list_to_clean))

# TODO: copy your remove_one function from the previous homework
def remove_one(list_to_clean):
    return list(filter(lambda x: len(x) > 1, list_to_clean))


We now going to write a function that takes a data frame with a "link" column and return a column of list of words. We basically aggregate all the above function into one to reproduce what was done in the previous homework. Note that here we DO NOT use the function that keep only a unique element of each list nor the one that filter on the number of occurance.

> Write a function that applies all the previous functions to clean a text.

In [6]:
from nltk.corpus import stopwords
words_to_remove = set(stopwords.words('english'))

#remove stopwords
def remove_stopword(list_to_clean, remove_list = words_to_remove):
    return list(filter(lambda x: x not in remove_list, list_to_clean))

# TODO: aggregate all the above function into one to return a list of words from each link
def clean_everything(df):
    return df['link'].apply(get_text) \
            .apply(clean_string) \
            .str.split() \
            .apply(remove) \
            .apply(remove_one) \
            .apply(remove_stopword)

    
    
physics_df["physics_list"] = clean_everything(physics_df)
nobel_df["physics_list"] = clean_everything(nobel_df)
nobel_df

Unnamed: 0_level_0,link,physics_list
Laureate,Unnamed: 1_level_1,Unnamed: 2_level_1
Wilhelm Conrad Röntgen,/wiki/Wilhelm_R%C3%B6ntgen,"[wilhelm, röntgen, born, wilhelm, conrad, rönt..."
Hendrik Lorentz,/wiki/Hendrik_Lorentz,"[confused, hendrikus, albertus, lorentz, ludvi..."
Pieter Zeeman,/wiki/Pieter_Zeeman,"[pieter, zeeman, born, may, zonnemaire, nether..."
Antoine Henri Becquerel,/wiki/Henri_Becquerel,"[uses, see, becquerel, disambiguation, antoine..."
Pierre Curie,/wiki/Pierre_Curie,"[pierre, curie, born, may, paris, france, died..."
Maria Skłodowska-Curie,/wiki/Maria_Sk%C5%82odowska-Curie,"[article, polish, physicist, uses, see, marie,..."
Lord Rayleigh,"/wiki/John_Strutt,_3rd_Baron_Rayleigh","[lord, rayleigh, om, prs, born, november, lang..."
Philipp Eduard Anton von Lenard,/wiki/Philipp_Lenard,"[waterfall, effect, redirects, illusory, visua..."
Joseph John Thomson,/wiki/J._J._Thomson,"[article, nobel, laureate, physicist, moral, p..."
Albert Abraham Michelson,/wiki/Albert_Abraham_Michelson,"[confused, athlete, albert, michelsen, albert,..."


We saw last time that there are many words that are not relevant to physics concepts in those Wikipedia pages. We are going to attempt to filter those with the simple following approach. 

We are going to compile a set of all the unique words in the `nobel_df` lists and a set of all the unique words in the `physics_df` lists. By taking the intersection of those 2 sets, we can subset the words corpus to something more relevant to physics.

>- Compile a set of all unique words in `nobel_df["physics_list"]`. You can use the function [`pd.sum`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sum.html) to concatenate lists. You can cast the final list to a [`set`](https://docs.python.org/2/library/sets.html)
- Compile a set of all unique words in `physics_df["physics_list"]`.
- Compile the intersection of those 2 sets using the `intersection` function.

In [7]:
# TODO: find all the words in nobel_df["physics_list"]
all_nobel_words =  set(nobel_df['physics_list'].sum())

# TODO: find all the words in physics_df["physics_list"]
all_physics_words =  set(physics_df['physics_list'].sum())

# TODO: find all the intersection of all_nobel_words and all_physics_words
physics_corpus =  set(all_nobel_words.intersection(all_physics_words))

physics_corpus

{'supplemented',
 'reflections',
 'b–b',
 'mechanical',
 'importantly',
 'quaternions',
 'batavia',
 'ore',
 'cosmologies',
 'bodys',
 'rmathrm',
 'gutenberg',
 'alwyn',
 'staring',
 'lyndenbell',
 'oliver',
 'threeyear',
 'chemist',
 'repository',
 'olum',
 'angular',
 'measurementsedit',
 'nasas',
 'flint',
 'lima',
 'hong',
 'compared',
 'handson',
 'speculation',
 'chimney',
 'nov',
 'prevent',
 'capable',
 'hancock',
 'cengage',
 'unpublished',
 'pavia',
 'elaborated',
 'albert',
 'appeal',
 'ram',
 'violations',
 'licht',
 'forward',
 'fat',
 'example',
 'curl',
 'leaving',
 'see',
 'hypothesis',
 'hoop',
 'ordinarily',
 'mineralogy',
 'kandel',
 'substantive',
 'theoryedit',
 'aperture',
 'bag',
 'kerr',
 'virtue',
 'extended',
 'fermi–dirac',
 '♠−',
 'approval',
 'poles',
 'cemented',
 'joos',
 'endeavors',
 'squeezed',
 'summation',
 'beach',
 'selfsustaining',
 'focuses',
 'indirect',
 'dont',
 'books',
 'sherwood',
 'personified',
 'denver',
 'lakeside',
 'magic',
 'pursue',

In [8]:
#Checking intersection list length
print(len(physics_corpus), len(all_nobel_words), len(all_physics_words))

12723 31367 33486


>Write a function that keep only specific words from a list

In [9]:
# TODO: write a function that keep only specific words from a list
def keep_only(list_to_clean, corpus=physics_corpus):
    return list(filter(lambda x: x in corpus, list_to_clean))
    
nobel_df["physics_list_clean"] = nobel_df["physics_list"].apply(keep_only)
physics_df["physics_list_clean"] = physics_df["physics_list"].apply(keep_only)


## Compiling the nodes data <a class="anchor" id="Compiling the nodes data"></a>

For those 2 dataframes, we are going to create 2 additional columns:
    
>- create columns "length" that counts the number of words in each list. This column will be used to capture the size of the nodes in the networks. Basically we are going to say: the more words in the Wikipedia page, the more significant the physicist or physics domain is.
- create columns "group" with a unique value for each of those dataframes. Set the value to 1 in the `nobel_df` dataframe and 0 for the `physics_df` dataframe. This columns will be used to distinguish the physicists from the physics domains and attribute them different colors in the network visualization.

In [10]:
# TODO: compute the length of each list
nobel_df["length"] =  nobel_df['physics_list_clean'].apply(len)
physics_df["length"] =  physics_df['physics_list_clean'].apply(len)

# TODO: Set this column to 1
nobel_df["group"] =  1
# TODO: Set this column to 0 
physics_df["group"] =  0

Let's concatenate the those 2 dataframes into the `nodes_df` dataframe. 

>Use the [`pd.concat`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html) function to do so and only keep the "length" and "group" columns. The concatenation needs to be done along the row axis.

In [11]:
# TODO: concatenate those two dataframe into the nodes_df dataframe. 
# keep only the "length" and "group" columns.
nodes_df =   pd.concat([nobel_df[['length', 'group']], physics_df[['length','group']]], axis=0)

nodes_df.index.name = "id"

From this dataframe, we can easily format the data as a list of dictionaries as the d3.js library expects the data to be. We have the "length" attribute for the size of the node, the "group" attribute to distinguish between physicists and physics domains and each node has a unique "id" tag represented by the names.  

In [12]:
nodes_list = list(nodes_df.reset_index().transpose().to_dict().values())
nodes_list

[{'group': 1, 'id': 'Wilhelm Conrad Röntgen', 'length': 1256},
 {'group': 1, 'id': 'Hendrik Lorentz', 'length': 2756},
 {'group': 1, 'id': 'Pieter Zeeman', 'length': 986},
 {'group': 1, 'id': 'Antoine Henri Becquerel', 'length': 1159},
 {'group': 1, 'id': 'Pierre Curie', 'length': 1431},
 {'group': 1, 'id': 'Maria Skłodowska-Curie', 'length': 4954},
 {'group': 1, 'id': 'Lord Rayleigh', 'length': 1418},
 {'group': 1, 'id': 'Philipp Eduard Anton von Lenard', 'length': 1258},
 {'group': 1, 'id': 'Joseph John Thomson', 'length': 2901},
 {'group': 1, 'id': 'Albert Abraham Michelson', 'length': 2306},
 {'group': 1, 'id': 'Gabriel Lippmann', 'length': 1472},
 {'group': 1, 'id': 'Guglielmo Marconi', 'length': 4074},
 {'group': 1, 'id': 'Karl Ferdinand Braun', 'length': 836},
 {'group': 1, 'id': 'Johannes Diderik van der Waals', 'length': 2188},
 {'group': 1, 'id': 'Wilhelm Wien', 'length': 747},
 {'group': 1, 'id': 'Nils Gustaf Dalén', 'length': 779},
 {'group': 1, 'id': 'Heike Kamerlingh-Onne


## Compiling the links data <a class="anchor" id="Compiling the links data"></a>

We have the nodes, we need to find a way to connect them. We are going to compute the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) between each of the wikipedia pages. It is called the cosine similarity because a dot product between 2 vectors $\mathbf{A}$ and $\mathbf{B}$ can be express as:
\begin{equation}
A\cdot B = \Vert A\Vert_2\Vert B\Vert_2\cos\theta
\end{equation}
where $\theta$ is the angle between the 2 vectors. Similarly:
\begin{equation}
\cos\theta = \frac{ A\cdot B}{\Vert A\Vert_2\Vert B\Vert_2}
\end{equation}
$\cos\theta\in[-1,1]$ and specifically $\cos\theta = 1$ if the 2 vectors are in the same direction or $\cos\theta = -1$ if the 2 vectors are in the opposite direction. The matter here becomes to be able to express a Wikipedia page as a vector. I suggest here a simple approach but there are many ways to achieve this. 

We defined earlier the corpus of physics words `physics_corpus`. Each word can be thought as a orthogonal basis defining a vector space where our Wikipedia pages are living in. Each component can be represented by the number of time a specific word appear in a page. As an example, imaging a page $P$ represented by the following list of words:
```
P_list = ["data", "data", "science", "python", "python"]
```
And let's imaging that we have a simple word corpus:
```
corpus = {"engineering", "data", "science", "python"}
```
Then in that basis $P$ could be represented by a vector 
\begin{equation}
\mathbf{P}=\left(\begin{matrix}
  0  \\
  2  \\
  1  \\
  2
 \end{matrix}\right)
\end{equation}

We need to express each Wikipedia page as such vector. 

>Let's start by creating a dataframe with as columns, all the indices of the `nodes_df` dataframe (`nodes_df.index.values`) and as index, the whole `physics_corpus` set.

In [13]:
# TODO: create a data frame with the index of nodes_df as columns and physics_corpus as index
words_vector =  pd.DataFrame(index=physics_corpus, columns = nodes_df.index.values)
words_vector

Unnamed: 0,Wilhelm Conrad Röntgen,Hendrik Lorentz,Pieter Zeeman,Antoine Henri Becquerel,Pierre Curie,Maria Skłodowska-Curie,Lord Rayleigh,Philipp Eduard Anton von Lenard,Joseph John Thomson,Albert Abraham Michelson,...,superfluid,supernova,superstring theory,supersymmetry,surface physics,theory of everything,universe,vacuum energy,vehicle dynamics,weak
supplemented,,,,,,,,,,,...,,,,,,,,,,
reflections,,,,,,,,,,,...,,,,,,,,,,
b–b,,,,,,,,,,,...,,,,,,,,,,
mechanical,,,,,,,,,,,...,,,,,,,,,,
importantly,,,,,,,,,,,...,,,,,,,,,,
quaternions,,,,,,,,,,,...,,,,,,,,,,
batavia,,,,,,,,,,,...,,,,,,,,,,
ore,,,,,,,,,,,...,,,,,,,,,,
cosmologies,,,,,,,,,,,...,,,,,,,,,,
bodys,,,,,,,,,,,...,,,,,,,,,,


To fill this table we are going to use the [`value_counts`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) function on the list of words contained in `nobel_df["physics_list_clean"]` and `physics_df["physics_list_clean"]`. 

>Write a function that takes a list and return a value_counts. The return value should be a pandas series with the words as index. We are using this function to populate the `words_vector` dataframe. Note that because `words_vector` already has an index, the values get populated at the right place automatically.

In [14]:
#TODO: write a function that take a list and return the a word count
def count_words(list_to_count):
    return pd.Series(list_to_count).value_counts()

words_vector.loc[:,nobel_df.index] = nobel_df["physics_list_clean"].apply(count_words).transpose()
words_vector.loc[:,physics_df.index] = physics_df["physics_list_clean"].apply(count_words).transpose()
words_vector

Unnamed: 0,Wilhelm Conrad Röntgen,Hendrik Lorentz,Pieter Zeeman,Antoine Henri Becquerel,Pierre Curie,Maria Skłodowska-Curie,Lord Rayleigh,Philipp Eduard Anton von Lenard,Joseph John Thomson,Albert Abraham Michelson,...,superfluid,supernova,superstring theory,supersymmetry,surface physics,theory of everything,universe,vacuum energy,vehicle dynamics,weak
supplemented,,,,,,,,,,,...,,,,,,1,,,,
reflections,,,,,,,,,,1,...,,,1,1,,,,,,
b–b,,,,,,,,,,,...,,,,,,,,,,
mechanical,2,,,,,,,,,,...,,,2,1,,3,,,,
importantly,,,,,,,,,,,...,,,,,,,1,,,
quaternions,,,,,,,,,,,...,,,1,,,,,,,
batavia,,,,,,,,,,,...,,,,,,,,,,
ore,,,,,,2,,,,,...,,,,,,,,,,
cosmologies,,,,,,,,,,,...,,,,,,,1,,,
bodys,,,,1,,,,,,,...,,,,,,,,,,


There are many entries in this dataframe that appear as `NaN`. We just need to replace those missing values by 0 since they indicate that no records of the word were found for those pages. 

>Use the function [`pd.fillna`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html) to fill with 0.

In [15]:
# TODO: fill the missing values
words_vector = words_vector.fillna(0)
words_vector

Unnamed: 0,Wilhelm Conrad Röntgen,Hendrik Lorentz,Pieter Zeeman,Antoine Henri Becquerel,Pierre Curie,Maria Skłodowska-Curie,Lord Rayleigh,Philipp Eduard Anton von Lenard,Joseph John Thomson,Albert Abraham Michelson,...,superfluid,supernova,superstring theory,supersymmetry,surface physics,theory of everything,universe,vacuum energy,vehicle dynamics,weak
supplemented,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
reflections,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
b–b,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mechanical,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,2.0,1.0,0.0,3.0,0.0,0.0,0.0,0.0
importantly,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
quaternions,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
batavia,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ore,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
cosmologies,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
bodys,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


>- Write a function that takes 2 vectors (2 pandas series) and return the cosine similarity index. You can use the function [`dot`](http://pandas.pydata.org/pandas-docs/version/0.18.1/generated/pandas.Series.dot.html), [`pow`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pow.html) and `sum` if you like.
- Use this function to fill the `similarity_df` dataframe
- Bonus points if you can compute this dataframe using matrix algebra ([`dot`](http://pandas.pydata.org/pandas-docs/version/0.18.1/generated/pandas.DataFrame.dot.html)) without having to iterate through the columns. Hint create 2 dataframe: one that is the dot products of words_vector with itself and one that represent a matrix of norm products. Then divide one matrix by the other element wise. For this case, you would not need to use the `compute_similarity` function.

In [16]:
# TODO: write a function that takes 2 vectors and return the cosine similarity index
def compute_similarity(vect1, vect2):
    return vect1.dot(vect2) / (np.sqrt(vect1.pow(2).sum()) * np.sqrt(vect2.pow(2).sum()))

similarity_df = pd.DataFrame(columns=words_vector.columns, index=words_vector.columns, dtype=float)

# TODO: fill the similarity_df dataframe with the cosine similarity
for i in range(336):
    for j in range(336):
        similarity_df.iloc[i,j] = compute_similarity(words_vector.T.iloc[i,:], words_vector.iloc[:,j])

# TODO: bonus points if you can compute this dataframe using matrix algebra 


similarity_df

Unnamed: 0,Wilhelm Conrad Röntgen,Hendrik Lorentz,Pieter Zeeman,Antoine Henri Becquerel,Pierre Curie,Maria Skłodowska-Curie,Lord Rayleigh,Philipp Eduard Anton von Lenard,Joseph John Thomson,Albert Abraham Michelson,...,superfluid,supernova,superstring theory,supersymmetry,surface physics,theory of everything,universe,vacuum energy,vehicle dynamics,weak
Wilhelm Conrad Röntgen,1.000000,0.187495,0.234174,0.246212,0.243290,0.259589,0.278950,0.323920,0.270947,0.225047,...,0.069528,0.062356,0.067698,0.082119,0.075613,0.096406,0.100489,,,
Hendrik Lorentz,0.187495,1.000000,0.374019,0.220995,0.180621,0.182818,0.205957,0.257049,0.181743,0.273420,...,0.121976,0.074170,0.216167,0.206869,0.073594,0.291843,0.167898,,,
Pieter Zeeman,0.234174,0.374019,1.000000,0.257315,0.209661,0.210837,0.217910,0.282626,0.209114,0.268707,...,0.095191,0.074724,0.071180,0.098205,0.074647,0.119626,0.093208,,,
Antoine Henri Becquerel,0.246212,0.220995,0.257315,1.000000,0.309748,0.277870,0.199766,0.243003,0.213223,0.209687,...,0.072723,0.086675,0.064575,0.090555,0.088253,0.098324,0.109370,,,
Pierre Curie,0.243290,0.180621,0.209661,0.309748,1.000000,0.799027,0.231758,0.211606,0.201191,0.194121,...,0.073278,0.061577,0.058881,0.088355,0.102899,0.096951,0.106298,,,
Maria Skłodowska-Curie,0.259589,0.182818,0.210837,0.277870,0.799027,1.000000,0.253190,0.239254,0.197890,0.227425,...,0.089972,0.068374,0.068389,0.101055,0.105122,0.118168,0.143752,,,
Lord Rayleigh,0.278950,0.205957,0.217910,0.199766,0.231758,0.253190,1.000000,0.227093,0.346124,0.338095,...,0.079277,0.058784,0.138953,0.121579,0.078784,0.173256,0.117252,,,
Philipp Eduard Anton von Lenard,0.323920,0.257049,0.282626,0.243003,0.211606,0.239254,0.227093,1.000000,0.313150,0.254457,...,0.136616,0.075325,0.115452,0.152293,0.102614,0.204830,0.160181,,,
Joseph John Thomson,0.270947,0.181743,0.209114,0.213223,0.201191,0.197890,0.346124,0.313150,1.000000,0.262140,...,0.089795,0.081221,0.109702,0.133458,0.105196,0.158560,0.141266,,,
Albert Abraham Michelson,0.225047,0.273420,0.268707,0.209687,0.194121,0.227425,0.338095,0.254457,0.262140,1.000000,...,0.085456,0.095191,0.075166,0.089990,0.071441,0.120201,0.130093,,,


We are now going to "melt" `similarity_df` into a long dataframe: 

>- We need to reset the index of `similarity_df` ([`pd.reset_index`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html))
- Then we use the [`melt`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html) function to melt the dataframe. Use the resetted index as `id_vars`. 

In [17]:
# TODO: reset the index and melt the dataframe
similarity_df =  similarity_df.reset_index()
melted_df = pd.melt(similarity_df, id_vars=['index'])

melted_df.columns = ["source", "target", "value"]
melted_df

Unnamed: 0,source,target,value
0,Wilhelm Conrad Röntgen,Wilhelm Conrad Röntgen,1.000000
1,Hendrik Lorentz,Wilhelm Conrad Röntgen,0.187495
2,Pieter Zeeman,Wilhelm Conrad Röntgen,0.234174
3,Antoine Henri Becquerel,Wilhelm Conrad Röntgen,0.246212
4,Pierre Curie,Wilhelm Conrad Röntgen,0.243290
5,Maria Skłodowska-Curie,Wilhelm Conrad Röntgen,0.259589
6,Lord Rayleigh,Wilhelm Conrad Röntgen,0.278950
7,Philipp Eduard Anton von Lenard,Wilhelm Conrad Röntgen,0.323920
8,Joseph John Thomson,Wilhelm Conrad Röntgen,0.270947
9,Albert Abraham Michelson,Wilhelm Conrad Röntgen,0.225047


You can see that at this point we have something close to what is necessary to create the links data. There are 3 things we need do to finalize our dataset. First, it is unnecessary to have "source" equal to the "target" (a node does not need to be linked to itself). Second, we have a duplicated links because in our case a "source" has the same role than a "target" and can be interchanged (our graph is not directed). Third, we need to subset the links set because there are too many for the program to run efficiently.

Let's shuffle the data set rowwise ([`sample`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html)). This help us to not bias our links selection due to a prior alphabetic ordering of the data

In [18]:
melted_df = melted_df.sample(frac=1.).reset_index(drop=True)
melted_df

Unnamed: 0,source,target,value
0,William Bradford Shockley,applied physics,0.193861
1,seesaw mechanism,fluid dynamics,0.099984
2,photon,Robert Hofstadter,0.218224
3,Masatoshi Koshiba,Nils Gustaf Dalén,0.389865
4,Jerome I. Friedman,gravitational,0.083606
5,Eric Allin Cornell,superconductor,0.109921
6,Theodor W. Hänsch,planet,0.052566
7,Ivar Giaever,optoelectronics,0.094194
8,James Watson Cronin,bloch wave,0.064355
9,David J. Thouless,engineering physics,0.199512


We then going to find the pairs of ("source", "target") that are equal to pairs ("target", "source"). To do that we are going to merge `melted_df` with itself where ("source", "target") = ("target", "source").

>- merge it with itself with `left_on=["source", "target"]` and `right_on=["target", "source"]`. Pass the dataframe with the index resetted using [`pd.reset_index`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html).

In [19]:
# TODO: merge melted_df with itself
melted_df = melted_df.reset_index()
merged_df =  pd.merge(left=melted_df, right=melted_df, left_on=['source', 'target'], right_on=['target','source'])

merged_df

Unnamed: 0,index_x,source_x,target_x,value_x,index_y,source_y,target_y,value_y
0,0,William Bradford Shockley,applied physics,0.193861,39099,applied physics,William Bradford Shockley,0.193861
1,1,seesaw mechanism,fluid dynamics,0.099984,17110,fluid dynamics,seesaw mechanism,0.099984
2,2,photon,Robert Hofstadter,0.218224,54423,Robert Hofstadter,photon,0.218224
3,3,Masatoshi Koshiba,Nils Gustaf Dalén,0.389865,77280,Nils Gustaf Dalén,Masatoshi Koshiba,0.389865
4,4,Jerome I. Friedman,gravitational,0.083606,84263,gravitational,Jerome I. Friedman,0.083606
5,5,Eric Allin Cornell,superconductor,0.109921,36857,superconductor,Eric Allin Cornell,0.109921
6,6,Theodor W. Hänsch,planet,0.052566,114261,planet,Theodor W. Hänsch,0.052566
7,7,Ivar Giaever,optoelectronics,0.094194,97530,optoelectronics,Ivar Giaever,0.094194
8,8,James Watson Cronin,bloch wave,0.064355,15993,bloch wave,James Watson Cronin,0.064355
9,9,David J. Thouless,engineering physics,0.199512,54232,engineering physics,David J. Thouless,0.199512


At this point, we can see that each pair of ("source", "target") has the redondant equivalent ("target", "source"). This also highlight the cases where "source" = "target". To filter the useless rows we can simply pick the ("source", "target") pair or the ("target", "source") pick to remove. Let's choose which pair to remove by capturing the index we want to remove

>- Look at the pair of columns `merged_df[["index_x", "index_y"]]` and simple choose the greater between the two using [`max`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.max.html). By selecting only the unique values ([`unique`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html)) of the resulting list of indices we have selected the index to remove and we can drop them using [`drop`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html). 

In [20]:
# TODO: find the index to drop
index_to_drop =  merged_df[['index_x','index_y']].max(axis=1).unique()

# TODO: use the index_to_drop to subset the melted_df dataframe
melted_df_sub = merged_df.drop(index_to_drop)
melted_df_sub

Unnamed: 0,index_x,source_x,target_x,value_x,index_y,source_y,target_y,value_y
0,0,William Bradford Shockley,applied physics,0.193861,39099,applied physics,William Bradford Shockley,0.193861
1,1,seesaw mechanism,fluid dynamics,0.099984,17110,fluid dynamics,seesaw mechanism,0.099984
2,2,photon,Robert Hofstadter,0.218224,54423,Robert Hofstadter,photon,0.218224
3,3,Masatoshi Koshiba,Nils Gustaf Dalén,0.389865,77280,Nils Gustaf Dalén,Masatoshi Koshiba,0.389865
4,4,Jerome I. Friedman,gravitational,0.083606,84263,gravitational,Jerome I. Friedman,0.083606
5,5,Eric Allin Cornell,superconductor,0.109921,36857,superconductor,Eric Allin Cornell,0.109921
6,6,Theodor W. Hänsch,planet,0.052566,114261,planet,Theodor W. Hänsch,0.052566
7,7,Ivar Giaever,optoelectronics,0.094194,97530,optoelectronics,Ivar Giaever,0.094194
8,8,James Watson Cronin,bloch wave,0.064355,15993,bloch wave,James Watson Cronin,0.064355
9,9,David J. Thouless,engineering physics,0.199512,54232,engineering physics,David J. Thouless,0.199512


We have filtered quite a bit of rows but it still is too many for the network simulation to run efficiently. For each source, we are going to select the 10 highest values. 

>- Group `melted_df_sub` by "source" using the [`groupby`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) method and select the 10 targets that have the highest values using the [`nlargest`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.nlargest.html) method.
- The resulting pandas Series has a multiindex with 2 levels. We need to get the level 1 of the multiindex to know which rows to keep in `melted_df_sub`. You can get it using the function [`get_level_values`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.get_level_values.html) on the index. 

In [21]:
# TODO: Group melted_df_sub by "source" using the groupby method and select the 10 
# targets that have the highest values using the nlargest method
largest_df =  melted_df_sub.groupby('source_x')['value_x'].nlargest(10)

# TODO: get the level 1 of the multiindex
index_to_keep =  largest_df.index.get_level_values(1)
links_df = melted_df_sub.loc[index_to_keep]
links_df

Unnamed: 0,index_x,source_x,target_x,value_x,index_y,source_y,target_y,value_y
100144,100144,Aage Bohr,Niels Bohr,0.809488,112897,Niels Bohr,Aage Bohr,0.809488
17472,17472,Aage Bohr,Nicolaas Bloembergen,0.484249,112014,Nicolaas Bloembergen,Aage Bohr,0.484249
26845,26845,Aage Bohr,Val Logsdon Fitch,0.466029,110427,Val Logsdon Fitch,Aage Bohr,0.466029
75285,75285,Aage Bohr,James Franck,0.456310,114236,James Franck,Aage Bohr,0.456310
106627,106627,Aage Bohr,Jerome I. Friedman,0.434219,114398,Jerome I. Friedman,Aage Bohr,0.434219
8792,8792,Aage Bohr,Frederick Reines,0.429680,70803,Frederick Reines,Aage Bohr,0.429680
78152,78152,Aage Bohr,Kai Manne Börje Siegbahn,0.427674,114701,Kai Manne Börje Siegbahn,Aage Bohr,0.427674
41199,41199,Aage Bohr,Norman Foster Ramsey,0.426393,57408,Norman Foster Ramsey,Aage Bohr,0.426393
41042,41042,Aage Bohr,Yoichiro Nambu,0.415296,46971,Yoichiro Nambu,Aage Bohr,0.415296
41262,41262,Aage Bohr,Toshihide Maskawa,0.406069,69300,Toshihide Maskawa,Aage Bohr,0.406069


In [22]:
#Cutting out unnecessary columns and renaming column names to look cleaner
links_df2 = links_df[['source_x', 'target_x', 'value_x']].copy()
links_df2.columns = ['source', 'target', 'value']
links_df2

Unnamed: 0,source,target,value
100144,Aage Bohr,Niels Bohr,0.809488
17472,Aage Bohr,Nicolaas Bloembergen,0.484249
26845,Aage Bohr,Val Logsdon Fitch,0.466029
75285,Aage Bohr,James Franck,0.456310
106627,Aage Bohr,Jerome I. Friedman,0.434219
8792,Aage Bohr,Frederick Reines,0.429680
78152,Aage Bohr,Kai Manne Börje Siegbahn,0.427674
41199,Aage Bohr,Norman Foster Ramsey,0.426393
41042,Aage Bohr,Yoichiro Nambu,0.415296
41262,Aage Bohr,Toshihide Maskawa,0.406069


We need to cast this data frame as a list of dictionaries as we have done for the list of nodes. 

>Use a similar code to than for the nodes to create a list of links: 

In [26]:
# TODO: create the list of links
links_list =  list(links_df2.transpose().to_dict().values())
links_list

[{'source': 'Aage Bohr', 'target': 'Niels Bohr', 'value': 0.8094877678877767},
 {'source': 'Aage Bohr',
  'target': 'Nicolaas Bloembergen',
  'value': 0.4842485435315032},
 {'source': 'Aage Bohr',
  'target': 'Val Logsdon Fitch',
  'value': 0.46602932287047916},
 {'source': 'Aage Bohr',
  'target': 'James Franck',
  'value': 0.45631043675497934},
 {'source': 'Aage Bohr',
  'target': 'Jerome I. Friedman',
  'value': 0.4342192173574456},
 {'source': 'Aage Bohr',
  'target': 'Frederick Reines',
  'value': 0.42968015212539395},
 {'source': 'Aage Bohr',
  'target': 'Kai Manne Börje Siegbahn',
  'value': 0.42767369164499813},
 {'source': 'Aage Bohr',
  'target': 'Norman Foster Ramsey',
  'value': 0.4263934942174235},
 {'source': 'Aage Bohr',
  'target': 'Yoichiro Nambu',
  'value': 0.4152957025808827},
 {'source': 'Aage Bohr',
  'target': 'Toshihide Maskawa',
  'value': 0.4060685239433306},
 {'source': 'Abdus Salam',
  'target': 'Sheldon Lee Glashow',
  'value': 0.33038895207158464},
 {'sour

We now create the final dictionary for the network and save it into a json file

In [27]:
import json
network_dict = {"nodes": nodes_list,
                "links": links_list}

with open("./data/physicists.json","w") as f:
    json.dump(network_dict, f, indent=4)

FileNotFoundError: [Errno 2] No such file or directory: './data/physicists.json'

If you have a mac, the following script is going to open a safari window to visualize this network, otherwise you can just open the index.html file with Safari or Firefox. For some reason it does not work with chrome.

In [None]:
import os
os.system("open -a /Applications/Safari.app ./index.html")

Adjust the parameters to try to find nodes that tend to be grouped together. You can try to recreate the network with with different number of links.

In [25]:
ls

[0m[01;34madditional[0m/               [01;32mindex.html[0m*          Text Mining Project.ipynb
Homework2.ipynb           [01;32msmall_network.html[0m*
Homework2_JayHwang.ipynb  small_network.json
