# Social Network Data Visualization

This projects builds upon the previous one we completed: "Classifying Nobel Prize Physicists by Physics Field."

In that project, we assigned each Nobel Prize Physicist to one of 6 physics subfields. Now, we graphically represent all these physicists (along with various physics domains) as a social network.

## Social network - The basics
A social network consists of nodes and links between those nodes. We produce a small social network as a json file, where the nodes are phsicists, and the links are "connections between them". The file produced is named "small_network.json". To visualize this network, you must open the "small_network.html" file (provided by Damien Benveniste) in your preferred browser. Both the JSON and HTML files are available in the repository.

NOTE: In order to work, the json file MUST be in a folder titled "data", which is itself within the folder containing the html file.

In [1]:
small_network = {
    "nodes": [
        {"id": "Albert Einstein"},
        {"id": "Paul Dirac"},
        {"id": "Niels Bohr"}
    ],
    "links": [
        {"source": "Albert Einstein", "target": "Paul Dirac"},
        {"source": "Albert Einstein", "target": "Niels Bohr"},
        {"source": "Paul Dirac", "target": "Niels Bohr"}
    ]
}

## We dump this network into a .json file
import json
with open(".\data\small_network.json","w") as f:
    json.dump(small_network, f, indent=4)

The networks just created is simple. The rest of this project builds a more complex social network, connecting Nobel physicists and various physics topics.

## Getting physicists and physics domains data

We are going to gather the words from the Wikipedia pages for physicists and physics domains. This will be the information we use to determine which physicists/domains are connect to which, and how strong each connection is.

We first get the links for each Nobel physicist's and physics domains's Wikipedia page. We identify the Nobel physicists by a table present at "https://en.wikipedia.org/wiki/List_of_Nobel_laureates_in_Physics" (as of 12 Jul 2018) which lists them all. We get the physics topics from the Reasearch table at "https://en.wikipedia.org/wiki/Physics", which lists all these physics domains.

- ### Gathering physicists links
We gather the links for the physicists. This is largely done using the code from the previous project, "Classifying Nobel Prize Physicists by Physics Field." However, we drop the Year, Country, Rationale that we gather. We then use Beautiful Soup to parse for the links on Wikipedia. At the end of this we have a df with the physicist ("Laureate") as the index, and containing the links for each physicist's Wikipedia page.

In [3]:
## We get the nobel data set
import numpy as np
import pandas as pd
from httplib2 import Http
from bs4 import BeautifulSoup, SoupStrainer

class Parser:
    
    def __init__(self, url):  
        http = Http()
        status, response = http.request(url)
        tables = BeautifulSoup(response, "lxml", 
                              parse_only=SoupStrainer("table", {"class":"wikitable plainrowheaders sortable"}))
        self.table = tables.contents[1]
    
    def parse_table(self):      
        rows = self.table.find_all("tr")
        header = self.parse_header(rows[0])
        table_array = [self.parse_row(row) for row in rows[1:]]
        table_df = pd.DataFrame(table_array, columns=header).apply(self.clean_table, 1)
        table_df = table_df.replace({"Year":{'':np.nan}})
        table_df.drop(labels = ['Ref','Image'], axis =1, inplace=True)
        return table_df
        
    def parse_row(self, row):     
        columns = row.find_all(["td","th"])
        return [BeautifulSoup.get_text(col).strip() for col in columns if BeautifulSoup.get_text(col) != '']
    
    def parse_header(self, row):     
        columns = row.find_all("th")
        to_return = [BeautifulSoup.get_text(col).strip() for col in columns if BeautifulSoup.get_text(col) != ""]
        return to_return

    def clean_table(self, row):
        if not row.iloc[0].isdigit():
            return row.shift(periods = 1)
        else:
            return row
        
url = "https://en.wikipedia.org/wiki/List_of_Nobel_laureates_in_Physics"        
parser = Parser(url)   
nobel_df = parser.parse_table()

nobel_df.columns = ["Year", "Laureate", "Country", "Rationale"]
nobel_df.dropna(subset=["Country"], inplace=True)
nobel_df.fillna(method="ffill", inplace=True)
nobel_df.drop(["Year", "Country", "Rationale"], 1, inplace=True)

http = Http()
status, response = http.request(url)

table = BeautifulSoup(response, "lxml", parse_only=SoupStrainer('table'))
link_df = pd.DataFrame([[x.string, x["href"]] for x in table.contents[1].find_all("a")],
                       columns=["Laureate", "link"]).drop_duplicates()

nobel_df = nobel_df.merge(link_df, on="Laureate", how="left")
nobel_df.set_index("Laureate", inplace=True)
nobel_df.drop_duplicates(inplace=True)
nobel_df

Unnamed: 0_level_0,link
Laureate,Unnamed: 1_level_1
Wilhelm Conrad Röntgen,/wiki/Wilhelm_R%C3%B6ntgen
Hendrik Lorentz,/wiki/Hendrik_Lorentz
Pieter Zeeman,/wiki/Pieter_Zeeman
Antoine Henri Becquerel,/wiki/Henri_Becquerel
Pierre Curie,/wiki/Pierre_Curie
Maria Skłodowska-Curie,/wiki/Maria_Sk%C5%82odowska-Curie
Lord Rayleigh,"/wiki/John_William_Strutt,_3rd_Baron_Rayleigh"
Philipp Eduard Anton von Lenard,/wiki/Philipp_Lenard
Joseph John Thomson,/wiki/J._J._Thomson
Albert Abraham Michelson,/wiki/Albert_A._Michelson


- ### Gather physics domain links

Using similar code, and taking advantage of Beautiful Soup, we gather the links for the physics domains, and place them in a df named "physics_df".

In [20]:
## We get the physics links
url = "https://en.wikipedia.org/wiki/Physics"

http = Http()
status, response = http.request(url)

table = BeautifulSoup(response, "lxml", parse_only=SoupStrainer('table'))
physics_df = pd.DataFrame([[x.string.lower(), x["href"].lower()] for x in table.contents[3].find_all("a")],
                       columns=["Physics_domain", "link"]).drop_duplicates()

physics_df = physics_df.groupby("Physics_domain").first()

In [21]:
physics_df

Unnamed: 0_level_0,link
Physics_domain,Unnamed: 1_level_1
accelerator physics,/wiki/accelerator_physics
acoustics,/wiki/acoustics
agrophysics,/wiki/agrophysics
antimatter,/wiki/antimatter
applied physics,/wiki/applied_physics
astrometry,/wiki/astrometry
astronomy,/wiki/astronomy
astrophysics,/wiki/astrophysics
atom,/wiki/atom
atomic and molecular astrophysics,/wiki/atomic_and_molecular_astrophysics


## Gather important words from Wikipedia pages

Now that we have the links, we need the words from each Wikipedia page corresponding to each link. We define a function to get a text, and multiple functions to keep only those words we consider "important". Largely, this code is borrowed from our previous project, "Classifying Nobel Prize Physicists by Physics Field." One difference here is that we aggregrate all these functions into a single function called "clean_everything" that will gather AND clean the words.

- ### Gather all words from Wikipedia bios

In [22]:
from string import punctuation

## We get the bios
def get_text(link, root_website = "https://en.wikipedia.org"):    
    http = Http()
    status, response = http.request(root_website + link)
    body = BeautifulSoup(response, "lxml", parse_only=SoupStrainer("div", {"id":"mw-content-text"}))
    return BeautifulSoup.get_text(body.contents[1])

- ### Copy clean_string, remove, remove_one from previous project

In [30]:
## Copy clean_string function from Nobel Classification project
def clean_string(string):
    for p in punctuation + "1234567890":
        string = string.replace(p,'').lower()
    return string

## Copy remove function from the Nobel Classification project
def remove(list_to_clean, element_to_remove=[None, ""]):
    list_cleaned = [x for x in list_to_clean if x not in element_to_remove]
    return list_cleaned

## Copy remove_one function from Nobel Classification project
def remove_one(list_to_clean):
    list_cleaned = [x for x in list_to_clean if len(x) > 1]
    return list_cleaned

## Import stop words to remove
from nltk.corpus import stopwords
words_to_remove = set(stopwords.words('english'))

- ### Aggregate all the above functions into one to return a list of words from each link

In [31]:
def clean_everything(df):
    fin_list = df['link'].apply(get_text).apply(clean_string).str.split().apply(remove).apply(remove, element_to_remove = words_to_remove).apply(remove_one)
    return fin_list

physics_df["physics_list"] = clean_everything(physics_df)
nobel_df["physics_list"] = clean_everything(nobel_df)
nobel_df

Unnamed: 0_level_0,link,physics_list
Laureate,Unnamed: 1_level_1,Unnamed: 2_level_1
Wilhelm Conrad Röntgen,/wiki/Wilhelm_R%C3%B6ntgen,"[wilhelm, röntgen, born, wilhelm, conrad, rönt..."
Hendrik Lorentz,/wiki/Hendrik_Lorentz,"[confused, hendrikus, albertus, lorentz, ludvi..."
Pieter Zeeman,/wiki/Pieter_Zeeman,"[pieter, zeeman, born, may, zonnemaire, nether..."
Antoine Henri Becquerel,/wiki/Henri_Becquerel,"[uses, see, becquerel, disambiguation, antoine..."
Pierre Curie,/wiki/Pierre_Curie,"[pierre, curie, pierre, curie, born, may, pari..."
Maria Skłodowska-Curie,/wiki/Maria_Sk%C5%82odowska-Curie,"[article, polishfrench, physicist, uses, see, ..."
Lord Rayleigh,"/wiki/John_William_Strutt,_3rd_Baron_Rayleigh","[lord, rayleighom, prs, born, november, langfo..."
Philipp Eduard Anton von Lenard,/wiki/Philipp_Lenard,"[waterfall, effect, redirects, illusory, visua..."
Joseph John Thomson,/wiki/J._J._Thomson,"[article, nobel, laureate, physicist, moral, p..."
Albert Abraham Michelson,/wiki/Albert_A._Michelson,"[confused, athlete, albert, michelsen, albert,..."


## How many words?

We now have all the words from the Wikipedia pages for the physicists and physics domains. Let's find out how many unique words are in the nobel_df["physics_list"], how many are in physics_df["physics_list"], how many are in the interesction of the two, and how many of the words in each entry (bio) of each respective list are in the intersection.

- ### Find all the words in nobel_df["physics_list"]

In [32]:
all_nobel_words =  set( nobel_df["physics_list"].sum() )
all_nobel_words

{'camilla',
 'buckley',
 'température',
 'angelesdoctoral',
 'thickness',
 'breslow',
 'northen',
 'armistice',
 'distorted',
 'maldon',
 'sampson',
 'howard',
 'maquoketa',
 'schik',
 'awarding',
 'judd',
 'mussolini',
 'interpretational',
 'driggs',
 'longest',
 'europhysics',
 'uhuru',
 'kaka',
 'honesttogoodness',
 'nicéphore',
 'colman',
 'delhi',
 'cars',
 'conducting',
 'hbs',
 'человека',
 'deserlawrence',
 'darin',
 'bscknown',
 'marshakgordon',
 'bombed',
 'cancellation',
 'anna',
 'roger',
 'ovation',
 'kattowitz',
 'whhmxg',
 'romanian',
 'ccting',
 'donor',
 'expertise',
 'distortion',
 'australiancitizenship',
 '秀樹born',
 'distort',
 'danby',
 'lazarevich',
 'orb',
 'shiver',
 'arxivhepex',
 'indexes',
 'watts',
 'www',
 'chap',
 'jrfebruary',
 'kölliker',
 'rijkedoctoral',
 'peres',
 'poets',
 'meti',
 'nandalal',
 'byrobert',
 'rich',
 'obsolescent',
 'alton',
 'arthurs',
 'unassailability',
 'pseudoreligious',
 'chidambaram',
 'bacterial',
 'isotopesmass',
 'boyle',
 '

- ### Find all the words in physics_df["physics_list"]

In [33]:
all_physics_words =  set( physics_df["physics_list"].sum() )
all_physics_words

{'paints',
 'deimosmoon',
 'galyam',
 'debyes',
 'thickness',
 'logicians',
 'distorted',
 'sculptures',
 'howard',
 'green–kubo',
 'chevallier',
 'cαγo',
 'cgh',
 'longest',
 'patil',
 'europhysics',
 'florian',
 'motioncitation',
 'conceptedit',
 'subbrown',
 'cars',
 'conducting',
 'bina',
 'sushruta',
 'cancellation',
 'meanfield',
 'eccentricity',
 'maire',
 'anna',
 'roger',
 'ldx',
 'lamarckism',
 'romanian',
 'offshore',
 'neoncopper',
 'brandon',
 'setend',
 'slicing',
 'expertise',
 'donor',
 'distortion',
 'distort',
 'danby',
 'rotatingnonspherical',
 'sipylum',
 'orb',
 'alhazens',
 'superdense',
 'qm',
 'arxivhepex',
 'watts',
 'grummans',
 'indexes',
 'riser',
 'www',
 'chap',
 'integrator',
 'decker',
 'peres',
 'trick',
 'tunnelingcurrent',
 'straightforwardly',
 'rich',
 'snr',
 'cladding',
 'leftcidagger',
 'biomaterials',
 'bk†',
 'deviates',
 'bacterial',
 'bangedit',
 'orientation',
 'even–even',
 'boyle',
 'acs',
 'lose',
 'samples',
 'floatovoltaics',
 'timetrav

- ### Find the intersection of all_nobel_words and all_physics_words

In [34]:
physics_corpus =  all_nobel_words.intersection(all_physics_words)

physics_corpus

{'paved',
 'proposal',
 'released',
 'wwwencyclopediacom',
 'pbar',
 'exact',
 'mystic',
 'thickness',
 'element',
 'aas',
 'poisonous',
 'ralph',
 'oleary',
 'distorted',
 'aqueous',
 'howard',
 'society',
 'reject',
 'le',
 'ramsey',
 'students',
 'liberated',
 'jaguar',
 'mother',
 'wrinkles',
 'kids',
 'longest',
 'europhysics',
 'electrochemical',
 'abelian',
 'religious',
 'natura',
 'detroit',
 'bibcodephrvk',
 'unanticipated',
 'enumerating',
 'dem',
 'nearer',
 'thermostatic',
 'cars',
 'conducting',
 'keeps',
 'genes',
 'intrigued',
 'allied',
 'stripped',
 'dj',
 'automaton',
 'eta',
 'harwood',
 'manifest',
 'cancellation',
 'contain',
 'stabilized',
 'trend',
 'installations',
 'lists',
 'anna',
 'roger',
 'superconductive',
 'glasses',
 'romanian',
 'pearce',
 'highquality',
 'schopenhauer',
 'donor',
 'expertise',
 'distortion',
 'chu',
 'congress',
 'manmade',
 'drawn',
 'routes',
 'distort',
 'danby',
 'alkali',
 'orb',
 'preferring',
 'ring',
 'microscope',
 'bibcoden

In [35]:
print(len(physics_corpus), len(all_nobel_words), len(all_physics_words))

12935 33378 34722


So the unique words from the physicist pages total 33,378.
The unique words from the physics domain pages totals 34,722.
The unique words they share in common is 12,935. A little more than a third of the total words in either.

For every entry in the "physics_list" for both nobel_df and physics_df, we are going to keep only those words in the intersection (physics_corpus, as defined above). We will store these in a new column in our dataframes, "physics_list_clean".

- ### Keep only words in physics corpus.

In [36]:
def keep_only(list_to_clean, corpus=physics_corpus):
    return [word for word in list_to_clean if word in corpus]
    
nobel_df["physics_list_clean"] = nobel_df["physics_list"].apply(keep_only)
physics_df["physics_list_clean"] = physics_df["physics_list"].apply(keep_only)
nobel_df

Unnamed: 0_level_0,link,physics_list,physics_list_clean
Laureate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Wilhelm Conrad Röntgen,/wiki/Wilhelm_R%C3%B6ntgen,"[wilhelm, röntgen, born, wilhelm, conrad, rönt...","[wilhelm, röntgen, born, wilhelm, conrad, rönt..."
Hendrik Lorentz,/wiki/Hendrik_Lorentz,"[confused, hendrikus, albertus, lorentz, ludvi...","[confused, lorentz, lorenz, see, also, lorentz..."
Pieter Zeeman,/wiki/Pieter_Zeeman,"[pieter, zeeman, born, may, zonnemaire, nether...","[pieter, zeeman, born, may, october, amsterdam..."
Antoine Henri Becquerel,/wiki/Henri_Becquerel,"[uses, see, becquerel, disambiguation, antoine...","[uses, see, becquerel, disambiguation, henri, ..."
Pierre Curie,/wiki/Pierre_Curie,"[pierre, curie, pierre, curie, born, may, pari...","[pierre, curie, pierre, curie, born, may, pari..."
Maria Skłodowska-Curie,/wiki/Maria_Sk%C5%82odowska-Curie,"[article, polishfrench, physicist, uses, see, ...","[article, physicist, uses, see, marie, curie, ..."
Lord Rayleigh,"/wiki/John_William_Strutt,_3rd_Baron_Rayleigh","[lord, rayleighom, prs, born, november, langfo...","[lord, born, november, langford, june, place, ..."
Philipp Eduard Anton von Lenard,/wiki/Philipp_Lenard,"[waterfall, effect, redirects, illusory, visua...","[effect, redirects, visual, motion, effect, se..."
Joseph John Thomson,/wiki/J._J._Thomson,"[article, nobel, laureate, physicist, moral, p...","[article, nobel, laureate, physicist, moral, p..."
Albert Abraham Michelson,/wiki/Albert_A._Michelson,"[confused, athlete, albert, michelsen, albert,...","[confused, albert, albert, michelson, born, de..."


Next, we gather the information that we really want: The number of words in each Wikipedia entry that are in the physics corpus, and store them in a new column in both dataframes (the column is named "length").

- ### Compute the length of each list

In [37]:
nobel_df["length"] =  nobel_df["physics_list_clean"].apply(len)
physics_df["length"] = physics_df["physics_list_clean"].apply(len)

Lastly, we are eventually going to keep all this information in a single dataframe. However, we still want to identify which entries are physicists, and which are physics domains. So we create a new column, named "group", in both dataframes. Entries in nobel_df receive a value of 1 for "group", while those in physics_df are set to 0.

- ### Create "group" column: 1 for nobel_df, 0 for physics_df

In [38]:
nobel_df["group"] =  1

In [39]:
physics_df["group"] =  0

Now that we have essentialy all the information we need, we simply need to put it into the right format. Recall that a social network consists of nodes and links connecting those nodes. Specifically, we are going to create a list containing the nodes (nodes_list), and a df containing the links (links_list). For now, we will focus on creating the nodes_list.

## Create nodes_list

nodes_list: A list of dictionaries, each containing a physics domain/physicist, the length of their Wikipedia entry (specifically, how many words in their entry were also in physics_corpus), and their group (0 if it is a physics domain, 1 if it is a physicist).

First, let's get the relevant info for all the physicists and physics domains into one dataframe.

- ### Concatenate phsycics_df and nobel_df to form nodes_df

In [40]:
nodes_df =  pd.concat([physics_df, nobel_df])[['length','group']]

nodes_df.index.name = "id"
nodes_df

Unnamed: 0_level_0,length,group
id,Unnamed: 1_level_1,Unnamed: 2_level_1
accelerator physics,866,0
acoustics,2072,0
agrophysics,398,0
antimatter,3532,0
applied physics,289,0
astrometry,1151,0
astronomy,5207,0
astrophysics,1859,0
atom,6633,0
atomic and molecular astrophysics,471,0


Since we are using the D3.js library, we must properly format as a list of dictionaries. Each dictionary will contain the group, length, and name of the physicist/physics domain (as 'id').

- ### Convert nodes_df to list of dictionaries

In [41]:
nodes_list = list(nodes_df.reset_index().transpose().to_dict().values())
nodes_list

[{'group': 0, 'id': 'accelerator physics', 'length': 866},
 {'group': 0, 'id': 'acoustics', 'length': 2072},
 {'group': 0, 'id': 'agrophysics', 'length': 398},
 {'group': 0, 'id': 'antimatter', 'length': 3532},
 {'group': 0, 'id': 'applied physics', 'length': 289},
 {'group': 0, 'id': 'astrometry', 'length': 1151},
 {'group': 0, 'id': 'astronomy', 'length': 5207},
 {'group': 0, 'id': 'astrophysics', 'length': 1859},
 {'group': 0, 'id': 'atom', 'length': 6633},
 {'group': 0, 'id': 'atomic and molecular astrophysics', 'length': 471},
 {'group': 0, 'id': 'atomic physics', 'length': 964},
 {'group': 0, 'id': 'atomic, molecular, and optical physics', 'length': 1870},
 {'group': 0, 'id': 'bcs theory', 'length': 1362},
 {'group': 0, 'id': 'big bang', 'length': 6932},
 {'group': 0, 'id': 'biophysics', 'length': 1074},
 {'group': 0, 'id': 'black hole', 'length': 8355},
 {'group': 0, 'id': 'bloch wave', 'length': 839},
 {'group': 0, 'id': 'bose-einstein condensate', 'length': 3494},
 {'group': 0

Creating nodes_list was easy. Creating links_list will be more involved.

## Create links_list

Ultimately we want to create a list of libraries, each containing two physicists/physics domains (one listed as "target", the other as "source"), and their "value", which measures the strength of their connection (we will discuss later how the strength of their connection is computed).

Here is the basic outline of what we are going to do:

- Create a df (words_vector) containing the how many times each word in physics_corpus occurs in each physicist/physics domain entry.
- Create another df (similarity_df) containing the cosine-similarity index between each of two entries. This is the value we will use to measure how "connected" two physicists/physics domains are. (The cosine-similarity index will be explained later)
- Melt similarity_df into a df containing three columns ("source", "target", "value". The resulting df is named melted_df.
- Convert similarity_df into a list of dictionaries. This list will be named links_list.

Once we have that, we can use nodes_list and links_list to visualize our social network.

Let's start with creating words_vector. We will create an empty dataframe, populate it with the count of each word per Wikipedia entry, and then fill in any missing values with a 0.

- ### Create a(n empty) data frame with the nodes_df index as columns and physics_corpus as index

In [42]:
words_vector =  pd.DataFrame( columns = nodes_df.index.values, index = physics_corpus)
words_vector

Unnamed: 0,accelerator physics,acoustics,agrophysics,antimatter,applied physics,astrometry,astronomy,astrophysics,atom,atomic and molecular astrophysics,...,Hiroshi Amano,Shuji Nakamura,Takaaki Kajita,Arthur B. McDonald,David J. Thouless,F. Duncan M. Haldane,John M. Kosterlitz,Rainer Weiss,Kip Thorne,Barry Barish
paved,,,,,,,,,,,...,,,,,,,,,,
proposal,,,,,,,,,,,...,,,,,,,,,,
released,,,,,,,,,,,...,,,,,,,,,,
wwwencyclopediacom,,,,,,,,,,,...,,,,,,,,,,
pbar,,,,,,,,,,,...,,,,,,,,,,
exact,,,,,,,,,,,...,,,,,,,,,,
mystic,,,,,,,,,,,...,,,,,,,,,,
thickness,,,,,,,,,,,...,,,,,,,,,,
element,,,,,,,,,,,...,,,,,,,,,,
aas,,,,,,,,,,,...,,,,,,,,,,


Now we need to make a function which takes a list, and returns the count for each element in the list. The format will be a Series with the elements of the list as index, and contains the count of each one.

We will apply this function to each element in "physics_list_clean" in both nobel_df  and physics_df. Recall that physics_list_clean columns contains all words from each physicist's/physics domain's Wikipedia entry that are also physics_corpus. Thus, applying this function to an element of "physics_list_clean" will return a Series, with index physics_corpus and counts for each. Thus, it will work perfectly to fill each column of words_vector.

- ### Write/apply function that returns word count of given list

In [43]:
def count_words(list_to_count):
    return (pd.Series(list_to_count)).value_counts()

words_vector.loc[:,nobel_df.index] = nobel_df["physics_list_clean"].apply(count_words).transpose()
words_vector.loc[:,physics_df.index] = physics_df["physics_list_clean"].apply(count_words).transpose()

Now we replace any NaN values with 0.

- ### Fill the missing values

In [44]:
words_vector = words_vector.fillna(value=0)
words_vector
#nobel_df["physics_list_clean"].apply(count_words)

Unnamed: 0,accelerator physics,acoustics,agrophysics,antimatter,applied physics,astrometry,astronomy,astrophysics,atom,atomic and molecular astrophysics,...,Hiroshi Amano,Shuji Nakamura,Takaaki Kajita,Arthur B. McDonald,David J. Thouless,F. Duncan M. Haldane,John M. Kosterlitz,Rainer Weiss,Kip Thorne,Barry Barish
paved,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
proposal,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
released,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
wwwencyclopediacom,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
pbar,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
exact,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mystic,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
thickness,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
element,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,30.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
aas,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Next, we want to create similarity_df. Before we can, we must define the cosine similarity index, and compute it for each pair of physicists/physics domains.

According to Wikipedia, the "Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them." In our situation, we will use the words in physics_corpus as an orthogonal basis for the vectors in our space. These vectors represent each physicist/physics domain Wikipedia entry. Specifically, each vector will be 12935 elements long, each element being the number of times the corresponding physics_corpus word appears in the vector (AKA Wikipedia entry). A cosine similarity index of 1 means that the two entries are in the exact same direction. A value of -1 means that they are in exact opposite direction.

Thus for any two Wikipedia entries, we can compute the cosine similarity index as

$$similarity = cos\left(\theta\right) = \frac{A\cdot B}{\left\|A\right\| \left\|B\right\|},$$ where $A, B$ are their vector representations. A similarity value of 1 means that the two entries are completely identical. A value of -1 means that they are completely opposite.



- ### Create similarity_df

In [63]:
def euc_norm(list):
    return np.sqrt( sum([x**2 for x in list]))
    
# Use "norm"s for denominator
norm = pd.DataFrame(words_vector.apply(euc_norm))
    
numer = words_vector.transpose().dot(words_vector)
denom = norm.dot( norm.transpose() )
similarity_df = numer/denom

In [64]:
similarity_df

Unnamed: 0,accelerator physics,acoustics,agrophysics,antimatter,applied physics,astrometry,astronomy,astrophysics,atom,atomic and molecular astrophysics,...,Hiroshi Amano,Shuji Nakamura,Takaaki Kajita,Arthur B. McDonald,David J. Thouless,F. Duncan M. Haldane,John M. Kosterlitz,Rainer Weiss,Kip Thorne,Barry Barish
accelerator physics,1.000000,0.181027,0.324358,0.243108,0.476278,0.133262,0.169541,0.296805,0.250150,0.106064,...,0.172038,0.120158,0.170583,0.180732,0.172286,0.169737,0.145952,0.151179,0.179970,0.200716
acoustics,0.181027,1.000000,0.232904,0.145087,0.204141,0.137180,0.179164,0.224255,0.163213,0.064637,...,0.115521,0.092485,0.095625,0.112706,0.108834,0.109798,0.086004,0.125214,0.150543,0.138593
agrophysics,0.324358,0.232904,1.000000,0.164500,0.477257,0.152558,0.194524,0.370907,0.169753,0.087921,...,0.197510,0.112776,0.201126,0.229489,0.201445,0.211273,0.160923,0.167525,0.232340,0.241451
antimatter,0.243108,0.145087,0.164500,1.000000,0.230500,0.190037,0.330833,0.282645,0.454103,0.130306,...,0.136195,0.169202,0.170693,0.175667,0.229921,0.201114,0.152238,0.176616,0.211923,0.197134
applied physics,0.476278,0.204141,0.477257,0.230500,1.000000,0.104750,0.157798,0.397664,0.192285,0.105777,...,0.310852,0.167854,0.291100,0.311594,0.297402,0.309185,0.250680,0.215102,0.265340,0.295526
astrometry,0.133262,0.137180,0.152558,0.190037,0.104750,1.000000,0.593334,0.456449,0.172033,0.171902,...,0.076848,0.099620,0.095817,0.093859,0.098176,0.072651,0.080420,0.125324,0.172844,0.117039
astronomy,0.169541,0.179164,0.194524,0.330833,0.157798,0.593334,1.000000,0.668919,0.303349,0.238242,...,0.116259,0.148210,0.157312,0.156652,0.150366,0.132998,0.096918,0.212065,0.254451,0.215852
astrophysics,0.296805,0.224255,0.370907,0.282645,0.397664,0.456449,0.668919,1.000000,0.268597,0.264489,...,0.190416,0.132860,0.222760,0.240200,0.224993,0.220498,0.184756,0.254001,0.307599,0.263597
atom,0.250150,0.163213,0.169753,0.454103,0.192285,0.172033,0.303349,0.268597,1.000000,0.218797,...,0.142735,0.191568,0.185659,0.208703,0.233741,0.190818,0.164022,0.173454,0.213961,0.209013
atomic and molecular astrophysics,0.106064,0.064637,0.087921,0.130306,0.105777,0.171902,0.238242,0.264489,0.218797,1.000000,...,0.057762,0.065542,0.062792,0.054900,0.053323,0.048615,0.035700,0.086568,0.094658,0.062200


We have all the information we need! We have the cosine similarity for every pair of physicists/physics domains, but we're going to clean it up some more. Here's what we need to do.

- Melt the df so that the data is easier to work with.
- Get rid of duplicates (for example, there is a row with acoustics for "source" and accelerator physics for "target". There is also a row with accelerator physics for "source" and aocustics for "target". Because our social network is not directed, we treat these as the same. Thus we are going to only keep one of these entries. Similarly, we will remove all other "duplicates"
- We want to highlight only the most important connections. So for every physicist/physics domain, we only keep the 8 physicists/domains with the highest cosine similarity index to it.


- ### Reset the index and melt the dataframe

In [48]:
melted_df = pd.melt(similarity_df.reset_index(), id_vars = ['index'])
melted_df.columns = ["source", "target", "value"]
melted_df

Unnamed: 0,source,target,value
0,accelerator physics,accelerator physics,1.000000
1,acoustics,accelerator physics,0.181027
2,agrophysics,accelerator physics,0.324358
3,antimatter,accelerator physics,0.243108
4,applied physics,accelerator physics,0.476278
5,astrometry,accelerator physics,0.133262
6,astronomy,accelerator physics,0.169541
7,astrophysics,accelerator physics,0.296805
8,atom,accelerator physics,0.250150
9,atomic and molecular astrophysics,accelerator physics,0.106064


- ### Shuffle data set rowwise.

In [49]:
melted_df = melted_df.sample(frac=1.).reset_index(drop=True)
melted_df

Unnamed: 0,source,target,value
0,antimatter,Rudolf Ludwig Mössbauer,0.134750
1,Joseph John Thomson,William Daniel Phillips,0.233942
2,nanoscale and mesoscopic physics,physics of computation,0.079613
3,Kip Thorne,gravitation physics,0.344277
4,Peter Higgs,Albert Abraham Michelson,0.206263
5,biophysics,Luis Walter Alvarez,0.085663
6,Steven Chu,Steven Chu,1.000000
7,nuclear physics,Tsung-Dao Lee,0.206987
8,applied physics,Horst Ludwig Störmer,0.220484
9,Georges Charpak,Arthur B. McDonald,0.373899


Getting rid of duplicates is a bit tricky. What we do is merge melted_df with itself, merging on ['source', 'target'] with ['target', 'source']. We keep the indices of both "copies" of melted_df. For each pair of duplicates (i.e. source_x, target_x =target_x, source_y),  we store in the list "index_to_drop" the value of either index_x or index_y, whichever is greater (using .unique() so that we don't have duplicates in this list). Finally, we drop all indices in "index_to_drop" from melted_df to get melted_df_sub, which has no duplicates.

- ### Merge melted_df with itself

In [50]:
merged_df = pd.merge(melted_df.reset_index(), melted_df.reset_index(), left_on =['source','target'], right_on=['target','source'])

merged_df

Unnamed: 0,index_x,source_x,target_x,value_x,index_y,source_y,target_y,value_y
0,0,antimatter,Rudolf Ludwig Mössbauer,0.134750,34893,Rudolf Ludwig Mössbauer,antimatter,0.134750
1,1,Joseph John Thomson,William Daniel Phillips,0.233942,2597,William Daniel Phillips,Joseph John Thomson,0.233942
2,2,nanoscale and mesoscopic physics,physics of computation,0.079613,59382,physics of computation,nanoscale and mesoscopic physics,0.079613
3,3,Kip Thorne,gravitation physics,0.344277,51995,gravitation physics,Kip Thorne,0.344277
4,4,Peter Higgs,Albert Abraham Michelson,0.206263,63381,Albert Abraham Michelson,Peter Higgs,0.206263
5,5,biophysics,Luis Walter Alvarez,0.085663,10652,Luis Walter Alvarez,biophysics,0.085663
6,6,Steven Chu,Steven Chu,1.000000,6,Steven Chu,Steven Chu,1.000000
7,7,nuclear physics,Tsung-Dao Lee,0.206987,88144,Tsung-Dao Lee,nuclear physics,0.206987
8,8,applied physics,Horst Ludwig Störmer,0.220484,49847,Horst Ludwig Störmer,applied physics,0.220484
9,9,Georges Charpak,Arthur B. McDonald,0.373899,71197,Arthur B. McDonald,Georges Charpak,0.373899


- ### Find the indices to drop

In [51]:
index_to_drop = merged_df[["index_x", "index_y"]].apply(max, axis=1).unique()

- ### Use the index_to_drop to subset the melted_df dataframe

In [52]:
melted_df_sub =  melted_df.drop(index_to_drop)
melted_df_sub

Unnamed: 0,source,target,value
0,antimatter,Rudolf Ludwig Mössbauer,0.134750
1,Joseph John Thomson,William Daniel Phillips,0.233942
2,nanoscale and mesoscopic physics,physics of computation,0.079613
3,Kip Thorne,gravitation physics,0.344277
4,Peter Higgs,Albert Abraham Michelson,0.206263
5,biophysics,Luis Walter Alvarez,0.085663
7,nuclear physics,Tsung-Dao Lee,0.206987
8,applied physics,Horst Ludwig Störmer,0.220484
9,Georges Charpak,Arthur B. McDonald,0.373899
10,Willis Eugene Lamb,high pressure physics,0.069202


And now, for every physicist/domain, we keep only the 8 strongest connections. We store this in largest_df, which will be a 2-level multiindex. We capture the values of index 1 and store them in index_to_keep. These store the indices we want to keep in our links_df.

- ### Group melted_df_sub by "source" using the groupby method and select the 8 targets with highest "value."

In [66]:
largest_df = melted_df_sub.groupby('source')['value'].nlargest(8)

- ### Get the level 1 of the multiindex

In [67]:
index_to_keep =  largest_df.index.get_level_values(1)

links_df = melted_df_sub.loc[index_to_keep]
links_df

Unnamed: 0,source,target,value
1514,Aage Bohr,Niels Bohr,0.810448
9442,Aage Bohr,Ben Roy Mottelson,0.640552
22853,Aage Bohr,Leo James Rainwater,0.593636
87695,Aage Bohr,Nicolaas Bloembergen,0.453569
32066,Aage Bohr,James Franck,0.451218
4246,Aage Bohr,Jerome I. Friedman,0.429247
2537,Aage Bohr,Andre Geim,0.424991
79156,Aage Bohr,David J. Thouless,0.403194
58248,Abdus Salam,Andre Geim,0.308075
82062,Abdus Salam,Nicolaas Bloembergen,0.298805


Finally, we conver this to a list.

- ### Create the links_list

In [68]:
links_list =  list(links_df.transpose().to_dict().values())
links_list

[{'source': 'Aage Bohr', 'target': 'Niels Bohr', 'value': 0.810448278052767},
 {'source': 'Aage Bohr',
  'target': 'Ben Roy Mottelson',
  'value': 0.6405523849120306},
 {'source': 'Aage Bohr',
  'target': 'Leo James Rainwater',
  'value': 0.59363610094523},
 {'source': 'Aage Bohr',
  'target': 'Nicolaas Bloembergen',
  'value': 0.4535685670272227},
 {'source': 'Aage Bohr',
  'target': 'James Franck',
  'value': 0.45121833589128957},
 {'source': 'Aage Bohr',
  'target': 'Jerome I. Friedman',
  'value': 0.4292465753445105},
 {'source': 'Aage Bohr', 'target': 'Andre Geim', 'value': 0.42499050449712006},
 {'source': 'Aage Bohr',
  'target': 'David J. Thouless',
  'value': 0.4031941944705469},
 {'source': 'Abdus Salam',
  'target': 'Andre Geim',
  'value': 0.3080747544129564},
 {'source': 'Abdus Salam',
  'target': 'Nicolaas Bloembergen',
  'value': 0.2988046873651129},
 {'source': 'Abdus Salam',
  'target': 'Anthony James Leggett',
  'value': 0.2873736808741804},
 {'source': 'Abdus Salam',

Similar to how we did at the beginning of this project for the small network, we export this as a json file. As before, we dump this file in a folder named "data", which is itself inside a folder containing "index.html". Open up index.html, and you will be able to interact with our social network. 

When you open up this file, you will see a social network. Each circle represents either a physicist ("red") or domain ("blue"). Additionally, clicking on one of them will highlight the 8 closest connections. The size of each circle is determined by how many words were in the corresponding Wikipedia entry.

In [69]:
network_dict = {"nodes": nodes_list,
                "links": links_list}

with open(".\data\physicists.json","w") as f:
    json.dump(network_dict, f, indent=4)