# Formalia:
Please read the [assignment overview page](https://github.com/SocialComplexityLab/socialgraphs2022/wiki/Assignments) carefully before proceeding. This page contains information about formatting (including formats etc), group sizes, and many other aspects of handing in the assignment.

If you fail to follow these simple instructions, it will negatively impact your grade!

Due date and time: The assignment is due on Tuesday November 1st, 2022 at 23:55. Hand in your IPython notebook file (with extension .ipynb) via http://peergrade.io/ (we won't be doing peergrading, but we'll still use http://peergrade.io/ for the handin.)

----

# Assignment 2: Network Science

_Course: 02805 - **Social Graphs and Interactions**_ <br>
_Course responsible: **Sune Lehmann Jørgensen**_ <br>
_DTU - **Technical University of Denmark**_  
_Due date - **01/11/2022**_ <br>
_Students - **Nikos Karageorgos, John Manganas, Georgios Panagiotopoulos**_

---

## Table of Contents:
- [__Part 0: Data__](#0.)

- [__Part 1: Basic Stats__](#1.)

- [__Part 2: Communities__](#2.)

- [__Part 3: Sentiment__](#3.)

---

# Introduction  

In Assignment 2 we will be working with Superheroes from the comics series of Marvel and DC. Each hero's data is the text from the corresponding Wikipedia page.  
In [Part 0](#0.) the functions and code used to extract the data for each superhero is presented. In [Part 1](#1.) the basic sstatistics and visualisations from the produced network are illustrated. [Part 2](#2.) explores the community structure of the network and finally, in [Part 3](#3.) sentiment analysis is implemented for the 'good' and 'bad' heroes.  

For this notebook, the questions will be shown as indented text, as follows:

> Question 

The answers are shown in the subsequent text cell , starting with __Answer__: 

Before starting, we install and import of the necessary libraries:

In [6]:
from io import BytesIO
import requests
import pandas as pd
from sqlalchemy import create_engine
import re
import matplotlib.pyplot as plt
%matplotlib inline
import pickle
import numpy as np
from tqdm.notebook import tqdm
import networkx as nx

Our work has been made easier by the provision of the names and wikilinks of the characters for both the DC and Marvel Universes. This data is stored as `.csv` files at the github page of the course for the [Marvel](https://github.com/SocialComplexityLab/socialgraphs2022/blob/main/files/marvel.csv) and [DC](https://github.com/SocialComplexityLab/socialgraphs2022/blob/main/files/dc.csv) universes. We have created the text files and uploaded them to a [public github repository](https://github.com/gpanagioto/projects_socialgraphs22/tree/main/Assignment2/Txt_files).

Our DC and Marvel superhero dataset after the proper extraction have been stored in a cloud PostgreSQL Database, easily and quickly accessible.

In [7]:
# Defining function for importing data from the cloud DB. The function returns two dataframes, one for each universe with two columns character_name, wiki_text
def DataImport(table_name):
  host="ec2-54-75-184-144.eu-west-1.compute.amazonaws.com"
  port="5432" 
  dbname="dab1kopm5t3l06"
  user="kpervzhazofybh" 
  password="0d1b5470c51c481880eed267865a8529bdc671f8cb90702d6dcb9e7c199d02ee"

  Engine   = create_engine('postgresql+psycopg2://{}:{}@{}:{}/{}'.format(user,password,host,port,dbname))

  # Connect to PostgreSQL server
  dbConnection = Engine.connect();

  # Read data from PostgreSQL database table and load into a DataFrame instance
  df = pd.read_sql("select * from {}".format(table_name), dbConnection)
  pd.set_option('display.expand_frame_repr', False)

  return df

The network has been saved as a `.gpickle` file and is available from the [repository](https://github.com/gpanagioto/projects_socialgraphs22/blob/main/Assignment2/SuperHeroesGraph.gpickle). In the following cell, this file is loaded and the DiGraph `G` contains the information for the superheroes network.

In [8]:
# Our Network has been stored as a pickle file in our GitHub Repository of this second assignment
mLink = 'https://github.com/gpanagioto/projects_socialgraphs22/blob/main/Assignment2/SuperHeroesGraph.gpickle?raw=true'
mfile = BytesIO(requests.get(mLink).content)
G = pickle.load(mfile)

#G_dir = pickle.load(open('/content/drive/MyDrive/DTU/02805 Social graphs and interactions/SuperHeroesGraph.gpickle', 'rb'))

Finally, the edge list for the generated network, prior to removing any nodes, has been stored in `.pickle` format in the [same public repository](https://github.com/gpanagioto/projects_socialgraphs22/blob/main/Assignment2/superheroes_edgelist.pickle). To recreate the network, the file `superheroes_edgelist.pickle` is loaded to a vriable using the `pickle` module. Then the graph is created by using `networkx`'s function [`from_edgelist`](https://networkx.org/documentation/stable/reference/generated/networkx.convert.from_edgelist.html), passing the optional argument `create_using=nx.DiGraph`.

<a id='0.'></a>
# Part 0: Data 


* Write a short paragraph describing the network. The paragraph should contain the following information

  * The number of nodes and links.
  * The average, median, mode, minimum and maximum value of the network's in-degree.s And of the out-degrees.

In [None]:
def Measures(Graph, TypeOfDegree, Data):

  if TypeOfDegree == 'In':
    values = list(dict(Graph.in_degree(Data)).values())
  else:
    values = list(dict(Graph.out_degree(Data)).values())
  
  mean = np.mean(values)
  print("The average value is {}.".format(mean))
  median = np.median(values)
  print("\nThe median is {}.".format(median))
  mode = st.mode(values)
  print("\nThe mode is {}.".format(list(mode)[0][0]))
  min = np.min(values)
  print("\nThe min is {}.".format(min))
  max = np.max(values)
  print("\nThe max is {}.".format(max))

  return mean, median, list(mode)[0][0], min, max  

In [None]:
print('The number of nodes of the network are {}. The number of edges of the network are {}.'.format(G.number_of_nodes(),G.number_of_edges()))

The number of nodes of the network are 2538. The number of edges of the network are 29998.


In [None]:
Data = list(G.nodes())
TypeOfDegree = 'In'
average, median, mode, min, max = Measures(G, TypeOfDegree, Data)

The average value is 11.819542947202521.

The median is 4.0.

The mode is 1.

The min is 0.

The max is 448.


In [None]:
Data = list(G.nodes())
TypeOfDegree = 'Out'
average, median, mode, min, max = Measures(G, TypeOfDegree, Data)

The average value is 11.819542947202521.

The median is 7.0.

The mode is 0.

The min is 0.

The max is 112.


* We also want the degree distributions and a plot of the network

  * Create in- and out-going degree distributions as described in Lecture 5.
  * Estimate the slope of the incoming degree distribtion as described in Lecture 5.
  * Plot the network using the Force Atlas algorithm as described in Lecture 5.
