# Instructions
On this first assignment, applying the basic functions of the Igraph package is required. The following datasets are going to be used:

* Actors dataset - undirected graph - : For the 2005 Graph Drawing conference a data set was provided of the IMDB movie database. We will use a reduced version of this dataset, which derived all actor-actor collaboration edges where the actors co-starred in at least 2 movies together between 1995 and 2004. 


You have to complete the code chunks in this document but also analyze the results, extract insights and answer the short questions. Fill the CSV attached with your answers, sometimes just the number is enough, some others just a small sentence or paragraph. Remember to change the header with your email.

In your submission please upload both this document in HTML and the CSV with the solutions.


# Loading data

In this section, the goal is loading the datasets given, building the graph and analyzing basics metrics. Include the edge or node attributes you consider.

Describe the values provided by summary function on the graph object.





In [1]:
import pandas as pd

In [2]:
!pip install python-igraph 
!apt-get install libcairo2-dev libjpeg-dev libgif-dev
!pip install pycairo
!pip install cairocffi

Collecting python-igraph
  Downloading python_igraph-0.9.6-cp37-cp37m-manylinux2010_x86_64.whl (3.2 MB)
[K     |████████████████████████████████| 3.2 MB 12.9 MB/s 
[?25hCollecting texttable>=1.6.2
  Downloading texttable-1.6.4-py2.py3-none-any.whl (10 kB)
Installing collected packages: texttable, python-igraph
Successfully installed python-igraph-0.9.6 texttable-1.6.4
Reading package lists... Done
Building dependency tree       
Reading state information... Done
libjpeg-dev is already the newest version (8c-2ubuntu8).
libjpeg-dev set to manually installed.
libgif-dev is already the newest version (5.1.4-2ubuntu0.1).
libgif-dev set to manually installed.
The following additional packages will be installed:
  libcairo-script-interpreter2 libpixman-1-dev libxcb-shm0-dev
Suggested packages:
  libcairo2-doc
The following NEW packages will be installed:
  libcairo-script-interpreter2 libcairo2-dev libpixman-1-dev libxcb-shm0-dev
0 upgraded, 4 newly installed, 0 to remove and 37 not upgrade

In [3]:
import igraph as ig

In [4]:
from igraph import *
import cairo
import cairocffi
import pandas as pd

In [5]:
#imports CSV files from local drive
from google.colab import files
uploaded = files.upload()

Saving imdb_actor_edges.tsv to imdb_actor_edges.tsv
Saving imdb_actors_key.tsv to imdb_actors_key.tsv


In [6]:
#save the files as df_key and df_act

df_key = pd.read_csv("imdb_actors_key.tsv", sep='\t', header=0, encoding='windows-1252')
df_act = pd.read_csv("imdb_actor_edges.tsv", sep='\t', header=0, encoding='windows-1252')

In [7]:
df_key.head()

Unnamed: 0,id,name,movies_95_04,main_genre,genres
0,15629,"Rudder, Michael (I)",12,Thriller,"Action:1,Comedy:1,Drama:1,Fantasy:1,Horror:1,N..."
1,5026,"Morgan, Debbi",16,Drama,"Comedy:2,Documentary:1,Drama:6,Horror:2,NULL:3..."
2,11252,"Bellows, Gil",33,Drama,"Comedy:6,Documentary:1,Drama:7,Family:1,Fantas..."
3,5150,"Dray, Albert",20,Comedy,"Comedy:6,Crime:1,Documentary:1,Drama:4,NULL:5,..."
4,4057,"Daly, Shane (I)",18,Drama,"Comedy:2,Crime:1,Drama:7,Horror:1,Music:1,Musi..."


In [None]:
df_act.head()

Unnamed: 0,from,to,weight
0,17776,17778,6
1,5578,9770,3
2,5578,929,2
3,5578,9982,2
4,1835,6278,2


In [8]:
g1 = Graph.DataFrame(df_act, directed=False, vertices = df_key)

**1) How many nodes and edges are there?**


In [None]:
#Output will show that there are 17577 nodes and 287074 edges
summary(g1)

IGRAPH UNW- 17577 287074 -- 
+ attr: genres (v), main_genre (v), movies_95_04 (v), name (v), weight (e)


In [None]:
g1.ecount()

287074

In [None]:
g1.vcount()

17577

# Degree distribution

Analyse the degree distribution. Compute the total degree distribution.




**3) What does this distributions look like?**

The distribution appears to show a long tail. It seems that many actors have 35 degrees or less, whereas only a small number of actors have over 100 degrees. While the relationship between number of degrees and fame is not direct, I would expect that those with fewer degrees tend to not have "made it" in the movie world, and may have only starred in a few movies at most.
I would guess that those with a higher amount of degrees have starred in more movies, with new actors, and in a diverse range of genres.


In [None]:
#the distribution appears to show a long tail. It seems that most actors have 50 degrees or less, whereas a small number of actors have over 100 degrees.
#I would guess that those with a higher amount of degrees have starred in more movies and in a diverse range of genres

import plotly.express as px
df = px.data.tips()
data=g1.get_vertex_dataframe()
data['degree']=g1.degree(mode='all')
fig = px.histogram(data, x="degree")
fig.show()

**4) What is the maximum degree?**


In [None]:
g1.maxdegree()

784

**5) What is the minum degree?**

In [None]:
min(Graph.degree(g1))

1

# Network Diameter and Average Path Length

You have functions in igraph to calculate the diameter and the average path length. Think if you should consider the weights, the directions, etc.




**6) What is the diameter of the graph?**
*italicized text*

In [None]:
#max distance between 2 vertices

g1.diameter()

16

**7) What is the avg path length of the graph?**

In [9]:

g1.average_path_length(directed=False, unconn=True)  

4.890545545798965

# Node importance: Centrality measures

(Optional but recommended): Obtain the distribution of the number of movies made by an actor and the number of genres in which an actor starred in. It may be useful to analyze and discuss the results to be obtained in the following exercises.

In [10]:
df_key.head()

Unnamed: 0,id,name,movies_95_04,main_genre,genres
0,15629,"Rudder, Michael (I)",12,Thriller,"Action:1,Comedy:1,Drama:1,Fantasy:1,Horror:1,N..."
1,5026,"Morgan, Debbi",16,Drama,"Comedy:2,Documentary:1,Drama:6,Horror:2,NULL:3..."
2,11252,"Bellows, Gil",33,Drama,"Comedy:6,Documentary:1,Drama:7,Family:1,Fantas..."
3,5150,"Dray, Albert",20,Comedy,"Comedy:6,Crime:1,Documentary:1,Drama:4,NULL:5,..."
4,4057,"Daly, Shane (I)",18,Drama,"Comedy:2,Crime:1,Drama:7,Horror:1,Music:1,Musi..."


In [11]:
#This will count the number of genres. Note that it includes the NULL genres.

df_key['# of genres']= df_key['genres'].str.count(":").head()

Obtain three vectors with the degree, betweeness and closeness for each vertex of the actors' graph.

In [12]:
df_key['degree'] = Graph.degree(g1)

In [13]:
df_key.sort_values("degree", ascending=False).head()

Unnamed: 0,id,name,movies_95_04,main_genre,genres,# of genres,degree
12147,162,"Davis, Mark (V)",540,Adult,"Action:1,Adult:429,Comedy:3,Crime:1,Documentar...",,784
1761,1743,"Sanders, Alex (I)",467,Adult,"Action:1,Adult:380,Adventure:1,Comedy:2,Docume...",,610
13442,1754,"North, Peter (I)",460,Adult,"Action:1,Adult:389,Documentary:5,Drama:5,NULL:...",,599
11272,1802,"Marcus, Mr.",435,Adult,"Adult:343,Crime:1,Documentary:2,NULL:86,Short:...",,584
4092,407,"Tedeschi, Tony",364,Adult,"Adult:286,Adventure:1,Comedy:1,Documentary:2,D...",,561


In [14]:
degree1 = df_key['degree']

In [15]:
degree1

0         36
1         23
2         22
3         23
4         46
        ... 
17572     18
17573      5
17574     57
17575    350
17576     62
Name: degree, Length: 17577, dtype: int64

In [16]:
df_key['betweenness'] = Graph.betweenness(g1)

In [17]:
betweenness1 = df_key['betweenness']

In [18]:
df_key['closeness'] = Graph.closeness(g1)

In [19]:
closeness1 = df_key['closeness']

Obtain the list of the 20 actors with the largest degree centrality. It can be useful to show a list with the degree, the name of the actor, the number of movies, the main genre, and the number of genres in which the actor has participated.

**8) Who is the actor with highest degree centrality?**

**9) How do you explain the high degree of the top-20 list??**

Degree can be thought of as a simple count of the total number of connections linked to a vertex. It can be thought of as a kind of popularity measure. In this case, it might reflect actors who are well connected with others, which could be due to factors such as: a) frequently working with new actors/actresses, b) having been in a variety of genres, c) their geographical location, etc.

It's notable that Adult actors tend to score high on centrality measures. This is likely due to them frequently starring with other actors and producing many more films than average.

In [20]:
df_key.sort_values("degree", ascending = False).head(20)

Unnamed: 0,id,name,movies_95_04,main_genre,genres,# of genres,degree,betweenness,closeness
12147,162,"Davis, Mark (V)",540,Adult,"Action:1,Adult:429,Comedy:3,Crime:1,Documentar...",,784,931853.1,0.2493
1761,1743,"Sanders, Alex (I)",467,Adult,"Action:1,Adult:380,Adventure:1,Comedy:2,Docume...",,610,557236.5,0.245821
13442,1754,"North, Peter (I)",460,Adult,"Action:1,Adult:389,Documentary:5,Drama:5,NULL:...",,599,417338.5,0.241765
11272,1802,"Marcus, Mr.",435,Adult,"Adult:343,Crime:1,Documentary:2,NULL:86,Short:...",,584,1463808.0,0.249964
4092,407,"Tedeschi, Tony",364,Adult,"Adult:286,Adventure:1,Comedy:1,Documentary:2,D...",,561,672163.5,0.245693
8354,164,"Dough, Jon",300,Adult,"Adult:248,Adventure:1,Comedy:1,Documentary:1,D...",,555,863647.9,0.248562
5968,179,"Stone, Lee (II)",403,Adult,"Adult:310,Comedy:1,Documentary:1,Fantasy:2,NUL...",,545,339310.9,0.238488
2236,176,"Voyeur, Vince",370,Adult,"Action:1,Adult:303,Comedy:3,Crime:1,Documentar...",,533,381060.6,0.245783
5752,175,"Lawrence, Joel (II)",315,Adult,"Adult:257,Comedy:1,Documentary:1,Musical:1,NUL...",,500,285123.6,0.241337
15511,160,"Steele, Lexington",429,Adult,"Adult:340,Comedy:1,Documentary:4,Drama:1,Fanta...",,493,297173.5,0.240841


Obtain the list of the 20 actors with the largest betweenness centrality. Show a list with the betweenness, the name of the actor, the number of movies, the main genre, and the number of genres in which the actor has participated.

**10) Who is the actor with highest betweenes?**

**11) How do you explain the high betweenness of the top-20 list?**

Betweenness centrality measures the number of times a node lies on the shortest path between other nodes.

Betweenness is useful for analyzing communication dynamics. Here, we might expect that actors in the top 20 list tend to have influence in their areas. This is likely the case for #1, Ron Jeremy, who has the highest betweenness by far. He has starred in a very high amount of films, and likely often with different Adult actresses. 

In [None]:
df_key.sort_values("betweenness", ascending = False).head(20)

Unnamed: 0,id,name,movies_95_04,main_genre,genres,# of genres,degree,betweenness,closeness
10548,2108,"Jeremy, Ron",280,Adult,"Adult:149,Adventure:1,Animation:1,Comedy:15,Do...",,471,9748544.0,0.28272
4693,3284,"Chan, Jackie (I)",59,Comedy,"Action:2,Comedy:13,Crime:4,Documentary:18,Fami...",,135,4716909.0,0.287238
2563,564,"Cruz, Penélope",46,Drama,"Adventure:1,Comedy:2,Documentary:5,Drama:6,Fam...",,182,4330663.0,0.295555
14433,14458,"Shahlavi, Darren",16,Action,"Action:4,Comedy:3,Documentary:1,Drama:1,Fantas...",,8,4295503.0,0.193886
15720,17308,"Del Rosario, Monsour",20,Action,"Action:8,Drama:3,Fantasy:1,Horror:2,NULL:1,Rom...",,6,4267099.0,0.163154
17458,285,"Depardieu, Gérard",56,Comedy,"Adventure:1,Comedy:15,Crime:2,Documentary:11,D...",,159,4037356.0,0.278351
8799,13723,"Bachchan, Amitabh",35,Romance,"Action:1,Comedy:1,Crime:1,Documentary:1,Drama:...",,66,2570247.0,0.226349
10412,1529,"Jackson, Samuel L.",97,Drama,"Action:3,Adventure:1,Comedy:3,Crime:3,Document...",,427,2539614.0,0.309265
5517,5083,"Soualem, Zinedine",65,Comedy,"Animation:1,Comedy:17,Crime:3,Documentary:1,Dr...",,121,2368164.0,0.249825
15894,1923,"Del Rio, Olivia",84,Adult,"Adult:64,Drama:1,Fantasy:2,NULL:14,Sci-Fi:1,Sh...",,168,2316388.0,0.240033


Obtain the list of the 20 actors with the largest closeness centrality. Show a list with the closeness the name of the actor, the number of movies, the main genre, and the number of genres in which the actor has participated.

**12) Who is the actor with highest closeness centrality?**

**13) How do you explain the high closeness of the top-20 list?**

Closeness centrality scores each node based on their ‘closeness’ to all other nodes in the network. For finding the individuals who are best placed to influence the entire network most quickly. When people belong to a social network that is highly connected, they may all tend to score high in closeness.

We might be able to derive more information from "influences" from single clusters that aren't highly connected.


In [27]:
df_key[df_key['closeness'] < 1].sort_values("closeness", ascending = False).head(20)


Unnamed: 0,id,name,movies_95_04,main_genre,genres,# of genres,degree,betweenness,closeness
2109,16747,"Armanis, Julian",12,Adult,"Adult:11,Documentary:1",,6,24.0,0.714286
14828,13582,"Fazira, Erra",13,Romance,"Animation:1,Crime:1,NULL:2,Romance:9",,1,0.0,0.666667
13001,16913,"Lim, Kay Tong",11,Drama,"Comedy:3,Drama:3,Romance:2,Short:1,Thriller:1,...",,1,0.0,0.666667
6367,17822,"Lee, Mark (X)",10,Comedy,"Comedy:4,Crime:2,Drama:1,Family:1,NULL:1,Roman...",,1,0.0,0.666667
17467,13581,"Hassan, Jalaluddin",14,Romance,"Comedy:1,Drama:5,NULL:1,Romance:6,Sci-Fi:1",,1,0.0,0.666667
2567,17804,"Kovac, Erik",11,Adult,"Adult:8,Documentary:1,NULL:2",,6,1.0,0.588235
9514,17803,"Sulik, Dano",21,Adult,"Adult:17,Documentary:1,NULL:2,Romance:1",,6,1.0,0.588235
7288,16740,"Novotny, Pavel",15,Adult,Adult:15,,4,21.0,0.588235
6659,16745,"Bonnet, Sebastian",17,Adult,"Adult:14,Documentary:1,NULL:1,Romance:1",,6,1.0,0.588235
5377,16748,"Davidov, Ion",10,Adult,"Adult:7,Documentary:1,NULL:2",,6,1.0,0.588235
