# Introduction

In this kernel, I will show how basic **NLP** and **Social Network Analysis** can work together to generate a similarity network between the texts. Social Networks are structures that represents entities and relations between each other. We will visualize the social networks using **graphs** and draw some results using network metrics.

__note:__ This kernel got some inspiration on the kernel of [Anisotropic](https://www.kaggle.com/arthurtok/spooky-nlp-and-topic-modelling-tutorial). Please, check his kernel! It's very interesting!

## Notebook summary<br><br>

**1. Data contextualization**

**2. Basic NLP: Constructing Document x Document Matrix**

**3. Social Network as a Graph**

**4. Network Centralities: Eigenvector and Betweenness**<br><br><br>

“Invisible things are the only realities.” 
> Egdar Allan Poe, Loss of Breath 

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.feature_extraction.text import TfidfVectorizer
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
from matplotlib import pyplot as plt
import networkx as nx
import networkx.drawing.layout as nxlayout
from math import sqrt
%matplotlib inline

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

sample_submission.csv
test.csv
train.csv



## 1. Data Contextualization: The Authors

* [Edgar Allan Poe - EAP](https://en.wikipedia.org/wiki/Edgar_Allan_Poe): "Edgar Allan Poe was an American writer, editor, and literary critic. Poe is best known for his **poetry and short stories**, particularly his tales of **mystery and the macabre**. He is widely regarded as a central figure of **Romanticism** in the United States and American literature as a whole, and he was one of the country's earliest practitioners of the short story. Poe is generally considered the **inventor of the detective fiction genre** and is further credited with **contributing to the emerging genre of science fiction**. He was the first well-known American writer to try to earn a living through writing alone, resulting in a **financially difficult life and career**."
* [H. P. Lovecraft - HPL](https://en.wikipedia.org/wiki/H._P._Lovecraft):  "Howard Phillips Lovecraft was an American writer who achieved posthumous fame through his influential works of **horror fiction**. Regarded as one of the most significant 20th-century authors in his genre.  Among his most celebrated tales are **"The Rats in the Walls", "The Call of Cthulhu", "At the Mountains of Madness" and "The Shadow Out of Time"**, all canonical to the Cthulhu Mythos."
* [Mary Shelley - MWS](https://en.wikipedia.org/wiki/Mary_Shelley): "Mary Wollstonecraft Shelley was an English novelist, short story writer, dramatist, essayist, biographer, and travel writer, best known for her Gothic novel **Frankenstein: or, The Modern Prometheus (1818)**."
> Wikipedia (adapted)

## 2. Basic NLP: Constructing Document x Document Matrix

Let's build our [Document x Term matrix (DxT)](https://en.wikipedia.org/wiki/Document-term_matrix). This matrix has, as rows, documents and, as columns, terms. Each value DxT[a,b] in the matrix will be the [TFIDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) of the term[b] in the document[a]

I will let the **sklearn.feature_extraction.text.TfidfVectorizer** deal with removing stopwords, tokenization and normalization of the matrix. :D

In [2]:
df = pd.read_csv('../input/train.csv')
tfidf = TfidfVectorizer(stop_words='english',norm='l2')
DxT = tfidf.fit_transform(df['text'])
pd.DataFrame(data=DxT.toarray(),index=df['id'],columns=tfidf.get_feature_names()).head()

Unnamed: 0_level_0,aaem,ab,aback,abaft,abandon,abandoned,abandoning,abandonment,abaout,abased,...,æneid,ærial,æronaut,æronauts,ærostation,æschylus,élite,émeutes,οἶδα,υπνος
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
id26305,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
id17569,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
id11008,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
id27763,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
id12958,0.0,0.0,0.0,0.0,0.0,0.267616,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


As you can see, there's a bunch of zeroes on our DxT matrix. This is because DxT matrices are, normally, [sparse](https://en.wikipedia.org/wiki/Sparse_matrix).

You can think of each row of the DxT matrix as  **vector representations** of the documents. In the next step, We will calculate the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) between the vectors representing documents. This will produce a Document x Document (DxD) matrix where each value DxD[a,b] is the **cosine of the angle between the Document[a] and Document[b]**. The formula to calculate the cosine of the angle between two vectors is:

<img src='http://www.sciweavers.org/upload/Tex2Img_1510312920/render.png'>

But, in our case, we have already **normalized** the document vectors! Remember the **TfidfVectorizer**? The "norm" argument allows us to normalize the vectors by some normalization method. Using the method **"l2"**, also called **Euclidean Norm**, TfidfVectorizer take every Document vectors A calculated and divide it by |A|. This means that the vector modulus of A becomes 1! In this case, our cosine similarity formula becomes:
<img src='http://www.sciweavers.org/upload/Tex2Img_1510313323/render.png'>

Applying the dot product between the matrix DxT and the matrix TxD (DxT transposed) results in the matrix DxD!

In [3]:
DxD = np.dot(DxT,DxT.T)

Now that NLP is over, we can start with Social Network Analysis! :D

## 3. Social Network as a Graph

In [Social Network Analysis](https://en.wikipedia.org/wiki/Social_network_analysis), there're two basic elements that caracterizes network structures: **Nodes** (representing entities) and **Edges** (establishing relations). In our case,
* The Nodes will be each row in the dataset, identified by its "id" and having "text" and "author" as attributes;
* The establishment of an Edge between Nodes will be defined as follow:

        Be "a" and "b" Nodes,

        if DxD[a,b] >= cutoff, establish Edge[a,b,w] with w = DxD[a,b];
 **Graph Representation:**<br>
 Now that our basic structures are chosen, we can worry about visual aspects of our graph representation. The nodes will be represented as **circles** and the edges will be represented as **lines** between nodes. In this case, the network is **undirected**, this means that the relations between nodes are in **both ways**. When the network is directed, the relations have direction associated with them, so we represent it using **arrows**. The **thickness** of the edges will be proportional to the similarity between the pair of nodes and the **color** of the nodes will represent the **author**.
 
 

In [15]:
G = nx.Graph()
for i in range(len(df)):
    idx = df.at[i,'id']
    text = df.at[i,'text']
    author = df.at[i,'author']
    G.add_node(idx,text=text,author=author)

dense_DxD = DxD.toarray()
len_dense = len(dense_DxD)
cutoff=0.5
for i in range(len_dense):
    for j in range(i+1,len_dense):
        if dense_DxD[i,j]>=cutoff:
            weight=dense_DxD[i,j]
            G.add_edge(df.at[i,'id'],df.at[j,'id'],weight=weight)

for node,degree in list(dict(G.degree()).items()):
    if degree == 0:
        G.remove_node(node)

pos = nxlayout.fruchterman_reingold_layout(G,k=1.5/sqrt(len(G.nodes())))

edge_data = []
colors = {'EAP':'1','HPL':'2','MWS':'3'}
for u,v,w in G.edges(data=True):
    x0,y0 = pos[u]
    x1,y1 = pos[v]
    w = w['weight']
    edge_data.append(go.Scatter(x=[x0,x1,None],
                            y=[y0,y1,None],
                            line=go.Line(width=3.0*w,color='#888'),
                            hoverinfo='none',
                            mode='lines'))


node_data = go.Scatter(
        x=[],
        y=[],
        text=[],
        mode='markers',
        hoverinfo='text',
        marker=go.Marker(
            showscale=True,
            colorscale='Viridis',
            reversescale=True,
            color=[],
            size=5.0,
            colorbar=dict(
                thickness=15,
                xanchor='left',
                tickmode='array',
                tickvals=[1,2,3],
                ticktext=['EAP','HPL','MWS'],
                ticks = 'outside'
            ),
            line=dict(width=0.5)))

for u,w in G.nodes(data=True):
    x,y = pos[u]
    color = colors[w['author']]
    text = w['text']
    node_data['x'].append(x)
    node_data['y'].append(y)
    node_data['text'].append(text)
    node_data['marker']['color'].append(color)

In [16]:
py.iplot(go.Figure(data=edge_data+[node_data],
                layout=go.Layout(
                width=800,
                height=600,
                title='<br>Spooky Similarity Network',
                titlefont=dict(size=16),
                showlegend=False,
                hovermode='closest',
                margin=dict(b=20,l=5,r=5,t=40),
                annotations=[ dict(
                    text="Kaggle",
                    showarrow=False,
                    xref="paper", yref="paper",
                    x=0.005, y=-0.002 ) ],
                xaxis=go.XAxis(showgrid=False, zeroline=False, showticklabels=False),
                yaxis=go.YAxis(showgrid=False, zeroline=False, showticklabels=False))))

## 4. Network Centralities: Betweenness and Eigenvector
<br><br>
To analyse this network, we will use two centrality measurements: **Betweenness centrality** and **Eigenvector centrality**.<br>
[**Betweenness centrality**](https://en.wikipedia.org/wiki/Centrality#Betweenness_centrality) measures the number of times a node acts as a **bridge** along the shortest path between two other nodes. It measures control of a node over the communication of groups in the network. In our case, allows us to identify which nodes work as a "semantic bridge" (this is not a thing, I made up because I did not find a better way of explaining) between semantic groups.<br><br>
[**Eigenvector centrality**](https://en.wikipedia.org/wiki/Centrality#Eigenvector_centrality) is a measure of the **influence** of a node in a network. It assigns relative scores to all nodes in the network based on the concept that connections to high-scoring nodes contribute more to the score of the node in question than equal connections to low-scoring nodes. Google's PageRank and the Katz centrality are variants of the eigenvector centrality.<br><br>
In an hypothetical practical situation, if I had to generally understand a set of texts but can't read everything, I would calculate this metrics, take the Top N nodes with the bigger centralities values and read them! (I, usually, do it a lot by the way). In this specific case, this doesn't work much because the text sizes are small.<br><br>

Ploting the Eigenvector and the Betweenness in a scatter plot is a good way to visualize! :D

In [17]:
from math import log
betweenness = nx.betweenness_centrality(G)
max_betweenness = sorted(betweenness.items(),key=lambda x:x[1],reverse=True)[0][1]
betweenness = [(a,(log(1+float(b)/(max_betweenness)))) for a,b in betweenness.items()]
eigen = nx.eigenvector_centrality(G)
max_eigen = sorted(eigen.items(),key=lambda x:x[1],reverse=True)[0][1]
eigen = [(a,(log(1+float(b)/(max_eigen)))) for a,b in eigen.items()]
eigen = dict(eigen)
betweenness = dict(betweenness)
scatter_data = go.Scatter(
        x=[],
        y=[],
        text=[],
        mode='markers',
        hoverinfo='text',
        marker=go.Marker(
            showscale=True,
            colorscale='Viridis',
            reversescale=True,
            color=[],
            size=5.0,
            colorbar=dict(
                thickness=15,
                xanchor='left',
                tickmode='array',
                tickvals=[1,2,3],
                ticktext=['EAP','HPL','MWS'],
                ticks = 'outside'
            ),
            line=dict(width=0.5)))
for u,w in G.nodes(data=True):
    x,y = pos[u]
    color = colors[w['author']]
    text = w['text']
    scatter_data['x'].append(eigen[u])
    scatter_data['y'].append(betweenness[u])
    scatter_data['text'].append(text)
    scatter_data['marker']['color'].append(color)
py.iplot(go.Figure(data=[scatter_data],
                layout=go.Layout(
                width=800,
                height=600,
                title='<br>Log of Eigenvector Centrality X Log of Betweenness Centrality',
                titlefont=dict(size=16),
                showlegend=False,
                hovermode='closest',
                margin=dict(b=50,l=50,r=100,t=100),
                annotations=[ dict(
                    text="Kaggle",
                    showarrow=False,
                    xref="paper", yref="paper",
                    x=0, y=-0.005 ) ],
                xaxis=go.XAxis(title='Eigenvector',showgrid=True, zeroline=True, showticklabels=True),
                yaxis=go.YAxis(title='Betweenness',showgrid=True, zeroline=True, showticklabels=True))))