# Data Science for Twitter

See `scraper.py` for how the twitter stream is consumed into the database. Initially I had used the `py2neo` library to interface with the neo4j library, however with the lack of documentation for the latest version and a level of abstraction that is too high, I decided to just use the [official python driver.](https://neo4j.com/docs/api/python-driver/current/) See `object.py` for the (failed) implentation of a scraper in `py2neo`.

## Nodes and Relations

The Nodes are straightforward: `Tweet` for tweet objects, `User` for User objects. Both have `id` as primary keys, and additional properties are all just additional info saved. I did not include all the additionaly information from Twitter's stream api though.

For relations, there are only 4 relations.  
`User` POSTED `Tweet`  
`Tweet` MENTIONED `User`  
`Tweet` RETWEETED `Tweet`  
`Tweet` REPLIED `Tweet`  

I could have created another Hashtag node and connected to that, but that doesn't really help in answering the questions involved.

## Main questions  
---

1. Who is retweeting the most number of tweets?
2. What is the distance from the author to the last retweet?
3. Who is the most connected user in your dataset?

In [1]:
# Setup
from neo4j import GraphDatabase
from collections import Counter
from pprint import pprint
import pandas as pd

uri = "bolt://localhost:7687"
driver = GraphDatabase.driver(uri, auth=("neo4j", "1234"))

COUNT = "CALL apoc.meta.stats() YIELD labels RETURN labels;"

with driver.session() as session:
    result = session.run(COUNT)
    record = result.single()
    print("Number of respective nodes:")
    print(record.data()['labels'])



Number of respective nodes:
{'Tweet': 152877, 'User': 108001}


## 1. Who is retweeting the most number of tweets?

To rephrase, this means the user who posts the most number of posts which could be classified as 'retweet'

In [2]:
Q1_QUERY = "MATCH (user)-[:POSTED]->()-[:RETWEETED]->() RETURN user.id AS id, user.username AS username"

with driver.session() as session:
    result = session.run(Q1_QUERY)
    
data = result.data()
# Get the top 5 user_ids with most retweets
top_5 = Counter([dicts['id'] for dicts in data]).most_common(5)
top_5 = [([d['username'] for d in data if d['id'] == tup[0]][0], tup[1]) for tup in top_5]
display(pd.DataFrame(top_5, columns=['User', 'Retweets']))


    

Unnamed: 0,User,Retweets
0,Sooner1944,64
1,onmymindmarais,54
2,SallyMoen2,42
3,BLKFOXGNG,38
4,allohhhh,32


## 2. What is the distance from the author to the last retweet?

This confused me for a bit. Which author? The one who retweeted or the one who made the original tweet? What is 'last retweet'? Lets say I leave the author as the one who made the original tweet, and the 'last retweet' being the latest retweet of the specific tweet. Wouldn't the 'distance' always be 2?  

author -POSTED> tweet <RETWEETED- retweet 

Then after a bit of research, I realise directed paths have to be pointing in the same direction. neo4j has two algorithms for this, [shortestPath](https://neo4j.com/docs/cypher-manual/3.5/execution-plans/shortestpath-planning/)  and the [Dijkstra Shortest Path algorithm](https://neo4j.com/docs/graph-algorithms/current/algorithms/shortest-path/). I went with both.  


In [3]:
dijk_query = """MATCH (:User)-[:POSTED]->(rt:Tweet)-[:RETWEETED]->(:Tweet)<-[:POSTED]-(author:User)
WITH rt, author
ORDER by rt.created desc limit 1
CALL algo.shortestPath(author, rt, null,{direction:'OUTGOING'})
YIELD writeMillis,loadMillis,nodeCount, totalCost
RETURN writeMillis,loadMillis,nodeCount, totalCost"""

with driver.session() as session:
    result = session.run(dijk_query)
    data = result.data()
print(data[0])

{'writeMillis': 0, 'loadMillis': 141, 'nodeCount': 0, 'totalCost': -1.0}


Uh oh, looks like there isn't any shortest path. Furthermore, replacing (direction: 'OUTGOING') with (direction: 'OUTGOING') treats the relationships as undirected would result in the expected distance cost of 2.

In [4]:
shortest_query = """MATCH (:User)-[:POSTED]->(rt:Tweet)-[:RETWEETED]->(:Tweet)<-[:POSTED]-(author:User)
WITH rt, author
ORDER by rt.created desc limit 1
MATCH p = shortestPath((rt)-[*]-(author))
RETURN p, length(p)"""

with driver.session() as session:
    result = session.run(shortest_query)
    data = result.data()
print("shortest bidirectional distance:", data[0]['length(p)'])

shortest bidirectional distance: 2


`shortestPath` does not allow directional distance. It appears that there is no shortest directed path, or the bidirectional path cost is 2. Most likely though:  
* I wrote the query wrong
* I did not store data in a way that could connect the retweet and the author
* I misunderstood the question

## 3. Who is the most connected user in your dataset?

This required a bit of googling here and there. Eventually, I found this http://nicolewhite.github.io/neo4j-jupyter/twitter.html which had both 'betweenness' and 'closeness' algorithms, using the `python-igraph` lib

From that source:  

$betweenness(v) = \sum_{s, t \in V} \frac{\sigma_{st}(v)}{\sigma_{st}}$

The betweenness centrality of a node $v$ is the number of shortest paths that pass through $v$, $\sigma_{st}(v)$,  divided by the total number of shortest paths, $\sigma_{st}$.

$closeness(v) = \frac{1}{\sum_{x} d(v, x)}$

The closeness centrality is the reciprocal of a node's farness, or sum of its shortest path distances from all other nodes in the graph.

In [5]:
from igraph import Graph as IGraph
from py2neo import Graph

q3_query = """
MATCH (user1:User)-[:POSTED]->(retweet:Tweet)-[:RETWEETED]->(tweet:Tweet),
      (user2:User)-[:POSTED]->(tweet)
RETURN user1.username, user2.username, count(*) AS weight
"""
graph = Graph(password='1234')
data = graph.run(q3_query)
    
ig = IGraph.TupleList(data, weights=True)

between = [(node["name"], node.betweenness()) for node in ig.vs]
top_between = sorted(between, key=lambda x: x[1], reverse=True)

close = [(node["name"], node.closeness()) for node in ig.vs]
top_close = sorted(close, key=lambda x: x[1], reverse=True)



KeyboardInterrupt: 

In [None]:
print(top_between[:5], top_close[:5])

While the above algorithm takes forever to compute on my lenovo on a somewhat small dataset, I attempt to use neo4j's included `PageRank` algorithm

In [6]:
pagerank_query = """CALL algo.pageRank.stream('User', null, {direction:'BOTH'})
YIELD nodeId, score
RETURN algo.asNode(nodeId).username AS page,score
ORDER BY score DESC limit 5"""

with driver.session() as session:
    result = session.run(pagerank_query)
    data = result.data()
pprint(data)

NameError: name 'pprint' is not defined