# Data Science for Twitter

See `scraper.py` for how the twitter stream is consumed into the database. Initially I had used the `py2neo` library to interface with the neo4j library, however with the lack of documentation for the latest version and a level of abstraction that is too high, I decided to just use the [official python driver.](https://neo4j.com/docs/api/python-driver/current/) See `object.py` for the (failed) implentation of a scraper in `py2neo`.

## Nodes and Relations

The Nodes are straightforward: `Tweet` for tweet objects, `User` for User objects. Both have `id` as primary keys, and additional properties are all just additional info saved. I did not include all the additionaly information from Twitter's stream api though.

For relations, there are only 4 relations.  
`User` POSTED `Tweet`  
`Tweet` MENTIONED `User`  
`Tweet` RETWEETED `Tweet`  
`Tweet` REPLIED `Tweet`  

I could have created another Hashtag node and connected to that, but that doesn't really help in answering the questions involved.

## Main questions  
---

1. Who is retweeting the most number of tweets?
2. What is the distance from the author to the last retweet?
3. Who is the most connected user in your dataset?

In [1]:
# Setup
from neo4j import GraphDatabase
from collections import Counter
from pprint import pprint
import pandas as pd

uri = "bolt://localhost:7687"
driver = GraphDatabase.driver(uri, auth=("neo4j", "1234"))

COUNT = "CALL apoc.meta.stats() YIELD labels RETURN labels;"

with driver.session() as session:
    result = session.run(COUNT)
    record = result.single()
    print("Number of respective nodes:")
    print(record.data()['labels'])



Number of respective nodes:
{'Tweet': 27334, 'User': 19496}


## 1. Who is retweeting the most number of tweets?

To rephrase, this means the user who posts the most number of posts which could be classified as 'retweet'

In retrospect, it makes much more sense to construct a cypher query to sort the relations for us.

In [2]:
Q1_QUERY = """MATCH (user)-[p:POSTED]->()-[:RETWEETED]->()
WITH user, count(p) as postcount
RETURN user.username AS User, postcount as Retweets
ORDER by postcount DESC
LIMIT 5"""

with driver.session() as session:
    result = session.run(Q1_QUERY)
    
# The transformation way. Does not 
data = result.data()

display(pd.DataFrame(data))


    

Unnamed: 0,Retweets,User
0,39,LuisEstradaJr69
1,37,internet_threat
2,20,702_5832112
3,15,AbbyGuguBanda
4,13,Nettie_Mo



## 2. What is the distance from the author to the last retweet?

This confused me for a bit. Which author? The one who retweeted or the one who made the original tweet? What is 'last retweet'? Lets say I leave the author as the one who made the original tweet, and the 'last retweet' being the latest retweet of the specific tweet. Wouldn't the 'distance' always be 2?  

author -POSTED> tweet <RETWEETED- retweet 

Then after a bit of research, I realise directed paths have to be pointing in the same direction. neo4j has two algorithms for this, [shortestPath](https://neo4j.com/docs/cypher-manual/3.5/execution-plans/shortestpath-planning/)  and the [Dijkstra Shortest Path algorithm](https://neo4j.com/docs/graph-algorithms/current/algorithms/shortest-path/). I went with both.  


In [3]:
dijk_query = """MATCH (:User)-[:POSTED]->(rt:Tweet)-[:RETWEETED]->(:Tweet)<-[:POSTED]-(author:User)
WITH rt, author
ORDER by rt.created desc limit 1
CALL algo.shortestPath(author, rt, null,{direction:'OUTGOING'})
YIELD writeMillis,loadMillis,nodeCount, totalCost
RETURN writeMillis,loadMillis,nodeCount, totalCost"""

with driver.session() as session:
    result = session.run(dijk_query)
    data = result.data()
print(data[0])

{'writeMillis': 0, 'loadMillis': 33, 'nodeCount': 0, 'totalCost': -1.0}


Uh oh, looks like there isn't any shortest path. Furthermore, replacing (direction: 'OUTGOING') with (direction: 'OUTGOING') treats the relationships as undirected would result in the expected distance cost of 2.

In [4]:
shortest_query = """MATCH (:User)-[:POSTED]->(rt:Tweet)-[:RETWEETED]->(:Tweet)<-[:POSTED]-(author:User)
WITH rt, author
ORDER by rt.created desc limit 1
MATCH p = shortestPath((rt)-[*]-(author))
RETURN p, length(p)"""

with driver.session() as session:
    result = session.run(shortest_query)
    data = result.data()
print("shortest bidirectional distance:", data[0]['length(p)'])

shortest bidirectional distance: 2


`shortestPath` does not allow directional distance. It appears that there is no shortest directed path, or the bidirectional path cost is 2. Most likely though:  
* I wrote the query wrong
* I did not store data in a way that could connect the retweet and the author
* I misunderstood the question

## 3. Who is the most connected user in your dataset?

This required a bit of googling here and there. Eventually, I found this http://nicolewhite.github.io/neo4j-jupyter/twitter.html which had both 'betweenness' and 'closeness' algorithms, using the `python-igraph` lib

From that source:  

$betweenness(v) = \sum_{s, t \in V} \frac{\sigma_{st}(v)}{\sigma_{st}}$

The betweenness centrality of a node $v$ is the number of shortest paths that pass through $v$, $\sigma_{st}(v)$,  divided by the total number of shortest paths, $\sigma_{st}$.

$closeness(v) = \frac{1}{\sum_{x} d(v, x)}$

The closeness centrality is the reciprocal of a node's farness, or sum of its shortest path distances from all other nodes in the graph.

I have also included PageRank's algorithm to test.



## Note
Before these algorithms are run, we have to include more relationships not captured in the scraper. Indirect relationships ((u:User)-[:POSTED]->(:Tweet)-[:MENTIONED]-(u2:User)) have to be converted to direct, KNOWS relationships. Else, the output from the algorithms ends up without any meaninful distinction between users (i.e centrality all 0, pagerank same value)

In [5]:
# Add a basic mention direct relationship. All replies will mention the user in the tweet object, so this covers that
reply = """MATCH (u:User)-[:POSTED]->(:Tweet)-[:MENTIONED]-(u2:User)
MERGE p = (u)-[:KNOWS]->(u2)
RETURN count(p)"""
with driver.session() as session:
    result = session.run(reply)
    data = result.data()
print(data)

[{'count(p)': 10772}]


In [6]:
# Add a direct relaionship between users who retweeted the other.
retweet = """MATCH (u:User)-[:POSTED]->(:Tweet)-[:RETWEETED]->(:Tweet)<-[:POSTED]-(u2:User)
MERGE p = (u)-[:KNOWS]->(u2)
RETURN count(p)"""
with driver.session() as session:
    result = session.run(retweet)
    data = result.data()
print(data)

[{'count(p)': 3022}]


In [7]:
# Finally, remove self referential relationships
delete = """MATCH (u:User)-[k:KNOWS]-(u)
DELETE k"""
with driver.session() as session:
    result = session.run(delete)


`PageRank` algorithm.

In [8]:
pagerank_query = """CALL algo.pageRank.stream('User', null, {direction:'BOTH'})
YIELD nodeId, score
RETURN algo.asNode(nodeId).username AS page,score
ORDER BY score DESC limit 5"""

with driver.session() as session:
    result = session.run(pagerank_query)
    data = result.data()
display(pd.DataFrame(data))

Unnamed: 0,page,score
0,realDonaldTrump,32.697384
1,BTS_twt,27.138027
2,Lilcurin,15.959499
3,AnnaBordelon84,15.83562
4,IlhanMN,15.179674


In [9]:
betweeness_query = """CALL algo.betweenness.stream('User', null, {direction: 'both'})
YIELD nodeId, centrality
MATCH (user) WHERE id(user) = nodeId
RETURN user.username AS user,centrality
ORDER BY centrality DESC
LIMIT 5"""

with driver.session() as session:
    result = session.run(betweeness_query)
    data = result.data()
display(pd.DataFrame(data))

Unnamed: 0,centrality,user
0,2153260.0,realDonaldTrump
1,1109247.0,IlhanMN
2,909522.4,theestallion
3,875563.1,BTS_twt
4,696977.5,KidCudi


In [10]:
closeness_query = """CALL algo.closeness.stream('User')
YIELD nodeId, centrality
MATCH (user) WHERE id(user) = nodeId
RETURN user.username AS user, centrality
ORDER BY centrality DESC
LIMIT 5"""
with driver.session() as session:
    result = session.run(closeness_query)
    data = result.data()
display(pd.DataFrame(data))

Unnamed: 0,centrality,user
0,1.0,fyoosha
1,1.0,ElChibo
2,1.0,_ShayShay2X
3,1.0,singledadissad
4,1.0,asiaxcheyanne


From [Neo4j docs,](https://neo4j.com/docs/graph-algorithms/current/algorithms/closeness-centrality/) Academically, closeness centrality works best on connected graphs. If we use the original formula on an unconnected graph, we can end up with an infinite distance between two nodes in separate connected components. This means that we’ll end up with an infinite closeness centrality score when we sum up all the distances from that node. 