-
Notifications
You must be signed in to change notification settings - Fork 1
/
README
24 lines (21 loc) · 1.36 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Instructions for running:
Run init.py using "python init.py".
open http://127.0.0.1:5000/ in browser.
Use UI to insert new Csv file and search top tweets related to given query.
Approach:
- Load Twitter data in the Cassandra Cluster.
- Index it using ElasticSearch.
- Weight matrix is constructed for each tweet containing words belonging to that tweet along with the corresponding frequencies.
- Random Tweets selected as centroids.
- For a cluster, selection of tweets containing atleast one similar word using elasticsearch.
- Assignment based on distance from each cluster in case of conflicts.
- Update centroids on each iteration.
- Given a query, selection of cluster with minimum distance.
- Extract information of all tweets belonging to that cluster from Cassandra.
Observation:
- Without Cassandra on multinode Large Data Sets cannot be handeled
- Without Elastic Search Time Required for clustering using K-means is sifnificantly higher.
Future Work:
- The next steps would be to reduce run-time on bigger datasets and cluster them in real-time. This may involve using a database for storing the word occurrence counts.
- With more documents, we may have to do some kind of filtering to reduce the size of the weight matrix and remove outdated edges.
- There might be other clustering methods that can be explored to reduce the time taken for clustering