Skip to content

Find Similar tweets based on cosine similarities of tfidfvectors

Notifications You must be signed in to change notification settings

heaven00/cosine_similarities

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

How this works

We are going to implement a simple and yet effective model called Vector Space Model (VSM).

In this model what we basically do is project the sentences, in this case tweets, in an n-dimensional space and see the angle between them and then we take the cosine of it. This value is called a cosine similarity, the value varies between 0,1. Thus the vectors near 0 (cos 90)are less similar to vectors near 1 (cos 0).

Implementation

We are using scikit's TfidfVectorizer to project the tweets as vectors and calculating the cosine similarities between these vectors and storing in a file. This is in no way highly computatinal efficient, so bear with this old but useful code.

How To use

You need to set this up a little bit first.

  1. In "config.py" file you need to add your required twitter keys so that the program can access the twitter streaming api.

  2. Under "stream.py" you can choose between two functions, either search for a query #hashtag or get stream of an user.

After you are done with step 1 and 2. Run,

    python run.py

this will execute a program that will collect tweets from the the streaming API every 9 seconds, you can change time interval in "run.py" to anything you like.

This will output a csv file called "tweet_cosine_similarities.csv" that will contain the tweets gathered and their cosine similarities.

Requirements

  1. TwitterAPI
  2. Pandas
  3. sklearn

About

Find Similar tweets based on cosine similarities of tfidfvectors

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages