Document (Term) Similarity using Latent Semantic Indexing

A small code in python to compute semantic similarity between documents (or items) using Latent Semantic Indexing (LSI)

Dependencies

To install them, just type: pip install -U numpy scipy gensim

How to use this code

There is an example dataset of artists and tags (crawled from Last.fm back in 2009) included in the compressed file data.tar.bz2. Just uncompress it and you will be able to run a demo that computes similarity between tags.
The file config.py includes the default configuration to run the demo in model.py. The two required documents for the model to work are:
- A dictionary file, data/lastfm_artists.txt, which includes the names of the items (artists in this case), one item per line.
- A corpus file, data/lastfm_tags_artists.tsv, which includes a document name (tag) and its list of items (artists) with their corresponding normalized weights
```
   document_name[TAB][(item_name1, weight1), (item_name2, weight2), ...]
```
The previous configuration is set up to provide tag similarity. If you want artist similarity instead, you just have to use the files data/lastfm_tags.txt and data/lastfm_artists_tags.tsv, respectively.
You can also use your own dataset. However, you must follow the exact format as the previoussly mention sample files. Hopefully in the future I'll make the code more flexible so that you can use your own format.
The rest of the configuration options for the demo are already explained in the script model.py, in the main function below.
There is however another script called cleaner.py. In the case of tags, it was necessary to write some code to clean the very noisy tag dataset from Last.fm. By default the DefaultCleaner will be called while running the demo at model.py. This cleaner actually just checks if the input document name is a string or not. In the case of tags, I used TagCleaner to clean the dataset, and thus is also needed to query the dataset.

Examples

Help

$ python model.py -h

Tag similarity

Similar tags to a given tag

$ python model.py "Italian Rap" -s 10 -c TagCleaner
('rapitaliano', 0.99289674)
('hiphopitaliano', 0.9866116)
('italianhiphop', 0.98450512)
('rapitalian', 0.94187772)
('ithiphop', 0.93498826)
('areacronica', 0.92992383)
('raphardcore', 0.9299171)
('nelvortice', 0.90312099)
('tormento', 0.89701408)
('soulville', 0.89118701)

$ python model.py "calm" -s 10 -n 200 -c TagCleaner
('mellow', 0.87416512)
('soothing', 0.87190253)
('soft', 0.86796957)
('melancholy', 0.85312933)
('gentle', 0.85238552)
('sad', 0.83518672)
('reflective', 0.83046651)
('quiet', 0.83010238)
('calming', 0.82065952)
('intimate', 0.81874591)

$ python model.py "Acid House" -s 5 -n 50 -c TagCleaner
('godfathersofhouseandtechno', 0.92068398)
('italohouse', 0.9116993)
('detroithouse', 0.89956081)
('digipunk', 0.89837003)
('progresivehouse', 0.89674282)

$ python model.py "Acid House" -s 5 -n 100 -c TagCleaner
('godfathersofhouseandtechno', 0.91140896)
('myrootsinelectronicmusic', 0.88002872)
('nuhouse', 0.86712235)
('houseartist', 0.86652893)
('funkyelectro', 0.86088693)

$ python model.py "Acid House" -s 5 -n 200 -c TagCleaner
('godfathersofhouseandtechno', 0.85408849)
('afrofuturism', 0.8189801)
('electronicdancemusic', 0.77227825)
('hiphouse', 0.76878512)
('wbmx', 0.75617921)

Similarity between two tags

$ python model.py "Acid House" -p "House" -c TagCleaner
0.860134568029

$ python model.py "Acid House" -p "Jazz" -c TagCleaner
0.0198018934176

$ python model.py "heavy metal" -p "calm" -c TagCleaner
-0.00130043656911

$ python model.py "classical music" -p "party" -c TagCleaner
0.0144112308305

$ python model.py "dance" -p "party" -c TagCleaner
0.804738605955

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
COPYING.txt		COPYING.txt
README.md		README.md
cleaner.py		cleaner.py
config.py		config.py
data.tar.bz2		data.tar.bz2
model.py		model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document (Term) Similarity using Latent Semantic Indexing

Dependencies

How to use this code

Examples

Help

Tag similarity

Similar tags to a given tag

Similarity between two tags

About

Releases

Packages

Languages

License

neomoha/python-lsi-similarity

Folders and files

Latest commit

History

Repository files navigation

Document (Term) Similarity using Latent Semantic Indexing

Dependencies

How to use this code

Examples

Help

Tag similarity

Similar tags to a given tag

Similarity between two tags

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages