### Gensim Tutorial

Gensim is a robust Python library for unsupervised topic modeling and natural language processing. It offers efficient implementations of popular algorithms like Word2Vec, FastText, Doc2Vec, LDA, and LSI. This notebook provides hands-on examples for each of these models.

### Setup

In [20]:
# Install Gensim if not already installed

# !pip install gensim

In [21]:
# Import necessary libraries
import gensim
from gensim.models import Word2Vec, FastText, Doc2Vec, LdaModel, LsiModel
from gensim.models.doc2vec import TaggedDocument
from gensim import corpora
from gensim.utils import simple_preprocess
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sagarmaheshwari/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [22]:
# Sample corpus
documents = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey"
]

In [None]:
# Preprocessing
stop_words = stopwords.words('english')

def preprocess(doc):
    return [token for token in simple_preprocess(doc) if token not in stop_words]

processed_docs = [preprocess(doc) for doc in documents]

### Word2Vec

Word2Vec generates vector representations of words by training a shallow neural network.

In [24]:
# Train Word2Vec model
w2v_model = Word2Vec(sentences=processed_docs, vector_size=100, window=5, min_count=1, workers=4, epochs=10)

In [25]:
# Access vector for a word
vector = w2v_model.wv['computer']
print(f"Vector for 'computer':\n{vector}")

Vector for 'computer':
[-0.00514661 -0.00667657 -0.00777021  0.00832735 -0.00199479 -0.00686069
 -0.00416312  0.0051635  -0.00288184 -0.00376003  0.00161964 -0.00278592
 -0.00157794  0.00108014 -0.00298168  0.008512    0.00392145 -0.00995569
  0.00624274 -0.00679344  0.00075621  0.00441559 -0.00511017 -0.00212507
  0.00809678 -0.00424531 -0.00764362  0.00928519 -0.00217425 -0.00471042
  0.00857088  0.00426897  0.00432532  0.00926353 -0.00846707  0.0052542
  0.00205502  0.00418248  0.00169106  0.00447188  0.00449154  0.00608854
 -0.0032179  -0.0045721  -0.00041279  0.00250806 -0.00328215  0.0060547
  0.0041642   0.00777352  0.00256294  0.00810639 -0.00137941  0.00808476
  0.00370277 -0.00804334 -0.00392963 -0.00247643  0.00487847 -0.00085269
 -0.00281719  0.00782761  0.00934011 -0.00160275 -0.00516775 -0.00468007
 -0.0048488  -0.00958754  0.00135457 -0.00422307  0.00253821  0.00562748
 -0.00405598 -0.00961495  0.00155525 -0.0066844   0.00250963 -0.00377671
  0.00707518  0.0006297   0.00

In [26]:
# Find most similar words
similar_words = w2v_model.wv.most_similar('computer', topn=5)
print("Most similar words to 'computer':")
for word, score in similar_words:
    print(f"{word}: {score}")

Most similar words to 'computer':
system: 0.21709389984607697
unordered: 0.12631958723068237
intersection: 0.10369198024272919
widths: 0.10257931798696518
random: 0.083739273250103


### FastText

FastText extends Word2Vec by considering subword information, allowing it to generate embeddings for out-of-vocabulary words.

In [27]:
# Train FastText model
ft_model = FastText(sentences=processed_docs, vector_size=100, window=5, min_count=1, workers=4, epochs=10)

In [28]:
# Access vector for a word
vector = ft_model.wv['computer']
print(f"Vector for 'computer':\n{vector}")

Vector for 'computer':
[ 2.9880801e-04  3.2837299e-04 -8.7087392e-04  3.4074162e-04
 -5.0530396e-04 -2.0400675e-03 -1.2366952e-03 -1.9385577e-03
  1.3510046e-03 -2.4163353e-03  9.1793487e-04 -1.0294152e-03
 -7.6270627e-04  7.1173337e-05  1.3854944e-03  5.1190931e-04
 -2.9630365e-04 -1.1949538e-03 -1.1720804e-03 -6.1215356e-04
 -6.7950838e-04  3.9473677e-04  1.0080618e-04  8.1050477e-04
  5.8251829e-04  7.0226018e-04 -7.3584268e-04 -1.0394261e-03
 -6.2610256e-04 -2.3708391e-04 -1.1937958e-03 -2.6840135e-04
  7.3543075e-04 -7.2244566e-04 -1.2749806e-03  1.2888059e-04
  3.8285521e-04 -1.3327518e-03 -2.7399871e-03 -3.0622751e-04
  9.2991581e-04 -7.2863739e-04 -1.1310756e-03 -3.2716527e-04
 -2.0244121e-04 -1.1019036e-04 -6.2306185e-04 -1.6128301e-03
  9.9268020e-04  9.7158161e-05  3.6868628e-04 -5.3636177e-04
  1.1346547e-03  8.7445206e-04 -1.6418194e-03 -8.5519900e-04
 -6.4471364e-04  6.2608358e-04  8.3561492e-04 -1.1247990e-03
  1.2888766e-03 -3.4181305e-04 -1.1802679e-03 -1.6068361e-03
 

In [29]:
# Find most similar words
similar_words = ft_model.wv.most_similar('computer', topn=5)
print("Most similar words to 'computer':")
for word, score in similar_words:
    print(f"{word}: {score}")

Most similar words to 'computer':
generation: 0.14043159782886505
opinion: 0.12851285934448242
response: 0.1254628300666809
perceived: 0.12263429164886475
measurement: 0.1199503019452095


### Doc2Vec

Doc2Vec represents entire documents as vectors, capturing the semantics of the text.

In [30]:
# Tag documents
tagged_docs = [TaggedDocument(words=doc, tags=[str(i)]) for i, doc in enumerate(processed_docs)]

In [31]:
# Train Doc2Vec model
d2v_model = Doc2Vec(tagged_docs, vector_size=100, window=5, min_count=1, workers=4, epochs=10)

In [32]:
# Infer vector for a new document
new_doc = preprocess("Human computer interaction")
vector = d2v_model.infer_vector(new_doc)
print(f"Vector for new document:\n{vector}")

Vector for new document:
[ 1.1764626e-03  2.7333130e-03  2.1515836e-04 -3.7031824e-04
 -1.5070344e-03  3.2378642e-03 -4.5723487e-03 -3.9421222e-05
  3.8916387e-03  1.4232891e-03  2.6362306e-03  4.7930083e-03
  3.9003030e-03  2.2299709e-03 -3.6341529e-03 -4.8852395e-03
  1.8337595e-03 -3.6945071e-03 -3.7225005e-03 -1.6903148e-03
 -4.7524776e-03 -2.6911485e-04 -1.8954898e-03  4.4092005e-03
 -1.1342483e-03  1.1098807e-03 -1.9440490e-03  3.4257399e-03
 -2.4771898e-03 -4.7522308e-03  2.3820894e-03 -7.6019316e-04
  1.7676657e-03 -3.8657379e-03 -3.9780149e-04  1.6454047e-03
  2.2256067e-03  3.8652827e-03  2.5457109e-03 -2.9814015e-03
 -3.3781792e-03 -1.3600319e-03 -4.4311825e-03 -3.1819853e-03
 -2.7437157e-03  3.3696613e-03 -3.2702973e-03 -3.1563635e-03
  2.6651809e-03 -3.7297118e-03  8.6977694e-04  1.4308354e-04
 -3.4440651e-03  1.0450006e-03 -1.2880480e-03  3.6284293e-03
 -2.8214226e-03 -3.4230915e-03 -1.8694761e-03 -1.7016320e-03
  1.8465366e-04 -3.0380900e-03 -4.0114444e-04 -1.9978984e-04

In [33]:
# Find most similar documents
similar_docs = d2v_model.dv.most_similar([vector], topn=3)
print("Most similar documents:")
for doc_id, score in similar_docs:
    print(f"Document ID {doc_id}: {score}")


Most similar documents:
Document ID 0: 0.08832982927560806
Document ID 8: 0.08134177327156067
Document ID 3: 0.048103343695402145


### Topic Modeling with LDA

LDA identifies topics in a corpus by grouping words that frequently occur together.

In [34]:
# Create dictionary and corpus
dictionary = corpora.Dictionary(processed_docs)
corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

In [35]:
# Train LDA model
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=3, passes=10)

In [36]:
# Display topics
print("LDA Topics:")
for idx, topic in lda_model.print_topics(-1):
    print(f"Topic {idx}: {topic}")

LDA Topics:
Topic 0: 0.119*"system" + 0.064*"computer" + 0.064*"human" + 0.063*"eps" + 0.063*"interface" + 0.063*"user" + 0.037*"survey" + 0.037*"testing" + 0.037*"opinion" + 0.037*"engineering"
Topic 1: 0.057*"time" + 0.057*"response" + 0.057*"user" + 0.056*"error" + 0.056*"perceived" + 0.056*"measurement" + 0.056*"relation" + 0.056*"random" + 0.056*"generation" + 0.056*"unordered"
Topic 2: 0.123*"graph" + 0.087*"trees" + 0.086*"minors" + 0.049*"quasi" + 0.049*"widths" + 0.049*"well" + 0.049*"iv" + 0.049*"ordering" + 0.049*"intersection" + 0.049*"paths"


### Topic Modeling with LSI

LSI reduces the dimensionality of the term-document matrix using singular value decomposition.

In [37]:
# Train LSI model
lsi_model = LsiModel(corpus=corpus, id2word=dictionary, num_topics=3)

In [38]:
# Display topics
print("LSI Topics:")
for idx, topic in lsi_model.print_topics(-1):
    print(f"Topic {idx}: {topic}")

LSI Topics:
Topic 0: 0.579*"system" + 0.376*"user" + 0.270*"eps" + 0.257*"response" + 0.257*"time" + 0.230*"computer" + 0.224*"human" + 0.191*"interface" + 0.176*"survey" + 0.157*"opinion"
Topic 1: 0.480*"graph" + 0.464*"trees" + 0.361*"minors" + 0.266*"quasi" + 0.266*"iv" + 0.266*"widths" + 0.266*"ordering" + 0.266*"well" + 0.119*"paths" + 0.119*"intersection"
Topic 2: 0.359*"response" + 0.359*"time" + -0.313*"system" + 0.301*"user" + -0.290*"human" + -0.244*"eps" + 0.241*"perceived" + 0.241*"measurement" + 0.241*"error" + 0.241*"relation"


### Saving and Loading Models

In [None]:
# # Save models
# w2v_model.save("word2vec.model")
# ft_model.save("fasttext.model")
# d2v_model.save("doc2vec.model")
# lda_model.save("lda.model")
# lsi_model.save("lsi.model")

# # Load models
# w2v_model = Word2Vec.load("word2vec.model")
# ft_model = FastText.load("fasttext.model")
# d2v_model = Doc2Vec.load("doc2vec.model")
# lda_model = LdaModel.load("lda.model")
# lsi_model = LsiModel.load("lsi.model")

### Coingecko Tutorial

CoinGecko provides a free public API to access real-time cryptocurrency data, including current prices, historical trends, and market statistics. This tutorial focuses on using Python to fetch the current price of Bitcoin (BTC) in USD using the `fetch_price` function. The function makes a simple HTTP request to the CoinGecko API and returns the latest Bitcoin price, allowing developers to integrate live crypto data into dashboards, bots, or analytics workflows.


In [39]:
from gensim_utils import fetch_price

fetch_price()

103086