# Gensim

Gensim is an open-source library for unsupervised topic modeling and natural language processing, using modern statistical machine learning. Gensim is implemented in Python and Cython.

Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.

Features
1. All algorithms are memory-independent w.r.t. the corpus size (can process input larger than RAM, streamed, out-of-core)
2. Intuitive interfaces

     a)easy to plug in your own input corpus/datastream (simple streaming API)
     
     b)easy to extend with other Vector Space algorithms (simple transformation API)
     
3. Efficient multicore implementations of popular algorithms, such as online Latent Semantic Analysis (LSA/LSI/SVD), Latent Dirichlet Allocation (LDA), Random Projections (RP), Hierarchical Dirichlet Process (HDP) or word2vec deep learning.
4. Distributed computing: can run Latent Semantic Analysis and Latent Dirichlet Allocation on a cluster of computers.

A)Gensim stands for "Generate Similar"

B)Features provided by Gensim :

1. fastText
2. word2vec
3. LSA
4. LDA
5. TF-IDF

## Install Package

In [1]:
pip install gensim

Collecting gensim
  Downloading gensim-4.0.0-cp38-cp38-win_amd64.whl (23.8 MB)
Collecting smart-open>=1.8.1
  Downloading smart_open-4.2.0.tar.gz (119 kB)
Building wheels for collected packages: smart-open
  Building wheel for smart-open (setup.py): started
  Building wheel for smart-open (setup.py): finished with status 'done'
  Created wheel for smart-open: filename=smart_open-4.2.0-py3-none-any.whl size=109637 sha256=d74014803ab3206ff1e00dc056af2f58e470b74b4d01a30696b2f5bc92fac074
  Stored in directory: c:\users\piyush.pathak\appdata\local\pip\cache\wheels\24\f6\ea\70a0761bdfaeacff66662751fe71920e25c4c43d97098a3886
Successfully built smart-open
Installing collected packages: smart-open, gensim
Successfully installed gensim-4.0.0 smart-open-4.2.0
Note: you may need to restart the kernel to use updated packages.


## Import Package

In [2]:
import gensim



#### Documents : It refers to some text

In [4]:
document = "Akshay is teaching gensim on youtube."


#### Corpus : It refers to collection of texts

In [5]:
corpus = ["Akshay is teaching gensim on youtube.","Today is a sunny day","India is one of the top ranking teasm in cricket","My favourite hobby is playing badminton"]

In [6]:

stoplist = set('for a of the and to in'.split(' '))
processed_corpus = [[word for word in document.lower().split() if word not in stoplist]
   for document in corpus]

In [7]:
import pprint
pprint.pprint(processed_corpus)

[['akshay', 'is', 'teaching', 'gensim', 'on', 'youtube.'],
 ['today', 'is', 'sunny', 'day'],
 ['india', 'is', 'one', 'top', 'ranking', 'teasm', 'cricket'],
 ['my', 'favourite', 'hobby', 'is', 'playing', 'badminton']]


In [8]:
import gensim

corpus = """'Akshay is teaching gensim on youtube.',"Today is a sunny day","India is one of the top ranking teasm in cricket",'My favourite hobby is playing badminton'"""

In [9]:
gensim.utils.simple_preprocess(corpus, deacc=False, min_len=2, max_len=15)

['akshay',
 'is',
 'teaching',
 'gensim',
 'on',
 'youtube',
 'today',
 'is',
 'sunny',
 'day',
 'india',
 'is',
 'one',
 'of',
 'the',
 'top',
 'ranking',
 'teasm',
 'in',
 'cricket',
 'my',
 'favourite',
 'hobby',
 'is',
 'playing',
 'badminton']


Document is text and vector is a mathematically convenient representation of that text.

One more important thing to be noted here is that, two different documents may have the same vector representation.

In [15]:

from gensim import corpora
dictionary = corpora.Dictionary(processed_corpus)
print(dictionary)

Dictionary(20 unique tokens: ['akshay', 'gensim', 'is', 'on', 'teaching']...)


In [16]:
pprint.pprint(dictionary.token2id)

{'akshay': 0,
 'badminton': 15,
 'cricket': 9,
 'day': 6,
 'favourite': 16,
 'gensim': 1,
 'hobby': 17,
 'india': 10,
 'is': 2,
 'my': 18,
 'on': 3,
 'one': 11,
 'playing': 19,
 'ranking': 12,
 'sunny': 7,
 'teaching': 4,
 'teasm': 13,
 'today': 8,
 'top': 14,
 'youtube.': 5}


In [17]:
processed_corpus

[['akshay', 'is', 'teaching', 'gensim', 'on', 'youtube.'],
 ['today', 'is', 'sunny', 'day'],
 ['india', 'is', 'one', 'top', 'ranking', 'teasm', 'cricket'],
 ['my', 'favourite', 'hobby', 'is', 'playing', 'badminton']]

In [18]:

BoW_corpus = [dictionary.doc2bow(text) for text in processed_corpus]
pprint.pprint(BoW_corpus)

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1)],
 [(2, 1), (6, 1), (7, 1), (8, 1)],
 [(2, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1)],
 [(2, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1)]]


In [19]:
from gensim import models
tfidf = models.TfidfModel(BoW_corpus)
words = "akshay cricket".lower().split()
print(tfidf)
print(tfidf[dictionary.doc2bow(words)])

TfidfModel(num_docs=4, num_nnz=23)
[(0, 0.7071067811865475), (9, 0.7071067811865475)]


# Topic modelling using LDA

In [20]:
import gensim

corpus = "In terms of unforgettable looks, and enduring desire from enthusiasts who may have grown up gluing together the AMT 3-in-1 model kit that it inspired, the 1940 models stand today as some of the most iconic, instantly recognizable automobiles that the Ford Motor Company ever produced. That year, Fords were produced in two series: Standard and Deluxe. The easiest way to tell them apart is to look for a cleaner one-piece grille on Standard models, while the Deluxe version has a three-piece grille assembly. Both cars also had slightly different pieces of hood trim. This 1940 Ford Standard Tudor sedan was a very popular model that year–around 151,000 of them were built and sold. This Standard has been under the same California ownership since 1994, after the seller bought it from an owner in Texas. The seller describes the car as being entirely original, though the age of the finish and status of any restoration or refresh are unknown."

In [21]:
from nltk import sent_tokenize

In [None]:
list_of_sentence = sent_tokenize(corpus)

list_of_sentence

In [None]:
list_of_simple_preprocess_data = []
for i in list_of_sentence:
    list_of_simple_preprocess_data.append(gensim.utils.simple_preprocess(i, deacc=True, min_len=3))

In [None]:
texts = list_of_simple_preprocess_data
texts

In [None]:
bigram = gensim.models.Phrases(list_of_simple_preprocess_data)
bigram

In [None]:
from gensim.utils import lemmatize
from nltk.corpus import stopwords

In [None]:
stops = set(stopwords.words('english'))

In [None]:
def process_texts(texts):
    texts = [[word for word in line if word not in stops] for line in texts]
    texts = [bigram[line] for line in texts]
    texts = [[word.decode("utf-8").split('/')[0] for word in lemmatize(' '.join(line), allowed_tags=re.compile('(NN)'), min_length=5)] for line in texts]
    return texts

In [None]:
import re
train_texts = process_texts(list_of_simple_preprocess_data)

In [None]:
from gensim.models import LdaModel
from gensim.models.wrappers import LdaMallet
from gensim.corpora import Dictionary

In [None]:
train_texts

In [None]:
#dictionary = Dictionary(train_texts)
corpus = [dictionary.doc2bow(text) for text in train_texts]
print(corpus)

In [None]:
print(dictionary)

In [None]:
ldamodel = LdaModel(corpus=corpus, num_topics=2, id2word=dictionary)

ldamodel.show_topics()

In [None]:

import pyLDAvis.gensim
pyLDAvis.enable_notebook()


pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)