<a href="https://colab.research.google.com/github/karthikcs/colab/blob/master/Tickets_Word2vec_(Doc2vec)_V1_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Topic Segregation using Doc2Vec and Clustering 

In [0]:
from gensim.models import Word2Vec, KeyedVectors
import pandas as pd
import nltk

### Configuration from the user
In section, user can specify some of the configuration needed for the tool to run. Example: Input file url, key column to consider, additional stop words etc

In [0]:
## Configuration
input_file = r'https://storage.googleapis.com/karthik101/incident_reduced.csv'
key_field = 'description'
no_of_topics = 20
more_stopwords = ['hi', 'hello']

### Input  Data
The data which we are trying to process is a ticket history from past one year. It contains more than 20,000 ticket information. The size of the file is almost 50MB. Github supports not more than 25MB, and hence using the google storage from GCP. (Account : karthikcs101)

In [0]:
df = pd.read_csv(input_file, encoding = "ISO-8859-1")

In [0]:
data = df.description.values.tolist()
# df.head()

### Data Cleaning
Before we start processing the data, we need to perform following pre-processing


1.   **Gensim simple_preprocess** - Convert a document into a list of tokens. This lowercases, tokenizes and converts to deaccents
2.   **Removing Stopwords** - Removes the stopwords from Spacy. It also removes additional user specified Stopwords
3. **Lemmatize** - Using Spacy, converts all the words to Lemmatized words. Example: *message*, *messages*, *messaging* - all gets converted to root word - *message*







In [0]:
from gensim.utils import simple_preprocess
import spacy

nlp = spacy.load('en')
stop_words = nlp.Defaults.stop_words

In [0]:
def simple_processing(sentences):
    for sentence in sentences:
        yield(simple_preprocess(str(sentence), deacc=True))  
        
def remove_stopwords(texts):
  return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]        

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [0]:
## Step 1: Data simple processing
data_words = list(simple_processing(data))

In [0]:
## Step 2: Removing Stopwords 
## Skipping for now 
# stop_words = nlp.Defaults.stop_words
# stp_list = list(stop_words)
# stp_list.extend(more_stopwords)
# stop_words = set(stp_list)
# data_words_nostops = remove_stopwords(data_words)

In [0]:
## Step 3: Lemmatize the data words
## Skipping for now 
# data_lemmatized = lemmatization(data_words_nostops)
# data_lemmatized[:10]

In [0]:
tokanized_sentenaces = data_words

### Now our data is cleaned and ready for Word2Vec processing

In [0]:
tokanized_sentenaces[:10]

### Data modeling  - Doc2Vec 
https://radimrehurek.com/gensim/models/doc2vec.html

In [0]:
 from gensim.models.doc2vec import Doc2Vec, TaggedDocument

In [0]:
## Creating Tagged Document for Doc2Vec
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(tokanized_sentenaces)]

In [0]:
model = Doc2Vec(documents, vector_size=100, window=2, min_count=1, workers=4)

#### Let us try out to see some samples and comparison

In [0]:
print(data[30])
print(data[39])

In [0]:
model.wv.cosine_similarities()

In [0]:
one = tokanized_sentenaces[30]
two = tokanized_sentenaces[33]

vector1 = model.infer_vector(one)
# print(vector1)
vector2 = model.infer_vector(two)
# print(vector2)
# model.wv.wmdistance(one, two)
model.wv.cosine_similarities(vector1, [vector1, vector2])

array([1.        , 0.72883576], dtype=float32)

#### Represent each line item (Descripion) as a vector and store with the Data Frame

In [0]:
vectors = [model.infer_vector(desc) for desc in tokanized_sentenaces]

In [0]:
model.wv.cosine_similarities(vector1, vectors)

[-0.33119848  0.41945872  0.20573749 -0.32744673  1.4933609   1.0280933
  0.29825422  1.2858794   2.3606308  -0.7747618 ]
[ 0.07841662  0.5612426   0.18695262  0.04362907  1.934794    1.3937353
  0.3778075   1.1910995   2.538958   -0.7002043 ]


In [0]:
df['data_words'] = data_words
df['vector']  = vectors



In [0]:
df.to_csv("output.csv")
!curl -X POST --data-binary @'output.csv' -H "Content-Type: text/csv" "https://www.googleapis.com/upload/storage/v1/b/karthik101/o?uploadType=media&name=output.csv"

{
 "kind": "storage#object",
 "id": "karthik101/output.csv/1561376266748871",
 "selfLink": "https://www.googleapis.com/storage/v1/b/karthik101/o/output.csv",
 "name": "output.csv",
 "bucket": "karthik101",
 "generation": "1561376266748871",
 "metageneration": "1",
 "contentType": "text/csv",
 "timeCreated": "2019-06-24T11:37:46.748Z",
 "updated": "2019-06-24T11:37:46.748Z",
 "storageClass": "MULTI_REGIONAL",
 "timeStorageClassUpdated": "2019-06-24T11:37:46.748Z",
 "size": "92919543",
 "md5Hash": "UiSQbFIJJIr/Ark8T3g2Xg==",
 "mediaLink": "https://www.googleapis.com/download/storage/v1/b/karthik101/o/output.csv?generation=1561376266748871&alt=media",
 "crc32c": "D6a26w==",
 "etag": "CMeH9u2DguMCEAE="
}


In [0]:
X = vectors

### Apply DBSCAN clustering for the dataset using the Vector representation of input Data

In [0]:
from sklearn.cluster import DBSCAN
from sklearn import metrics


In [0]:
db = DBSCAN(eps=0.6, min_samples=5).fit(X)

In [0]:
set(db.labels_)

{-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}

In [0]:
[list(db.labels_).count(i) for i in set(db.labels_) ]

[17636, 1079, 6, 6, 7, 11, 7, 11, 10, 5, 6, 5, 9, 5, 7, 2143]

In [0]:
list(db.labels_).count(1)
df['dbscan-5'] = db.labels_
df[df['dbscan-5'] == 1]

In [0]:
df[['dbscan', 'dbscan-5']].to_csv('output-dbscan.csv')
!curl -X POST --data-binary @'output-dbscan.csv' -H "Content-Type: text/csv" "https://www.googleapis.com/upload/storage/v1/b/karthik101/o?uploadType=media&name=output-dbscan.csv"

{
 "kind": "storage#object",
 "id": "karthik101/output-dbscan.csv/1561383573998493",
 "selfLink": "https://www.googleapis.com/storage/v1/b/karthik101/o/output-dbscan.csv",
 "name": "output-dbscan.csv",
 "bucket": "karthik101",
 "generation": "1561383573998493",
 "metageneration": "1",
 "contentType": "text/csv",
 "timeCreated": "2019-06-24T13:39:33.998Z",
 "updated": "2019-06-24T13:39:33.998Z",
 "storageClass": "MULTI_REGIONAL",
 "timeStorageClassUpdated": "2019-06-24T13:39:33.998Z",
 "size": "204153",
 "md5Hash": "cKAFVam1tOUizuvNa1/xag==",
 "mediaLink": "https://www.googleapis.com/download/storage/v1/b/karthik101/o/output-dbscan.csv?generation=1561383573998493&alt=media",
 "crc32c": "X9eJIw==",
 "etag": "CJ2XpYqfguMCEAE="
}


### Apply K-Means Algorithm to see difference in clustering mechanism


In [0]:
from sklearn.cluster import KMeans

In [0]:
kmeans = KMeans(n_clusters=15, random_state=0 ).fit(X)

In [0]:
[list(kmeans.labels_).count(i) for i in set(kmeans.labels_) ]

[2987, 346, 6541, 460, 223, 74, 565, 284, 1946, 5394, 619, 117, 167, 858, 372]

In [0]:
df['kmeans-15'] = kmeans.labels_

In [0]:
df[df['kmeans-15'] == 14]

Unnamed: 0,number,short_description,description,kmeans-15
128,INC0099235,Access to: http://teamsites/isit/helpdesk/pmo/,"Hello,\r\n\r\nAs a designated approver please ...",14
271,INC0098533,access to QM11 and QM15 in EWP,"Dear Team ,\r\n\r\nPlease create incident and ...",14
280,INC0098483,Portal tasks,"Hello Team,\r\n\r\nCould you give access to th...",14
436,INC0097735,RE: CRM Access request INC0067014 GRC pre appr...,"Hi Team,\r\n\r\nWe have now access to SAP but ...",14
555,INC0097216,RE: FW: Food System - China - Security require...,"Ramesh,\r\n\r\nPlease use plant finance manage...",14
589,INC0097057,Check the PC situation for X_SurjawanI,"Hello,\r\n\r\nEWP, BW and EWC must be needed\r...",14
749,INC0095933,Need to import transports in EWM systme,Svd@ Pls create an incident and assign it to t...,14
969,INC0094829,RE: Please provide authorization is CRP,Please create user id and assign.\r\n\r\nFrom:...,14
1086,INC0094233,EWP Retrofits in EWI,"Hi Team,\r\n\r\nCan you please create an INC a...",14
1106,INC0094147,New Transaction ZFITVCITI_DUPENTRIES,"Dear Team ,\r\n\r\nPlease raise incident and a...",14


In [0]:
df[['dbscan', 'dbscan-5', 'kmeans-15']].to_csv('output-dbscan-kmeans.csv')
!curl -X POST --data-binary @'output-dbscan-kmeans.csv' -H "Content-Type: text/csv" "https://www.googleapis.com/upload/storage/v1/b/karthik101/o?uploadType=media&name=output-dbscan-kmeans.csv"

{
 "kind": "storage#object",
 "id": "karthik101/output-dbscan-kmeans.csv/1561384428685891",
 "selfLink": "https://www.googleapis.com/storage/v1/b/karthik101/o/output-dbscan-kmeans.csv",
 "name": "output-dbscan-kmeans.csv",
 "bucket": "karthik101",
 "generation": "1561384428685891",
 "metageneration": "1",
 "contentType": "text/csv",
 "timeCreated": "2019-06-24T13:53:48.685Z",
 "updated": "2019-06-24T13:53:48.685Z",
 "storageClass": "MULTI_REGIONAL",
 "timeStorageClassUpdated": "2019-06-24T13:53:48.685Z",
 "size": "252842",
 "md5Hash": "AqG6BqwMi2gDYvn9/4b7ug==",
 "mediaLink": "https://www.googleapis.com/download/storage/v1/b/karthik101/o/output-dbscan-kmeans.csv?generation=1561384428685891&alt=media",
 "crc32c": "dzJcsA==",
 "etag": "CMOU66GiguMCEAE="
}
