# TF-IDF Vectorization.
## &emsp; Converting Unstructured Text Data into Structured Data.

### When there is a system issue we generally raise a ticked and put our comment about the issue. 
### Then tickets are prioritized on the basis of comment or issue description and as per priority tickets are resolved.

### For this, I can build an algorithm on the basis of historical data. 
### So, when a new ticket will come, based on the comment or issue description algorithm can predict the Ticket Priority.


**************************
**************************

## Creating Data.

In [1]:
# Creating a sample System Issue Ticket Data.
import pandas as pd

TicketData= pd.DataFrame(data=[['Hi Please reset my password, i am not able to reset it','P3'],
                               ['Hi Please reset my password','P3'],
                               ['Hi The system is down please restart it', 'P1'],
                               ['Not able to login can you check?', 'P3'],
                               ['The data is not getting exported', 'P2'] ],
                                    columns= ['Issue_Description','Priority'])
# Printing the data
TicketData

Unnamed: 0,Issue_Description,Priority
0,"Hi Please reset my password, i am not able to ...",P3
1,Hi Please reset my password,P3
2,Hi The system is down please restart it,P1
3,Not able to login can you check?,P3
4,The data is not getting exported,P2


## Converting Unstructured Text Data into Structured Data.

In [2]:
# TF-IDF vectorization of Text Data.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction._stop_words import ENGLISH_STOP_WORDS

Comment = TicketData['Issue_Description'].values

vectorizer = TfidfVectorizer(stop_words=ENGLISH_STOP_WORDS)

X = vectorizer.fit_transform(Comment)

# New column names
print(vectorizer.get_feature_names())

print(X.shape)

['able', 'check', 'data', 'exported', 'getting', 'hi', 'login', 'password', 'reset', 'restart']
(5, 10)


In [3]:
# Visualizing the Document Term Matrix using TF-IDF.
VectorizedText= pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
VectorizedText['Issue_Description']= pd.Series(Comment)
VectorizedText

Unnamed: 0,able,check,data,exported,getting,hi,login,password,reset,restart,Issue_Description
0,0.38665,0.0,0.0,0.0,0.0,0.320954,0.0,0.38665,0.7733,0.0,"Hi Please reset my password, i am not able to ..."
1,0.0,0.0,0.0,0.0,0.0,0.506204,0.0,0.609818,0.609818,0.0,Hi Please reset my password
2,0.0,0.0,0.0,0.0,0.0,0.556451,0.0,0.0,0.0,0.830881,Hi The system is down please restart it
3,0.495524,0.614189,0.0,0.0,0.0,0.0,0.614189,0.0,0.0,0.0,Not able to login can you check?
4,0.0,0.0,0.57735,0.57735,0.57735,0.0,0.0,0.0,0.0,0.0,The data is not getting exported


In [4]:
# Example Data frame For machine learning.
# Priority column acts as a target variable and other columns as predictors.
DataForML=pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
DataForML['Priority']=TicketData['Priority']
DataForML.head()

Unnamed: 0,able,check,data,exported,getting,hi,login,password,reset,restart,Priority
0,0.38665,0.0,0.0,0.0,0.0,0.320954,0.0,0.38665,0.7733,0.0,P3
1,0.0,0.0,0.0,0.0,0.0,0.506204,0.0,0.609818,0.609818,0.0,P3
2,0.0,0.0,0.0,0.0,0.0,0.556451,0.0,0.0,0.0,0.830881,P1
3,0.495524,0.614189,0.0,0.0,0.0,0.0,0.614189,0.0,0.0,0.0,P3
4,0.0,0.0,0.57735,0.57735,0.57735,0.0,0.0,0.0,0.0,0.0,P2


## Now with this data I can build an Classification algorithm to predict the Ticket Priority.

&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp; &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp; &emsp;&emsp;&emsp;&emsp;&emsp; &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;Pranab Kumar Paul.