### Using Natural Language Processing & Text Modeling to group the inputs by topic

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('user_input.csv')
df.head(10)

Unnamed: 0,input,topic
0,lower bill,tips
1,how can i reduce my energy use,tips
2,hi im looking for ways to reduce my energy usage,tips
3,hi im looking for ways to reduce my energy uses,tips
4,what small changes can i make to my energy use...,tips
5,is there a way i can reduce my energy bill,tips
6,how can i lower my bill,tips
7,how to conserve energy use,tips
8,i need tips to reduce my energy bill,tips
9,tips to reduce my energy bill,tips


In [6]:
df[(df["topic"] == "tips") | (df["topic"] == "resource")| (df["topic"] == "disconnection")]

Unnamed: 0,input,topic
0,lower bill,tips
1,how can i reduce my energy use,tips
2,hi im looking for ways to reduce my energy usage,tips
3,hi im looking for ways to reduce my energy uses,tips
4,what small changes can i make to my energy use...,tips
5,is there a way i can reduce my energy bill,tips
6,how can i lower my bill,tips
7,how to conserve energy use,tips
8,i need tips to reduce my energy bill,tips
9,tips to reduce my energy bill,tips


To read the text and use it for our analysis, we need an object from `sklearn` called a `CountVectorizer`. Essentially, what it does is create a dictionary from a series of text. It lowercases the text and tokenizes it by using whitespace and interpunction as separations between words. I use a list of frequent English words ('stop words') that will not be counted: they are not informative enough.

We will need to convert the text to Unicode, which is a standard text format. We do so by using `.values.astype('U')`.

In [9]:
from sklearn.feature_extraction.text import CountVectorizer #The CountVectorizer object

text = df['input'].values.astype('U') #Taking the text from the df. We need to convert it to Unicode
vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) #We fit the model with the words from the review text
vect
feature_names = vect.get_feature_names() #Get the words from the vocabulary
#feature_names
print(f"There are {len(feature_names)} words in the vocabulary. A selection: {feature_names[500:520]}")

There are 58 words in the vocabulary. A selection: []


In [11]:
docu_feat = vect.transform(df['input']) #The transform method from the CountVectorizer object creates the matrix
print(docu_feat[0:50,0:50]) #Let's print a little part of the matrix: the first 50 words & documents

  (0, 27)	1
  (1, 16)	1
  (1, 41)	1
  (2, 16)	1
  (2, 20)	1
  (2, 23)	1
  (2, 26)	1
  (2, 41)	1
  (3, 16)	1
  (3, 20)	1
  (3, 23)	1
  (3, 26)	1
  (3, 41)	1
  (4, 5)	1
  (4, 16)	2
  (4, 27)	1
  (4, 28)	2
  (4, 46)	1
  (5, 16)	1
  (5, 41)	1
  (6, 27)	1
  (7, 8)	1
  (7, 16)	1
  (8, 16)	1
  (8, 35)	1
  :	:
  (28, 15)	1
  (28, 44)	1
  (29, 14)	1
  (29, 40)	1
  (30, 40)	1
  (30, 45)	1
  (31, 17)	1
  (31, 36)	1
  (31, 37)	1
  (32, 15)	1
  (32, 23)	1
  (32, 24)	1
  (32, 45)	1
  (33, 3)	1
  (33, 11)	1
  (33, 15)	1
  (33, 18)	1
  (33, 26)	1
  (33, 30)	1
  (33, 38)	1
  (33, 45)	1
  (34, 15)	1
  (34, 45)	1
  (35, 13)	1
  (35, 15)	1


## Naive Bayes

In [12]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

First, let's start with the code to generate a document-feature matrix.

In [13]:
df = pd.read_csv('user_input.csv')

df = df.loc[(df["topic"] == "tips") | (df["topic"] == "resource")| (df["topic"] == "disconnection")]
text = df['input'].values.astype('U') #Taking the text from the df. We need to convert it to Unicode
vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) #We fit the model with the words from the review text
docu_feat = vect.transform(text) # make a matrix

Now, we will use the Naïve Bayes classifier from `sklearn`.

In [14]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB() #create the model
X = docu_feat #the document-feature matrix is the X matrix
y = df['topic'] #creating the y vector

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) #split the data and store it

nb = nb.fit(X_train, y_train) #fit the model X=features, y=character




## Evaluating the model

In [15]:
#Evaluate the model
y_test_p = nb.predict(X_test)
nb.score(X_test, y_test)

1.0

In [None]:
cm = confusion_matrix(y_test, y_test_p)

cm