## Text modeling

In the next two example Notebook, we are applying text mining techniques to discover [whether a hotel review is genuine or fake](https://www.kaggle.com/rtatman/deceptive-opinion-spam-corpus). In this Notebook, we will create the document-feature matrix.

In [2]:
import pandas as pd

First, let's read in the data file.

In [3]:
reviews = pd.read_csv('deceptive-opinion.csv')
reviews.head()

Unnamed: 0,deceptive,hotel,polarity,source,text
0,truthful,conrad,positive,TripAdvisor,We stayed for a one night getaway with family ...
1,truthful,hyatt,positive,TripAdvisor,Triple A rate with upgrade to view room was le...
2,truthful,hyatt,positive,TripAdvisor,This comes a little late as I'm finally catchi...
3,truthful,omni,positive,TripAdvisor,The Omni Chicago really delivers on all fronts...
4,truthful,hyatt,positive,TripAdvisor,I asked for a high floor away from the elevato...


In [17]:
reviews['deceptive'].value_counts()

truthful     800
deceptive    800
Name: deceptive, dtype: int64

As we can see, there are 800 truthful and 800 deceptive reviews. 

To read the text and use it for our analysis, we need an object from `sklearn` called a `CountVectorizer`. Essentially, what it does is create a dictionary from a series of text. It lowercases the text and tokenizes it by using whitespace and interpunction as separations between words. I use a list of frequent English words ('stop words') that will not be counted: they are not informative enough.

We will need to convert the text to Unicode, which is a standard text format. We do so by using `.values.astype('U')`.

In [19]:
from sklearn.feature_extraction.text import CountVectorizer #The CountVectorizer object

text = reviews['text'].values.astype('U') #Taking the text from the df. We need to convert it to Unicode

vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) #We fit the model with the words from the review text
feature_names = vect.get_feature_names() #Get the words from the vocabulary
print(f"There are {len(feature_names)} words in the vocabulary. A selection: {feature_names[500:520]}")


There are 9284 words in the vocabulary. A selection: ['aid', 'aides', 'air', 'airfare', 'airline', 'airlines', 'airplane', 'airport', 'airports', 'airshow', 'airy', 'aka', 'akin', 'akk', 'al', 'alarm', 'alarmed', 'alas', 'albeit', 'albiet']


Now that we have the dictionary, we can count the occurences of each word for each review. This way, we can create a document-feature matrix, with documents (reviews) in the rows, and features (words) in the columns.

In [20]:
matrix = vect.transform(text) #The transform method from the CountVectorizer object creates the matrix
print(matrix[0:50,0:50]) #Let's print a little part of the matrix: the first 50 words & documents

  (2, 14)	1
  (7, 0)	2
  (7, 37)	1
  (10, 21)	1
  (12, 0)	1
  (12, 1)	1
  (12, 4)	1
  (12, 13)	1
  (12, 25)	2
  (14, 39)	1
  (18, 49)	1
  (23, 0)	4
  (23, 14)	1
  (25, 6)	1
  (25, 7)	1
  (25, 8)	1
  (25, 40)	1
  (41, 41)	1
  (42, 14)	1
  (47, 24)	1
  (48, 33)	1
  (49, 0)	1


As you can see, there are no 0's in the matrix. Because the matrix is mostly zeroes, they are left out to save memory. This is a so-called _sparse matrix_. We can convert it to a regular matrix however, with `.toarray()`

In [21]:
docu_feat = pd.DataFrame(matrix.toarray()) #make a regular matrix, then put in Dataframe
docu_feat.index = reviews['text'] #Give the rows names (text of the review)
docu_feat.columns = feature_names #Give the columns names (words from vocabulary)

In [23]:
docu_feat.iloc[0:4, 1000:1015] #Show a part of the matrix

Unnamed: 0_level_0,beaches,beagle,bearing,bears,beast,beat,beaten,beating,beatles,beats,beauti,beautiful,beautifull,beautifully,beauty
text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
"We stayed for a one night getaway with family on a thursday. Triple AAA rate of 173 was a steal. 7th floor room complete with 44in plasma TV bose stereo, voss and evian water, and gorgeous bathroom(no tub but was fine for us) Concierge was very helpful. You cannot beat this location... Only flaw was breakfast was pricey and service was very very slow(2hours for four kids and four adults on a friday morning) even though there were only two other tables in the restaurant. Food was very good so it was worth the wait. I would return in a heartbeat. A gem in chicago... \n",0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
"Triple A rate with upgrade to view room was less than $200 which also included breakfast vouchers. Had a great view of river, lake, Wrigley Bldg. & Tribune Bldg. Most major restaurants, Shopping, Sightseeing attractions within walking distance. Large room with a very comfortable bed. \n",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
"This comes a little late as I'm finally catching up on my reviews from the past several months:) A dear friend and I stayed at the Hyatt Regency in late October 2007 for one night while visiting a friend and her husband from out of town. This hotel is perfect, IMO. Easy check in and check out. Lovely, clean, comfortable rooms with great views of the city. I know this area pretty well and it's very convenient to many downtown Chicago attractions. We had dinner and went clubing with our friends around Division St.. We had no problems getting cabs back and forth to the Hyatt and there's even public transportation right near by but we didn't bother since we only needed cabs from and to the hotel. Parking, as is usual for Chicago, was expensive but we were able to get our car out quickly (however, we left on a Sunday morning, not exactly a high traffic time although it was a Bears homegame day, so a bit busier than usual I would think). No problems at all and the best part is that we got a rate of $100 through Hotwire, a downright steal for this area of Chicago and the quality of the hotel. \n",0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
"The Omni Chicago really delivers on all fronts, from the spaciousness of the rooms to the helpful staff to the prized location on Michigan Avenue. While this address in Chicago requires a high level of quality, the Omni delivers. Check in for myself and a whole group of people with me was under 3 minutes, the staff had plentiful recommendations for dining and events, and the rooms are some of the largest you'll find at this price range in Chicago. Even the 'standard' room has a separate living area and work desk. The fitness center has free weights, weight machines, and two rows of cardio equipment. I shared the room with 7 others and did not feel cramped in any way! All in all, a great property! \n",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


As we can see, the matrix is almost entirely filled with zeroes.