# Project # 3 - Text Mining
Data file:
* https://raw.githubusercontent.com/vjavaly/Baruch-CIS-STA-3920/main/data/Seattle_hotels.csv

## Project #3 Requirements
* Load data and examine data
* Clean data: 1) remove punctuation, 2) lowercase, 3) stem or lemmatize
* Vectorize cleaned data
* Generate similarities matrix
* Generate hotel recommendations for the 3 listed hotels
  * Motel 6 Seattle Sea-Tac Airport South
  * The Bacon Mansion Bed and Breakfast
  * Holiday Inn Seattle Downtown

In [1]:
from datetime import datetime
print(f'Run time: {datetime.now().strftime("%D %T")}')

Run time: 11/10/23 13:38:27


### Import libraries

In [2]:
import pandas as pd
import re
import string
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

### Load data

In [3]:
df = pd.read_csv("https://raw.githubusercontent.com/vjavaly/Baruch-CIS-STA-3920/main/data/Seattle_hotels.csv")

### Examine data

In [4]:
print(df.head())

                             name  \
0  Hilton Garden Seattle Downtown   
1          Sheraton Grand Seattle   
2   Crowne Plaza Seattle Downtown   
3   Kimpton Hotel Monaco Seattle    
4              The Westin Seattle   

                                           address  \
0  1821 Boren Avenue, Seattle Washington 98101 USA   
1   1400 6th Avenue, Seattle, Washington 98101 USA   
2                  1113 6th Ave, Seattle, WA 98101   
3                   1101 4th Ave, Seattle, WA98101   
4   1900 5th Avenue,�Seattle,�Washington�98101�USA   

                                                desc  
0  Located on the southern tip of Lake Union, the...  
1  Located in the city's vibrant core, the Sherat...  
2  Located in the heart of downtown Seattle, the ...  
3  What?s near our hotel downtown Seattle locatio...  
4  Situated amid incredible shopping and iconic a...  


In [5]:
print(df.shape)

(152, 3)


In [6]:
print(df.columns)

Index(['name', 'address', 'desc'], dtype='object')


In [7]:
print(df.describe())

                                  name  \
count                              152   
unique                             152   
top     Hilton Garden Seattle Downtown   
freq                                 1   

                                                address  \
count                                               152   
unique                                              152   
top     1821 Boren Avenue, Seattle Washington 98101 USA   
freq                                                  1   

                                                     desc  
count                                                 152  
unique                                                152  
top     Located on the southern tip of Lake Union, the...  
freq                                                    1  


In [8]:
print(df.dtypes)

name       object
address    object
desc       object
dtype: object


In [9]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 152 entries, 0 to 151
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   name     152 non-null    object
 1   address  152 non-null    object
 2   desc     152 non-null    object
dtypes: object(3)
memory usage: 3.7+ KB
None


### Prepare data

In [10]:
df = df.drop(columns=['address'])
print(df)

                                        name  \
0             Hilton Garden Seattle Downtown   
1                     Sheraton Grand Seattle   
2              Crowne Plaza Seattle Downtown   
3              Kimpton Hotel Monaco Seattle    
4                         The Westin Seattle   
..                                       ...   
147                The Halcyon Suite Du Jour   
148                              Vermont Inn   
149               Stay Alfred on Wall Street   
150       Pike's Place Lux Suites by Barsala   
151  citizenM Seattle South Lake Union hotel   

                                                  desc  
0    Located on the southern tip of Lake Union, the...  
1    Located in the city's vibrant core, the Sherat...  
2    Located in the heart of downtown Seattle, the ...  
3    What?s near our hotel downtown Seattle locatio...  
4    Situated amid incredible shopping and iconic a...  
..                                                 ...  
147  Located in Queen An

#### Clean column hotel descriptions
1) remove punctuation
2) lowercase text
3) either stem or lemmatize text

In [11]:
def remove_punctuation(text):
    return ''.join([char for char in text if char not in string.punctuation])

df['step1'] = df['desc'].apply(remove_punctuation)

In [12]:
def to_lowercase(text):
    return text.lower()

df['step2'] = df['step1'].apply(to_lowercase)

In [13]:
def stem_or_lemmatize(text, use_stemmer=True):
    if use_stemmer:
        stemmer = PorterStemmer()
        tokens = word_tokenize(text)
        tokens = [stemmer.stem(word) for word in tokens]
        return ' '.join(tokens)
    else:
        lemmatizer = WordNetLemmatizer()
        tokens = word_tokenize(text)
        tokens = [lemmatizer.lemmatize(word) for word in tokens]
        return ' '.join(tokens)

#### Display updated dataframe

In [14]:
import nltk
nltk.download('punkt')
use_stemming = True

# Apply the cleaning functions and create a new column
df['desc_cleaned'] = df['step2'].apply(lambda text: stem_or_lemmatize(text, use_stemming))

# Display the cleaned data
print(df['desc_cleaned'])

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/jinwoorhee/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


0      locat on the southern tip of lake union the hi...
1      locat in the citi vibrant core the sheraton gr...
2      locat in the heart of downtown seattl the awar...
3      what near our hotel downtown seattl locat the ...
4      situat amid incred shop and icon attract the w...
                             ...                        
147    locat in queen ann district the halcyon suit d...
148    just a block from the world famou space needl ...
149    stay alfr on wall street resid in the heart of...
150    the perfect marriag of heighten conveni and un...
151    ye it true everi room at citizenm is the best ...
Name: desc_cleaned, Length: 152, dtype: object


### Vectorize cleaned hotel descriptions

In [15]:
cleaned_text = df['desc_cleaned']
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(cleaned_text)
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
print(tfidf_df)

      10       100  1000  10000  103000  109  109room  10best   10minut  \
0    0.0  0.000000   0.0    0.0     0.0  0.0      0.0     0.0  0.000000   
1    0.0  0.000000   0.0    0.0     0.0  0.0      0.0     0.0  0.000000   
2    0.0  0.000000   0.0    0.0     0.0  0.0      0.0     0.0  0.000000   
3    0.0  0.000000   0.0    0.0     0.0  0.0      0.0     0.0  0.000000   
4    0.0  0.000000   0.0    0.0     0.0  0.0      0.0     0.0  0.000000   
..   ...       ...   ...    ...     ...  ...      ...     ...       ...   
147  0.0  0.063154   0.0    0.0     0.0  0.0      0.0     0.0  0.157663   
148  0.0  0.115977   0.0    0.0     0.0  0.0      0.0     0.0  0.000000   
149  0.0  0.000000   0.0    0.0     0.0  0.0      0.0     0.0  0.000000   
150  0.0  0.000000   0.0    0.0     0.0  0.0      0.0     0.0  0.000000   
151  0.0  0.000000   0.0    0.0     0.0  0.0      0.0     0.0  0.000000   

     10night  ...     youll      your  yourself  youth     yummi  zagat  \
0        0.0  ...  0.000

### Generate similarities matrix on cleaned hotel descriptions

In [16]:
cleaned_text = df['desc_cleaned']
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(cleaned_text)
cosine_sim_matrix = cosine_similarity(tfidf_matrix, tfidf_matrix)
cosine_sim_df = pd.DataFrame(cosine_sim_matrix, index=df.index, columns=df.index)
print(cosine_sim_df)

          0         1         2         3         4         5         6    \
0    1.000000  0.204004  0.242663  0.157574  0.283405  0.250509  0.209805   
1    0.204004  1.000000  0.179680  0.132857  0.232243  0.178648  0.158034   
2    0.242663  0.179680  1.000000  0.115171  0.217613  0.178156  0.160445   
3    0.157574  0.132857  0.115171  1.000000  0.193410  0.213950  0.160997   
4    0.283405  0.232243  0.217613  0.193410  1.000000  0.243606  0.187601   
..        ...       ...       ...       ...       ...       ...       ...   
147  0.131941  0.097382  0.097513  0.094734  0.131270  0.141449  0.075832   
148  0.163753  0.151492  0.190590  0.108045  0.168909  0.202738  0.124185   
149  0.223970  0.174066  0.221418  0.128916  0.192328  0.197694  0.155560   
150  0.101081  0.071346  0.095523  0.086470  0.151553  0.088760  0.044619   
151  0.123778  0.092845  0.119354  0.074897  0.101220  0.131381  0.102740   

          7         8         9    ...       142       143       144  \
0  

### Create hotel recommender

In [17]:
def get_hotel_recommendations(hotel_name, num_recommendations=5):
    hotel_index = df[df['name'] == hotel_name].index[0]
    similarities = cosine_sim_df.iloc[hotel_index]
    recommendations = similarities.drop(hotel_index)
    top_recommendations = recommendations.nlargest(num_recommendations)

    recommended_hotels = []
    for i, similarity in top_recommendations.items():
        recommended_hotels.append({
            "Hotel Name": df.loc[i]['name'],
            "Similarity Score": similarity,
            "Hotel Description": df.loc[i]['desc_cleaned']
        })

    for recommendation in recommended_hotels:
        print("Hotel Name:", recommendation["Hotel Name"])
        print("Similarity Score:", round(recommendation["Similarity Score"], 4))
        print("Hotel Description:", recommendation["Hotel Description"])
        print()
        

### Make hotel recommendations for the following hotel names:
* Motel 6 Seattle Sea-Tac Airport South
* The Bacon Mansion Bed and Breakfast
* Holiday Inn Seattle Downtown

In [18]:
hotel_name = "Motel 6 Seattle Sea-Tac Airport South"
get_hotel_recommendations(hotel_name)

Hotel Name: Ramada by Wyndham SeaTac Airport
Similarity Score: 0.3766
Hotel Description: explor the puget sound is easi when you make a reserv at our ramada seatac airport hotel locat off i5 our seatac set near the museum of flight place you within minut of great restaur and top attract seattl tacoma intern airport sea is a quick threeminut drive from our front door start your morn with a fullservic breakfast check the headlin with a free usa today and repli to email with free wifi leisur facil includ a fit center and heat outdoor pool all of our guestroom are wellequip with a coffe maker safe and free hbo kitchenett and suit with jet hot tub provid extra space and amen for extend stay 100 nonsmok and access accommod are also avail take advantag of our free airport shuttl servic park avail for a nomin fee

Hotel Name: Red Roof Inn Seattle Airport - SEATAC
Similarity Score: 0.3223
Hotel Description: red roof inn seattl airport seatac is a 100 smokefre petfriendli hotel in seattl that� l

In [19]:
hotel_name = "The Bacon Mansion Bed and Breakfast"
recommendations = get_hotel_recommendations(hotel_name)

Hotel Name: Shafer Baillie Mansion Bed & Breakfast
Similarity Score: 0.4101
Hotel Description: look for the perfect bed and breakfast in seattl wa the shafer bailli mansion is a magnific 14000squarefoot tudor reviv home where you will find the graciou atmospher of a bygon era with the�comfort and amenities�that today� travel need all ideal situat on capitol hill in seattl washington our capitol hill neighborhood is ideal situat for both busi and vacat travel just a block from volunt park the crown jewel of the seattl park system you will be a short walk from restaur and shop on broadway or even closer to the busi district on 15th avenu east downtown seattl is a quick bu ride fiveminut car trip or 20minut walk away we are minut by bu car or taxi to the waterfront pike place market seattl center space needl experi music project etc univers of washington seattl univers all major seattl hospit and medic center museum music and theater venu and much more shafer bailli mansion is the largest

In [20]:
hotel_name = "Holiday Inn Seattle Downtown"
recommendations = get_hotel_recommendations(hotel_name)

Hotel Name: Holiday Inn Express & Suites Seattle-City Center
Similarity Score: 0.4876
Hotel Description: discov all that seattl ha to offer at the�holiday inn express�� suit seattl citi center our hotel locat is in the heart of downtown shop dine and entertain district you are sure to have a memor trip when you stay with us busi travel find mani of the area major compani are near our hotelamazon microsoft boe and starbuck corpor are all just a few minut away we are also close to the univers of washington seattl univers and the washington state convent center is a mile from our hotel we featur busi amen like free highspe internet access and busi center youll find mani of seattl most popular attract within a few mile of our hotel the famou space needl experi music project and key arena are all just a few block away in seattl citi center our hotel locat is also near safeco and centurylink field as well as seattl waterfront whether your visit seattl for work or fun we are the perfect home 