# Text Summary
Summarize review for New York AirBnb  
  
Trying out __Text Summarization in 5 Steps uing NLTK__  
https://becominghuman.ai/text-summarization-in-5-steps-using-nltk-65b21e352b65

In [119]:
import os
import pandas as pd
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk import word_tokenize, PorterStemmer

## 1. Data Preprocessing
At this step we will:  
1. read in dataset
2. get only neighborhood with at least {cutoff} listings
3. join neighborhood with review

In [14]:
os.listdir('data')

['calendar.csv',
 'listings-2.csv',
 'listings.csv',
 'neighbourhoods.csv',
 'neighbourhoods.geojson',
 'reviews-2.csv',
 'reviews.csv']

In [49]:
# read in listing2 dataset
print('\nlistings-2:')
df_listing2 = pd.read_csv('data/listings-2.csv')
display(df_listing2.head())

print('\nreviews-2:')
df_review = pd.read_csv('data/reviews-2.csv')
display(df_review.head())


listings-2:


Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,10,48,2019-11-04,0.39,1,1
1,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,295,2019-11-22,4.67,1,1
2,5099,Large Cozy 1 BR Apartment In Midtown East,7322,Chris,Manhattan,Murray Hill,40.74767,-73.975,Entire home/apt,200,3,78,2019-10-13,0.6,1,19
3,5121,BlissArtsSpace!,7356,Garon,Brooklyn,Bedford-Stuyvesant,40.68688,-73.95596,Private room,60,29,49,2017-10-05,0.38,1,365
4,5178,Large Furnished Room Near B'way,8967,Shunichi,Manhattan,Hell's Kitchen,40.76489,-73.98493,Private room,79,2,454,2019-11-21,3.52,1,242



reviews-2:


Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,2595,17857,2009-11-21,50679,Jean,Notre séjour de trois nuits.\r\nNous avons app...
1,2595,19176,2009-12-05,53267,Cate,Great experience.
2,2595,19760,2009-12-10,38960,Anita,I've stayed with my friend at the Midtown Cast...
3,2595,34320,2010-04-09,71130,Kai-Uwe,"We've been staying here for about 9 nights, en..."
4,2595,46312,2010-05-25,117113,Alicia,We had a wonderful stay at Jennifer's charming...


In [44]:
# get count of neighbourhood with at least 200 listings
cutoff = 200
df_nbr_count = df_listing2[['neighbourhood']].reset_index(drop=True)
df_nbr_count['freq'] = df_nbr_count.groupby('neighbourhood')['neighbourhood']\
                                   .transform('count')

# create subset of data based on cutoff
df_top_nbr = df_nbr_count[df_nbr_count['freq'] >= cutoff].drop_duplicates()

# print result
print(f'Number of neighbourhood with at least {cutoff} listings {df_top_nbr.shape[0]}')

Number of neighbourhood with at least 200 listings 49


In [55]:
# create trimmed id, neighbourhood table
df_id_nbr = df_listing2[['id', 'neighbourhood']].drop_duplicates()
df_nbr_trim = df_id_nbr.merge(df_top_nbr, on='neighbourhood', how='inner')

df_nbr_trim = df_nbr_trim[['id', 'neighbourhood']].reset_index(drop=True)

In [61]:
# get comments
df_comments = df_review2[['listing_id', 'comments']].reset_index(drop=True)
df_comments = df_comments.rename(columns={'listing_id':'id'})
print('\n comments:')
display(df_comments.head())


 comments:


Unnamed: 0,id,comments
0,2595,Notre séjour de trois nuits.\r\nNous avons app...
1,2595,Great experience.
2,2595,I've stayed with my friend at the Midtown Cast...
3,2595,"We've been staying here for about 9 nights, en..."
4,2595,We had a wonderful stay at Jennifer's charming...


In [62]:
# join id, neighborhood with comments
# since this is our final table, we will call it df
df = df_nbr_trim.merge(df_comments, on='id', how='inner')
df.shape

(1073129, 3)

In [63]:
df.head()

Unnamed: 0,id,neighbourhood,comments
0,2595,Midtown,Notre séjour de trois nuits.\r\nNous avons app...
1,2595,Midtown,Great experience.
2,2595,Midtown,I've stayed with my friend at the Midtown Cast...
3,2595,Midtown,"We've been staying here for about 9 nights, en..."
4,2595,Midtown,We had a wonderful stay at Jennifer's charming...


## 2. EDA
We will do some EDA such as counts, and read some comments to get a sense of the contents as well as how the summarizer should perform.

In [97]:
# change index to inspect
index = 1842
print(df[df['neighbourhood'] == 'Midtown']['comments'][index])

The Maria’s place is exactly as described. The apartment is small but enough for 2 people.
The location is fantastic and in few minutes walking you are on the best spots of NY.
My suggestion is to book it if available


In [94]:
# listing by neighborhood
df_nbr_count['freq'].describe()

count    50599.000000
mean      1547.119963
std       1303.485229
min          1.000000
25%        376.000000
50%       1155.000000
75%       2504.000000
max       3974.000000
Name: freq, dtype: float64

## 3. Text Summarization
We will concatinate reviews for the same neighbourhood into one giant paragraph then do text summarization from there.

### 1. Create Wrod Frequency Table

In [120]:
def _create_frequency_table(text_string) -> dict:

    stopWords = set(stopwords.words("english"))
    words = word_tokenize(text_string)
    ps = PorterStemmer()

    freqTable = dict()
    for word in words:
        word = ps.stem(word)
        if word in stopWords:
            continue
        if word in freqTable:
            freqTable[word] += 1
        else:
            freqTable[word] = 1

    return freqTable

In [121]:
index = 1842
sample_string = df[df['neighbourhood'] == 'Midtown']['comments'][index]
print(sample_string)

The Maria’s place is exactly as described. The apartment is small but enough for 2 people.
The location is fantastic and in few minutes walking you are on the best spots of NY.
My suggestion is to book it if available


In [123]:
freq_table = _create_frequency_table(sample_string)
freq_table

{'maria': 1,
 '’': 1,
 'place': 1,
 'exactli': 1,
 'describ': 1,
 '.': 3,
 'apart': 1,
 'small': 1,
 'enough': 1,
 '2': 1,
 'peopl': 1,
 'locat': 1,
 'fantast': 1,
 'minut': 1,
 'walk': 1,
 'best': 1,
 'spot': 1,
 'NY': 1,
 'My': 1,
 'suggest': 1,
 'book': 1,
 'avail': 1}