# Text Summary
Summarize review for New York AirBnb  
  
Trying out __Text Summarization in 5 Steps uing NLTK__  
https://becominghuman.ai/text-summarization-in-5-steps-using-nltk-65b21e352b65

In [26]:
import os
import pandas as pd
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk import word_tokenize, PorterStemmer, sent_tokenize

## 1. Data Preprocessing
At this step we will:  
1. read in dataset
2. get only neighborhood with at least {cutoff} listings
3. join neighborhood with review

In [3]:
os.listdir('data')

['calendar.csv', 'listings.csv', 'reviews.csv']

In [110]:
# read in listing2 dataset
print('\nlistings:')
df_listing2 = pd.read_csv('data/listings.csv')
display(df_listing2.head())

print('\nreviews:')
df_review = pd.read_csv('data/reviews.csv')
display(df_review.head())


listings:


  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,2595,https://www.airbnb.com/rooms/2595,20200212052319,2020-02-12,Skylit Midtown Castle,"Beautiful, spacious skylit studio in the heart...","- Spacious (500+ft²), immaculate and nicely fu...","Beautiful, spacious skylit studio in the heart...",none,Centrally located in the heart of Manhattan ju...,...,f,f,strict_14_with_grace_period,t,t,2,2,0,0,0.39
1,3831,https://www.airbnb.com/rooms/3831,20200212052319,2020-02-13,Cozy Entire Floor of Brownstone,Urban retreat: enjoy 500 s.f. floor in 1899 br...,Greetings! We own a double-duplex brownst...,Urban retreat: enjoy 500 s.f. floor in 1899 br...,none,Just the right mix of urban center and local n...,...,f,f,moderate,f,f,1,1,0,0,4.69
2,5099,https://www.airbnb.com/rooms/5099,20200212052319,2020-02-12,Large Cozy 1 BR Apartment In Midtown East,My large 1 bedroom apartment has a true New Yo...,I have a large 1 bedroom apartment centrally l...,My large 1 bedroom apartment has a true New Yo...,none,My neighborhood in Midtown East is called Murr...,...,f,f,moderate,t,t,1,1,0,0,0.59
3,5121,https://www.airbnb.com/rooms/5121,20200212052319,2020-02-12,BlissArtsSpace!,,HELLO EVERYONE AND THANKS FOR VISITING BLISS A...,HELLO EVERYONE AND THANKS FOR VISITING BLISS A...,none,,...,f,f,strict_14_with_grace_period,f,f,1,0,1,0,0.38
4,5178,https://www.airbnb.com/rooms/5178,20200212052319,2020-02-13,Large Furnished Room Near B'way,Please don’t expect the luxury here just a bas...,"You will use one large, furnished, private roo...",Please don’t expect the luxury here just a bas...,none,"Theater district, many restaurants around here.",...,f,f,strict_14_with_grace_period,f,f,1,0,1,0,3.53



reviews:


Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,2595,17857,2009-11-21,50679,Jean,Notre séjour de trois nuits.\r\nNous avons app...
1,2595,19176,2009-12-05,53267,Cate,Great experience.
2,2595,19760,2009-12-10,38960,Anita,I've stayed with my friend at the Midtown Cast...
3,2595,34320,2010-04-09,71130,Kai-Uwe,"We've been staying here for about 9 nights, en..."
4,2595,46312,2010-05-25,117113,Alicia,We had a wonderful stay at Jennifer's charming...


In [111]:
# get count of neighbourhood with at least 200 listings
cutoff = 200
df_nbr_count = df_listing2[['neighbourhood']].reset_index(drop=True)
df_nbr_count['freq'] = df_nbr_count.groupby('neighbourhood')['neighbourhood']\
                                   .transform('count')

# create subset of data based on cutoff
df_top_nbr = df_nbr_count[df_nbr_count['freq'] >= cutoff].drop_duplicates()

# print result
print(f'Number of neighbourhood with at least {cutoff} listings {df_top_nbr.shape[0]}')

Number of neighbourhood with at least 200 listings 37


In [112]:
# create trimmed id, neighbourhood table
df_id_nbr = df_listing2[['id', 'neighbourhood']].drop_duplicates()
df_nbr_trim = df_id_nbr.merge(df_top_nbr, on='neighbourhood', how='inner')

df_nbr_trim = df_nbr_trim[['id', 'neighbourhood']].reset_index(drop=True)

In [113]:
# get comments
df_comments = df_review[['listing_id', 'comments']].reset_index(drop=True)
df_comments = df_comments.rename(columns={'listing_id':'id'})
print('\n comments:')
display(df_comments.head())


 comments:


Unnamed: 0,id,comments
0,2595,Notre séjour de trois nuits.\r\nNous avons app...
1,2595,Great experience.
2,2595,I've stayed with my friend at the Midtown Cast...
3,2595,"We've been staying here for about 9 nights, en..."
4,2595,We had a wonderful stay at Jennifer's charming...


In [114]:
# join id, neighborhood with comments
# since this is our final table, we will call it df
df = df_nbr_trim.merge(df_comments, on='id', how='inner')\
                .drop_duplicates()
df.shape

(1131048, 3)

In [115]:
df.head()

Unnamed: 0,id,neighbourhood,comments
0,2595,Midtown,Notre séjour de trois nuits.\r\nNous avons app...
1,2595,Midtown,Great experience.
2,2595,Midtown,I've stayed with my friend at the Midtown Cast...
3,2595,Midtown,"We've been staying here for about 9 nights, en..."
4,2595,Midtown,We had a wonderful stay at Jennifer's charming...


## 2. EDA
We will do some EDA such as counts, and read some comments to get a sense of the contents as well as how the summarizer should perform.

In [22]:
# change index to inspect
index = 1856
print(df[df['neighbourhood'] == 'Midtown']['comments'][index])

The apartment is very nice and well located. Since the host was not home, we were unfortunately not able to communicate beyond basic necessities. Brian left us clear and detailed instructions about the apartment and the neighborhood, and made sure that our arrival went smoothly.


In [18]:
# listing by neighborhood
df_nbr_count['freq'].describe()

count    51082.000000
mean      4948.452018
std       4732.830392
min          1.000000
25%        578.000000
50%       1924.000000
75%      10167.000000
max      10873.000000
Name: freq, dtype: float64

## 3. Text Summarization
We will concatinate reviews for the same neighbourhood into one giant paragraph then do text summarization from there.

### 1. Create Word Frequency Table

In [38]:
def _create_frequency_table(text_string) -> dict:

    stopWords = set(stopwords.words("english"))
    words = word_tokenize(text_string)
    ps = PorterStemmer()

    freqTable = dict()
    for word in words:
        word = ps.stem(word)
        if word in stopWords:
            continue
        if word in freqTable:
            freqTable[word] += 1
        else:
            freqTable[word] = 1

    return freqTable

In [39]:
index = 1873
sample_string = df[df['neighbourhood'] == 'Midtown']['comments'][index]
print(sample_string)

I can't say enough about Brian or his place.  The location was perfect for exploring the city, the apartment was small but very clean and welcoming and Brian himself was a fantastic host.  Due to flight issues we arrived in the middle of the night and he was very friendly and accommodating.  He even had a list of favorite local restaurants, tourist activities, etc. ready for us!  I definitely recommend this air bnb!


In [40]:
freq_table = _create_frequency_table(sample_string)
freq_table

{'I': 2,
 'ca': 1,
 "n't": 1,
 'say': 1,
 'enough': 1,
 'brian': 2,
 'hi': 1,
 'place': 1,
 '.': 4,
 'locat': 1,
 'wa': 4,
 'perfect': 1,
 'explor': 1,
 'citi': 1,
 ',': 3,
 'apart': 1,
 'small': 1,
 'veri': 2,
 'clean': 1,
 'welcom': 1,
 'fantast': 1,
 'host': 1,
 'due': 1,
 'flight': 1,
 'issu': 1,
 'arriv': 1,
 'middl': 1,
 'night': 1,
 'friendli': 1,
 'accommod': 1,
 'He': 1,
 'even': 1,
 'list': 1,
 'favorit': 1,
 'local': 1,
 'restaur': 1,
 'tourist': 1,
 'activ': 1,
 'etc': 1,
 'readi': 1,
 'us': 1,
 '!': 2,
 'definit': 1,
 'recommend': 1,
 'thi': 1,
 'air': 1,
 'bnb': 1}

### 2. Tokenize the sentences

In [90]:
sentences = sent_tokenize(sample_string)
sentences

["I can't say enough about Brian or his place.",
 'The location was perfect for exploring the city, the apartment was small but very clean and welcoming and Brian himself was a fantastic host.',
 'Due to flight issues we arrived in the middle of the night and he was very friendly and accommodating.',
 'He even had a list of favorite local restaurants, tourist activities, etc.',
 'ready for us!',
 'I definitely recommend this air bnb!']

### 3. Score the sentences: Term frequency

In [62]:
def _score_sentences(sentences, freqTable) -> dict:
    sentenceValue = dict()
    
    for sentence in sentences:
        word_count_in_sentence = (len(word_tokenize(sentence)))
        for wordValue in freqTable:
            if wordValue in sentence.lower():
                if sentence[:10] in sentenceValue:
                    sentenceValue[sentence[:10]] += freqTable[wordValue]
                else:
                    sentenceValue[sentence[:10]] = freqTable[wordValue]

        sentenceValue[sentence[:10]] = sentenceValue[sentence[:10]] // word_count_in_sentence

    return sentenceValue

In [63]:
sentence_scores = _score_sentences(sentences, freq_table)

### 4. Find the threshold

In [64]:
def _find_average_score(sentenceValue) -> int:
    sumValues = 0
    for entry in sentenceValue:
        sumValues += sentenceValue[entry]

    # Average value of a sentence from original text
    average = int(sumValues / len(sentenceValue))

    return average

In [65]:
threshold = _find_average_score(sentence_scores)

### 5. Generate the summary

In [66]:
def _generate_summary(sentences, sentenceValue, threshold):
    sentence_count = 0
    summary = ''

    for sentence in sentences:
        if sentence[:10] in sentenceValue and sentenceValue[sentence[:10]] > (threshold):
            summary += " " + sentence
            sentence_count += 1

    return summary

In [67]:
summary = _generate_summary(sentences, sentence_scores, 1.5*threshold)

In [68]:
print(summary)

 I can't say enough about Brian or his place. He even had a list of favorite local restaurants, tourist activities, etc. I definitely recommend this air bnb!


In [250]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize

def summarize(text, multiplier=1.5):
    # 1 Create the word frequency table
    freq_table = _create_frequency_table(text)

    # 2 Tokenize the sentences
    sentences = sent_tokenize(text)

    # 3 Important Algorithm: score the sentences
    sentence_scores = _score_sentences(sentences, freq_table)

    # 4 Find the threshold
    threshold = _find_average_score(sentence_scores)

    # 5 Important Algorithm: Generate the summary
    summary = _generate_summary(sentences, sentence_scores, multiplier * threshold)

    return summary

In [183]:
df_top = df[['neighbourhood', 'comments']][:5]
df_tail = df[['neighbourhood', 'comments']][-5:]
df_test = pd.concat([df_top, df_tail]).reset_index(drop=True)
df_test

Unnamed: 0,neighbourhood,comments
0,Midtown,Notre séjour de trois nuits.\r\nNous avons app...
1,Midtown,Great experience.
2,Midtown,I've stayed with my friend at the Midtown Cast...
3,Midtown,"We've been staying here for about 9 nights, en..."
4,Midtown,We had a wonderful stay at Jennifer's charming...
5,East Flatbush,The place was dirty. Adorable kids
6,East Flatbush,L'appartement se situe en sous sol d'une maiso...
7,East Flatbush,Staying at this location exceeded more than my...
8,East Flatbush,"it was easy to get to ,very convenient to wher..."
9,East Flatbush,Shovonte was an amazing host and was super eas...


In [184]:
df_test = df_test.groupby('neighbourhood',as_index=False)\
                 .agg(lambda x:' '.join(x))
df_test

Unnamed: 0,neighbourhood,comments
0,East Flatbush,The place was dirty. Adorable kids L'apparteme...
1,Midtown,Notre séjour de trois nuits.\r\nNous avons app...


In [185]:
summarize(df_test['comments'][1])

" Notre séjour de trois nuits. Agréable, propre et bien soigné. C'est idéal pour une famille de 3 ou 4 personnes. Jennifer est correcte le remboursement de la caution était très rapide."

In [190]:
df_top = df[['neighbourhood', 'comments']][:5]
df_tail = df[['neighbourhood', 'comments']][5:]
df_test = pd.concat([df_top, df_tail]).reset_index(drop=True)
df_test

Unnamed: 0,neighbourhood,comments
0,Midtown,Notre séjour de trois nuits.\r\nNous avons app...
1,Midtown,Great experience.
2,Midtown,I've stayed with my friend at the Midtown Cast...
3,Midtown,"We've been staying here for about 9 nights, en..."
4,Midtown,We had a wonderful stay at Jennifer's charming...
...,...,...
1131043,East Flatbush,The place was dirty. Adorable kids
1131044,East Flatbush,L'appartement se situe en sous sol d'une maiso...
1131045,East Flatbush,Staying at this location exceeded more than my...
1131046,East Flatbush,"it was easy to get to ,very convenient to wher..."


In [282]:
%%time
n_comment = 5000

col_name = ['neighbourhood', 'comments']
df_concat = pd.DataFrame()

df_groupby = df[col_name].drop_duplicates().reset_index(drop=True)
neighborhood = df_groupby['neighbourhood'].drop_duplicates()

for item in neighborhood:
    df_temp = df[df['neighbourhood'] == item][:n_comment]
    
    place = df_temp['neighbourhood'].tolist()[0]
    text = df_temp['comments'].str.cat(sep='\n')
    text = summarize(text, 4.5)
    
    d = {'neighbourhood': [place],
              'comments': [text]}
    
    df_temp = pd.DataFrame(data=d)
    
    df_concat = df_concat.append(df_temp, ignore_index=True)

print(df_concat)



         neighbourhood                                           comments
0              Midtown   ! Clean. Thanks. P.S. ! . . . Perfection! Per...
1             Brooklyn   . . Recomendamos este lugar con toda segurida...
2            Manhattan   Great host. Great host. Recomendo...!!!! Grea...
3   Bedford-Stuyvesant   Reasonable rates + private kitchen and bathro...
4      Upper West Side   Wonderful. Quiet. Great host, cozy space\nInc...
5      Lower East Side   2. 3. Thanks. . clean. clean. Location. Locat...
6           Park Slope   Warm welcome, left suitcase in living room wi...
7         Williamsburg   Great apartment in a great area. Well-equippe...
8              Chelsea   Comfortable bedroom. Anyway. Comfortable bed,...
9         East Village   Clean. 2. 3. 2. Thanks. .. Awesome. Huge apar...
10              Harlem   AMAZING. . Recommended! Awesome. P.S. Awesome...
11    Hamilton Heights   Recommended. Recommended if you’re looking fo...
12            Bushwick   Great place i

In [290]:
print(df_concat['comments'][5])

 2. 3. Thanks. . clean. clean. Location. Location. Location. Amazing. Amazing. !. Clean. Location... Location... Perfect. . walk. 2. 3. Thanks. Perfect. Thanks. 1. 2. 1. Clean. Location. Location. . Check. Check. Check. Check. 2. Thanks. etc. . Thanks. Thanks. etc. Thanks. etc. Clean.


In [283]:
df_concat.to_csv('comment_summary.csv', index=False)

In [284]:
os.listdir()

['.git',
 '.gitignore',
 '.ipynb_checkpoints',
 'comment_summary.csv',
 'data',
 'README.md',
 'sandbox.ipynb']