# NLP: Text Summarization and Keyword Extraction on Property Rental Listings
# Part 1



__A practical implementation of NLP techniques such as text summarization, named entity recognition (NER), topic modeling, and text classification on a property rental listing dataset. This Jupyter Notebook is a script companion for a medium article published on Medium.__



__Part 1 (this article)__ covers the basics: the goal, the data and its preparation, and the methods used to extract keywords and text summaries using various techniques such as named entity recognition (NER), sentence scoring, and Google's T5 (Text-to-Text Transformer). We'll also touch on leveraging these insights to improve user experience - serving suggestions included.

__Part 2__ will demonstrate how to perform topic modeling on unlabeled data.




Daniel Kristiyanto  
Bali+Singapore, Summer 2024

https://medium.com/@kristiyanto_



In [1]:
import pandas as pd
import spacy
from spacy.language import Language
from spacy.matcher import Matcher
import requests
from io import BytesIO
from spacy import displacy

try:
    # Try to read locally, if not available, get the data from the internet
    data = pd.read_pickle('data.pkl')
except:
    url = "https://data.insideairbnb.com/japan/kantō/tokyo/2024-03-30/data/listings.csv.gz"
    response = requests.get(url)
    if response.status_code == 200:
        data = pd.read_csv(BytesIO(response.content), compression='gzip')
    
    data = data[['id', 'description']].dropna().astype(str)
    data.to_pickle('data.pkl')

# Sample text for demonstration
sample_text = "This apartment is located in the center of Tokyo, it is an apartment building located in the bustling commercial street. Because each floor is a single-story apartment, it will not affect everyone's respective rest. There are 2 toilets and 2 bathrooms that can be used separately by men and women to avoid everyone having to take turns when bathing and affect rest for a long time. The nearest station is Higashi-Shinjuku station and Shin-Okubo, each walk only takes 3 minutes and 7 minutes, taking advantage of two subway and Yamanote lines, very convenient to go anywhere.Shinjuku Kabukicho, 3 minutes on foot to Isetan, Odakyu Department Store, Large Electronics Store Labi, Teruma Natural Hot Springs, etc. are also only about 10 minutes away. Downstairs is 1-km long commercial street with food and beverage, department stores, drug and health products stores, supermarkets, 24 hours convenience stores, etc. 24-hour cheap duty-free shop, Don Quixote, 5 minutes on foot, ready for shopping and d"
data.describe()

Unnamed: 0,id,description
count,14588,14588
unique,14588,8845
top,890199509573985394,* Tokyo E joy Inn Higashi Nippori<br />-------...
freq,1,56


# 1. Data Preparation

In [2]:
def detect_language(text):
    nlp = spacy.blank('xx')  
    doc = nlp(text)
    has_japanese = False
    has_english = False
    
    for token in doc:
        if any('\u3040' <= char <= '\u30ff' or '\u4e00' <= char <= '\u9fff' for char in token.text):
            has_japanese = True
        elif any('\u0041' <= char <= '\u007A' or '\u0061' <= char <= '\u007A' for char in token.text):
            has_english = True
            
    if has_japanese and not has_english:
        return "Japanese"
    elif has_english and not has_japanese:
        return "English"
    elif has_japanese and has_english:
        return "Mixed"
    else:
        return "Unknown"

if 'language' not in data.columns:
    data['language'] = data['description'].apply(detect_language)
    data.to_pickle('data.pkl')

In [3]:
def data_cleaning(data):
    # Remove HTML artifacts
    data['description'] = data['description'].str.replace(r'<.*?>', '', regex=True)
    data['description'] = data['description'].str.replace(r'&nbsp;', ' ', regex=True)
    data['description'] = data['description'].str.replace(r'&lt;', '<', regex=True)
    data['description'] = data['description'].str.replace(r'&gt;', '>', regex=True)

    # Remove non 'English' language listings
    row_count = data.shape[0]

    clean_data = data[data['language'].isin(['English', 'Mixed'])].copy()
    discarded_data = data[~data.index.isin(clean_data.index)]

    clean_data.drop_duplicates(subset='description', inplace=True)
    clean_data = clean_data[clean_data['description'].notnull()]

    print(f"Removed {row_count - clean_data.shape[0]} non-English listings")
    return clean_data, discarded_data

data, discarded_data = data_cleaning(data)
if discarded_data.shape[0] > 0:
    discarded_data.to_csv('discarded_data.csv', index=False)
    data.to_pickle('data.pkl')
    
data.language.value_counts()

Removed 5804 non-English listings


language
English    8182
Mixed       602
Name: count, dtype: int64

In [4]:

nlp = spacy.load("en_core_web_sm")
@Language.component("custom_lemma_component")
def custom_lemma_component(doc):
    custom_lemmas = {
        "br": "bedroom",
        "apt": "apartment",
        "st": "street",
        "min": "minute",
        "w/": "with",
    }
    for token in doc:
        lower_text = token.text.lower()
        if lower_text in custom_lemmas:
            token.lemma_ = custom_lemmas[lower_text]
    return doc

def extract_tokens(text):
    doc = nlp(text)
    tokens = [token.lemma_.lower().strip() for token in doc if not token.is_stop and token.is_ascii]
    return tokens

nlp.add_pipe('custom_lemma_component', after='tagger')

<function __main__.custom_lemma_component(doc)>

# 2. keyword extraction using Named Entity Recognition (NER) and Part-of-Speech (POS) tagging

In [5]:
def extract_keywords(text, max_keywords=10):
    doc = nlp(text)
    matcher = Matcher(nlp.vocab)

    # Noun and Noun Phrases
    noun_phrases_patterns = [
        [{'POS': 'NUM'}, {'POS': 'NOUN'}], #example: 2 bedrooms
        [{'POS': 'ADJ', 'OP': '*'}, {'POS': 'NOUN'}], #example: beautiful house
        [{'POS': 'NOUN', 'OP': '+'}], #example: house
    ]

    # Geo-political entity
    gpe_patterns = [
        [{'ENT_TYPE': 'GPE'}], #example: Tokyo
    ]

    # Location
    loc_patterns = [
        [{'ENT_TYPE': 'LOC'}], #example: downtown
    ]

    # Facility
    fac_patterns = [
        [{'ENT_TYPE': 'FAC'}], #example: airport
    ]

    # Proximity
    proximity_patterns = [
    [{'POS': 'ADJ'}, {'POS': 'ADP'}, {'POS': 'NOUN', 'ENT_TYPE': 'FAC', 'OP': '?'}], # example: near airport
    [{'POS': 'ADJ'}, {'POS': 'ADP'}, {'POS': 'PROPN', 'ENT_TYPE': 'FAC', 'OP': '?'}] # example: near to Narita
    ]

    for entity, patterns in zip(['NOUN_PHRASE', 'GPE', 'LOC', 'FAC', "PROXIMITY"], 
                                [noun_phrases_patterns, gpe_patterns, loc_patterns, 
                                 fac_patterns, proximity_patterns]):
        
        matcher.add(entity, patterns)

    matches = matcher(doc)
    keywords = []
    for match_id, start, end in matches:
        span = doc[start:end]
        match_label = nlp.vocab.strings[match_id]
        keywords.append((match_label, span.text.strip().lower()))

    keyword_freq = {}
    for keyword in keywords:
        keyword_freq[keyword] = keyword_freq.get(keyword, 0) + 1
    
    keywords = sorted(keyword_freq, key=keyword_freq.get, reverse=True)
    return keywords[:max_keywords]


extract_keywords(sample_text)

[('NOUN_PHRASE', 'minutes'),
 ('NOUN_PHRASE', 'apartment'),
 ('NOUN_PHRASE', 'stores'),
 ('NOUN_PHRASE', 'commercial street'),
 ('NOUN_PHRASE', 'street'),
 ('NOUN_PHRASE', 'rest'),
 ('NOUN_PHRASE', 'station'),
 ('NOUN_PHRASE', '3 minutes'),
 ('NOUN_PHRASE', 'foot'),
 ('NOUN_PHRASE', 'center')]

In [6]:
displacy.render(nlp(sample_text), style="ent", jupyter=True)  # Visualize the entities

In [7]:
if "keywords" not in data.columns:  
    data['keywords'] = data['description'].apply(extract_keywords)
    data.to_pickle('data.pkl')
    nlp.to_disk('property_rental_pipeline')
    
data

Unnamed: 0,id,description,language,keywords
0,890199509573985394,You'll have a great time at this comfortable p...,English,"[(NOUN_PHRASE, great time), (NOUN_PHRASE, time..."
1,31868772,We have Free wifi/washing machine/refrigerator...,English,"[(NOUN_PHRASE, walk), (NOUN_PHRASE, kitchen), ..."
2,24378724,****Room No. 103****If you will stayed with ki...,English,"[(NOUN_PHRASE, room), (NOUN_PHRASE, station), ..."
3,1060499710482458044,"[Benefits in this property]1. Sobu Line, Hanzo...",English,"[(PROXIMITY, such as), (PROXIMITY, more than),..."
5,42366410,Hotel Wing International Premium Tokyo Yotsuya...,English,"[(NOUN_PHRASE, minute), (NOUN_PHRASE, walk), (..."
...,...,...,...,...
14791,678769871979608297,【About six minutes on foot from Sangenjaya Sta...,English,"[(NOUN_PHRASE, people), (NOUN_PHRASE, six minu..."
14792,42318267,"Welcome to our home near Insect Park, just a 1...",English,"[(NOUN_PHRASE, minutes), (NOUN_PHRASE, room), ..."
14794,811011650992939510,Enjoy a quiet Setagaya Kodo in a small apartment.,English,"[(NOUN_PHRASE, small apartment), (NOUN_PHRASE,..."
14797,707705614523948141,"Women's floors, upper beds【Woman only floor , ...",English,"[(NOUN_PHRASE, floors), (NOUN_PHRASE, upper be..."


# 3a. Text sumarization using TFIDF

In [8]:
def summarize(text, char_limit=80):
    doc = nlp(text.description)
    sentences = [sent.text.strip() for sent in doc.sents]
    keywords = [keyword[1] for keyword in text.keywords]

    if not keywords or not sentences:
        return ""
    
    scores = []
    for sentence in sentences:
        score = 0
        for keyword in keywords:
            score += sentence.lower().count(keyword.lower())  # Ensure case-insensitive matching
        scores.append(score)
    
    sorted_sentences = [sent for _, sent in sorted(zip(scores, sentences), reverse=True)]
    
    output = ""
    for i, sentence in enumerate(sorted_sentences):
        if i == 0:
            output += sentence + " "
            continue
        elif (len(output) + len(sentence) + 1 <= char_limit):  
            output += sentence + " "
        else:
            break 
    
    return output.strip()  

pd.DataFrame({"description": [sample_text], "keywords": [extract_keywords(sample_text)]}).apply(summarize, axis=1).values

array(['Downstairs is 1-km long commercial street with food and beverage, department stores, drug and health products stores, supermarkets, 24 hours convenience stores, etc. 24-hour cheap duty-free shop, Don Quixote, 5 minutes on foot, ready for shopping and d'],
      dtype=object)

In [9]:
if 'summary' not in data.columns:
    data['summary'] = data.apply(summarize, axis=1)
    data.to_pickle('data.pkl')
    nlp.to_disk('property_rental_pipeline')

data

Unnamed: 0,id,description,language,keywords,summary
0,890199509573985394,You'll have a great time at this comfortable p...,English,"[(NOUN_PHRASE, great time), (NOUN_PHRASE, time...",You'll have a great time at this comfortable p...
1,31868772,We have Free wifi/washing machine/refrigerator...,English,"[(NOUN_PHRASE, walk), (NOUN_PHRASE, kitchen), ...",We have Free wifi/washing machine/refrigerator...
2,24378724,****Room No. 103****If you will stayed with ki...,English,"[(NOUN_PHRASE, room), (NOUN_PHRASE, station), ...",This room is the best for family and group!!Th...
3,1060499710482458044,"[Benefits in this property]1. Sobu Line, Hanzo...",English,"[(PROXIMITY, such as), (PROXIMITY, more than),...","Sobu Line, Hanzomon Line Kinshicho station 6 m..."
5,42366410,Hotel Wing International Premium Tokyo Yotsuya...,English,"[(NOUN_PHRASE, minute), (NOUN_PHRASE, walk), (...","All rooms have a flat-screen satellite TV, ref..."
...,...,...,...,...,...
14791,678769871979608297,【About six minutes on foot from Sangenjaya Sta...,English,"[(NOUN_PHRASE, people), (NOUN_PHRASE, six minu...",【About six minutes on foot from Sangenjaya Sta...
14792,42318267,"Welcome to our home near Insect Park, just a 1...",English,"[(NOUN_PHRASE, minutes), (NOUN_PHRASE, room), ...","Welcome to our home near Insect Park, just a 1..."
14794,811011650992939510,Enjoy a quiet Setagaya Kodo in a small apartment.,English,"[(NOUN_PHRASE, small apartment), (NOUN_PHRASE,...",Enjoy a quiet Setagaya Kodo in a small apartment.
14797,707705614523948141,"Women's floors, upper beds【Woman only floor , ...",English,"[(NOUN_PHRASE, floors), (NOUN_PHRASE, upper be...","Women's floors, upper beds【Woman only floor , ..."


# 3b. Text Sumarization using T5

In [32]:
import numpy as np
from concurrent.futures import ProcessPoolExecutor
from transformers import T5Tokenizer, T5ForConditionalGeneration

model_name = "t5-small" # "t5-base" or "t5-large" can also be used
tokenizer = T5Tokenizer.from_pretrained(model_name, legacy=False)
model = T5ForConditionalGeneration.from_pretrained(model_name)

def summarize_with_t5(text, prompt="summarize :", max_length=80):
     
    if len(text) < max_length:
        return text

    input_text = prompt + text

    input_token = tokenizer.encode(input_text, return_tensors="pt", truncation=True)   
    summary_ids = model.generate(input_token, max_length=max_length, min_length=10, length_penalty=0.5, num_beams=4, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    return summary



Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [34]:
summarize_with_t5(sample_text, 'summarize: ')

'each floor is a single-story apartment building located in the bustling commercial street. the nearest station is Higashi-Shinjuku station and Shin-Okubo, each walk only takes 3 minutes and 7 minutes.'

In [30]:
summarize_with_t5(sample_text, 'make it sound like a short tagline: ')

'a single-story apartment is located in the center of Tokyo. the apartment has 2 toilets and 2 bathrooms that can be used separately by men and women. the nearest station is Higashi-Shinjuku station and Shin-Okubo station.'

In [27]:
summarize_with_t5(sample_text, 'pick the best sentence: ')

"apartment is located in the center of Tokyo, it is an apartment building located in the bustling commercial street. Because each floor is a single-story apartment, it will not affect everyone's respective rest."

In [33]:
summarize_with_t5(sample_text, 'based on following information, where is the apartment?: ')

'is located in the center of Tokyo, it is an apartment building located in the bustling commercial street. This apartment is located in the center of Tokyo, it is an apartment building located in the bustling commercial street.'

In [35]:
# Depending on computing power and the model selected, this can take a while (minutes to hours to overnight!)
def process_batch(batch):
    return [summarize_with_t5(text) for text in batch]

if "summary_t5" not in data.columns:
    data['summary_t5'] = data['description'].apply(summarize_with_t5)

    # Split data into batches
    num_cores = 4  # Adjust based on your machine's capabilities
    batch_size = int(np.ceil(len(data) / num_cores))
    batches = [data['description'][i:i + batch_size] for i in range(0, data.shape[0], batch_size)]

    # Process batches in parallel
    with ProcessPoolExecutor(max_workers=num_cores) as executor:
        results = list(executor.map(process_batch, batches))

    # Flatten the list of results and add them to the DataFrame
    data['summary_t5'] = [summary for batch in results for summary in batch]
    data.to_pickle('data_enriched_part1.pkl')

In [39]:
data.sample(100)

Unnamed: 0,id,description,language,keywords,summary,summary_t5
5186,1104144168592121299,It is very convenient to use the Yamanote line...,English,"[(NOUN_PHRASE, sofa), (NOUN_PHRASE, double sof...",The double sofa bed is arranged in the shape o...,it is very convenient to use the Yamanote line...
8835,740964517941808733,"First of all, thank you for your interest on A...",English,"[(GPE, tokyo), (NOUN_PHRASE, minutes), (NOUN_P...","First of all, thank you for your interest on A...","my house is located in Kita-ku, Tokyo, about 1..."
10856,41484548,"The maisonette room in the building 5F, 6F, Ro...",English,"[(NOUN_PHRASE, room), (NOUN_PHRASE, karaoke), ...","The maisonette room in the building 5F, 6F, Ro...","the maisonette room in the building 5F, 6F, Ro..."
1027,42060717,I am a Superhost and strive to give my guests ...,Mixed,"[(NOUN_PHRASE, min), (NOUN_PHRASE, walk), (NOU...","Heart of SHINJUKU, 4-min walk to Shinjuku-Sanc...",this listing is 100% licensed and legal.Heart ...
2947,28090881,"Apartment hotel ""Shin""Only 3-min walk to Subwa...",Mixed,"[(NOUN_PHRASE, min), (NOUN_PHRASE, walk), (NOU...","Apartment hotel ""Shin""Only 3-min walk to Subwa...",the apartment offers a 2-bedroom suite with 4 ...
...,...,...,...,...,...,...
5657,965766457139741853,This is a stylish stylish old house-style room...,English,"[(NOUN_PHRASE, mins), (NOUN_PHRASE, train), (N...","The large space has all the kitchen, toilet, b...",the room has a 140cm double bed and a set of d...
2871,39142570,7 min to Osaki Station on JR Yamanote Line & R...,English,"[(NOUN_PHRASE, 7 min), (NOUN_PHRASE, min), (GP...","Shibuya(6min), Shinagawa(3min), Shinjuku, Hara...",Wi-Fi or Mobile Wi-Fi can be used for free.
9016,1026450563795155365,English/Chinese/Japanese OKAbout 6 minutes on ...,English,"[(NOUN_PHRASE, minutes), (NOUN_PHRASE, apartme...",English/Chinese/Japanese OKAbout 6 minutes on ...,Japanese/chinese OKAbout 6 minutes on foot fro...
4635,790213498190844856,Renovated and decorated like new. All furnitur...,English,"[(NOUN_PHRASE, furniture), (NOUN_PHRASE, brand...","All furniture, appliances, beds, etc. are bran...","all furniture, appliances, beds, etc. are bran..."
