# 2. Data Preparation

## 2.1 Объединение и чистка данных

Для простоты решил не бороться с неточной разметкой, а просто объединить уникальные примеры и дополнить их сгенерированными данными.

In [45]:
import pandas as pd
from pathlib import Path

RAW_DATA_DIR = Path("../raw_data")
PREPROCESSED_DATA_DIR = Path("../preprocessed_data")

rates_clean = pd.read_csv(RAW_DATA_DIR / "rates_clean.csv")
rates_dirty = pd.read_csv(RAW_DATA_DIR / "rates_dirty.csv")

In [46]:
unique_clean = rates_clean.fillna("undefined").drop_duplicates()
unique_dirty = rates_dirty.fillna("undefined").drop_duplicates()
unique_dirty = unique_dirty[unique_dirty["class"] != "class"]

In [47]:
unique_merged = pd.concat([unique_clean, unique_dirty]).drop(columns=["floor", "bedrooms"]).replace("twin/twin-or-double", "double/double-or-twin")
print(unique_merged.shape)
print(unique_merged.drop_duplicates().shape)

(76859, 9)
(75051, 9)


Между rates_clean и rates_dirty есть дубликаты. Убедимся, что у них одинаковая разметка, прежде чем их удалять:

In [48]:
duplicates_by_name = unique_merged[unique_merged.duplicated(subset=["rate_name"], keep=False)]
full_duplicates = duplicates_by_name[duplicates_by_name.duplicated(keep=False)]

print(duplicates_by_name.shape, full_duplicates.shape)
duplicates_by_name.compare(full_duplicates)

(3616, 9) (3616, 9)


In [49]:
unique_final = unique_merged.drop_duplicates()
print(unique_final.shape)
unique_final.to_csv(PREPROCESSED_DATA_DIR / "rates_merged_cleaned.csv", index=False)

(75051, 9)


In [50]:
unique_final.head()

Unnamed: 0,rate_name,class,quality,bathroom,bedding,capacity,club,balcony,view
0,deluxe triple room,room,deluxe,private bathroom,undefined,triple,not club,no balcony,undefined
1,Premium Two Queen Room with Living Lrea\tHigh ...,room,premium,private bathroom,undefined,double,not club,no balcony,undefined
2,Premier Premier Seaview Sea,room,premier,private bathroom,undefined,undefined,not club,no balcony,sea view
3,Queen room (1 Queen Bed),room,standard,private bathroom,double/double-or-twin,double,not club,no balcony,undefined
4,Studio - Pool access Studio - Pool access pool,studio,undefined,private bathroom,undefined,undefined,not club,no balcony,undefined


## 2.2 Дополнение данных

Сначала вспомним, каких классов нам не хватает:

In [51]:
from ydata_profiling import ProfileReport

ProfileReport(unique_final).to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

In [53]:
unique_final["class"].value_counts().to_frame()

Unnamed: 0_level_0,count
class,Unnamed: 1_level_1
room,57361
suite,6260
apartment,3300
studio,2910
junior-suite,1990
dorm,1311
villa,956
bungalow,337
cottage,286
capsule,141


Я хочу сгенерировать примеры для класса "camping". Для Few-Shot я хочу использовать несколько случайных примеров, включая класс "tent", так как такие записи должны быть похожи на "camping". 

In [62]:
few_shot_sample = pd.concat([unique_final[unique_final["class"] == "tent"].sample(5), unique_final.sample(5)]).sort_values("rate_name")
few_shot_sample

Unnamed: 0,rate_name,class,quality,bathroom,bedding,capacity,club,balcony,view
1881,Double room - Glamping Double room - Glamping,tent,undefined,shared bathroom,double/double-or-twin,undefined,not club,no balcony,undefined
176647,"Double safari-tent ""Sirenevyy zakat"" double",tent,undefined,shared bathroom,double/double-or-twin,undefined,not club,no balcony,undefined
666,Family room - Terrace Family room - Terrace,room,standard,private bathroom,undefined,undefined,not club,no balcony,undefined
133345,"Safari with 3 Twin Beds Tent - Shared Bath,Non...",tent,undefined,shared bathroom,undefined,undefined,not club,no balcony,undefined
122806,Standard Double Room (SensBase) (1 Queen Bed),room,standard,private bathroom,double/double-or-twin,double,not club,no balcony,undefined
6555,Standard GLAMPING TWIN,tent,undefined,shared bathroom,double/double-or-twin,undefined,not club,no balcony,undefined
139112,"Standard Single, Single or Double/Twin Bed",room,standard,private bathroom,double/double-or-twin,single,not club,no balcony,undefined
44586,Swiss Advantage Twin Twin room,room,standard,private bathroom,double/double-or-twin,double,not club,no balcony,undefined
176945,Tent standard with double bed Lesnoe uedinenie...,tent,undefined,shared bathroom,double/double-or-twin,undefined,not club,no balcony,undefined
76113,small double (1 Double),room,standard,private bathroom,double/double-or-twin,double,not club,no balcony,undefined


In [63]:
few_shot_sample.to_csv(PREPROCESSED_DATA_DIR / "few_shot/few_shot_for_camping.csv", index=False)

Задаю такой промпт для ChatGPT 4o:


Below is a sample from csv file with hotel rates (room tariffs) and attributes extracted from descriptions:
```
rate_name,class,quality,bathroom,bedding,capacity,club,balcony,view
Double room - Glamping Double room - Glamping,tent,undefined,shared bathroom,double/double-or-twin,undefined,not club,no balcony,undefined
"Double safari-tent ""Sirenevyy zakat"" double ",tent,undefined,shared bathroom,double/double-or-twin,undefined,not club,no balcony,undefined
Family room - Terrace Family room - Terrace,room,standard,private bathroom,undefined,undefined,not club,no balcony,undefined
"Safari with 3 Twin Beds Tent - Shared Bath,NonSmoking",tent,undefined,shared bathroom,undefined,undefined,not club,no balcony,undefined
Standard Double Room (SensBase) (1 Queen Bed),room,standard,private bathroom,double/double-or-twin,double,not club,no balcony,undefined
Standard GLAMPING TWIN,tent,undefined,shared bathroom,double/double-or-twin,undefined,not club,no balcony,undefined
"Standard Single, Single or Double/Twin Bed",room,standard,private bathroom,double/double-or-twin,single,not club,no balcony,undefined
Swiss Advantage Twin Twin room,room,standard,private bathroom,double/double-or-twin,double,not club,no balcony,undefined
Tent standard with double bed Lesnoe uedinenie double  in block,tent,undefined,shared bathroom,double/double-or-twin,undefined,not club,no balcony,undefined
small double (1 Double),room,standard,private bathroom,double/double-or-twin,double,not club,no balcony,undefined
```
Here is column descriptions:
- rate_name - string with hotel room description in English, no more than 200 symbols,
- class - hotel room type, one of 'room, studio, suite, junior-suite, apartment, villa, dorm, cottage, bungalow, tent, capsule, chalet, run-of-house, camping',
- quality - hotel room class, one of 'deluxe, premium, premier, standard, undefined, comfort, economy, grand, classic, executive, superior, presidential, royal, luxury, privilege, business, prestige, ambassador, platinum',
- bathroom - hotel bathroom type, one of 'private bathroom, shared bathroom, undefined, external private bathroom',
- capacity - hotel room capacity, one of 'triple, double, undefined, single, quadruple, sextuple, quintuple',
- bedrooms - number of rooms in a room, one of 'undefined, 1 bedroom, 3 bedrooms, 2 bedrooms, 4 bedrooms, 5 bedrooms, 6 bedrooms',
- club - flag whether room is a club type, one of 'club, not club',
- balcony - flag whether room is having a balcony, one of 'balcony, no balcony',
- view - view from the hotel room, one of 'undefined, sea view, water view, partial-sea view, pool view, city view, harbour view, beachfront, with view, ocean view, ocean front, garden view, sea front, bosphorus view, bay view, partial view, sunrise view, mountain view, burj-khalifa view, partial-ocean view, sunset view, courtyard view, panoramic view, golf view, park view, land view, sheikh-zayed view, inland view, lake view, street view, river view, canal view, kremlin view, dubai-marina view'.

Your task is to generate a csv with 15 records with "class" equals to "camping". Generate "rate_name" according to "class". Fill in other fields according to "rate_name". Output only final csv.

Получаю такой результат:

```
rate_name,class,quality,bathroom,bedding,capacity,club,balcony,view
Camping Tent - Deluxe Private Bath,camping,deluxe,private bathroom,double/double-or-twin,double,not club,no balcony,undefined
"Camping Tent - Standard, 1 Double Bed",camping,standard,shared bathroom,double/double-or-twin,double,not club,no balcony,undefined
"Camping Cabin - Comfort, Shared Bathroom",camping,comfort,shared bathroom,double/double-or-twin,double,not club,no balcony,undefined
Luxury Camping Tent - Private Bath, Sea View,camping,luxury,private bathroom,double/double-or-twin,double,not club,no balcony,sea view
Camping Lodge - Premium with Shared Bathroom,camping,premium,shared bathroom,double/double-or-twin,undefined,not club,no balcony,undefined
"Camping Tent - Economy, Shared Bath",camping,economy,shared bathroom,double/double-or-twin,double,not club,no balcony,undefined
Family Camping Cabin - Private Bath,camping,standard,private bathroom,double/double-or-twin,quadruple,not club,no balcony,undefined
"Camping Safari Tent - Deluxe, Garden View",camping,deluxe,shared bathroom,double/double-or-twin,double,not club,no balcony,garden view
Standard Camping Tent - External Private Bath,camping,standard,external private bathroom,double/double-or-twin,double,not club,no balcony,undefined
"Camping Tent - Ambassador, Private Bath",camping,ambassador,private bathroom,double/double-or-twin,undefined,not club,no balcony,undefined
"Camping Tent - Royal, Sea Front",camping,royal,private bathroom,double/double-or-twin,double,not club,no balcony,sea front
"Camping Cabin - Business Class, Shared Bath",camping,business,shared bathroom,double/double-or-twin,undefined,not club,no balcony,undefined
Luxury Camping Tent - Burj-Khalifa View,camping,luxury,private bathroom,double/double-or-twin,double,not club,no balcony,burj-khalifa view
"Camping Tent - Presidential, Private Bath",camping,presidential,private bathroom,double/double-or-twin,double,not club,no balcony,undefined
Standard Camping Cabin - Partial-Ocean View,camping,standard,shared bathroom,double/double-or-twin,undefined,not club,no balcony,partial-ocean view
```

Все результаты сохраняю в директорию `preprocessed_data/generated`

Перехожу к следующем столбцам:

In [56]:
unique_final["quality"].value_counts().to_frame()

Unnamed: 0_level_0,count
quality,Unnamed: 1_level_1
standard,32163
undefined,12264
deluxe,9335
superior,8045
classic,2674
comfort,2195
economy,2187
premium,1724
executive,1551
premier,707


In [57]:
unique_final[unique_final["quality"] == "platinum"].to_csv(PREPROCESSED_DATA_DIR / "few_shot/few_shot_for_platinum.csv", index=False)

Засылаю следующий промпт в тот же диалог:

Below is another sample from the same file:
```
rate_name,class,quality,bathroom,bedding,capacity,club,balcony,view
Platinum King Room (1 King Bed),room,platinum,private bathroom,double/double-or-twin,double,not club,no balcony,undefined
Platinum Suite,suite,platinum,private bathroom,undefined,undefined,not club,no balcony,undefined
Suite (Rock Platinum) BC ,suite,platinum,private bathroom,undefined,undefined,not club,no balcony,undefined
"Platinum King, 1 King Bed",room,platinum,private bathroom,double/double-or-twin,double,not club,no balcony,undefined
```

Your task is to generate 15 more records with "quality" equals to "platinum". Generate "rate_name" according to "quality". Fill in other fields according to "rate_name". Output only final csv.

In [58]:
unique_final["bathroom"].value_counts().to_frame()

Unnamed: 0_level_0,count
bathroom,Unnamed: 1_level_1
private bathroom,72684
shared bathroom,2294
undefined,57
external private bathroom,16


In [59]:
unique_final["capacity"].value_counts().to_frame()

Unnamed: 0_level_0,count
capacity,Unnamed: 1_level_1
double,46299
undefined,18039
single,5931
quadruple,2907
triple,1862
quintuple,9
sextuple,4


In [60]:
unique_final[unique_final["capacity"].isin(["sextuple", "quintuple"])].to_csv(PREPROCESSED_DATA_DIR / "few_shot/few_shot_for_sextuple_or_quintuple.csv", index=False)

Следующий промпт:

Below is another sample from the same file:
```
rate_name,class,quality,bathroom,bedding,capacity,club,balcony,view
Economy room with 6 single beds twin (Main Building) in block,room,economy,shared bathroom,single bed,sextuple,not club,no balcony,undefined
Sextuple Room (3 Adults + 2 Children).,room,standard,private bathroom,undefined,sextuple,not club,no balcony,undefined
Standard Room (Quintuple) (5 Twin Beds),room,standard,private bathroom,undefined,quintuple,not club,no balcony,undefined
"FOUR SINGLE BEDS DELUXE ROOM WITH HARAM VIEW, COMP WIFI/COFFEE-TEA FACILITIES/49 INCH HDTV, 63 SQM/IRON-BOARD/SAFE/MINIBAR/WALKIN SHOWER Quintuple",room,deluxe,private bathroom,single bed,quintuple,not club,no balcony,undefined
Sextuple Room,room,standard,private bathroom,undefined,sextuple,not club,no balcony,undefined
Quintuple Room Bed and Breakfast,room,standard,private bathroom,undefined,quintuple,not club,no balcony,undefined
Standard room with 5 single beds  ,room,standard,private bathroom,single bed,quintuple,not club,no balcony,undefined
Quintuple Room non-smoking,room,standard,private bathroom,undefined,quintuple,not club,no balcony,undefined
Quintuple Room,room,standard,private bathroom,undefined,quintuple,not club,no balcony,undefined
Apartment with 3 rooms for 6 people Sextuple,apartment,undefined,private bathroom,undefined,sextuple,not club,no balcony,undefined
Quintuple Apartments with kitchen (Korvet building) Basic rate without breakfast,apartment,undefined,private bathroom,undefined,quintuple,not club,no balcony,undefined
Standard room with 5 single beds twin ,room,standard,private bathroom,single bed,quintuple,not club,no balcony,undefined
Apartment (Quintuple) (2 Twin Beds and 2 Twin Sofa Beds),apartment,undefined,private bathroom,double/double-or-twin,quintuple,not club,no balcony,undefined
```

Your task is to generate 15 more records with "capacity" equals to "quintuple" or "sextuple". Generate "rate_name" according to "capacity". Fill in other fields according to "rate_name". Output only final csv.

In [44]:
unique_final["club"].value_counts().to_frame()

Unnamed: 0_level_0,count
club,Unnamed: 1_level_1
not club,74587
club,464


In [64]:
unique_final["balcony"].value_counts().to_frame()


Unnamed: 0_level_0,count
balcony,Unnamed: 1_level_1
no balcony,70950
balcony,4101


In [65]:
unique_final["view"].value_counts().to_frame()

Unnamed: 0_level_0,count
view,Unnamed: 1_level_1
undefined,59231
sea view,3597
city view,3158
garden view,1901
pool view,976
partial-sea view,805
with view,737
ocean view,654
mountain view,530
river view,387


In [68]:
pd.concat([unique_final[unique_final["view"] == "kremlin view"], unique_final[unique_final["view"] != "undefined"].sample(10)]).to_csv(PREPROCESSED_DATA_DIR / "few_shot/few_shot_for_kremlin_dubai_marina_view.csv", index=False)

Следующий промпт:

Below is another sample from the same file:
```
rate_name,class,quality,bathroom,bedding,capacity,club,balcony,view
Studio room Kremlin view double,studio,undefined,private bathroom,double/double-or-twin,double,not club,no balcony,kremlin view
Twin Room Standard | Garden View,room,standard,private bathroom,double/double-or-twin,double,not club,no balcony,garden view
Studio - City View,studio,undefined,private bathroom,undefined,undefined,not club,no balcony,city view
Junior Suite Garden View Junior Suite,junior-suite,undefined,private bathroom,undefined,undefined,not club,no balcony,garden view
Double or twin garden view,room,standard,private bathroom,double/double-or-twin,double,not club,no balcony,garden view
Apartment 1 room Partial Ocean View One Bedroom Condo,apartment,undefined,private bathroom,undefined,undefined,not club,no balcony,partial-ocean view
Sea View Twin Room,room,standard,private bathroom,double/double-or-twin,double,not club,no balcony,sea view
Luxury Sea View Room (Third adult in sofa Bed) Luxury Sea View Room,room,luxury,private bathroom,undefined,undefined,not club,no balcony,sea view
"room, 1 bedroom, garden view",room,standard,private bathroom,undefined,undefined,not club,no balcony,garden view
"deluxe room, 1 king bed, pool view",room,deluxe,private bathroom,double/double-or-twin,double,not club,no balcony,pool view
"Deluxe Room with Two Single Beds, Land or Pool View, ",room,deluxe,private bathroom,double/double-or-twin,double,not club,no balcony,pool view
```

Your task is to generate 20 more records with "view" equals to "kremlin view" or "dubai-marina view". Generate "rate_name" according to "view". Fill in other fields according to "rate_name". Output only final csv.

Последний промпт - добавляем примеры на разных языках:

Your task is to generate 32 more records for different "view" values, but now in different languages. Write "rate_name" in Russian, French, Spanish and Turkish, but follow the same rules regarding the other fields. Use various values for fields among available. DO NOT CHANGE VALUES OF OTHER FIELDS, THEY SHOULD BE IN ENGLISH. KEEP "view" COLUMN IN ENGLISH. Output only final csv. 

Теперь пришло время объединить получившиеся данные: 

In [72]:
files_to_concat = [unique_final]
for generated_file in PREPROCESSED_DATA_DIR.glob("generated/*.csv"):
    print(f"Processing {generated_file}")
    df = pd.read_csv(generated_file)
    files_to_concat.append(df)

augmented_final = pd.concat(files_to_concat).drop_duplicates()
augmented_final.to_csv(PREPROCESSED_DATA_DIR / "augmented_merged.csv", index=False)

Processing ../preprocessed_data/generated/generated_platinum.csv
Processing ../preprocessed_data/generated/generated_camping.csv
Processing ../preprocessed_data/generated/generated_kremlin_dubai_marina.csv
Processing ../preprocessed_data/generated/generated_sextuple_quintuple.csv
Processing ../preprocessed_data/generated/generated_different_languages.csv


In [73]:
ProfileReport(augmented_final).to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]