## Logo del Tec

## Proyecto Integrador
## Smart Travel Assistant: Optimizando la Experiencia de Viaje con IA

### Equipo 37
##### A00759664 - Joel Orlando Hernández Ramos	
##### A01793486 - Juan Carlos Alvarado Carricarte
##### A00260430 - Juan Carlos Romo Cárdenas


**Cargando Conjunto de Datos**

In [1]:
# Importando librerias requeridas
import pyarrow.parquet as pq
import pandas as pd
import numpy as np


In [2]:
# Cargando conjunto de datos de hoteles como un Dataset multi-parte de Apache Parquet
dataset = pq.ParquetDataset('hotel-dataset')
table = dataset.read()

In [3]:
# Creando un Dataframe de Pandas para el analisis
dataframe = table.to_pandas()


**Estadisticas Resumidas del Conjunto de Datos**

El conjunto de datos de informacion de hoteles contiene mas de un millon novecientos tres mil registros con 14 columnas. Las columnas incluyen valores como la descripcion del hotel, el numero de estrellas del hotel, la direccion del hotel, el codigo del pais donde se localiza el hotel y sus coordenadas geograficas, entre otras.

Para crear la base de conocimientos en AWS Bedrock se esperan usar los siguientes campos o columnas:
* Nombre del hotel
* Descripción del hotel
* Nombre de la ciudad donde se ubica el hotel
* Código del país donde se ubica el hotel
* Nivel del hotel (rating)

Otros piezas de informacion que se pudieran usar incluyen:
* Longitud y Latitud


Uno de los parametros importantes para crear una base de conocimiento para RAG es el tamanio del texto a caracteres, or *chunck size*. Para este efecto se agregara una nueva columna con el tamanio de la descripcion del hotel en caracteres.



In [4]:
# Calculando el tamanio en caracteres de la descripcion del hotel
dataframe['DescLength'] = dataframe['Description'].str.len()


In [5]:
# Desplegando un resumen estadisticos de los datos
dataframe.describe(include='all')

Unnamed: 0,HotelCode,HotelName,Description,Address,Pincode,CountryCode,PhoneNumber,CityName,Longitude,Latitude,HotelRating,uuid,match_id,match_confidence_score,DescLength
count,1093095.0,1093095,1093095,1093095,1093095.0,1093095,742223,1093095,1090879.0,1093095.0,1093052.0,1093095,1093095.0,24688.0,1093095.0
unique,1093095.0,994885,1025518,1060951,178491.0,441,612681,100184,1009164.0,980655.0,,1093095,,,
top,1688289.0,quality inn,<br/><b>Disclaimer notification: Amenities are...,"510 gulf shore drive, , destin, 32541, usa",,US,91-93-13931393,Rome,-86.49782,,,e8558524-a7e1-4c2c-a4aa-afaa8adc2563,,,
freq,1.0,373,62806,94,40804.0,116324,6690,7937,94.0,2217.0,,1,,,
mean,,,,,,,,,,,47.7285,,2839610000000.0,1.0,1356.845
std,,,,,,,,,,,47844.83,,1744277000000.0,0.0,587.4179
min,,,,,,,,,,,0.0,,0.0,1.0,0.0
25%,,,,,,,,,,,0.0,,1322850000000.0,1.0,977.0
50%,,,,,,,,,,,2.0,,2843268000000.0,1.0,1383.0
75%,,,,,,,,,,,3.0,,4355097000000.0,1.0,1750.0


La descripcion extendida de los datos almacenados en en conjunto de datos indica que:
* Hay 1,093,095 registros con 14 columnas
* Las columnas *PhoneNumber* y *Longitude* tienen valores ausentes
* El nivel promedio de estrellas de los hoteles es por encima de 4 estrellas de acuerdo a la media de *HotelRating*
* El tamanio promedio en caracteres de las descripcion de un hotel es 1,356 de acuerdo a la media de *DescLength*, aunque hay descripciones vacias basados en el valor minimo de cero (0) de la misma columna


In [6]:
# Desplegando los tipos de valor almacenados en las diferentes columnas con contador de valores no ausentes
dataframe.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1093095 entries, 0 to 1093094
Data columns (total 15 columns):
 #   Column                  Non-Null Count    Dtype  
---  ------                  --------------    -----  
 0   HotelCode               1093095 non-null  object 
 1   HotelName               1093095 non-null  object 
 2   Description             1093095 non-null  object 
 3   Address                 1093095 non-null  object 
 4   Pincode                 1093095 non-null  object 
 5   CountryCode             1093095 non-null  object 
 6   PhoneNumber             742223 non-null   object 
 7   CityName                1093095 non-null  object 
 8   Longitude               1090879 non-null  object 
 9   Latitude                1093095 non-null  object 
 10  HotelRating             1093052 non-null  float64
 11  uuid                    1093095 non-null  object 
 12  match_id                1093095 non-null  int64  
 13  match_confidence_score  24688 non-null    float64
 14  De

El desplegado de la informacion del conjunto de datos tambien confirma la cuenta de registros totales y de registros ausentes indicados anteriormente.

In [7]:
# Desplegando una muestra de los valores presentes en el conjunto de datos
dataframe.head()

Unnamed: 0,HotelCode,HotelName,Description,Address,Pincode,CountryCode,PhoneNumber,CityName,Longitude,Latitude,HotelRating,uuid,match_id,match_confidence_score,DescLength
0,1688289,villa alun,<p>HeadLine : In Seminyak (Batubelig)</p><p>Lo...,"jalan raya batu belig gang kamboja no.2, gang ...",80361,ID,62-812-38144235,Seminyak,115.148258,-8.672673,4.0,e8558524-a7e1-4c2c-a4aa-afaa8adc2563,1005022348356,,2325
1,1863020,house of ahasna,"Located in Katunayaka, House Of Ahasna feature...","air force road, kuranakatunayake 109akatunayak...",11450,lk,,Katunayaka,79.86482,7.18265,3.0,fe5a67fb-7140-43fc-a62c-b2521f86f257,4741643896104,,732
2,5414820,casa vacanze margherita,Casa Vacanze Margherita is a detached holiday ...,"localita' piricone, , orosei, 08028, italy",8028,IT,,Orosei,9.6809,40.36094,0.0,86d185f6-1e7b-4a24-8ba2-18e8b0edcf1b,927712937388,,898
3,5758326,casa bethel,"Located in Cobán, in a building dating from 20...","9na avenida 2 07 zona 1 coban, , coban, 16001,...",16001,GT,,Coban,-90.37927,15.46935,0.0,b93537ec-e148-42b0-9bd9-2347b765f611,5592047420856,,480
4,1116957,pension abc,This quietly located guest house in Berlin off...,"kurfürstenstr. 20, 10785 berlinschöneberg, ber...",10785,DE,(49) 3026949903,Berlin,13.36543,52.49947,3.0,ee2eb159-87a7-4b7b-8b2a-fbcb8d92dcf6,137438955021,,1662


Una revision preliminar de los datos muestra lo siguiente:
* La descripcion del hotel incluye elementos, o tags, de HTML que se tienen que remover
* El codigo del pais donde reside el hotel parece estar basado en el estandard ISO 3166 (ISO, s.f.) de dos letras. Este es un data categorico con una cardinalidad de 249, ya que el estandar ISO 3166 incluye 249 codigos.


**Analisis de Datos Faltantes**



In [9]:
# Determinando columnas con valores ausentes
dataframe.isnull().sum()

HotelCode                       0
HotelName                       0
Description                     0
Address                         0
Pincode                         0
CountryCode                     0
PhoneNumber                350872
CityName                        0
Longitude                    2216
Latitude                        0
HotelRating                    43
uuid                            0
match_id                        0
match_confidence_score    1068407
DescLength                      0
dtype: int64

El analisis preliminar de datos faltantes muestra lo siguiente:
* Aproximadamente tercera parte de los registros, mas de 350 mil, no tiene numeros telefonicos
* Aunque faltan mas de dos mil doscientos valore de logitud, no faltan valors de latitud

A continuacion se hara un analisis de datos categoricos y numericos unicamente, por lo que se creara un conjunto de datos que excluya valores no requeridos.

In [11]:
# Removiendo campos con texto para analisis estadistico
dataframe_analysis = dataframe.drop(['Description','PhoneNumber','CityName','Address','Pincode','uuid','match_id','match_confidence_score'], axis=1)

In [12]:
dataframe_analysis.describe(include='all')

Unnamed: 0,HotelCode,HotelName,CountryCode,Longitude,Latitude,HotelRating,DescLength
count,1093095.0,1093095,1093095,1090879.0,1093095.0,1093052.0,1093095.0
unique,1093095.0,994885,441,1009164.0,980655.0,,
top,1688289.0,quality inn,US,-86.49782,,,
freq,1.0,373,116324,94.0,2217.0,,
mean,,,,,,47.7285,1356.845
std,,,,,,47844.83,587.4179
min,,,,,,0.0,0.0
25%,,,,,,0.0,977.0
50%,,,,,,2.0,1383.0
75%,,,,,,3.0,1750.0


In [13]:
dataframe_analysis.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1093095 entries, 0 to 1093094
Data columns (total 7 columns):
 #   Column       Non-Null Count    Dtype  
---  ------       --------------    -----  
 0   HotelCode    1093095 non-null  object 
 1   HotelName    1093095 non-null  object 
 2   CountryCode  1093095 non-null  object 
 3   Longitude    1090879 non-null  object 
 4   Latitude     1093095 non-null  object 
 5   HotelRating  1093052 non-null  float64
 6   DescLength   1093095 non-null  int64  
dtypes: float64(1), int64(1), object(5)
memory usage: 58.4+ MB


In [14]:
dataframe_analysis.head()

Unnamed: 0,HotelCode,HotelName,CountryCode,Longitude,Latitude,HotelRating,DescLength
0,1688289,villa alun,ID,115.148258,-8.672673,4.0,2325
1,1863020,house of ahasna,lk,79.86482,7.18265,3.0,732
2,5414820,casa vacanze margherita,IT,9.6809,40.36094,0.0,898
3,5758326,casa bethel,GT,-90.37927,15.46935,0.0,480
4,1116957,pension abc,DE,13.36543,52.49947,3.0,1662


La descripcion extendida del conjunto de analysis, asi como su informacion desplegada concuerdan con los valores obtenidos del conjunto de datos original.

Como las coordenadas geograficas de longitud y latitud son candidatas para agregarse a la base de conocimiento es pertinente revisar los valores ausentes para ver si hay forma de completarlos.

In [16]:
dataframe.loc[dataframe['Longitude'].isnull()]

Unnamed: 0,HotelCode,HotelName,Description,Address,Pincode,CountryCode,PhoneNumber,CityName,Longitude,Latitude,HotelRating,uuid,match_id,match_confidence_score,DescLength
352,1944494,sunview lodge & restaurant,<br/><b>Disclaimer notification: Amenities are...,"mombasa road, kibwezi 90137, kibwezi, 90137, k...",90137,KE,,Kibwezi,,,0.0,1b539168-5527-45a1-a3e5-dfcdba5c0f90,1760936592369,,121
429,5001876,santubong suites b just like home damai,"<p>Featuring an outdoor swimming pool, a fitne...","jalan sultan tengah, 93050 kuching, malaysia, ...",93050,MY,,Damai Beach,,,4.0,d51aa7de-79f8-4ea0-831a-6607a6fdadd3,5643587027080,,532
1432,5002452,hotel salyut,"<p>The hotel ""salute"" was founded in 2007. It ...","tula region, city aleksin, 18, bolotova st., ,...",,RU,,Aleksin,,,3.0,9c8143f7-b995-4968-a82f-77258df3ca4b,5643587027641,,504
2031,5020924,atour hotel shanghai pudong lujiazui,<p> This hotel offers a pleasant stay in Shang...,"no. 138 pudong avenue lujiazui, pudong 200000 ...",,CN,,Shanghai,,,4.0,a129f8e4-4275-460d-9b47-bb1723a781fc,3341484557220,,954
3066,5016410,premier inn bangor (northern ireland),Whether you’re planning romantic breaks or fam...,"castle avenue, bangor, northern ireland bt20 4...",,GB,,Bangor,,,0.0,956944a4-1d0b-4024-b903-90465cd86463,4294967296355,,846
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1091396,5015364,iris lite corbett,Our core focus is to provide a budget friendly...,"dhikuli village,, ramnagar, ramnagar, 244715",244715,13,6398601206,Ramnagar,,,3.0,b359cb1b-91ad-44ba-8e35-e6731d02659e,4724464026634,,418
1092511,5007324,jinjiang innselect xinhua road wuhan,<p>While staying at Jinjiang Inn (Wuhan Xinhua...,"no. 162 north jianghan, roadwuhan, nanjing,",,CN,,NANJING,,,0.0,92d0a2c3-be2d-4284-97d3-4457138e9a37,4698694222010,,318
1092668,5010802,ji hotel shanghai wujiaochang shiguang road,<p> JI Hotel Shanghai Wujiaochang Shiguang Roa...,"no.635 shiguang road, shanghai, pvg, 0000, sha...",,CN,,Shanghai,,,0.0,58917d5c-df4c-4292-bcad-140b71977686,2319282340207,,306
1092679,5021993,bristol hotel podgorica,"<p>Situated in Podgorica, 700 metres from Chur...","2 bore stankovica,, 81000 podgorica, montenegr...",81000,ME,,Podgorica,,,0.0,a7d30a6e-a847-4fdc-b4a3-09c7b1cd970f,3289944949646,,1076


Un hallazgo importante del desplegado anterior es que, en los registros donde el valor de longitud no esta presente. la latitud no presenta un valor aun y cuando el analisis de valores ausentes indica que cuenta con valores para todos los registros.

Para poder verificar este hallazgo se convertiran los valores de longitud y latitud a un valor de punto flotante valido o faltante.

In [17]:
# Convirtiendo latitud a un numero
def to_float_number(x):
    try:
        return float(x)
    except:
        return np.nan

dataframe_analysis['Longitude'] = dataframe_analysis['Longitude'].apply(to_float_number) 
dataframe_analysis['Latitude'] = dataframe_analysis['Latitude'].apply(to_float_number)

In [18]:
dataframe_analysis.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1093095 entries, 0 to 1093094
Data columns (total 7 columns):
 #   Column       Non-Null Count    Dtype  
---  ------       --------------    -----  
 0   HotelCode    1093095 non-null  object 
 1   HotelName    1093095 non-null  object 
 2   CountryCode  1093095 non-null  object 
 3   Longitude    1066303 non-null  float64
 4   Latitude     1066296 non-null  float64
 5   HotelRating  1093052 non-null  float64
 6   DescLength   1093095 non-null  int64  
dtypes: float64(3), int64(1), object(3)
memory usage: 58.4+ MB


In [19]:
# Determinando columnas con valores ausentes
dataframe_analysis.isnull().sum()

HotelCode          0
HotelName          0
CountryCode        0
Longitude      26792
Latitude       26799
HotelRating       43
DescLength         0
dtype: int64

Despues de la conversion se puede observar que logitud y latitud tienen una cuenta de valores ausentes similar, aunque no la misma.


In [59]:
dataframe_desc = dataframe.loc[dataframe['DescLength']<150].drop(['PhoneNumber','CityName','Address','Pincode','uuid','match_id','match_confidence_score'], axis=1)


In [60]:
dataframe_desc.info()

<class 'pandas.core.frame.DataFrame'>
Index: 63228 entries, 29 to 1093088
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   HotelCode    63228 non-null  object 
 1   HotelName    63228 non-null  object 
 2   Description  63228 non-null  object 
 3   CountryCode  63228 non-null  object 
 4   Longitude    62959 non-null  object 
 5   Latitude     63228 non-null  object 
 6   HotelRating  63227 non-null  float64
 7   DescLength   63228 non-null  int64  
dtypes: float64(1), int64(1), object(6)
memory usage: 4.3+ MB


In [61]:
dataframe_desc.head(40)

Unnamed: 0,HotelCode,HotelName,Description,CountryCode,Longitude,Latitude,HotelRating,DescLength
29,5574247,artinov villa,<br/><b>Disclaimer notification: Amenities are...,UA,28.401920318604,49.241844177246,0.0,121
42,5227047,cabanas wualmapu,<br/><b>Disclaimer notification: Amenities are...,CL,-71.97763,-39.27225,3.0,121
69,1100700,lewis grand hotel,<br/><b>Disclaimer notification: Amenities are...,PH,120.577199,15.165388,4.0,121
120,1397984,nour el balad,<br/><b>Disclaimer notification: Amenities are...,EG,32.60036,25.71717,2.0,121
124,5617885,rigofutty vendeghaz,<br/><b>Disclaimer notification: Amenities are...,HU,19.9869624,46.82569881,0.0,121
165,5474897,pokoje goscinne koralik,<br/><b>Disclaimer notification: Amenities are...,PL,19.962448120117,49.294097900391,0.0,121
177,5301104,apartment cesar,<br/><b>Disclaimer notification: Amenities are...,SI,13.89544567,46.28912249,0.0,121
184,5883785,casa baixú caraíva,<br/><b>Disclaimer notification: Amenities are...,BR,-39.14976952,-16.80781346,0.0,121
185,5441589,el vergel,<br/><b>Disclaimer notification: Amenities are...,AR,-57.046329498291,-37.343910217285,0.0,121
196,6186660,bella pensao,<br/><b>Disclaimer notification: Amenities are...,IN,73.7905,15.57383,0.0,121


In [62]:
dataframe_desc = dataframe_desc.loc[dataframe_desc['DescLength']<100]

In [63]:
dataframe_desc.info()

<class 'pandas.core.frame.DataFrame'>
Index: 94 entries, 26256 to 1037114
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   HotelCode    94 non-null     object 
 1   HotelName    94 non-null     object 
 2   Description  94 non-null     object 
 3   CountryCode  94 non-null     object 
 4   Longitude    94 non-null     object 
 5   Latitude     94 non-null     object 
 6   HotelRating  93 non-null     float64
 7   DescLength   94 non-null     int64  
dtypes: float64(1), int64(1), object(6)
memory usage: 6.6+ KB


In [64]:
dataframe_desc.head(40)

Unnamed: 0,HotelCode,HotelName,Description,CountryCode,Longitude,Latitude,HotelRating,DescLength
26256,4dWK,tesoro ixtapa all inclusive,<p><b>About the property</b><br /><span></span...,MX,-101606,1766317,3.0,51
36715,tJ67,americinn by wyndham wisconsin dells,This cosy hotel is set in Wisconsin Rapids Area.,US,-8979388,43622463,3.0,48
48092,kD6w,camino real hotel & suites puebla,Rooms Number: 149,MX,-9825305939,1901609993,4.0,17
51347,L0F5,capital o hotel posada terraza,,MX,-103833908,2090864,3.0,0
53828,R4fM,homewood suites by hilton austin round rock tx,,US,-976758,304899,3.0,0
62935,kk3L,ramada by wyndham viscount suites tucson east,This simple hotel is located in Downtown/Unive...,US,-11088852,3222175,3.0,52
69356,PfXB,camelback resort,,US,-75355011,41051849,3.0,0
82836,HotelCode,hotelname,Description,CountryCode,Longitude,Latitude,,11
121147,9fqD,quality inn litchfield route 66,"Easy interstate access, riverboat gambling 50 ...",US,-8966609,391822,2.0,96
125927,XFhS,disney's all star movies resort package,,US,-81.57,28.34,3.0,0


**Referencias**

* ISO. (s.f.) ISO 3166 — Country Codes. Internation Standards Organization. Recuperado 03 de mayo, 2024 de https://www.iso.org/iso-3166-country-codes.html
