# Proyecto AirBNB

### Estudio del Contexto

#### ¿Qué es AirBNB?

Antes de hacer la exploración y el análisi de los datos, nos ponemos en contexto.

"Airbnb es una compañía que ofrece una plataforma digital dedicada a la oferta de alojamientos a particulares y turísticos (alquiler vacacional) mediante la cual los anfitriones pueden publicitar y contratar el arriendo de sus propiedades con sus huéspedes; anfitriones y huéspedes pueden valorarse mutuamente, como referencia para futuros usuarios." Wikipedia [es.wikipedia.org/wiki/Airbnb](https://es.wikipedia.org/wiki/Airbnb)

De esta manera sabemos qué vamos a encontrar, información de alojamientos y referencias.

#### Exploramos el directorio Datasets

```` javascript
ls -s datasets
total 627988
 432868 calendar.csv
 86000 listings.csv
 109120 reviews.csv
````

Vemos que son Archivos tipo csv de tamaño medio a grande, por lo tanto es lo primero que vamos a tener en cuenta.

Pasamos a hacer el EDA

##### Analizamos el archivo ````calendar.csv````

````
listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights
50778,2020-04-26,f,"$2,655.00","$2,655.00",5,1125
133654,2020-04-27,t,"$1,150.00","$1,150.00",4,1125
133654,2020-04-28,t,"$1,150.00","$1,150.00",4,1125
133654,2020-04-29,t,"$1,150.00","$1,150.00",4,1125
133654,2020-04-30,t,"$1,150.00","$1,150.00",4,1125
133654,2020-05-01,t,"$1,150.00","$1,150.00",4,1125
133654,2020-05-02,t,"$1,150.00","$1,150.00",4,1125
````

A simple vista, parece que es una tabla de hechos, desde el punto de vista de datawarehouse, con los siguientes campos

* **listing_id**: Es el id de la tabla  que detalla los alojamientos, el lugar físico
* **date**: fecha
* **available**: **t** es True, **f** is false
* **price**: precio por noche
* **adjusted_price**: 
* **minimum_nights**: cantidad minima de noches
* **maximum_nights**: cantidad maxima de noches

Después de hacer un insight en el archivo listings, verificamos que calendat asocia un alojamiento con una fecha, disponibilidad y precio, siendo los dos últimos campos, redundantes


Procedemos a realizar la limpieza

In [1]:
import pandas as pd
import numpy as np

In [None]:
calendar = pd.read_csv('../datasets/calendar.csv', sep=',', usecols=['listing_id', 'date', 'available', 'price', 'adjusted_price'])

calendar.listing_id = calendar.listing_id.astype('uint16').copy()

calendar.available = calendar.available.apply(lambda x: False if x == 'f' else True).copy()

calendar.date = pd.to_datetime(calendar.date).copy()
calendar.price = calendar.price.apply(lambda x: x.replace("\"", "").replace("$", "").replace(",", "")).copy()
calendar.adjusted_price = calendar.adjusted_price.apply(lambda x: x.replace("\"", "").replace("$", "").replace(",", "")).copy()
calendar.price = calendar.price.astype("float32").copy()
calendar.adjusted_price = calendar.adjusted_price.astype("float32").copy()


In [3]:
import matplotlib.pyplot as plt
import seaborn as sns

In [38]:
import pymysql
import pandas as pd
import numpy as np

HOST='localhost'
SCHEMA='airbnb'
USER='root'
PORT=3306
# PASS='*****'


# calendar_limpio = pd.read_csv('../datasets/calendar_limpio.csv', sep="\t")


from sqlalchemy import create_engine
# cnx = create_engine('mysql+pymysql://[user]:[pass]@[host]:[port]/[schema]', echo=False)

# cnx = create_engine(f'mysql+pymysql://${USER}:${PASS}@{HOST}:{PORT}/{SCHEMA}', echo=False)
cnx = create_engine(f'mysql+pymysql://root:S{PASS}?@{HOST}:{PORT}/{SCHEMA}', echo=False)

# calendar_limpio.to_sql(name="calendar", con=cnx)

## Archivo **listings.csv**

Procederemos a leer via DataFrame el archivo **listings.csv**

In [None]:
listings = pd.read_csv('../datasets/listings.csv', low_memory=False)

In [6]:
listings.head()
listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23729 entries, 0 to 23728
Columns: 106 entries, id to reviews_per_month
dtypes: float64(23), int64(21), object(62)
memory usage: 19.2+ MB


Vemos que contiene 106 columnas!!!

````[id, ...., calculated_host_listings_count_shared_rooms, reviews_per_month]````

In [None]:
colunmas_listings = listings.columns

for i, c in enumerate(colunmas_listings):
    print(c, end=", ")

Luego de un análisis se puede ver como tablas relacionales condensadas en una sola, después de un análisis de las columnas llegamos a la conclusión que podemos extraer al menos 7 tablas según la naturaleza de las columnas.

Separamos por ejemplo, la información de los anfitriones en una tabla llamada **hosts** y la almacenamos en **MySQL**.

In [39]:
columnas_host = ['host_id', 'host_url', 'host_name', 'host_since', 'host_location',
       'host_about', 'host_response_time', 'host_response_rate',
       'host_acceptance_rate', 'host_is_superhost', 'host_thumbnail_url',
       'host_picture_url', 'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified']

host = listings[columnas_host]
host = host.drop_duplicates()

host.to_sql(name="hosts", con=cnx, if_exists='replace')
# host.to_csv('../datasets/hosts.csv', encoding='utf-8', sep=';')


15536

Actualizamos la tabla quitando la info de los anfitriones en el dataframe de listings

In [47]:
listings_filtered = listings.drop(columnas_host[1:], axis=1)

Usando este mismo formato continuamos con las siguientes tablas que podemos generar

In [41]:
columnas_info1 = ['id', 'listing_url', 'scrape_id', 'last_scraped', 'name', 'summary',
       'space', 'description', 'experiences_offered', 'neighborhood_overview',
       'notes', 'transit', 'access', 'interaction', 'house_rules',
       'thumbnail_url', 'medium_url', 'picture_url', 'xl_picture_url']

info1 = listings[columnas_info1]

info1 = info1.drop_duplicates()

info1.to_sql(name="info1", con=cnx, if_exists="replace")


23729

In [None]:
listings_filtered = listings_filtered.drop(columnas_info1[1:], axis=1)

listings_filtered

In [49]:
columnas_info2 = ["id","street", "neighbourhood_cleansed", "neighbourhood_group_cleansed","require_guest_profile_picture","require_guest_phone_verification",
           "calculated_host_listings_count", "calculated_host_listings_count_entire_homes", "calculated_host_listings_count_shared_rooms",
           "calculated_host_listings_count_private_rooms"]

# info2 = pd.read_csv('../datasets/listings.csv', usecols=columnas_info2)
info2 = listings[columnas_info2]

info2 = info2.drop_duplicates()

info2.to_sql(name='info2', con = cnx, if_exists="replace")

23729

In [None]:
listings_filtered = listings_filtered.drop(columnas_info2[1:], axis=1)

listings_filtered

In [53]:
columnas_location = ['id', 'neighbourhood', 'city', 'state',
       'zipcode', 'market', 'smart_location', 'country_code', 'country',
       'latitude', 'longitude', 'is_location_exact']

location = listings[columnas_location]
location.to_sql(name='location', con = cnx, if_exists="replace")

23729

In [None]:
listings_filtered = listings_filtered.drop(columnas_info2[1:], axis=1)

listings_filtered

In [55]:
columnas_propiedad = ['id','property_type', 'room_type', 'accommodates', 'bathrooms', 'bedrooms',
       'beds', 'bed_type', 'amenities', 'square_feet']

datos_propiedad = listings[columnas_propiedad]

datos_propiedad.drop_duplicates()

datos_propiedad.to_sql(name='datos_propiedad', con=cnx, if_exists='replace')

23729

In [None]:
listings_filtered = listings_filtered.drop(columnas_propiedad[1:], axis=1)

listings_filtered

In [57]:
columnas_precio = ['id','price', 'weekly_price', 'monthly_price', 'security_deposit',
       'cleaning_fee', 'guests_included', 'extra_people']

precios = listings[columnas_precio]

precios = precios.drop_duplicates()

precios.to_sql(name='precios', con=cnx, if_exists='replace')


23729

In [None]:
listings_filtered = listings_filtered.drop(columnas_precio[1:], axis=1)

listings_filtered

In [59]:
columnas_disponibilidad = ['id', 'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'calendar_updated', 'has_availability',
       'availability_30', 'availability_60', 'availability_90',
       'availability_365', 'calendar_last_scraped']

disponibilidad = listings[columnas_disponibilidad]

disponibilidad.to_sql(name="disponibilidad", con=cnx, if_exists='replace')



23729

In [None]:
listings_filtered = listings_filtered.drop(columnas_disponibilidad[1:], axis=1)

listings_filtered

In [61]:
columnas_review = ['id','number_of_reviews', 'number_of_reviews_ltm', 'first_review',
       'last_review', 'review_scores_rating', 'review_scores_accuracy',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'reviews_per_month']

reviews_info = listings[columnas_review]

reviews_info.to_sql(name='reviews_info', con=cnx, if_exists='replace')

23729

In [None]:
listings_filtered = listings_filtered.drop(columnas_review[1:], axis=1)

listings_filtered

In [None]:
# Assuming listings is your DataFrame


# to_drop = ["street", "neighbourhood_cleansed", "neighbourhood_group_cleansed","require_guest_profile_picture","require_guest_phone_verification",
#            "calculated_host_listings_count", "calculated_host_listings_count_entire_homes", "calculated_host_listings_count_shared_rooms",
#            "calculated_host_listings_count_private_rooms"]

# to_drop2 = ['neighbourhood', 'city', 'state',
#        'zipcode', 'market', 'smart_location', 'country_code', 'country',
#        'latitude', 'longitude', 'is_location_exact']

# to_drop0 = ['host_url', 'host_name', 'host_since', 'host_location', 'host_about',
#        'host_response_time', 'host_response_rate', 'host_acceptance_rate',
#        'host_is_superhost', 'host_thumbnail_url', 'host_picture_url',
#        'host_neighbourhood', 'host_listings_count',
#        'host_total_listings_count', 'host_verifications',
#        'host_has_profile_pic', 'host_identity_verified']

# to_drop3 = ['scrape_id', 'last_scraped', 'summary', 'space', 'description',
#        'experiences_offered', 'neighborhood_overview', 'notes', 'transit',
#        'access', 'interaction', 'house_rules', 'thumbnail_url', 'medium_url',
#        'picture_url', 'xl_picture_url']

# to_drop4 = ['property_type', 'room_type', 'accommodates', 'bathrooms', 'bedrooms',
#        'beds', 'bed_type', 'amenities', 'square_feet']

# to_drop5 =['price', 'weekly_price', 'monthly_price', 'security_deposit',
#        'cleaning_fee', 'guests_included', 'extra_people']

# to_drop6 = ['minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
#        'maximum_minimum_nights', 'minimum_maximum_nights',
#        'maximum_maximum_nights', 'minimum_nights_avg_ntm',
#        'maximum_nights_avg_ntm', 'calendar_updated', 'has_availability',
#        'availability_30', 'availability_60', 'availability_90',
#        'availability_365', 'calendar_last_scraped']

# to_drop7 = ['number_of_reviews', 'number_of_reviews_ltm', 'first_review',
#        'last_review', 'review_scores_rating', 'review_scores_accuracy',
#        'review_scores_cleanliness', 'review_scores_checkin',
#        'review_scores_communication', 'review_scores_location',
#        'review_scores_value', 'reviews_per_month']

# listings_filtered = listings.drop(to_drop0, axis=1)
# listings_filtered = listings_filtered.drop(to_drop, axis=1)
# listings_filtered = listings_filtered.drop(to_drop2, axis=1)
# listings_filtered = listings_filtered.drop(to_drop3, axis=1)
# listings_filtered = listings_filtered.drop(to_drop4, axis=1)
# listings_filtered = listings_filtered.drop(to_drop5, axis=1)
# listings_filtered = listings_filtered.drop(to_drop6, axis=1)
# listings_filtered = listings_filtered.drop(to_drop7, axis=1)



# listings_filtered.info()

Finalmente Guardamos la tabla listings

In [63]:
listings_filtered.to_sql(name="listings", con=cnx, if_exists='replace')

23729

In [64]:

reviews = pd.read_csv('../datasets/reviews.csv')

reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 387099 entries, 0 to 387098
Data columns (total 6 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   listing_id     387099 non-null  int64 
 1   id             387099 non-null  int64 
 2   date           387099 non-null  object
 3   reviewer_id    387099 non-null  int64 
 4   reviewer_name  387099 non-null  object
 5   comments       386923 non-null  object
dtypes: int64(3), object(3)
memory usage: 17.7+ MB


In [7]:
reviews.to_parquet('../datasets/reviews.parquet')

In [3]:
%pip install pyarrow

Collecting pyarrow
  Downloading pyarrow-15.0.0-cp312-cp312-win_amd64.whl.metadata (3.1 kB)
Downloading pyarrow-15.0.0-cp312-cp312-win_amd64.whl (25.3 MB)
   ---------------------------------------- 0.0/25.3 MB ? eta -:--:--
   ---------------------------------------- 0.0/25.3 MB ? eta -:--:--
   ---------------------------------------- 0.0/25.3 MB 640.0 kB/s eta 0:00:40
   ---------------------------------------- 0.1/25.3 MB 1.1 MB/s eta 0:00:24
   ---------------------------------------- 0.2/25.3 MB 913.1 kB/s eta 0:00:28
   ---------------------------------------- 0.2/25.3 MB 1.1 MB/s eta 0:00:23
   ---------------------------------------- 0.3/25.3 MB 1.1 MB/s eta 0:00:23
    --------------------------------------- 0.3/25.3 MB 1.2 MB/s eta 0:00:22
    --------------------------------------- 0.4/25.3 MB 1.1 MB/s eta 0:00:22
    --------------------------------------- 0.5/25.3 MB 1.1 MB/s eta 0:00:23
    --------------------------------------- 0.5/25.3 MB 1.1 MB/s eta 0:00:23
    ----

In [4]:
import nltk
nltk.download('vader_lexicon')


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\javier\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer

# Create an instance of the sentiment intensity analyzer
sid = SentimentIntensityAnalyzer()

# Sample reviews
# comments = reviews.comments

# Analyze the sentiment of each review
for i, review in reviews.iterrows():
    # print(review)
    if i == 100:
        break
    try:
        scores = sid.polarity_scores(review.comments)
        compound = scores['compound']
        # for key in sorted(scores):
            # print(f"{key}: {scores[key]}", end=" ")
        # for key in sorted(scores):
        
        if compound > .4 and compound < .6:
            print(compound, review.listing_id, review.reviewer_name, review.comments, "\n", end=" ")
    except:
        pass
    
# sid.polarity_scores("Excellent, genius, best in the world")

In [6]:
# reviews = pd.read_csv('../datasets/reviews.csv')
reviews

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,11508,1615861,2012-07-02,877808,Charlie,Amazing place!\r\n\r\nLocation: short walk to ...
1,11508,3157005,2012-12-26,656077,Shaily,Really enjoyed Candela's recommendations and q...
2,11508,3281011,2013-01-05,2835998,Michiel,Candela and her colleague were very attentive ...
3,11508,6050019,2013-07-28,4600436,Tara,"The apartment was in a beautiful, modern build..."
4,11508,9328455,2013-12-22,3130017,Simon,My stay at Candela's apartment was very enjoya...
...,...,...,...,...,...,...
387094,42974156,621670219,2020-04-03,270233993,Carolina,Muchas gracias Mariano por la amabilidad en to...
387095,42975917,620648461,2020-03-23,342208450,Guillermo,"Me encanto el lugar. Impecable, moderno, y ate..."
387096,42990298,622364643,2020-04-13,342811096,Heber,"Lugar muy bien ubicado y tal cual las fotos, c..."
387097,43080350,622571105,2020-04-17,184553721,Elisabeth,"The apartment is a beautiful, small and good l..."


In [52]:
listings.iloc[:,-20]

0         95.0
1         95.0
2        100.0
3          NaN
4         99.0
         ...  
23724      NaN
23725      NaN
23726      NaN
23727      NaN
23728      NaN
Name: review_scores_rating, Length: 23729, dtype: float64

In [None]:
import pandas as pd

pd.Da