<a href="https://colab.research.google.com/github/nortonvanz/Texas_Airbnb/blob/main/datasets/NPS_Texas_Airbnb_c1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Objetivo

Dataset: Rio Airbnb



Objectives - Cycle 1:

Using pyspark:

- 1 Eliminate properties without review
- 2 Identify customers who were detractors in some evaluation
- 3 Get bag of words from these negative reviews

# Cycle 1

## Imports

### PySpark

In [2]:
#pyspark libs and dependencies
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-3.1.2/spark-3.1.2-bin-hadoop2.7.tgz
!tar xf spark-3.1.2-bin-hadoop2.7.tgz
!pip install -q findspark

In [3]:
import os
import findspark

#configure JAVA_HOME and SPARK_HOME environment variables in Google Colab, indicating where Java and Apache Spark are installed in the environment
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.2-bin-hadoop2.7"

#Launch findspark, a Python library that helps you locate the Apache Spark installation on your system and configure the Python environment to interact with Spark.
findspark.init()

In [4]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

#Create a Spark session using the PySpark library
spark = SparkSession.builder \
    .master('local[*]') \
    .appName('Iniciando com Spark') \
    .config('spark.ui.port', '4050') \
    .getOrCreate()

### Imports

In [25]:
import gzip
import shutil

## Load data

In [69]:
#Carrega datasets diretamente do Inside Airbnb para o ambiente temporário Colab:
  #Dasets: http://insideairbnb.com/get-the-data/

# listings.csv.gz = Detailed Listings Data
!wget --quiet --show-progress http://data.insideairbnb.com/united-states/tx/austin/2023-09-10/data/listings.csv.gz

# reviews.csv.gz = Detailed Review Data (utilizando este .gz, pois ele contém os comentários)
#!wget --quiet --show-progress http://data.insideairbnb.com/united-states/tx/austin/2023-09-10/data/reviews.csv.gz



In [70]:
#unzip datasets:

# input_file = "reviews.csv.gz"
# output_file = "reviews.csv"

# with gzip.open(input_file, 'rb') as f_in:
#     with open(output_file, 'wb') as f_out:
#         shutil.copyfileobj(f_in, f_out)
# print(f'O arquivo {input_file} foi descompactado para {output_file}.')


input_file2 = "listings.csv.gz"
output_file2 = "listings.csv"

with gzip.open(input_file2, 'rb') as f_in:
    with open(output_file2, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)
print(f'O arquivo {input_file2} foi descompactado para {output_file2}.')

O arquivo listings.csv.gz foi descompactado para listings.csv.


In [71]:
# Load real estate dataset
df_houses = spark.read.csv("listings.csv", header=True, inferSchema=True)

# Load review dataset
df_reviews = spark.read.csv("reviews.csv", header=True, inferSchema=True)

In [72]:
#novo
df_houses.show(3)

+----+--------------------+--------------+------------+---------------+--------------------+--------------------+---------------------+--------------------+-------+--------------------+---------+----------+-------------+--------------------+------------------+------------------+--------------------+-----------------+--------------------+--------------------+------------------+-------------------+-------------------------+--------------------+--------------------+----------------------+--------------------+----------------------+----------------------------+--------+---------+--------------------+---------------+------------+---------+--------------+--------+----+--------------------+--------------------+--------------------+--------------------+----------------------+----------------------+----------------------+----------------------+----------------------+----------------------+--------------------+-----------------+--------------------+--------------------+---------------+----------

In [73]:
#linhas
df_houses.count()

24182

In [74]:
df_reviews.show(3)

+----------+-------+----------+-----------+-------------+--------------------+
|listing_id|     id|      date|reviewer_id|reviewer_name|            comments|
+----------+-------+----------+-----------+-------------+--------------------+
|      5456|    865|2009-03-08|       5267|        Ellen|Sylvia is a hoste...|
|    282342| 913203|2012-02-11|     633688|      Claudia|This is a fantast...|
|    282342|1064098|2012-03-31|    1613219|        Kerry|Chris and his fam...|
+----------+-------+----------+-----------+-------------+--------------------+
only showing top 3 rows



In [75]:
df_reviews.count()

583744

## Create Views

In [76]:
#create views from datasets
df_houses.createOrReplaceTempView("houses")
df_reviews.createOrReplaceTempView("reviews")

## 1 Eliminate properties without review

In [77]:
# Select 1 property: https://www.airbnb.com.br/rooms/5456

In [82]:
#SPARK SQL
spark.sql('''
          SELECT
            *
          FROM houses h
          WHERE h.id = 5456

          --WHERE host_name like '%Serita%'
        ''').show(10, truncate=False)

+----+---------------------------------+--------------+------------+-----------+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
# Select reviews from property

In [99]:
spark.sql('''
          SELECT
            *
          FROM reviews
          WHERE listing_id = 5456

          --WHERE EXTRACT (YEAR FROM date) = 2023
          --AND reviewer_name = 'Sneha'

        ''').show(10,truncate=False) #)

+----------+----+----------+-----------+------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|listing_id|id  |date      |reviewer_id|reviewer_name     |comments                                                                                                                                                                                                                                                                                                                                                            

In [106]:
#Eliminate properties without review
  # build new dataset with granularity = reviews:

houses_w_review = spark.sql('''
          SELECT
            *
          FROM houses h
          WHERE h.id IN (SELECT listing_id FROM reviews)

        ''')

houses_w_review.show(5)

+--------+--------------------+--------------+------------+---------------+--------------------+--------------------+---------------------+--------------------+--------+--------------------+-----------+----------+-------------+--------------------+------------------+------------------+--------------------+-----------------+--------------------+--------------------+------------------+-------------------+-------------------------+--------------------+--------------------+----------------------+--------------------+----------------------+----------------------------+--------+---------+-------------+---------------+------------+---------+--------------+--------+----+--------------------+----------------+--------------------+--------------------+----------------------+----------------------+----------------------+----------------------+----------------------+----------------------+----------------+--------------------+--------------------+---------------+--------------------+---------------

In [None]:
#total number of houses = 14861
#total number of houses w/ review = 11758

In [107]:
#create view from dataset
houses_w_review.createOrReplaceTempView("houses_w_reviews")

In [155]:
spark.sql('''
          SELECT
           *
          FROM houses_w_reviews
        ''').show(10) #,truncate=False)

+--------+--------------------+--------------+------------+---------------+--------------------+--------------------+---------------------+--------------------+---------+--------------------+-----------+----------+--------------------+--------------------+------------------+------------------+--------------------+-----------------+--------------------+--------------------+------------------+-------------------+-------------------------+--------------------+--------------------+----------------------+--------------------+----------------------+----------------------------+--------+---------+--------------------+---------------+------------+---------+----------------+--------+----+--------------------+--------------------+--------------------+--------------------+----------------------+----------------------+----------------------+----------------------+----------------------+----------------------+------------------+--------------------+--------------------+-----------------+-----------

## 2 Identify customers who were detractors in some evaluation

In [165]:
#get evaluation from name, and create new view
houses_w_ratings = spark.sql('''
          SELECT
              id,
              name,
              regexp_extract(name, '★([0-9]*\.[0-9]+)', 1) AS evaluation,
              host_location,
              latitude,
              longitude,
              property_type,
              accommodates
          FROM
              houses_w_reviews
          WHERE
              name LIKE "%★%"
        ''')

#create view from dataset
houses_w_ratings.createOrReplaceTempView("houses_w_ratings")

In [166]:
spark.sql('''
          SELECT
             *
          FROM houses_w_ratings
        ''').show(10) #,truncate=False)

+--------+--------------------+----------+--------------------+--------+---------+--------------------+------------+
|      id|                name|evaluation|       host_location|latitude|longitude|       property_type|accommodates|
+--------+--------------------+----------+--------------------+--------+---------+--------------------+------------+
|17239710|Tiny home in Aust...|      4.95|          Austin, TX|    null|     null|                null|        null|
|18227140|Home in Austin · ...|      4.95|            Kyle, TX| 30.2205|-97.70055|         Entire home|          12|
|20932451|Home in Austin · ...|      4.98|          Austin, TX|30.25544|-97.73423|         Entire home|          16|
|21098269|Loft in Austin · ...|      4.95|          Austin, TX|30.26014| -97.7154|         Entire loft|           2|
| 2322667|Home in Austin · ...|       5.0|          Austin, TX|    null|     null|                null|        null|
|28302431|Townhouse in Aust...|       5.0|          Austin, TX|3

In [173]:
#identify properties with evaluation < 2, to get an example
spark.sql('''
          SELECT
             *
          FROM houses_w_ratings
          WHERE evaluation < 2 AND evaluation > 0
        ''').show(10) #,truncate=False)

+------------------+--------------------+----------+-------------+------------------+------------------+--------------------+------------+
|                id|                name|evaluation|host_location|          latitude|         longitude|       property_type|accommodates|
+------------------+--------------------+----------+-------------+------------------+------------------+--------------------+------------+
|844172257191676124|Home in Austin · ...|      1.40|         null|30.390417104953457|-97.68489815675012|Private room in home|           3|
+------------------+--------------------+----------+-------------+------------------+------------------+--------------------+------------+



In [175]:
#check property evaluations: id 844172257191676124

spark.sql('''
          SELECT
             *
          FROM reviews
          where listing_id = 844172257191676124
        ''').show(10, truncate=False) #)

+------------------+------------------+----------+-----------+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|listing_id        |id                |date      |reviewer_id|reviewer_name  |comments                             

In [None]:
#these 5 customers above were detractors.
#Run sentiment analysis to identify them all, from their comments

### Sentiment Analysis

In [None]:
# Define a UDF (User Defined Function) for sentiment analysis
from textblob import TextBlob

def analyze_sentiment(comment):
    analysis = TextBlob(comment)
    if analysis.sentiment.polarity > 0:
        return 'Positive'
    elif analysis.sentiment.polarity == 0:
        return 'Neutral'
    else:
        return 'Negative'

In [None]:
# Create a DataFrame
columns = ['comments']
df = spark.createDataFrame(data, columns)

## 3 Get bag of words from these negative reviews

# Melhorias próximos ciclos:
-