# üê≠üè∞ Disneyland Review Analysis in SQL

In [17]:
# Import necessary libraries
import pandas as pd
from pyspark.sql import SparkSession

## üì• Load/Read the Data








In [18]:
# Read the csv
df = pd.read_csv("DisneylandReviews.csv", encoding='latin-1')

In [19]:
# Preview the dataset
df.head()

Unnamed: 0,Review_ID,Rating,Year_Month,Reviewer_Location,Review_Text,Branch
0,670772142,4,2019-4,Australia,If you've ever been to Disneyland anywhere you...,Disneyland_HongKong
1,670682799,4,2019-5,Philippines,Its been a while since d last time we visit HK...,Disneyland_HongKong
2,670623270,4,2019-4,United Arab Emirates,Thanks God it wasn t too hot or too humid wh...,Disneyland_HongKong
3,670607911,4,2019-4,Australia,HK Disneyland is a great compact park. Unfortu...,Disneyland_HongKong
4,670607296,4,2019-4,United Kingdom,"the location is not in the city, took around 1...",Disneyland_HongKong


## üìä Set Up PySpark for SQL



The pyspark.sql module in PySpark is a tool for working with structured data . It allows a me to run SQL-like queries on my data.

In [20]:
# Create Spark Session
spark = SparkSession.builder.master("local[*]").appName("ReviewAnalysis").getOrCreate()

In [21]:
# Pandas DataFrame to a Spark DataFrame
spark_df = spark.createDataFrame(df)

In [22]:
# Create a SQL table called "reviews"
spark_df.createOrReplaceTempView("reviews")

## üîç Queries

### 1) Which Branch Has the Highest Average Rating?


In [23]:
spark.sql("""
SELECT Branch, ROUND(AVG(Rating), 2) AS Avg_Rating
FROM reviews
GROUP BY Branch
ORDER BY Avg_Rating DESC
""").show()

+--------------------+----------+
|              Branch|Avg_Rating|
+--------------------+----------+
|Disneyland_Califo...|      4.41|
| Disneyland_HongKong|       4.2|
|    Disneyland_Paris|      3.96|
+--------------------+----------+



Disneyland California holds the highest average rating at 4.41, indicating strong visitor satisfaction and a positive overall experience. Disneyland Hong Kong follows with a solid 4.2 rating, showing generally favorable feedback but slightly trailing behind California. Disneyland Paris has the lowest average rating at 3.96, suggesting a decent but comparatively lower level of visitor satisfaction. Overall, Disneyland California appears to offer the most well-received experience, while Disneyland Paris may have opportunities for improvement.

### 2) How has review count changed over the years?


In [24]:
spark.sql("""
SELECT SUBSTR(Year_Month, 1, 4) AS Year, COUNT(*) AS Review_Count
FROM reviews
WHERE Year_Month != 'missing'
GROUP BY Year
ORDER BY Year
""").show()

+----+------------+
|Year|Review_Count|
+----+------------+
|2010|         143|
|2011|        1984|
|2012|        4342|
|2013|        4717|
|2014|        5301|
|2015|        6979|
|2016|        6599|
|2017|        5195|
|2018|        3997|
|2019|         786|
+----+------------+



The review count for Disneyland branches has fluctuated significantly over the years. Starting with only 143 reviews in 2010, there was a sharp increase to 1,984 reviews in 2011, followed by continued growth, peaking at 6,979 reviews in 2015. After 2015, review counts began to decline, with 6,599 reviews in 2016, 5,195 in 2017, and 3,997 in 2018. The drop became even more pronounced in 2019, with only 786 reviews recorded. This trend may reflect changes in visitor engagement, external factors such as park developments or promotions, or broader industry influences like shifts in travel patterns or the emergence of new review platforms.



### 3) Find the top 5 most common reviewer locations


In [25]:
spark.sql("""
SELECT Reviewer_Location, COUNT(*) as Review_Count
FROM reviews
GROUP BY Reviewer_Location
ORDER BY Review_Count DESC
LIMIT 5
""").show()

+-----------------+------------+
|Reviewer_Location|Review_Count|
+-----------------+------------+
|    United States|       14551|
|   United Kingdom|        9751|
|        Australia|        4679|
|           Canada|        2235|
|            India|        1511|
+-----------------+------------+



The analysis of the most common reviewer locations reveals that the United States dominates with the highest number of reviews, totaling 14,551. This is followed by the United Kingdom with 9,751 reviews, indicating a strong presence of reviewers from the UK as well. Australia ranks third with 4,679 reviews, showing a significant but smaller contribution compared to the US and UK. Canada and India round out the top five, with 2,235 and 1,511 reviews, respectively. This suggests that the majority of reviews come from English-speaking countries, with a noticeable international reach extending to places like Canada and India.

### 4) Which months have higher review volumes?

In [26]:
spark.sql("""
SELECT
    CAST(
        IF(LENGTH(SUBSTR(Year_Month, 6, 2)) = 1, SUBSTR(Year_Month, 6, 1), SUBSTR(Year_Month, 6, 2))
        AS INT
    ) AS Month,
    COUNT(*) AS Review_Count
FROM reviews
WHERE Year_Month != 'missing'
GROUP BY Month
ORDER BY Month
""").show()

+-----+------------+
|Month|Review_Count|
+-----+------------+
|    1|        2516|
|    2|        2459|
|    3|        3134|
|    4|        3478|
|    5|        3439|
|    6|        3590|
|    7|        3880|
|    8|        3994|
|    9|        3230|
|   10|        3764|
|   11|        2685|
|   12|        3874|
+-----+------------+



The months with higher review volumes are typically those in the middle of the year. August stands out with the highest review count at 3,994, followed closely by July with 3,880 reviews. These summer months likely see an increase in visitors, contributing to more reviews. Additionally, December also has a high review volume, with 3,874 reviews, possibly due to holiday visits and end-of-year travel. Other months with relatively high volumes include October (3,764) and June (3,590). These months suggest that peak tourist seasons, including summer and holidays, generate the most review activity.

### 5) What is the review count for each rating?

In [27]:
spark.sql("""
SELECT Rating, COUNT(*) AS Review_Count
FROM reviews
WHERE Year_Month != 'missing'
GROUP BY Rating
ORDER BY Rating
""").show()

+------+------------+
|Rating|Review_Count|
+------+------------+
|     1|        1338|
|     2|        1929|
|     3|        4782|
|     4|       10086|
|     5|       21908|
+------+------------+



The review count analysis for each rating shows a clear trend towards more positive feedback from reviewers. The 5-star rating stands out significantly with 21,908 reviews, which is by far the highest, indicating that the majority of visitors had a highly positive experience. 4-star ratings also show strong satisfaction, with 10,086 reviews, though it is still considerably lower than the 5-star count. On the other hand, 3-star ratings have 4,782 reviews, reflecting a more neutral sentiment but still showing a significant portion of the feedback. The 2-star and 1-star ratings are much lower, with 1,929 and 1,338 reviews, respectively, suggesting that negative experiences are much less common. This data implies that while most visitors are satisfied with their experiences, there are still some who provide more critical feedback, particularly in the lower ratings.

### 6) What percentage of reviews mention food and what is the average rating for those reviews?

In [28]:
spark.sql("""
SELECT
    Branch,
    ROUND((COUNT(IF(LOWER(Review_Text) LIKE '%food%', 1, NULL)) / COUNT(*)) * 100, 2) AS Percentage_Food_Mentions,
    ROUND(AVG(Rating), 2) AS Average_Rating
FROM reviews
WHERE Year_Month != 'missing'
GROUP BY Branch
""").show()

+--------------------+------------------------+--------------+
|              Branch|Percentage_Food_Mentions|Average_Rating|
+--------------------+------------------------+--------------+
| Disneyland_HongKong|                   26.42|          4.22|
|Disneyland_Califo...|                   18.71|          4.41|
|    Disneyland_Paris|                   32.67|          3.98|
+--------------------+------------------------+--------------+



The analysis of reviews mentioning food shows differing trends across the three Disneyland branches. Disneyland Paris has the highest percentage of food-related mentions, with 32.67% of reviews touching on food. However, the average rating for these reviews is relatively low at 3.98, suggesting that the food experience in Paris might not meet visitors' expectations as well as other aspects of the park. In contrast, Disneyland Hong Kong has 26.42% of reviews mentioning food, with an average rating of 4.22, indicating that while food is frequently discussed, it generally receives more positive feedback than in Paris. Disneyland California has the lowest percentage of food-related reviews at 18.71%, but those reviews carry the highest average rating of 4.41, showing that when food is mentioned, it tends to be a more positively reviewed aspect of the experience. Overall, Disneyland California stands out for providing a more favorable food experience, while Disneyland Paris has more food-related feedback but with lower satisfaction.

### 7) What percentage of reviews mention a line and what is the average rating for those reviews?

In [29]:
spark.sql("""
SELECT
    Branch,
    ROUND((COUNT(IF(LOWER(Review_Text) LIKE '%line%', 1, NULL)) / COUNT(*)) * 100, 2) AS Percentage_Line_Mentions,
    ROUND(AVG(Rating), 2) AS Average_Rating
FROM reviews
WHERE Year_Month != 'missing'
GROUP BY Branch
""").show()

+--------------------+------------------------+--------------+
|              Branch|Percentage_Line_Mentions|Average_Rating|
+--------------------+------------------------+--------------+
| Disneyland_HongKong|                   18.42|          4.22|
|Disneyland_Califo...|                   31.66|          4.41|
|    Disneyland_Paris|                    16.2|          3.98|
+--------------------+------------------------+--------------+



The analysis of reviews mentioning lines reveals key differences between the three Disneyland branches. Disneyland California has the highest percentage of reviews mentioning lines, with 31.66% of reviews discussing wait times. Despite the higher frequency of these mentions, the average rating for these reviews is 4.41, indicating that visitors, while acknowledging the lines, are generally satisfied with their experiences overall. Disneyland Hong Kong follows with 18.42% of reviews mentioning lines, and these reviews have an average rating of 4.22, suggesting that although lines are a concern, they do not significantly detract from the overall visitor experience. Disneyland Paris has the lowest percentage of line-related mentions at 16.2%, and the average rating for these reviews is 3.98, which is the lowest among the three branches. This suggests that while lines are less frequently discussed, they tend to result in more negative feedback compared to the other locations. In conclusion, while lines are a common point of discussion, visitors in Disneyland California appear the most forgiving of wait times, whereas Disneyland Paris has a higher level of dissatisfaction with lines.

### 8) What percentage of reviews for each branch have a rating higher than the average rating of all reviews?

In [30]:
spark.sql("""
SELECT
    Branch,
    ROUND(
        (COUNT(IF(Rating > (SELECT AVG(Rating) FROM reviews WHERE Year_Month != 'missing'), 1, NULL)) / COUNT(*)) * 100,
        2
    ) AS Percentage_Higher_Than_Avg
FROM reviews
WHERE Year_Month != 'missing'
GROUP BY Branch
""").show()

+--------------------+--------------------------+
|              Branch|Percentage_Higher_Than_Avg|
+--------------------+--------------------------+
| Disneyland_HongKong|                     47.43|
|Disneyland_Califo...|                     64.84|
|    Disneyland_Paris|                     45.44|
+--------------------+--------------------------+



The data shows that Disneyland California has the highest percentage of reviews above the average rating at 64.84%, indicating that a significant portion of visitors had positive experiences and rated their visit highly. Disneyland Hong Kong follows with 47.43%, suggesting that slightly less than half of its reviews were above the average, implying a more balanced mix of positive and neutral feedback. Disneyland Paris has the lowest percentage at 45.44%, indicating that fewer reviews surpass the average rating, which might reflect a higher degree of customer dissatisfaction or less frequent exceptional experiences. The differences in these percentages could be indicative of varying customer satisfaction levels and experiences at each location, with Disneyland California likely benefiting from a stronger positive perception compared to the other two branches.

### 9) What is the average rating and the average review length for each branch?



In [31]:
spark.sql("""
SELECT
    Branch,
    ROUND(AVG(Rating), 2) AS Average_Rating,
    ROUND(AVG(LENGTH(Review_Text)), 2) AS Average_Review_Length
FROM reviews
WHERE Year_Month != 'missing'
GROUP BY Branch
""").show()

+--------------------+--------------+---------------------+
|              Branch|Average_Rating|Average_Review_Length|
+--------------------+--------------+---------------------+
| Disneyland_HongKong|          4.22|               542.84|
|Disneyland_Califo...|          4.41|               607.54|
|    Disneyland_Paris|          3.98|               877.91|
+--------------------+--------------+---------------------+



Disneyland California stands out with the highest average rating of 4.41, indicating a generally positive sentiment from customers. However, its average review length of 607.54 characters suggests that while reviews are somewhat detailed, they are not as lengthy as those from Disneyland Paris. Disneyland Hong Kong, with a rating of 4.22, also shows a relatively positive reception, but its reviews are the shortest at 542.84 characters, possibly indicating more concise feedback from visitors. On the other hand, Disneyland Paris has the lowest average rating of 3.98, which could suggest customer dissatisfaction, and the longest reviews at 877.91 characters. This combination of lower ratings and longer reviews might indicate that customers tend to provide more extensive feedback when they are dissatisfied or have specific comments to make. The variance in review length across the branches suggests that factors like customer expectations, experiences, and cultural differences may influence the depth of feedback provided.

### 10) Is there a correlation between the length of reviews and the ratings given by customers?



In [32]:
spark.sql("""
SELECT
    Branch,
    ROUND(CORR(LENGTH(Review_Text), Rating), 2) AS Review_Length_Rating_Correlation
FROM reviews
WHERE Year_Month != 'missing'
GROUP BY Branch
""").show()

+--------------------+--------------------------------+
|              Branch|Review_Length_Rating_Correlation|
+--------------------+--------------------------------+
| Disneyland_HongKong|                           -0.03|
|Disneyland_Califo...|                           -0.17|
|    Disneyland_Paris|                           -0.16|
+--------------------+--------------------------------+



The correlation between review length and rating across the three Disneyland branches is weak and negative. Disneyland Hong Kong shows a correlation of -0.03, indicating almost no relationship between review length and ratings. Disneyland California and Disneyland Paris show correlations of -0.17 and -0.16, respectively, suggesting a weak negative relationship where slightly longer reviews tend to have lower ratings. However, the correlations are weak overall, indicating that review length has little influence on ratings, with other factors likely playing a more significant role.