## Industry Requirement: 
## Identifying the Best Travel Locations for Outdoor Activities Based on Weather Conditions
**Description**:
For travel and tourism companies, it's crucial to identify the best cities for outdoor activities during specific months. Using the weather dataset, we can implement a requirement to find cities with the most favorable weather conditions for outdoor activities. The criteria can include:
* Temperature Range: Between 20°C and 30°C.
* Low Rainfall: Precipitation less than 2 mm.
* Clear Skies: Cloud cover less than 20%.

### Steps to Implement in PySpark:
#### 1. Load the Dataset:

In [1]:
from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder \
    .appName("Best Travel Locations") \
    .getOrCreate()

# Load the weather data
weather_data = spark.read.option("header", True).option("inferSchema", True).csv("Datasets/hourly_data_combined_2010_to_2019.csv")

weather_data.show(5)
weather_data.printSchema()


+---------+-------------------+--------------+--------------------+------------+--------------------+-------------+----+--------+----------+------------+------------+----------------+-----------+---------------+---------------+----------------+--------------------------+-----------------------+--------------+---------------+------------------+-------------------+--------------+-------------------------+--------------------------+----------------------------+-----------------------------+----------------------+-----------------------+-------------------------+--------------------------+-------------------+----------------+-----------------+------------------------+------------------------+---------------------+---------------------------+------------------------+-------------------------+--------------------------------+--------------------------------+-----------------------------+
|city_name|           datetime|temperature_2m|relative_humidity_2m|dew_point_2m|apparent_temperature|prec

#### 2. Add Derived Columns for Analysis:
Extract year and month from the datetime column to filter data by specific periods.

In [2]:
from pyspark.sql.functions import col, month, year

# Add year and month columns
weather_data = weather_data.withColumn("year", year(col("datetime"))) \
    .withColumn("month", month(col("datetime")))

weather_data.show(5)


+---------+-------------------+--------------+--------------------+------------+--------------------+-------------+----+--------+----------+------------+------------+----------------+-----------+---------------+---------------+----------------+--------------------------+-----------------------+--------------+---------------+------------------+-------------------+--------------+-------------------------+--------------------------+----------------------------+-----------------------------+----------------------+-----------------------+-------------------------+--------------------------+-------------------+----------------+-----------------+------------------------+------------------------+---------------------+---------------------------+------------------------+-------------------------+--------------------------------+--------------------------------+-----------------------------+----+-----+
|city_name|           datetime|temperature_2m|relative_humidity_2m|dew_point_2m|apparent_tempe

#### 3. Filter for Favorable Conditions:
Apply filters for temperature, precipitation, and cloud cover.

In [3]:
# Filter data for favorable outdoor conditions
favorable_conditions = weather_data.filter(
    (col("temperature_2m") >= 20) & (col("temperature_2m") <= 30) &
    (col("precipitation") < 2) &
    (col("cloud_cover") < 20)
)

favorable_conditions.show(5)


+---------+-------------------+--------------+--------------------+------------+--------------------+-------------+----+--------+----------+------------+------------+----------------+-----------+---------------+---------------+----------------+--------------------------+-----------------------+--------------+---------------+------------------+-------------------+--------------+-------------------------+--------------------------+----------------------------+-----------------------------+----------------------+-----------------------+-------------------------+--------------------------+-------------------+----------------+-----------------+------------------------+------------------------+---------------------+---------------------------+------------------------+-------------------------+--------------------------------+--------------------------------+-----------------------------+----+-----+
|city_name|           datetime|temperature_2m|relative_humidity_2m|dew_point_2m|apparent_tempe

#### 4. Group and Rank Cities:
Group data by city and month, then count the number of favorable hours.

In [4]:
from pyspark.sql.functions import count

# Count favorable hours per city per month
city_ranking = favorable_conditions.groupBy("city_name", "month").agg(
    count("*").alias("favorable_hours")
).orderBy(col("favorable_hours").desc())

city_ranking.show(10)


+---------+-----+---------------+
|city_name|month|favorable_hours|
+---------+-----+---------------+
|   Muscat|   11|           5530|
|   Naples|    8|           5362|
|   Muscat|    3|           5244|
|   Naples|    7|           5021|
| Istanbul|    7|           4995|
|Abu Dhabi|   11|           4972|
|   Muscat|   12|           4891|
| Istanbul|    8|           4743|
| Tel Aviv|    6|           4647|
| Valencia|    7|           4512|
+---------+-----+---------------+
only showing top 10 rows

