# Phuket Precipitation Analysis

## Intro

 1. Sunny Season, during which all of the regular tourists come.
 2. Rain Season — the weather is unpredictable and can ruin your trip. The waves are generally high, and it's not allowed to swim in the sea.

I started this project because my family is coming to Phuket to visit me during the rainy season. I was pretty concerned about the precipitation amount these days because they wanted to have fun on the beach too. However, in general, "Rain Season" is from May till the end of October; my family's trip dates are from 07.07 till 22.07. Further, I will be more focused on these dates (i.g. month — July) from a historical perspective.


## Questions to be answered

 1. Which month is the best to come to Phuket during the Rain Season (the weather is primarily clear)?
 2. Which month has the highest precipitation amount and is therefore not significant to stand by?
 3. Is rain season worth its name? Is precipitation amount so different?
 4. What was the rainiest day in the Rain Season?


## Disclaimer

In this case study, I will use the Code Table D-7: Aerodrome present or forecast weather because the dataset source — Phuket International Airport (many thanks for their work), and use D-7 Codes. You can read more about these weather codes and their use-cases; the link is above.
However, Code Table D-7 is rich in codes, there are no weights to these codes, so I assigned weight values to each code by myself, based on open sources and precipitation amount info.

## Dataset

As mentioned above, the dataset is from Phuket International Airport, which reports on the weather daily and is the only open (paid) source of weather conditions on the island. I bought this dataset from Weather Spark. The study period covers 2015-2021 years. I could've taken a much broader period, but climate change will affect the study outcome.
There is a lot of info in this dataset, but we will use mostly 12 columns of Weather codes (and labels). Each column represents 2 hours of weather condition observation.

## Process

In [61]:
# Imports
import pandas as pd
import seaborn as sns

### Cleaning and preparation

#### 1. Dataframe selected, duplicates dropped

In [62]:
raw_data = pd.read_csv('../input/phuket-weather-daily-20152021/Phuket Weather Daily 2015-2021 - Copy of 2010-2022 Raw.csv')

main_df = pd.DataFrame(raw_data, columns=[
        'Date (String)',
        'Weather Code 1 (String)',
        'Weather Label 1 (String)',
        'Weather Code 2 (String)',
        'Weather Label 2 (String)',
        'Weather Code 3 (String)',
        'Weather Label 3 (String)',
        'Weather Code 4 (String)',
        'Weather Label 4 (String)',
        'Weather Code 5 (String)',
        'Weather Label 5 (String)',
        'Weather Code 6 (String)',
        'Weather Label 6 (String)',
        'Weather Code 7 (String)',
        'Weather Label 7 (String)',
        'Weather Code 8 (String)',
        'Weather Label 8 (String)',
        'Weather Code 9 (String)',
        'Weather Label 9 (String)',
        'Weather Code 10 (String)',
        'Weather Label 10 (String)',
        'Weather Code 11 (String)',
        'Weather Label 11 (String)',
        'Weather Code 12 (String)',
        'Weather Label 12 (String)'
])

main_df.drop_duplicates()

#### 2. Dataset examined (types, null columns etc.)

In [63]:
# .describe() isn't really fit for us since we have primarily string values in our dataset
main_df.info()

Examining our dataset we can see, that these columns were never used:
- Weather Code 11 (String)
- Weather Label 11 (String)
- Weather Code 12 (String)
- Weather Label 12 (String)
    
That's why we will drop them out of our dataframe

In [64]:
main_df.drop(
    columns=[
        'Weather Code 11 (String)',
        'Weather Label 11 (String)',
        'Weather Code 12 (String)',
        'Weather Label 12 (String)'
    ],
    axis=1,
    inplace=True
)

#### 3. Columns types checked 
'Date (String)' with type string was converted to datetime. 2 new columns were created to help us aggregate our data later:
- Date Year
- Date Month

In [65]:
main_df['Date (String)'] = pd.to_datetime(main_df['Date (String)'])
main_df['Date Year'] = pd.DatetimeIndex(main_df['Date (String)']).year
main_df['Date Month'] = pd.DatetimeIndex(main_df['Date (String)']).month

#### 4. Data completeness checked
Since we are dealing with time-sensitive data, we should check if the last year contains data for the whole year.

In [66]:
main_df.tail()

As we see, 2022 year is not complete, so we should drop it out of our dataframe.

In [67]:
main_df = main_df[~main_df['Date Year'].isin([2022])]

#### 5. D-7 weather codes dataframe created
As mentioned above, in order to calculate precipitation, we should assign weight values to the given D-7 codes.
I actually did it manually in a spreadsheet and then transform it into arrays to create dataframe.

In [68]:
weather_codes_values = {
        'Weather Code': [
            '-RA', 'TS', 'RA', 'BR', 'RERA', 'HZ', 'RETS', '+TSRA', 'TSRA', 'VCSH', '-TSRA',
            '+RA', 'VCTS', 'RETSRA', '+SHRA', 'IWC17', 'IWC65', 'IWC21', 'IWC60', '-TS', '-DZ',
            'IWC13', 'MIFG', 'IWC95', 'SHRA', 'IWC63', 'IWC61', 'DZ', 'IWC51', 'RESHRA',
            '-SHRA', 'IWC10', '-BR', 'BCFG', 'FG', 'IWC16', 'FU', 'RE', 'PO', 'REDZ', 'FC',
            'REFC'
        ],
        'Weather Label': [
            'light rain', 'thunderstorm', 'rain', 'mist', 'recent rain', 'haze', 'recent thunderstorm',
            'thunderstorm with heavy rain', 'thunderstorm with rain', 'showers in the vicinity',
            'thunderstorm with light rain', 'heavy rain', 'thunderstorm in the vicinity',
            'recent thunderstorm with rain', 'showers of heavy rain', 'thunderstorm', 'heavy rain',
            'rain', 'light intermittent rain', 'mild thunderstorm', 'light drizzle',
            'thunderstorm in the vicinity', 'shallow fog', 'thunderstorm with rain', 'showers of rain',
            'rain', 'light rain', 'drizzle', 'light drizzle', 'recent showers of rain', 'showers of light rain',
            'mist', 'light mist', 'patches of fog', 'fog', 'precipitation in the vicinity', 'smoke',
            'recent unknown weather', 'well-developed dust/sand whirls', 'recent drizzle', 'funnel cloud',
            'recent funnel cloud'
        ],
        'Value': [
            3, 4, 4, 1, 2, 2, 2, 9, 8, 1, 7,
            5, 1, 2, 6, 4, 5, 4, 3, 3, 3,
            1, 2, 8, 5, 4, 3, 4, 3, 2,
            4, 1, 1, 1, 2, 1, 3, 0, 7, 2, 9,
            2
        ]
}

weather_codes_df = pd.DataFrame(weather_codes_values)

### Weather Condition Score
Weather Condition Score will be our key metric to define our daily precipitation amount.
To calculate it we will define a function, that will sum up values for weather codes in each row.

In [69]:
def calc_weather_condition_score(df, row, weather_codes_values_df):
    score = 0
    
    weather_code_columns = [
      'Weather Code 1 (String)',
      'Weather Code 2 (String)',
      'Weather Code 3 (String)',
      'Weather Code 4 (String)',
      'Weather Code 5 (String)',
      'Weather Code 6 (String)',
      'Weather Code 7 (String)',
      'Weather Code 8 (String)',
      'Weather Code 9 (String)',
      'Weather Code 10 (String)'
    ]
    
    weather_codes_list = pd.concat([
        df['Weather Code 1 (String)'],
        df['Weather Code 2 (String)'],
        df['Weather Code 3 (String)'],
        df['Weather Code 4 (String)'],
        df['Weather Code 5 (String)'],
        df['Weather Code 6 (String)'],
        df['Weather Code 7 (String)'],
        df['Weather Code 8 (String)'],
        df['Weather Code 9 (String)'],
        df['Weather Code 10 (String)']
    ]).unique().tolist()
    
    for index, item in enumerate(weather_code_columns):
        if row[item] in weather_codes_list and pd.isna(row[item]) is False:
            score += weather_codes_values_df.loc[
                weather_codes_values_df['Weather Code'] == row[item],
                'Value'
            ].iat[0]
    
    return score

Next, we will apply our function to the dataframe and create new column for the calculated values.

In [70]:
main_df['Weather Condition Score'] = main_df.apply(
    lambda row: calc_weather_condition_score(
        main_df,
        row,
        weather_codes_df
    ),
    axis=1
)
main_df.head(10)

### Question 1
Which month is the best to come to Phuket during the Rain Season (the weather is primarily clear)?

In order to find an answer we will aggregate score by year and month. To ease the use of aggregated values we will create a new dataframe.

In [71]:
aggregated_df = pd.DataFrame(
    main_df.groupby(['Date Year', 'Date Month'])['Weather Condition Score'].sum()
)
aggregated_df.reset_index(inplace=True)
aggregated_df.rename(columns={'Weather Condition Score': 'Score For Month'}, inplace=True)
aggregated_df.head()

To present our data clear we will create column with month labels instead of just month numbers.

In [72]:
month_labels = {
    1: 'January',
    2: 'February',
    3: 'March',
    4: 'April',
    5: 'May',
    6: 'June',
    7: 'July',
    8: 'August',
    9: 'September',
    10: 'October',
    11: 'November',
    12: 'December'
}

aggregated_df['Date Month Name'] = aggregated_df['Date Month'].apply(lambda row: month_labels[row])

In [73]:
sns.boxplot(
    x='Date Month Name',
    y='Score For Month',
    data=aggregated_df,
    palette='YlGnBu'
)

It appears **July** is the best month to come to Phuket during Rain Season. The median value of precipitation amount is the lowest among other months.
Also, during regular (sunny) season **February** is best weather.

### Question 2
Which month has the highest precipitation amount and is therefore not significant to stand by?

Based on the previous plot, we could say, that **May** is the worst month to visit Phuket among all month, because box in the chart is relatively small.

### Question 3
Is rain season worth its name? Is precipitation amount so different?

To find an answer I want to calculate mean precipitation score

In [74]:
mean_precipitation_df = aggregated_df.groupby(['Date Month', 'Date Month Name'])['Score For Month'].mean().reset_index(name='Mean Precipitation Score')
mean_precipitation_df

In [75]:
sns.lineplot(
    x='Date Month Name',
    y='Mean Precipitation Score',
    data=mean_precipitation_df
)

As we can see, precipitation amount in **May** is higher by 700% comparing to **February**. So we could say that yes, the Rain Seaon worth its name.

### Question 4
What was the rainiest day in the study period?

To answer this question first we should calculate Max Precipitation Amount in every month, then find max value among similar months and then retrive the date for each value.

In [76]:
aggregated_df['Max In Month'] = main_df.groupby(['Date Year', 'Date Month'])['Weather Condition Score'].max().reset_index()['Weather Condition Score']
aggregated_df.head()

In [77]:
def calc_most_rainy_day(raw_df, grouped_df):
    df: pd.DataFrame = grouped_df.groupby(['Date Month'])['Max In Month'].max().reset_index(name='Most Rainy Day Value')
    
    dates = list()
    
    for month, value in zip(
        df['Date Month'].values,
        df['Most Rainy Day Value'].values
    ):
        dates.append(raw_df.loc[
            (raw_df['Weather Condition Score'] == value) &
            (raw_df['Date Month'] == month),
            'Date (String)'
        ].head(1).item())
    
    df['Most Rainy Day Date'] = dates
    
    return df

most_rainy_day_df = calc_most_rainy_day(main_df, aggregated_df)
most_rainy_day_df

Now we can find the maximum value among maximum values and find the rainiest day in study period.

In [78]:
max_precipitation_date = most_rainy_day_df.loc[
    (most_rainy_day_df['Most Rainy Day Value'] == most_rainy_day_df['Most Rainy Day Value'].max()),
    ['Most Rainy Day Value', 'Most Rainy Day Date']
]
max_precipitation_date

As we can see, the most rainy day is **10th September 2020** with precipitation amount of **44**.

## Conclusion
To conclude the study here are my findings:
1. July is the best month to come to Phuket during the Rain Season and February is the best during the regular season.
2. May is your worst option if you want to visit Phuket.
3. The Rain Season is worth its name by 700% higher in precipitation than in regular months.
4. The rainiest day in the study period is 10th September 2020.

I've made this case study, to sum up, my newly acquired knowledge of data analysis. This analysis is not perfect: a lot of insights are hidden and metrics calculated. Maybe a more complex version in the future is possible, from the lessons learned here. I have chosen some questions that are interesting to me here, but am open to suggestions. 

Hope now you know which month is the best for vacationing on Phuket island. I will be glad if you enjoyed the reading.