<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 1: Data Analysis of Singapore Rainfall

--- 
# Part 1

Part 1 requires knowledge of basic Python.

---

### Contents:
- [Background](#Background)
- [Data Import & Cleaning](#Data-Import-and-Cleaning)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)
- [Data Visualization](#Visualize-the-Data)
- [Conclusions and Recommendations](#Conclusions-and-Recommendations)

## Background

According to the [Meteorological Services Singapore](http://www.weather.gov.sg/climate-climate-of-singapore/#:~:text=Singapore%20is%20situated%20near%20the,month%2Dto%2Dmonth%20variation.), Singapore has typical tropical climate with adundant rainfall, high and uniform temperatures and high humidity all year round, since its situated near the equator. There are many factors that help us understand the climate of a country and in this project we are going to look into a few, especially rainfall.

Singapore’s climate is characterised by two main monsoon seasons separated by inter-monsoonal periods.  The **Northeast Monsoon** occurs from December to early March, and the **Southwest Monsoon** from June to September.

The major weather systems affecting Singapore that can lead to heavy rainfall are:

-Monsoon surges, or strong wind episodes in the Northeast Monsoon flow bringing about major rainfall events;

-Sumatra squalls, an organised line of thunderstorms travelling eastward across Singapore, having developed over the island of Sumatra or Straits of Malacca west of us;

-Afternoon and evening thunderstorms caused by strong surface heating and by the sea breeze circulation that develops in the afternoon.

Singapore’s climate station has been located at several different sites in the past 140 years. The station had been decommissioned at various points in the past due to changes to local land use in the site’s vicinity, and had to be relocated. Since 1984, the climate station has been located at **Changi**.

There are other metrics of climate such as temperature, humidity, sun shine duration, wind speed, cloud cover etc. All the dataset used in the project comes from [data.gov.sg](data.gov.sg), as recorded at the Changi climate station 


### Choose your Data

There are 2 datasets included in the [`data`](./data/) folder for this project. These correponds to rainfall information. 

* [`rainfall-monthly-number-of-rain-days.csv`](./data/rainfall-monthly-number-of-rain-days.csv): Monthly number of rain days from 1982 to 2022. A day is considered to have “rained” if the total rainfall for that day is 0.2mm or more.
* [`rainfall-monthly-total.csv`](./data/rainfall-monthly-total.csv): Monthly total rain recorded in mm(millimeters) from 1982 to 2022

Other relevant weather datasets from [data.gov.sg](data.gov.sg) that you can download and use are as follows:

* [Relative Humidity](https://data.gov.sg/dataset/relative-humidity-monthly-mean)
* [Monthly Maximum Daily Rainfall](https://data.gov.sg/dataset/rainfall-monthly-maximum-daily-total)
* [Hourly wet buld temperature](https://data.gov.sg/dataset/wet-bulb-temperature-hourly)
* [Monthly mean sunshine hours](https://data.gov.sg/dataset/sunshine-duration-monthly-mean-daily-duration)
* [Surface Air Temperature](https://data.gov.sg/dataset/surface-air-temperature-mean-daily-minimum)

You can also use other datasets for your analysis, make sure to cite the source when you are using them

Datasets:</br>
___Weather data extracted from data.gov.sg___
1) Monthly number of rainy days from 1982 to 2022
2) Monthly total recorded rainfall in mm (millimeters) from 1982 to 2022
3) Monthly relative humidity (%) from 1982 to 2022
4) Monthly sunshine duration (hours) from 1982 to 2022</br>

___Historical inflation rate extracted from inflationrate.com___
1) Monthly inflation rate (%) from 1982 to 2022

## Problem Statement

The Outdoor Collectives (TOC) is an outdoor apparel shop known for their handheld electronic fans, lightweight umbrellas as well as sunblock lotions. Established since the early 2000s in Honolulu, Hawaii, TOC has since expanded it's operations to America and has set its sights on the Asean region, with Singapore being the first country for their market entry strategy. At the same time, TOC has identified that consumers are becoming increasingly reluctant to carry umbrellas, electronic fans and sunblock lotion as part of their everyday carry (EDC), citing reasons that it takes up additional space in their bags. As such, TOC also like to explore to setup of temoprary retail spaces around Singapore to sell their products.

TOC's target consumers would be Singapore residents from all age groups. The organisation wants to ensure that there is proper inventory management throughout the year, taking into account the Singapore weather conditions. The organisation anticipates sales for sunblock lotion and visors to be higher on days with more sunlight and while those for electronic fans on days with high humidity levels. Sales of umbrellas are expected to be higher on rainy days.

This study explores the analysis of Singapore's historical weather data to enable TOC to efficiently cater for supplies of their products throughout the year.

### Outside Research

Based on your problem statement and your chosen datasets, spend some time doing outside research on how climate change is affecting different industries or additional information that might be relevant. Summarize your findings below. If you bring in any outside tables or charts, make sure you are explicit about having borrowed them. If you quote any text, make sure that it renders as being quoted. **Make sure that you cite your sources.**

___How Weather Affects Consumer behavior and Purchase Decisions___ </br>
Source: https://www.weatherads.io/blog/how-weather-affects-consumer-behavior-and-purchase-decisions </br>
"Weather has a deep-rooted effect on consumer psychology and purchase behavior." </br>
This article provides insights as to how weather has an affect an invidivual's mood and drive them towards making certain purchase decisions.

___How the Consumer Price Index Measures The Cost of Living and Inflation___ </br>
Source: https://www.cnbc.com/select/how-the-consumer-price-index-measures-the-cost-of-living-and-inflation/ </br>
"CPI weighs certain spending categories more than others based on the things people, on average, are spending their money on." </br>
This source explains how inflation affects consumers' purchasing decision be it across different categories such as energy or food.

___It’s the Weather: Quantifying the Impact of Weather on Retail Sales___ </br>
Source: https://link.springer.com/article/10.1007/s12061-021-09397-0 </br>
"There is an inherent seasonality in the retail sector, which occurs both in overall sales volumes and individual product sales Jang (2004) attributes this seasonality to two main factors, the natural climatic seasons and their corresponding weather conditions, and institutional seasons reflecting social norms such as religious or school holidays. "

"Research shows that retail sales are influenced by various economic measures, such as the consumer price index (CPI), disposable income, consumer confidence and unemployment levels."

This source further emphaises the impact of weather on retail sales coupled with the effects of inflation, driving consumer's purchasing decision.

### Coding Challenges

1. Manually calculate mean:

    Write a function that takes in values and returns the mean of the values. Create a list of numbers that you test on your function to check to make sure your function works!
    
    *Note*: Do not use any mean methods built-in to any Python libraries to do this! This should be done without importing any additional libraries.

## Function to calculate mean

In [None]:
def calculate_mean(val_list):
    total_sum = 0
    count = 0
    for x in val_list:
        total_sum=total_sum+x
        count=count+1
    average=total_sum/count
    return average
    

In [None]:
# test data
list_num=[1,2,3,4,5]

In [None]:
calculate_mean(list_num)

2. Manually calculate standard deviation:

    The formula for standard deviation is below:

    $$\sigma = \sqrt{\frac{1}{n}\sum_{i=1}^n(x_i - \mu)^2}$$

    Where $x_i$ represents each value in the dataset, $\mu$ represents the mean of all values in the dataset and $n$ represents the number of values in the dataset.

    Write a function that takes in values and returns the standard deviation of the values using the formula above. Hint: use the function you wrote above to calculate the mean! Use the list of numbers you created above to test on your function.
    
    *Note*: Do not use any standard deviation methods built-in to any Python libraries to do this! This should be done without importing any additional libraries.

## Function to calculate standard deviation

In [None]:
def calculate_sd(val_list):
    # x represents each value in the data set
    # y represents the mean of all values in the dataset
    # n represents the number of values in the data set
    
    n = len(val_list)
    y = calculate_mean(val_list)
    holding_val=0
    for x in val_list:
        holding_val=holding_val+ (x-y)**2
    sd = (((1/n)*holding_val))**(1/2)
    return sd

In [None]:
calculate_sd(list_num)

--- 
# Part 2

Part 2 requires knowledge of Pandas, EDA, data cleaning, and data visualization.

---

*All libraries used should be added here*

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

%matplotlib inline

## Data Import and Cleaning

### Data Import & Cleaning

Import all the datasets that you selected for this project and go through the following steps at a minimum. You are welcome to do further cleaning as you feel necessary. Make sure to comment your code to showcase the intent behind the data processing step.
1. Display the data: print the first 5 rows of each dataframe to your Jupyter notebook.
2. Check for missing values and datatype.
3. Check for any obvious issues with the observations.
4. Fix any errors you identified in steps 2-3.
6. Fix any incorrect data types found in step 5.
    - Fix any individual values preventing other columns from being the appropriate type.
    - If the month column data is better analyzed as month and year, create new columns for the same
7. Rename Columns.
    - Column names should be all lowercase.
    - Column names should not contain spaces (underscores will suffice--this allows for using the `df.column_name` method to access columns in addition to `df['column_name']`).
    - Column names should be unique and informative.
8. Drop unnecessary rows (if needed).
9. Merge dataframes that can be merged.
    - Since different climate metrics are in month format, you can merge them into one single dataframe for easier analysis
10. Perform any additional cleaning that you feel is necessary.
11. Save your cleaned and merged dataframes as csv files.

### Import dataset

In [None]:
# Import data
df_monthly_rain_days = pd.read_csv('../data/rainfall-monthly-number-of-rain-days.csv')
df_monthly_rain_total = pd.read_csv('../data/rainfall-monthly-total.csv')
df_monthly_humidity = pd.read_csv('../data/relative-humidity-monthly-mean.csv')
df_monthly_sunshine = pd.read_csv('../data/sunshine-duration-monthly-mean-daily-duration.csv')
df_historical_infl = pd.read_csv('../data/singapore-historical-inflation.csv')

#### Singapore Historical Inflation Dataset

In [None]:
# Display first 5 rows of data
df_historical_infl.head()

In [None]:
# Identify the shape
df_historical_infl.shape

In [None]:
# Drop "Annual" column as it will not be used as part of EDA
df_historical_infl.drop(labels=['Annual'], axis=1, inplace=True)

In [None]:
# Prepare months in a list to reshape
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]

# Zip respective strings to values
month_map = dict(zip(months, [i for i in range(1,13)]))

In [None]:
# Perform melt
df_melted_infl = df_historical_infl.melt(id_vars=["Year"], value_vars=months, var_name="Month", value_name="inflation")

In [None]:
df_melted_infl = df_melted_infl.replace({"Month": month_map})
ym = [f"{y}-{m:02d}" for y, m in df_melted_infl[["Year", "Month"]].values]
df_melted_infl.insert(0, "Year-Month", ym)
df_melted_infl.drop(["Year", "Month"], axis=1, inplace=True)

In [None]:
# Remove '%'
df_melted_infl['inflation'] = df_melted_infl['inflation'].str.replace('%',' ')

In [None]:
# Check reshaped table
df_melted_infl.head()

In [None]:
df_melted_infl.info()

In [None]:
# Convert 'inflation' data type from object to float
# 'Year-Month' data type will be converted to date time after merging with other datasets
df_melted_infl['inflation'] = pd.to_numeric(df_melted_infl['inflation'])

In [None]:
# Rename header for Year-Month for ease of merging after
df_melted_infl.rename(columns={'Year-Month':'month'}, inplace=True)

#### Monthly Rainy Days Dataset

In [None]:
# Display first 5 rows of data for monthly number of rain days
df_monthly_rain_days.head()

#### Monthly Total Rain Dataset

In [None]:
# Display first 5 rows of data
df_monthly_rain_total.head()

In [None]:
df_monthly_rain_days.info()

#### Monthly Mean Humidity Dataset

In [None]:
# Display first 5 rows of data
df_monthly_humidity.head()

In [None]:
df_monthly_humidity.info()

#### Monthly Duration of Sunshine Dataset

In [None]:
# Display first 5 rows of data
df_monthly_sunshine.head()

In [None]:
# High level overview of dataset
df_monthly_sunshine.info()

### Merging of the 4 datasets

In [None]:
# Merge df_monthly_rain_days and df_monthly_rain_total
df_merge1 = pd.merge(df_monthly_rain_days, df_monthly_rain_total, how='left')

In [None]:
# Merge with df_monthly_humidity
df_merge2 = pd.merge(df_merge1, df_monthly_humidity, how='right')

In [None]:
# Merge with df_monthly_sunshine
df_merge3 = pd.merge(df_merge2, df_monthly_sunshine, how='left')

In [None]:
# Merge with df_melted_cpi
df_merged= pd.merge(df_merge3, df_melted_infl, how='left')

### Inspect outcome of merged datasets

In [None]:
df_merged.info()

___Note___: 'month' category has yet to be converted to datetime format.

### Check for null values and convert month to datetime format

In [None]:
# Identify null values
df_merged.isnull().sum()

In [None]:
# Drop null values and assign to new dataframe
cleaned_df = df_merged.dropna()

In [None]:
# Convert series to date time format
cleaned_df.loc[:,'month'] = pd.to_datetime(cleaned_df['month'], format='%Y-%m')

### Data Dictionary

Now that we've fixed our data, and given it appropriate names, let's create a [data dictionary](http://library.ucmerced.edu/node/10249). 

A data dictionary provides a quick overview of features/variables/columns, alongside data types and descriptions. The more descriptive you can be, the more useful this document is.

Example of a Fictional Data Dictionary Entry: 

|Feature|Type|Dataset|Description|
|---|---|---|---|
|**county_pop**|*integer*|2010 census|The population of the county (units in thousands, where 2.5 represents 2500 people).| 
|**per_poverty**|*float*|2010 census|The percent of the county over the age of 18 living below the 200% of official US poverty rate (units percent to two decimal places 98.10 means 98.1%)|

[Here's a quick link to a short guide for formatting markdown in Jupyter notebooks](https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html).

Provided is the skeleton for formatting a markdown table, with columns headers that will help you create a data dictionary to quickly summarize your data, as well as some examples. **This would be a great thing to copy and paste into your custom README for this project.**

*Note*: if you are unsure of what a feature is, check the source of the data! This can be found in the README.

In [None]:
# Recap of categories
cleaned_df.info()

**To-Do:** *Edit the table below to create your own data dictionary for the datasets you chose.*

|Feature|Type|Dataset|Description|
|---|---|---|---|
|month|datetime64[ns]|rainfall-monthly-number-of-rain-days|Months from from 1982 to 2022| 
|no_of_rainy_days|float64|rainfall-monthly-number-of-rain-days|No. of rainy days per month from 1982 to 2022|
|total_rainfall|float64|rainfall-monthly-total|Amount of rainfall (mm) per month from 1982 to 2022|
|mean_rh|float64|relative-humidity-monthly-mean|Humidity levels (%) per month from 1982 to 2022|
|mean_sunshine_hrs|float64|sunshine-duration-monthly-mean-daily-duration|Hours of sunshine (hours) per month from 1982 to 2022| 
|inflation|float64|singapore-historical-inflation|Inflation rate (%) per month from 1982 to 2022|

## Exploratory Data Analysis

Complete the following steps to explore your data. You are welcome to do more EDA than the steps outlined here as you feel necessary:
1. Summary Statistics.
2. Use a **dictionary comprehension** to apply the standard deviation function you create in part 1 to each numeric column in the dataframe.  **No loops**.
    - Assign the output to variable `sd` as a dictionary where: 
        - Each column name is now a key 
        - That standard deviation of the column is the value 
        - *Example Output :* `{'rainfall-monthly-total': xxx, 'no_of_rainy_days': xxx, ...}`
3. Investigate trends in the data.
    - Using sorting and/or masking (along with the `.head()` method to avoid printing our entire dataframe), consider questions relevant to your problem statement. Some examples are provided below (but feel free to change these questions for your specific problem):
        - Which month have the highest and lowest total rainfall in 1990, 2000, 2010 and 2020?
        - Which year have the highest and lowest total rainfall in the date range of analysis?
        - Which month have the highest and lowest number of rainy days in 1990, 2000, 2010 and 2020?
        - Which year have the highest and lowest number of rainy days in the date range of analysis?
        - Are there any outliers months in the dataset?
       
    - **The above 5 questions are compulsory. Feel free to explore other trends based on the datasets that you have choosen for analysis. You should comment on your findings at each step in a markdown cell below your code block**. Make sure you include at least one example of sorting your dataframe by a column, and one example of using boolean filtering (i.e., masking) to select a subset of the dataframe.

### Summary Statistics

In [None]:
# Summary statistics
cleaned_df.describe()

In [None]:
# Dictionary comprehension combining column headers with their respective sd values
column_headers=cleaned_df.columns[1:]
column_std_values=cleaned_df.std()[1:]
sd = dict(zip(column_headers,column_std_values))

In [None]:
sd

### Which month have the highest and lowest total rainfall in 1990, 2000, 2010 and 2020?

In [None]:
cleaned_df.head()

In [None]:
# Filter
df_1990 = cleaned_df[(cleaned_df['month']>='1990-01-01') & (cleaned_df['month']<='1990-12-01')]
df_2000 = cleaned_df[(cleaned_df['month']>='2000-01-01') & (cleaned_df['month']<='2000-12-01')]
df_2010=cleaned_df[(cleaned_df['month']>='2010-01-01') & (cleaned_df['month']<='2010-12-01')]
df_2020=cleaned_df[(cleaned_df['month']>='2020-01-01') & (cleaned_df['month']<='2020-12-01')]

In [None]:
# Max value
df_1990[df_1990['total_rainfall']==df_1990['total_rainfall'].max()]

In [None]:
# Min value
df_1990[df_1990['total_rainfall']==df_1990['total_rainfall'].min()]

___Answer___: </br>
For the year 1990, September experienced the highest total rainfall (204.5 mm) while February experienced the lowest total rainfall (24.1 mm).

In [None]:
# Max value
df_2000[df_2000['total_rainfall']==df_2000['total_rainfall'].max()]

In [None]:
# Min value
df_2000[df_2000['total_rainfall']==df_2000['total_rainfall'].min()]

___Answer___: </br>
For the year 2000, November experienced the highest total rainfall (385.7 mm) while February experienced the lowest total rainfall (81.1 mm).

In [None]:
# Max value
df_2010[df_2010['total_rainfall']==df_2010['total_rainfall'].max()]

In [None]:
# Min value
df_2010[df_2010['total_rainfall']==df_2010['total_rainfall'].min()]

___Answer___: </br>
For the year 2010, July experienced the highest total rainfall (298.5 mm) while February experienced the lowest total rainfall (6.3 mm).

In [None]:
# Max value
df_2020[df_2020['total_rainfall']==df_2020['total_rainfall'].max()]

In [None]:
# Min value
df_2020[df_2020['total_rainfall']==df_2020['total_rainfall'].min()]

___Answer___: </br>
For the year 2020, May experienced the highest total rainfall (255.6mm) while February experienced the lowest total rainfall (65mm).

### Which year have the highest and lowest total rainfall in the date range of analysis?

In [None]:
# Create a new category for year
cleaned_df['year'] = pd.DatetimeIndex(cleaned_df['month']).year

In [None]:
df_rainfall_years = cleaned_df.groupby(['year']).sum()

In [None]:
df_rainfall_years[df_rainfall_years['total_rainfall']==df_rainfall_years['total_rainfall'].max()]

In [None]:
df_rainfall_years[df_rainfall_years['total_rainfall']==df_rainfall_years['total_rainfall'].min()]

___Answer___: </br>
The year 2007 experienced the highest amount of rainfall (2886.2 mm) while the year 1997 experienced the least amount of rainfall (1118.9 mm).

### Which month have the highest and lowest number of rainy days in 1990, 2000, 2010 and 2020?

In [None]:
df_1990[df_1990['no_of_rainy_days']==df_1990['no_of_rainy_days'].max()]

In [None]:
df_1990[df_1990['no_of_rainy_days']==df_1990['no_of_rainy_days'].min()]

___Answer___: </br>
For the year 1990, the highest number of rainy days was in both September and November (17 days) while the lowest number of rainy days was in March (4 days).

In [None]:
df_2000[df_2000['no_of_rainy_days']==df_2000['no_of_rainy_days'].max()]

In [None]:
df_2000[df_2000['no_of_rainy_days']==df_2000['no_of_rainy_days'].min()]

___Answer___: </br>
For the year 2000, the highest number of rainy days was in November (21 days) while the lowest number of rainy days was in May (10 days).

In [None]:
df_2010[df_2010['no_of_rainy_days']==df_2010['no_of_rainy_days'].max()]

In [None]:
df_2010[df_2010['no_of_rainy_days']==df_2010['no_of_rainy_days'].min()]

___Answer___: </br>
For the year 2010, the highest number of rainy days was in November (21 days) while the lowest number of rainy days was in February (4 days).

In [None]:
df_2020[df_2020['no_of_rainy_days']==df_2020['no_of_rainy_days'].max()]

In [None]:
df_2020[df_2020['no_of_rainy_days']==df_2020['no_of_rainy_days'].min()]

___Answer___: </br>
For the year 2020, the highest number of rainy days was in July (22 days) while the lowest number of rainy days was in January (6 days).

### Which year have the highest and lowest number of rainy days in the date range of analysis?

In [None]:
df_rainfall_years[df_rainfall_years['no_of_rainy_days']==df_rainfall_years['no_of_rainy_days'].max()]

In [None]:
df_rainfall_years[df_rainfall_years['no_of_rainy_days']==df_rainfall_years['no_of_rainy_days'].min()]

___Answer___: </br>
The year 2013 experienced the highest number of rainy days (206 days) while the year 1997 experienced the lowest number of rainy days (116 days).

### Are there any outliers months in the dataset?

In [None]:
cleaned_df.info()

In [None]:
cleaned_df.boxplot(column =['no_of_rainy_days'], grid = False)

In [None]:
cleaned_df.boxplot(column =['total_rainfall'], grid = False)

___Answer___: </br>
There are outliers in the total_rainfall category

In [None]:
cleaned_df.boxplot(column =['mean_rh'], grid = False)

___Answer___: </br>
There are outliers in the mean_rh category.

In [None]:
cleaned_df.boxplot(column =['mean_sunshine_hrs'], grid = False)

___Answer___: </br>There are outliers in the mean_sunshine_hrs category.

In [None]:
cleaned_df.boxplot(column =['inflation'], grid = False)

___Answer___: </br>
There are outliers in the inflation category.

## Visualize the Data

There's not a magic bullet recommendation for the right number of plots to understand a given dataset, but visualizing your data is *always* a good idea. Not only does it allow you to quickly convey your findings (even if you have a non-technical audience), it will often reveal trends in your data that escaped you when you were looking only at numbers. It is important to not only create visualizations, but to **interpret your visualizations** as well.

**Every plot should**:
- Have a title
- Have axis labels
- Have appropriate tick labels
- Text is legible in a plot
- Plots demonstrate meaningful and valid relationships
- Have an interpretation to aid understanding

Here is an example of what your plots should look like following the above guidelines. Note that while the content of this example is unrelated, the principles of visualization hold:

![](https://snag.gy/hCBR1U.jpg)
*Interpretation: The above image shows that as we increase our spending on advertising, our sales numbers also tend to increase. There is a positive correlation between advertising spending and sales.*

---

Here are some prompts to get you started with visualizations. Feel free to add additional visualizations as you see fit:
1. Use Seaborn's heatmap with pandas `.corr()` to visualize correlations between all numeric features.
    - Heatmaps are generally not appropriate for presentations, and should often be excluded from reports as they can be visually overwhelming. **However**, they can be extremely useful in identify relationships of potential interest (as well as identifying potential collinearity before modeling).
    - Please take time to format your output, adding a title. Look through some of the additional arguments and options. (Axis labels aren't really necessary, as long as the title is informative).
2. Visualize distributions using histograms. If you have a lot, consider writing a custom function and use subplots.
    - *OPTIONAL*: Summarize the underlying distributions of your features (in words & statistics)
         - Be thorough in your verbal description of these distributions.
         - Be sure to back up these summaries with statistics.
         - We generally assume that data we sample from a population will be normally distributed. Do we observe this trend? Explain your answers for each distribution and how you think this will affect estimates made from these data.
3. Plot and interpret boxplots. 
    - Boxplots demonstrate central tendency and spread in variables. In a certain sense, these are somewhat redundant with histograms, but you may be better able to identify clear outliers or differences in IQR, etc.
    - Multiple values can be plotted to a single boxplot as long as they are of the same relative scale (meaning they have similar min/max values).
    - Each boxplot should:
        - Only include variables of a similar scale
        - Have clear labels for each variable
        - Have appropriate titles and labels
4. Plot and interpret scatter plots to view relationships between features. Feel free to write a custom function, and subplot if you'd like. Functions save both time and space.
    - Your plots should have:
        - Two clearly labeled axes
        - A proper title
        - Colors and symbols that are clear and unmistakable
5. Additional plots of your choosing.
    - Are there any additional trends or relationships you haven't explored? Was there something interesting you saw that you'd like to dive further into? It's likely that there are a few more plots you might want to generate to support your narrative and recommendations that you are building toward. **As always, make sure you're interpreting your plots as you go**.

Some ideas for plots that can be generated:

- Plot the histogram of the rainfall data with various bins and comment on the distribution of the data - is it centered, skewed?
- Plot the box-and-whiskers plot. Comment on the different quartiles and identify any outliers in the dataset. 
- Is there a correlation between the number of rainy days and total rainfall in the month? What kind of correlation do your suspect? Does the graph show the same?


### Correlation Heatmap

In [None]:
df_corr = cleaned_df.corr()
# Mask
mask = np.triu(np.ones_like(df_corr, dtype=np.bool))

# Adjust corr
mask = mask[1:, :-1]
corr = df_corr.iloc[1:,:-1].copy()

In [None]:
plt.figure(figsize=(15,10))
sns.heatmap(corr, mask=mask, annot=True, fmt=".2f", linewidth=.5, vmin=-1, vmax=1, cbar_kws={"shrink": .8})
plt.title('Correlation matrix of all numeric values in dataset');

___Remarks___: </br>
Strong correlation between: </br>
- total no. of rainy days and total rainfall
- total no. of rainy days and humidity.
- total rainfall and  humidity.

### Histogram subplots

In [None]:
fig, axs = plt.subplots(2, 2, figsize=(10,8))

axs1 = plt.subplot2grid(shape=(2,6), loc=(0,0), colspan=2)
axs2 = plt.subplot2grid((2,6), (0,2), colspan=2)
axs3 = plt.subplot2grid((2,6), (0,4), colspan=2)
axs4 = plt.subplot2grid((2,6), (1,1), colspan=2)
axs5 = plt.subplot2grid((2,6), (1,3), colspan=2)


axs1.hist(cleaned_df['no_of_rainy_days'], bins=10)
axs1.set_title("Distribution of rainy days")
axs1.set_xlabel('No. of rainy days (days)')
axs1.set_ylabel('Frequency')

axs2.hist(cleaned_df['total_rainfall'], bins=10)
axs2.set_title("Distribution of rainfall")
axs2.set_xlabel('Amount of rainfall (mm)')
axs2.set_ylabel('Frequency')

axs3.hist(cleaned_df['mean_rh'], bins=10)
axs3.set_title("Distribution of humidity")
axs3.set_xlabel('Monthly mean relative humidity (%)')
axs3.set_ylabel('Frequency')

axs4.hist(cleaned_df['mean_sunshine_hrs'], bins=10)
axs4.set_title("Distribution of hours of sunshine")
axs4.set_xlabel('Hours of sunshine (hours)')
axs4.set_ylabel('Frequency')

axs5.hist(cleaned_df['inflation'], bins=10)
axs5.set_title("Distribution of inflation")
axs5.set_xlabel('Monthly inflation (%)')
axs5.set_ylabel('Frequency')

fig.tight_layout()
plt.tight_layout()

___Remarks___: </br>
- Distribution for rainy days and hours of sunshine are normally distributed.
- Distribution for rainfall (mm) and monthly inflation (%) are right-skewed.
- Distribution for humidity is left-skewed.

### Boxplots to identify outliers in data

In [None]:
cleaned_df.boxplot(column =['no_of_rainy_days'], grid = False)

In [None]:
cleaned_df.boxplot(column =['total_rainfall'], grid = False)

In [None]:
cleaned_df.boxplot(column =['mean_rh'], grid = False)

In [None]:
cleaned_df.boxplot(column =['mean_sunshine_hrs'], grid = False)

In [None]:
cleaned_df.boxplot(column =['inflation'], grid = False)

___Remarks___: </br>
- Outliers are present in total_rainfall, mean_rh, mean_sunshine_hrs and inflation dataset.
- Due to the size of the dataset being small, we will not exclude the outliers.

### Scatterplot to identify correlations

In [None]:
sns.pairplot(cleaned_df, corner=True)

___Remarks___: </br>
- Pairplot to provide a high overview of the correlation between the different categories.
- It would be interesting to expand further on the correlation between (1) mean_sunshine_hrs vs no_of_rainy_days, (2) mean_rh and no_of_rainy_days, (3) mean_rh vs mean_sunshine_hrs.

#### Scatterplot - mean_sunshine_hrs vs no_of_rainy_days

In [None]:
plt.scatter(cleaned_df['mean_sunshine_hrs'], cleaned_df['no_of_rainy_days'])
plt.xlabel("mean_sunshine_hrs (hours)")
plt.ylabel("no_of_rainy_days (days)")
plt.title("Scatterplot of mean_sunshine_hrs (hours) vs. no_of_rainy_days (days)")

z = np.polyfit(cleaned_df['mean_sunshine_hrs'], cleaned_df['no_of_rainy_days'], 1)
p = np.poly1d(z)
plt.plot(cleaned_df['mean_sunshine_hrs'],p(cleaned_df['mean_sunshine_hrs']),"r--");

#### Scatterplot - mean_rh vs no_of_rainy_days

In [None]:
plt.scatter(cleaned_df['mean_rh'], cleaned_df['no_of_rainy_days'])
plt.xlabel("mean_rh (%)")
plt.ylabel("no_of_rainy_days (days)")
plt.title("Scatterplot of mean_rh (%) vs. no_of_rainy_days (days)")

z = np.polyfit(cleaned_df['mean_rh'], cleaned_df['no_of_rainy_days'], 1)
p = np.poly1d(z)
plt.plot(cleaned_df['mean_rh'],p(cleaned_df['mean_rh']),"r--");

#### Scatterplot - mean_rh vs mean_sunshine_hrs

In [None]:
plt.scatter(cleaned_df['mean_rh'], cleaned_df['mean_sunshine_hrs'])
plt.xlabel("mean_rh (%)")
plt.ylabel("mean_sunshine_hrs (hours)")
plt.title("Scatterplot of mean_rh (%) vs. mean_sunshine_hrs (hours)")

z = np.polyfit(cleaned_df['mean_rh'], cleaned_df['mean_sunshine_hrs'], 1)
p = np.poly1d(z)
plt.plot(cleaned_df['mean_rh'],p(cleaned_df['mean_rh']),"r--");

___Remarks___: </br>
Based from the plots, it can be observed that:
1) The amount of sunshine decreases as the number of rain days decreases.
2) The amount of humidity increases as the number of rainy days increases.
3) The amount of humidity increases as the amount of sunshine decreases.
* Portable handheld fans can be sold together with umbrellas. A strange combination, but possible.

### Line plot for total rainfall across the months for the past 5 years

In [None]:
# Create a category for month
cleaned_df['month-mmm'] = pd.DatetimeIndex(cleaned_df['month']).month

In [None]:
cleaned_df.head()

In [None]:
# Filter data
filtered_df_2022 = cleaned_df[cleaned_df['year']==2022]
filtered_df_2021 = cleaned_df[cleaned_df['year']==2021]
filtered_df_2020 = cleaned_df[cleaned_df['year']==2020]
filtered_df_2019 = cleaned_df[cleaned_df['year']==2019]
filtered_df_2018 = cleaned_df[cleaned_df['year']==2018]

# Groupby to calculate mean
filtered_df = cleaned_df[cleaned_df['year'].between(2018,2023)]
mean_rainfall_df = filtered_df.groupby('month-mmm')['total_rainfall'].mean ()

In [None]:
# Establish figure size
plt.figure(figsize = (16, 9))

# Plot data
plt.plot(filtered_df_2022['month-mmm'], filtered_df_2022['total_rainfall'], label = '2022', linewidth=2, c='black')
plt.plot(filtered_df_2021['month-mmm'], filtered_df_2021['total_rainfall'], label = '2021', linewidth=2, c='orange')
plt.plot(filtered_df_2020['month-mmm'], filtered_df_2020['total_rainfall'], label = '2020', linewidth=2, c='blue')
plt.plot(filtered_df_2019['month-mmm'], filtered_df_2019['total_rainfall'], label = '2019', linewidth=2, c='magenta')
plt.plot(filtered_df_2018['month-mmm'], filtered_df_2018['total_rainfall'], label = '2018', linewidth=2, c='cyan')
plt.plot(mean_rainfall_df, label = 'Mean', linewidth=7, c='red')


# Legend
plt.legend(loc = 'upper right', fontsize = 20)

# Draw grid
plt.grid(True, linewidth = 0.5, linestyle = '-', c = 'black', alpha = 0.1)

# Create tick mark labels on the Y axis and rotate them.
plt.xticks(fontsize = 20)
plt.yticks(fontsize = 20)

# Create title.
plt.title("Total rainfall across the months from 2018-2022", fontsize = 30)

# Label axis
plt.xlabel("Months").set_fontsize(20)
plt.ylabel("Total rainfall (mm)").set_fontsize(20)

In [None]:
mean_rainfall_df

___Remarks___: </br>
- High amount of rainfall can be observed from November to January (Northeast Monsoon) as well as in April and June (Southwest Monsoon).

### Line plot for mean relative humidity across the months for the past 5 years

In [None]:
mean_rh_df = filtered_df.groupby('month-mmm')['mean_rh'].mean ()

In [None]:
# Establish figure size
plt.figure(figsize = (16, 9))

# Plot data
plt.plot(filtered_df_2022['month-mmm'], filtered_df_2022['mean_rh'], label = '2022', linewidth=2, c='black')
plt.plot(filtered_df_2021['month-mmm'], filtered_df_2021['mean_rh'], label = '2021', linewidth=2, c='orange')
plt.plot(filtered_df_2020['month-mmm'], filtered_df_2020['mean_rh'], label = '2020', linewidth=2, c='blue')
plt.plot(filtered_df_2019['month-mmm'], filtered_df_2019['mean_rh'], label = '2019', linewidth=2, c='magenta')
plt.plot(filtered_df_2018['month-mmm'], filtered_df_2018['mean_rh'], label = '2018', linewidth=2, c='cyan')
plt.plot(mean_rh_df, label = 'Mean', linewidth=7, c='red')


# Legend
plt.legend(loc = 'upper right', fontsize = 20)

# Draw grid
plt.grid(True, linewidth = 0.5, linestyle = '-', c = 'black', alpha = 0.1)

# Create tick mark labels on the Y axis and rotate them.
plt.xticks(fontsize = 20)
plt.yticks(fontsize = 20)

# Create title.
plt.title("Total relative humidity across the months from 2018-2022", fontsize = 30)

# Label axis
plt.xlabel("Months").set_fontsize(20)
plt.ylabel("Mean relative humidity (%)").set_fontsize(20)

In [None]:
mean_rh_df

___Remarks___: </br>
- High humidity levels are observed in months with high rainfall.

### Line plot for mean sunshine hours across the months for the past 5 years

In [None]:
mean_sunshine_df = filtered_df.groupby('month-mmm')['mean_sunshine_hrs'].mean ()

In [None]:
# Establish figure size
plt.figure(figsize = (16, 9))

# Plot data
plt.plot(filtered_df_2022['month-mmm'], filtered_df_2022['mean_sunshine_hrs'], label = '2022', linewidth=2, c='black')
plt.plot(filtered_df_2021['month-mmm'], filtered_df_2021['mean_sunshine_hrs'], label = '2021', linewidth=2, c='orange')
plt.plot(filtered_df_2020['month-mmm'], filtered_df_2020['mean_sunshine_hrs'], label = '2020', linewidth=2, c='blue')
plt.plot(filtered_df_2019['month-mmm'], filtered_df_2019['mean_sunshine_hrs'], label = '2019', linewidth=2, c='magenta')
plt.plot(filtered_df_2018['month-mmm'], filtered_df_2018['mean_sunshine_hrs'], label = '2018', linewidth=2, c='cyan')
plt.plot(mean_sunshine_df, label = 'Mean', linewidth=7, c='red')


# Legend
plt.legend(loc = 'upper right', fontsize = 20)

# Draw grid
plt.grid(True, linewidth = 0.5, linestyle = '-', c = 'black', alpha = 0.1)

# Create tick mark labels on the Y axis and rotate them.
plt.xticks(fontsize = 20)
plt.yticks(fontsize = 20)

# Create title.
plt.title("Total sunshine hours across the months from 2018-2022", fontsize = 30)

# Label axis
plt.xlabel("Months").set_fontsize(20)
plt.ylabel("Mean sunshine hours (hours)").set_fontsize(20)

In [None]:
mean_sunshine_df

___Remarks___: </br>
- High amount of sunshine hours can be observed from February to April and from July to August.

In [None]:
mean_raindays_df = filtered_df.groupby('month-mmm')['no_of_rainy_days'].mean ()

### Line plot for changes in rainy days per month for the past 5 years

In [None]:
mean_rainydays_df = filtered_df.groupby('month-mmm')['no_of_rainy_days'].mean ()

In [None]:
# Establish figure size
plt.figure(figsize = (16, 9))

# Plot data
plt.plot(filtered_df_2022['month-mmm'], filtered_df_2022['no_of_rainy_days'], label = '2022', linewidth=2, c='black')
plt.plot(filtered_df_2021['month-mmm'], filtered_df_2021['no_of_rainy_days'], label = '2021', linewidth=2, c='orange')
plt.plot(filtered_df_2020['month-mmm'], filtered_df_2020['no_of_rainy_days'], label = '2020', linewidth=2, c='blue')
plt.plot(filtered_df_2019['month-mmm'], filtered_df_2019['no_of_rainy_days'], label = '2019', linewidth=2, c='magenta')
plt.plot(filtered_df_2018['month-mmm'], filtered_df_2018['no_of_rainy_days'], label = '2018', linewidth=2, c='cyan')
plt.plot(mean_rainydays_df, label = 'Mean', linewidth=7, c='red')


# Legend
plt.legend(loc = 'upper right', fontsize = 20)

# Draw grid
plt.grid(True, linewidth = 0.5, linestyle = '-', c = 'black', alpha = 0.1)

# Create tick mark labels on the Y axis and rotate them.
plt.xticks(fontsize = 20)
plt.yticks(fontsize = 20)

# Create title.
plt.title("Number of rainy days across the months from 2018-2022", fontsize = 30)

# Label axis
plt.xlabel("Months").set_fontsize(20)
plt.ylabel("Number of rainy days").set_fontsize(20)

___Remarks___: </br>
- High amount of rainy days can be observed from November to January (Northeast Monsoon) as well as in April and June (Southwest Monsoon).

## Conclusions and Recommendations

Based on your exploration of the data, what are you key takeaways and recommendations? Make sure to answer your question of interest or address your problem statement here.

___Problem statement___: </br>
The Outdoor Collectives (TOC) is an outdoor apparel shop known for their handheld electronic fans, lightweight umbrellas as well as sunblock lotions. Established since the early 2000s in Honolulu, Hawaii, TOC has since expanded it's operations to America and has set its sights on the Asean region, with Singapore being the first country for their market entry strategy. At the same time, TOC has identified that consumers are becoming increasingly reluctant to carry umbrellas, electronic fans and sunblock lotion as part of their everyday carry (EDC), citing reasons that it takes up additional space in their bags. As such, TOC also like to explore to setup of temoprary retail spaces around Singapore to sell their products.

TOC's target consumers would be Singapore residents from all age groups. The organisation wants to ensure that there is proper inventory management throughout the year, taking into account the Singapore weather conditions. The organisation anticipates sales for sunblock lotion and visors to be higher on days with more sunlight and while those for electronic fans on days with high humidity levels. Sales of umbrellas are expected to be higher on rainy days.

This study explores the analysis of Singapore's historical weather data to enable TOC to efficiently cater for supplies of their products throughout the year.

___Recommendations___: </br>
TOC can cater more inventory for umbrellas and electronic fans from November to January as well as from April to june.

TOC can also plan for more supplies of sunblock lotions from February to April and from July to August.

There is no predicted demand for either of the products for the month of September and October. TOC can maintain a 'neutral' level of inventory during this period.

|Month|Product With Expected Higher Demand|
|---|---|
|Jan| Umbrellas and electronic fans|
|Feb| Sunblock lotion|
|Mar| Sunblock lotion|
|Apr| Sunblock lotion, umbrellas and electronic fans|
|May| Umbrellas and electronic fans|
|Jun| Umbrellas and electronic fans|
|Jul| Sunblock lotion|
|Aug| Sunblock lotion|
|Sep| nil|
|Oct| nil|
|Nov| Umbrellas and electronic fans|
|Dec| Umbrellas and electronic fans|

No correlation was observed between the inflation levels and the weather data. However, taking into account the year-on-year increase in inflation levels, for months where there is an expected higher demand for at least two products, TOC could explore a discount campaign to drive an increase in the amount of sales.

___Future explorations___: </br>
1) Although no correlation was observed between historical inflation levels and weather data, with more datasets provided surrounding inflation (e.g. Consumer Price Index levels) could influence a change in the observation.
2) With geographical data mapping the areas of high rainfall and sunshine including intensity levels, this could enable TOC to plan inventory at a more micro level and decide which regions of Singapore would be worthwhile setting up their pop-up shop at.

Don't forget to create your README!

**To-Do:** *If you combine your problem statement, data dictionary, brief summary of your analysis, and conclusions/recommendations, you have an amazing README.md file that quickly aligns your audience to the contents of your project.* Don't forget to cite your data sources!