# Pandas Exercise
---
In this exercise, we will be using a dataset avaliable from [Kaggle](https://www.kaggle.com/leekahhow/singapore-psi-pm25-20162019). You can download this dataset (`psi_df_2016_2019.csv`) from Moodle. This dataset consists of the Pollutant Standards Index (PSI), specifically pm2.5 data. 

The dataset has these columns
* north
* south
* east
* west
* central
* national
* timestamp

We will be going through how to extract a subset of data from the main dataset and plot this subset.

## Importing the libraries
---
Complete the code cell to define which libraries are required for:
* pandas
* plotting

In [None]:
import datetime
import pandas as pd
import matplotlib.pyplot as plt

## Loading and viewing the data 

Load the data.

In [None]:
df = pd.read_csv('psi_df_2016_2019.csv', header=0)

Check if the data have been loaded properly

In [None]:
df.info()

Take a look at what the first 5 records

In [None]:
df.head()

Check the data for any null values

In [None]:
df.isnull().sum()

## Sorting the data

As the timestamps are in string format, we will be converting it into pandas datetime object.

In [None]:
# convert the time stamp to pandas datetime object
df['timestamp'] = pd.to_datetime(df['timestamp'])

Since this data is has a span of 3 different years we would like to split it up into year and months
<br>
We added the year and month column to the dataframe

In [None]:
df['year'] = pd.DatetimeIndex(df['timestamp']).year
df['month'] = pd.DatetimeIndex(df['timestamp']).month
df.head()

We now need to extract the data for the year 2016. To work with a clean dataset, we would like to remove the columns `timestamp` and `year`.

**Hint:** Find out how a copy of a object is done

In [None]:
# performs a deep copy of the dataframe 
df_2016 = df.loc[df['year'] == 2016].copy()

# discarding the 'timestamp' column
df_2016.drop(['timestamp', 'year'], inplace=True, axis=1)
df_2016.tail()


Saving the dataframe into a CSV file for future use (not required but you may want to know how it can be done)

In [None]:
df_2016.to_csv('psi_df_2016.csv', index=False)

## Normalizing the data for plotting

Calculate the mean of PSI by month for the year of 2016

In [None]:
df_mean_by_month_2016 = df_2016.groupby('month').mean()
df_mean_by_month_2016

Plot the Mean PSI values for the year 2016. The x-axis is the months and the y-axis is the column data.

**Hint:** If you have already used the `groupby` function in the previous code cell, check that your dataframe has the right columns and whether or not the dataframe's index can be used.

In [None]:
# There are 2 ways to plot, either with matplotlib or with pandas. 
# Pandas uses matplotlib as well but there are times where one way is simpler than the other.

# ------------------- Matplotlib way ----------------------------------
fig = plt.figure(figsize=(13,10))
for col in df_mean_by_month_2016.columns:
    plt.plot(df_mean_by_month_2016.index, df_mean_by_month_2016[col], 'o-', label=str(col))

plt.ylabel("PSI PM2.5 Levels")
plt.xlabel("Months")
plt.title('Mean PSI for Year 2016')
plt.legend(fontsize=15)
plt.show()

In [None]:
# ---------------------- Pandas way ---------------------------------
axes = df_mean_by_month_2016.plot(title='Mean PSI for Year 2016',figsize=(13,10), style='o-')
axes.set_ylabel("PSI PM2.5 Levels")
axes.legend(fontsize=15)
plt.show()