# Comparing Data and Dealing with Dates in Pandas

Previously we have seen how to import simple data using `Pandas`.
This worksheet will follow a similar trend, introducing other ways to
manipulate the data we import, and show how it can recognize dates.

We will use data from the website
[GRIDWATCH](https://www.gridwatch.templar.co.uk/), which provides
current and historical data on the UK's electrical grid.  This
includes information on the demand, as well as a breakdown on the
sources of the electricity (e.g., natural gas, solar, hydro, wind,
etc.).

Download the file "gridwatch.csv" by going to the [Myplace site for CP540](https://classes.myplace.strath.ac.uk/course/view.php?id=27428#section-5) or download the file directly by clicking [here](https://classes.myplace.strath.ac.uk/mod/resource/view.php?id=1758384). 


In [None]:
from google.colab import files
 
 
uploaded = files.upload()

## Importing data

We can load the data in this CSV file into a Pandas dataframe in the same manner as we did in the previous notebook.  Let's put the data into the variable `df` and see the names of the columns.

 

In [None]:
import pandas as pd
import io
 
df = pd.read_csv(io.BytesIO(uploaded['gridwatch.csv'])) 
print(df.columns)


If we look closely at the names of the columns, we will notice that many of them start with a space.  This is not necessarily a problem, as long as we remember to add this space when we call the columns; however, it is an inconvenience.  We can remove any leading space from the headers by adding a option to the function that reads the CSV file.



In [None]:
df = pd.read_csv(io.BytesIO(uploaded['gridwatch.csv']), skipinitialspace=True)
print(df.columns)

Now those leading spaces have been eliminated.

To see the contents of the dataframe, we print it.

In [None]:
print(df)

This is a larger file than we used last time, but it can easily be managed in the same way.  It contains data for around 7 days worth of energy generation, recorded at intervals of 5 minutes.  The resulting dataframe has 2016 rows and 25 columns of data.   

Currently, the column of data under the heading "timestamp" is not recognized by Pandas as a time; it thinks that it is simply a string.  We can get Pandas to interpret this column of strings as times by using the `to_datetime` function:


In [None]:
df['timestamp'] = pd.to_datetime(df['timestamp'], format='%Y-%m-%d %H:%M:%S')  # Firstly we define the current timestamp in its current format
print(df)

## Plotting data

Now we can begin to visualise the data. We can use a simple plot to see a comparison between the 3 types of energy generation. 

In [None]:
import pylab as plt

plt.plot(df['timestamp'], df['wind'], label='wind')
plt.plot(df['timestamp'], df['hydro'], label='hydro')
plt.plot(df['timestamp'], df['solar'], label='solar')
plt.xlabel('date')
plt.ylabel('power / MW')
plt.xticks(rotation=45)
plt.legend()
plt.show()

Another way to plot the data is to use the plotting capabilities within Pandas.

In [None]:
#import numpy as np
#import matplotlib.pyplot as plt
#import matplotlib.ticker as mticker


df.plot(x="timestamp", y=["wind", "hydro", "solar"])
plt.show()


#help(plt.plot)

Say for example now we want to see how much solar energy is generated each day. We can now use the new timestamp column we created to help us calculate the sum of energy for each day.

In [None]:
import datetime

#start = datetime.datetime.strptime("24-08-2022", "%d-%m-%Y")                   # First, we define the first day in our dataframe 
#date_generated = pd.date_range(start, periods=8)                               # We can then create a range of dates from this, starting at the inital date stated 
                                                                               # and the following 7 days                     
#dates = list(date_generated.strftime("%d-%m"))                                 # We then turn this range into a list to allow us to utilise it
date_list = pd.date_range(start='2022-08-21', end='2022-08-27')
print(date_list)

dailysolar = []                                                                # Here we create an empty list to collect the average solar energy for each day

for date in date_list:
  df_tmp = df[df['timestamp'].dt.date.between(date, date)]
  dailysolar.append( df_tmp['solar'].sum() * 5.0/60.0 )                        # This makes use of the Boolean principles,
                                                                               # and sums all the solar demands from each day, before depositing 
                                                                               # them in the avgsolar list
print(dailysolar)

plt.bar(date_list, dailysolar)                                                 # We can then visualise this in a bar graph.
plt.xticks(rotation=45)                                                   
plt.xlabel("Days")
plt.ylabel("Solar Energy Generation / MW h")

We can also show the split of energy generation in a pie chart.

In [None]:
plt.pie(dailysolar, labels=date_list, labeldistance=1.15);

plt.show()


We can utilize another Python library, `plotly.express`, to show the data of all three types of energy. This import is interesting as it can be used to make interactive graphs. 

In [None]:
import plotly.express as px

fig = px.line(df, x = 'timestamp', y = ['hydro', 'wind', 'solar'],
              labels={
                     "timestamp": "Day",
                     "value": "Power Generated / MW",
                     "variable": "Type of Energy"
                 })

fig.show()

This is a simple graph, similar to the one shown previously in this worksheet, but using `plotly.express` allows us to hover over any data point on the graph and see its value.  This can be exapnded on to include filters to show only specific data on the graph, or show scatter graphs.  More information on this import can be found [here](https://plotly.com/python/line-charts/).


Violin plots can be used to show the distribution of data in a dataset, so we can use them here to show how the average energy generated per day varies over the course of the week's worth of data we have. For these plots another python import is required - `seaborn`. The code used to generate these plots can be found [here](https://www.python-graph-gallery.com/violin-plot/). It is useful to note that this site contains many more different types of plots and the codes to create them. 

## Statistics

To allow us to analyze these data on a day-by-day basis, it is helpful to create a categorize it according to the date, rather than by the time.  We create a new column in the dataframe with the date, which we create by using the function `dt.strftime`.  This function create a string which represents the date in the format we choose.

In [None]:
df['date'] = df['timestamp'].dt.strftime('%d-%m')                     
print(df)                                                                   

Now we can observe the histogram of power generation from wind turbines throughout each day by using a violin plot.

In [None]:
import seaborn as sns
 
ax = sns.violinplot(x=df["date"], y=df["wind"], palette="Pastel1")

ax.set_ylabel("Energy Generated (GW)")
ax.set_xlabel("Day")


plt.show()

In [None]:
ax = sns.violinplot(x=df["date"], y=df["solar"], palette="Pastel1")
ax.set_ylabel("Energy Generated / MW")
ax.set_xlabel("Day")
plt.show()

A note on the above graph - the plot can be seen to go into the negative range. However, there is no negative values in the dataframe. This can be explained by the fact there are values in the data close to or equal to 0. As violin plots use kernel distribution data, this processes the values of '0' in the data and gives a non-zero probability of finding a negative value in the data analysed. It does not mean however that there are negative values in the data. 

In [None]:
ax = sns.violinplot(x=df["date"], y=df["hydro"], palette="Pastel1")
ax.set_ylabel("Energy Generated / MW")
ax.set_xlabel("Day")
plt.show()

### Conclusion

In this worksheet we have seen different type of plots available to us in python using different modules, and how they can be created with pandas dataframes. We have also seen how date values within dataframes can be edited and manipulated to allow us to view data more clearly. 