# `pandas`

![](https://pandas.pydata.org/pandas-docs/stable/_static/pandas.svg)

[`pandas`](https://pandas.pydata.org/) is a Data Analysis module written in `Python`. `pandas` is a really big package with a very extensive funcionality. On this notebook we will just explore some of the visualisation capabilities that are build upon [`Matplotlib`](https://matplotlib.org/).




# Datasets

Let's load a few data to fuel the examples

In [None]:
from vega_datasets import data
dfw = data.seattle_weather()
dfr = data.la_riots()
dfe = data.iowa_electricity()

Now we have three datasets ready to use: `dfw`, `dfr` and `dfe`.

In [None]:
dfe

and let's set a few general options (not really important)

In [None]:
import pandas as pd
pd.set_option('display.max_rows', 8)
pd.set_option('display.max_columns', None)

# Basic `pandas` charts

We will explore a few very basic charts that requiere really simple code.

## Line chart and common parameters

In [None]:
dfw

A basic plot is easy...

In [None]:
dfw.info()

In [None]:
dfw.plot(kind='line', x='date', y='temp_min')

Some customizations

In [None]:
dfw.plot(kind='line', x='date', y='temp_min',
         figsize=(15,7), title='Minimum Temperature in Seattle',
         color='lightblue', xlabel='Year', ylabel='º Celsius')

Colours to suit every taste
* https://matplotlib.org/stable/gallery/color/named_colors.html
* https://matplotlib.org/stable/gallery/color/colormap_reference.html

We can also add more than one variable to the chart

In [None]:
dfw.plot(kind='line', x='date', y=['temp_min','temp_max'],
         figsize=(15,7), title='Temperatures in Seattle',
         color=['lightblue','khaki'], xlabel='Year', ylabel='º Celsius')

A few more [options](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html)

In [None]:
dfw.plot(kind='line', x='date', y=['temp_min','temp_max'],
         figsize=(15,7), title='Temperaturas en Seattle',
         color=['lightblue','khaki'], xlabel='Year', ylabel='º Celsius',
         ylim=(-10,40), grid=True, style=[':','.-'])

In [None]:
dfw.plot(kind='line', x='date', y=['temp_min','temp_max'],
         figsize=(15,7), title='Temperaturas en Seattle',
         color=['lightblue','khaki'], xlabel='Year', ylabel='º Celsius',
         grid=True, style=[':','.-'],
         subplots=True, sharey=True)

### **Exercise**

Execute the following code cells and consider the dataset `dfep` that has been  generated from the data on `dfe`.

In [None]:
dfe

In [None]:
dfep = dfe.pivot(index='year', columns='source', values='net_generation')
dfep.info()

In [None]:
dfep

In [None]:
dfep.columns

Use the plot funcion to visualize the data. Try to customise your chart

In [None]:
dfep.plot(kind='line', title='Energy',style=[':', '.-', '-'] )

## Bar Chart

In the dataframe with meteorological data, let's group the data of precipitation and wind by month and year.

In [None]:
dfwg = dfw[['precipitation','wind']].groupby(by=[dfw.date.dt.year, dfw.date.dt.month]).sum()
dfwg.index.set_names('year', level=0, inplace=True)
dfwg.index.set_names('month', level=1, inplace=True)
dfwg.info()

If you don't fully understand all the `pandas` code, it really doesn't matter.  We are just generating new data that are more interesting for the visualisations we want to show.

In [None]:
dfwg

Let's try now a bar chart on the data we have.

In [None]:
dfwg.plot(kind='bar')

We see that the index is used in the x-axis and the values of the columns are shown in the y-axis.

Not very good though... let's do some work:

In [None]:
dfwg.plot(kind='bar', y=['precipitation', 'wind'],
          figsize=(15,7), title='Weather in Seattle',
          color=['lightblue','lightgreen'], xlabel='Month', grid=True)

We could focus on a concrete year

In [None]:
dfwg2012 = dfwg.query('year == 2012')
dfwg2012.plot(kind='bar', y=['precipitation','wind'],
              figsize=(15,7), title='Weather in Seattle',
              color=['lightblue','lightgreen'], xlabel='Year 2012', grid=True)

Or we could compare the same month in different years:

In [None]:
dfwgJune = dfwg.query('month == 6')
dfwgJune.plot(kind='bar', y=['precipitation','wind'],
                figsize=(15,7), title='Weather in Seattle',
                color=['lightblue','lightgreen'], xlabel='June', grid=True)

### **Exercise**

Copy below the cell above and modify it to produce graphs for other months or years. Customize the size, title, labels, colors, etc. to produce a nice chart.

In [None]:
dfwgJune = dfwg.query('month == 12')
dfwgJune.plot(kind='bar', y=['precipitation','wind'],
                figsize=(15,7), title='Weather in Seattle',
                color=['blue','green'], xlabel='December', grid=True)

## Histograms

Histograms are useful to understand the distribution of numeric values. Let's use the `dfr` dataframe.

In [None]:
dfr

In [None]:
dfr.plot(kind='hist', y='age')

It's frequent to choose the bins of the histogram.

In [None]:
dfr.plot(kind='hist', y='age', bins=15)

As usual, we can tune some parameters...

In [None]:
dfr.plot(kind='hist', y='age',
         bins=[10,25,40,55,70,90], xticks=[10,25,40,55,70,90], grid=True,
         color='salmon')

Setting the labels for histograms is a bit different...

In [None]:
ax = dfr.plot(kind='hist', y='age',
         bins=[10,25,40,55,70,90], xticks=[10,25,40,55,70,90], grid=True,
         color='salmon')
ax.set(xlabel='Age group', ylabel='Number of death')

We can draw multiple histograms together. Let's use again the dataframe `dfw`.

In [None]:
dfw.plot(kind='hist', y=['temp_min', 'temp_max'])

In this case, the `alpha` parameter is interesting, it allows to tune the transparency of the colors...

In [None]:
dfw.plot(kind='hist', y=['temp_min', 'temp_max'], alpha=.8)

Remember that all previous parameters can be used, for instance, we can set a larger number of bins to classify
temperatures.

In [None]:
dfw.plot(kind='hist', y=['temp_min', 'temp_max'], bins=15, alpha=.8)


### **Exercise**

Consider the dataframe `dfep` and produce a histogram for the values in the different columns. Try the `alpha` parameter...

In [None]:
dfep.plot(kind='hist', y=['Fossil Fuels', 'Nuclear Energy','Renewables'], bins=15, alpha=.8)


## Many more...

There are many other options the the kind of plot we want to use with a pandas dataframe. Nevertheless the idea is quite similar, the plot has a default behavior that can de modified by tuning the different parameters.

Below we comment a few more possibilities of charts that can help understand to the data.

In [None]:
dfw.plot(kind='box',
         title='Quartile Values',
         grid=True,
         showmeans=True,
         showfliers=False)

https://en.wikipedia.org/wiki/Box_plot

In [None]:
dfw.plot(kind='scatter', x='temp_min', y='temp_max', alpha=0.3)

In [None]:
dfr.race.value_counts()

In [None]:
dfr.race.value_counts().plot(kind='pie',
                             autopct='%.1f%%',
                             colormap='summer',
                             title='Death casualties %',
                             ylabel='Etnic Group',
                             figsize=(6,6),
                             labels=['Afroamerican', 'Latin', 'White', 'Asian'])


### **Exercise**

Look for good examples where these last charts could be applied.

In [None]:
dfw.plot(kind='scatter', x='temp_min', y='temp_max', alpha=0.3)

# Interaction

Let's see a simple example of how python widgets can be easyly integrated with the `pandas` visualization capabilities to produce interactive notebooks.

Recall the code to produce a bar chart for comparing differente months in the Seattle weather dataset:

In [None]:
dfwg = dfw[['precipitation','wind']].groupby(by=[dfw.date.dt.year, dfw.date.dt.month]).sum()
dfwg.index.set_names('year', level=0, inplace=True)
dfwg.index.set_names('month', level=1, inplace=True)
dfwg.info()

In [None]:
dfwg

In [None]:
dfwgJune = dfwg.query('month == 6')
dfwgJune.plot(kind='bar', y=['precipitation','wind'],
                figsize=(15,7), title='Weather in Seattle',
                color=['lightblue','lightgreen'], xlabel='June', grid=True)

It would be conveninet to choose the number of the month an to produce the corresponding chart. This can be accomplished with the following code cell:


In [None]:
from ipywidgets import interact_manual, ColorPicker
import matplotlib.pyplot as plt

print('Select the number of the month you want to show:')
def f(month):
  month_name = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
  dfwgJune = dfwg.query('month == @month')
  dfwgJune.plot(kind='bar', y=['precipitation','wind'],
                figsize=(15,7), title='Weather in Seattle',
                color=['lightblue','lightgreen'], xlabel=month_name[month-1], grid=True)
  plt.show()

p = interact_manual(f, month=(1,12))

There are a lot of widgets we can use, let's see how we could also select the colors for the chart:

In [None]:

print('Select the number of the month you want to show and the colors of the bars:')
def f(month, color_rain, color_wind):
  month_name = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
  dfwgJune = dfwg.query('month == @month')
  dfwgJune.plot(kind='bar', y=['precipitation','wind'],
                figsize=(15,7), title='Weather in Seattle',
                color=[color_rain, color_wind], xlabel=month_name[month-1], grid=True)
  plt.show()

p = interact_manual(f, month=(1,12), color_rain=ColorPicker(value='blue'), color_wind=ColorPicker(value='green'))

## Exercise

Considering the data in the `dfwg` dataframe we have defined above,
create an interactive widget for the user to select a year to visualize a bar chart comparing the cumulative precipitation and wind.

In [None]:
dfwg

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import ipywidgets as widgets
from IPython.display import display

def plot_precipitation_wind(year):
    # Filter the DataFrame for the selected year
    df_filtered = dfwg[dfwg['Year'] == year]

    # Calculate cumulative values
    cumulative_precipitation = df_filtered['precipitation'].sum()
    cumulative_wind = df_filtered['wind'].sum()

    # Create the bar chart
    fig, ax = plt.subplots(figsize=(10, 6))
    ax.bar(['Cumulative Precipitation', 'Cumulative Wind'], [cumulative_precipitation, cumulative_wind], color=['blue', 'green'])

    # Set title and labels
    ax.set_title(f'Cumulative Precipitation and Wind for Year {year}')
    ax.set_ylabel('Value')

    # Show the plot
    plt.show()

# Create the interactive widget
year_selector = widgets.IntSlider(min=dfwg['year'].min(), max=dfwg['year'].max(), step=1, value=dfwg['year'].min(), description='Select Year:')
widgets.interactive(plot_precipitation_wind, year=year_selector)

# Display the widget
display(year_selector)

<hr>
<hr>
Carlos Gregorio Rodríguez

Universidad Complutense de Madrid

<img src="https://static0.makeuseofimages.com/wordpress/wp-content/uploads/2019/11/CC-BY-NC-License.png" alt="cc by nc" width="200"/>

https://creativecommons.org/licenses/by-nc/4.0/