<img src='images/gdd-logo.png' width='300px' align='right' style="padding: 15px">

# Plotting with Altair Review: DuPont

We shall use the following DuPont dataset to practice data analysis with Pandas.

<mark style="color: white; background-color:darkred">Note: If you are using Binder, to use this you must upload the `logins.csv` file to the `data` folder from the SharePoint. It is not available on Binder.</mark>

```
data/logins.csv
```

It is a dataset containing information about when DuPont employees have logged on to an online platform.

We shall first help you to load in the data (as it is in an unsual format for Python) and provide some context about the dataset,
- [Loading in the data](#load)
- [Initial Visualization](#first)

Further visualization:
- [Looking into weekly trends](#week)
- [Looking into daily trends](#day)


<a id='load'></a>

## Loading in the data

In the previous notebook you read in the binary Excel file and converted to a csv. For this notebook we will use the csv file.

In [None]:
import pandas as pd
import altair as alt

As mentioned earlier, the dataset contains information about when DuPont employees have logged on to an online platform.

In [None]:
logins = (
    pd.read_csv('data/logins.csv', parse_dates=['login_date'])
    .assign(company=lambda df: df['location'].str.split(':').str[0].str.strip())
    .dropna(subset='company')
)
logins.head()

Since we'll be using Altair, we need to sample our data to `5000` rows. It's a good idea to do this on sample and get the charts right. Once we have done this and have some nice visuals we can run `alt.data_transformers.disable_max_rows()` and then re-run our visualisations on the full data.

In [None]:
logins_sample = logins.sample(5000)

<a id='first'></a>

## Initial Visualization

It's always good to have a think about what you want to visualize. We see that there are logins with timestamps, so one area we might want to look into is: **What are the number of logins over time?**

When working with time series data like this, it makes sense to plot x as the time series and see how the volume of logins changes over time.

Note how in the below code we:
- Use `alt.Chart(logins_sample)` to set up the chart
- Use `.mark_line()` to create this as a line chart
- Use `.encode()` to have the x axis as the `login_date` and the y axis as `full`
- Use `.properties()` to set the chart size to `300x800`

In [None]:
alt.Chart(logins_sample).mark_line().encode(
    x='login_date:T',
    y='full'
).properties(
    height=300,
    width=800
)

The above chart shows us some seasonality, probably with weekly trends. As well as this there is a lot of noise, since there are so many observations for each time period. Therefore there are two areas we can investigate further:

1. [Weekly trends](#weekly)
    - Which day of the week has the most logins?
    
    
2. [Daily trends](#day)
    - What time of day has the most?

**<mark>Execise: Plot a bar plot</mark>**

- Plot an altair chart using the `alt.Chart()` with the new `logins_weekly` data and with `.mark_bar()` to create a bar plot.
- Use `x='day(login_date)` to plot each day of the week. Use the right data code (`'Q'`, `'O'`, `'N'` or `'T'`)
- Use ` y= 'sum(full)'` so that each bar represents the sum of the `full` column

**<mark>Execise: Plotting</mark>**

- Create a chart with your new dataframe, in the same way as above, and add a color argument to color the bars according to `company`
- Change the `width` of the bars in `.mark_bar()`
- Change the `width` and `height` of the chart in `.properties()` to `800` and `300`

**<mark>Bonus: Add a selector</mark>**

Copy & paste your code from above and then do the following:

- Create the variable `selector = alt.selection_single(encodings=['x', 'color'])` (above the chart code)
- Change the `color` argument to be dependent on the selector: `alt.condition(selector, 'company', alt.value('lightgray'))`
- Add `.add_selection()` and use the parameter `selector`

***Answers***

In [None]:
# %load answers/alt-weekly-trends-plotting.py

In [None]:
# %load answers/alt-weekly-trends-company-plotting.py

In [None]:
# %load answers/alt-weekly-trends-company-plotting-bonus.py

<a id='day'></a>

## 2. The daily trends

In the plot it was also clear that there were daily trends show by the thin spikes:

In [None]:
alt.Chart(logins_sample).mark_line().encode(
    x='login_date',
    y='full'
).properties(
    height=300,
    width=800
)

**<mark>Execise: Plotting</mark>**

- Create a chart using `logins_sample`
- The mark type should be `.mark_line()`
- Use `.encode()` with `x=day(login_date)'` and `y='sum(full)'`
- Set the `color=` in `.encode()` to be `day_of_week`. Which is the right data code to use (`:Q`, `:O` or `:N`)

**<mark>Bonus: Customisation</mark>**

Copy & paste your code from above and then do the following:

- Change the `strokeWidth` in `.mark_line()` to `3`
- Is it necessary to also plot Saturday and Sunday? Use `.transform_filter()` and `alt.FieldOneOfPredicate(` to keep only working days:
```python
.transform_filter(
    alt.FieldOneOfPredicate(field='login_date', timeUnit='day', oneOf=['Mon', 'Tue', 'Wed', 'Thu', 'Fri'])
)
```
- Add the `.properties()` method to change the size of the chart and to add a `title`
- Add a `title` to your chart inside `.properties()`
- Add labels to the x and y axis, this is done in `.encode()`: `x=alt.X('name_of_column', axis=alt.Axis(title='Chosen Label'))`
- Add `.interactive()` to your chart to allow scrolling in and out.
- Add a `selector` so that you can select the lines
``` python
# Step 1: create the selector variable
selector = alt.selection_single(encodings=['x', 'color'])
```
``` python
# Step 2: add the method to the chart
.add_selection(
    selector
)
```
``` python
# Step 3: add the condition to the color argument in encode
color=alt.condition(selector, 'day(login_date):O', alt.value('lightgray'))
```

***Answers***

In [None]:
# %load answers/alt-daily-trends-plotting.py

In [None]:
# %load answers/alt-daily-trends-bonus.py

# Next Steps - Change up the default colors and max_rows!

Now that we have some charts, we might want to visualise the entire data. As well as changing the defaults to allow more rows, we can also change the color theme. Altair offers a great out-of-the-box visuals that are professional and easy to look at, however we can change the style to suit our company. 

<mark>**Exercise: Change the colors to Dupont colors**</mark>

Look for where the two colors `green` and `blue` are defined in the code below. Change the colors so that they match the branding of Dupont and IFF.

Struggling to find the right colors? You can retrieve the HEX color from images here: https://imagecolorpicker.com/en
Try uploading the Dupont and IFF logos and get the exact colors!

In [None]:
alt.data_transformers.disable_max_rows()

def my_theme():
    return {
        'config': {
            'view': {'continuousHeight': 300, 'continuousWidth': 800},  # from the default theme
            'range': {'category': ['green', 'blue']}
        }
    }

alt.themes.register('my_theme', my_theme)
alt.themes.enable('my_theme')

login_days = (
    logins_sample
    .assign(company = lambda df: df['location'].str.split(':').str[0].str.strip(),
            day_of_year = lambda df: df['login_date'].dt.day_of_year)
    .dropna()
)

alt.Chart(logins).mark_line().encode(
    x='monthdate(login_date):T',
    y='sum(full)',
    color='company'
).properties(
    title='Logins over the Year'
)

<img src=images/gdd-logo.png align=right width=300px>

# Conclusion

In this noteboook we have looked at how Altair can be used as a data centric way of plotting charts. There are still some data wrangling techniques that we need to use, and we need to remember that our data needs to be names (no `numpy` allowed)

In this notebook we have covered:

- Initial visual inspection
- Looking into trends by creating time specific columns and grouping by them
- Different ways to visualize daily and weekly trends