# Data Visualization

## Assignment 7: Interactive Visualizations

You can't learn technical subjects without hands-on practice. The assignments are an important part of the course. To submit this assignment you will need to make sure that you save your Jupyter notebook. 

Below are the links to 2 videos that explain:

1. [How to save your Jupyter notebook](https://youtu.be/0aoLgBoAUSA) and,       
2. [How to answer a question in a Jupyter notebook assignment](https://youtu.be/7j0WKhI3W4s).

<div class="alert alert-info" style="color:black">
    
Assignment Learning Goals:

By the end of the module, students are expected to:

- Create selections within a plot.
- Link selections between plots to highlight and select data.
- Develop widgets to filter plotted data.
- Share interactive visualizations without running a full dashboard or Python.
    

This assignment covers [Module 7](https://viz-learn.mds.ubc.ca/en/module7) of the online course. You should complete this module before attempting this assignment.
 
</div>

Any place you see `...`, you must fill in the function, variable, or data to complete the code. Substitute the `None` and the `raise NotImplementedError # No Answer - remove if you provide an answer` with your completed code and answers then proceed to run the cell!

Note that some of the questions in this assignment will have hidden tests. This means that no feedback will be given as to the correctness of your solution. It will be left up to you to decide if your answer is sufficiently correct. These questions are worth 2 points.

In [1]:
# Import libraries needed for this assignment

from hashlib import sha1

import altair as alt
import pandas as pd
import test_assignment7 as t
from vega_datasets import data

alt.data_transformers.enable("default", max_rows=None)
# Handle large data sets without embedding them in the notebook
#alt.data_transformers.enable('data_server');

DataTransformerRegistry.enable('default')

# 0. Let's Bring the Action 

Welcome to the season play-offs. Not only have we gotten to the final round of the assignments but we've also gotten to the final game of batting season!

In this week's game assignment, our goal is to try to showcase just how useful widgets, and interactivity is to communicating our insights to our audience. Yes, there are other ways to answer some of the questions in this assignment, but we want to really focus on how convenient and time-saving certain features can be.
The data we are exploring are from [Sean Lahman's Baseball dataset](http://www.seanlahman.com/baseball-archive/statistics/)  - a baseball resource that has the statistics on players across many different leagues and years. The license can be viewed [here](http://www.seanlahman.com/baseball-archive/statistics/). 

This data has several modifications have been made so that the visualization process is a home run. 

***NOTE: This is a larger dataset and because of that, the interactivity component is going to be a little slower but just be a little patient.***

It's important to note that in the real world, the data playing field is often not as well maintained as what we provide. Often data is messy and needs a lot of wrangling before we can even begin the task at hand. Think of it as cross-country skiing with lots of bumps and holes that need addressing. In this course, we really want to focus on visualizing the data since the wangling and cleaning part you learned already in [Programming in Python for Data Science ](https://prog-learn.mds.ubc.ca/)

For this assignment we will be looking at 2 tables, one with the players statistics such as their height, date of birth etc. and a second with their game stats over each season they played and their position. 
The description of the columns of both datasets can be found below and they are available at the links provided .

### [`baseball_players.json`](https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/main/data/baseball_players.json): 


| Column            | Description                                                                            |
|-------------------|:---------------------------------------------------------------------------------------|
| player_id         | A unique code assigned to each player.                                                 |
| birth_year        | Year player was born                                                                   |
| birth_month       | Month  player was born                                                                 |
| birth_day         | Day player was born                                                                    |
| birth_date        | year, month and day player was born                                                    |
| birth_country     | Country where player was born                                                          |
| birth_state       | State where player was born                                                            |
| birth_city        | City where player was born                                                             |
| death_year        | Year player died                                                                       |
| death_country     | Country where player died                                                              |
| death_state       | State where player died                                                                |
| name_first        | Player’s first name                                                                    |
| name_last         | Player’s last name                                                                     |
| name_given        | Player’s given name (typically first and middle)                                       |
| height            | Player's height in inches                                                              |
| weight            | Player's weight in pounds                                                              |
| bats              | Player’s batting hand (left (L), right (R), or both (B))                               |
| throws            | Player’s throwing hand (left(L) or right(R))                                           |
| debut             | Date that player made first major league appearance                                    |
| final_game        | Date that player made first major league appearance (blank if still active)            |
| id_country        | The [ISO 3166-1 numeric code](https://en.wikipedia.org/wiki/ISO_3166-1_numeric) of the country in `first_nationality`                                    |
| region            | The continent of which the country from `first_nationality` belongs to                 |
| id                | The state code to create a US map                                                      |


### [`states_df.json`](https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/main/data/states_df.json) 

| Column            | Description                                                                            |
|-------------------|:---------------------------------------------------------------------------------------|
| birth_state       | The unique shortform of each State in the U                                            |
| id                | The id used to map each state                                                          |
| player_num        | Number of players                                                                      |


### [`baseball_pitchers.json`](https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/main/data/baseball_pitchers.json) 

| Column            | Description                                                                            |
|-------------------|:---------------------------------------------------------------------------------------|
| player_id         | A unique code assigned to each player.                                                 |
| teamID            | Team                                                                                   |
| W                 | Wins                                                                                   |
| L                 | Losses                                                                                 |
| G                 | Games                                                                                  |
| GS                | Games Started                                                                          |
| CG                | Complete Games                                                                         |
| SHO               | Shutouts                                                                               |
| SV                | Saves                                                                                  |
| IPouts            | Outs Pitched (innings pitched x 3)                                                     |
| H                 | Hits                                                                                   |
| ER                | Earned Runs                                                                            |
| HR                | Homeruns                                                                               |
| BB                | Walks                                                                                  |
| SO                | Strikeouts                                                                             |
| BAOpp             | Opponent’s Batting Average                                                             |
| ERA               | Earned Run Average                                                                     |
| IBB               | Intentional Walks                                                                      |
| HBP               | Batters Hit By Pitch                                                                   |
| WP                | Wild Pitches                                                                           |
| BFP               | Batters faced by Pitcher                                                               |
| GF                | Games Finished                                                                         |
| R                 | Runs Allowed                                                                           |
| SH                | Sacrifices by opposing batters                                                         |
| SF                | Sacrifice flies by opposing batters                                                    |
| GIDP              | Grounded into double plays by opposing batter                                          |
| name_last         | Player’s first name                                                                    |
| name_first        | Player’s last name                                                                     |
| id                | The id of the state the player was born                                                |
| SO_per_BF         | Strikeout per batter faced                                                             |



### [`baseball_players_stats.json`](https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/main/data/baseball_players_stats.json) 

       
| Column            | Description                                                                            |
|-------------------|:---------------------------------------------------------------------------------------|
| player_id         | A unique code assigned to each player.                                                 |
| teamID            | Team                                                                                   |
| lgID              | League; levels AA AL FL NL PL UA                                                       |
| G                 | number of games in which a player played                                               |
| AB                | At bat                                                                                 |
| R                 | Runs                                                                                   |
| H                 | Hits - times reached base because of a batted, fair ball without error by the defense  |
| 2B                | Doubles: hits on which the batter reached second base safely                           |
| 3B                | Triples: hits on which the batter reached third base safely                            |
| HR                | Homeruns                                                                               |
| RBI               | Runs Batted In                                                                         |
| SB                | Stolen Bases                                                                           |
| CS                | Caught Stealing                                                                        |
| BB                | Base on Balls                                                                          |
| SO                | Strikeouts                                                                             |
| IBB               | Intentional walks                                                                      |
| HBP               | Hit by pitch                                                                           |
| SH                | Sacrifice hits                                                                         |
| SF                | Sacrifice flies                                                                        |
| GIDP              | Grounded into double plays                                                             |
| name_first        | Player’s first name                                                                    |
| name_last         | Player’s last name                                                                     |
| bats              | Player’s batting hand (left (L), right (R), or both (B))                               |
| throws            | Player’s throwing hand (left(L) or right(R))                                           |
| team_name         | Team’s full name                                                                       |
| salary            | Salary (USD)                                                                           |


# 1. Let the Games Begin - Drafting the Players 

We expect basketball players to be taller than an average person's height to be able to shoot and dunk balls into the net but are similar statistics expected by baseball players? Of course there needs to be a lot of power to swing a ball and hit a home run but also agility is an asset when running to home base. These questions may lead us to questions such as  ***How are baseball player's height and weight distributed in the data?*** ***Is there a relationship between the two measurements?*** It would be interesting to visualize our players and see if there are any patterns in the data. 

**Question 1.1** 
    <br> {points: 1}

Before we can start asking questions, we need to un-bench the players! 

Read in the data `baseball_players.json` from the link [we've provided](https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/main/data/baseball_players.csv) named `players_url`. You may notice that the column for `birth_year` is a little weird. That's ok! This is how json is store the value in a way that accomodates "date" fields.

*Assign your data to an object named `players_df`*.

In [2]:
players_url = 'https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/main/data/baseball_players.json'
players_df = pd.read_json(players_url)
players_df

Unnamed: 0,playerID,birth_year,birth_month,birth_day,birth_date,birth_country,birth_state,birth_city,death_year,death_country,...,name_given,weight,height,bats,throws,debut,final_game,id_country,region,id
0,aardsda01,347155200000,12,27,1981-12-27,United States,CO,Denver,,,...,David Allan,215,75,R,R,2004-04-06,2015-08-23,840.0,Americas,8.0
1,aaronha01,-1136073600000,2,5,1934-02-05,United States,AL,Mobile,2021-01-01,USA,...,Henry Louis,180,72,R,R,1954-04-13,1976-10-03,840.0,Americas,1.0
2,aaronto01,-978307200000,8,5,1939-08-05,United States,AL,Mobile,1984-01-01,USA,...,Tommie Lee,190,75,R,R,1962-04-10,1971-09-26,840.0,Americas,1.0
3,aasedo01,-504921600000,9,8,1954-09-08,United States,CA,Orange,,,...,Donald William,190,75,R,R,1977-07-26,1990-10-03,840.0,Americas,6.0
4,abadan01,63072000000,8,25,1972-08-25,United States,FL,Palm Beach,,,...,Fausto Andres,184,73,L,L,2001-09-10,2006-04-13,840.0,Americas,12.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19069,zupofr01,-978307200000,8,29,1939-08-29,United States,CA,San Francisco,2005-01-01,USA,...,Frank Joseph,182,71,L,R,1957-07-01,1961-05-09,840.0,Americas,6.0
19070,zuvelpa01,-378691200000,10,31,1958-10-31,United States,CA,San Mateo,,,...,Paul,173,72,R,R,1982-09-04,1991-05-02,840.0,Americas,6.0
19071,zuverge01,-1451692800000,8,20,1924-08-20,United States,MI,Holland,2014-01-01,USA,...,George,195,76,R,R,1951-04-21,1959-06-15,840.0,Americas,26.0
19072,zwilldu01,-2587680000000,11,2,3788-11-02,United States,MO,St. Louis,1978-01-01,USA,...,Edward Harrison,160,66,L,L,1910-08-14,1916-07-12,840.0,Americas,29.0


In [3]:
t.test_1_1(players_df)

'Success'

**Question 1.2** 
    <br> {points: 2}

As always, it's important to know the limitations of our data. Which 3 columns are particularly problematic due to minimal data? 

*Assign your the column names as string elements in a list named `minimal_columns`*.

In [4]:
players_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19074 entries, 0 to 19073
Data columns (total 23 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   playerID       19074 non-null  object 
 1   birth_year     19074 non-null  int64  
 2   birth_month    19074 non-null  int64  
 3   birth_day      19074 non-null  int64  
 4   birth_date     19074 non-null  object 
 5   birth_country  19074 non-null  object 
 6   birth_state    18634 non-null  object 
 7   birth_city     19027 non-null  object 
 8   death_year     9002 non-null   object 
 9   death_country  8999 non-null   object 
 10  death_state    8950 non-null   object 
 11  name_first     19074 non-null  object 
 12  name_last      19074 non-null  object 
 13  name_given     19074 non-null  object 
 14  weight         19074 non-null  int64  
 15  height         19074 non-null  int64  
 16  bats           19074 non-null  object 
 17  throws         19074 non-null  object 
 18  debut 

In [5]:
minimal_columns = ['death_year', 'death_country', 'death_state']
minimal_columns

['death_year', 'death_country', 'death_state']

In [6]:
# check that the variable exists
assert 'minimal_columns' in globals(
), "Please make sure that your solution is named 'minimal_columns'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

**Question 1.3** 
    <br> {points: 2}

Now that the players are warmed up, let's plot!

For this plot (and other plots in this lab)  we need to use the `players_url` as a data source instead of the dataframe object. This helps keep our Jupyter notebook smaller, else it may end up being over 100MB! 
Remember that since we are using a url source we need to make sure all of our columns are referenced with altair datatypes. 

Plot the players' height on the X-axis vs their weight on the Y-axis and assign a colour channel to the region column in a scatterplot using a circle mark (`mark_circle()`).

Give the circle mark a size of 20.  Let's **NOT** include the 0 marks in each axis and make sure to include a title and proper axis labels, taking care to include units. 

*Assign your plot to an object named `players_hw_plot`*.

In [7]:
players_hw_plot = alt.Chart(players_url).mark_circle(size=20).encode(
    alt.X('height:Q', scale=alt.Scale(zero=False), title="Height (in)"),
    alt.Y('weight:Q', scale=alt.Scale(zero=False), title="Weight (lbs)"),
    alt.Color('region:N', title="Region")).properties(title="Size of Baseball Players")

players_hw_plot

In [8]:
t.test_1_3a(players_hw_plot)

'Success'

In [9]:
t.test_1_3b(players_hw_plot)

'Success'

**Question 1.4** 
    <br> {points: 1}

Ok, this is a great start but as you can see there are a few players that we may want to identify outside of the cluster. A great way to do this is to add the name of the player to a tooltip channel (a tooltip is a box with information that pops up when you mouse hovers over an observation). Using `players_hw_plot` from the previous question, assign the players' first, last name and `birth_country` to a tooltip channel. 

*Assign your plot to an object named `hw_tooltip_plot`*.

In [10]:
hw_tooltip_plot = players_hw_plot.encode(tooltip=['name_first:N', 'name_last:N', 'birth_country:N'])

hw_tooltip_plot

In [11]:
t.test_1_4(hw_tooltip_plot)

'Success'

**Question 1.5** 
    <br> {points: 3}

Great! This helped us identify each player in the dataset. Currently, there seems to be quite a lot of saturation and overlap in the visualization. How about we use opacity to highlight certain points using the legend for region as a selection tool. 

- First, assign the region column to the field you want to select on and make sure to bind the legend using the `.selection_multi()` function. Save this in an object named `select_region`. 
- Using `hw_tooltip_plot` as a base, set the opacity channel to a condition where depending on `select_region`, either an opacity value of 0.8 or 0 will show. 
- To make sure this all works without any errors, you'll need to add the `.add_selection()` method making sure to take `select_region` as an input. 


*Assign your plot to an object named `hw_legend_plot`*.

In [12]:
select_region = alt.selection_multi(fields=['region'], bind='legend')
hw_legend_plot = hw_tooltip_plot.encode(
    opacity=alt.condition(select_region, alt.value(0.8), alt.value(0))).add_selection(select_region)
hw_legend_plot

In [13]:
t.test_1_5a(select_region)

'Success'

In [14]:
t.test_1_5b(select_region, hw_legend_plot)

'Success'

In [15]:
t.test_1_5c(hw_legend_plot)

'Success'

**Question 1.6** 
    <br> {points: 1}
    
With this new interactive feature can you identify the number of players from Africa? 

*Assign the number in an object of type `int` named `african_players`*.

In [16]:
african_players = 2

african_players

2

In [17]:
t.test_1_6(african_players)

'Success'

# 2. Birth Dates with Selections

Currently, the data we have contains all the possible players in the dataset. It would be nice to find a way to filter and select players based on when they were born. This might help us identify (loosely) if perhaps players general physics over time have changed. Are players body shapes changing due to the requirements of the game (more strength, agility?)  

In the lecture, we explore how we can sync up multiple plots with a selection tool. Let's make a bar plot that counts up the players by the year they were born and use it as a tool to highlight the players in the scatter plot we made earlier.


**Question 2.1** <br> {points: 2}
    
Before we add interactivity it's a good idea to make a static plot first. Using the `birth_year` column on the x-axis, make a bar plot that counts the values of players born in each year using `players_url` as a data source. Give it a nice colour and make sure that the labels are appropriate for each axis. Since we will be using it as a tool to select points for our earlier plot, do **NOT** give this plot a title. 

*Assign your plot to an object named `dob_plot`.*

In [18]:
dob_plot = alt.Chart(players_url).mark_bar(color="purple").encode(
    alt.X('birth_year:T', title = "Birth Year"),
    alt.Y('count():Q', title="Number of Players"))
dob_plot

In [19]:
t.test_2_1a(dob_plot)

'Success'

In [20]:
t.test_2_1b(dob_plot)

'Success'

**Question 2.2** 
    <br> {points: 5}
    
Now for the fun part! 
Create an  interval selection object named `interval`. We want to make sure we are encoding just the x-axis here. 
Next, using  the `dob_plot` we made in **Question 2.1** as a base, encode a colour condition so that when the interval bars are selected the bars are a `navy` and `lightgray` if not selected.  Add the `interval` as a selection option ([some examples here](https://altair-viz.github.io/altair-viz-v4/user_guide/interactions.html)). Set the height and width of this plot to 100 and 600 respectively and save it in an object named  `bar_slider`. 

Using the `hw_tooltip_plot` plot made in **Question 1.4** let's add conditions to both the colour and opacity channels and save it in an object named `scatter_plot`. 
- For the colour channel, we want the condition so that if the points within the `interval` are selected, it should display the `region` colour, and if not, then it should display a `lightgray` colour. This will make it clearer which points are highlighted. 
- For the opacity channel, we are going to do something similar to the legend selection tool we did above. In this case, if the points are selected in `interval`, the opacity should be 0.8 and 0.01 for the points outside this interval. 
- Give this plot a width of 600 as well. 
- Don't forget to add the selection option to this plot with `add_selection()`.

*Combine your plots vertically and save them in an object named `dob_combo_plot`.*

In [21]:
interval = alt.selection_interval(encodings=['x'])
bar_slider = dob_plot.encode(
     color=alt.condition(interval, alt.value('navy'), alt.value('lightgray'))).add_selection(interval).properties(
    height=100, width=600)
scatter_plot = hw_tooltip_plot.encode(
    color=alt.condition(interval, 'region:N', alt.value('lightgray')),
    opacity=alt.condition(interval, alt.value(0.8), alt.value(0.01))).add_selection(interval).properties(width=600)
dob_combo_plot = bar_slider & scatter_plot
dob_combo_plot

In [22]:
t.test_2_2a(interval)

'Success'

In [23]:
t.test_2_2b(interval, bar_slider)

'Success'

In [24]:
t.test_2_2c(interval, scatter_plot)

'Success'

In [25]:
t.test_2_2d(bar_slider, scatter_plot, dob_combo_plot)

'Success'

**Question 2.3** 
    <br> {points: 2}
    
Great, now that we have the plot above, let's try and loosely answer the question we posed - "Are newer players from the 90s onwards generally taller than players in 1840-1860s". 

*Assign either "yes" or "No" to an object of type string named `answer2_3`*.

In [26]:
answer2_3 = "yes"

answer2_3

'yes'

In [27]:
# check that the variable exists
assert 'answer2_3' in globals(
), "Please make sure that your solution is named 'answer2_3'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

#  3. Where in the World ?

Baseball players originate from all around the world and we've seen this in the previous plots where we have been colouring by the continent but it would be interesting to go more granular? The majority of players are from the USA and it would be interesting to see if certain states are more prone to the sport over others. Do we expect more players to come from Texas or New York? What about colder States such as Alaska or Maine?

We will also look at a new dataset and explore if better pitcher come from certain states in the US. 

Let's use the map as a tool to select points in our scatter plot we made earlier. 

**Question 3.1** <br> {points: 1}

Use this opportunity to read the `us_10m` data from the `vega_datasets` we imported at the beginning of this assignment (where we import in all the libraries).

You'll need to make sure you import the `url` and use `.topo_feature()`.

*Save it in an object named `us_map`.* 

In [28]:
us_map = alt.topo_feature(data.us_10m.url, 'states')

us_map

UrlData({
  format: TopoDataFormat({
    feature: 'states',
    type: 'topojson'
  }),
  url: 'https://cdn.jsdelivr.net/npm/vega-datasets@v1.29.0/data/us-10m.json'
})

In [29]:
t.test_3_1(us_map)

'Success'

**Question 3.2** <br> {points: 5}

Time to get charting! 

To make this map we will need to combine and create 2 Chart objects; one called `background` and the other called `states_plot`. 

To make the `background` plot: 
- Create a `.mark_geoshape()` chart using the `us_map` data from **Question 3.1**. 
- Give it a white colour and a grey stroke. 
    

To make the `states_plot` plot: 
-  Create a `.mark_geoshape()` chart using the `us_map` data from **Question 3.1**, but this time give it a black stroke and a stroke width of 0.15. 
- This map will require us to lookup data from our other data source but before we do that we need to assign the `player_num` column from it (specifying a quantitative type) to a colour channel. We want to specify an appropriate colour scheme setting `scale=alt.Scale(scheme=....)` within the colour channel and giving it an appropriate title. 
- Assign `alt.Tooltip("birth_state:N", title="State")` and `alt.Tooltip("player_num:Q", title="Number of players")` within the tooltip channel.
- Next lookup the data with `.transform_lookup()`, setting the lookup argument to `id` and the `from_` argument using `alt.LookupData()` to look up the  by the `id` column from the data stored at the url from `state_url` (that we have provided for you). Select the columns `birth_state` and `player_num` from this source. 

Combine the plots `states_plot` and `background` to make the complete plot named `players_map`. 
- Add the plots together so they are layered on one another.
- give the combined plot a height of 250 and a width of 500. 
- Assign a `albersUsa` projection. 

If you need further help with this, take hints from **Question 4.3** in ***Assignment 6***. 

In [30]:
state_url = 'https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/main/data/states_df.json'
background = alt.Chart(us_map).mark_geoshape(color="white", stroke="grey")
states_plot = alt.Chart(us_map).mark_geoshape(stroke='black', strokeWidth=0.15).encode(
    color = alt.Color('player_num:Q', scale=alt.Scale(scheme="blues"), title="Number of Players"),
    tooltip = [alt.Tooltip("birth_state:N", title="State"),
               alt.Tooltip("player_num:Q", title="Number of players")]).transform_lookup(
    lookup='id',
    from_=alt.LookupData(state_url, 'id', ['birth_state', 'player_num']))
players_map = (
     (background + states_plot)
     .properties(height=250, width=500)
     .project(type='albersUsa')
 )
players_map

In [31]:
t.test_3_2a(background)

'Success'

In [32]:
t.test_3_2b(states_plot, players_map)

'Success'

In [33]:
t.test_3_2c(states_plot)

'Success'

In [34]:
t.test_3_2d(states_plot)

'Success'

In [35]:
t.test_3_2e(background, states_plot, players_map)

'Success'

**Question 3.3** <br> {points: 4}

Ok, so we have a map, what else can we do with it? 


What if we wanted to see compare pitching statistics and see if some states produces better pitchers? 

To do this we will be looking at the earned run average (ERA) of each pitcher and plot it vs the strikeout rate (Strikeouts/batter faced). 

In baseball ERA is a well known and accepted statistic to help measure the performance of a pitcher. [source]( https://www.mlb.com/glossary/standard-stats/earned-run-average#:~:text=ERA%20is%20the%20most%20commonly,runners%20will%20count%20against%20him). Teams and leagues want are pitchers where the opposing team is earning less runs and so a pitcher's with a lower ERA is a more successful player.

Pitchers also want to strikeout as many players as they can in a game to avoid the opposing team from earning runs. A strikeout is when a batter gets three strikes at bat, with a strike being defined as:
 1. a batter swinging at a pitch and misses, 
 2. a batter not swinging at a pitched ball that passes through the strike zone
 3. a ball is hit foul. 
 
(you can read more about strikes [here](https://baseballrulesacademy.com/official-rule/nfhs/what-is-a-strike/)). 

We are using the statistic here of strikeouts per batter faced, meaning the number of players a pitcher strikeouts divided by all the batters they encounted. 

Pitchers want a higher strikeout/batter faced and a lower ERA. 

Baseball has a different level of popularity across the states in the US and so it would be interesting if there was a relationship between successful pitchers and the state they were born. Let's explore it. 
 
Since this question is a little tricky, we've given you the majority of the code where you can fill in the blank(`...`) and produce an interactive panel where the map we just made will act as a way to highlight the players from the selected state(s) in a new scatterplot that plots pitchers ERA vs strikeout rates. 

Here we are using a new dataset that is stored in a url in the object `pitcher_url`. 
Again, since we are using data stored at a url make sure to specify the type of each column. 


In [36]:
pitcher_url = 'https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/main/data/baseball_pitchers.json'

# state_select = alt....(fields=['id'])

# states_select_map = ....encode(
#     opacity=alt.condition(state_select, alt.value(0.8), alt.value(0.2)),
#     tooltip=[
#         alt....("birth_state:N", title="State Player was Born"),
#         alt.Tooltip("player_num:...", title="Number of players"),
#     ]).transform_lookup(
#     lookup="id",
#     from_=alt....(state_url, "id", ["birth_state", "player_num"])
# ).add_selection(...)

# pitching_scatter = alt.Chart(...).mark_circle(size=20).encode(
#     alt.X("...:Q", 
#           title="ERA (earned run average)",
#           scale=alt.Scale(domain=[0, 55])
#          ),
#     alt.Y("SO_per_BF:Q", 
#           ...="Strikeouts per batter faced", 
#           ...=alt.Scale(domain=[0, 0.5])
#          ),
#     tooltip=["name_first:N", "name_last:N"]
# ).properties(width=500,
#              title=" There is a negative relationship between ERA and Strikout rates of pitchers"
#             ).transform_filter(...)

# map_panel =  pitching_scatter & ...

# map_panel

# your code here
state_select = alt.selection_multi(fields=['id'])

states_select_map = players_map.encode(
    opacity=alt.condition(state_select, alt.value(0.8), alt.value(0.2)),
    tooltip=[
        alt.Tooltip("birth_state:N", title="State Player was Born"),
        alt.Tooltip("player_num:Q", title="Number of players"),]).transform_lookup(
    lookup="id",
    from_=alt.LookupData(state_url, "id", ["birth_state", "player_num"])).add_selection(state_select)

pitching_scatter = alt.Chart(pitcher_url).mark_circle(size=20).encode(
    alt.X("ERA:Q", title="ERA (earned run average)", scale=alt.Scale(domain=[0, 55])),
    alt.Y("SO_per_BF:Q", 
          title="Strikeouts per batter faced", 
          scale=alt.Scale(domain=[0, 0.5])),
    tooltip=["name_first:N", "name_last:N"]).properties(
    width=500,title="There is a negative relationship between ERA and Strikout rates of pitchers").transform_filter(state_select)

map_panel =  pitching_scatter & states_select_map

map_panel

In [37]:
t.test_3_3a(state_select, states_select_map)

'Success'

In [38]:
t.test_3_3b(state_select, pitching_scatter)

'Success'

In [39]:
# check that the variables exist
assert ('states_select_map' in globals() and
        'pitching_scatter' in globals() and
        'map_panel' in globals()
       ), "Please make sure that your solutions are named 'states_select_map', 'pitching_scatter', and 'map_panel'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

**Question 3.4** <br> {points: 1}

Using the plot above, where was the pitcher with the highest strikeout rate and lowest ERA born? 

A) California (CA) 

B) Texas (TX)

C) Illinois (IL)

D) New York (NY)


*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer3_4`.*

In [40]:
answer3_4 = "C"
answer3_4

'C'

In [41]:
t.test_3_4(answer3_4)

'Success'

# 4. Batting up! 

One of the columns in the data shows the way the player bats. We also know that around 10% of the population is left-handed. It would be interesting to see if the same statistics apply to the baseball players in the data. It might also be interesting to see if the state they are from affects the ratio of left-handed to right-handed batter. Let's make a bar plot, that counts the total of each type of batter Left, Right or Both (L, R, B). It's important to note here that there is a lot of missing values for this column so we should plot the count of those as well. We've already adjusted this in the data so that instead of blank data, we've replace it with "No record". This allows it to be selected as an option. 

Also generally speaking a right-handed player will usually bat right and a left-handed player will generally bat left [source](https://www.beabetterhitter.com/the-top-and-bottom-hand-swing/). That being said there are plenty of players that do not follow this. 



Here we see Rays Michael Perez who previously played for Tampa Bay and bats left:

<img src="https://storage.googleapis.com/afs-prod/media/3b5837746f6d464aab4f0d11af4c0389/1000.jpeg" alt="error 404" width="40%"> 

[[image source](https://apnews.com/article/mlb-tampa-bay-rays-kevin-cash-boston-red-sox-archive-d28c4711c096449f004f2b9fe46a2299/gallery/3b5837746f6d464aab4f0d11af4c0389)]

And Frank Thomas (playing for the Oakland Athletics) who was a heavy hitting right batter before he retired:

<img src="https://cdn.bleacherreport.net/images_root/slides/photos/000/353/197/80945570_original.jpg?1282348588" alt="error 404" width="60%"> 

[[image source](https://bleacherreport.com/articles/440271-albert-pujols-and-the-20-greatest-right-handed-hitters-of-all-time)]


For the interactive portion of these plots, we want to link the bar graph with the number of players that bat, left, right and both (and the missing ones) to the map of the players' birth state. Maybe there is a relationship between where players originate from and the way players bats? Let's find out.


We want to create a horizontal bar plot that displays the percentage of left (L) batters, right batters (R), switch hitters who bat both ways (B) as well as the players missing this data.

Make sure to set proper axis labels but do NOT give this plot a title since we need it as a base for the future combined interactive plot we will be making. Assign your plot to an object named bats_plot

Hints:

Since this is a new plot, let’s make it first without any interactivity. Create a horizontal bar plot that counts the number of left (L) batters, right batters ®, switch hitters who bat both ways (B) as well as the players missing in this data. We will be using this with the scatter plot from above. Sort the categorical variable so that the order of bar is L, R, B, and No record is the last bar. We want to make sure all the bars show even when there are no records for them so we need to specify the domain to contain all 4 categories (Hint: the domain should equal a list of values!). Also, give it a colour to spice it up!

**Question 4.1** <br> {points: 3}


We want to create a horizontal bar plot that displays the count of left (L) batters, right batters (R), switch hitters who bat both ways (B) as well as the players missing this data. Let's first make it without any interactivity. 

Make sure to set proper axis labels but do **NOT** give this plot a title since we need it as a base for the future combined interactive plot we will be making.

*Hints:*
*Create a horizontal bar plot that counts the number of left (L) batters, right batters (R), switch hitters and the players missing in this data. We will be using this with the scatter plot from above. Sort the categorical variable so that the order of bar is `L`, `R`, `B`, and `No record` is the last bar. We want to make sure all the bars show even when there are no records for them so we need to specify the domain to contain all 4 categories (The domain should equal a list of values!). Also, give it a colour to spice it up!*

*Assign your plot to an object named `bats_plot`*.

In [42]:
sort_order = ['L', 'R', 'B', 'No record']

bats_plot = alt.Chart(players_url).mark_bar(color="purple").encode(
    alt.X('count(y):Q', title = "Number of Players"),
    alt.Y('bats:N', sort=sort_order, scale=alt.Scale(domain=sort_order), title="Batting Side"))

bats_plot

In [43]:
# check that the variable exists
assert 'bats_plot' in globals(
), "Please make sure that your solution is named 'bats_plot'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

In [44]:
t.test_4_1b(bats_plot)

'Success'

**Question 4.2** <br> {points: 1}

Which of the following statements is correct? 

A) Similar to regular life, left-batting (left-handed) players are approximately 10% of the players in the data.

B) Unlike in regular life, the reverse happens. Instead, right-batting (right-handed) players are approximately 10% of the players in the data.

C) In the data, left-batting players are around a quarter of the players.

D) In the data, right-batting players are around a quarter of the players.


*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer4_2`.*

In [45]:
answer4_2 = "C"
answer4_2

'C'

In [46]:
t.test_4_2(answer4_2)

'Success'

**Question 4.3** <br> {points: 3}

Now let's filter the data in the bar plot based on the birth state of the players selected from the `states_select_map` plot we made in **Question 3.4**. 

Using the `state_select` selection tool (from **Question 3.4**), add a transformation filter to the `bats_plot` plot from **Question 4.1** and rename this object `filtered_bats_bars`.

Combine `states_select_map` and `filtered_bats_bars` vertically and name the new plot `map_bat_plot`. 


In [47]:
filtered_bats_bars = bats_plot.transform_filter(state_select)
map_bat_plot = states_select_map & filtered_bats_bars
map_bat_plot

In [48]:
t.test_4_3a(state_select, filtered_bats_bars)

'Success'

In [49]:
t.test_4_3b(states_select_map, filtered_bats_bars, map_bat_plot)

'Success'

**Question 4.4** <br> {points: 1}

Of the players from Wyoming (WY), which side is the most dominant? 

*(Wyoming is square shaped and in the West - to me, Midwest but the [U.S. Census Bureau](https://web.archive.org/web/20130921053705/http://www.census.gov/geo/maps-data/maps/pdfs/reference/us_regdiv.pdf) says West)* 

A) Left (L)

B) Right (R)

C) Switch hitters (B)

D) Left and Right have equal numbers of players


*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer4.4`.*

In [50]:
answer4_4 = "D"
answer4_4

'D'

In [51]:
t.test_4_4(answer4_4)

'Success'

**Question 4.5** <br> {points: 1}

Of the players from New York, which side is the most dominant?

A) Left

B) Right 

C) Both

D) Left and Right have equal numbers of players

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer4.5`.*

In [52]:
answer4_5 = "B"
answer4_5

'B'

In [53]:
t.test_4_5(answer4_5)

'Success'

**Question 4.6** <br> {points: 1}

How many players from New Hampshire bat left? 

*(New Hampshire is the small triangle state in the upper East coast.)*

*Assign the number in an object of type `int` named `answer4_6`*.

In [54]:
answer4_6 = 15
answer4_6

15

In [55]:
t.test_4_6(answer4_6)

'Success'

# 5. The Value of a Homerun?

Well, we've explored the players' statistics but we have yet to ~celebrate~ analyze their achievements. Let's bring in some new data that contains each player's home runs and salary for players in 2016. 
 
To analyze our data better we will be adding interactivity using dropdown menus and radio buttons. 

**Question 5.1** <br> {points: 2}

Read in the data `baseball_players_stats.json` from the link [here](https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/main/data/baseball_players_stats.json) that we've provided below.

(This step is so you can take a look at the data) 

*Assign your answer as a dataframe to an object named `player_stats`*.

In [56]:
batting_url = 'https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/main/data/baseball_players_stats.json'
player_stats = pd.read_json(batting_url)
player_stats

Unnamed: 0,playerID,teamID,lgID,G,AB,R,H,2B,3B,HR,...,HBP,SH,SF,GIDP,name_first,name_last,bats,throws,team_name,salary
0,abadfe01,MIN,AL,39,1,0,0,0,0,0,...,0,0,0,0,Fernando,Abad,L,L,Minnesota Twins,1250000
1,arciaos01,MIN,AL,32,103,8,22,4,0,4,...,1,0,0,1,Oswaldo,Arcia,L,R,Minnesota Twins,535000
2,buxtoby01,MIN,AL,92,298,44,67,19,6,10,...,3,4,3,2,Byron,Buxton,R,R,Minnesota Twins,512500
3,doziebr01,MIN,AL,155,615,104,165,35,5,42,...,8,2,5,12,Brian,Dozier,R,R,Minnesota Twins,3000000
4,escobed01,MIN,AL,105,352,32,83,14,2,6,...,1,2,1,7,Eduardo,Escobar,B,R,Minnesota Twins,2150000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
800,strasst01,WAS,NL,24,48,3,10,1,0,0,...,0,7,0,2,Stephen,Strasburg,R,R,Washington Nationals,10400000
801,taylomi02,WAS,NL,76,221,28,51,11,0,7,...,1,0,1,2,Michael,Taylor,R,R,Washington Nationals,524000
802,treinbl01,WAS,NL,73,0,0,0,0,0,0,...,0,0,0,0,Blake,Treinen,R,R,Washington Nationals,524900
803,werthja01,WAS,NL,143,525,84,128,28,0,21,...,4,0,6,17,Jayson,Werth,R,R,Washington Nationals,21733615


In [57]:
# check that the variable exists
assert 'player_stats' in globals(
), "Please make sure that your solution is named 'player_stats'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

**Question 5.2** <br> {points: 2}

This new dataset has been pre-filtered for the player in 2016 otherwise it would have been HUGE!

let's analyze the players' successes and if they really *pay* off (pun intended) ! Are players with more home-runs also high paid players? Or are there likely other important factors involved in salary decisions? Is this the same for all teams? Let's take a look.

Build a scatter plot using a circle mark (`mark_circle()`) that maps the players' home runs vs the players salary, assigning the colour to the team name they played on in 2016. 
Set the x-axis domain from 0-50. For the y-axis, set the domain from 500,000-50,000,000, and transform it to a log scale. You'll also have to use `nice=False` within the `alt.Scale()` function so that the axis gets properly restricted. 
Set the tooltip channel to the players' first name, last name and salary. Make sure there are proper labels for the axis and a title for the plot.


*Assign your plot to an object named `stats_plot`*.

In [58]:
batting_url = 'https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/main/data/baseball_players_stats.json'

stats_plot = alt.Chart(batting_url).mark_circle().encode(
    alt.X('HR:Q', title="Homeruns", scale=alt.Scale(domain=[0,50])),
    alt.Y('salary:Q', title='Salary (log $)', scale=alt.Scale(domain=[500000,50000000], type='log', nice=False)),
    alt.Color('team_name:N', title="Team Name"),
    tooltip = ['name_first:N', 'name_last:N', 'salary:Q']).properties(title='Baseball Players Home Runs & Salary')
stats_plot

In [59]:
t.test_5_2a(stats_plot)

'Success'

In [60]:
t.test_5_2b(stats_plot)

'Success'

**Question 5.3** <br> {points: 4}

Woahh, that's a big legend! It would be better if we could look at the Teams individually since either faceting on `team_name` or using a legend with a colour channel are both far too much to handle at once. This is a great opportunity for a dropdown menu where we can highlight the Team based on the selection we choose. 

First off, find all the unique values in the `team_name` column and save them in a list or an array in an object named `teams`. Make sure they are sorted alphabetically.  

Create a dropdown menu using the `teams` as options within `.binding_select()` and saving this in an object named `dropdown_team` give this drop down an informative label like "Team name". 

Create a single selection tool using `.selection_single()` and set the fields to the leagues, binding it to the `dropdown_team` and saving this as `select_team`. 

Now it's time to connect these to the plot! Using the plot `stats_plot` from **Question 5.3** as a base.
- Encode the colour channel so now it is only a single colour - navy.
- Encode opacity channels to a condition where depending on `select_team`, either an opacity value of 0.8 or 0.08 will result. 
- Finally make sure to add the selection option with `select_team` using `.add_selection()` and `.transform_filter()`.
- Save this plot in an object named `stats_drop_plot`. 

In [61]:
teams = sorted(player_stats['team_name'].unique())
dropdown_team = alt.binding_select(name='Team ',options=teams)
select_team = alt.selection_single(fields=['team_name'], bind=dropdown_team)
stats_drop_plot = stats_plot.encode(
    color=alt.value('navy'),
    opacity=alt.condition(select_team, alt.value(0.8), alt.value(0.08))).add_selection(select_team).transform_filter(select_team)

stats_drop_plot

In [62]:
t.test_5_3a(teams)

'Success'

In [63]:
t.test_5_3b(teams, dropdown_team)

'Success'

In [64]:
t.test_5_3c(dropdown_team, select_team)

'Success'

In [65]:
t.test_5_3d(select_team, stats_drop_plot)

'Success'

**Question 5.4** <br> {points: 1}

Of the following, which team tends to pay their high scoring home run players more than the others?

A) Oakland Athletics

B) Toronto Blue Jays

C) Cincinnati Reds

D) Chicago Cubs

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer5_4`.*

In [66]:
answer5_4 = "B"
answer5_4

'B'

In [67]:
t.test_5_4(answer5_4)

'Success'

**Question 5.5** <br> {points: 4}

Great! Using the dropdown helps us see the players' home runs and salary on a team basis which helps with our analysis.

Earlier we talked about batting left vs right side. Do you think that one side is paid more than the other? Are individual teams paying players differently base on how they bat?
Are certain batters more likely to hit home runs? 

To answer these questions, let's add some radio buttons to highlight based on how players' bats. 

Begin by creating a list or an array of all the unique positions saving this in an object named `batting_side`. 

Using `.binding_radio()`, make a radio menu using `batting_side` as options and saving this in an object named `radio_position`. 

(We made a dropdown menu already for `team_name` in **Question 5.4** called  `dropdown_team`.) 

Create selection tool using `.selection_single()` and set the fields to both `team_name`and `bats`. Set the bind argument equal to a dictionary with the `bats`  and `team_name` columns as the keys and `radio_batting` and `dropdown_team` as the values. Save this as an object named `select_bat_and_team`. 

Finally, let's edit the `stats_plot`.
- Set the color channel to a `navy` color. 
- Make sure to add the selection option **AND** the filtering option using `.add_selection()` and `.transform_filter()` but this time selecting on `select_bat_and_team`.

Save this plot in an object named `stats_radio_plot`. 

In [68]:
batting_side = player_stats['bats'].unique()
radio_batting = alt.binding_radio(name='Batting Side ',options=batting_side)
select_bat_and_team = alt.selection_single(fields=['team_name','bats'], bind={'bats':radio_batting,
                                                                             'team_name': dropdown_team})
stats_radio_plot = stats_plot.encode(
    color = alt.value('navy')).add_selection(select_bat_and_team).transform_filter(select_bat_and_team)
stats_radio_plot

In [69]:
t.test_5_5a(batting_side)

'Success'

In [70]:
t.test_5_5b(batting_side, radio_batting)

'Success'

In [71]:
t.test_5_5c(dropdown_team, radio_batting, select_bat_and_team)

'Success'

In [72]:
t.test_5_5d(select_bat_and_team, stats_radio_plot)

'Success'

**Question 5.6** <br> {points: 1}

Looking at the plot above, which players on the San Diego Padres get more home runs? (Remember this is too small of a data set to come to any firm conclusions here) 

A) Left batting players (L)

B) Right batting player (R)

C) Ambidextrous batting players (B) 


*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer5_6`.*

In [73]:
answer5_6 = "B"
answer5_6

'B'

In [74]:
t.test_5_6(answer5_6)

'Success'

You did it! You completed all 7 assignments of Data Visualization. Congratulations! 

Let's now use these skills in our final project. 

## Before Submitting 

Before submitting your assignment please do the following:

- Read through your solutions
- **Restart your kernel, clear output and rerun your cells from top to bottom** 
- Makes sure that none of your code is broken 
- Verify that the tests from the questions you answered have obtained the output "Success"

This is a simple way to make sure that you are submitting all the variables needed to mark the assignment. This method should help avoid losing marks due to changes in your environment.  

## Attributions

- MDS DSCI 531: Data Visualization I - [MDS's GitHub website](https://github.com/UBC-MDS/DSCI_531_viz-1) 

- The [Sean Lahman's Baseball dataset](http://www.seanlahman.com/baseball-archive/statistics/) 