Name: Jai Gollapudi
<br>
Class: DS4200
<br>
Assignment: HW3

## Part 1: Altair interactive plots

Gapminder is a non-profit organization that promotes global sustainable development and seeks to bridge the gap between misconceptions and data-driven understanding. We are going to explore the a small subset of its data with the information of average income, health score and population for each country in the world. The region information is also provided. 


In [None]:
import altair as alt
import pandas as pd

gapminder = pd.read_csv('gapminder-health-income.csv')
gapminder.head()

### Part 1.1 Add selection (5 points)

Make a scatter plot to show the relationship between average personal income and average health score. Please add tooltip to it to show the information about the region name and population. Also, allow the user to select a single country to highlight while all the other become light grey. 

In [None]:
# Creating a point selection to highlight points
point_selection = alt.selection_point(fields=['country'], empty=True)

# Scatter plot with tooltip
scatter_plot = alt.Chart(gapminder).mark_circle().encode(
    x='income:Q',
    y='health:Q',
    color=alt.condition(point_selection, 'region:N', alt.value('lightgrey')),
    tooltip=['country:N', 'region:N', 'population:Q']
).add_params(
    point_selection
).properties(
    width=800,
    height=300,
    title = 'Scatter plot between average personal income and average health score'
)

scatter_plot

### Part 1.2 Customize the color (10 points)

Now choose a customized color map for the previous question. Explain how you choose the color map and apply it to the plot. 


In [None]:
# Creating a point selection to highlight points
point_selection = alt.selection_point(fields=['country'], empty=True)

# Setting the domain of regions
domain = gapminder['region'].unique()

# Defining a custom color scale
color_scale = alt.Scale(domain=domain,
                        range=['rgb(0, 0, 255)',  # Blue
                               'rgb(0, 128, 0)',  # Green
                               'rgb(255, 0, 0)',  # Red
                               'rgb(255, 165, 0)',# Orange
                               'rgb(0, 0, 0)',    # Black
                               'rgb(255, 255, 0)']) # Yellow])

# Applying the custom color scale to the scatter plot with tool tip
scatter_plot = alt.Chart(gapminder).mark_circle().encode(
    x='income:Q',
    y='health:Q',
    color=alt.condition(point_selection, 
                        alt.Color('region:N', scale=color_scale), 
                        alt.value('lightgrey')),
    tooltip=['country:N', 'region:N', 'population:Q']
).add_params(
    point_selection  
).properties(
    width=800,
    height=300,
    title = 'Scatter plot between average personal income and average health score'
)

scatter_plot

I chose my colors scheme from [here](http://vrl.cs.brown.edu/color) to visualize the relationship between average personal income and health scores by region. Explanation: 

- Increased Color Diversity: By expanding the color range to include blue, green, red, orange, black, and yellow, I've provided a broader spectrum to represent the unique regions within the gapminder dataset. This diversity ensures that each region can be distinctly identified, enhancing the interpretability of the scatter plot.


- Accessibility and Contrast: The selected colors offer high contrast against the light grey color used for non-selected countries. This contrast is crucial for accessibility and ensures that the viewer can easily distinguish between selected and non-selected points, as well as between different regions.


- Aesthetic Appeal: The colors balance aesthetic appeal with functionality. The combination of colors is visually engaging, encouraging viewers to explore the data more deeply. An attractive visualization can significantly enhance the user's engagement and interest in the data being presented.


### Part 1.3 Select across multiple panels (5 points)

Now add an interval selection such that the user can select over any income range, such that we can generate a second plot to show the relationship between income and population for the given range. 

In [None]:
# Defining the interval selection
interval_selection = alt.selection_interval(encodings=['x'], empty=True)


# Scatter plot with tooltip
scatter_plot = alt.Chart(gapminder).mark_circle().encode(
    x='income:Q',
    y='health:Q',
    color=alt.condition(interval_selection, 
                        alt.Color('region:N'), 
                        alt.value('lightgrey')),
    tooltip=['country:N', 'region:N', 'population:Q']
).add_params(
    interval_selection  
).properties(
    width=800,
    height=300,
    title = 'Scatter plot between average personal income and average health score'
)

# Using the selection in the second plot
income_population_plot = alt.Chart(gapminder).mark_circle().encode(
    x='income:Q',
    y='population:Q',
    color='region:N',
    tooltip=['country:N', 'population:Q']
).transform_filter(
    interval_selection
).properties(
    width=800,
    height=300,
    title = 'Scatter plot between average personal income and population'
)

# Combining the plots
combined_plot = scatter_plot & income_population_plot

combined_plot

### Part 1.4 Data binding (10 points)

Instead of the using the legend, now include a radio button or the region such that each selection only highlights one region and make the other points to be grey. 

In [None]:
# Creating a radio button selection for filtering the data by region
radio_selection = alt.selection_point(
    fields=['region'],
    bind=alt.binding_radio(options=gapminder['region'].unique().tolist()),
    name="Region", 
    empty=True)

# Scatter plot with tooltip
scatter_plot = alt.Chart(gapminder).mark_circle().encode(
    x='income:Q',
    y='health:Q',
    color=alt.condition(radio_selection, 
                        alt.Color('region:N'), 
                        alt.value('lightgrey')),
    tooltip=['country:N', 'population:Q']
).add_params(
    radio_selection
).properties(
    width=800,
    height=300,
    title = 'Scatter plot between average personal income and average health score'
)

scatter_plot


### Part 1.5 Add filter with bars (10 points)

Add a slider bar such that for a given value on the bar, we only show the data such that the population of the country is less than the value. 

In [None]:
# Defining the slider as a selection that targets the entire dataset based on a condition
slider = alt.binding_range(min=gapminder['population'].min(), 
                           max=gapminder['population'].max(), 
                           step=10000, 
                           name='Population less than:')


# Creating the scatter plot with a filter transformation based on the slider's value
scatter_plot = alt.Chart(gapminder).mark_circle().encode(
    x='income:Q',
    y='health:Q',
    color='region:N',
    tooltip=['country:N', 'region:N', 'population:Q']
).transform_filter(
    alt.datum.population < slider_selection.limit
).add_params(
    slider_selection
).properties(
    width=800,
    height=300,
    title = 'Scatter plot between average personal income and average health score'
)

scatter_plot





## Part 2: D3 basic plots: 

In this question, we provide a CSV file with the penguin data. You need to make two plots by filling the given templete. When submit the homework, please submit both .html and .js. 

### Part 2.1 Scatter plot with groups (30 points)

Use the penguin data, make a scatter plot to show the relationship between flipper length and bill length. Use differnt color for each species and add a legend. 

Here is a general approch here: 

1. In the templete, we provide you a way to read the csv file. Once the data is read into d3. All the inputs are considered as strings. Therefore, the first thing we need to do is to convert the data to numeric type. The code is also provided.
2. Define the dimensions and margins for the SVG  and create the SVG canvas. (5 points)
3. Set up scales for x and y axes. Set the range of X and Y to be the range of bill length and flipper length plus 5 on each side. One example of the .min function is provided. Color scale is also provided. (5 points)
5. Add scales to the plot. (5 points)
6. Add circles for each data point (5 points)
7. Add x-axis and y-axis label. (5 points)
8. Add legend. Legend has two parts. The circle and the text. First, we need to set up a layout for the legend, and then add circle and text to this legend. (5 points)

### Part 2.2 Side-by-side boxplot (30 points)

Use the penguin data, make a side by side boxplot to show the distribution of flipper length across three species. To make things easier, we can ignore the outliers first. 

Here is a general approch here: 

1. First convert the strings into numeric data as we did in previous question. Setup the SVG canvas, scales and add the scales to the canvas and also add labels for the scales. (5 points) 
2. In order to make a boxplot, we need to calculate some basic metrics for the data. For each species, we need to calcualte the q1, median and q3. We first define a fundtion called `rollupFunction` to list all the variables we need to calculate. Follow the example for q1 to setup for median and q3, or any other values you need. (5 points)
3. Add comments for the following two lines (add in the .js file) to explain what those codes are doing. (5 points) 
    
    ```js
    const quartilesBySpecies = d3.rollup(data, rollupFunction, d => d.species);

    quartilesBySpecies.forEach((quartiles, species) => {
        const x = xScale(species);
        const boxWidth = xScale.bandwidth();
    ```
4. Inside the `.forEach` function, draw the boxes. There are three things you need to draw for the box plot: 
    - The vertical line in the middel from the q1-1.5 * IQR to q3+1.5 * IQR (5 points)
    - The rectangular shape from q1 to q3. You can add some color to hide the vertical line in the back.  (5 points)
    - The horizental line for median (5 points)


## Submission

Once you finish all the questions. Submit the jupyter notebook file for the Altair part, as well as the .html and .js for the D3 part to Gradescope. 