# Baby Boomers Thru Time
## Demographics Report by Madhurima Narendran (mn25679);

## Homework 4

### Creating Interactive Charts to Visualize Population Shifts over Time with Altair

Baby boomers (often shortened to boomers) are the demographic cohort following the Silent Generation and preceding Generation X. The generation is generally defined as people born from 1946 to 1964, during the post–World War II baby boom. The term is also used outside the United States but the dates, the demographic context and the cultural identifiers may vary. The baby boom has been described variously as a "shockwave"and as "the pig in the python." Baby boomers are often parents of late Gen Xers and Millennials. [from wikipedia](https://en.wikipedia.org/wiki/Baby_boomers).

Let's explore this "shockwave" by examining the US Census data available via the vega datasets package.  We'll start by doing some data engineering to add a column in our population data to denote generational membership, then we'll juxtapose the sex distribution of the population using a brush and linking technique we studied in the lab.  Finally we'll add a slider to animate the transition through time. 

In [1]:
# Import the necessary libraries and data
import altair as alt
import pandas as pd
from vega_datasets import data

df_population = data.population()

In [2]:
df_population.head()

Unnamed: 0,year,age,sex,people
0,1850,0,1,1483789
1,1850,0,2,1450376
2,1850,5,1,1411067
3,1850,5,2,1359668
4,1850,10,1,1260099


## Q1 - Add in the "Boomer" label

As we can see from inspecting the dataframe, our data only gives us information for: 
  - year
  - age
  - sex
  - people
  
But, we want to be able to highlight just the people born between 1946 - 1964 as a separate group.  To accomplish this, we want to create a new categorical attribute - `Generation`

Using pandas data manipulation techniques, add a new column to `df_population` named `Generation` that either has the value `Baby Boomer` or `Other`. 

In [3]:
df_population['Generation'] = df_population.apply(lambda row: 'Baby Boomer' if ((row.year - 1964) <= row.age <= (row.year - 1946)) else 'Other', axis=1)
df_population.sample(30)

Unnamed: 0,year,age,sex,people,Generation
86,1870,25,1,1509059,Other
221,1910,75,2,350900,Other
291,1930,60,2,1783515,Other
484,1980,70,1,2857774,Other
45,1860,15,2,1495999,Other
489,1980,80,2,1919292,Other
329,1940,60,2,2317790,Other
502,1990,20,1,9436188,Other
332,1940,70,1,1280023,Other
449,1970,75,2,2293376,Other


## Q2 - Change the encoding for `sex`

As in our lab in class, the sex is "Male" is encoded as the number `1` and the sex for Female is encoded as `2`.  Modify the dataframe  `df_population` to replace the encoding with the string so when we create our plots this will automatically have the legend come out correctly (note, you can map numbers to labels in Altair as well).  

In [4]:
df_population['sex'] = df_population['sex'].apply(lambda x: 'Male' if x == 1 else 'Female')

In [5]:
df_population.head()

Unnamed: 0,year,age,sex,people,Generation
0,1850,0,Male,1483789,Other
1,1850,0,Female,1450376,Other
2,1850,5,Male,1411067,Other
3,1850,5,Female,1359668,Other
4,1850,10,Male,1260099,Other


## Q3 Juxtapose Bar Charts Horizontally 

Create a bar chart of the population distribution in the year 1960, and horizontally juxtapose the bar chart of the population distribution for the year 1990. Plot the total number of people (ignoring the `sex` attribute).

Note - You can slice the data to a give year before you pass it to Altair using pandas. (you can also do this in Altair with filters, but we haven't covered that yet).

Encode membership of the Baby Boomer generation with color using "#7D3C98" (purple) for the boomers "#F4D03F" (gold) for other.

Fix the y axis so it is equal in both plots.

In [6]:
bar1990 = alt.Chart(df_population[df_population['year'] == 1990]).mark_bar().encode(
    x = alt.X('age:O', axis=alt.Axis(title='Age')),
    y = alt.Y('people:Q', axis=alt.Axis(title='Number of People')),
    color = alt.Color('Generation:N',
                      scale=alt.Scale(
                          domain=['Baby Boomer', 'Other'],
                          range=['#7D3C98', 'F4D03F']
                      )
    )
).properties(title='Distribution of Ages in 1960')

bar1960 = alt.Chart(df_population[df_population['year'] == 1960]).mark_bar().encode(
    x = alt.X('age:O', axis=alt.Axis(title='Age')),
    y = alt.Y('people:Q', axis=alt.Axis(title='Number of People')),
    color = alt.Color('Generation:N',
                      scale=alt.Scale(
                          domain=['Baby Boomer', 'Other'],
                          range=['#7D3C98', '#F4D03F']
                      )
    )
).properties(title='Distribution of Ages in 1960')

alt.hconcat(bar1960, bar1990).resolve_scale(y='shared')

## Q5 - Show the Population Change Over Time with a Slider

Now, we have a snapshot of 2 different years next to each other, but what about creating a crude animation by controlling the the year displayed with a slider?

Create a slider using [this example](https://altair-viz.github.io/gallery/us_population_over_time.html) to help guide you.  Our plot will look similar, except we have not split our bar chart up by `sex` yet. Name the slider 'Select Year:' (this in controlled in `binding_range`, and not in the `selection_single` parameters).  

Encode membership of the Baby Boomer generation with color using "#7D3C98" (purple) for the boomers "#F4D03F" (gold) for other.

Start the slider at 1900.

In [7]:
slider = alt.binding_range(min=1900, max=2000, step=10, name="Select Year")
select_year = alt.selection_single(name="year", fields=['year'],
                                   bind=slider, init={'year': 2000})

alt.Chart(df_population).mark_bar().encode(
    x=alt.X('age:O', title='Age'),
    y=alt.Y('people:Q', scale=alt.Scale(domain=(0, 24000000)), title='Number of People'),
    color=alt.Color('Generation:N',
                    scale=alt.Scale(
                          domain=['Baby Boomer', 'Other'],
                          range=['#7D3C98', '#F4D03F']
                      )
    )
).properties(
    width = 700,
    title='Population Distribution by Age in the USA'
).add_selection(
    select_year
).transform_filter(
    select_year
).configure_facet(
    spacing=8
)

## Q6 - Linking

Let's take a closer look at just the year 2000 data, and find what the distribution of sex is for each individual age grouping.  Plot the distribution of ages as a bar chart for just the year 2000, and link a histogram that will plot the distribution of sex for the current selection.  It should default to no age group selected.  The histogram for the sex distribution should appear below the year 2000 data (vertically concatenated). When a bar on the top chart is selected, indicate its selection by turning the other bars light gray.  The histogram of the sex distribution below it should be a horizontal bar chart. 

Encode membership of the Baby Boomer generation with color using "#7D3C98" (purple) for the boomers "#F4D03F" (gold) for other.

In [8]:
click = alt.selection_single(encodings=['x'])

age = alt.Chart(df_population[df_population['year'] == 2000]).mark_bar().encode(
    x = alt.X('age:O', axis=alt.Axis(title='Age')),
    y = alt.Y('people:Q', axis=alt.Axis(title='Number of People')),
    color = alt.condition(click, 'Generation:N', alt.value('lightgray'))
).properties(
    title='Distribution of Ages in 2000',
    width=700
).add_selection(
    click
)

sex = alt.Chart(df_population[df_population['year'] == 2000]).mark_bar().encode(
    y = alt.Y('sex:N', title='Sex'),
    x = alt.X('people:Q', title='Number of People'),
    color = alt.Color('Generation:N',
                    scale=alt.Scale(
                          domain=['Baby Boomer', 'Other'],
                          range=['#7D3C98', '#F4D03F']
                      )
    )
).transform_filter(
    click
).properties(
    title='Distribution of Sex for Above Age Selection',
    width=700
)

age & sex

## Q7 - Combine Q5 and Q6 to One Chart

In question 6, we linked the distribution of sex to the age selection for just the year 2000.  Let's visualize all the data by incorporating the year selection slider from question 5 so that you can select which year of data you are viewing. Retain the ability to just select one age group for the sex distribution, and default to no age group selected.

Add a tooltip so you can see exactly how many people are in the age range for the top "Distribution of Ages for the Selected Year" histogram. 

Encode membership of the Baby Boomer generation with color using "#7D3C98" (purple) for the boomers "#F4D03F" (gold) for other.

In [9]:
click = alt.selection_single(encodings=['x'])
slider = alt.binding_range(min=1900, max=2000, step=10, name="Select Year")
select_year = alt.selection_single(name="year", fields=['year'],
                                   bind=slider, init={'year': 2000})

age = alt.Chart(df_population).mark_bar().encode(
    x = alt.X('age:O', axis=alt.Axis(title='Age')),
    y = alt.Y('people:Q', axis=alt.Axis(title='Number of People'), scale=alt.Scale(domain=(0,24000000))),
    color = alt.condition(click, 'Generation:N', alt.value('lightgray')),
    tooltip = ["age", "people"]
).properties(
    title='Distribution of Ages in 2000',
    width=700
).add_selection(
    click, select_year
).transform_filter(
    select_year
)

sex = alt.Chart(df_population).mark_bar().encode(
    y = alt.Y('sex:N', title='Sex'),
    x = alt.X('people:Q', title='Number of People'),
    color = alt.Color('Generation:N',
                    scale=alt.Scale(
                          domain=['Baby Boomer', 'Other'],
                          range=['#7D3C98', '#F4D03F']
                      )
    )
).transform_filter(
    click
).transform_filter(
    select_year
).properties(
    title='Distribution of Sex for Above Age Selection',
    width=700
)

age & sex