You will be writing an interactive data visualization article aimed at the public. Your article should feature:

1.	A compelling <"title">: don't forget to specify that you or your group are/is the author/s!

2.	At least <'one central interactive visualization'> featuring your primary dataset. This can be similar to what you submitted in the last phase but does not need to be a dashboard. Remember, this is for the public so it should be large and friendly (see Rubric for details about what we are looking for).

3.	At least <'two contextual visualizations'> - these can be other data visualizations you've done, or images from other places (remember to site your sources!!).

4.	At least <'3 paragraphs of connective information'> to help a novice understand what is happening in your datasets.

5.	<'Citations'> of all the data sources used and information for the reader to be able to find those datasets themselves.

6.	Links to any analysis Jupiter <'notebooks'> used to create your final visualizations.

## Family Factors and Grades on Students’ Alcohol Consumption and Health Status

## Final Project Group 1 Member

Mingrui Xu

Mengyue Huang

In [23]:
import altair as alt
import pandas as pd

In [24]:
math_df = pd.read_csv('https://query.data.world/s/uxp3sfi63xgzge6gkkycbi2tz3gh6x')
math_df

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,4,0,11,11
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,2,9,11,11
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,6,12,13,12
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,0,14,14,14
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,0,11,13,13
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
644,MS,F,19,R,GT3,T,2,3,services,other,...,5,4,2,1,2,5,4,10,11,10
645,MS,F,18,U,LE3,T,3,1,teacher,services,...,4,3,4,1,1,1,4,15,15,16
646,MS,F,18,U,GT3,T,1,1,other,other,...,1,1,1,1,1,5,6,11,12,9
647,MS,M,17,U,LE3,T,3,1,services,services,...,2,4,5,3,4,2,6,10,10,10


### Part1: Main Analysis

#### Connectivity Paragraph 1

Introduction, Purpose, and Each Plot's Purpose:

For our project here with the dataset of Student Alcohol Consumption with their Math (main data-frame) and Portuguese (contextual data-frame) grades shown, our group want to dive into the data to see if there are some relationships or links between some features in the columns and students' Health Status (1-very bad to 5-very good)) ('health') and Weekend Alcohol Consumption (1-very low to 5-very high) ('Walc'). In the figures below, chosen features include students' Father's Job ('Fjob'), Mother's Job ('Mjob'), Portuguese Final Grade (0-20) ('G3' from por_df) and Math Final Grade (0-20) ('G3' from math_df). The main purpose of this project to show potential trend between these factors and try to derive real-life inspirations for future students' health status.

In [25]:
myDir = '/Users/98768/Desktop/mengyuehuang.github.io/assets/json/'

In [26]:
brush = alt.selection_interval(encodings=['x','y'])

# 1. heatmap
rect = alt.Chart(math_df).mark_rect().encode(alt.X('Fjob:N', axis=alt.Axis(title="Father's Job", labelAngle=-45)),
                                             alt.Y('Mjob:N', axis=alt.Axis(title="Mother's Job")),
                                             alt.Color("count()"),
                                             tooltip=['Fjob:N', 'Mjob:N']
                                             ).add_selection(brush)

# 2. histogram
hist = alt.Chart(math_df).mark_bar().encode(alt.X('health:Q', bin= True, axis=alt.Axis(title='Current Health Status (1-5)')),
                                             alt.Y('Walc:Q', axis=alt.Axis(title='Weekend Alcohol Consumption (1-5)')),
                                             tooltip=['health:Q', 'Walc:Q']
                                            ).transform_filter(brush)

# 3. connecting together
myDashboard = (rect.properties(width=350,height=350) | hist.properties(width=350,height=350)).configure_axis(labelFontSize=20,titleFontSize=20)
myDashboard

In [27]:
#myDashboard.save(myDir+'main_dashboard.json')

In [28]:
import numpy as np

In [29]:
table_alco = pd.pivot_table(math_df, values='Walc', index='Fjob', columns='Mjob', aggfunc = np.mean, fill_value='na')
table_alco

Mjob,at_home,health,other,services,teacher
Fjob,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
at_home,2.055556,1.0,2.142857,1.857143,1.5
health,3.0,1.666667,1.666667,1.75,3.4
other,2.25974,2.619048,2.174863,2.362069,2.321429
services,2.388889,2.533333,2.4,2.677966,2.333333
teacher,1.0,3.0,1.375,1.5,2.1875


In [30]:
table_health = pd.pivot_table(math_df, values='health', index='Fjob', columns='Mjob', aggfunc = np.mean, fill_value='na')
table_health

Mjob,at_home,health,other,services,teacher
Fjob,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
at_home,2.888889,5.0,3.642857,2.857143,5.0
health,3.0,4.333333,4.666667,4.5,4.2
other,3.545455,3.619048,3.491803,3.913793,3.75
services,3.055556,3.4,3.26,3.610169,3.380952
teacher,4.0,3.0,3.75,3.75,3.5625


Conclusion:

For the heatmap, the darkest part is in the middle - families with father's job as 'other' and mother's job as 'other', which makes sense since the number of jobs that do not fall in the list of 'at_home', 'health', 'other', 'services', and 'teacher' surely outnumbers the amount of jobs that are in the list, making the fathers' and mother's job combination of 'other' and 'other' the one with the most counts. Other common job combinations include 'other' and 'at_home', 'other' and 'services', and 'services' and 'services'.

For the histogram with students' Health Status (1-5) and Weekend Alcohol Consumption (1-5), it seems that the largest student average weekends alcohol consumption is from the job combination of 'health' and 'teacher', the lowest student average health status is from the combination of 'at_home' and 'services'.

### Part 2: Contextual Analysis

In [31]:
por_df = pd.read_csv('https://query.data.world/s/uxp3sfi63xgzge6gkkycbi2tz3gh6x')
por_df

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,4,0,11,11
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,2,9,11,11
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,6,12,13,12
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,0,14,14,14
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,0,11,13,13
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
644,MS,F,19,R,GT3,T,2,3,services,other,...,5,4,2,1,2,5,4,10,11,10
645,MS,F,18,U,LE3,T,3,1,teacher,services,...,4,3,4,1,1,1,4,15,15,16
646,MS,F,18,U,GT3,T,1,1,other,other,...,1,1,1,1,1,5,6,11,12,9
647,MS,M,17,U,LE3,T,3,1,services,services,...,2,4,5,3,4,2,6,10,10,10


#### Connectivity Paragraph 2

'link to next':

For the below visualizations we will make use of the contextual dataset with a focus on students' Portuguese grades instead of Math. We will take a look at whether the students grades distribution of these 2 subjects have an impact on their weekend alcohol consumption ('Walc').

#### Contextual Plot 1

In [32]:
hist1_por = alt.Chart(por_df).mark_bar().encode(alt.X('G3:Q', axis=alt.Axis(title='Portuguese Final Grade (0-20)')),
                                                alt.Y('Walc:Q', axis=alt.Axis(title='Weekend Alcohol Consumption (1-5)')),
                                                tooltip=['G3:Q', 'Walc:Q']
                                               ).configure_axis(labelFontSize=15,titleFontSize=15)
hist1_por

In [33]:
#hist1_por.save(myDir+'hist1_por.json')

In [34]:
hist1_math = alt.Chart(math_df).mark_bar().encode(alt.X('G3:Q', axis=alt.Axis(title='Math Final Grade (0-20)')),
                                                  alt.Y('Walc:Q', axis=alt.Axis(title='Weekend Alcohol Consumption (1-5)')),
                                                  tooltip=['G3:Q', 'Walc:Q']
                                                 ).configure_axis(labelFontSize=15,titleFontSize=15)
hist1_math

In [35]:
#hist1_math.save(myDir+'hist1_math.json')

Conclusion: 

It seems like that neither kinds of these w subjects' grades have a deep impact on students' weekend alcohol consumption. In other words, students consume a considerate amount of alcohol no matter what their Portuguese or Math grades are. There are almost no relationships between students' Portuguese or Math grades and their weekend alcohol consumption. However, there still seems to be a noticeable trend on the right end of students' grades on these 2 subjects. That is, when students have Portuguese or Math grades ranging from 18-20, their weekend alcohol consumption decreased compared with prior figures.

#### Connectivity Paragraph 2

'link to next':

With the fact that we already have the visualizations between parents' job combinations and students' health status ('health') and weekend alcohol consumption ('Walc'), maybe it would be helpful to see if there are some correlations between Family Size ('famsize'), Quality of Family Relationships (1-very bad to 5-excellent) ('famrel') and students' health status (1-5).

#### Contextual Plot 3

In [43]:
heat2_por = alt.Chart(por_df).mark_rect().encode(alt.X('famsize:N', axis=alt.Axis(title="Family Size (>3 / <=3)", labelAngle=0)),
                                                 alt.Y('famrel:Q', axis=alt.Axis(title="Quality of Family Relationships (1-5)")),
                                                 color='Walc:Q',
                                                 tooltip=['famsize:N', 'famrel:Q', 'Walc:Q']
                                                ).properties(width=300,height=300).configure_axis(labelFontSize=15,titleFontSize=20)
heat2_por

In [44]:
#heat2_por.save(myDir+'heat2_por.json')

In [38]:
table_health3 = pd.pivot_table(por_df, values='health', index='famsize', columns='famrel', aggfunc = np.std, fill_value='na')
table_health3

famrel,1,2,3,4,5
famsize,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
GT3,1.772811,1.604732,1.442448,1.376771,1.496102
LE3,1.902379,1.368476,1.350657,1.368171,1.482786


In [39]:
table_health2 = pd.pivot_table(por_df, values='health', index='famsize', columns='famrel', aggfunc = np.mean, fill_value='na')
table_health2

famrel,1,2,3,4,5
famsize,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
GT3,3.0,3.111111,3.298701,3.608108,3.672
LE3,2.428571,3.454545,3.208333,3.821053,3.363636


Conclusion:

Browsing the plot, it seems that when the family consists of < 3 person, the mroe students would drink during the weenkends. Generally, the larger the family size is, the less , the less harmonious the family atmosphere is, the less amount of alcohol students would consume on weekends.

## Citation & References

UCI Machine Learning. (2016). Student Alcohol Consumption (Version V2) [Dataset]. Data Society. https://data.world/data-society/student-alcohol-consumption

https://stackoverflow.com/questions/54918651/controlling-bin-widths-in-altair

https://stackoverflow.com/questions/70988235/make-bar-charts-x-axis-markers-horizontal-or-45-degree-readable-in-python-altai

https://altair-viz.github.io/user_guide/generated/core/altair.Scale.html