## Students in Portugal dataset - Teenagers Drinking Habits Analysis

In this project we use a dataset containing information about Portuguese students from two public schools. This is real world dataset that was collected in order to study alcohol consumption in young people and its effects on students academic performance. The dataset was built from two sources: school reports and questionnaires.

 Attribute contents:

    *1. school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)
    *2. sex - student's sex (binary: 'F' - female or 'M' - male)
    *3. age - student's age (numeric: from 15 to 22)
    *4. address - student's home address type (binary: 'U' - urban or 'R' - rural)
    *5. famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
    *6. Pstatus - parent's cohabitation status (binary:'T' - living together or 'A' - apart)
    *7. Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2–5th to 9th  grade, 3 – secondary education or 4 – higher education)
    *8. Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2–5th to 9th grade, 3 – secondary education or 4 – higher education)
    *9. Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
    *10. Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
    *11. reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
    *12. guardian - student's guardian (nominal: 'mother', 'father' or 'other')
    *13. traveltime - home to school travel time (numeric: 1 - <15 min., 2-15 to 30 min., 3-30 min. to 1 hour, or 4 - >1 hour)
    *14. studytime - weekly study time (numeric: 1 - <2 hours, 2-2 to 5 hours, 3-5 to 10 hours, or 4 - >10 hours)
    *15. failures - number of past class failures (numeric: n if 1<=n<3, else 4)
    *16. schoolsup - extra educational support (binary: yes or no)
    *17. famsup - family educational support (binary: yes or no)
    *18. paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
    *19. activities - extra-curricular activities (binary: yes or no)
    *20. nursery - attended nursery school (binary: yes or no)
    *21. higher - wants to take higher education (binary: yes or no)
    *22. internet - Internet access at home (binary: yes or no)
    *23. romantic - with a romantic relationship (binary: yes or no)
    *24. famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
    *25. freetime - free time after school (numeric: from 1 - very low to 5 - very high)
    *26. goout - going out with friends (numeric: from 1 - very low to 5 - very high)
    *27. Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
    *27. Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
    *29. health - current health status (numeric: from 1 - very bad to 5 - very good)
    *30. absences - number of school absences (numeric: from 0 to 93)

These grades are related with the course subject, Math or Portuguese:

    *31 G1 - first period grade (numeric: from 0 to 20)
    *32 G2 - second period grade (numeric: from 0 to 20)
    *33 G3 - final grade (numeric: from 0 to 20, output target)


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from numpy import sqrt, arange
from scipy import stats
%matplotlib inline

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
student = pd.read_csv('student-por.csv')

In [None]:
student.head()

In this analysis, we are interested in 3 variables:

    1. Alcohol consumption level (we will create it & called it 'acl')
    2. Final grade for the course subject(G3)
    3. Gender of the student

In [None]:
# Rename 'sex' into 'gender'
student.rename(columns={'sex':'gender'}, inplace=True)
# Create 'alcohol_index' column by using'Dalc' & 'Walc' columns
student['alcohol_index'] = (5*student['Dalc'] + 2*student['Walc'])/7
# Create 'alc',alcohol consumtion level by using 'alcohol_index' column and it is suppose to be <= 2
student['acl'] = student['alcohol_index'] <= 2
# map the 'acl' column, if the 'acl' value is <= 2: 'Low', if it is > 2: 'High'
student['acl'] = student['acl'].map({True: 'Low', False: 'High'})

In [None]:
student.tail(10)

# Confidence Intervals

## Confidence intervals for the mean of the final grade

We can calculate confidence intervals for the means & for proportions.

In [None]:
sample_size = student.shape[0]
print(sample_size)

Because we have a sample size much greater than 30, we can use the __Central Limit Theorem__ to calculate confidence intervals. According to this theorem we can calculate a confidence interval for the mean using the normal distribution.

To get the confidence interval for the men we need three numbers:
    
    1. Sample mean
    2. Standard error
    3. Confidence level
    
 Formula for the __standard error__:
               
  ![title](standard_error_formula.png)
  
- sigma	= 	sample standard deviation
- n	= 	number of samples



In [None]:
sample_mean_grade = student['G3'].mean()
sample_mean_grade

In [None]:
# Apply the formula
std_error_grades = student['G3'].std()/sqrt(sample_size)

In [None]:
# 95% confidence interval(CI) for the mean of final grade
stats.norm.interval(0.95, loc=sample_mean_grade, scale=std_error_grades)

__The 95% confidence interval for the mean of the final grade is between 9.9 to 10.8__

Now let's calculate confidence interval for the proportion of students with __High alcohol Consumptions Level__. Again we need three numbers:

    1. Sample proportions
    2. Standard error
    3. Confidence level
    
 For __proportions the standard error__ is given by:
 
   ![title](proportions_standard_error_formula.png)
    
    
  
 - P = Proportion of successes. Population
 - n = Number of observations. Sample.

In [None]:
student['acl'].value_counts()

In [None]:
# In percentage
student['acl'].value_counts(normalize=True)

In [None]:
high_prop = student['acl'].value_counts(normalize=True)['High']
std_error_prop = sqrt(high_prop*(1-high_prop)/sample_size)

In [None]:
stats.norm.interval(0.98, loc=high_prop, scale=std_error_prop)

__The 98% confidence interval for the proportion of students with High alcohol Consumptions Level is between 0.21 to 0.31__

# Probability calculations

__Assuming the P(High ALC)=0.25. In a class of 10, what is the probability of finding 5 students with High ACL?__

In [None]:
# use binomial & probability mass function
stats.binom.pmf(k=5, n=10, p=0.25)

In [None]:
def plot_probs_n(n):
    fig, ax= plt.subplots(1,2, figsize = (14,4))
    ax[0].bar(arange(n+1), stats.binom.pmf(k=arange(n+1), n=n, p=0.25))
    ax[0].set_xticks(arange(n+1))
    ax[0].set_title('Probability mass function')
    ax[1].plot(stats.binom.cdf(k=range(n+1), n=n, p=0.25))
    ax[1].set_xticks(arange(n+1))
    ax[1].set_title('Cumulative distribution function')

In [None]:
plot_probs_n(10)

# Hypothesis Testing

### 1. Are the population variances equal in the groups of students (Low vs High alcohol consumption)?

Let's perform the Bartetts test whose Null Hypothesis is that the variances are equal. We will use a significance level of 5.0%.

In [None]:
student.groupby('acl')['G3'].var()

In [None]:
grades_low_acl = student['G3'][student['acl']=='Low']
grades_high_acl = student['G3'][student['acl']=='High']
# Bartetts test
stats.bartlett(grades_low_acl, grades_high_acl)

According to the test we __cannot__ reject the Null Hypothesis of equl variances, so we will assume that the two groups are samples from a population with the same variances. This information will be useful in the next test.

### 2. Does alcohol consumption affect academic performance?

In [None]:
fig, axes = plt.subplots(1,2, figsize=(14,4))
sns.boxplot(x='acl', y='G3', data=student, ax=axes[0])
sns.pointplot(x='acl', y='G3', data=student, ax=axes[1])

The visualization suggest there is a difference between the means of the final grade of the two groups. Now we will perform a formal statistical test to confirm the hypothesis that students with High alcohol consumption level perform worse than the students with Low alcohol consumption level.

  > #### Null Hypothesis: For both groups(High & Low ACL) the population means of the final grade are equal.
        
  > #### Alternative Hypothesis: The population mens of the finalgrade are different

In [None]:
# T-test
stats.ttest_ind(grades_low_acl, grades_high_acl, equal_var=True)

Since we got a low p-value we can reject the Null Hypothesis of equal means for the two groups at a level of significance of 5%.

> #### Conclusion: There is a statistical significance difference between the grades in the two analyzed groups, since the mean for the group with high alcohol consumption is less than the mean of the other group, the results suggest that alcohol consumption has a negative impact on students academic performance.

### 3. Do male teenagers drink more than female teenagers?

In [None]:
fig, axes = plt.subplots(1,2, figsize=(14,4))
student['gender'].value_counts().plot(kind='bar', ax=axes[0], title='Gender')
student['acl'].value_counts().plot(kind='bar', ax=axes[1], title='Alcohol Consumption Level')

In [None]:
gender_acl_table = pd.crosstab(student['acl'], student['gender'])
gender_acl_table

In [None]:
fig, axes = plt.subplots(1,2, figsize=(14,4))
gender_acl_table.plot(kind='bar', stacked=True, ax=axes[0]);
(100*(gender_acl_table.T/gender_acl_table.apply(sum, axis=1)).T).plot(kind='bar', stacked=True, ax=axes[1]);

Chi-square test of independence of variables in a contingency table.

This function computes the chi-square statistic & p-value for the hypothesis test of independence of the observed frequencies in the contingency table.

In [None]:
# chi2:test statistic, p:p-value of the test, 
# dof:degrees of freedom, expected :expected frequencies, based on the marginal sums of the table.
chi2, p, dof, expected = stats.chi2_contingency(gender_acl_table)

In [None]:
p

In [None]:
expected_table = pd.DataFrame(expected, index=['High', 'Low'], columns=['F','M'])
expected_table

In [None]:
fig, axes= plt.subplots(1,2, figsize =(14,4))
(100*(gender_acl_table.T/gender_acl_table.apply(sum, axis=1)).T)\
.plot(kind='bar', stacked=True, title='Observe', ax=axes[0])

(100*(expected_table.T/expected_table.apply(sum, axis= 1)).T)\
.plot(kind='bar', stacked=True, title='Expected under NO relation', ax=axes[1])