# Descriptive Statistics Exercises

## Important Notes

**IMPORTANT NOTE 1:** Please remember to **run your code cells** so that you can see the output of your codes.

**IMPORTANT NOTE 2:** By default, Jupyter Notebook will only display the output of the last command in a Code cell. Thus, if you have multiple commands in a Code cell and you need to print output of a Python command in the middle of the cell, you have two options: 
- Option 1: Break your Code cell into multiple Code cells and place only one command in each cell so that you can display output of each command.
- Option 2: (**Preferred**) In a Code cell, put print() statements around each Python command whose output you would like to display.

## Tutorial Overview

In this exercise, you will gain insight into public health by generating simple graphical and numerical summaries of a dataset collected by the U.S. Centers for Disease Control and Prevention (CDC).

The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage.

Data source: cdc.gov/brfss

We will focus on a random sample of 60 people from the BRFSS survey conducted in 2000. While there are over 200 variables in this data set, we will work with a small subset. 



**Exercise 1:** Import the `numpy` and `pandas` modules in as `np` and `pd` respectively. Then place the `cdc_sample.csv` from our dataset GitHub repository into the same directory as this notebook and read in the data as `cdc`. Display the first 5 rows of the data. 

In [3]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import io
import requests

pd.set_option("display.max_columns", None)

url_name = 'https://raw.githubusercontent.com/akmand/datasets/master/cdc_sample.csv'
url_content = requests.get(url_name, verify=False).content
cdc = pd.read_csv(io.StringIO(url_content.decode('utf-8')))

cdc.head()

Unnamed: 0,exerany,hlthplan,smoke100,height,weight,wtdesire,age,gender,genhlth
0,1,1,1,75,265,225,45,m,very good
1,1,1,0,72,150,150,24,m,excellent
2,1,1,1,69,137,150,47,m,excellent
3,1,1,1,66,159,125,26,f,good
4,1,1,0,63,145,125,33,f,very good


**Exercise 2:** How many variables are there in this dataset? For each variable, identify its data type (e.g., categorical, numerical). If categorical, state the number of levels.

**Hint:** Try using Pandas' `info()` method on your data frame. In the output of this method, `object` data type ("dtype") stands for a string type, which usually indicates a categorical variable. On the hand, some numerical variables can actually be categorical in nature (think about hlthplan, for instance). This can be verified when coupled with the `nunique()` function.



In [19]:
print('There are', cdc.shape[1], 'variables')

print("""
Categorical variables: 
genhlth (ordinal, 5 levels), exerany (binary), hlthplan (binary), 
smoke100 (binary) and gender (binary)
""")

print("""
Numerical variables: 
height, weight, wtdesire and age
""")

There are 9 variables

Categorical variables: 
genhlth (ordinal, 5 levels), exerany (binary), hlthplan (binary), 
smoke100 (binary) and gender (binary)


Numerical variables: 
height, weight, wtdesire and age



**Exercise 3:** What are the levels in `genhlth`? and how many people fall under each level?

**Hint:** you can use Pandas' `value_counts()` function.

In [44]:
print('The levels of genhlth are', cdc['genhlth'].unique())
print('The number of people in each are', cdc["genhlth"].value_counts().values, ', respectively')

The levels of genhlth are ['very good' 'excellent' 'good' 'fair' 'poor']
The number of people in each are [18 17 17  7  1] , respectively


**Exercise 4:** Import `matplotlib.pyplot` as `plt` and create a scatterplot of `height` and `weight`, ensuring that the plot has an appropriate title and axis labels. What is the association between these two variables? 

In [45]:
import matplotlib.pyplot as plt
%matplotlib inline 
%config InlineBackend.figure_format = 'retina'
plt.style.use("ggplot")

# creating scatterplot
plt.scatter(cdc['weight'], cdc['height'])
plt.title('Scatterplot of Weight against Height')
plt.xlabel('Weight')
plt.ylabel('Height')
plt.show()

ModuleNotFoundError: No module named 'matplotlib'

**Exercise 5:** Find the mean, sample standard deviation, and median of `weight`.

In [59]:
print('Weight mean:', round(cdc['weight'].mean(), 3))
print('Weight standard deviation:', round(cdc['weight'].std(ddof=1), 3))
print('Weight median:', cdc['weight'].median())

Weight mean: 173.3
Weight standard deviation: 49.035
Weight median: 165.0


**Exercise 6:** Find the mean, sample standard deviation, and median of `weight` for respondents who exercised in the past month. Is there any significant difference in the results when compared to the results of Exercise 5?

**Hint:** `exerany` is the variable that is 1 if the respondent exercised in the past month and 0 otherwise.

In [61]:
print('Weight mean (exercising individual):', round(cdc[cdc['exerany'] == 1]['weight'].mean(), 3))
print('Weight standard deviation (exercising individual):', round(cdc[cdc['exerany'] == 1]['weight'].std(), 3))
print('Weight median (exercising individual):', round(cdc[cdc['exerany'] == 1]['weight'].median(), 3))

Weight mean (exercising individual): 169.733
Weight standard deviation (exercising individual): 36.668
Weight median (exercising individual): 170.0


**Exercise 7:** Create a histogram of `weight` from the data examined in Exercise 5 and 6 on the same plot. Ensuring that your plot has an appropriate title, axis labels and legend. Does this histogram support your answer in question 6? Also comment of the shape of the distribution.

**Hint:** The `alpha` argument of plotting can be used to change the level of transparacy.

In [62]:
plt.hist(cdc['weight'], label = 'Full data')
plt.hist(cdc[cdc['exerany'] == 1]['weight'], label ='exerany=1', alpha = 0.8)

plt.title('Histogram of Weight of the full data and participants whom exercise')
plt.xlabel('Weight')
plt.ylabel('Frequency')
plt.legend()
plt.show()

NameError: name 'plt' is not defined

**Exercise 8:** Continuing our investigation into the `weight` variable compute the: 

- 5-number summary in ascending order (that is, min, Q1, Q2 (median), Q3, and max). 
- interquartile range (IQR) for this variable (which is Q3-Q1). 
- max upper whisker reach and max lower whisker reach. Based on these values, how many outliers are there for `weight`? 

Finally, using Matplotlib, create a boxplot for this variable.

**Hint:** For quantiles, you can use `np.quantile()`.

In [106]:
print('Summary:')
print(' - Minimum: ', cdc['weight'].min())
print(' - Q1: ', cdc['weight'].quantile(0.25))
print(' - Median: ', cdc['weight'].quantile(0.5))
print(' - Q3: ', cdc['weight'].quantile(0.75))
print(' - Maximum: ', cdc['weight'].max())

iqr = cdc['weight'].quantile(0.75) - cdc['weight'].quantile(0.25)
print('\n IQR: ', iqr)

print('\nEstimated outliers:')

max_whisker = cdc['weight'].quantile(0.75) + 1.5*iqr
min_whisker = cdc['weight'].quantile(0.25) - 1.5*iqr
print(' - Max upper whisker reach: ', max_whisker)
print(' - Max lower whisker reach: ', min_whisker)

wt_outliers = cdc[(cdc['weight'] > max_whisker) |
               (cdc['weight'] < min_whisker)]['weight']


print(' - Estimated count: ', wt_outliers.shape[0])

Summary:
 - Minimum:  104
 - Q1:  143.75
 - Median:  165.0
 - Q3:  192.75
 - Maximum:  400

 IQR:  49.0

Estimated outliers:
 - Max upper whisker reach:  266.25
 - Max lower whisker reach:  70.25
 - Estimated count:  3


**Exercise 9:** Similarly for the `wtdesire` compute the 5 number summary. Produce a boxplot of both `wtdesire` and `weight`.

Then compare it with the results from Exercise 8 and comment on the boxplot. 

In [107]:
min_val = cdc['wtdesire'].min()
q1_val  = cdc['wtdesire'].quantile(0.25)
q2_val  = cdc['wtdesire'].quantile(0.50)  # this is also the median
q3_val  = cdc['wtdesire'].quantile(0.75)
max_val = cdc['wtdesire'].max()

print('min:', min_val)
print('q1:', q1_val)
print('q2 (median):', q2_val)
print('q3:', q3_val)
print('max:', max_val)

min: 104
q1: 135.0
q2 (median): 150.0
q3: 176.25
max: 225


In [108]:
plt.boxplot(x=[cdc['wtdesire'], cdc['weight']])
plt.title("Boxplot of wtdesire and weight")
plt.xticks([1,2],['wtdesire', 'weight'])
plt.show();

NameError: name 'plt' is not defined

**Exercise 10:** Create a new data subset called `under25_and_overweight` that contains all respondents under the age of 25 who think their actual weights are over their desired weights. 

How many rows are there in this dataset? 

What percent of respondents under the age 25 think that they are overweight?


In [117]:
under25_and_overweight = cdc[(cdc['age'] < 25) & (cdc['weight'] > cdc['wtdesire'])]
under25 = cdc[cdc['age'] < 25]
print('Rows: ', under25_and_overweight.shape[0])
print('Percentage: ', round(under25_and_overweight.shape[0] / under25.shape[0] * 100, 3), '%')

Rows:  4
Percentage:  57.143 %


**Exercise 11:** Let's consider a new variable: the difference between desired weight (`wtdesire`) and current weight (`weight`). Create this new variable by subtracting the two columns in the cdc data frame and assigning them to a new variable called `wdiff`.

In [119]:
cdc['wdiff'] = (cdc['wtdesire']) - (cdc['weight'])

**Exercise 12:** What percent of respondents' `wdiff` is zero? Comment on the result.

In [127]:
print(round((cdc['wdiff'][cdc['wdiff'] == 0]).shape[0] / cdc.shape[0] * 100, 2), '% of respondents are happy with their weight.')

30.0 % of respondents are happy with their weight.


**Exercise 13:** What percent of respondents think they are overweight, that is, their `wdiff` value is less than 0? What percent of respondents think they are underweight?

In [128]:
print(round((cdc['wdiff'][cdc['wdiff'] > 0]).shape[0] / cdc.shape[0] * 100, 2), 
      '% of respondents think they are underweight.')
print(round((cdc['wdiff'][cdc['wdiff'] < 0]).shape[0] / cdc.shape[0] * 100, 2), 
      '% of respondents think they are overweight.')

5.0 % of respondents think they are underweight.
65.0 % of respondents think they are overweight.


**Exercise 14:** Make a scatterplot of weight versus desired weight. Set the fill color as blue and alpha level as 0.3. Describe the relationship between these two variables.

**Bonus**: Also fit a red line with a slope of 1 and an intercept value of 0. See [this](https://www.featureranking.com/tutorials/python-tutorials/matplotlib/#Lines) for an example of a line fit. 

**Exercise 15:** Create a side-by-side boxplot to determine if men tend to view their weight differently than women.

**Hint**: For this, you will need to use the [Seaborn module](https://www.featureranking.com/tutorials/python-tutorials/seaborn/#Boxplots).

***
www.featureranking.com