# Chapter 1 & 2: Practice Exercises
---

In this exercise, you will gain insight into public health by generating simple graphical and numerical summaries of a dataset collected by the U.S. Centers for Disease Control and Prevention (CDC).

The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage.

Data source: https://www.cdc.gov/brfss/

In this exercise, we will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000. While there are over 200 variables in this data set, we will work with a small subset.

The data set is available as a CSV file named `cdc.csv` (Download from [here](https://raw.githubusercontent.com/imranture/practice_stats/main/datasets/cdc.csv)).

---
**Exercise 1:** Place the CSV file under the same directory where your Jupyter Notebook file is. Import Pandas as "pd" and NumPy as "np". Read the data into a Pandas data frame called `df`.

---
**Exercise 2:** How many observations are there in this dataset?

**Hint:** Use `shape()`.

---
**Exercise 3:** Display the first 10 rows.

---
**Exercise 4:** How many variables are there in this dataset? For each variable, identify its data type (e.g., categorical, numerical).

**Hint:** Try using Pandas' `dtypes()` method on your data frame. In the output of this method, `object` data type ("dtype") stands for a string type, which usually indicates a categorical variable. However, some numerical variables can actually be categorical in nature (think about `hlthplan`, for instance).

---
**Exercise 5:**  State how many levels each categorical variable has. Print all the levels for each categorical variable.

**Hint:** Use Pandas' `unique()` and `nunique()`.

---
**Exercise 6:** Find the mean, sample standard deviation, and median of `weight`.

---
**Exercise 7:** Find the mean, sample standard deviation, and median of `weight` for respondents who exercised in the past month. Is there any significant difference in the results when compared to the results of Exercise 6?

**Hint:** `exerany` is the variable that is 1 if the respondent exercised in the past month and 0 otherwise.

---
**Exercise 8:** Compute the 5-number summary for `wtdesire`, respondent's desired weight in pound, in ascending order (that is, min, Q1, Q2 (median), Q3, and max). Also compute the interquartile range (IQR) for this variable (which is Q3-Q1). In addition, compute max upper whisker reach and max lower whisker reach. Based on these values, how many outliers are there for `wtdesire`? Finally, using Matplotlib, create a boxplot for this variable.

**Hint:** For quantiles, you can use NumPy' `quantile()`.

---
**Exercise 9:** Let's consider a new variable: the difference between desired weight (`wtdesire`) and current weight (`weight`). Create this new variable by subtracting the two columns in the `df` data frame and assigning them to a new variable called `wdiff`. Display the first 5 rows of `df`. How many columns are there?

---
**Exercise 10:** What percent of respondents' `wdiff` is zero? Comment on the result.

---
**Exercise 11:** What percent of respondents think they are overweight, that is, their `wdiff` value is less than 0? What percent of respondents think they are underweight?

---
**Exercise 12:** Create a side-by-side boxplot to determine if men tend to view their weight differently than women.

**Hint**: For this, you will need to use the *seaborn* module.

---
**Exercise 13:**  Generate a histogram of `age` with the bin size of 7. Comment on the skewness and modality of this histogram.

---
**Exercise 14:** Make a scatterplot of weight versus desired weight. Set the fill color as blue and alpha level as 0.3. Describe the relationship between these two variables.

**Bonus**: Also fit a red line with a slope of 1 and an intercept value of 0.

---
*Solutions available [here](https://nbviewer.org/github/imranture/practice_stats/blob/main/ch1and2-exercises_with_solutions.ipynb).*