# Statistical Measures - Lab

## Introduction

In this lab, you will apply your knowledge of statistical measures with Python to solve a real-world problem. You are a junior analyst for a company that sells widgets for use across many different industries/markets. Your boss has asked you to give her a summary of widget sales across these markets for the past year. She wants to know:

- What has been the typical sales volume across markets?
- How variable have sales been across the different markets this past year?
- How much has the company been characteristically spending in different advertising media (TV, radio, and newspaper) across the different markets for the past year?

## Objectives

You will be able to:
- Calculate central tendency within a variable in Python
- Create visualizations to showcase central tendency
- Compare variables by their central tendency
- Calculate dispersion within a variable

### Sales Data Summary

You have been given a dataset (in CSV format) that contains sales and advertising budget information that you will require for your analysis. There are four columns:
1. `sales`: the number of widgets sold (in thousands)
2. `tv`: the amount of money (in thousands of dollars) spent on TV ads
3. `radio`: the amount of money (in thousands of dollars) spent on radio ads
4. `newspaper`: the amount of money (in thousands of dollars) spent on newspaper ads

In [42]:
# CodeGrade step0
# Run this cell without changes
import csv
import numpy as np
import scipy.stats as stats

## Step 1

Use the `csv` `DictReader` to load the dataset into a list of dictionaries and save it to a variable data.

In [43]:
# CodeGrade step1
# Replace None with appropriate code
with open("Advertising.csv") as f:
    data = list(csv.DictReader(f))

## Step 2

Extract sales numbers for each market in the dataset as a list and save it to a variable “sales”. Then save TV, Radio, and Newspaper advertising expenditures to lists called “tv”, “radio” and “newspaper” respectively.

In [44]:
# CodeGrade step2
# Replace None with appropriate code

sales = []
tv = []
radio = []
newspaper = []

# List comprehension would be great individually but forces multiple iterations
# Doing this in one for loop should be faster and would scale better with more fields
for record in data:
    sales.append(float(record["sales"]))
    tv.append(float(record["TV"]))
    radio.append(float(record["radio"]))
    newspaper.append(float(record["newspaper"]))

## Step 3

Provide a summary of the data by:
- Getting the number of markets your company has been engaged in this past year
- Use in-built Python functions to get the minimum and maximum sales across all markets operated in.

In [45]:
# CodeGrade step3
# Replace None with appropriate code

# We could accomplish this with numpy but we've been told to use built-in functions
num_markets = len(sales)
min_sales = min(sales)
max_sales = max(sales)

In [46]:
# Run this cell without changes
print(f"""
This dataset contains records for {num_markets} markets


The fewest sales for any market was {min_sales} thousand widgets


The most sales for any market was {max_sales} thousand widgets
""")


This dataset contains records for 200 markets


The fewest sales for any market was 1.6 thousand widgets


The most sales for any market was 27.0 thousand widgets



Run this code to create a histogram of all sales data:

In [47]:
# Run this cell without changes
import matplotlib.pyplot as plt
%matplotlib inline

fig, ax = plt.subplots(figsize=(10, 5))

ax.hist(sales, bins=15)

ax.set_xlabel("Sales (thousands of widgets)")
ax.set_ylabel("Count")

ax.set_title("Distribution of Sales across Markets");

# Note: for some reason this cell does not display the histogram.
# We could use plt.show() but I've been told not to make changes
# plt.show()

### Typical Sales Volume

Now let us address the first business question: What has been the typical sales volume across markets?

## Step 4

Based on the histogram, choose an appropriate measure of central tendency for widget sales. Use whatever method you wish to calculate your chosen metric – making any required imports in the cell.

In [48]:
# CodeGrade step4
# Replace None with appropriate code
# Make any imports here
# We have already imported numpy but because the cost of re-importing is very small, we'll do that for safety
import numpy as np

# Assign measure_central_tendency to the mean, median, or mode of the sales data
# Looking at the histogram, there is a skew to the right
# Since we're being asked for "typical" sales, even though this is not well-defined, we'll use median
sales_array = np.array(sales)
measure_central_tendency = np.median(sales_array)

measure_central_tendency

12.9

### Dispersion of Sales Volume

Now that we have a number to represent the typical sales volume, let's answer: How variable have sales been across markets?

## Step 5

Based on the histogram, choose an appropriate measure of dispersion for widget sales. Use whatever method you wish to calculate your chosen metric – making any required imports in the cell. Have your answer be one value rather than a range.

In [49]:
# CodeGrade step5
# Replace None or with appropriate code
# Make any imports here
from scipy.stats import iqr

# Assign measure_dispersion
# We'll use IQR to look at the range of the middle 50%
# I would be inclined to round, but am concerned this will mess with the autograder
measure_dispersion = iqr(sales)

measure_dispersion

7.024999999999999

## Step 6

How much has the company characteristically been spending on different advertising media (TV, radio, and newspaper) across the different markets for the past year? Calculate the median expenditure for each media:

In [50]:
# CodeGrade step6
# Replace None with appropriate code
# make any imports here
# For safety!
import numpy as np

# Since we're not using these np arrays further we'll leave them inline
# calculate median tv expenditure
median_tv_expenditure = np.median(np.array(tv))
# calculate median radio expenditure
median_radio_expenditure = np.median(np.array(radio))
# calculate median newspaper expenditure
median_newspaper_expenditure = np.median(np.array(newspaper))

median_tv_expenditure, median_radio_expenditure, median_newspaper_expenditure

(149.75, 22.9, 25.75)

## Step 7

How much has the company characteristically been spending on different advertising media (TV, radio, and newspaper) across the different markets for the past year? Calculate the IQR for each media:

In [51]:
# CodeGrade step7
# Replace None with appropriate code
from scipy.stats import iqr

# make any imports here
iqr_tv_expenditure = iqr(tv)
iqr_radio_expenditure = iqr(radio)
iqr_newspaper_expenditure = iqr(newspaper)

# The results here are not the same as the median but are close
# This gives us some confidence in our methods

iqr_tv_expenditure, iqr_radio_expenditure, iqr_newspaper_expenditure

(144.45, 26.549999999999997, 32.35)

### Summary

In this lab you were able to:
- Calculate central tendency within a variable in Python
- Create visualizations to showcase central tendency
- Compare variables by their central tendency
- Calculate dispersion within a variable