---
# Crash Course Python for Data Science - Intro to Statistics 
---
# 01 - Descriptive Statistics I
---



## How to think about descriptive statistics

As the name suggests, descriptive statistics refer to the ways in which you can describe the data you're working with. Data can have many varied characteristics, all of which can tell you something about that data. Let's begin with *measures of central tendency*.  




## What is central tendency?
You're part of a group of about 30  students here trying to learn a bit of Python. Chances are, if asked, you'd be able to compute the average age of this group. The thing is, not all averages are the same. And what's the point of an average, anyway?


First of all, you can think of an *average* as a single value that best represents a group of values.

Now, *averages* (the plural) can be considered interchangeable with *central tendency*. In other words, central tendency = the value(s) that best represents a group of values.

**There are three main measures of central tendency:**
1.   Mean 
2.   Median
3.   Mode





In [0]:
# Let's begin by importing our tools:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

plt.style.use(style='ggplot') 
plt.rcParams['figure.figsize'] = (10, 6)

print('Libraries imported')

## Mean

**The *mean* of a group of values is derived by adding all of the values together, and then dividing them by the number of values there are. When most people hear the word "average," this is what they think of.** 






In [0]:
# Let's create a group of values using a normal Python list:
numbers = [12, 12, 13, 11, 24, 55, 23, 87, 65, 40, 1000]

def mean(nums):
  return sum(nums) / len(nums)

mean(numbers)

In [0]:
# Numpy has a built-in function for finding the mean:

np.mean(numbers)

Numpy [documentation](https://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html) on mean function.

## Median

**The *median* of a group of values is derived by ordering all of the values, then finding their midpoint. In other words, the value that is equidistant from the largest and the smallest value in the set of values.** 






In [0]:
numbers = [12, 12, 13, 13, 11, 24, 55, 23, 87, 65, 40, 1000]
sorted(numbers)

In [0]:
def median(nums):
  n = len(nums)
  if n < 1:
    return None
  if n % 2 == 1:
    return sorted(nums)[n//2]
  else:
    return sum(sorted(nums)[n//2-1:n//2+1])/2.0
  
median(numbers)

In [0]:
np.median(numbers)

### A note about median vs. mean:

The mean is more susceptible than the median to being pulled in the direction of "outliers." For now, just think about outliers as values that are either really big or really small relative to the mean.

## Mode

**The *mode* of a set of values is the value which occurs with the highest frequency. In other words, the most common value.** 






In [0]:
# Numpy has no built-in function for mode, but the Python ecosystem does have a dedicated statistics library:

import statistics as stats
stats.mode(numbers)

Statistics library [documentation](https://docs.scipy.org/doc/scipy/reference/stats.html).

## Different types of variables in statistics

Obviously, not all data are the same. For some, their order matters. Others represent categories instead of numbers. Again, this is paralleled in how Python represents different variables with different data types. If you recall the module on Python variables, not all variables are equal in their representation of values. Let's quickly cover the basic types of variables in statistics:



*   **Ordinal**: can be ordered/ranked (historical financial data, time series data, etc.)
*   **Categorical (or Discrete)**: categories like gender, political affiliation, or binaries (voted/didn't vote)
*   **Numeric**: numbers, either discrete or continuous



**Let's put this all together with a familiar dataset:**

In [0]:
# Simply run this cell to import the tips dataset:

df = pd.read_csv("https://raw.githubusercontent.com/axrd/datasets/master/tipdata.csv", index_col=0)
df.head()

*   **total_bill** is continuous, ordinal, and numeric
*   **tip** is also continuous, ordinal, and numeric
*   **sex** is categorical or discrete; it's also binary (male vs. female)
*   **smoker** same as *sex*
*   **day** is categorical but not binary (there are 7 categories, 1 for each day of the week)
*   **time** is categorical (discrete)
*   **size** is numeric and categorical/discrete (can't have 3.4 people at a table)



## Distributions!

**At the end of the Intro to Pandas lesson, we introduced visualizations because a *picture is worth a thousand words.* Let's do the same here so we can see what's going on more easily.**



In [0]:
# Let's make a list of 250 random numbers between 1 and 1000 and assign
# it to a variable "a":

import random
a = random.sample(range(1, 1000), 250)
a

In [0]:
# Let's turn it into a pandas DataFrame and plot it to visualize the distribution 
# of numbers:
data = pd.DataFrame({'random_nums': a}) 

data['random_nums'].hist();

In [0]:
np.mean(data['random_nums'])

In [0]:
np.median(data['random_nums'])

If you run the cell with the random number generator, it will create a new random list. Run the plotting cell again to visualize the new list of random numbers. Check out the different distributions of values! What can you say about their central tendency?

### Skewness: left, right, or middle?

![Skewness](https://cdn-images-1.medium.com/max/1600/1*nj-Ch3AUFmkd0JUSOW_bTQ.jpeg)

## Variability: how spread out is our data?

**Also known as the *spread* of a set of values, variability measures how different they are from each other. More intuitively, variability is a measure of how far values are from one specific value. Can you guess which one? Hint, how can you describe a group of values with just one value?**



*   **Range**: the difference between the largest and smallest values in a set.
*   **Variance**: the difference between the largest and smallest values in a set.
*   **Standard deviation**: the average distance from the mean.



In [0]:
# Let's get sales performance data from the sales dataset we've used previously:

sales = pd.read_csv("https://raw.githubusercontent.com/axrd/datasets/master/sales_data.csv", index_col=0)
sales.head()

In [0]:
sales['Sales'].hist(bins=10);

In [0]:
np.mean(sales['Sales'])

In [0]:
# Let's calculate the variance:
# Variance is the average squared distance of each point from the mean
np.var(sales['Sales'], ddof=1)

In [0]:
# Lets calculate the standard deviation:
# Standard Deviation is the average distance of each point from the mean
np.std(sales['Sales'], ddof=1)

In [0]:
# The Standard Deviation is the Square Root of the variance. 
np.sqrt(np.var(sales['Sales'], ddof=1))

In [0]:
# Here's a really useful built-in pandas method to get some summary descriptive
# statistics:

sales.describe()

### And that's it. Don't forget to do your exercise.



Also, [this will probably come in handy](http://web.mit.edu/~csvoss/Public/usabo/stats_handout.pdf). Be careful not to just copy and paste things if you're asked a written question. Demonstrate your own thinking. 