In [3]:
import numpy as np
import pandas as pd
%matplotlib inline

In [2]:
# Set up the data
data = pd.DataFrame()
data['gender'] = ['male'] * 100 + ['female'] * 100
data['height'] = np.append(np.random.normal(69, 8, 100), np.random.normal(64, 5, 100))
data['weight'] = np.append(np.random.normal(195, 25, 100), np.random.normal(166, 15, 100))

In [3]:
data.head()

Unnamed: 0,gender,height,weight
0,male,76.075135,164.084642
1,male,72.131834,192.990586
2,male,63.808059,172.382063
3,male,77.234649,212.957441
4,male,71.116512,205.243992


# Describing data with Pandas

So far in this lesson, we've discussed the various ways we can use statistics to describe a given dataset. Now, we're going to discuss how we can leverage the tools of data science, specifically the _pandas_ package, to quickly and easily describe our data. This is what you'll actually be using day to day when you have to describe or summarize the data you're working with. Rather than draw out formulas or perform calculations you'll use the tools of programming to get the answers you want easily and efficiently.


## What we've seen before

We've already shown some of the basic tools. We have NumPy methods like `.mean()` or `.std()` to calculate the mean and standard deviation of our data.

In [4]:
data.height.mean()

66.884801413027489

In [5]:
data.height.std()

7.0741683857746871

Now, there are many more methods in pandas to describe data in simple aggregative forms. Things like median and variance all have associated pandas methods. As a general rule of thumb, if you're trying to compute a standard statistical measure (the kinds of measures you could find in a statistics book somewhere) Python probably has a coded up method for it somewhere already. Usually that method will be in NumPy and pandas, but not always. It is, however, always worth a quick Google and check of Stack Overflow to see if the work has already been done before you go off and create your own functions.

## The `.describe()` method

So far we've mostly talked about methods with two kinds of output: it either stays the same shape with modified values (the iterative kinds of methods) or it condenses the data into a single value output (aggregative methods). There is another group of methods in Pandas, and they happen to be supremely useful for quickly and coherently summarizing data in a numeric rather than visual way. 

In statistics, there are a lot of descriptive values that are often used in concert with each other. The most classic example is probably mean and standard deviation. Using the two of them together gets you a lot of information about how the data is distributed across values.

Pandas understands this. Sometimes you want more than one value, but less than all of them. You want a set of summary statistics that give you a good, standardized view into the data and its variables. Enter `.describe()`.

In [6]:
data.describe()

Unnamed: 0,height,weight
count,200.0,200.0
mean,66.884801,180.990596
std,7.074168,24.647523
min,52.559517,135.856048
25%,62.082902,164.321093
50%,66.264664,177.430006
75%,70.631608,197.696568
max,88.28286,257.34854


Let's look at what that did. Firstly, it returned a data frame, but not one of the same size or shape that we gave it. Instead it iterated over the columns and created these standard statistical measures for each column possible. We say each column possible because one is missing: Gender. That's because gender is a string, rather than a numeric value. We can't compute the means of strings.

Now, as for the values themselves. Count should be relatively self evident, as should min and max. Mean and std (standard deviation) we've also talked about before. The three percent values are _percentiles_. These values represent cutoff points, below which a certain percentage of the data lies. So, 25% of weights are below 162.82 and so on.

Together, these values give us a decent image of what each of the variables included looks like. We can get a numerical sense of what we might call their "shape". However, this is only one part of `.describe()`'s capabilities. As we covered in the toolkit unit, we can also group our data. This allows us to be even more insightful with our describe, letting us compare the summary statistics for two different groups of our data.

In [7]:
data.groupby('gender').describe()

Unnamed: 0_level_0,height,height,height,height,height,height,height,height,weight,weight,weight,weight,weight,weight,weight,weight
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
female,100.0,64.631253,5.020057,52.559517,60.82928,64.659156,68.105445,77.571937,100.0,165.695191,13.666863,135.856048,156.197938,166.648353,173.072667,212.956623
male,100.0,69.13835,8.070482,52.810637,63.433466,67.742419,75.588823,88.28286,100.0,196.286001,23.700822,146.452108,179.157897,197.065564,213.03812,257.34854


Now we have twice the output. This may not be the easiest form to read it, but it does give us a sense of the difference between the two groups, male and female. In this case we can see that the distributions for height and weight are higher for men than for women, which is what we'd expect. This kind of grouping can give us another layer of insight to our analysis.

## Value Counts

Sometimes, you aren't dealing with data that is best summarized in this form. The most common example of this is strings, where these kinds of methods do not apply. In that case what you're probably interested in is _counts_. Python gives you an easy way to go over a column of data and return the distinct values as well as the counts of each.

In [8]:
data.gender.value_counts()

male      100
female    100
Name: gender, dtype: int64

Now, the first thing to note is that this method is working on `data.gender`, which is a _series_ object rather than a _data frame_ object. This `.value_counts()` method cannot iterate over a whole data frame. Luckily each column and row in a data frame is a series and you can use this method simply by selecting a column as we did above. 

There are several reasons to use this method. Firstly, it gives you another way to make sense of your data. In this case it shows us that our data is evenly balanced between males and females, with one hundred samples of each.

There are plenty of other ways this function could be useful. It can show outliers or possible malformed data. For example, if we were to see something like `'Mal'` with a single entry, we'd have found a typo in the data. This method works over both numerical and object data, though it is not valuable to run over the numeric columns in this example. Can you think of why?

In [9]:
data.weight.value_counts().head()

205.958433    1
205.243992    1
166.652014    1
181.575870    1
193.806997    1
Name: weight, dtype: int64

As you can see, it's not useful because we're dealing with truly continuous random data, so no value is exactly repeated. We simply get a list of all the values with a count of 1 for each.

However, these two methods, `.describe()` and `.value_counts()`, do often provide incredibly easy and valuable insights into your dataset. You'll want to use them throughout the course as one of the ways to get a first, quick sense of the data before digging in more specifically on points of interest.

# Drill - Describing Data

1) Greg was 14, Marcia was 12, Peter was 11, Jan was 10, Bobby was 8, and Cindy was 6 when they started playing the Brady kids on The Brady Bunch. Cousin Oliver was 8 years old when he joined the show. What are the mean, median, and mode of the kids' ages when they first appeared on the show? What are the variance, standard deviation, and standard error?

In [4]:
bradyBunch = pd.DataFrame()

In [6]:
bradyBunch['Name'] = ['Greg', 'Marcia', 'Peter', 'Jan', 'Bobby', 'Cindy', 'Oliver']
bradyBunch['Age-When-Joined'] = [14, 12, 11, 10, 8, 6, 8]

bradyBunch

Unnamed: 0,Name,Age-When-Joined
0,Greg,14
1,Marcia,12
2,Peter,11
3,Jan,10
4,Bobby,8
5,Cindy,6
6,Oliver,8


In [13]:
bradyBunch.describe()

Unnamed: 0,Age-When-Joined
count,7.0
mean,9.857143
std,2.734262
min,6.0
25%,8.0
50%,10.0
75%,11.5
max,14.0


In [14]:
#standard error is N/A because it only applies to samples

2) Using these estimates, if you had to choose only one estimate of central tendency and one estimate of variance to describe the data, which would you pick and why?

In [None]:
#Central Tendency:  Mean, because why not?  There's no huge outliers, which is the mean's disadvantage.  Median (50% percentile) works well too here.

#Variance:  Population StDev, because again, why not?

3) Next, Cindy has a birthday. Update your estimates- what changed, and what didn't?

In [15]:
# Nothing will change, because the Age Field is "Age-When-Joined" (The Show), which will never change.

# But if Cindy's field does change, then the mean, stdev, and min will change.  The count, 25%, 50%, 75%, and max will stay the same. 

4) Nobody likes Cousin Oliver. Maybe the network should have used an even younger actor. Replace Cousin Oliver with 1-year-old Jessica, then recalculate again. Does this change your choice of central tendency or variance estimation methods?

In [13]:
bradyBunch.replace(('Oliver',8), ('Jessica',1), inplace=True)
bradyBunch.describe()

Unnamed: 0,Age-When-Joined
count,7.0
mean,7.857143
std,5.273474
min,1.0
25%,3.5
50%,10.0
75%,11.5
max,14.0


In [14]:
#Yes, now I will definitely use Median as a Central Tendency because Jessica's a huge outlier that caused Mean to go down by 2
#I still will use StDev though

5) On the 50th anniversary of The Brady Bunch, four different magazines asked their readers whether they were fans of the show. The answers were: TV Guide 20% fans Entertainment Weekly 23% fans Pop Culture Today 17% fans SciPhi Phanatic 5% fans

Based on these numbers, what percentage of adult Americans would you estimate were Brady Bunch fans on the 50th anniversary of the show?

In [None]:
#There's no way to tell without knowing how the different magazines will be weighted (for our weighted sum), or even for Americans that didn't participate in the surveys