# Olympic-size statistics
We've been using "categorical" data, like countries.  What if we want to work with numeric data?  And, well, what if we like columns, more than rows?  In this worksheet, we'll look at data from the 2014 Solchi Olympics.

We'll start with pandas.  *You'll get used to putting this command at the top of every notebook.*

In [None]:
import pandas as pd

Now, let's add the list of countries.

In [None]:
countries = ['Russian Fed.', 'Norway', 'Canada', 'United States',
             'Netherlands', 'Germany', 'Switzerland', 'Belarus',
             'Austria', 'France', 'Poland', 'China', 'Korea', 
             'Sweden', 'Czech Republic', 'Slovenia', 'Japan',
             'Finland', 'Great Britain', 'Ukraine', 'Slovakia',
             'Italy', 'Latvia', 'Australia', 'Croatia', 'Kazakhstan']
countries

Now, let's get the lists of medal counts for those countries in the 2014 Solchi Winter Olympics.  (*Aren't you glad you didn't have to type those in?*)

In [None]:
gold = [13, 11, 10, 9, 8, 8, 6, 5, 4, 4, 4, 3, 3, 2, 2, 2, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
silver = [11, 5, 10, 7, 7, 6, 3, 0, 8, 4, 1, 4, 3, 7, 4, 2, 4, 3, 1, 0, 0, 2, 2, 2, 1, 0]
bronze = [9, 10, 5, 12, 9, 5, 2, 1, 5, 7, 1, 2, 2, 6, 2, 4, 3, 1, 2, 1, 0, 6, 2, 1, 0, 1]

Finally, we'll put them into a data frame.  Try the next few lines to see if they work.

In [None]:
olympic_medal_counts = pd.DataFrame(data = {"country":countries,
                                        "gold":gold,
                                        "silver":silver,
                                        "bronze":bronze})

In [None]:
olympic_medal_counts

Do you notice anything strange about the table you just generated?  What is it?

## Arranging columns
From my perspective, the columns are in the wrong order.  Do you know how to fix that issue?  Think back to our previous worksheet and update the command so that you put the country first, then the number of gold medals, then brone, then silver.


In [None]:
olympic_medal_counts = pd.DataFrame(data = {"country":countries,
                                        "gold":gold,
                                        "silver":silver,
                                        "bronze":bronze})
olympic_medal_counts

You can also arrange the columns by building a new data frame from the old one.  Just as you can use `olympic_medal_counts["gold"]` to get just the column of gold medals, you can use `olympic_medal_counts[["country","gold"]]` to get two columns.  (*Note that we've added square braces to make a list of column names.*)

In [None]:
olympic_medal_counts[["country","gold"]]

## Computing statistics
Unlike the panda data set, which was mostly textual data, our Olympic data set contains lots of numeric data.  That means we can start to compute some statistics.

### Average number of medals
What's the average number of gold medals?  As you may recall, we could add up all of the values and then divide by the total number of values.  Can you remember how to do those two things with lists?

In [None]:
gold_medals = olympic_medal_counts["gold"]

When you've figured out a solution, grab a counselor and show them!  And if you can't figure out a solution, that's okay, too.  Grab a counselor and ask for help!

### Medal statistics, revisited
You may have written something like the following to compute the average number of gold medals.
```
sum = 0
for x in gold_medals:
    sum = sum + x
print(sum/len(gold_medals))
```
But that's a lot of work.  Fortunately, `pandas` lets you call lots of functions on colums, including `mean`.  Try the following. 

In [None]:
gold_medals.mean()

Did you get the same answer (or about the same answer)?  We hope so.

Next, use commands to determine the median, mode, min, and max of the numbers of gold medals.

Are you puzzled by the answer for the mode?  If so, try the following for more information.  (Remember that `value_counts` tells you how many times each value appears.)

In [None]:
gold_medals.value_counts()

## Selecting rows
Remember how we were able to select just the giant pandas in China with

```
giant_pandas.loc[giant_pandas["country"] == "China"]
```

You can use things other than `==` in the selection.

See if you can figure out how to select the countries that got more than the average number of gold medals (3.8).  You'll want to use a command something like the following.

```
olympic_medal_counts[olympic_medal_counts[???] ?? 3.8]
```

You can figure out how many countries got at least that many gold medals using `len`.
```
len(olympic_medal_counts[olympic_medal_counts[???] ?? 3.8])
```

## Practice
Repeat the previous exercise for silver medals and bronze medals.

## Exploration
See what else you can discover about this data set.  For example,
* What countries had more than the average number of gold, silver, and bronze medals?
* What is the average number of bronze medals for countries with lots of gold medals?  For countries with few gold medals?
* What countries had more silver medals than gold medals?