# pandas
Here we will have a quick play with a pandas DataFrame and use what we've learned about accessing them to answer some questions.

We stopped ten people in the street and asked them what pets they have. We also recorded the person's sex and age.

In [1]:
import numpy as np
import pandas as pd

In [2]:
pets = pd.DataFrame({'sex': np.array(['M', 'M', 'F', 'M', 'F', 'F', 'F', 'M', 'F', 'M']),
                   'age': np.array([21, 45, 23, 56, 47, 70, 34, 30, 19, 62]),
                   'pets': np.array([['cat', 'dog'],
                                    ['hamster'],
                                    ['cat', 'gerbil'],
                                    ['fish', 'hamster', 'gerbil'],
                                    ['cat'],
                                    ['dog'],
                                    ['dog'],
                                    ['cat'],
                                    ['rabbit', 'cat'],
                                    ['dog']])})

  'pets': np.array([['cat', 'dog'],


We have been asked to analyse the survey responses. In particular, we have been given the questions

* What sex was the youngest respondent?
* What age was the person with the most pets?
* What was the most popular pet?
* What was the average age of dog owners?


Firstly, let's just look at the data. It's not very big so we don't actually even need to use head().

In [3]:
pets.head()

Unnamed: 0,sex,age,pets
0,M,21,"[cat, dog]"
1,M,45,[hamster]
2,F,23,"[cat, gerbil]"
3,M,56,"[fish, hamster, gerbil]"
4,F,47,[cat]


Notice here, as well, how the notebook has a nice default presentation for DataFrames. And, yes, you can customize this! We won't be going into that here.

## What sex was the youngest respondent?
Hint, you might find the .loc accessor useful here. Think about breaking this task down into creating a boolean index that is True where the value in the age column is equal to the minimum of the age column. Then select the sex column.

In [36]:
# one line of code
pets.loc[:, ['age', 'sex']].sort_values(by='age')

Unnamed: 0,age,sex
8,19,F
0,21,M
2,23,F
7,30,M
6,34,F
1,45,M
4,47,F
3,56,M
9,62,M
5,70,F


We see that the youngest respondent was female (F)

## What age was the person with the most pets?
Hint, you may find _apply_ ing len as a lambda function to the pets column useful here. Remember that calling len on the pets column will just return the length of the series, which is the number of rows in the DataFrame. In fact, adding useful features to your data is a very common thing in data science, so go ahead and create a new column in our pets DataFrame and call it 'num_pets'.

In [44]:
# task: create new column 'num_pets' which contains the number of pets
# each person had (hint: this is the length of each list in the pets column)
# one line of code here:
pets['num_pets'] = pets['pets'].apply(lambda x: len(x))

In [45]:
# view the DataFrame again to check our new column is there
pets

Unnamed: 0,sex,age,pets,num_pets
0,M,21,"[cat, dog]",2
1,M,45,[hamster],1
2,F,23,"[cat, gerbil]",2
3,M,56,"[fish, hamster, gerbil]",3
4,F,47,[cat],1
5,F,70,[dog],1
6,F,34,[dog],1
7,M,30,[cat],1
8,F,19,"[rabbit, cat]",2
9,M,62,[dog],1


In [46]:
pets.loc[pets['num_pets'] == max(pets['num_pets']), 'age']

3    56
Name: age, dtype: int64

So we see the person with the most pets was 56 years old.

## What was the most popular pet?
This is a very interesting question, given the data, because the data are arranged by respondent, not by pet. We need to _get into_ the pets column now in order to count each type of animal. To do this, we could perform a list comprehension and iterate over each list element for each Series element. But here we're going to give you a handy way to convert that Series of lists into a (longer) Series. The reason for this is to end up with another Series, which means we still have access to the powerful methods available from pandas.

In [47]:
pet_series = pets['pets'].apply(pd.Series).stack().reset_index(drop=True)
pet_series

0         cat
1         dog
2     hamster
3         cat
4      gerbil
5        fish
6     hamster
7      gerbil
8         cat
9         dog
10        dog
11        cat
12     rabbit
13        cat
14        dog
dtype: object

In [48]:
# task: produce an ordered count of each animal
# one line of code here:
pet_series.value_counts()

cat        5
dog        4
hamster    2
gerbil     2
fish       1
rabbit     1
dtype: int64

Cat is the most popular pet.

Note we could also have approached this task by iterating over the original pets column and collecting the animal as the key and the count as the value, but even this requires more explicit iterating and count incrementing, and we still need to iterate over the final result to find the maximum count. With our approach here, we can easily read the most popular pet animal from the top of the result.

## What was the average age of dog owners?
Hint, again here you may find it useful to use a lambda function to create a boolean index which is True if a respondent said they had a dog and False otherwise.

In [49]:
# example
('dog' in ['dog', 'cat'], 'dog' in ['rabbit'])

(True, False)

In [52]:
# task: use a lambda function to test whether 'dog' is contained in each list of animals,
# extract the age column and then chain the mean method to calculate the average age.
# one line of code here:
pets.loc[pets['pets'].apply(lambda x: 'dog' in x), 'age'].mean()

46.75

# Conclusion
You've now seen how pandas holds tabular data, where each column can be a different type (e.g. sex is character and age is a number). Furthermore, pandas provides incredibly powerful methods for slicing and dicing the data to answer some very interesting questions using relatively little code. You're well on your journey to becoming a data ninja!