## Pandas deep dive
Lets review some of the key concepts for Pandas and Numpy that we will use during this course. First lets import our libraries, Pandas and Numpy. 

Numpy is a numerical library that makes it easy to work with big arrays and matrices.

In [3]:
import pandas as pd
import numpy as np

A Pandas Series is a unidimentional matrix of indexed data. We can create one from a list, like this example

In [4]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

We can see in our output that the Series is wrpaped by a sequence of values and a sequence of indexes. Values are simply a Numpy matrix

In [6]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

And the index is a matrix of type pd.Index

In [7]:
data.index

RangeIndex(start=0, stop=4, step=1)

We can access data through the associated index

In [8]:
data[1]

0.5

And do data slicing

In [10]:
data[1:3]

1    0.50
2    0.75
dtype: float64

Now, the main difference between Numpy indices and the Series object is that the Series index can be something other than an integer, we can use strings for our index.

In [11]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [12]:
data['b']

0.5

We can think of the Series as a specialized dictionary. We can even create a Pandas Series object from a Python dictionary

In [13]:
mass_dict = {'Sun': "1.989 × 10^30 kg",
                   'Mercury': "3.285 × 10^23 kg",
                   'Venus': "4.867 × 10^24 kg",
                   'Earth': "5.972 × 10^24 kg",
                   'Mars': "6.39 × 10^23 kg"}
mass = pd.Series(mass_dict)
mass

Sun        1.989 × 10^30 kg
Mercury    3.285 × 10^23 kg
Venus      4.867 × 10^24 kg
Earth      5.972 × 10^24 kg
Mars        6.39 × 10^23 kg
dtype: object

We can slice our Series

In [14]:
mass['Mercury':'Earth']

Mercury    3.285 × 10^23 kg
Venus      4.867 × 10^24 kg
Earth      5.972 × 10^24 kg
dtype: object

* The pandas Dataframe*
The next key object in Pandas is the DataFrame. This object can be considered a generalization of a matrix.

This object can be thinked of an ordered sequence of columns, sharing a row index. Lets create a new Series and then use this Series to create a Dataframe

In [15]:
grav_dict = {'Sun': "274 m/s²", 'Mercury': "3.7 m/s²", 'Venus': "8.87 m/s²",
             'Earth': "9.807 m/s²", 'Mars': "3.721 m/s²"}
grav = pd.Series(grav_dict)
grav

Sun          274 m/s²
Mercury      3.7 m/s²
Venus       8.87 m/s²
Earth      9.807 m/s²
Mars       3.721 m/s²
dtype: object

In [17]:
# lets use our mass dictionary from previous explanation
# Create a single Dataframe from both
objects = pd.DataFrame({'mass': mass,
                       'grav': grav})
objects

Unnamed: 0,mass,grav
Sun,1.989 × 10^30 kg,274 m/s²
Mercury,3.285 × 10^23 kg,3.7 m/s²
Venus,4.867 × 10^24 kg,8.87 m/s²
Earth,5.972 × 10^24 kg,9.807 m/s²
Mars,6.39 × 10^23 kg,3.721 m/s²


We can acces each object with its index

In [18]:
objects.index

Index(['Sun', 'Mercury', 'Venus', 'Earth', 'Mars'], dtype='object')

In [19]:
# read the Dataframe columns
objects.columns

Index(['mass', 'grav'], dtype='object')

In [20]:
# Access one of the columns, similar to a dictionary
objects['grav']

Sun          274 m/s²
Mercury      3.7 m/s²
Venus       8.87 m/s²
Earth      9.807 m/s²
Mars       3.721 m/s²
Name: grav, dtype: object

Please notice that we are calling the Dataframe *column*

In [21]:
objects['mass']

Sun        1.989 × 10^30 kg
Mercury    3.285 × 10^23 kg
Venus      4.867 × 10^24 kg
Earth      5.972 × 10^24 kg
Mars        6.39 × 10^23 kg
Name: mass, dtype: object

You can describe a Pandas Dataframe with `.describe()`

In [22]:
objects.describe()

Unnamed: 0,mass,grav
count,5,5
unique,5,5
top,1.989 × 10^30 kg,274 m/s²
freq,1,1


Lets open our cereal csv and check some Pandas functions

In [24]:
cereal = pd.read_csv('cereal.csv', index_col='name')
cereal.head()

Unnamed: 0_level_0,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843


In [25]:
cereal.describe()

Unnamed: 0,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
count,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0
mean,106.883117,2.545455,1.012987,159.675325,2.151948,14.597403,6.922078,96.077922,28.246753,2.207792,1.02961,0.821039,42.665705
std,19.484119,1.09479,1.006473,83.832295,2.383364,4.278956,4.444885,71.286813,22.342523,0.832524,0.150477,0.232716,14.047289
min,50.0,1.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,0.0,1.0,0.5,0.25,18.042851
25%,100.0,2.0,0.0,130.0,1.0,12.0,3.0,40.0,25.0,1.0,1.0,0.67,33.174094
50%,110.0,3.0,1.0,180.0,2.0,14.0,7.0,90.0,25.0,2.0,1.0,0.75,40.400208
75%,110.0,3.0,2.0,210.0,3.0,17.0,11.0,120.0,25.0,3.0,1.0,1.0,50.828392
max,160.0,6.0,5.0,320.0,14.0,23.0,15.0,330.0,100.0,3.0,1.5,1.5,93.704912


In [26]:
cereal.columns

Index(['mfr', 'type', 'calories', 'protein', 'fat', 'sodium', 'fiber', 'carbo',
       'sugars', 'potass', 'vitamins', 'shelf', 'weight', 'cups', 'rating'],
      dtype='object')

In [28]:
# Returns an array of the unique manufacturers
cereal.mfr.unique()

array(['N', 'Q', 'K', 'R', 'G', 'P', 'A'], dtype=object)

You can use `.loc` to access specific data.
Lets return the cereals which have a protein content higher that 4.0

In [30]:
cereal.loc[cereal['protein'] >= 4.0]

Unnamed: 0_level_0,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
Cheerios,G,C,110,6,2,290,2.0,17.0,1,105,25,1,1.0,1.25,50.764999
Life,Q,C,100,4,2,150,2.0,12.0,6,95,25,2,1.0,0.67,45.328074
Maypo,A,H,100,4,1,0,0.0,16.0,3,95,25,2,1.0,1.0,54.850917
Muesli Raisins; Dates; & Almonds,R,C,150,4,3,95,3.0,16.0,11,170,25,3,1.0,1.0,37.136863
Muesli Raisins; Peaches; & Pecans,R,C,150,4,3,150,3.0,16.0,11,170,25,3,1.0,1.0,34.139765
Quaker Oat Squares,Q,C,100,4,1,135,2.0,14.0,6,110,25,3,1.0,0.5,49.511874
Quaker Oatmeal,Q,H,100,5,2,0,2.7,-1.0,-1,110,0,1,1.0,0.67,50.828392


Return cereals with protein higher than 2 grams and sugar lower than 6 grams

In [32]:
cereal.loc[(cereal['protein'] >= 2) & (cereal['sugars'] <= 6)]

Unnamed: 0_level_0,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
Bran Chex,R,C,90,2,1,200,4.0,15.0,6,125,25,1,1.0,0.67,49.120253
Bran Flakes,P,C,90,3,0,210,5.0,13.0,5,190,25,3,1.0,0.67,53.313813
Cheerios,G,C,110,6,2,290,2.0,17.0,1,105,25,1,1.0,1.25,50.764999
Corn Chex,R,C,110,2,0,280,0.0,22.0,3,25,25,1,1.0,1.0,41.445019
Corn Flakes,K,C,100,2,0,290,1.0,21.0,2,35,25,1,1.0,1.0,45.863324
Cream of Wheat (Quick),N,H,100,3,0,80,1.0,21.0,0,-1,0,2,1.0,1.0,64.533816
Crispix,K,C,110,2,0,220,1.0,21.0,3,30,25,3,1.0,1.0,46.895644


### Exercise 1
- Can you find the average sugar content of the cereals which list the portion `cups` size as 1.0?
- Can you find the highest and lowest calorie content of the previous selection?

### Exercise 2
- How many cereals by manufacturer `G` have a higher calorie content than 100?