# SI 330: Data Manipulation 
## 02 - Introduction to Pandas

### Dr. Chris Teplovs, School of Information, University of Michigan
<small><a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a>This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

# Reminders
* Slack (see link via Canvas announcements)

## Learning Objectives

* Pandas introduction
* explain how pandas operations differ from "traditional" python
* be able to load a CSV file into a Pandas DataFrame
* explain how to extract columns from a DataFrame
* sort a DataFrame
* assign a column as the index of a DataFrame
* filter a DataFrame according to some criteria
* explain how boolean masks work in filtering DataFrames

### IMPORTANT: Replace ```?``` in the following code with your uniqname.

In [None]:
MY_UNIQNAME = 'gprime'

## <font color="magenta"> Exercise 1 (1 point): Based on the readings ([Chapter 4](https://learning.oreilly.com/library/view/python-for-data/9781491957653/ch04.html#numpy) of Python for Data Analysis), what is the key feature of numpy that underpins much of the functionality of pandas? </font>

Arrays

## Pandas
* high-level library to support data manipulation and analysis
* DataFrame is the primary object we’ll be dealing with
* similar to R’s dataframe
* maps onto tabular structure
* good for time series and econometric data

## Shift from "pythonic" to "pandorable"

* less looping over elements
* lots of built-in functionality
* a "paradigm shift"

# Data structures

We're all familiar with lists:

In [2]:
names = ["Charlotte", "Ingrid", "Ian", "Eric"]
scores = [80, 95, 85, 70]

Now let's say that we wanted to divide each of those scores by two and assign the results to another variable. Go ahead and write some code that does that... There are lots of ways to do this, so go ahead and write one way to do it (without importing any additional python packages) and assign the results to a 
variable called ```half```:

## <font color="magenta"> Exercise 2 (1 point): Write some python code to divide all the scores by 2.  The results should be saved to a variable called ```half```. </font>

In [6]:
# insert your code here
half = [x / 2 for x in scores]

If you followed the above instructions, the following cell block should print
a list of floats that looks like ```
[40.0, 47.5, 42.5, 35.0]```


In [7]:
half

[40.0, 47.5, 42.5, 35.0]

We can put data into an array structure that allows us to apply more powerful
functions.  The data structure that we're interested in is called an ```ndarray``` and is from the ```numpy``` package:

In [12]:
import numpy as np
ascores = np.array(scores)

In [13]:
ascores 

array([80, 95, 85, 70])

In [14]:
ahalf = ascores / 2

In [22]:
ahalf

array([40. , 47.5, 42.5, 35. ])

Numpy arrays are powerful, but they have some limitations:  they can only 
consist of one type of data (e.g. int), etc.  pandas provides two additional
data structures that are built on numpy ndarrays.

The first are Series.  Let's create a simple pandas Series and examine it:

In [23]:
import pandas as pd

In [24]:
from pandas import Series

In [25]:
sscores = Series(scores,name='scores')

In [26]:
sscores

0    80
1    95
2    85
3    70
Name: scores, dtype: int64

So you see a couple of useful things: an index (0 to 3) and a data type (dtype), which in this case is an int64.

**A Series is a one-dimensional ndarray with axis labels**

In [34]:
data = dict(zip(names,scores))

In [35]:
import pandas as pd

In [36]:
data

{'Charlotte': 80, 'Ingrid': 95, 'Ian': 85, 'Eric': 70}

In [39]:
sData = Series(data=data,name='score')

In [40]:
sData

Charlotte    80
Ingrid       95
Ian          85
Eric         70
Name: score, dtype: int64

So Series are a bit friendlier than numpy arrays, but they're still only one-dimensional.  Keep in mind that our basic data abstraction is a table, which can
be thought of as a two-dimensional array.  Let's go ahead and create a simple DataFrame with just one column:

In [41]:
from pandas import DataFrame


In [42]:
sData.to_frame()

Unnamed: 0,score
Charlotte,80
Ingrid,95
Ian,85
Eric,70


## Today's focus: filtering, slicing and dicing

We're going to use some data from the [Nutrition Facts for McDonald's Menu](https://www.kaggle.com/mcdonalds/nutrition-facts) dataset on [Kaggle](www.kaggle.com).

Go ahead and browse the file using JupyterLab.

Now let's load the file using ```read_csv```.

In [2]:
import pandas as pd

In [3]:
menu = pd.read_csv('data/menu.csv')

In [4]:
#menu

## <font color="magenta"> Exercise 3 (1 point): How many rows and columns are in this dataset?  Include one cell block to determine the number and one markdown block that presents the answer as a complete sentence (i.e. "The McDonald's nutrition data set contains X rows and Y columns"). </font>

In [53]:
len(menu)


260

In [55]:
len(menu.columns)

24

In [57]:
menu.shape

(260, 24)

The McDonald's nutrition data set contains X rows and Y columns

## Extracting columns 

Getting column names is easy:

In [58]:
menu.columns

Index(['Category', 'Item', 'Serving Size', 'Calories', 'Calories from Fat',
       'Total Fat', 'Total Fat (% Daily Value)', 'Saturated Fat',
       'Saturated Fat (% Daily Value)', 'Trans Fat', 'Cholesterol',
       'Cholesterol (% Daily Value)', 'Sodium', 'Sodium (% Daily Value)',
       'Carbohydrates', 'Carbohydrates (% Daily Value)', 'Dietary Fiber',
       'Dietary Fiber (% Daily Value)', 'Sugars', 'Protein',
       'Vitamin A (% Daily Value)', 'Vitamin C (% Daily Value)',
       'Calcium (% Daily Value)', 'Iron (% Daily Value)'],
      dtype='object')

Similarly, extracting a specific columns is also easy:

In [6]:
menu['Category']
# menu.Category

0               Breakfast
1               Breakfast
2               Breakfast
3               Breakfast
4               Breakfast
5               Breakfast
6               Breakfast
7               Breakfast
8               Breakfast
9               Breakfast
10              Breakfast
11              Breakfast
12              Breakfast
13              Breakfast
14              Breakfast
15              Breakfast
16              Breakfast
17              Breakfast
18              Breakfast
19              Breakfast
20              Breakfast
21              Breakfast
22              Breakfast
23              Breakfast
24              Breakfast
25              Breakfast
26              Breakfast
27              Breakfast
28              Breakfast
29              Breakfast
              ...        
230          Coffee & Tea
231          Coffee & Tea
232    Smoothies & Shakes
233    Smoothies & Shakes
234    Smoothies & Shakes
235    Smoothies & Shakes
236    Smoothies & Shakes
237    Smoot

And multiple columns can also be extracted by passing a list of column names

In [7]:
menu[['Item','Calories']]

Unnamed: 0,Item,Calories
0,Egg McMuffin,300
1,Egg White Delight,250
2,Sausage McMuffin,370
3,Sausage McMuffin with Egg,450
4,Sausage McMuffin with Egg Whites,400
5,Steak & Egg McMuffin,430
6,"Bacon, Egg & Cheese Biscuit (Regular Biscuit)",460
7,"Bacon, Egg & Cheese Biscuit (Large Biscuit)",520
8,"Bacon, Egg & Cheese Biscuit with Egg Whites (R...",410
9,"Bacon, Egg & Cheese Biscuit with Egg Whites (L...",470


## Extracting rows

In [63]:
#menu.iloc[0]

Category                              Breakfast
Item                               Egg McMuffin
Serving Size                     4.8 oz (136 g)
Calories                                    300
Calories from Fat                           120
Total Fat                                    13
Total Fat (% Daily Value)                    20
Saturated Fat                                 5
Saturated Fat (% Daily Value)                25
Trans Fat                                     0
Cholesterol                                 260
Cholesterol (% Daily Value)                  87
Sodium                                      750
Sodium (% Daily Value)                       31
Carbohydrates                                31
Carbohydrates (% Daily Value)                10
Dietary Fiber                                 4
Dietary Fiber (% Daily Value)                17
Sugars                                        3
Protein                                      17
Vitamin A (% Daily Value)               

In [None]:
menu

You'll notice that the index column is just a series of integers starting with 0.  Sometimes that's fine.  
Other times we want to assign a more useful row as the index.  Note that the values in the index do not need to be unique.

In [None]:
menu_i = menu.set_index('Item')

In [None]:
menu_i

In [None]:
menu_i.loc[0]

In [None]:
menu_i.iloc[0]

We can also extract a row and a slice of its columns

In [None]:
menu_i.iloc[0,0:2]

Or we can extract a column and a slice of its rows


In [None]:
menu_i.iloc[1:3,:]

## Sorting
Sorting is supported using sort_index and sort_values:


In [65]:
menu_sorted_by_cals = menu.sort_values('Calories',ascending=True)

In [67]:
#menu_sorted_by_cals

## <font color="magenta"> Exercise 4 (2 points): Display the four menu items that have the most Saturated Fat (the absolute amount, not the % Daily Value).</font>

In [68]:
#menu.columns

Index(['Category', 'Item', 'Serving Size', 'Calories', 'Calories from Fat',
       'Total Fat', 'Total Fat (% Daily Value)', 'Saturated Fat',
       'Saturated Fat (% Daily Value)', 'Trans Fat', 'Cholesterol',
       'Cholesterol (% Daily Value)', 'Sodium', 'Sodium (% Daily Value)',
       'Carbohydrates', 'Carbohydrates (% Daily Value)', 'Dietary Fiber',
       'Dietary Fiber (% Daily Value)', 'Sugars', 'Protein',
       'Vitamin A (% Daily Value)', 'Vitamin C (% Daily Value)',
       'Calcium (% Daily Value)', 'Iron (% Daily Value)'],
      dtype='object')

In [79]:
satFat = menu.sort_values('Saturated Fat', ascending=False)

In [80]:
satFat.iloc[:4]

Unnamed: 0,Category,Item,Serving Size,Calories,Calories from Fat,Total Fat,Total Fat (% Daily Value),Saturated Fat,Saturated Fat (% Daily Value),Trans Fat,...,Carbohydrates,Carbohydrates (% Daily Value),Dietary Fiber,Dietary Fiber (% Daily Value),Sugars,Protein,Vitamin A (% Daily Value),Vitamin C (% Daily Value),Calcium (% Daily Value),Iron (% Daily Value)
231,Coffee & Tea,Frappé Chocolate Chip (Large),22 fl oz cup,760,280,31.0,48,20.0,101,1.5,...,111,37,1,5,99,12,20,0,35,6
82,Chicken & Fish,Chicken McNuggets (40 piece),22.8 oz (646 g),1880,1060,118.0,182,20.0,101,1.0,...,118,39,6,24,1,87,0,15,8,25
32,Breakfast,Big Breakfast with Hotcakes (Large Biscuit),15.3 oz (434 g),1150,540,60.0,93,20.0,100,0.0,...,116,39,7,28,17,36,15,2,30,40
253,Smoothies & Shakes,McFlurry with M&M’s Candies (Medium),16.2 oz (460 g),930,290,33.0,50,20.0,102,1.0,...,139,46,2,7,128,20,25,0,70,10


## Filtering

More often than extracting a particular row, we want to extract one or more rows that match
some criteria.  For example, to find all the menu items that contain Trans Fats, we could use:


In [91]:
#menu['Trans Fat' ] > 0

In [85]:
menu_trans_fats = menu[menu['Trans Fat'] > 0.0]

We're going to spend time in class explaining what just happened there.

In [86]:
menu['Trans Fat']

0      0.0
1      0.0
2      0.0
3      0.0
4      0.0
5      1.0
6      0.0
7      0.0
8      0.0
9      0.0
10     0.0
11     0.0
12     0.0
13     0.0
14     0.0
15     0.0
16     0.0
17     0.0
18     1.0
19     0.0
20     0.0
21     0.0
22     0.0
23     0.0
24     0.5
25     0.5
26     1.5
27     0.0
28     0.0
29     0.0
      ... 
230    1.0
231    1.5
232    0.0
233    0.0
234    0.0
235    0.0
236    0.0
237    0.0
238    0.0
239    0.0
240    0.0
241    1.0
242    1.0
243    1.0
244    1.0
245    1.0
246    1.0
247    1.0
248    1.0
249    1.0
250    1.0
251    1.0
252    0.5
253    1.0
254    0.0
255    0.5
256    1.0
257    0.0
258    1.0
259    0.0
Name: Trans Fat, Length: 260, dtype: float64

In [87]:
menu['Trans Fat'] > 0.0

0      False
1      False
2      False
3      False
4      False
5       True
6      False
7      False
8      False
9      False
10     False
11     False
12     False
13     False
14     False
15     False
16     False
17     False
18      True
19     False
20     False
21     False
22     False
23     False
24      True
25      True
26      True
27     False
28     False
29     False
       ...  
230     True
231     True
232    False
233    False
234    False
235    False
236    False
237    False
238    False
239    False
240    False
241     True
242     True
243     True
244     True
245     True
246     True
247     True
248     True
249     True
250     True
251     True
252     True
253     True
254    False
255     True
256     True
257    False
258     True
259    False
Name: Trans Fat, Length: 260, dtype: bool

In [89]:
#menu[menu['Trans Fat'] > 0.0]

In [90]:
menu.columns

Index(['Category', 'Item', 'Serving Size', 'Calories', 'Calories from Fat',
       'Total Fat', 'Total Fat (% Daily Value)', 'Saturated Fat',
       'Saturated Fat (% Daily Value)', 'Trans Fat', 'Cholesterol',
       'Cholesterol (% Daily Value)', 'Sodium', 'Sodium (% Daily Value)',
       'Carbohydrates', 'Carbohydrates (% Daily Value)', 'Dietary Fiber',
       'Dietary Fiber (% Daily Value)', 'Sugars', 'Protein',
       'Vitamin A (% Daily Value)', 'Vitamin C (% Daily Value)',
       'Calcium (% Daily Value)', 'Iron (% Daily Value)'],
      dtype='object')

## <font color="magenta">Exercise 5 (2 points): List the top 3 breakfast items have the most Dietary Fiber.</font>

In [94]:
bfast = menu[menu['Category'] == 'Breakfast']
temp = bfast.sort_values('Dietary Fiber', ascending=False)
temp.iloc[:3]

Unnamed: 0,Category,Item,Serving Size,Calories,Calories from Fat,Total Fat,Total Fat (% Daily Value),Saturated Fat,Saturated Fat (% Daily Value),Trans Fat,...,Carbohydrates,Carbohydrates (% Daily Value),Dietary Fiber,Dietary Fiber (% Daily Value),Sugars,Protein,Vitamin A (% Daily Value),Vitamin C (% Daily Value),Calcium (% Daily Value),Iron (% Daily Value)
34,Breakfast,Big Breakfast with Hotcakes and Egg Whites (La...,15.4 oz (437 g),1050,450,50.0,77,16.0,81,0.0,...,115,38,7,28,18,35,4,2,25,30
32,Breakfast,Big Breakfast with Hotcakes (Large Biscuit),15.3 oz (434 g),1150,540,60.0,93,20.0,100,0.0,...,116,39,7,28,17,36,15,2,30,40
33,Breakfast,Big Breakfast with Hotcakes and Egg Whites (Re...,14.9 oz (423 g),990,410,46.0,70,16.0,78,0.0,...,110,37,6,23,17,35,0,2,25,30


## <font color="magenta">Exercise 6 (3 points): Show up to three of the best choices for someone who is following the "Atkin's Diet" (Google it).  Justify your choices in a markdown block that follows your code.</font>

In [6]:
temp = menu.sort_values('Carbohydrates', ascending=True)
temp.iloc[:3]

Unnamed: 0,Category,Item,Serving Size,Calories,Calories from Fat,Total Fat,Total Fat (% Daily Value),Saturated Fat,Saturated Fat (% Daily Value),Trans Fat,...,Carbohydrates,Carbohydrates (% Daily Value),Dietary Fiber,Dietary Fiber (% Daily Value),Sugars,Protein,Vitamin A (% Daily Value),Vitamin C (% Daily Value),Calcium (% Daily Value),Iron (% Daily Value)
136,Beverages,Dasani Water Bottle,16.9 fl oz,0,0,0.0,0,0.0,0,0.0,...,0,0,0,0,0,0,0,0,0,0
145,Coffee & Tea,Coffee (Small),12 fl oz cup,0,0,0.0,0,0.0,0,0.0,...,0,0,0,0,0,0,0,0,0,0
140,Coffee & Tea,Iced Tea (Child),12 fl oz cup,0,0,0.0,0,0.0,0,0.0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
menu[(menu['Category'] != 'Beverages') & (menu['Category'] != 'Coffee & Tea')].sort_values(['Carbohydrates'], ascending = False)

# <font color="magenta">END OF NOTEBOOK</font>
## Remember to submit this file in HTML and IPYNB formats via Canvas.