# NYC Restaurants

Found an external document that summarizes how NYC health department scores and grades restaurants [here](https://www1.nyc.gov/assets/doh/downloads/pdf/rii/how-we-score-grade.pdf). 


Key points:
- supposed to monitor ~24k restaurants per year
- Restaurants with a score between 0 and 13 points earn an A, those with 14 to 27 points receive a B and those with 28 or more a C
- not graded usually means failed to get an A
- Inspectors assign additional points to reflect the extent of the violation. A violation’s condition level can range from 1 (least extensive) to 5 (most extensive). For example, the presence of one contaminated food item is a condition level 1 violation, generating 7 points. Four or more contaminated food items is a condition level 4 violation, resulting in 10 points. 

Major violation types:
- PUBLIC HEALTH HAZARD, such as failing to keep food at the right temperature, triggers a minimum of 7 points. If the violation can’t be corrected before the inspection ends, the Health Department may close the restaurant until it’s fixed. 
- CRITICAL VIOLATION, for example, serving raw food such as a salad without properly washing it first, carries a minimum of 5 points.
- GENERAL VIOLATION, such as not properly sanitizing cooking utensils, receives at least 2 points. 

## Initial set-up

### import the libraries and dataset

In [19]:
# libraries
%matplotlib notebook
import os
import re
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import datetime as dt

In [22]:
# import 6 dataset text files with headers
data1 = pd.read_csv('DOHMH_New_York_City_Restaurant_Inspection_Results_1_of_6.txt', sep = ',', header = 0)
data2 = pd.read_csv('DOHMH_New_York_City_Restaurant_Inspection_Results_2_of_6.txt', sep = ',', header = 0)
data3 = pd.read_csv('DOHMH_New_York_City_Restaurant_Inspection_Results_3_of_6.txt', sep = ',', header = 0)
data4 = pd.read_csv('DOHMH_New_York_City_Restaurant_Inspection_Results_4_of_6.txt', sep = ',', header = 0)
data5 = pd.read_csv('DOHMH_New_York_City_Restaurant_Inspection_Results_5_of_6.txt', sep = ',', header = 0)
data6 = pd.read_csv('DOHMH_New_York_City_Restaurant_Inspection_Results_6_of_6.txt', sep = ',', header = 0)

# combine into one dataframe
nyc = pd.concat([data1, data2, data3, data4, data5, data6], axis=0)

### data cleaning

In [27]:
# rename first column
nyc = nyc.rename(columns={ nyc.columns[0]: "uniqueID" })

# eliminate any easy duplicates
nyc = nyc.drop_duplicates()

# convert to datetime
nyc['INSPECTION DATE'] = pd.to_datetime(nyc['INSPECTION DATE'])
nyc['GRADE DATE'] = pd.to_datetime(nyc['GRADE DATE'])
nyc['RECORD DATE'] = pd.to_datetime(nyc['RECORD DATE'])

In [28]:
# See first 10 entries
nyc.head()

# Basic summary
print("Number of rows: ", str(nyc.shape[0]))
print("Number of columns: ", str(nyc.shape[1]))
print("Column names: ", str(nyc.columns))
print("Index method: ", str(nyc.index))
print("Data types for entire dataframe: ")
nyc.info()


Number of rows:  399918
Number of columns:  19
Column names:  Index(['uniqueID', 'CAMIS', 'DBA', 'BORO', 'BUILDING', 'STREET', 'ZIPCODE',
       'PHONE', 'CUISINE DESCRIPTION', 'INSPECTION DATE', 'ACTION',
       'VIOLATION CODE', 'VIOLATION DESCRIPTION', 'CRITICAL FLAG', 'SCORE',
       'GRADE', 'GRADE DATE', 'RECORD DATE', 'INSPECTION TYPE'],
      dtype='object')
Index method:  Int64Index([    0,     1,     2,     3,     4,     5,     6,     7,     8,
                9,
            ...
            66643, 66644, 66645, 66646, 66647, 66648, 66649, 66650, 66651,
            66652],
           dtype='int64', length=399918)
Data types for entire dataframe: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 399918 entries, 0 to 66652
Data columns (total 19 columns):
uniqueID                 399918 non-null int64
CAMIS                    399918 non-null int64
DBA                      399559 non-null object
BORO                     399918 non-null object
BUILDING                 399809 non-

In [29]:
months = nyc.resample('BM').mean()
len(months.index)

TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Int64Index'

In [18]:
# number of restaurants and the number of times they've been inspected
rests = nyc.groupby('CAMIS')
rests = rests.count()
rests = rests.sort_values('uniqueID', ascending = False)
print("Number of restaurants: ", str(rests.shape[0]))
rests

Number of restaurants:  26505


Unnamed: 0_level_0,uniqueID,DBA,BORO,BUILDING,STREET,ZIPCODE,PHONE,CUISINE DESCRIPTION,INSPECTION DATE,ACTION,VIOLATION CODE,VIOLATION DESCRIPTION,CRITICAL FLAG,SCORE,GRADE,GRADE DATE,RECORD DATE,INSPECTION TYPE
CAMIS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
41683816,97,97,97,97,97,97,97,97,97,97,96,95,97,95,36,36,97,97
50001880,95,95,95,95,95,95,95,95,95,95,95,95,95,92,47,47,95,95
40965177,94,94,94,94,94,94,94,94,94,94,91,91,94,86,33,33,94,94
50033122,91,91,91,91,91,91,91,91,91,91,91,91,91,89,27,27,91,91
41459659,90,90,90,90,90,90,90,90,90,90,90,90,90,85,32,32,90,90
41510846,90,90,90,90,90,90,90,90,90,90,89,89,90,88,31,31,90,90
41289382,88,88,88,88,88,88,88,88,88,88,87,87,88,81,28,28,88,88
41630632,86,86,86,86,86,86,86,86,86,86,84,84,86,84,37,37,86,86
41528486,79,79,79,79,79,79,79,79,79,79,77,77,79,76,29,29,79,79
40392685,78,78,78,78,78,78,78,78,78,78,78,76,78,73,31,31,78,78


In [26]:
nyc['INSPECTION DATE']

0       2015-06-15
1       2014-11-25
2       2016-10-03
3       2017-05-17
4       2017-03-30
5       2015-03-03
6       2017-06-22
7       2017-06-14
8       2015-03-10
9       2015-10-06
10      2015-08-13
11      2015-10-14
12      2016-07-28
13      2017-01-19
14      2017-01-25
15      2017-08-14
16      2014-09-02
17      2014-10-27
18      2014-06-25
19      2014-12-19
20      2014-08-28
21      2016-03-21
22      2016-04-07
23      2016-05-12
24      2017-01-31
25      2017-06-16
26      2016-11-15
27      2016-09-26
28      2017-04-25
29      2015-05-20
           ...    
66623   2016-03-17
66624   2014-12-31
66625   2015-12-17
66626   2017-02-08
66627   2014-09-11
66628   2016-11-17
66629   2017-03-11
66630   2016-04-21
66631   2014-04-10
66632   2014-07-22
66633   2016-07-22
66634   2016-12-07
66635   2015-10-19
66636   2015-04-06
66637   2015-01-16
66638   2014-12-01
66639   2016-06-15
66640   2015-06-11
66641   2016-02-29
66642   2014-08-07
66643   2014-08-19
66644   2016

### What kind of inspections are being done?

In [7]:
inspections = nyc.groupby('INSPECTION TYPE')
inspections = inspections.count()
inspections = inspections.sort_values('uniqueID', ascending = False)
print("Number of inspection types: ", str(inspections.shape[0]))
print("Total number of unique inspections: ", str(nyc.shape[0]))
inspections

Number of inspection types:  34
Total number of unique inspections:  399918


Unnamed: 0_level_0,uniqueID,CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,PHONE,CUISINE DESCRIPTION,INSPECTION DATE,ACTION,VIOLATION CODE,VIOLATION DESCRIPTION,CRITICAL FLAG,SCORE,GRADE,GRADE DATE,RECORD DATE
INSPECTION TYPE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Cycle Inspection / Initial Inspection,230431,230431,230431,230431,230366,230431,230431,230431,230431,230431,230431,229870,229696,230431,230431,75876,75717,230431
Cycle Inspection / Re-inspection,99436,99436,99436,99436,99423,99436,99436,99436,99436,99436,99436,99198,99091,99436,99436,97384,97377,99436
Pre-permit (Operational) / Initial Inspection,24702,24702,24702,24702,24687,24702,24702,24702,24702,24702,24702,24550,24515,24702,24702,8909,7269,24702
Pre-permit (Operational) / Re-inspection,10568,10568,10568,10568,10568,10568,10568,10568,10568,10568,10568,10505,10485,10568,10568,10232,10200,10568
Administrative Miscellaneous / Initial Inspection,7911,7911,7911,7911,7910,7911,7911,7911,7911,7911,7911,6062,6062,7911,0,3,2,7911
Smoke-Free Air Act / Initial Inspection,4107,4107,4107,4107,4106,4107,4107,4107,4107,4107,4107,3975,3847,4107,2,0,0,4107
Pre-permit (Non-operational) / Initial Inspection,3957,3957,3957,3957,3956,3957,3957,3957,3957,3957,3957,3691,3691,3957,3957,689,2,3957
Cycle Inspection / Reopening Inspection,3221,3221,3221,3221,3221,3221,3221,3221,3221,3221,3221,3070,3070,3221,3220,1839,1839,3221
Trans Fat / Initial Inspection,2910,2910,2910,2910,2909,2910,2910,2910,2910,2910,2910,2464,2464,2910,0,0,0,2910
Administrative Miscellaneous / Re-inspection,2765,2765,2765,2765,2765,2765,2765,2765,2765,2765,2765,2312,2312,2765,3,0,0,2765


There are 34 categories of inspections, but the initial inspections alone make up 57.62% of the total number, and the re-inspections make up an additional 24.86% so I will focus initially just on those, rather than the more specialized inspection types that are based on permits, calorie postings, smoking, trans-fats, etc since I'm most interested in preventing critical food-borne illnesses. This will also help control for some variation due to nuances of different inspect protocols that I'm not aware of.

In [8]:
# filtered down to just these two types of inspections
init_ins = nyc[nyc['INSPECTION TYPE'] == "Cycle Inspection / Initial Inspection"]
re_ins = nyc[nyc['INSPECTION TYPE'] == "Cycle Inspection / Re-inspection"]

In [11]:
crit_init = init_ins.groupby('CRITICAL FLAG')
crit_init = crit_init.count()
crit_init = crit_init.sort_values('uniqueID', ascending = False)
crit_init

Unnamed: 0_level_0,uniqueID,CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,PHONE,CUISINE DESCRIPTION,INSPECTION DATE,ACTION,VIOLATION CODE,VIOLATION DESCRIPTION,SCORE,GRADE,GRADE DATE,RECORD DATE,INSPECTION TYPE
CRITICAL FLAG,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Critical,136761,136761,136761,136761,136725,136761,136761,136761,136761,136761,136761,136761,136761,136761,35002,34897,136761,136761
Not Critical,92935,92935,92935,92935,92907,92935,92935,92935,92935,92935,92935,92935,92935,92935,40426,40373,92935,92935
Not Applicable,735,735,735,735,734,735,735,735,735,735,735,174,0,735,448,447,735,735


In [16]:
grade_init = init_ins.groupby(['GRADE', 'CRITICAL FLAG'])
grade_init = grade_init.count()
grade_init = grade_init.sort_values('uniqueID', ascending = False)
grade_init

Unnamed: 0_level_0,Unnamed: 1_level_0,uniqueID,CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,PHONE,CUISINE DESCRIPTION,INSPECTION DATE,ACTION,VIOLATION CODE,VIOLATION DESCRIPTION,SCORE,GRADE DATE,RECORD DATE,INSPECTION TYPE
GRADE,CRITICAL FLAG,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
A,Not Critical,40372,40372,40372,40372,40354,40372,40372,40372,40372,40372,40372,40372,40372,40372,40372,40372,40372
A,Critical,34894,34894,34894,34894,34880,34894,34894,34894,34894,34894,34894,34894,34894,34894,34894,34894,34894
A,Not Applicable,447,447,447,447,446,447,447,447,447,447,447,62,0,447,447,447,447
Not Yet Graded,Critical,106,106,106,106,106,106,106,106,106,106,106,106,106,106,1,106,106
Not Yet Graded,Not Critical,53,53,53,53,53,53,53,53,53,53,53,53,53,53,0,53,53
Z,Critical,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
Not Yet Graded,Not Applicable,1,1,1,1,1,1,1,1,1,1,1,1,0,1,0,1,1
Z,Not Critical,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1


In [17]:
crit_re = re_ins.groupby(['GRADE','CRITICAL FLAG'])
crit_re = crit_re.count()
crit_re = crit_re.sort_values('uniqueID', ascending = False)
crit_re

Unnamed: 0_level_0,Unnamed: 1_level_0,uniqueID,CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,PHONE,CUISINE DESCRIPTION,INSPECTION DATE,ACTION,VIOLATION CODE,VIOLATION DESCRIPTION,SCORE,GRADE DATE,RECORD DATE,INSPECTION TYPE
GRADE,CRITICAL FLAG,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
A,Critical,33360,33360,33360,33360,33355,33360,33360,33360,33360,33360,33360,33360,33360,33360,33360,33360,33360
A,Not Critical,30771,30771,30771,30771,30765,30771,30771,30771,30771,30771,30771,30771,30771,30771,30771,30771,30771
B,Critical,17362,17362,17362,17362,17360,17362,17362,17362,17362,17362,17362,17362,17362,17362,17362,17362,17362
B,Not Critical,8137,8137,8137,8137,8137,8137,8137,8137,8137,8137,8137,8137,8137,8137,8137,8137,8137
C,Critical,4055,4055,4055,4055,4055,4055,4055,4055,4055,4055,4055,4055,4055,4055,4055,4055,4055
C,Not Critical,1723,1723,1723,1723,1723,1723,1723,1723,1723,1723,1723,1723,1723,1723,1723,1723,1723
Z,Critical,1077,1077,1077,1077,1077,1077,1077,1077,1077,1077,1077,1077,1077,1077,1077,1077,1077
Z,Not Critical,552,552,552,552,552,552,552,552,552,552,552,552,552,552,552,552,552
A,Not Applicable,316,316,316,316,316,316,316,316,316,316,316,78,0,316,316,316,316
B,Not Applicable,16,16,16,16,16,16,16,16,16,16,16,16,0,16,16,16,16


Strangely, the number of restaurants with 'A' grades with critical flags are similar to the number of restaurants without critical flags! However, the proportion of restaurants without critical flags do have more 'A' grades. But there are restaurants without critical flags still receiving 'C' grades. W

In [13]:
print(136761/92935)
print(57242/41849)

1.4715769085920267
1.3678224091376139


In [67]:
# The most frequently ordered item in general
c = chipo.groupby('item_name')
c = c.sum()
c = c.sort_values(['quantity'], ascending=False)
print("The most frequently order item was: ")
print (c.head(6))

# The most frequently ordered subchoice:
c = chipo.groupby('choice_description').sum()
c = c.sort_values(['quantity'], ascending=False)
print("\nThe most frequently ordered subchoice was: ")
print (c.head(6))

# Can summarize one column this way
total_items_orders = chipo.quantity.sum()
print("\nThe sum of the quanitity column is: ", str(total_items_orders))

The most frequently order item was: 
                     order_id  quantity  item_price2
item_name                                           
Chicken Bowl           713926       761      7342.73
Chicken Burrito        497303       591      5575.82
Chips and Guacamole    449959       506      2201.04
Steak Burrito          328437       386      3851.43
Canned Soft Drink      304753       351       438.75
Chips                  208004       230       494.34

The most frequently ordered subchoice was: 
                                                    order_id  quantity  \
choice_description                                                       
[Diet Coke]                                           123455       159   
[Coke]                                                122752       143   
[Sprite]                                               80426        89   
[Fresh Tomato Salsa, [Rice, Black Beans, Cheese...     43088        49   
[Fresh Tomato Salsa, [Rice, Black Beans, Cheese...

### Lambda functions

In [43]:
# Tiny function that gets rid of '$' and turns str into a float
dollarizer = lambda x: float(x[1:-1])

print ("Item price before being dollarized: \n", str(chipo.item_price[:4]))
chipo = chipo.assign(item_price2 = chipo.item_price.apply(dollarizer))
print ("\n Item price after being dollarized: \n", str(chipo.item_price2[:4]))

Item price before being dollarized: 
 0    $2.39 
1    $3.39 
2    $3.39 
3    $2.39 
Name: item_price, dtype: object

 Item price after being dollarized: 
 0    2.39
1    3.39
2    3.39
3    2.39
Name: item_price2, dtype: float64


### Unique values in a column

In [53]:
# How many different teams are there?
euro_teams['Team'].nunique()

# same thing
print("Number of teams: ", str(euro_teams.Team.nunique()))

Number of teams:  16


### View only certain columns and rows

In [97]:
# filter only giving the column names
discipline = euro_teams[['Team', 'Yellow Cards', 'Red Cards', 'Goals']]
print("Selected columns only: ")
print(discipline)

# filter to certain rows
discipline_shortlist = discipline[3:7]
print("\nSubset of rows: ")
print(discipline_shortlist)


Selected columns only: 
                   Team  Yellow Cards  Red Cards  Goals
0               Croatia             9          0      4
1        Czech Republic             7          0      4
2               Denmark             4          0      4
3               England             5          0      5
4                France             6          0      3
5               Germany             4          0     10
6                Greece             9          1      5
7                 Italy            16          0      6
8           Netherlands             5          0      2
9                Poland             7          1      2
10             Portugal            12          0      6
11  Republic of Ireland             6          1      1
12               Russia             6          0      5
13                Spain            11          0     12
14               Sweden             7          0      5
15              Ukraine             5          0      2

Subset of rows: 
      

### Sort and filter tricks

In [100]:
# discipline is a shorter version of euro_teams

# Round and take the mean
print("The rounded mean # of yellow cards is: ", str(round(discipline['Yellow Cards'].mean())))

# Sort based on multiple values (first red cards and then by yellow cards)
print('\nThe red card - yellow card - goals sorted set is:')
print(discipline.sort_values(['Red Cards', 'Yellow Cards', 'Goals'], ascending = False))

# Filter based on number of goals scored
print('\nA subset of only the teams with a minimum 7 goals: ')
print(discipline[discipline.Goals > 6])

# Filter based on team name strings
print('\nOnly those teams that begin with the letter C')
print(discipline[discipline.Team.str.startswith('C')])
                    

The rounded mean # of yellow cards is:  7

The red card - yellow card - goals sorted set is:
                   Team  Yellow Cards  Red Cards  Goals
6                Greece             9          1      5
9                Poland             7          1      2
11  Republic of Ireland             6          1      1
7                 Italy            16          0      6
10             Portugal            12          0      6
13                Spain            11          0     12
0               Croatia             9          0      4
14               Sweden             7          0      5
1        Czech Republic             7          0      4
12               Russia             6          0      5
4                France             6          0      3
3               England             5          0      5
8           Netherlands             5          0      2
15              Ukraine             5          0      2
5               Germany             4          0     10
2          

### Accessing subsets of the data based on location with iloc and loc functions

In [125]:
# use .iloc to slices via the position of the passed integers
# : means all, 0:7 means from 0 to 7
print("All teams/rows, but only 3 columns in the middle:")
print(euro_teams.iloc[: , 4:7])

# use negative to exclude the last 3 columns
print("\nOnly the last 8 teams, and only the last 3 columns:")
print(euro_teams.iloc[:-8, -3:])

# .loc is another way to slice, using the labels of the columns and indexes
print("\nOnly certain teams and certain columns specified by name:")
print(euro_teams.loc[euro_teams.Team.isin(['England', 'Italy', 'Russia']), ['Team','Shooting Accuracy']])

# Get location of a certain cell when having a mix of names and indexed locations
print("\nMixing it up: ")
col_name = euro_teams.columns[5]
euro_teams.loc[euro_teams.Team.isin(['Croatia']), [col_name]]

All teams/rows, but only 3 columns in the middle:
   Shooting Accuracy % Goals-to-shots  Total shots (inc. Blocked)
0              51.9%            16.0%                          32
1              41.9%            12.9%                          39
2              50.0%            20.0%                          27
3              50.0%            17.2%                          40
4              37.9%             6.5%                          65
5              47.8%            15.6%                          80
6              30.7%            19.2%                          32
7              43.0%             7.5%                         110
8              25.0%             4.1%                          60
9              39.4%             5.2%                          48
10             34.3%             9.3%                          82
11             36.8%             5.2%                          28
12             22.5%            12.5%                          59
13             55.9%      

Unnamed: 0,% Goals-to-shots
0,16.0%


### A chain of tidying to group, get means, and then sort

In [130]:
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv'
drinks = pd.read_csv(url, sep = ',')

# A chain to sort values based on beer serving averages by continent
print('Here are average number of wine servings by continent: ')
print(drinks.groupby('continent').beer_servings.mean().sort_values(ascending=False))

# A chain to get wine stats by continent
print('\nHere are the stats on wine servings by continent')
print(drinks.groupby('continent').wine_servings.describe())

# A chain to get multiple stats on a given column in one dataframe
print('\nHere are a bunch of stats aggregated bout spirit servings: ')
print(drinks.groupby('continent').spirit_servings.agg(['mean', 'median', 'min', 'max']))

Here are average number of wine servings by continent: 
continent
EU    193.777778
SA    175.083333
OC     89.687500
AF     61.471698
AS     37.045455
Name: beer_servings, dtype: float64

Here are the stats on wine servings by continent
           count        mean        std  min   25%    50%     75%    max
continent                                                               
AF          53.0   16.264151  38.846419  0.0   1.0    2.0   13.00  233.0
AS          44.0    9.068182  21.667034  0.0   0.0    1.0    8.00  123.0
EU          45.0  142.222222  97.421738  0.0  59.0  128.0  195.00  370.0
OC          16.0   35.625000  64.555790  0.0   1.0    8.5   23.25  212.0
SA          12.0   62.416667  88.620189  1.0   3.0   12.0   98.50  221.0

Here are a bunch of stats aggregated bout spirit servings: 
                 mean  median  min  max
continent                              
AF          16.339623     3.0    0  152
AS          60.840909    16.0    0  326
EU         132.555556   122.0  

### Eliminate hierachical indexing with 'unstack'

In [136]:
# Data
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 
        'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 
        'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 
        'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
regiment = pd.DataFrame(raw_data, columns = raw_data.keys())

# Normal indexing
print('Normal way: ')
print(regiment.groupby(['regiment', 'company']).preTestScore.mean())

# Print getting rid of that
print('\nUnstacked: ')
print(regiment.groupby(['regiment', 'company']).preTestScore.mean().unstack())

# Number of observations overall
print('\n number of observations per regiment/company: ')
print(regiment.groupby(['regiment', 'company']).size())

Normal way: 
regiment    company
Dragoons    1st         3.5
            2nd        27.5
Nighthawks  1st        14.0
            2nd        16.5
Scouts      1st         2.5
            2nd         2.5
Name: preTestScore, dtype: float64

Unstacked: 
company      1st   2nd
regiment              
Dragoons     3.5  27.5
Nighthawks  14.0  16.5
Scouts       2.5   2.5

 number of observations per regiment/company: 
regiment    company
Dragoons    1st        2
            2nd        2
Nighthawks  1st        2
            2nd        2
Scouts      1st        2
            2nd        2
dtype: int64


### Random tricks that should probably be organized elsewhere

In [1]:
# When wanting to round up an int here's a nice trick I discovered that takes advantage of
# the fact the True = 1, and False = 0

round_typical = int(21 / 5)
print("This is typically int rounding down: ", str(round_typical))

round_up = int(21 / 5) + (21 % 5 > 0)
print("This is how you get in to round up: ", str(round_up))


This is typically int rounding down:  4
This is how you get in to round up:  5
