In [1]:
import pandas as pd
import numpy as np


In [2]:
df = pd.read_csv("cereal.csv")

df.describe()

Unnamed: 0,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
count,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0
mean,106.883117,2.545455,1.012987,159.675325,2.151948,14.597403,6.922078,96.077922,28.246753,2.207792,1.02961,0.821039,42.665705
std,19.484119,1.09479,1.006473,83.832295,2.383364,4.278956,4.444885,71.286813,22.342523,0.832524,0.150477,0.232716,14.047289
min,50.0,1.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,0.0,1.0,0.5,0.25,18.042851
25%,100.0,2.0,0.0,130.0,1.0,12.0,3.0,40.0,25.0,1.0,1.0,0.67,33.174094
50%,110.0,3.0,1.0,180.0,2.0,14.0,7.0,90.0,25.0,2.0,1.0,0.75,40.400208
75%,110.0,3.0,2.0,210.0,3.0,17.0,11.0,120.0,25.0,3.0,1.0,1.0,50.828392
max,160.0,6.0,5.0,320.0,14.0,23.0,15.0,330.0,100.0,3.0,1.5,1.5,93.704912


## high level analysis

thre are 77 cereals represented here, no missing values across all variables. 
is there a codebook for this? yeah more or less it is the summary on kaggle: 

Fields in the dataset:

Name: Name of cereal
mfr: Manufacturer of cereal
A = American Home Food Products;
G = General Mills
K = Kelloggs
N = Nabisco
P = Post
Q = Quaker Oats
R = Ralston Purina
type:
cold
hot
calories: calories per serving
protein: grams of protein
fat: grams of fat
sodium: milligrams of sodium
fiber: grams of dietary fiber
carbo: grams of complex carbohydrates
sugars: grams of sugars
potass: milligrams of potassium
vitamins: vitamins and minerals - 0, 25, or 100, indicating the typical percentage of FDA recommended
shelf: display shelf (1, 2, or 3, counting from the floor)
weight: weight in ounces of one serving
cups: number of cups in one serving
rating: a rating of the cereals (Possibly from Consumer Reports?)

questions I have initially: 
- are the less healthy cereals stocked on a particular shelf relative to teh others? 
- do the same cereals that are low in sugar also have more vitamins? 
- what do the ratings mean? does that correlate to low sugar, high vitamin content, or something else?
- are cereals with high sugar content heavier or lighter per ounce? is there any relationship? 
- are serving sizes manipulated to keep the sugar levels down in cereals that we know have more sugar overall? do we even have enough information to know this? 

### using what I've learned: 

- standard deviation of calories is about 20, 
    so a cereal that is more than 40 calories from the mean of 106 in either direction is likely an outlier 
- similar for sugars in grams, 9 grams more or less than about 7 so more than 16 at least we know would be an extreme value, can't be 9 lower than 7,    but values close to 0 would be surprising. 



## examining variables one by one first


In [5]:
# frequency tables and distribution plots shall we? 
import thinkstats

# just look around first

df.shape


(77, 16)

In [35]:
# lets look at the items by descending sugar content, spot anomolous values
df["sugars"].value_counts().sort_index()

lots_of_sugar = df.query("sugars >= 10")
less_sugar = df.query("sugars < 10")

difference_in_carbs = lots_of_sugar["carbo"].mean() - less_sugar["carbo"].mean()

print("mean carbs for each")
print(f"more sugar: {lots_of_sugar['carbo'].mean()}")
print(f"less sugar: {less_sugar['carbo'].mean()}")

print(f" difference in average carb content between more sugar and less sugar groups {difference_in_carbs}")


mean carbs for each
more sugar: 12.711538461538462
less sugar: 15.558823529411764
 difference in average carb content between more sugar and less sugar groups -2.847285067873303


### ok it appears as though the higher sugar cereals have fewer carbs on average, maybe cohen's effect would help here. 

I think cohen's effect is necessary here for the same reason it was necessary in the question of whetehr first born or later born children are born later. we have two groups now, each with their own mean and spread. 

so what we have going on is there is a very real quantitative signal coming from the data. it is not fully causal, because there could be more explanations for why there are more carbs by ~3 grams on average coming from the less sugary cereals, but those explanations woudl be coming from outside the dataset. as far as this sample is concerned, we are seeing something show up at least for the way we grouped these things. 



In [None]:
df[df["sugars"] == -1]

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
57,Quaker Oatmeal,Q,H,100,5,2,0,2.7,-1.0,-1,110,0,1,1.0,0.67,50.828392


In [None]:
# quaker appears to be bad data, not sure but seems to be, maybe I can fix by imputing correct amount for quaker that I lookup online since it is just the one example
# also this data sseems inconsistent with what I found online drawing thw whole thing into suspicion
# anyway, let's change this to 1


df_cleaned = df.copy(deep=True)
df_cleaned.at[57, "sugars"] = 1
df_cleaned.at[57, "carbo"] = 1

df_cleaned.iloc[57]

name        Quaker Oatmeal
mfr                      Q
type                     H
calories               100
protein                  5
fat                      2
sodium                   0
fiber                  2.7
carbo                  1.0
sugars                   1
potass                 110
vitamins                 0
shelf                    1
weight                 1.0
cups                  0.67
rating           50.828392
Name: 57, dtype: object

In [18]:
df_cleaned["fat"].value_counts()

fat
1    30
0    27
2    14
3     5
5     1
Name: count, dtype: int64

In [None]:
# taking a look at these
df_cleaned["carbo"].value_counts().sort_index()

carbo
1.0     1
5.0     1
7.0     1
8.0     2
9.0     1
10.0    2
10.5    2
11.0    5
11.5    1
12.0    7
13.0    8
13.5    1
14.0    7
15.0    8
16.0    7
17.0    6
18.0    3
19.0    1
20.0    3
21.0    7
22.0    2
23.0    1
Name: count, dtype: int64

In [None]:
# slow and deliberate comparison, showing that most unhealthy cereals in terms of sugars aren't necessarily the highest carb cereals
print(df[df["sugars"] == 13])
print(df[df["carbo"] == 23])

                    name mfr type  calories  protein  fat  sodium  fiber  \
14           Cocoa Puffs   G    C       110        1    1     180    0.0   
18         Count Chocula   G    C       110        1    1     180    0.0   
24           Froot Loops   K    C       110        2    1     125    1.0   
46  Mueslix Crispy Blend   K    C       160        3    2     150    3.0   

    carbo  sugars  potass  vitamins  shelf  weight  cups     rating  
14   12.0      13      55        25      2     1.0  1.00  22.736446  
18   12.0      13      65        25      2     1.0  1.00  22.396513  
24   11.0      13      30        25      2     1.0  1.00  32.207582  
46   17.0      13     160        25      3     1.5  0.67  30.313351  
         name mfr type  calories  protein  fat  sodium  fiber  carbo  sugars  \
61  Rice Chex   R    C       110        1    0     240    0.0   23.0       2   

    potass  vitamins  shelf  weight  cups     rating  
61      30        25      1     1.0  1.13  41.998933 

In [None]:
highest_carb = 