## Goals

- For everybody to be able to follow along and learn something
- Use vanilla python with standard libraries only when needed

## Disclamers
- Not necessarily the best way or even the right way
- 

# Part 1. Prepare our data

### Load the CSV into our project

- name: Name of cereal
- mfr: Manufacturer of cereal
    - A = American Home Food Products;
    - G = General Mills
    - K = Kelloggs
    - N = Nabisco
    - P = Post
    - Q = Quaker Oats
    - R = Ralston Purina
- type: cold vs hot
- calories: calories per serving
- protein: grams of protein
- fat: grams of fat
- sodium: milligrams of sodium
- fiber: grams of dietary fiber
- carbo: grams of complex carbohydrates
- sugars: grams of sugars
- potass: milligrams of potassium
- vitamins: vitamins and minerals - 0, 25, or 100, indicating the typical percentage of FDA recommended serving
- shelf: display shelf (1, 2, or 3, counting from the floor)
- weight: weight in ounces of one serving
- cups: number of cups in one serving
- rating: a rating of the cereals (Possibly from Consumer Reports?)

Source: https://www.kaggle.com/crawford/80-cereals

[View this File](/edit/cereal.csv)

In [122]:
import csv

with open('cereal.csv') as f:
    data = list(csv.reader(f))
    
print(data)

[['name', 'mfr', 'type', 'calories', 'protein', 'fat', 'sodium', 'fiber', 'carbo', 'sugars', 'potass', 'vitamins', 'shelf', 'weight', 'cups', 'rating'], ['100% Bran', 'N', 'C', '70', '4', '1', '130', '10', '5', '6', '280', '25', '3', '1', '0.33', '68.402973'], ['100% Natural Bran', 'Q', 'C', '120', '3', '5', '15', '2', '8', '8', '135', '0', '3', '1', '1', '33.983679'], ['All-Bran', 'K', 'C', '70', '4', '1', '260', '9', '7', '5', '320', '25', '3', '1', '0.33', '59.425505'], ['All-Bran with Extra Fiber', 'K', 'C', '50', '4', '0', '140', '14', '8', '0', '330', '25', '3', '1', '0.5', '93.704912']]


In [208]:
print('Number of Rows: ', len(data))
print('Number of Cols: ', len(data[0]))

Number of Rows:  78
Number of Cols:  16


In [251]:
def peak_in(d, rows=6):
    for row in d[0:rows]:
        for col_num, col in enumerate(row):
            if (col_num <= 15):
                print(str(col)[:6], end=(' ' * (7 - len(str(col)[:6]))))
            else:
                print(' ... ')
                break
        print('\n')

    print('** Total Number of columns: ', str(len(data[0])))
    print('** Displaying rows 1 - ', str(rows), ' of ', str(len(d)), '\n')

peak_in(data)

name   mfr    type   calori protei fat    sodium fiber  carbo  sugars potass vitami shelf  weight cups   rating 

100% B N      C      70     4      1      130    10     5      6      280    25     3      1      0.33   68.402 

100% N Q      C      120    3      5      15     2      8      8      135    0      3      1      1      33.983 

All-Br K      C      70     4      1      260    9      7      5      320    25     3      1      0.33   59.425 

All-Br K      C      50     4      0      140    14     8      0      330    25     3      1      0.5    93.704 

Almond R      C      110    2      2      200    1      14     8      0      25     3      1      0.75   34.384 

** Total Number of columns:  16
** Displaying rows 1 -  6  of  78 



### Randomize Our Rows

In [238]:
data_randomized = data[1:]
import random
random.shuffle(data_randomized);

peak_in(data_randomized)

Oatmea G      C      130    3      2      170    1.5    13.5   10     120    25     3      1.25   0.5    30.450 

Just R K      C      110    2      1      170    1      17     6      60     100    3      1      1      36.523 

Raisin G      C      100    3      2      140    2.5    10.5   8      140    25     3      1      0.5    39.703 

Shredd N      C      90     3      0      0      4      19     0      140    0      1      1      0.67   74.472 

Bran C R      C      90     2      1      200    4      15     6      125    25     1      1      0.67   49.120 

Shredd N      C      80     2      0      0      3      16     0      95     0      1      0.83   1      68.235 

... Displaying rows 1 - 6 of 77



In [239]:
data_values_only = []

for row_num, row in enumerate(data_randomized):
    data_values_only.append([])
    for col_num, col in enumerate(row):
        if (col_num not in {0, 1, 2, 12}):
            data_values_only[row_num].append(float(col))
            
peak_in(data_values_only)

130.0  3.0    2.0    170.0  1.5    13.5   10.0   120.0  25.0   1.25   0.5    30.450 

110.0  2.0    1.0    170.0  1.0    17.0   6.0    60.0   100.0  1.0    1.0    36.523 

100.0  3.0    2.0    140.0  2.5    10.5   8.0    140.0  25.0   1.0    0.5    39.703 

90.0   3.0    0.0    0.0    4.0    19.0   0.0    140.0  0.0    1.0    0.67   74.472 

90.0   2.0    1.0    200.0  4.0    15.0   6.0    125.0  25.0   1.0    0.67   49.120 

80.0   2.0    0.0    0.0    3.0    16.0   0.0    95.0   0.0    0.83   1.0    68.235 

... Displaying rows 1 - 6 of 77



### Column names for reference:

(0) Cals | (1) Prot | (2) Fat | (3) Na | (4) Fib | (5) Carb | (6) Sug | (7) K  | (8) Vit | (9)  Weight | (10) Cups | (11) Score

In [240]:
data_proportional = [];

for row_num, row in enumerate(data_values_only):
    data_proportional.append([])
    for col_num, col in enumerate(row):
        if (col_num not in {10, 11}):
            data_proportional[row_num].append(col / data_values_only[row_num][10])
        elif (col_num == 11):
            data_proportional[row_num].append(col) # Score, unchanged

peak_in(data_proportional)

260.0  6.0    4.0    340.0  3.0    27.0   20.0   240.0  50.0   2.5    30.450 

110.0  2.0    1.0    170.0  1.0    17.0   6.0    60.0   100.0  1.0    36.523 

200.0  6.0    4.0    280.0  5.0    21.0   16.0   280.0  50.0   2.0    39.703 

134.32 4.4776 0.0    0.0    5.9701 28.358 0.0    208.95 0.0    1.4925 74.472 

134.32 2.9850 1.4925 298.50 5.9701 22.388 8.9552 186.56 37.313 1.4925 49.120 

80.0   2.0    0.0    0.0    3.0    16.0   0.0    95.0   0.0    0.83   68.235 

... Displaying rows 1 - 6 of 77



### Column names for reference:
(0) Cals | (1) Prot | (2) Fat | (3) Na | (4) Fib | (5) Carb | (6) Sug | (7) K | (8) Vit | (9) Weight | (10) Score

### Add Y-Intercept

In [241]:
data_prepped = [];

for row_num, row in enumerate(data_proportional):
    data_prepped.append([1] + data_proportional[row_num])
            
peak_in(data_prepped)

1      260.0  6.0    4.0    340.0  3.0    27.0   20.0   240.0  50.0   2.5    30.450 

1      110.0  2.0    1.0    170.0  1.0    17.0   6.0    60.0   100.0  1.0    36.523 

1      200.0  6.0    4.0    280.0  5.0    21.0   16.0   280.0  50.0   2.0    39.703 

1      134.32 4.4776 0.0    0.0    5.9701 28.358 0.0    208.95 0.0    1.4925 74.472 

1      134.32 2.9850 1.4925 298.50 5.9701 22.388 8.9552 186.56 37.313 1.4925 49.120 

1      80.0   2.0    0.0    0.0    3.0    16.0   0.0    95.0   0.0    0.83   68.235 

... Displaying rows 1 - 6 of 77



### Split Our Data into two parts

### Column Labels for Reference

(0) Y-int | (1) Cals | (2) Prot | (3) Fat | (4) Na | (5) Fib | (6) Carb | (7) Sug | (8) K | (9) Vit | (10) Weight | (11) Score

In [242]:
num_of_training_examples = int(len(data_prepped)*0.8)

training_set   = data_prepped[:num_of_training_examples]
validation_set = data_prepped[num_of_training_examples:]

print('Number of training examples: ' + str(len(training_set)))
print('Number of validation examples: ' + str(len(validation_set)))
print('Number of features: ', len(validation_set[0]) - 1)

Number of training examples: 61
Number of validation examples: 16
Number of features:  11


# Part 2: Create Our Algorithm

## Initialize Our Parameters For Training Set

In [243]:
X = []
y = []

for row in training_set:
    X.append(row[:11])
    y.append(row[11])
    
m = len(X[0])
theta = [0] * m
alpha = 0.3
interations = 200

print('X (Sample): ')
peak_in(X)
print('---\n')
print('y: ', y, '\n\n---\n')
print('m: ', m, '\n\n---\n')
print('theta: ', theta, '\n\n---\n')

X (Sample): 
1      260.0  6.0    4.0    340.0  3.0    27.0   20.0   240.0  50.0   2.5    

1      110.0  2.0    1.0    170.0  1.0    17.0   6.0    60.0   100.0  1.0    

1      200.0  6.0    4.0    280.0  5.0    21.0   16.0   280.0  50.0   2.0    

1      134.32 4.4776 0.0    0.0    5.9701 28.358 0.0    208.95 0.0    1.4925 

1      134.32 2.9850 1.4925 298.50 5.9701 22.388 8.9552 186.56 37.313 1.4925 

1      80.0   2.0    0.0    0.0    3.0    16.0   0.0    95.0   0.0    0.83   

... Displaying rows 1 - 6 of 61

---

y:  [30.450843, 36.523683, 39.7034, 74.472949, 49.120253, 68.235885, 49.787445, 22.736446, 31.072217, 31.230054, 36.176196, 19.823573, 51.592193, 72.801787, 26.734515, 36.471512, 40.917047, 29.509541, 29.924285, 36.187559, 35.782791, 38.839746, 37.038562, 63.005645, 59.642837, 28.025765, 41.998933, 46.895644, 53.371007, 53.131324, 49.511874, 41.50354, 45.811716, 28.742414, 60.756112, 37.840594, 34.139765, 41.015492, 35.252444, 40.105965, 40.69232, 22.396513, 45.863324, 3

### Transposition

In [244]:
def transpose(matrix):
    return list(map(list, zip(*matrix)))

example_matrix = [[1,2,3],
                  [4,5,6],
                  [7,8,9]]

peak_in(transpose(X))

1      1      1      1      1      1      1      1      1      1      1      1      1      1      1      1       ... 


260.0  110.0  200.0  134.32 134.32 80.0   149.25 110.0  146.66 146.66 133.33 160.0  100.0  134.32 110.0  186.66  ... 


6.0    2.0    6.0    4.4776 2.9850 2.0    4.4776 1.0    4.0    2.6666 2.6666 1.3333 3.0    4.4776 2.0    4.0     ... 


4.0    1.0    4.0    0.0    1.4925 0.0    1.4925 1.0    1.3333 1.3333 1.3333 4.0    1.0    0.0    1.0    1.3333  ... 


340.0  170.0  280.0  0.0    298.50 0.0    343.28 180.0  333.33 93.333 186.66 280.0  200.0  0.0    180.0  226.66  ... 


3.0    1.0    5.0    5.9701 5.9701 3.0    4.4776 0.0    2.0    1.3333 2.6666 0.0    3.0    4.4776 0.0    2.6666  ... 


... Displaying rows 1 - 6 of 11



### Normalize All the Features

#### Refresher of statistics terms:

- **μ or mu:** The mean (i.e. average) of a list of values.  

- **σ or sigma or standard deviation:** A measure of how dispersed a data set is from it's mean.

- **range**: Difference between the highest and lowest item in a list

In [260]:
import statistics

mu = list(map(statistics.mean,transpose(X))) # returns a list of means for each feature (the number of features)
sigma = list(map(statistics.stdev, transpose(X))) # returns a list of standard deviations for each feature
X_range = list(map(lambda feature: max(feature) - min(feature), transpose(X))) # returns list of ranges for each feature

print(mu)
print(sigma)
print(X_range)

X_norm = list(map( (X(:,2:size(X)(2)) - mu ) ./ X_range )


# peak_in(test)

# for col in range(len(X[0])):
#     for row:
#     print ('yo')

# print ([x + y for x, y in zip(X[0], X[1])])

# for row in X: 
#     averages.push([])
#     for col in X:
#         averages[0]

# sigma = std(X);
# Xrange = range( X(:,2:size(X)(2)));
# X_norm = [X(:,1) ( (X(:,2:size(X)(2)) - mu ) ./ Xrange )];

[1, 143.65708529001571, 3.3868106797100874, 1.3743391833678107, 205.74974236908415, 3.022299085793091, 19.232461678772925, 9.248450300670246, 128.52144512586517, 36.23073959249564, 1.3876660896944426]
[0.0, 64.38921318265143, 2.405482356727578, 1.6319113190492316, 125.6383934056823, 4.449316472741617, 9.102905318464357, 6.159673019942178, 134.2504538969575, 28.03486829158596, 0.6235678950991048]
[0, 390.0, 11.369332421964002, 9.09090909090909, 680.0, 30.3030303030303, 69.49253731343283, 22.388059701492537, 848.4848484848485, 133.33333333333334, 3.5]


In [266]:
yo = list(map(len,['test','yo','sup']))
print(yo)

[4, 2, 3]
