Copyright 2021 Andrew M. Olney and made available under [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0) for text and [Apache-2.0](http://www.apache.org/licenses/LICENSE-2.0) for code.

# Vectorization and weighting: Problem solving


For this session, we'll use some data on breakfast cereals, which consists mostly of nutrition information along with some other properties.

Our goal is to predict `fiber` using the name of the cereal.

| Variable | Type | Description |
|:-------|:-------|:-------|
| name     | Nominal | Name of cereal (an ID)                                                                                                                          |
| mfr      | Nominal | Manufacturer of cereal: (A)merican Home Food Products; (G)eneral Mills; (K)elloggs; (N)abisco; (P)ost; (Q)uaker Oats; (R)alston Purina |
| type     | Nominal | (H)ot or (C)old                                                                                                                        |
| calories | Ratio   | calories per serving                                                                                                                   |
| protein  | Ratio   | grams of protein                                                                                                                       |
| fat      | Ratio   | grams of fat                                                                                                                           |
| sodium   | Ratio   | milligrams of sodium                                                                                                                   |
| fiber    | Ratio   | grams of dietary fiber                                                                                                                 |
| carbo    | Ratio   | grams of complex carbohydrates                                                                                                         |
| sugars   | Ratio   | grams of sugars                                                                                                                        |
| potass   | Ratio   | milligrams of potassium                                                                                                                |
| vitamins | Ordinal | vitamins and minerals - 0, 25, or 100, indicating the typical percentage of FDA recommended                                            |
| shelf    | Ratio   | display shelf (1, 2, or 3, counting from the floor)                                                                                    |
| weight   | Ratio   | weight in ounces of one serving                                                                                                        |
| cups     | Ratio   | number of cups in one serving                                                                                                          |
| rating   | Ratio   | a rating of the cereals (Possibly from Consumer Reports?)                                                                              |
      
<div style="text-align:center;font-size: smaller">
    <b>Source:</b> This dataset was taken from <a href="https://www.kaggle.com/crawford/80-cereals">Kaggle</a>.
</div>
<br>


## Load the data

Start by importing `pandas`.

In [30]:
import pandas as pd

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="A]5Vf/x,=8dk=KKk3OrJ">pd</variable></variables><block type="importAs" id="G*NhDz5Jo?CcaJf3rUv}" x="150" y="308"><field name="libraryName">pandas</field><field name="libraryAlias" id="A]5Vf/x,=8dk=KKk3OrJ">pd</field></block></xml>

And load the dataframe with `datasets/cereal.csv`, displaying it to make sure it looks right.

In [31]:
dataframe = pd.read_csv('datasets/cereal.csv')

dataframe

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="Y+grzOaS*0=OHV6zO8m/">dataframe</variable><variable id="A]5Vf/x,=8dk=KKk3OrJ">pd</variable></variables><block type="variables_set" id="FJGldw(qFp?V=E]uaN)4" x="31" y="222"><field name="VAR" id="Y+grzOaS*0=OHV6zO8m/">dataframe</field><value name="VALUE"><block type="varDoMethod" id="4s18v%0iqE%_l@{fLYCo"><field name="VAR" id="A]5Vf/x,=8dk=KKk3OrJ">pd</field><field name="MEMBER">read_csv</field><data>pd:read_csv</data><value name="INPUT"><block type="text" id="tN4Jmaffy/={ZFo;uu;W"><field name="TEXT">datasets/cereal.csv</field></block></value></block></value></block><block type="variables_get" id="yN6MT1B[)Pk)cEECN_YZ" x="31" y="286"><field name="VAR" id="Y+grzOaS*0=OHV6zO8m/">dataframe</field></block></xml>

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6.0,280.0,25,3,1.0,0.33,68.402973
1,100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8.0,135.0,0,3,1.0,1.00,33.983679
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5.0,320.0,25,3,1.0,0.33,59.425505
3,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0.0,330.0,25,3,1.0,0.50,93.704912
4,Almond Delight,R,C,110,2,2,200,1.0,14.0,8.0,,25,3,1.0,0.75,34.384843
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
72,Triples,G,C,110,2,1,250,0.0,21.0,3.0,60.0,25,3,1.0,0.75,39.106174
73,Trix,G,C,110,1,1,140,0.0,13.0,12.0,25.0,25,2,1.0,1.00,27.753301
74,Wheat Chex,R,C,100,3,1,230,3.0,17.0,3.0,115.0,25,1,1.0,0.67,49.787445
75,Wheaties,G,C,100,3,1,200,3.0,17.0,3.0,110.0,25,1,1.0,1.00,51.592193


## Descriptives

`describe` the dataframe.

In [3]:
dataframe.describe()

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="Y+grzOaS*0=OHV6zO8m/">dataframe</variable></variables><block type="varDoMethod" id="Lg=BRVvroBMlk45X$C+B" x="-124" y="-134"><field name="VAR" id="Y+grzOaS*0=OHV6zO8m/">dataframe</field><field name="MEMBER">describe</field><data>dataframe:describe</data></block></xml>

Unnamed: 0,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
count,77.0,77.0,77.0,77.0,77.0,76.0,76.0,75.0,77.0,77.0,77.0,77.0,77.0
mean,106.883117,2.545455,1.012987,159.675325,2.151948,14.802632,7.026316,98.666667,28.246753,2.207792,1.02961,0.821039,42.665705
std,19.484119,1.09479,1.006473,83.832295,2.383364,3.907326,4.378656,70.410636,22.342523,0.832524,0.150477,0.232716,14.047289
min,50.0,1.0,0.0,0.0,0.0,5.0,0.0,15.0,0.0,1.0,0.5,0.25,18.042851
25%,100.0,2.0,0.0,130.0,1.0,12.0,3.0,42.5,25.0,1.0,1.0,0.67,33.174094
50%,110.0,3.0,1.0,180.0,2.0,14.5,7.0,90.0,25.0,2.0,1.0,0.75,40.400208
75%,110.0,3.0,2.0,210.0,3.0,17.0,11.0,120.0,25.0,3.0,1.0,1.0,50.828392
max,160.0,6.0,5.0,320.0,14.0,23.0,15.0,330.0,100.0,3.0,1.5,1.5,93.704912


----------------------
**QUESTION:**

Does the range of fiber look good for regression?

**ANSWER: (click here to edit)**

*It ranges from 0 to 14, but the two inner quartiles range from 1 to 3.
So high fiber is clearly an outlier.*

----------------------

## Vectorize to get features

Import `sklearn.feature_extraction.text` as `text`.

In [9]:
import sklearn.feature_extraction.text as text

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="y*%FH]Xz:N5?J=p7So4;">text</variable></variables><block type="importAs" id="oTX6-0d~y$]Xv#M)+_z!" x="-33" y="-322"><field name="libraryName">sklearn.feature_extraction.text</field><field name="libraryAlias" id="y*%FH]Xz:N5?J=p7So4;">text</field></block></xml>

Create a `CountVectorizer` using `text`, but pass in a list containing
     - freestyle `stop_words='english'`
     - freestyle `min_df=2` (only include words that occur at least twice)
     
*Filtering out hapax and stopwords will help us avoid overfit/fitting noise*

In [26]:
vectorizer = text.CountVectorizer(stop_words='english', min_df=2)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="Xu=hAdiWJ.`f(n4Tn*5t">vectorizer</variable><variable id="y*%FH]Xz:N5?J=p7So4;">text</variable></variables><block type="variables_set" id="`#yMT01y=4DItX`0~kI." x="13" y="-227"><field name="VAR" id="Xu=hAdiWJ.`f(n4Tn*5t">vectorizer</field><value name="VALUE"><block type="varCreateObject" id="e+2wjiwOjX%pfxd5nKAj"><field name="VAR" id="y*%FH]Xz:N5?J=p7So4;">text</field><field name="MEMBER">CountVectorizer</field><data>text:CountVectorizer</data><value name="INPUT"><block type="lists_create_with" id=")/|_zo`p.YN?DYaEPR]p"><mutation items="2"></mutation><value name="ADD0"><block type="dummyOutputCodeBlock" id="n9!jxI#!FgOzCFMIOh(L"><field name="CODE">stop_words='english'</field></block></value><value name="ADD1"><block type="dummyOutputCodeBlock" id="!QhL-02Lc;OBF[QiFRZ/"><field name="CODE">min_df=2</field></block></value></block></value></block></value></block></xml>

Call `fit_transform` with the vectorizer using the `name` column of `dataframe`, and store in `matrix`.

In [27]:
matrix = vectorizer.fit_transform(dataframe['name'])

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="g#vU}+I%b#efZeGj-i8*">matrix</variable><variable id="Xu=hAdiWJ.`f(n4Tn*5t">vectorizer</variable><variable id="Y+grzOaS*0=OHV6zO8m/">dataframe</variable></variables><block type="variables_set" id="scYNV_rn?aY)~0.5mcLO" x="-12" y="-180"><field name="VAR" id="g#vU}+I%b#efZeGj-i8*">matrix</field><value name="VALUE"><block type="varDoMethod" id="]#ME2Y;blQoh|J?@,oYL"><field name="VAR" id="Xu=hAdiWJ.`f(n4Tn*5t">vectorizer</field><field name="MEMBER">fit_transform</field><data>vectorizer:fit_transform</data><value name="INPUT"><block type="lists_create_with" id="H]4d}7bP.$l`IjllnR~l"><mutation items="1"></mutation><value name="ADD0"><block type="indexer" id="Dxj+)aFUxS~s{NqB[21O"><field name="VAR" id="Y+grzOaS*0=OHV6zO8m/">dataframe</field><value name="INDEX"><block type="text" id="Wb3X:Ik2gnOo4mmiS}6^"><field name="TEXT">name</field></block></value></block></value></block></value></block></value></block></xml>

Now we're going to put the features back into the `dataframe` as columns.

First create a new dataframe, using a list containing
    - with `matrix` do `todense` 
    - freestyle `columns = ` with `vectorizer` do `get_feature_names`
    
Store this in `matrix_df` and display it.

In [29]:
matrix_df = pd.DataFrame(matrix.todense(), columns= (vectorizer.get_feature_names()))

matrix_df

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="W[q[,eoSbY~~V15N=5p{">matrix_df</variable><variable id="A]5Vf/x,=8dk=KKk3OrJ">pd</variable><variable id="g#vU}+I%b#efZeGj-i8*">matrix</variable><variable id="Xu=hAdiWJ.`f(n4Tn*5t">vectorizer</variable></variables><block type="variables_set" id="AVm^$xGTvQWc9/B#F{G_" x="-22" y="-70"><field name="VAR" id="W[q[,eoSbY~~V15N=5p{">matrix_df</field><value name="VALUE"><block type="varCreateObject" id="c*gvq*J);=ExtPVOl-m;"><field name="VAR" id="A]5Vf/x,=8dk=KKk3OrJ">pd</field><field name="MEMBER">DataFrame</field><data>pd:DataFrame</data><value name="INPUT"><block type="lists_create_with" id="Y=(/%bNx8K/3Uyij;S8G"><mutation items="2"></mutation><value name="ADD0"><block type="varDoMethod" id="i+eh0Bi~@3Z,^7aLgNMz"><field name="VAR" id="g#vU}+I%b#efZeGj-i8*">matrix</field><field name="MEMBER">todense</field><data>matrix:todense</data></block></value><value name="ADD1"><block type="valueOutputCodeBlock" id="g]qN!:S$E}_SG?%8KD@$"><field name="CODE">columns=</field><value name="INPUT"><block type="varDoMethod" id="d11CQcX7D#woR(qS]GMU"><field name="VAR" id="Xu=hAdiWJ.`f(n4Tn*5t">vectorizer</field><field name="MEMBER">get_feature_names</field><data>vectorizer:get_feature_names</data></block></value></block></value></block></value></block></value></block><block type="variables_get" id="v,jztGFO5TKlcoeDE_W!" x="-17" y="46"><field name="VAR" id="W[q[,eoSbY~~V15N=5p{">matrix_df</field></block></xml>

Unnamed: 0,100,almond,apple,bran,cheerios,chex,cinnamon,corn,crisp,crispy,crunch,dates,flakes,frosted,fruit,golden,grain,grape,honey,just,muesli,nut,nutri,nuts,oat,oatmeal,puffed,quaker,raisin,raisins,rice,right,shredded,squares,total,wheat,wheaties,wheats
0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
72,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
73,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
74,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
75,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0


----------------------
**QUESTION:**

Does it make sense to weight the term document matrix? Why or why not?

**ANSWER: (click here to edit)**

*In general, the name of the cereal will only have a single instance of a given word, so it doesn't make much sense to use term frequency.
We might consider document frequency weighting for common words, but it's unclear what they would be right now.*

----------------------

We have the vectorized text in a dataframe, but we still need to add it to `dataframe`.

Set `dataframe` to with `pd` do `concat` using a list containing
    - a list containing `dataframe` and `matrix_df`
    - freestyle `axis=1` (this appends the columns of the dataframes rather than the rows)
Display it.

In [32]:
dataframe = pd.concat([dataframe, matrix_df], axis=1)

dataframe

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="Y+grzOaS*0=OHV6zO8m/">dataframe</variable><variable id="A]5Vf/x,=8dk=KKk3OrJ">pd</variable><variable id="W[q[,eoSbY~~V15N=5p{">matrix_df</variable></variables><block type="variables_set" id="@Kkmf29ozFtEn1.t2t):" x="-21" y="-132"><field name="VAR" id="Y+grzOaS*0=OHV6zO8m/">dataframe</field><value name="VALUE"><block type="varDoMethod" id="H+cy/0E_FGGJqDiq[1Qn"><field name="VAR" id="A]5Vf/x,=8dk=KKk3OrJ">pd</field><field name="MEMBER">concat</field><data>pd:concat</data><value name="INPUT"><block type="lists_create_with" id="GC-,V%~PUk`9LjFfxe8."><mutation items="2"></mutation><value name="ADD0"><block type="lists_create_with" id="S1+X8{!g42B!R0tyFUV)"><mutation items="2"></mutation><value name="ADD0"><block type="variables_get" id="?0E;LOMz6k}Bf8/4+:XI"><field name="VAR" id="Y+grzOaS*0=OHV6zO8m/">dataframe</field></block></value><value name="ADD1"><block type="variables_get" id="6]!)L/nocX6_--Y,.=F2"><field name="VAR" id="W[q[,eoSbY~~V15N=5p{">matrix_df</field></block></value></block></value><value name="ADD1"><block type="dummyOutputCodeBlock" id="-2}W(*#m~vO{^${QMJO:"><field name="CODE">axis=1</field></block></value></block></value></block></value></block><block type="variables_get" id="ED@*AeRW1/:rr71,#3^S" x="-17" y="-22"><field name="VAR" id="Y+grzOaS*0=OHV6zO8m/">dataframe</field></block></xml>

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating,100,almond,apple,bran,cheerios,chex,cinnamon,corn,crisp,crispy,crunch,dates,flakes,frosted,fruit,golden,grain,grape,honey,just,muesli,nut,nutri,nuts,oat,oatmeal,puffed,quaker,raisin,raisins,rice,right,shredded,squares,total,wheat,wheaties,wheats
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6.0,280.0,25,3,1.0,0.33,68.402973,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8.0,135.0,0,3,1.0,1.00,33.983679,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5.0,320.0,25,3,1.0,0.33,59.425505,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0.0,330.0,25,3,1.0,0.50,93.704912,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Almond Delight,R,C,110,2,2,200,1.0,14.0,8.0,,25,3,1.0,0.75,34.384843,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
72,Triples,G,C,110,2,1,250,0.0,21.0,3.0,60.0,25,3,1.0,0.75,39.106174,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
73,Trix,G,C,110,1,1,140,0.0,13.0,12.0,25.0,25,2,1.0,1.00,27.753301,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
74,Wheat Chex,R,C,100,3,1,230,3.0,17.0,3.0,115.0,25,1,1.0,0.67,49.787445,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
75,Wheaties,G,C,100,3,1,200,3.0,17.0,3.0,110.0,25,1,1.0,1.00,51.592193,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0


## Model

Let's create a linear regression model to predict `fiber`.
You may wish to [review multiple regression](https://github.com/memphis-iis/datawhys-content-notebooks/blob/master/Multiple-linear-regression.ipynb).

Import `sklearn.linear_model` and `numpy`.

In [35]:
import sklearn.linear_model as linear_model
import numpy as np

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="!+Hi;Yx;ZB!EQYU8ItpO">linear_model</variable><variable id="YynR+H75hTgW`vKfMxOx">np</variable></variables><block type="importAs" id="m;0Uju49an!8G3YKn4cP" x="93" y="288"><field name="libraryName">sklearn.linear_model</field><field name="libraryAlias" id="!+Hi;Yx;ZB!EQYU8ItpO">linear_model</field><next><block type="importAs" id="^iL#`T{6G3.Uxfj*r`Cv"><field name="libraryName">numpy</field><field name="libraryAlias" id="YynR+H75hTgW`vKfMxOx">np</field></block></next></block></xml>

Create a linear regression model.

In [36]:
lm = linear_model.LinearRegression()

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="F]q147x/*m|PMfPQU-lZ">lm</variable><variable id="!+Hi;Yx;ZB!EQYU8ItpO">linear_model</variable></variables><block type="variables_set" id="!H`J#y,K:4I.h#,HPeK{" x="127" y="346"><field name="VAR" id="F]q147x/*m|PMfPQU-lZ">lm</field><value name="VALUE"><block type="varCreateObject" id="h:O3ZfE(*c[Hz3sF=$Mm"><field name="VAR" id="!+Hi;Yx;ZB!EQYU8ItpO">linear_model</field><field name="MEMBER">LinearRegression</field><data>linear_model:LinearRegression</data></block></value></block></xml>

Create a dataframe `X` based on columns `100` to `wheats`

To define these columns, use this freestyle: `dataframe.loc[:,'100':'wheats']`.

In [52]:
X = dataframe.loc[:,'100':'wheats']

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="!)hBTrKL{[o_8MuO:[mm">X</variable></variables><block type="variables_set" id="?KPFp90zrlAS;}6TH^-#" x="-24" y="-125"><field name="VAR" id="!)hBTrKL{[o_8MuO:[mm">X</field><value name="VALUE"><block type="dummyOutputCodeBlock" id="OOe|HL1zi2f#}/4I1koS"><field name="CODE">dataframe.loc[:,'100':'wheats']</field></block></value></block></xml>

Train the model to predict `fiber` using `X`.

In [53]:
lm.fit(X, dataframe['fiber'])

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="F]q147x/*m|PMfPQU-lZ">lm</variable><variable id="Y+grzOaS*0=OHV6zO8m/">dataframe</variable><variable id="!)hBTrKL{[o_8MuO:[mm">X</variable></variables><block type="varDoMethod" id="SH-.KDuf!yWP%},Vp-f#" x="-111" y="-134"><field name="VAR" id="F]q147x/*m|PMfPQU-lZ">lm</field><field name="MEMBER">fit</field><data>lm:</data><value name="INPUT"><block type="lists_create_with" id="rvX$!5J5B.`S`737FEWR"><mutation items="2"></mutation><value name="ADD0"><block type="variables_get" id="ZBs|.MY}JohrThyj%x0t"><field name="VAR" id="!)hBTrKL{[o_8MuO:[mm">X</field></block></value><value name="ADD1"><block type="indexer" id="Pyox~Ig5!cJ+EwOp3-0G"><field name="VAR" id="Y+grzOaS*0=OHV6zO8m/">dataframe</field><value name="INDEX"><block type="text" id="gXtWN[O*AGEl[tQaa--("><field name="TEXT">fiber</field></block></value></block></value></block></value></block></xml>

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Get the $r^2$.

In [54]:
lm.score(X, dataframe['fiber'])

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="F]q147x/*m|PMfPQU-lZ">lm</variable><variable id="!)hBTrKL{[o_8MuO:[mm">X</variable><variable id="Y+grzOaS*0=OHV6zO8m/">dataframe</variable></variables><block type="varDoMethod" id="SH-.KDuf!yWP%},Vp-f#" x="-111" y="-134"><field name="VAR" id="F]q147x/*m|PMfPQU-lZ">lm</field><field name="MEMBER">score</field><data>lm:score</data><value name="INPUT"><block type="lists_create_with" id="rvX$!5J5B.`S`737FEWR"><mutation items="2"></mutation><value name="ADD0"><block type="variables_get" id="ZBs|.MY}JohrThyj%x0t"><field name="VAR" id="!)hBTrKL{[o_8MuO:[mm">X</field></block></value><value name="ADD1"><block type="indexer" id="Pyox~Ig5!cJ+EwOp3-0G"><field name="VAR" id="Y+grzOaS*0=OHV6zO8m/">dataframe</field><value name="INDEX"><block type="text" id="gXtWN[O*AGEl[tQaa--("><field name="TEXT">fiber</field></block></value></block></value></block></value></block></xml>

0.6580210737680805

We can explain almost 66% of the variance of `fiber` using the name of the cereal!

Let's see what words are doing the work for us:

- for each item `i` in list as sorted zip a list containing freestyle `lm.coef_` and from `X` get `columns`
    - print `i`

In [69]:
for i in sorted(zip(lm.coef_, X.columns)):
  print(i)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id=":Kr5%b?Ez_%b!yoj+ua4">i</variable><variable id="!)hBTrKL{[o_8MuO:[mm">X</variable></variables><block type="controls_forEach" id="#?O,@s%7zW|@gYPHVa;@" x="-93" y="73"><field name="VAR" id=":Kr5%b?Ez_%b!yoj+ua4">i</field><value name="LIST"><block type="sortedBlock" id="!Kw_pgqz!v-.60.+N#74"><value name="x"><block type="zipBlock" id="Y};j[{F2%yZ0$0`kx%ww"><value name="x"><block type="lists_create_with" id="xo?#mRg,l/LdhW?):9Ep"><mutation items="2"></mutation><value name="ADD0"><block type="dummyOutputCodeBlock" id="#;Nw/pe6|^LG`6j0cl7,"><field name="CODE">lm.coef_</field></block></value><value name="ADD1"><block type="varGetProperty" id="_C7CMp=u2`#A];}?EP0("><field name="VAR" id="!)hBTrKL{[o_8MuO:[mm">X</field><field name="MEMBER">columns</field><data>X:columns</data></block></value></block></value></block></value></block></value><statement name="DO"><block type="text_print" id="n-#hKK{Ai[0S?W6jV9q2"><value name="TEXT"><shadow type="text" id="DvmAgfF2mRc2bT#m4q6y"><field name="TEXT">abc</field></shadow><block type="variables_get" id="jUL^N4!74tnnxxarX8aF"><field name="VAR" id=":Kr5%b?Ez_%b!yoj+ua4">i</field></block></value></block></statement></block></xml>

(-202458672735294.22, 'just')
(-23508344698801.395, 'nuts')
(-2.528457892445477, 'oat')
(-2.0027899034589813, 'raisin')
(-1.717047306424311, 'raisins')
(-1.3679816284048034, 'golden')
(-1.0721135348111535, 'flakes')
(-1.000557552485842, 'nut')
(-0.8612322105316365, '100')
(-0.7077912870236237, 'chex')
(-0.6600999184466105, 'puffed')
(-0.6218143357579544, 'rice')
(-0.528651693397785, 'crunch')
(-0.46529513196688804, 'cinnamon')
(-0.35246773674370185, 'shredded')
(-0.25053319951339253, 'honey')
(-0.039141901232693406, 'total')
(-0.008956658643282128, 'apple')
(0.12459005337670034, 'quaker')
(0.133703802240185, 'wheats')
(0.15468490062446807, 'corn')
(0.4863786212759184, 'crisp')
(0.7096806223771343, 'wheat')
(0.8338754831446739, 'cheerios')
(0.8665749346086379, 'almond')
(0.9794710833195207, 'grain')
(1.0880246102906006, 'wheaties')
(1.107992257379842, 'dates')
(1.160425769438441, 'nutri')
(1.4139691656399291, 'frosted')
(1.7040460191039914, 'oatmeal')
(1.95817897981296, 'crispy')
(2.183

Some of these seem intuitive, like `bran` but others less so.

Show all the rows of the dataframe where fiber > 3 (the 4th quartile).

In [77]:
dataframe[(dataframe['fiber'] > 3)]

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="Y+grzOaS*0=OHV6zO8m/">dataframe</variable></variables><block type="indexer" id="(feI)IP?mEyrRC5;~PVN" x="-124" y="-132"><field name="VAR" id="Y+grzOaS*0=OHV6zO8m/">dataframe</field><value name="INDEX"><block type="logic_compare" id="Sxca#F6{U$UpZy]%9C8P"><field name="OP">GT</field><value name="A"><block type="indexer" id="#6%*]Thox/+k0kZJ5%M,"><field name="VAR" id="Y+grzOaS*0=OHV6zO8m/">dataframe</field><value name="INDEX"><block type="text" id="Y)C9:rQQ}ghSsP`5!/03"><field name="TEXT">fiber</field></block></value></block></value><value name="B"><block type="math_number" id="f~HIxit*U52Z?RG0a!|."><field name="NUM">3</field></block></value></block></value></block></xml>

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating,100,almond,apple,bran,cheerios,chex,cinnamon,corn,crisp,crispy,crunch,dates,flakes,frosted,fruit,golden,grain,grape,honey,just,muesli,nut,nutri,nuts,oat,oatmeal,puffed,quaker,raisin,raisins,rice,right,shredded,squares,total,wheat,wheaties,wheats
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6.0,280.0,25,3,1.0,0.33,68.402973,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5.0,320.0,25,3,1.0,0.33,59.425505,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0.0,330.0,25,3,1.0,0.5,93.704912,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,Bran Chex,R,C,90,2,1,200,4.0,15.0,6.0,125.0,25,1,1.0,0.67,49.120253,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,Bran Flakes,P,C,90,3,0,210,5.0,13.0,5.0,190.0,25,3,1.0,0.67,53.313813,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
19,Cracklin' Oat Bran,K,C,110,3,3,140,4.0,10.0,7.0,160.0,25,3,1.0,0.5,40.448772,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
27,Fruit & Fibre Dates; Walnuts; and Oats,P,C,120,3,2,160,5.0,12.0,10.0,200.0,25,3,1.25,0.67,40.917047,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
28,Fruitful Bran,K,C,120,3,0,240,5.0,14.0,12.0,190.0,25,3,1.33,0.67,41.015492,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
52,Post Nat. Raisin Bran,P,C,120,3,1,200,6.0,11.0,14.0,260.0,25,3,1.33,0.67,37.840594,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
58,Raisin Bran,K,C,120,3,1,210,5.0,14.0,12.0,240.0,25,2,1.33,0.75,39.259197,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0


Let's look at the last element, `Total Raisin Bran`. 
We can see that `bran` is a strong positive coefficient and that `raisin` is a strong negative one.
We can expect that a linear model will need to oppositely weight words in a name to get good overall fit.

Let's look at the mystery of `right`.
Find all rows where `name` contains "Right", i.e. freestyle `dataframe['name'].str.contains("Right")`.

In [80]:
dataframe[dataframe['name'].str.contains("Right")]

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="Y+grzOaS*0=OHV6zO8m/">dataframe</variable></variables><block type="indexer" id="YJfBaX!6tlDaI8cIwMi`" x="-124" y="257"><field name="VAR" id="Y+grzOaS*0=OHV6zO8m/">dataframe</field><value name="INDEX"><block type="dummyOutputCodeBlock" id="i=hv?U%[ge!cecE5WjLp"><field name="CODE">dataframe['name'].str.contains("Right")</field></block></value></block></xml>

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating,100,almond,apple,bran,cheerios,chex,cinnamon,corn,crisp,crispy,crunch,dates,flakes,frosted,fruit,golden,grain,grape,honey,just,muesli,nut,nutri,nuts,oat,oatmeal,puffed,quaker,raisin,raisins,rice,right,shredded,squares,total,wheat,wheaties,wheats
38,Just Right Crunchy Nuggets,K,C,110,2,1,170,1.0,17.0,6.0,60.0,100,3,1.0,1.0,36.523683,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
39,Just Right Fruit & Nut,K,C,140,3,1,170,2.0,20.0,9.0,95.0,100,3,1.3,0.75,36.471512,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0


It seems these two cereals have the min/max weighted words - these extreme weights seem to be a colinearity issue because "Just" and "Right" only occur together in these names.