This notebook is based on notebook [here](https://www.google.com/url?q=https%3A%2F%2Fnbviewer.jupyter.org%2Fgithub%2Fagisga%2Fsciruby-notebooks%2Fblob%2Fmaster%2FData%2520Analysis%2FLogistic%2520regression%2520with%2520categorical%2520data.ipynb&sa=D&sntz=1&usg=AFQjCNE7gDkrVcPcy6d4EeqtRixVhB017A) created by [Alexej](http://github.com/agisga)

# Logistic regression with categorical data

We aim to fit a logistic regression model to the [shelter animal data](https://www.kaggle.com/c/shelter-animal-outcomes) from [kaggle](https://www.kaggle.com/competitions) using the Ruby gems `daru` and `statsample-glm`.



In [1]:
require 'daru'
shelter_data = Daru::DataFrame.from_csv 'data/animal_shelter_train.csv'
shelter_data.head(3)

"if(window['d3'] === undefined ||\n   window['Nyaplot'] === undefined){\n    var path = {\"d3\":\"https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.5/d3.min\",\"downloadable\":\"https://cdn.rawgit.com/domitry/d3-downloadable/master/d3-downloadable\"};\n\n\n\n    var shim = {\"d3\":{\"exports\":\"d3\"},\"downloadable\":{\"exports\":\"downloadable\"}};\n\n    require.config({paths: path, shim:shim});\n\n\nrequire(['d3'], function(d3){window['d3']=d3;console.log('finished loading d3');require(['downloadable'], function(downloadable){window['downloadable']=downloadable;console.log('finished loading downloadable');\n\n\tvar script = d3.select(\"head\")\n\t    .append(\"script\")\n\t    .attr(\"src\", \"https://cdn.rawgit.com/domitry/Nyaplotjs/master/release/nyaplot.js\")\n\t    .attr(\"async\", true);\n\n\tscript[0][0].onload = script[0][0].onreadystatechange = function(){\n\n\n\t    var event = document.createEvent(\"HTMLEvents\");\n\t    event.initEvent(\"load_nyaplot\",false,false);\n\t    w

Daru::DataFrame(3x10),Daru::DataFrame(3x10),Daru::DataFrame(3x10),Daru::DataFrame(3x10),Daru::DataFrame(3x10),Daru::DataFrame(3x10),Daru::DataFrame(3x10),Daru::DataFrame(3x10),Daru::DataFrame(3x10),Daru::DataFrame(3x10),Daru::DataFrame(3x10)
Unnamed: 0_level_1,AnimalID,Name,DateTime,OutcomeType,OutcomeSubtype,AnimalType,SexuponOutcome,Breed,Color,AgeuponOutcome(Weeks)
0,A671945,Hambone,2014-02-12 18:22:00,Return_to_owner,,Dog,Neutered Male,Shetland Sheepdog Mix,Brown/White,52.0
1,A656520,Emily,2013-10-13 12:44:00,Euthanasia,Suffering,Cat,Spayed Female,Domestic Shorthair Mix,Cream Tabby,52.0
2,A686464,Pearce,2015-01-31 12:28:00,Adoption,Foster,Dog,Neutered Male,Pit Bull Mix,Blue/White,104.0


In [2]:
shelter_data.to_category 'OutcomeType', 'OutcomeSubtype', 'AnimalType', 'SexuponOutcome', 'Breed', 'Color'
nil

We create a 0-1-valued indicator for whether the animal got adopted.

In [3]:
shelter_data['OutcomeType_Adoption'] = (shelter_data['OutcomeType'].contrast_code)['OutcomeType_Adoption']
shelter_data.head 3

Daru::DataFrame(3x11),Daru::DataFrame(3x11),Daru::DataFrame(3x11),Daru::DataFrame(3x11),Daru::DataFrame(3x11),Daru::DataFrame(3x11),Daru::DataFrame(3x11),Daru::DataFrame(3x11),Daru::DataFrame(3x11),Daru::DataFrame(3x11),Daru::DataFrame(3x11),Daru::DataFrame(3x11)
Unnamed: 0_level_1,AnimalID,Name,DateTime,OutcomeType,OutcomeSubtype,AnimalType,SexuponOutcome,Breed,Color,AgeuponOutcome(Weeks),OutcomeType_Adoption
0,A671945,Hambone,2014-02-12 18:22:00,Return_to_owner,,Dog,Neutered Male,Shetland Sheepdog Mix,Brown/White,52.0,0
1,A656520,Emily,2013-10-13 12:44:00,Euthanasia,Suffering,Cat,Spayed Female,Domestic Shorthair Mix,Cream Tabby,52.0,0
2,A686464,Pearce,2015-01-31 12:28:00,Adoption,Foster,Dog,Neutered Male,Pit Bull Mix,Blue/White,104.0,1


## Model fit

Now, having put data in appropriate form, we can fit the logistic regression model with `statsample-glm`.

In [4]:
small = shelter_data.head 300
small.head 3
m = small['OutcomeType_Adoption'].mean
"Trivial accuracy = #{[m, 1-m].max}"

"Trivial accuracy = 0.5900000000000001"

In [5]:
require 'statsample-glm'

formula = 'OutcomeType_Adoption~AnimalType+AgeuponOutcome(Weeks)'
# Getting NotRegularMatrix exception if including variables such as Breed and Color.
# See (https://github.com/SciRuby/statsample-glm/issues/32)
glm_adoption = Statsample::GLM.fit_model formula,
  small, :logistic
glm_adoption.coefficients :hash

{:AnimalType_Cat=>-0.798117952110872, :"AgeuponOutcome(Weeks)"=>-0.003617767489740798, :constant=>0.34728399162356716}

In [6]:
x = Statsample::GLM::Regression.new formula, small, :logistic
x.df_for_regression.head(5)

Daru::DataFrame(5x3),Daru::DataFrame(5x3),Daru::DataFrame(5x3),Daru::DataFrame(5x3)
Unnamed: 0_level_1,AnimalType_Cat,AgeuponOutcome(Weeks),OutcomeType_Adoption
0,0,52.0,0
1,1,52.0,0
2,0,104.0,1
3,1,3.0,0
4,0,104.0,0


In [7]:
prediction = glm_adoption.predict x.df_for_regression[:'AnimalType_Cat', :'AgeuponOutcome(Weeks)']
prediction.map! { |i| i < 0.5 ? 0 : 1 }
prediction.head 5

Daru::Vector(5),Daru::Vector(5).1
0,1
1,0
2,0
3,0
4,0


In [8]:
real = x.df_for_regression['OutcomeType_Adoption']
correct = real.zip(prediction).count { |a, b| a == b }
total = real.size
"Our model accuracy: #{correct/total.to_f}"

"Our model accuracy: 0.6366666666666667"

## Possible next steps

1. Interpret the logistic regression coefficients.
2. Fit logistic regression models with euthanasia, death, etc. as response variable.
3. Predict adoption, euthanasia, death, etc. on test data.
4. Submit prediction results to kaggle, and fail against random forrest models.