# Regression Notes


- we want to see how much our dependent variables (answers to our survey questions) can be explained by our demographic information (ie race, sex, income, etc)
- our dependent variables are *nominal*, meaning they are categorical (NOT continuous) -- this means we need to use a different kind of regression (we can't use simple linear regression)
- because our *dependent variable is nominal* we want to use *logistic regression*, not linear regression
- in addition, our nominal data is *ordinal*, meaning that it is a scale -- our survey answer choices all range from least amount of agreeance to the most
- so, we must use *ordinal logistic regression* -- this is because, with regular logistic regression, we assume that for our nominal dependent variables, each is a double of the last. This would mean we're assuming that people who choose answer 2 are agreeing 2x as much with the statement. We can't safely make this assumption, and in addition, the way our survey choices were written does not imply to the respondent that is what is meant. So this is why we want to go with this method ("We only know that a rating of 2 is better than a rating of 1, but we don’t know by how much. Such data is essentially a ranking, i.e. ordinal data. It’s better to think of ratings as categories which are ordered from very bad to excellent." [2] )

## Ordinal Logistic Regression

Examples of multiple ordered response categories include bond ratings, opinion surveys with responses ranging from "strongly agree" to "strongly disagree," [3]

# Sources

[1] https://statisticsbyjim.com/regression/choosing-regression-analysis/

[2] https://rikunert.com/ordinal_rating

[3] https://en.wikipedia.org/wiki/Ordered_logit

In [13]:
import mord
from os.path import dirname, join

import numpy as np
from sklearn.datasets.base import Bunch

In [14]:

#from mord.datasets.base import load_housing
from sklearn import linear_model, metrics, preprocessing

In [24]:
data.target

array([1, 1, 1, ..., 0, 0, 0])

In [19]:


data = load_housing()
features = data.data

le = preprocessing.LabelEncoder()
le.fit(data.target)
data.target = le.transform(data.target)

features.loc[features.Infl == 'Low', 'Infl'] = 1
features.loc[features.Infl == 'Medium', 'Infl'] = 2
features.loc[features.Infl == 'High', 'Infl'] = 3

features.loc[features.Cont == 'Low', 'Cont'] = 1
features.loc[features.Cont == 'Medium', 'Cont'] = 2
features.loc[features.Cont == 'High', 'Cont'] = 3

le = preprocessing.LabelEncoder()
le.fit(features.loc[:,'Type'])
features.loc[:,'type_encoded'] = le.transform(features.loc[:,'Type'])

X, y = features.loc[:,('Infl', 'Cont', 'type_encoded')], data.target

clf1 = linear_model.LogisticRegression(
    solver='lbfgs',
    multi_class='multinomial')
clf1.fit(X, y)

print('Mean Absolute Error of LogisticRegression: %s' %
      metrics.mean_absolute_error(clf1.predict(X), y))

clf2 = mord.LogisticAT(alpha=1.)
clf2.fit(X, y)
print('Mean Absolute Error of LogisticAT %s' %
      metrics.mean_absolute_error(clf2.predict(X), y))

clf3 = mord.LogisticIT(alpha=1.)
clf3.fit(X, y)
print('Mean Absolute Error of LogisticIT %s' %
      metrics.mean_absolute_error(clf3.predict(X), y))

clf4 = mord.LogisticSE(alpha=1.)
clf4.fit(X, y)
print('Mean Absolute Error of LogisticSE %s' %
      metrics.mean_absolute_error(clf4.predict(X), y))


Mean Absolute Error of LogisticRegression: 0.6805472932778108
Mean Absolute Error of LogisticAT 0.6287923854848304
Mean Absolute Error of LogisticIT 0.7441998810232004
Mean Absolute Error of LogisticSE 0.6627007733491969


.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated


In [18]:
def load_housing():
    from pandas import read_csv
    """Load and return the Copenhagen housing survey dataset
       (ordinal classification).
    ==============     ==============
    Samples total                1681
    Dimensionality                  3
    Features              categorical
    Targets       ordered categorical
    ==============     ==============
    Returns
    -------
    data : Bunch
        Dictionary-like object, the interesting attributes are:
        'data', the data to learn, 'target', the classification targets,
        and 'DESCR', the full description of the dataset.
    Examples
    --------
    >>> from sklearn.datasets import load_housing
    >>> housing = load_housing()
    >>> print(housing.data.shape)
    (506, 13)
    """
    module_path = dirname('testdata.csv')
    print(module_path)

    #fdescr_name = join(module_path, 'descr', 'copenhagen_housing_survey.rst')
    #with open(fdescr_name) as f:
    #    descr_text = f.read()

    #data_file_name = join(module_path, 'data', 'copenhagen_housing_survey.csv')
    data = read_csv('../testdata.csv')

    '''
    Original data set is formatted as a frequency table,
    but it's more convenient to work with the data
    as having one row per observation, below duplicates
    each obs by index based on the number the frequency ('Freq')
    of appearance
    '''
    index = np.asarray(range(0, data.shape[0])).\
        repeat(data.ix[:,'Freq'].values)
    data = data.ix[index,:]
    features = ('Infl', 'Type', 'Cont')

    return Bunch(data=data.loc[:,features],
                 target=data.loc[:,'Sat'],
                 feature_names=features,
                 DESCR="descr_text")