# Feature Representation

## Textual Categorical-Features

In [1]:
import pandas as pd

In [2]:
ordered_satisfaction = ['Very Unhappy', 'Unhappy', 'Neutral', 'Happy', 'Very Happy']

In [4]:
df = pd.DataFrame({'satisfaction':['Mad', 'Happy', 'Unhappy', 'Neutral']})
df.satisfaction = df.satisfaction.astype("category",
  ordered=True,
  categories=ordered_satisfaction
).cat.codes

In [5]:
df

Unnamed: 0,satisfaction
0,-1
1,3
2,1
3,2


In [6]:
df = pd.DataFrame({'vertebrates':[
...  'Bird',
...  'Bird',
...  'Mammal',
...  'Fish',
...  'Amphibian',
...  'Reptile',
...  'Mammal',
... ]})

In [9]:
# Method 1
df['vertebrates'] = df.vertebrates.astype("category").cat.codes

In [10]:
df

Unnamed: 0,vertebrates
0,1
1,1
2,3
3,2
4,0
5,4
6,3


In [11]:
# Method 2
df = pd.get_dummies(df,columns=['vertebrates'])

In [12]:
df

Unnamed: 0,vertebrates_0,vertebrates_1,vertebrates_2,vertebrates_3,vertebrates_4
0,0.0,1.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0
3,0.0,0.0,1.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,1.0
6,0.0,0.0,0.0,1.0,0.0


## Pure Textual Features

AKA "Bag of Words model"

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

In [14]:
corpus = [
...  "Authman ran faster than Harry because he is an athlete.",
...  "Authman and Harry ran faster and faster.",
... ]

In [15]:
bow = CountVectorizer()
X = bow.fit_transform(corpus) # Sparse Matrix

In [16]:
bow.get_feature_names()

[u'an',
 u'and',
 u'athlete',
 u'authman',
 u'because',
 u'faster',
 u'harry',
 u'he',
 u'is',
 u'ran',
 u'than']

In [17]:
X.toarray()

array([[1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [0, 2, 0, 1, 0, 2, 1, 0, 0, 1, 0]])

## Graphical Features

1. Split the image into a grid of smaller areas, and attempt feature extraction at each locality. Then return a combined array of all discovered features
2. Use variable-length gradients and other transformations as the features, such as regions of high / low luminosity, histogram counts for horizontal and vertical black pixels, stroke and edge detection, etc.
3. Resize the picture to a fixed size, convert it to grayscale, then encode every pixel as an element in a unidimensional feature array

In [18]:
# Uses the Image module (PIL)
from scipy import misc

In [25]:
# Load the image up
img = misc.imread('mqe.png')

In [26]:
# Is the image too big? Shrink it by an order of magnitude
img = img[::2, ::2]

In [27]:
# Scale colors from (0-255) to (0-1), then reshape to 1D Array
X = (img / 255.0).reshape(-1, 3)

In [28]:
X

array([[ 0.95686275,  0.95686275,  0.95686275],
       [ 0.95686275,  0.95686275,  0.95686275],
       [ 0.95686275,  0.95686275,  0.95686275],
       ..., 
       [ 0.95686275,  0.95686275,  0.95686275],
       [ 0.95686275,  0.95686275,  0.95686275],
       [ 0.95686275,  0.95686275,  0.95686275]])