# Nature of Data

## Data Type

### Numeric Data
 - A measurement: e.g. height, weight or count
 - Discrete and continuous data

### Categorical Data
- Represent characteristics: e.g. position, team
- can take on numerical values, but they don't have mathematical meaning.
- oridnal data
  e.g. ranking

### Time Series Data
- a collection of observations obtained through repeated measurements over time
- time series data does have some implied ordering.

Many algorithms assume that input data is numerical.    
For categorical data, this ofter means coverting categorial data into numberical data
that represents the same patterns.    
One standard way of doing this is with one-hot encoding.    
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html    


For example, in a dataset on baseball players, one feature might be "Handedness" which can take values "left" or "right". Then the data:    

Joe    
Handedness: right    
Jim    
Handedness: left       
    
Would become:      
    
Joe    
Handedness/right: 1     
Handedness/left: 0    
Jim    
Handedness/right: 0    
Handedness/left: 1    

For ordinal data, it often makes sense to simply assign the values to integers. So the following data:    

Joe    
Skill: low    
Jim    
Skill: medium    
Jane    
Skill: high    
    
Would become:    

Joe    
Skill: 0    
Jim    
Skill: 1    
Jane    
Skill: 2    

## Encoding using sklearn

Encoding in sklearn is done using the preprocessing module which comes with a variety of options of manipulating data before going into the analysis of data.
preprocessing module:http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing    
we will focus on two forms of encoding for now, the Label Encoder and the OneHotEncoder.    
LabelEncoder:http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html    
OneHotEncoder:http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html    


### Label Encoder

In [1]:
from sklearn import preprocessing
import pandas

In [2]:
sample_data = {'name':['Ray', 'Adam', 'Jason', 'Varun', 'Xiao'],
              'health':['fit','slim','obese','fit','slim']}
sample_data

{'health': ['fit', 'slim', 'obese', 'fit', 'slim'],
 'name': ['Ray', 'Adam', 'Jason', 'Varun', 'Xiao']}

In [3]:
data = pandas.DataFrame(sample_data)
data

Unnamed: 0,health,name
0,fit,Ray
1,slim,Adam
2,obese,Jason
3,fit,Varun
4,slim,Xiao


In [4]:
data = pandas.DataFrame(sample_data, columns = ['name','health'])
data

Unnamed: 0,name,health
0,Ray,fit
1,Adam,slim
2,Jason,obese
3,Varun,fit
4,Xiao,slim


In [5]:
# we have 3 different labels that we are looking to categorize: slim, fit, obese.
label_encoder = preprocessing.LabelEncoder()
label_encoder.fit(data['health'])

LabelEncoder()

In [6]:
# once have fit the label encoder to the column, you can then transform that column to integer data.
label_encoder.transform(data['health'])

array([0, 2, 1, 0, 2])

In [7]:
#you also can use this combined method
label_encoder.fit_transform(data['health'])

array([0, 2, 1, 0, 2])

In [50]:
#pandas
pandas.get_dummies(data['health'])

Unnamed: 0,fit,obese,slim
0,1,0,0
1,0,0,1
2,0,1,0
3,1,0,0
4,0,0,1


In [51]:
print(data)

    name health
0    Ray    fit
1   Adam   slim
2  Jason  obese
3  Varun    fit
4   Xiao   slim


In [52]:
#sklearn on the label encoded
ohe = preprocessing.OneHotEncoder() # creating OneHotEncoder object
label_encoded_data = label_encoder.fit_transform(data['health'])
label_encoded_data = ohe.fit_transform(label_encoded_data.reshape(-1,1))
print(label_encoded_data)

  (0, 0)	1.0
  (1, 2)	1.0
  (2, 1)	1.0
  (3, 0)	1.0
  (4, 2)	1.0


In [13]:
#Exercise
#load the titanic data
#then perform one-hot encoding on the feature names
import numpy as np
import pandas as pd

In [21]:
# load the dataset
X = pd.read_csv('titanic-data.csv')

In [22]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 5 columns):
PassengerId    4 non-null int64
Survived       4 non-null int64
Pclass         4 non-null int64
Name           4 non-null object
Sex            4 non-null object
dtypes: int64(3), object(2)
memory usage: 240.0+ bytes


In [55]:
X = X.select_dtypes(include = [object])

In [56]:
X.head()

Unnamed: 0,Name,Sex
0,Braund,male
1,Cumings,female
2,Heikkinen,female
3,Futrelle,female


In [64]:
X.shape

(4, 2)

In [65]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

In [66]:
le = LabelEncoder()
for feature in X:
    X[feature] = le.fit_transform(X[feature])

In [67]:
X

Unnamed: 0,Name,Sex
0,0,1
1,1,0
2,3,0
3,2,0


In [68]:
#OneHotEncoder
#The input to this transformer should be a matrix of integers, denoting the values taken on by categorical features.
#the output will a sparse matrix where each column corresponds to one possible value of one
#feature.

#It is assumed that input features take on values in 
#the range [0,n_values]
#This encoding is needed for feeding categorical data to many scikit-learn estimators, notably linear models 
#adn SVMs with standard kernels.

enc = preprocessing.OneHotEncoder()

In [69]:
enc.fit(X)

OneHotEncoder(categorical_features='all', dtype=<class 'numpy.float64'>,
       handle_unknown='error', n_values='auto', sparse=True)

In [70]:
onehotlabels = enc.transform(X).toarray()

In [71]:
onehotlabels.shape

(4, 6)

In [73]:
X

Unnamed: 0,Name,Sex
0,0,1
1,1,0
2,3,0
3,2,0


In [72]:
onehotlabels

array([[ 1.,  0.,  0.,  0.,  0.,  1.],
       [ 0.,  1.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  1.,  1.,  0.],
       [ 0.,  0.,  1.,  0.,  1.,  0.]])

### Time series data leakage


In [None]:
#When dealing with time-series data, it can be tempting to simply disregard the timming structure and
#simply treat it as the appropriate form of categorical or numberical data.

#One important concern is that if you are building a predictive project looking at forecast future data points.
#in this case, it is important NOT to use the future as a source of information!


In [None]:
# A hands-on example
#Enron Email Dataset: https://www.cs.cmu.edu/~./enron/
#https://github.com/udacity/ud120-projects

# http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-categorical-features