# Data Science Introduction

Summer 2021 

## Outline 

1. Section 1: Introduction
    - Overall problem
    - Types of data and how to handle
    - Types of algorithms 
    - Types of error 
    - Dealing with error

### Overall Problem

- Given some data, make some predictions
- Two main types: supervised learning (labels) and unsupurvised learning (no labels)

### Data
- Possible data types: numeric, categorical, text, image, etc
- For numeric data, you should almost always scale
- Categorical data, use one hot encoding
- Text is much more open-ended and best practices are changing
- Image generally uses pixel values 

### Numeric Data

- There are many scaling methods, but a common one is the standard scaler (z score). Other options could be min-max scaler, bucketing.
- Most important is that the values for each feature are comparable, ie not having one feature in the millions and one z score

In [17]:
from sklearn.datasets import load_boston
from sklearn.preprocessing import StandardScaler
import pandas as pd

X, y = load_boston(return_X_y=True)
X = pd.DataFrame(X)
# features
X.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [24]:
X.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36
75%,3.677083,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97


In [18]:
# labels (AKA targets) - don't need to scale
print(y[0:10])

[24.  21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9]


In [21]:
scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X))

# this is the data you would use in your model
X_scaled

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,-0.419782,0.284830,-1.287909,-0.272599,-0.144217,0.413672,-0.120013,0.140214,-0.982843,-0.666608,-1.459000,0.441052,-1.075562
1,-0.417339,-0.487722,-0.593381,-0.272599,-0.740262,0.194274,0.367166,0.557160,-0.867883,-0.987329,-0.303094,0.441052,-0.492439
2,-0.417342,-0.487722,-0.593381,-0.272599,-0.740262,1.282714,-0.265812,0.557160,-0.867883,-0.987329,-0.303094,0.396427,-1.208727
3,-0.416750,-0.487722,-1.306878,-0.272599,-0.835284,1.016303,-0.809889,1.077737,-0.752922,-1.106115,0.113032,0.416163,-1.361517
4,-0.412482,-0.487722,-1.306878,-0.272599,-0.835284,1.228577,-0.511180,1.077737,-0.752922,-1.106115,0.113032,0.441052,-1.026501
...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,-0.413229,-0.487722,0.115738,-0.272599,0.158124,0.439316,0.018673,-0.625796,-0.982843,-0.803212,1.176466,0.387217,-0.418147
502,-0.415249,-0.487722,0.115738,-0.272599,0.158124,-0.234548,0.288933,-0.716639,-0.982843,-0.803212,1.176466,0.441052,-0.500850
503,-0.413447,-0.487722,0.115738,-0.272599,0.158124,0.984960,0.797449,-0.773684,-0.982843,-0.803212,1.176466,0.441052,-0.983048
504,-0.407764,-0.487722,0.115738,-0.272599,0.158124,0.725672,0.736996,-0.668437,-0.982843,-0.803212,1.176466,0.403225,-0.865302


In [25]:
X_scaled.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,-8.787437000000001e-17,-6.343191e-16,-2.682911e-15,4.701992e-16,2.490322e-15,-1.14523e-14,-1.407855e-15,9.210902e-16,5.441409e-16,-8.868619e-16,-9.205636e-15,8.163101e-15,-3.370163e-16
std,1.00099,1.00099,1.00099,1.00099,1.00099,1.00099,1.00099,1.00099,1.00099,1.00099,1.00099,1.00099,1.00099
min,-0.4197819,-0.4877224,-1.557842,-0.2725986,-1.465882,-3.880249,-2.335437,-1.267069,-0.9828429,-1.31399,-2.707379,-3.907193,-1.531127
25%,-0.4109696,-0.4877224,-0.8676906,-0.2725986,-0.9130288,-0.5686303,-0.837448,-0.8056878,-0.6379618,-0.767576,-0.4880391,0.2050715,-0.79942
50%,-0.3906665,-0.4877224,-0.2110985,-0.2725986,-0.1442174,-0.1084655,0.3173816,-0.2793234,-0.5230014,-0.4646726,0.274859,0.3811865,-0.1812536
75%,0.00739656,0.04877224,1.015999,-0.2725986,0.598679,0.4827678,0.9067981,0.6623709,1.661245,1.530926,0.8065758,0.433651,0.6030188
max,9.933931,3.804234,2.422565,3.668398,2.732346,3.555044,1.117494,3.960518,1.661245,1.798194,1.638828,0.4410519,3.548771


### Categorical data

One hot encoding is most standard method. Makes all categories binary columns. Can also use dictionary encoders that map some values to others

In [41]:
from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(handle_unknown='ignore')

X = [['Male', 1], ['Female', 3], ['Female', 2]]
enc.fit_transform(X).toarray()
# columns are F - M - 1 - 2 - 3

array([[0., 1., 1., 0., 0.],
       [1., 0., 0., 0., 1.],
       [1., 0., 0., 1., 0.]])

### Text data

Many tools available like bag of words, embeddings (BERT), topic modeling

### Images

Typically same processing is done on pixel values as with numeric data. Sometimes you may need to chunk the data for more/less granularity

## Types of Algorithms

1. Clustering
    - k-nearest neighbors
    - PCA
    - LDA
2. Prediction
    - regression 
    - classification
3. Embedding
    - for text
4. Neural Nets
    - Deep learning, reinforment learning, GANs
    
    
(more details next time)
    

## Types of Error

1. Overfitting
    - Your model does not generalize well to unseen data
2. Underfitting
    - Your model is too simplistic or you've suppressed too much
3. Leakage
    - Your model learns the 'wrong' things 


![alt text](images/fitting.png)

## Dealing with Error / Getting the most out of your model

1. Test/train split
    - This is a must do!
2. Parameter tuning
    - Varies for every model type
    - Cost function, regularization, kernel selection, class weights
3. Feature engineering / selection 
    - Often useful, sometimes need expertise
    - Mean, max, multiplying features, be careful about correlations and overfitting
4. Ensemble methods
    - combining multiple models into one big model
5. Over/undersampling
    - Used with class imbalance (anomaly detection), can tweak the training data sets
6. Imputation
    - Missing data points

In [42]:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(
     X_scaled, y, test_size=0.3, random_state=42)