# Data Science Introduction

Summer 2021 

## Outline 

1. Section 1: Introduction
    - Overall problem
    - Types of data and how to handle
    - Types of algorithms 
    - Types of error 
    - Dealing with error

### Overall Problem

- Given some data, make some predictions
- Two main types: supervised learning (labels) and unsupurvised learning (no labels)

### Data
- Possible data types: numeric, categorical, text, image, etc
- For numeric data, you should almost always scale
- Categorical data, use one hot encoding
- Text is much more open-ended and best practices are changing
- Image generally uses pixel values 

### Numeric Data

- There are many scaling methods, but a common one is the standard scaler (z score). Other options could be min-max scaler, bucketing.
- Most important is that the values for each feature are comparable, ie not having one feature in the millions and one z score

In [None]:
from sklearn.datasets import load_boston
from sklearn.preprocessing import StandardScaler
import pandas as pd

X, y = load_boston(return_X_y=True)
X = pd.DataFrame(X)
# features
X.head()

In [None]:
X.describe()

In [None]:
# labels (AKA targets) - don't need to scale
print(y[0:10])

In [None]:
scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X))

# this is the data you would use in your model
X_scaled

In [None]:
X_scaled.describe()

### Categorical data

One hot encoding is most standard method. Makes all categories binary columns. Can also use dictionary encoders that map some values to others

In [None]:
from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(handle_unknown='ignore')

X = [['Male', 1], ['Female', 3], ['Female', 2]]
enc.fit_transform(X).toarray()
# columns are F - M - 1 - 2 - 3

### Text data

Many tools available like bag of words, embeddings (BERT), topic modeling

### Images

Typically same processing is done on pixel values as with numeric data. Sometimes you may need to chunk the data for more/less granularity

## Types of Algorithms

1. Clustering
    - k-nearest neighbors
    - PCA
    - LDA
2. Prediction
    - regression 
    - classification
3. Embedding
    - for text
4. Neural Nets
    - Deep learning, reinforment learning, GANs
    
    
(more details next time)
    

## Types of Error

1. Overfitting
    - Your model does not generalize well to unseen data
2. Underfitting
    - Your model is too simplistic or you've suppressed too much
3. Leakage
    - Your model learns the 'wrong' things 


![alt text](images/fitting.png)

## Dealing with Error / Getting the most out of your model

1. Test/train split
    - This is a must do!
2. Parameter tuning
    - Varies for every model type
    - Cost function, regularization, kernel selection, class weights
3. Feature engineering / selection 
    - Often useful, sometimes need expertise
    - Mean, max, multiplying features, be careful about correlations and overfitting
4. Ensemble methods
    - combining multiple models into one big model
5. Over/undersampling
    - Used with class imbalance (anomaly detection), can tweak the training data sets
6. Imputation
    - Missing data points

In [None]:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(
     X_scaled, y, test_size=0.3, random_state=42)