# Part 3: Introduction to Machine Learning

# Outline

1. What is Machine Learning?
    * Supervised Learning vs Unsupervised Learning
2. What are some typical tasks of Supervised Machine Learning?
    * Regression
    * Classification
3. Typical Machine Learning Pipeline
    * Identify the problem
    * Data Exploration & Preprocessing
    * Train/Test/Validation
    * Training, Testing and Finetuning a model
    * Application and Iteration

# 1. What is Machine Learning? 

## Supervised Learning (Focus today) and Unsupervised Learning

Math Equation, map x to y linearly: y = ax + b 

What if we have a lot of xs and ys, and we want to find a best fit line for them?

For example, let x be the years of working experience and y be the annual salary

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
data = pd.read_csv('data/Salary_Data.csv')

In [None]:
data.head()

In [None]:
x = data.iloc[:,0]
y = data.iloc[:,1]

In [None]:
x.head()

In [None]:
y.head()

In [None]:
#draw a scatter plot
fig = plt.figure()
#plot points
plt.scatter(x,y)
plt.xlabel('years')
plt.ylabel('salary')

#plot best fit line
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)),color='red')
plt.show()

Conclusion: In a typical Supervised Machine Learning Task, we are given a set of inputs and outputs, and we are interested in finding the best mapping function between them! 

# 2. What are some typical tasks of Supervised Machine Learning?

#### Regression (For continuous data) :  Stock price prediction, Housing price prediction
#### Classification (For discrete/categorical data): Fraud Detection, Cat-Dog Classification

# 3. Typical Machine Learning Pipeline

#### a. Identify the problem: I want to predict the salary of an employee

#### b. Dataset exploration & Preprocessing: What are data I have (years of experience etc.)? Is the data clean enough? Any missing data? Convert categorical data into numerical form?

In [None]:
example = pd.read_csv('titanic.csv')

In [None]:
example.head()

In [None]:
example.loc[example['Age'].isnull()]

In [None]:
# convert categorical to numerical
gender = example.loc[:,'Sex']
gender.head()

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
gender_num = le.fit_transform(gender)
gender_num

#### c.Train/Test/Validation Split

Think of a student taking an A-level course.

* Train: Learning and doing questions given by teacher during a lesson
* Val: Weekly/Monthly exams;
* Test: A-Level exam

In [None]:
from sklearn.cross_validation import train_test_split

x = np.random.randn(10)
y = np.random.randn(10)

x, y

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)

In [None]:
x_train, x_test, y_train, y_test

#### d. Training/Testing/Tuning a model

In [None]:
#To be covered in the later sessions

#### e. Apply to real problem and iterate along the way!