# Classification
https://cyberhelp.sesync.org/basic-Python-lesson/course/

## Lesson Objectives
Earn your Python “learner’s permit”
Work with Pandas, the DataFrame package
Recognize differences between R and Python
Hit the autubahn with Scikit-learn
Specific Achievements
Differentiate between data types and structures
Learn to use indentation as syntax
Import data and try a simple “split-apply-combine”
Implement a SVM for binary classification
Why learn Python? While much of the academic research community has dived into R for an open source toolbox for data science, there are plenty of reasons to learn Python too. Some find they can learn to write scripts more quickly in Python, others find its object orientation a real boon. This lesson works towards a simple “machine learning” problem, which has long been a speciality of Python’s Scikit-learn package.

## Machine Learning
Machine Learning is another take on regression (for continuous response) and classification (for categorical response).

Emphasis on prediction over parameter inference
Equal emphasis on probabilistic and non-probabilistic methods: whatever works
Not necessarilly “supervised” (e.g. clustering)
This lesson will step through a non-probabilistic classifier, because it’s at far end of the spectrum relative to generalized linear regression. Some classification methods have probabilistic interpretations: logistic regression is actually a classifier, whether or not you follow through to choosing the most likely outcome or are satisfied with estimating its probability. Others optimize the classifier based on other abstract quantities: SVMs maximize the distance between the “support vectors”.

## Jupyter
Sign into JupyterHub and open up worksheet-11.ipynb. This worksheet is an Jupyter Notebook document: it is divided into “cells” that are run independently but access the same Python interpreter. Use the Notebook to write and annotate code.

After opening worksheet-11.ipynb, right click anywhere in your notebook and choose “Create Console for Notebook”. Drag-and-drop the tabs into whatever arrangement you like.

### Variables
Variable assignment attaches the label left of an = to the return value of the expression on its right.

In [None]:
a = 'xyz'
a

Colloquially, you might say the new variable a equals 'xyz', but Python makes it easy to “go deeper”. There can be only one string 'xyz', so the Python interpreter makes a into another label for the same 'xyz', which we can verify by id().

The “in-memory” location of a returned by id() …

In [None]:
id(a)
id('xyz')
a is 'xyz'

… is equal to that of xyz itself:
The idiom to test this “sameness” is typical of the Python language: it uses plain English when words will suffice.

### Equal but not the Same
The id() function helps demonstrate that “equal” is not the “same”.

In [None]:
b = [1,2,3]
id(b)

Even though b == [1, 2, 3] returns True, these are not the same object:

In [None]:
id([1, 2, 3])
b is [1, 2, 3]

### Side-effects
The reason to be aware of what b is has to do with “side-effects”, an very import part of Python programming. A side-effect occurs when an expression generates some ripples other than its return value.

In [1]:
b.pop()
b

NameError: name 'b' is not defined

Python is an object-oriented language from the ground up—everything is an “object” with some state to be more or less aware of. And side-effects don’t touch the label, they effect what the label is assigned to (i.e. what it is).

### Question
Re-check the “in-memory” location—is it the same b?
### Answer
Yes! The list got shorter but it is the same list.
Side-effects trip up Python programmers when an object has multiple labels, which is not so unusual:

In [None]:
c = b
b.pop()
c

The assignment to c does not create a new list, so the side-effect of popping off the tail of b ripples into c.

A common mistake for those coming to Python from R, is to write b = b.append(4), which overwrites b with the value None that happens to be returned by the append() method.

### Data types

In [None]:
...('x')

### Operators

In [None]:
5 ... 7

In [None]:
2 ... 3

In [None]:
... * 2

### Data structures

In [None]:
T = ...
type(...)

In [None]:
T = ...
type(T)

In [None]:
L = ...
type(L)

In [None]:
S1 = set(...)
S2 = {3.14, 'z'}
S1.difference(S2)

In [None]:
user = ...
  ...
  'Last Name': 'Doe',
  'Email': 'j.doe@gmail.com',
}
...

In [None]:
user[...] = 42

### Flow control

In [None]:
squares = []
for ... in range(1, 5):
    ...
    squares....
len(squares)

In [None]:
users = [
    {'Name':'Alice', 'Email':'alice@email.com'},
    {'Name':'Bob', 'Email': 'bob@email.com'},
    ]
for ...:
    if ...:
        print(u['Email'])
    else:
        print('')

### Methods

In [None]:
square....

In [None]:
...({
    'Nickname':'Jamie',
    'Age':24,
    })
user

### Pandas

In [None]:
...
cbp = ...("data/cbp15co.csv")
...

In [None]:
cbp = pd.read_csv(
    'data/cbp15co.csv',
    ...
)
cbp.dtypes

In [None]:
cbp....[...]

In [None]:
cbp.loc[...]

In [None]:
cbp....[...]

In [None]:
cbp[['NAICS', 'AP']]...

In [None]:
logical_idx = cbp['NAICS']....('[0-9]{2}----')
cbp = cbp.loc[logical_idx]
...

In [None]:
cbp['FIPS'] = cbp['FIPSTATE']....
cbp.head()

### Index

In [None]:
cbp = ...
cbp.head()

In [None]:
cbp = cbp[['EMP', 'AP']]....

In [None]:
... = cbp['EMP']
employment = employment....
employment.head()

### Classification

In [None]:
... = pd.read_csv(
    'data/ruralurbancodes2013.csv',
    dtype={'FIPS':'str'},
    ).set_index('FIPS')
rural_urban['Metro'] = rural_urban['RUCC_2013'] < 4

In [None]:
employment_rural_urban = ...(
    ...,
    ...,
    )

In [None]:
import ...

train = employment_rural_urban.sample(
    ...,
    ...)

In [None]:
from ...
ml = svm.LinearSVC()

X = train.drop(..., axis=1).values[:, :2]
X = np.log(1 + X)
y = train[...].values.astype(int)

ml.....

In [None]:
from sklearn import metrics

metrics....

In [None]:
from mlxtend.plotting import plot_decision_regions

plot_decision_regions(X, y, clf=ml, legend=2)

### Kernel Method

In [None]:
ml = svm....
ml.fit(X, y)

In [None]:
metrics....

In [None]:
plot_decision_regions(...)