# Multi-class Classification - Predict the Poker Hand

Dataset:
https://archive.ics.uci.edu/ml/datasets/Poker+Hand

## Dataset observations

https://archive.ics.uci.edu/ml/machine-learning-databases/poker/poker-hand.names

- 10 classes
- 1 million samples
- missing values: None
- classes are not balanced (some poker hands are rare)
- separate test dataset from train dataset

## Workflow

Data Gathering
1. read_csv

Data Transformation
2. transform dataframe
3. PCA to plot (for classification)
4. shuffle training set (train test split not necessary as there is a separate test set)
5. scale

Training
6. choose your own model
7. compare with at least one other model

Validation
8. metrics
9. learning curve
10. predictions

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import learning_curve

## Data Gathering

1. read_csv

In [6]:
df = pd.read_csv('D:/tmp/poker/poker-hand-training-true.data',
                 names=['S1', 'C1', 'S2', 'C2', 'S3', 'C3', 'S4', 'C4', 'S5', 'C5', 'CLASS'])

df.head()

Unnamed: 0,S1,C1,S2,C2,S3,C3,S4,C4,S5,C5,CLASS
0,1,10,1,11,1,13,1,12,1,1,9
1,2,11,2,13,2,10,2,12,2,1,9
2,3,12,3,11,3,13,3,10,3,1,9
3,4,10,4,11,4,1,4,13,4,12,9
4,4,1,4,13,4,12,4,11,4,10,9


In [10]:
df.describe()

Unnamed: 0,S1,C1,S2,C2,S3,C3,S4,C4,S5,C5,CLASS
count,25010.0,25010.0,25010.0,25010.0,25010.0,25010.0,25010.0,25010.0,25010.0,25010.0,25010.0
mean,2.508756,6.995242,2.497721,7.014194,2.510236,7.014154,2.495922,6.942463,2.497321,6.962735,0.621152
std,1.116483,3.749805,1.121767,3.766974,1.123148,3.744974,1.116009,3.747147,1.118732,3.741579,0.788361
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
25%,2.0,4.0,1.0,4.0,2.0,4.0,1.0,4.0,1.0,4.0,0.0
50%,3.0,7.0,2.0,7.0,3.0,7.0,2.0,7.0,3.0,7.0,1.0
75%,4.0,10.0,4.0,10.0,4.0,10.0,3.0,10.0,3.0,10.0,1.0
max,4.0,13.0,4.0,13.0,4.0,13.0,4.0,13.0,4.0,13.0,9.0


In [9]:
df_test = pd.read_csv('D:/tmp/poker/poker-hand-testing.data',
                      names=['S1', 'C1', 'S2', 'C2', 'S3', 'C3', 'S4', 'C4', 'S5', 'C5', 'CLASS'])

df_test.head()

Unnamed: 0,S1,C1,S2,C2,S3,C3,S4,C4,S5,C5,CLASS
0,1,1,1,13,2,4,2,3,1,12,0
1,3,12,3,2,3,11,4,5,2,5,1
2,1,9,4,6,1,4,3,2,3,9,1
3,1,4,3,13,2,13,2,1,3,6,1
4,3,10,2,7,1,2,2,11,4,9,0


In [11]:
df_test.describe()

Unnamed: 0,S1,C1,S2,C2,S3,C3,S4,C4,S5,C5,CLASS
count,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0
mean,2.500493,6.997927,2.499894,7.006097,2.500871,6.998873,2.500393,7.002298,2.499451,6.989481,0.616902
std,1.117768,3.743374,1.118568,3.743481,1.118225,3.74189,1.117245,3.74127,1.118948,3.739894,0.773377
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
25%,2.0,4.0,1.0,4.0,1.0,4.0,2.0,4.0,1.0,4.0,0.0
50%,3.0,7.0,3.0,7.0,3.0,7.0,3.0,7.0,2.0,7.0,0.0
75%,3.0,10.0,4.0,10.0,4.0,10.0,3.0,10.0,4.0,10.0,1.0
max,4.0,13.0,4.0,13.0,4.0,13.0,4.0,13.0,4.0,13.0,9.0


## Data Transformation
2. transform dataframe
3. PCA to plot (for classification)
4. train-test split
5. scale

In [None]:
X = df.loc[]

## Training
6. logistic regression
7. SGD logistic regression

## Validation
8. metrics
9. learning curve
10. prediction