# Vertrebral column dataset

This lab is based in the AWS-course for machine learning foundations, module 3, section 1, third question:

Question: Based on the biomechanical features, can you predict whether a patient has an abnormality (disk hernia or spondylolisthesis)?

Why:
* View statistics
* Encode categorical data
* Train and tune a model

Citation

Source: [UCI Vertebral column dataset](https://archive.ics.uci.edu/ml/datasets/vertebral%2Bcolumn)  _-> already downloaded!_

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. 


Let's keep it short this time. Only import and encode categoricals.

You have 2 files and 2 versions of every file: dat and arff. "dat" only has the abbreviated data, "arff" also contains the column name and the full name for the class. We'll be using the second file.

You could this manually, but keep your history in mind: Python become the leading programming language because of the many, many open source libraries.

In [10]:
import pandas as pd
from scipy.io.arff import loadarff 

raw_data = loadarff('files/vertebral+column/column_2C_weka.arff')
df = pd.DataFrame(raw_data[0])

raw_data = loadarff('files/vertebral+column/column_3C_weka.arff')
df = pd.concat([df, pd.DataFrame(raw_data[0])])

# df.head()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 620 entries, 0 to 309
Data columns (total 7 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   pelvic_incidence          620 non-null    float64
 1   pelvic_tilt               620 non-null    float64
 2   lumbar_lordosis_angle     620 non-null    float64
 3   sacral_slope              620 non-null    float64
 4   pelvic_radius             620 non-null    float64
 5   degree_spondylolisthesis  620 non-null    float64
 6   class                     620 non-null    object 
dtypes: float64(6), object(1)
memory usage: 38.8+ KB


Import works fine, but the class isn't strings, it's [bytes](https://stackoverflow.com/questions/6269765/what-does-the-b-character-do-in-front-of-a-string-literal). Fix that.

In [12]:
df['class'] = df['class'].str.decode('utf-8')
df["class"].unique()

array(['Abnormal', 'Normal', 'Hernia', 'Spondylolisthesis'], dtype=object)

And turn it into a ordinal/nominal (delete as appropriate) categorical.

In [15]:
from pandas.api.types import CategoricalDtype

values = df["class"].unique()
df["class"] = df["class"].astype(CategoricalDtype(categories=values))

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 620 entries, 0 to 309
Data columns (total 7 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   pelvic_incidence          620 non-null    float64 
 1   pelvic_tilt               620 non-null    float64 
 2   lumbar_lordosis_angle     620 non-null    float64 
 3   sacral_slope              620 non-null    float64 
 4   pelvic_radius             620 non-null    float64 
 5   degree_spondylolisthesis  620 non-null    float64 
 6   class                     620 non-null    category
dtypes: category(1), float64(6)
memory usage: 34.7 KB
