### Dealing with categorical features

- scitkit-learn does not accept categorical features by default
- we need to convert categorical features into numeric values
- convert to binary features called dummy variables
    - 0 means observation was NOT that category
    - 1 means observation WAS that category
    
Two options exist for creating dummy variables
- scikit-learn's OneHotEncoder()
- pandas get_dummies()

In [7]:
import pandas as pd

# for some reason there are no music genres in the provided dataset
# the genre column is just set to '1'

music_df = pd.read_csv("data/music_clean.csv")
music_df.head()
#music_dummies = pd.get_dummies(music_df["genre"], drop_first = True)

#print(music_dummies.head(20))

Unnamed: 0.1,Unnamed: 0,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,genre
0,36506,60.0,0.896,0.726,214547.0,0.177,2e-06,0.116,-14.824,0.0353,92.934,0.618,1
1,37591,63.0,0.00384,0.635,190448.0,0.908,0.0834,0.239,-4.795,0.0563,110.012,0.637,1
2,37658,59.0,7.5e-05,0.352,456320.0,0.956,0.0203,0.125,-3.634,0.149,122.897,0.228,1
3,36060,54.0,0.945,0.488,352280.0,0.326,0.0157,0.119,-12.02,0.0328,106.063,0.323,1
4,35710,55.0,0.245,0.667,273693.0,0.647,0.000297,0.0633,-7.787,0.0487,143.995,0.3,1


### Handling missing data

Approaches include:

- dropping missing data rows
- imputation of missing elements
    - can use mean, median, or other values
    - for categorical values it is common to use the most frequent value
    

In [None]:
# pattern for imputation
from sklearn.impute import SimpleImputer
X_cat = music_df["genre"].values.reshape(-1,1)
X_num = music_df.drop(["genre", "popularity"], axis = 1).values
y = music_df["popularity"].values
X_train_cat, X_test_cat, y_train, y_test = train_test_split(X_cat, y, test_size = 0.2, random_state = 12)
X_train_num, X_test_num, y_train, y_test = train_test_split(X_num, y, test_size = 0.2, random_state = 12)

imp_cat = SimpleImputer(strategy = "most_frequent")

# impute missing values from the training categorical data with fit_transform
X_train_cat = imp_cat.fit_transform(X_train_cat)

# impute missing values from the test data with transform
X_test_cat = imp_cat.transform(X_test_cat)


# instantiate a numerical imputer for numerical data
# this uses mean for missing values by default
imp_num = SimpleImputer()
X_train_num = imp_num.fit_transform(X_train_num)
X_test_num = imp_num.transform(X_test_num)

# now combine training data with np.append
X_train = np.append(X_train_num, X_train_cat, axis = 1)

# repeat for test data
X_test = np.append(X_test_num, X_test_cat, axis = 1)

In [None]:
# pattern for using a Pipeline
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

# Instantiate an imputer
imputer = SimpleImputer()

# Instantiate a knn model
knn = KNeighborsClassifier(n_neighbors = 3)

# Build steps for the pipeline
steps = [("imputer", imputer), 
         ("knn", knn)]

# Create the pipeline
pipeline = Pipeline(steps)

# Fit the pipeline to the training data
pipeline.fit(X_train, y_train)

# Make predictions on the test set
y_pred = pipeline.predict(X_test)

# Print the confusion matrix
print(confusion_matrix(y_test, y_pred))

### Centering and scaling data
