# Pre-Processing and Training Data Development: CBARQ Survey

### Problem: Does dog breed affect behavioral traits, specifically chasing (or herding)?

The purpose of this project is to assess how dog breeds affect behavioral traits, with the focus narrowed to chasing (see previous EDA step of this project for explanation why). Therefore, here I will create dummy variables for the breeds and conduct a train/test split. I will not need to use a scaler, as all of my independent variables will be dummy values (binary 0s and 1s).

In [1]:
# Imports

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns
from sklearn.model_selection import train_test_split

In [2]:
# Evaluate dataset created during EDA to double-check for missing values and datatypes.

df = pd.read_csv('EDA_CBARQ')

# print(df.isnull().sum())

# print(df.describe())

# print(df.head())

# print(df.columns)

# print(df.dtypes)

# Remove extra columns
df = df[['BreedID', 'chasing',]]

In [3]:
# Creating dummy/indicator variables for the categorical data, BreedID (10 values)

dummies = pd.get_dummies(df['BreedID']).astype('int')

# Create a dataframe with the dummy variables, excluding the original column

df_final = pd.concat([df, dummies], axis=1).drop('BreedID', axis=1)
df_final.head()

Unnamed: 0,chasing,Australian Shepherd,Border Collie,Doberman Pinscher,German Shepherd,Golden Retriever,Labrador Retriever,Mixed Breed/Unknown,Poodle (Standard),Rottweiler,Soft Coated Wheaten Terrier
0,3.25,0,0,0,0,0,0,1,0,0,0
1,3.0,0,0,0,0,0,0,1,0,0,0
2,4.25,0,0,0,0,0,1,0,0,0,0
3,2.75,0,0,0,0,0,0,1,0,0,0
4,2.75,0,1,0,0,0,0,0,0,0,0


In [4]:
# Separate X and y dataframes

X = df_final[['Australian Shepherd', 'Border Collie', 'Doberman Pinscher',
              'German Shepherd', 'Golden Retriever', 'Labrador Retriever','Mixed Breed/Unknown', 
              'Poodle (Standard)', 'Rottweiler','Soft Coated Wheaten Terrier']]

y = df_final['chasing']

In [5]:
# Split data into training and test sets. (70/30 or 80/20)

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size = 0.3, random_state = 55)

In [6]:
# Since all of my X values are binary 0/1 dummy variables for breeds, I don't need to use a scaler.

Narrowing this dataset to just 'chasing' and breeds makes it far simpler and eliminates a lot of the noise generated by some other traits which didn't show very clear relationships. The reasoning for this limitation is in the EDA portion of this project.

The data has been separated into training and test sets, as well as X and y variables (with chasing being the y, dependent variable, and the breeds being the X, independent variables). All X values are 0 or 1, and y values ranging from 1-5 (as they were acquired through a survey which used those numbers for rating). 