In this project, I am working with a csv file that contains data for the music that young men and women listen to within the age range of 20 to 30 years.
The data does not contain missing data and so it does not require any cleaning.

The goal of this project is to use the DecisionTreeClassifier from the sklearn.tree module to make predictions of the music genre a young man/young might listen to if I pass in the age and gender variable.

Note1
From the DecisionTree documentation, https://scikit-learn.org/stable/modules/tree.html, the decision tree does not support datasets with missing data so if anyone is planning on using the DecisionTreeClassifier module, you have to make sure to clean your data and remove all missing data before you pass that data to the fit function

Note2
It is also stated in the documentation that although it is able to handle both numerical and categorical data, scikit-learn implementation does not support categorical data for now.

It is very important to bare this mind before manipulating the data.


IMPLEMENTATION
- I am first going to import the pandas module in this worksheet. I am using the pandas module to read data from the csv file
- I am importing DecisionTreeClassifier from the sklear.tree module
- I am import train_test_split from sklearn.model_selection to train and split the input and output
- I am also using the accuracy_score from the sklearn.metrics module to test the accuracy of my predictions

In [173]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [174]:
# Read data from a csv file
music_data = pd.read_csv('music.csv')

In [175]:
# View the first five(5) records of our data
music_data.head()

Unnamed: 0,age,gender,genre
0,20,1,HipHop
1,23,1,HipHop
2,25,1,HipHop
3,26,1,Jazz
4,29,1,Jazz


In [176]:
music_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 0 to 17
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   age     18 non-null     int64 
 1   gender  18 non-null     int64 
 2   genre   18 non-null     object
dtypes: int64(2), object(1)
memory usage: 560.0+ bytes


From the above information, we can see that we have two columns[age, gender] which contains numerical data and one column[genre]  that contains categorical data

In this project implementation, I will be making predictions of music genre an individual of a certain age and gender might prefer to listen.
With this in mind, I am splitting the data into two sets.
One set would be for input [age, gender] and the other would be output [genre]

In [177]:
# Get the genre column
output = music_data['genre']

In [178]:
# Drop the genre column to have only the age and gender columns
input_data = music_data.drop(columns=['genre'])

If we want to mutate the gender column and change it from numerical to categorical data by assigning "male" or "female"
based on the numeric value, this is how we will go about it

new_music_data['gender'] = ["male" if new_music_data.loc[i, 'gender'] == 1 else "female" for i in new_music_data.index]

I am going to use the train_test_split to train certain portion of the data and also use a defined portion for testing.
In this case, I am going to use 20% of the data for testing and the rest of the data for training my model

In [206]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# I am instantiating the DecisionTreeClassifier and I am assigning that to the variable model
model = DecisionTreeClassifier()

# I am training my data
model.fit(X_train, y_train)

# I am making predictions here
predictions = model.predict(X_test)

# I also want to test the accurancy of my predictions hence I am passing the expected test data and predictions into the 
# accurancy_score function to get the value of the accurancy of my predictions
score = accuracy_score(y_test, predictions)

score

0.75