# Random Forest Classification - Gender Prediction

For this project, I utilized the random forest algorithm to predict the gender of individuals. To achieve this, I used a dataset from Kaggle, which can be accessed using the link provided below.

URL - https://www.kaggle.com/datasets/muhammadtalharasool/simple-gender-classification

## Import of libraries

In [152]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score

## Importation of dataset

In [153]:
gender_df = pd.read_csv(r"C:\Users\pjhop\OneDrive\Documents\Programming & Coding\Python\Projects\Datasets\gender.csv")

## Exploratory Data Analysis

In [154]:
gender_df.head()

Unnamed: 0,Gender,Age,Height (cm),Weight (kg),Occupation,Education Level,Marital Status,Income (USD),Favorite Color,Unnamed: 9
0,male,32,175,70,Software Engineer,Master's Degree,Married,75000,Blue,
1,male,25,182,85,Sales Representative,Bachelor's Degree,Single,45000,Green,
2,female,41,160,62,Doctor,Doctorate Degree,Married,120000,Purple,
3,male,38,178,79,Lawyer,Bachelor's Degree,Single,90000,Red,
4,female,29,165,58,Graphic Designer,Associate's Degree,Single,35000,Yellow,


Here we can see that there is one column which has no significance or meaning and we therefore drop this column in the next step.

In [155]:
gender_df = gender_df.drop(['Unnamed: 9'], axis=1)

In [156]:
gender_df.head()

Unnamed: 0,Gender,Age,Height (cm),Weight (kg),Occupation,Education Level,Marital Status,Income (USD),Favorite Color
0,male,32,175,70,Software Engineer,Master's Degree,Married,75000,Blue
1,male,25,182,85,Sales Representative,Bachelor's Degree,Single,45000,Green
2,female,41,160,62,Doctor,Doctorate Degree,Married,120000,Purple
3,male,38,178,79,Lawyer,Bachelor's Degree,Single,90000,Red
4,female,29,165,58,Graphic Designer,Associate's Degree,Single,35000,Yellow


In [157]:
gender_df.columns

Index([' Gender', ' Age', ' Height (cm)', ' Weight (kg)', ' Occupation',
       ' Education Level', ' Marital Status', ' Income (USD)',
       ' Favorite Color'],
      dtype='object')

From this inspection of the column names, we can see that the column name strings contain whitespace, this is a potential problem because it prevents us from calling the columns using dot notation. Therefore, we remove this whitespace using the code below.

In [158]:
gender_df = gender_df.rename(columns=lambda x: x.strip())
gender_df.columns

Index(['Gender', 'Age', 'Height (cm)', 'Weight (kg)', 'Occupation',
       'Education Level', 'Marital Status', 'Income (USD)', 'Favorite Color'],
      dtype='object')

In [159]:
gender_df['Gender'].unique()

array([' male', ' female', 'male', 'female'], dtype=object)

In [160]:
gender_df['Age'].unique()

array([32, 25, 41, 38, 29, 45, 27, 52, 31, 36, 24, 44, 28, 33, 37, 26, 40,
       47, 35, 42, 49, 30, 39, 34, 43], dtype=int64)

In [161]:
gender_df['Height (cm)'].unique()

array([175, 182, 160, 178, 165, 190, 163, 179, 168, 177, 162, 183, 166,
       181, 170, 176, 169, 187, 172, 180, 167, 185, 188, 174, 164, 186,
       184], dtype=int64)

In [162]:
gender_df['Occupation'].unique()

array([' Software Engineer', ' Sales Representative', ' Doctor',
       ' Lawyer', ' Graphic Designer', ' Business Consultant',
       ' Marketing Specialist', ' CEO', ' Project Manager', ' Engineer',
       ' Accountant', ' Architect', ' Nurse', ' Analyst', ' Teacher',
       ' IT Manager', ' Writer', ' Business Analyst', 'Engineer',
       'Teacher', 'Doctor', 'Graphic Designer', 'IT Manager',
       'Sales Representative', 'Lawyer', 'Marketing Specialist',
       'Project Manager', 'Writer', 'Architect', 'Nurse',
       'Business Analyst', 'Accountant', 'CEO', 'Analyst',
       'Software Developer'], dtype=object)

In [163]:
gender_df['Education Level'].unique()

array([" Master's Degree", " Bachelor's Degree", ' Doctorate Degree',
       " Associate's Degree", "Master's Degree", "Bachelor's Degree",
       'Doctorate Degree', "Associate's Degree"], dtype=object)

In [164]:
gender_df['Marital Status'].unique()

array([' Married', ' Single', ' Divorced', ' Widowed', 'Single',
       'Married', 'Divorced'], dtype=object)

In [165]:
gender_df['Income (USD)'].unique()

array([ 75000,  45000, 120000,  90000,  35000, 110000,  50000, 500000,
        80000,  95000,  40000,  55000,  60000,  65000,  85000,  30000,
       150000,  70000, 100000, 130000, 180000, 250000], dtype=int64)

In [166]:
gender_df['Favorite Color'].unique()

array([' Blue', ' Green', ' Purple', ' Red', ' Yellow', ' Black', ' Pink',
       ' Orange', ' Grey', 'Blue', 'Green', 'Red', 'Orange', 'Purple',
       'Yellow', 'Black', 'Grey', 'Pink'], dtype=object)

In [167]:
gender_df.isnull().sum()

Gender             0
Age                0
Height (cm)        0
Weight (kg)        0
Occupation         0
Education Level    0
Marital Status     0
Income (USD)       0
Favorite Color     0
dtype: int64

In [168]:
gender_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 131 entries, 0 to 130
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Gender           131 non-null    object
 1   Age              131 non-null    int64 
 2   Height (cm)      131 non-null    int64 
 3   Weight (kg)      131 non-null    int64 
 4   Occupation       131 non-null    object
 5   Education Level  131 non-null    object
 6   Marital Status   131 non-null    object
 7   Income (USD)     131 non-null    int64 
 8   Favorite Color   131 non-null    object
dtypes: int64(4), object(5)
memory usage: 9.3+ KB


When expecting the column values, I saw that there were no missing values, however there were categorical column values (Datatype = object) that had whitespace meaning that if it isn't removed it will make it difficult to one-hot encode them correctly. The next step in our analysis is to remove this whitespace from the column values. 

In [169]:
gender_df['Gender'] = gender_df['Gender'].str.strip()
gender_df['Occupation'] = gender_df['Occupation'].str.strip()
gender_df['Education Level'] = gender_df['Education Level'].str.strip()
gender_df['Marital Status'] = gender_df['Marital Status'].str.strip()
gender_df['Favorite Color'] = gender_df['Favorite Color'].str.strip()

In [170]:
gender_df.head()

Unnamed: 0,Gender,Age,Height (cm),Weight (kg),Occupation,Education Level,Marital Status,Income (USD),Favorite Color
0,male,32,175,70,Software Engineer,Master's Degree,Married,75000,Blue
1,male,25,182,85,Sales Representative,Bachelor's Degree,Single,45000,Green
2,female,41,160,62,Doctor,Doctorate Degree,Married,120000,Purple
3,male,38,178,79,Lawyer,Bachelor's Degree,Single,90000,Red
4,female,29,165,58,Graphic Designer,Associate's Degree,Single,35000,Yellow


In [171]:
gender_df['Gender'] = gender_df['Gender'].map({'male':1, 'female':0})
gender_df.head()

Unnamed: 0,Gender,Age,Height (cm),Weight (kg),Occupation,Education Level,Marital Status,Income (USD),Favorite Color
0,1,32,175,70,Software Engineer,Master's Degree,Married,75000,Blue
1,1,25,182,85,Sales Representative,Bachelor's Degree,Single,45000,Green
2,0,41,160,62,Doctor,Doctorate Degree,Married,120000,Purple
3,1,38,178,79,Lawyer,Bachelor's Degree,Single,90000,Red
4,0,29,165,58,Graphic Designer,Associate's Degree,Single,35000,Yellow


In this next section, I encoded the rest of the categorical columns with the drop_first included into our pd.get_dummies function because otherwise our model will suffer from the presence of multicollinearity in our data.

In [172]:
occupation = pd.get_dummies(gender_df['Occupation'], drop_first=True)
education_level = pd.get_dummies(gender_df['Education Level'], drop_first=True)
marital_status = pd.get_dummies(gender_df['Marital Status'], drop_first=True)
fav_color = pd.get_dummies(gender_df['Favorite Color'], drop_first=True)
gender_df = pd.concat([gender_df, occupation, education_level, marital_status, fav_color], axis=1)
pd.set_option('display.max_columns', None)
gender_df.head()

Unnamed: 0,Gender,Age,Height (cm),Weight (kg),Occupation,Education Level,Marital Status,Income (USD),Favorite Color,Analyst,Architect,Business Analyst,Business Consultant,CEO,Doctor,Engineer,Graphic Designer,IT Manager,Lawyer,Marketing Specialist,Nurse,Project Manager,Sales Representative,Software Developer,Software Engineer,Teacher,Writer,Bachelor's Degree,Doctorate Degree,Master's Degree,Married,Single,Widowed,Blue,Green,Grey,Orange,Pink,Purple,Red,Yellow
0,1,32,175,70,Software Engineer,Master's Degree,Married,75000,Blue,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,0
1,1,25,182,85,Sales Representative,Bachelor's Degree,Single,45000,Green,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0
2,0,41,160,62,Doctor,Doctorate Degree,Married,120000,Purple,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0
3,1,38,178,79,Lawyer,Bachelor's Degree,Single,90000,Red,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0
4,0,29,165,58,Graphic Designer,Associate's Degree,Single,35000,Yellow,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1


Now I have added the dummy variables from the categorical data, I dropped the original categorical columns because they are no longer required for the analysis.

In [173]:
gender_df = gender_df.drop(['Occupation', 'Education Level', 'Marital Status', 'Favorite Color'], axis=1)
gender_df.head()

Unnamed: 0,Gender,Age,Height (cm),Weight (kg),Income (USD),Analyst,Architect,Business Analyst,Business Consultant,CEO,Doctor,Engineer,Graphic Designer,IT Manager,Lawyer,Marketing Specialist,Nurse,Project Manager,Sales Representative,Software Developer,Software Engineer,Teacher,Writer,Bachelor's Degree,Doctorate Degree,Master's Degree,Married,Single,Widowed,Blue,Green,Grey,Orange,Pink,Purple,Red,Yellow
0,1,32,175,70,75000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,0
1,1,25,182,85,45000,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0
2,0,41,160,62,120000,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0
3,1,38,178,79,90000,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0
4,0,29,165,58,35000,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1


## Splitting the data into the test and training sets

In [174]:
x = gender_df.drop(['Gender'], axis=1)
y = gender_df['Gender']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=30)

## Using the Random Forest algorithm

The random forest algorithm is a supervised machine learning technique used for both regression and classification. It is an ensemble method that utilizes the majority vote of multiple decision trees to estimate the classification of data. In our case, we use 100 decision trees (n_estimators) which are trained using the bagging method.

In bagging, each decision tree is trained using a different subset of the training data, resulting in reduced variance, increased accuracy, and lower risk of overfitting. In addition, the random forest algorithm employs a feature randomization technique where a random subset of features is selected for each decision tree in the ensemble. This approach helps to decorrelate the trees and improve the performance of the model.

In [175]:
#Initiate the class instance
rf = RandomForestClassifier(n_estimators=100, random_state=10, oob_score=True)

#Fit the data to the model
rf.fit(x_train, y_train)

RandomForestClassifier(oob_score=True, random_state=10)

In [176]:
print(rf.oob_score)

True


In [177]:
y_pred = rf.predict(x_test)

## Metrics

In [179]:
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)

In [180]:
print('Accuracy: ', round(accuracy * 100, 3))
print('F1 score: ', round(f1 * 100, 3))
print('Recall score: ', round(recall * 100, 3))
print('Precision score: ', round(precision * 100, 3))

Accuracy:  100.0
F1 score:  100.0
Recall score:  100.0
Precision score:  100.0
