<h1>Gender Classification Based on a Survey Questions</h1>

Data source: https://www.kaggle.com/miroslavsabo/young-people-survey

The data was gathered from a subjective survey on people, asking questions about their interests/hobbies and phobias, and things relating to their personalities, along with questions about themselves (e.g. gender, age, height, etc.) Participants have to answer each question on a scale from 1 to 5 with 1 being not interested/afraid/agreeing and 5 being very interested/afraid/agreeing. The sample is made up of Slovakian students of a college class and their friends, with ages ranging from 15 to 30.

I want to use the answers to the different types of questions to train a model to classify a person as either male or female.

In [1]:
# Import some libraries (numpy might not be used)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import mutual_info_classif
from sklearn.model_selection import train_test_split, cross_val_predict
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import xgboost as XGB
from sklearn.metrics import classification_report

In [2]:
# Load in the dataset
data = pd.read_csv('responses.csv')

# Allow all the columns to be seen
pd.set_option('display.max_columns', None)

# First 5 rows of the dataset
data.head()

Unnamed: 0,Music,Slow songs or fast songs,Dance,Folk,Country,Classical music,Musical,Pop,Rock,Metal or Hardrock,Punk,"Hiphop, Rap","Reggae, Ska","Swing, Jazz",Rock n roll,Alternative,Latino,"Techno, Trance",Opera,Movies,Horror,Thriller,Comedy,Romantic,Sci-fi,War,Fantasy/Fairy tales,Animated,Documentary,Western,Action,History,Psychology,Politics,Mathematics,Physics,Internet,PC,Economy Management,Biology,Chemistry,Reading,Geography,Foreign languages,Medicine,Law,Cars,Art exhibitions,Religion,"Countryside, outdoors",Dancing,Musical instruments,Writing,Passive sport,Active sport,Gardening,Celebrities,Shopping,Science and technology,Theatre,Fun with friends,Adrenaline sports,Pets,Flying,Storm,Darkness,Heights,Spiders,Snakes,Rats,Ageing,Dangerous dogs,Fear of public speaking,Smoking,Alcohol,Healthy eating,Daily events,Prioritising workload,Writing notes,Workaholism,Thinking ahead,Final judgement,Reliability,Keeping promises,Loss of interest,Friends versus money,Funniness,Fake,Criminal damage,Decision making,Elections,Self-criticism,Judgment calls,Hypochondria,Empathy,Eating to survive,Giving,Compassion to animals,Borrowed stuff,Loneliness,Cheating in school,Health,Changing the past,God,Dreams,Charity,Number of friends,Punctuality,Lying,Waiting,New environment,Mood swings,Appearence and gestures,Socializing,Achievements,Responding to a serious letter,Children,Assertiveness,Getting angry,Knowing the right people,Public speaking,Unpopularity,Life struggles,Happiness in life,Energy levels,Small - big dogs,Personality,Finding lost valuables,Getting up,Interests or hobbies,Parents' advice,Questionnaires or polls,Internet usage,Finances,Shopping centres,Branded clothing,Entertainment spending,Spending on looks,Spending on gadgets,Spending on healthy eating,Age,Height,Weight,Number of siblings,Gender,Left - right handed,Education,Only child,Village - town,House - block of flats
0,5.0,3.0,2.0,1.0,2.0,2.0,1.0,5.0,5.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,1.0,5.0,4.0,2.0,5.0,4.0,4.0,1.0,5.0,5.0,3.0,1.0,2.0,1.0,5.0,1.0,3.0,3.0,5.0,3.0,5.0,3.0,3.0,3.0,3.0,5.0,3.0,1.0,1.0,1.0,1.0,5.0,3.0,3.0,2.0,1.0,5.0,5.0,1.0,4.0,4.0,2.0,5.0,4.0,4.0,1.0,1.0,1.0,1.0,1.0,5,3.0,1.0,3.0,2.0,never smoked,drink a lot,4.0,2.0,2.0,5.0,4.0,2.0,5.0,4.0,4.0,1.0,3.0,5.0,1.0,1.0,3.0,4.0,1.0,3.0,1.0,3.0,1,4.0,5.0,4.0,3.0,2.0,1.0,1.0,1.0,4,2.0,3,i am always on time,never,3.0,4.0,3.0,4.0,3.0,4.0,3.0,5.0,1.0,1.0,3.0,5.0,5.0,1.0,4.0,5.0,1.0,4.0,3.0,2.0,3.0,4.0,3.0,few hours a day,3.0,4.0,5.0,3.0,3.0,1,3.0,20.0,163.0,48.0,1.0,female,right handed,college/bachelor degree,no,village,block of flats
1,4.0,4.0,2.0,1.0,1.0,1.0,2.0,3.0,5.0,4.0,4.0,1.0,3.0,1.0,4.0,4.0,2.0,1.0,1.0,5.0,2.0,2.0,4.0,3.0,4.0,1.0,3.0,5.0,4.0,1.0,4.0,1.0,3.0,4.0,5.0,2.0,4.0,4.0,5.0,1.0,1.0,4.0,4.0,5.0,1.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,3.0,3.0,2.0,4.0,2.0,5.0,1.0,1.0,1.0,2.0,1.0,1,1.0,3.0,1.0,4.0,never smoked,drink a lot,3.0,3.0,2.0,4.0,5.0,4.0,1.0,4.0,4.0,3.0,4.0,3.0,2.0,1.0,2.0,5.0,4.0,4.0,1.0,2.0,1,2.0,4.0,3.0,2.0,4.0,4.0,4.0,1.0,3,1.0,3,i am often early,sometimes,3.0,4.0,4.0,4.0,4.0,2.0,4.0,2.0,2.0,5.0,4.0,4.0,4.0,1.0,4.0,3.0,5.0,3.0,4.0,5.0,3.0,2.0,3.0,few hours a day,3.0,4.0,1.0,4.0,2.0,5,2.0,19.0,163.0,58.0,2.0,female,right handed,college/bachelor degree,no,city,block of flats
2,5.0,5.0,2.0,2.0,3.0,4.0,5.0,3.0,5.0,3.0,4.0,1.0,4.0,3.0,5.0,5.0,5.0,1.0,3.0,5.0,3.0,4.0,4.0,2.0,4.0,2.0,5.0,5.0,2.0,2.0,1.0,1.0,2.0,1.0,5.0,2.0,4.0,2.0,4.0,1.0,1.0,5.0,2.0,5.0,2.0,3.0,1.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,2.0,1.0,1.0,4.0,2.0,5.0,5.0,5.0,5.0,1.0,1.0,1.0,1.0,1.0,1,1.0,1.0,1.0,2.0,tried smoking,drink a lot,3.0,1.0,2.0,5.0,3.0,5.0,3.0,4.0,5.0,1.0,5.0,2.0,4.0,1.0,3.0,5.0,4.0,4.0,1.0,5.0,5,5.0,4.0,2.0,5.0,3.0,2.0,5.0,5.0,1,3.0,3,i am often running late,sometimes,2.0,3.0,4.0,3.0,5.0,3.0,4.0,4.0,3.0,4.0,3.0,2.0,4.0,4.0,4.0,4.0,3.0,3.0,3.0,4.0,5.0,3.0,1.0,few hours a day,2.0,4.0,1.0,4.0,3.0,4,2.0,20.0,176.0,67.0,2.0,female,right handed,secondary school,no,city,block of flats
3,5.0,3.0,2.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,4.0,2.0,2.0,1.0,2.0,5.0,1.0,2.0,1.0,5.0,4.0,4.0,3.0,3.0,4.0,3.0,1.0,2.0,5.0,1.0,2.0,4.0,4.0,5.0,4.0,1.0,3.0,1.0,2.0,3.0,3.0,5.0,4.0,4.0,2.0,5.0,1.0,5.0,4.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,2.0,4.0,3.0,1.0,2.0,1.0,1.0,2.0,1.0,1.0,3.0,5.0,5,5.0,4.0,5.0,5.0,former smoker,drink a lot,3.0,4.0,4.0,4.0,5.0,3.0,1.0,3.0,4.0,5.0,2.0,1.0,1.0,5.0,5.0,5.0,5.0,4.0,3.0,3.0,1,1.0,2.0,5.0,5.0,5.0,1.0,5.0,4.0,3,3.0,1,i am often early,only to avoid hurting someone,1.0,1.0,5.0,3.0,1.0,3.0,3.0,2.0,5.0,5.0,4.0,5.0,3.0,3.0,2.0,2.0,1.0,2.0,1.0,1.0,,2.0,4.0,most of the day,2.0,4.0,3.0,3.0,4.0,4,1.0,22.0,172.0,59.0,1.0,female,right handed,college/bachelor degree,yes,city,house/bungalow
4,5.0,3.0,4.0,3.0,2.0,4.0,3.0,5.0,3.0,1.0,2.0,5.0,3.0,2.0,1.0,2.0,4.0,2.0,2.0,5.0,4.0,4.0,5.0,2.0,3.0,3.0,4.0,4.0,3.0,1.0,4.0,3.0,2.0,3.0,2.0,2.0,2.0,2.0,2.0,3.0,3.0,5.0,2.0,3.0,3.0,2.0,3.0,1.0,4.0,4.0,1.0,3.0,1.0,3.0,1.0,4.0,3.0,3.0,3.0,2.0,4.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,1,2.0,2.0,4.0,3.0,tried smoking,social drinker,4.0,3.0,1.0,2.0,3.0,5.0,5.0,5.0,4.0,2.0,3.0,3.0,2.0,1.0,3.0,5.0,5.0,5.0,1.0,3.0,1,3.0,3.0,4.0,3.0,5.0,3.0,4.0,5.0,3,3.0,3,i am always on time,everytime it suits me,3.0,4.0,2.0,3.0,3.0,3.0,3.0,5.0,4.0,2.0,3.0,5.0,5.0,2.0,3.0,5.0,3.0,3.0,2.0,4.0,3.0,3.0,3.0,few hours a day,4.0,3.0,4.0,3.0,3.0,2,4.0,20.0,170.0,59.0,1.0,female,right handed,secondary school,no,village,house/bungalow


In [3]:
# Remove columns that are intuitively not related with gender, or are not related to hobbies, phobias, or personality.
data.drop(['Age','Height','Weight','Number of siblings','Left - right handed','Only child','Village - town','House - block of flats'], axis=1, inplace=True)

<h2>Imputing Missing Values</h2>

In [4]:
# Check dataset for missing values
data.isnull().sum()

Music                         3
Slow songs or fast songs      2
Dance                         4
Folk                          5
Country                       5
                             ..
Spending on looks             3
Spending on gadgets           0
Spending on healthy eating    2
Gender                        6
Education                     1
Length: 142, dtype: int64

In [5]:
# Replace missing values with the most frequent value 
imp = SimpleImputer(strategy ='most_frequent')
# Create a new data frame called data2 since SimpleImputer returns a np array
data2 = pd.DataFrame(imp.fit_transform(data)) 
data2.columns = data.columns
data2.index = data.index

# Check again for missing values
data2.isnull().sum()

Music                         0
Slow songs or fast songs      0
Dance                         0
Folk                          0
Country                       0
                             ..
Spending on looks             0
Spending on gadgets           0
Spending on healthy eating    0
Gender                        0
Education                     0
Length: 142, dtype: int64

<h2>Encoding Categorical Features</h2>

<h3>Ordinal Features</h3>

In [6]:
# Output of the different categories in the Smoking feature
data2['Smoking'].unique()

array(['never smoked', 'tried smoking', 'former smoker', 'current smoker'],
      dtype=object)

In [7]:
# Output of the different categories in the Alcohol feature
data2['Alcohol'].unique()

array(['drink a lot', 'social drinker', 'never'], dtype=object)

In [8]:
# Output of the different categories in the Punctuality feature
data2['Punctuality'].unique()

array(['i am always on time', 'i am often early',
       'i am often running late'], dtype=object)

In [9]:
# Output of the different categories in the Internet usage feature
data2['Internet usage'].unique()

array(['few hours a day', 'most of the day', 'less than an hour a day',
       'no time at all'], dtype=object)

In [10]:
# Output of the different categories in the Education feature
data2['Education'].unique()

array(['college/bachelor degree', 'secondary school', 'primary school',
       'masters degree', 'doctorate degree',
       'currently a primary school pupil'], dtype=object)

In [11]:
# Label encode the above features
data2['Smoking'].replace({'never smoked':0, 'tried smoking':1, 'current smoker':3, 'former smoker':2}, inplace=True)
data2['Alcohol'].replace({'drink a lot':2, 'social drinker':1, 'never':0}, inplace=True)
data2['Punctuality'].replace({'i am always on time':2, 'i am often early':1, 'i am often running late':0}, inplace=True)
data2['Internet usage'].replace({'few hours a day':2, 'less than an hour a day':1, 'most of the day':3, 'no time at all':0}, inplace=True)
data2['Education'].replace({'currently a primary school pupil':0, 'primary school':1, 'secondary school':2, 'college/bachelor degree':3, 'masters degree':4, 'doctorate degree':5}, inplace=True)

<h3>Binary & Nominal Categorical Features</h3>

In [12]:
# Make the binary categorical features into 0's and 1's
data2['Gender'].replace({'male':0, 'female':1}, inplace=True)

In [13]:
# Output of the different categories in the Lying feature
data2['Lying'].unique()

array(['never', 'sometimes', 'only to avoid hurting someone',
       'everytime it suits me'], dtype=object)

In [14]:
# Dummy encode the Lying feature
lying_dv = pd.get_dummies(data2['Lying'],prefix='Lying',drop_first=True)

# Make the dummy variables become part of the table and remove the original Lying feature
data2[['Lying_never', 'Lying_only to avoid hurting someone', 'Lying_sometimes']] = lying_dv[['Lying_never', 'Lying_only to avoid hurting someone', 'Lying_sometimes']]
data2.drop('Lying', axis=1, inplace=True)

In [15]:
# Later when modelling, I found that some columns had abnormal dtypes
# Check dtypes of columns
data2.dtypes

Music                                  object
Slow songs or fast songs               object
Dance                                  object
Folk                                   object
Country                                object
                                        ...  
Gender                                  int64
Education                               int64
Lying_never                             uint8
Lying_only to avoid hurting someone     uint8
Lying_sometimes                         uint8
Length: 144, dtype: object

In [16]:
# Make all column dtypes into int dtypes
data2 = data2.astype('int')

# Check dtypes again
data2.dtypes

Music                                  int32
Slow songs or fast songs               int32
Dance                                  int32
Folk                                   int32
Country                                int32
                                       ...  
Gender                                 int32
Education                              int32
Lying_never                            int32
Lying_only to avoid hurting someone    int32
Lying_sometimes                        int32
Length: 144, dtype: object

<h2>Feature Selection</h2>

In [17]:
# Split data into features and target
X_data = data2[data2.columns.difference(['Gender'])]
y_data = data2['Gender']

# Find which features are unimportant
importance = mutual_info_classif(X_data, y_data)
df_importance = pd.DataFrame(importance, index=X_data.columns)
unimp = list(df_importance.loc[df_importance[0]==0].index)

# Remove unimportant features
X_data = X_data.drop(unimp, axis=1)

<h2>Modelling (on mostly default parameters)</h2>
<h3>Prep Work</h3>

In [18]:
# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.3, random_state=0)

# Initialize the models
lr = LogisticRegression(max_iter=500)
dtc = DecisionTreeClassifier(random_state=0)
rfc = RandomForestClassifier(random_state=0)
knn = KNeighborsClassifier()
svm = SVC(random_state=0)
xgb = XGB.XGBClassifier(random_state=0, verbosity = 0, use_label_encoder=False)

# List of classifiers that will be used
classifiers = [lr, dtc, rfc, knn, svm, xgb]
c_names = ['Logistic Regression Model', 'Decistion Tree Classification Model', 'Random Forest Classification Model', 'K-Nearest Neighbors Classfication Model',
          'Support-Vector Machine Model', 'XGBoost Classification Model']

<h3>Cross Validation</h3>

In [19]:
# Perform 5-fold cross validations on each model
for c, c2 in zip(classifiers, c_names):
    c.fit(X_train, y_train)
    cross_val_pred = cross_val_predict(c, X_train, y_train, cv=5)
    print('{}:\n{}'.format(c2,classification_report(y_train, cross_val_pred)))

Logistic Regression Model:
              precision    recall  f1-score   support

           0       0.85      0.85      0.85       282
           1       0.90      0.90      0.90       425

    accuracy                           0.88       707
   macro avg       0.88      0.88      0.88       707
weighted avg       0.88      0.88      0.88       707

Decistion Tree Classification Model:
              precision    recall  f1-score   support

           0       0.68      0.71      0.70       282
           1       0.80      0.78      0.79       425

    accuracy                           0.75       707
   macro avg       0.74      0.75      0.74       707
weighted avg       0.75      0.75      0.75       707

Random Forest Classification Model:
              precision    recall  f1-score   support

           0       0.91      0.82      0.86       282
           1       0.89      0.94      0.91       425

    accuracy                           0.89       707
   macro avg       0.90     

The support-vector machine model tends to give me the best results

Note: I have tried hyperparameter tuning using GridSearchCV, but it does not give me better results, instead it made the run time take much longer.