# Cross validation application in logistic regression algorithm

In [1]:
import numpy as np
import pandas as pd

### Dataset

Dataset used is Speed Dating dataset from kaggle: https://www.kaggle.com/datasets/ulrikthygepedersen/speed-dating

The objective is to predict wether two people matched or not after their first date.

In [2]:
df=pd.read_csv("speeddating.csv",header=0)

In [3]:
df.info(verbose=True,null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8378 entries, 0 to 8377
Data columns (total 123 columns):
 #    Column                           Non-Null Count  Dtype  
---   ------                           --------------  -----  
 0    has_null                         8378 non-null   object 
 1    wave                             8378 non-null   float64
 2    gender                           8378 non-null   object 
 3    age                              8283 non-null   float64
 4    age_o                            8274 non-null   float64
 5    d_age                            8378 non-null   float64
 6    d_d_age                          8378 non-null   object 
 7    race                             8378 non-null   object 
 8    race_o                           8378 non-null   object 
 9    samerace                         8378 non-null   object 
 10   importance_same_race             8299 non-null   float64
 11   importance_same_religion         8299 non-null   float64
 12   d_im

  df.info(verbose=True,null_counts=True)


See the file [speeddatingcolumns.txt](speeddatingcolumns.txt) to see descriptions of the columns, it's a summary of the description from Kaggle as it doesn't downloads it with the data.

From the descriptions is apparent that the columns with type float is a ranking maybe 1-10 or 1-100 (must review the max values to be sure on that one). On the other part, most of the object type columns are groupings of the numerical values. Other columns such as gender and the match are objects. Maybe I'll try to drop all the object type columns but the gender and match, which the later one is the labels for the training.

In [5]:
data=df.copy()
data=data.select_dtypes(exclude=['object'])
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8378 entries, 0 to 8377
Data columns (total 59 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   wave                           8378 non-null   float64
 1   age                            8283 non-null   float64
 2   age_o                          8274 non-null   float64
 3   d_age                          8378 non-null   float64
 4   importance_same_race           8299 non-null   float64
 5   importance_same_religion       8299 non-null   float64
 6   pref_o_attractive              8289 non-null   float64
 7   pref_o_sincere                 8289 non-null   float64
 8   pref_o_intelligence            8289 non-null   float64
 9   pref_o_funny                   8280 non-null   float64
 10  pref_o_ambitious               8271 non-null   float64
 11  pref_o_shared_interests        8249 non-null   float64
 12  attractive_o                   8166 non-null   f

Seems like the column interests_correlate takes into account all the interests (hobbies) and is a correlation with other people. That's why I will drop these columns.

In [6]:
data.columns

Index(['wave', 'age', 'age_o', 'd_age', 'importance_same_race',
       'importance_same_religion', 'pref_o_attractive', 'pref_o_sincere',
       'pref_o_intelligence', 'pref_o_funny', 'pref_o_ambitious',
       'pref_o_shared_interests', 'attractive_o', 'sinsere_o',
       'intelligence_o', 'funny_o', 'ambitous_o', 'shared_interests_o',
       'attractive_important', 'sincere_important', 'intellicence_important',
       'funny_important', 'ambtition_important', 'shared_interests_important',
       'attractive', 'sincere', 'intelligence', 'funny', 'ambition',
       'attractive_partner', 'sincere_partner', 'intelligence_partner',
       'funny_partner', 'ambition_partner', 'shared_interests_partner',
       'sports', 'tvsports', 'exercise', 'dining', 'museums', 'art', 'hiking',
       'gaming', 'clubbing', 'reading', 'tv', 'theater', 'movies', 'concerts',
       'music', 'shopping', 'yoga', 'interests_correlate',
       'expected_happy_with_sd_people', 'expected_num_interested_in_me',
 

In [8]:
interests=['sports', 'tvsports', 'exercise', 'dining', 'museums', 'art', 'hiking','gaming', 'clubbing', 'reading', 'tv', 'theater', 'movies', 'concerts','music', 'shopping', 'yoga']
data.drop(interests,axis=1,inplace=True)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8378 entries, 0 to 8377
Data columns (total 42 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   wave                           8378 non-null   float64
 1   age                            8283 non-null   float64
 2   age_o                          8274 non-null   float64
 3   d_age                          8378 non-null   float64
 4   importance_same_race           8299 non-null   float64
 5   importance_same_religion       8299 non-null   float64
 6   pref_o_attractive              8289 non-null   float64
 7   pref_o_sincere                 8289 non-null   float64
 8   pref_o_intelligence            8289 non-null   float64
 9   pref_o_funny                   8280 non-null   float64
 10  pref_o_ambitious               8271 non-null   float64
 11  pref_o_shared_interests        8249 non-null   float64
 12  attractive_o                   8166 non-null   f