# Exploratory Data Analysis with python and plotly

##### Contents

[1. Load data](###1.-Load-data)

[2. Data Understanding and Cleaning](###2.-data-understanding-and-cleaning)

### 1. Load data

In [1823]:

import pandas as pd

train = pd.read_csv('../data/raw/train.csv') 
test = pd.read_csv('../data/raw/test.csv')


Stick training and test data together so that feature engineering is done to the test set as well. 

In [1824]:
train['source'] = 'train'
test['source'] = 'test'
combined = pd.concat([train, test], sort=False)

### 2. Data Understanding and Cleaning

#### Understanding

In [1825]:
combined.head(10)



Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,source
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,train
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,train
2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,train
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,train
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,train
5,6,0.0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,train
6,7,0.0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,train
7,8,0.0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,train
8,9,1.0,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,train
9,10,1.0,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,train


In [1826]:
combined.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1309 entries, 0 to 417
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1046 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1308 non-null   float64
 10  Cabin        295 non-null    object 
 11  Embarked     1307 non-null   object 
 12  source       1309 non-null   object 
dtypes: float64(3), int64(4), object(6)
memory usage: 143.2+ KB


All of these datatypes make sense, none need to be converted

In [1827]:
combined.describe().round(2)


Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,1309.0,891.0,1309.0,1046.0,1309.0,1309.0,1308.0
mean,655.0,0.38,2.29,29.88,0.5,0.39,33.3
std,378.02,0.49,0.84,14.41,1.04,0.87,51.76
min,1.0,0.0,1.0,0.17,0.0,0.0,0.0
25%,328.0,0.0,2.0,21.0,0.0,0.0,7.9
50%,655.0,0.0,3.0,28.0,0.0,0.0,14.45
75%,982.0,1.0,3.0,39.0,1.0,0.0,31.28
max,1309.0,1.0,3.0,80.0,8.0,9.0,512.33


#### Deal with Missing Values

In [1828]:
missing_values = combined.isnull().sum()
missing_values = missing_values[missing_values > 0]
print("Missing values in train data:")
print(missing_values)

Missing values in train data:
Survived     418
Age          263
Fare           1
Cabin       1014
Embarked       2
dtype: int64


possible values

In [1829]:
unique_values = combined.nunique()
print(unique_values)

for col in combined.columns:
    if combined[col].nunique() < 10:
        print(f"{col}: {combined[col].unique()}") # show unique values for columns with less than 10 unique values


PassengerId    1309
Survived          2
Pclass            3
Name           1307
Sex               2
Age              98
SibSp             7
Parch             8
Ticket          929
Fare            281
Cabin           186
Embarked          3
source            2
dtype: int64
Survived: [ 0.  1. nan]
Pclass: [3 1 2]
Sex: ['male' 'female']
SibSp: [1 0 3 4 2 5 8]
Parch: [0 1 2 5 3 4 6 9]
Embarked: ['S' 'C' 'Q' nan]
source: ['train' 'test']


##### Embarked Column

In [1830]:
print(combined['Embarked'].value_counts())
print(combined['Embarked'].value_counts() / len(combined))


Embarked
S    914
C    270
Q    123
Name: count, dtype: int64
Embarked
S    0.698243
C    0.206264
Q    0.093965
Name: count, dtype: float64


Becuase there is a 73% chance a passanger embarked in Cherbourg (C), it makes sense to fill in the two missing values in with C

In [1831]:
print("Before filling:")
print(combined['Embarked'].isnull().sum()) # 2

# Fill missing values
combined['Embarked'] = combined['Embarked'].fillna(combined['Embarked'].mode()[0])

print("\nAfter filling:")
print(combined['Embarked'].isnull().sum()) # should be 0

Before filling:
2

After filling:
0


##### Age Column

In [1832]:
print(missing_values[0:1] / len(combined))

print(combined['Age'].describe().round(2))

Survived    0.319328
dtype: float64
count    1046.00
mean       29.88
std        14.41
min         0.17
25%        21.00
50%        28.00
75%        39.00
max        80.00
Name: Age, dtype: float64


For age, we have 20% of rows missing an entry. This is probably an important predictive columns so we we will impute values. 

First lets extract a title, beucase that coule be useful. 

In [1833]:
# add a nnew column for the title, found in the second index after splitting the name column
combined['Title'] = combined['Name'].str.split(',').str[1].str.split('.').str[0]

# print the unique values in the title column and the number of occurences
print(combined['Title'].value_counts())



Title
Mr              757
Miss            260
Mrs             197
Master           61
Rev               8
Dr                8
Col               4
Mlle              2
Major             2
Ms                2
Lady              1
Sir               1
Mme               1
Don               1
Capt              1
the Countess      1
Jonkheer          1
Dona              1
Name: count, dtype: int64


Predict the missing vlaues of age using XGBoost

In [1834]:
# lets use a Gradient Boosting Regressor to predict the missing values of age 
# using columns Pclass, Sex, SibSp, Parch, Embarked, Title.

import xgboost as xgb
import numpy as np
from sklearn.preprocessing import LabelEncoder

# First, let's prepare the data for XGBoost
# Create a copy of the data we'll use for prediction
age_data = combined[['Age', 'Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked', 'Title', 'Fare']].copy()

# Encode categorical variables
le_sex = LabelEncoder()
le_emb = LabelEncoder()
le_title = LabelEncoder()

age_data['Sex'] = le_sex.fit_transform(age_data['Sex'])
age_data['Embarked'] = le_emb.fit_transform(age_data['Embarked'])
age_data['Title'] = le_title.fit_transform(age_data['Title'])

# Split into two sets: known age and unknown age
known_age = age_data[age_data['Age'].notnull()]
unknown_age = age_data[age_data['Age'].isnull()]

# Prepare the training data
X_train = known_age.drop('Age', axis=1)
y_train = known_age['Age']

# Train XGBoost model
xgb_reg = xgb.XGBRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)

xgb_reg.fit(X_train, y_train)

# Predict missing ages
X_test = unknown_age.drop('Age', axis=1)
predicted_ages = xgb_reg.predict(X_test)

# Fill the missing values in the original dataframe
combined.loc[combined['Age'].isnull(), 'Age'] = predicted_ages

# Verify the results
print("\nMissing values after prediction:")
print(combined['Age'].isnull().sum())

# Show age distribution statistics before and after
print("\nAge statistics after imputation:")
print(combined['Age'].describe().round(2))




Missing values after prediction:
0

Age statistics after imputation:
count    1309.00
mean       29.90
std        13.52
min         0.17
25%        21.96
50%        28.12
75%        38.00
max        80.00
Name: Age, dtype: float64


##### Cabin Column

For cabin we have two components, the letter and the number. Lets split these, creating new columns and deleting the old ones.

note that for some tickets that are booked together, there are multiple Cabins recoded, for these we can only take the cabin letter.

In [1835]:
def extract_cabin_info(cabin_entry):
    if pd.isna(cabin_entry):
        return pd.Series([np.nan, np.nan])
    cabins = cabin_entry.split()
    first_cabin = cabins[0]
    cabin_letter = first_cabin[0]
    cabin_number = first_cabin[1:] if len(cabins) == 1 else np.nan
    return pd.Series([cabin_letter, cabin_number])

combined[['Cabin_letter', 'Cabin_number']] = combined['Cabin'].apply(extract_cabin_info)

combined = combined.drop('Cabin', axis=1)

combined.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,source,Title,Cabin_letter,Cabin_number
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S,train,Mr,,
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,train,Mrs,C,85.0
2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S,train,Miss,,
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S,train,Mrs,C,123.0
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S,train,Mr,,


In [1836]:

print(missing_values[1:2] / len(combined))

Age    0.200917
dtype: float64


77% of the Cabin letter and number were not recorded. This is too much to drop, and hard to predict given our other predictors. We will leave it as NaN.

### 3. Feature Engineering

#### Last Name

In [1837]:
combined['LastName'] = combined['Name'].str.split(',').str[0].str.split('.').str[0]
print(combined['LastName'].value_counts())

LastName
Andersson    11
Sage         11
Goodwin       8
Asplund       8
Davies        7
             ..
Milling       1
Maisner       1
Goncalves     1
Campbell      1
Saether       1
Name: count, Length: 875, dtype: int64


#### Family Size

In [1838]:
# Create family size column
combined['FamilySize'] = combined['SibSp'] + combined['Parch'] + 1  # +1 to include the passenger themselves

# Sort by FamilySize and then by Fare to help identify family groups
families = combined.loc[combined['FamilySize'] > 1].sort_values(
    ['FamilySize', 'LastName', 'Ticket', 'Fare'], 
    ascending=[False, False, True, True]
)

families.head(50)




Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,source,Title,Cabin_letter,Cabin_number,LastName,FamilySize
159,160,0.0,3,"Sage, Master. Thomas Henry",male,15.969364,8,2,CA. 2343,69.55,S,train,Master,,,Sage,11
180,181,0.0,3,"Sage, Miss. Constance Gladys",female,23.197212,8,2,CA. 2343,69.55,S,train,Miss,,,Sage,11
201,202,0.0,3,"Sage, Mr. Frederick",male,20.659479,8,2,CA. 2343,69.55,S,train,Mr,,,Sage,11
324,325,0.0,3,"Sage, Mr. George John Jr",male,20.659479,8,2,CA. 2343,69.55,S,train,Mr,,,Sage,11
792,793,0.0,3,"Sage, Miss. Stella Anna",female,23.197212,8,2,CA. 2343,69.55,S,train,Miss,,,Sage,11
846,847,0.0,3,"Sage, Mr. Douglas Bullen",male,20.659479,8,2,CA. 2343,69.55,S,train,Mr,,,Sage,11
863,864,0.0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,23.197212,8,2,CA. 2343,69.55,S,train,Miss,,,Sage,11
188,1080,,3,"Sage, Miss. Ada",female,23.197212,8,2,CA. 2343,69.55,S,test,Miss,,,Sage,11
342,1234,,3,"Sage, Mr. John George",male,39.557892,1,9,CA. 2343,69.55,S,test,Mr,,,Sage,11
360,1252,,3,"Sage, Master. William Henry",male,14.5,8,2,CA. 2343,69.55,S,test,Master,,,Sage,11


In [1839]:
combined.isnull().sum()

# print rows with missing vlaues in LastName_encoded




PassengerId        0
Survived         418
Pclass             0
Name               0
Sex                0
Age                0
SibSp              0
Parch              0
Ticket             0
Fare               1
Embarked           0
source             0
Title              0
Cabin_letter    1014
Cabin_number    1055
LastName           0
FamilySize         0
dtype: int64

Sweet that kind of works, its problems is that parents will (moslty) show up as having no parents on board, so will have incorrect family sizes. The family size will only work for the rows of children.

Note in the Sage's case there are parents and grandparents, so grandparents will not have the correct family size.

#### LastName

The last name is currently our most effective method for grouping families. However, due to its high cardinality, we will encode this feature using target encoding by assigning the mean survival rate to a new variable, LastName_encoded

In [1840]:
combined["LastName_encoded"] = combined.groupby("LastName")["Survived"].transform("mean")

combined[["LastName", "Survived", "LastName_encoded"]].head(10)

LName_stats = combined.groupby("LastName_encoded").agg(
    Survived_Mean=("Survived", "mean"),
    Count=("Survived", "count")
).sort_values(by="Count", ascending=False)

print(LName_stats)

                  Survived_Mean  Count
LastName_encoded                      
0.000000               0.000000    470
1.000000               1.000000    259
0.500000               0.500000     72
0.666667               0.666667     36
0.333333               0.333333     21
0.750000               0.750000     16
0.222222               0.222222      9
0.250000               0.250000      8


In [1841]:
combined = combined.drop('LastName', axis=1)

#### Title

We will also do target encoding with Title due to its high cartinality

In [1842]:
combined["Title_encoded"] = combined.groupby("Title")["Survived"].transform("mean")

title_stats = combined.groupby("Title").agg(
    Survived_Mean=("Survived", "mean"),
    Count=("Survived", "count")
).sort_values(by="Count", ascending=False)

print(title_stats)

              Survived_Mean  Count
Title                             
Mr                 0.156673    517
Miss               0.697802    182
Mrs                0.792000    125
Master             0.575000     40
Dr                 0.428571      7
Rev                0.000000      6
Major              0.500000      2
Col                0.500000      2
Mlle               1.000000      2
Sir                1.000000      1
Ms                 1.000000      1
Capt               0.000000      1
Mme                1.000000      1
Lady               1.000000      1
Jonkheer           0.000000      1
Don                0.000000      1
the Countess       1.000000      1
Dona                    NaN      0


In [1843]:
combined = combined.drop('Title', axis=1)

#### Drop columns with high cartinality or non-useful Information

In [1844]:
# Calculate original cardinality and data types
cardinality_before = combined.nunique()
dtypes_before = combined.dtypes

# Combine into a DataFrame
cardinality_df_before = pd.DataFrame({
    'Cardinality': cardinality_before,
    'Dtype': dtypes_before
})

print("Cardinality and Data Types (Before Dropping):")
print(cardinality_df_before.sort_values(by="Cardinality", ascending=False))
print("\n")

# Identify high-cardinality columns
high_cardinality = cardinality_before[cardinality_before > 10].index

# Drop only if dtype is not float
columns_to_drop = [col for col in high_cardinality if pd.api.types.is_object_dtype(combined[col])]

print("Columns to drop (cardinality > 10 and not float):")
for col in columns_to_drop:
    print(f"{col}: {combined[col].dtype}")
# drop the Cabin_
print("\n")

# Drop from train and test
combined.drop(columns=columns_to_drop, inplace=True)

# Calculate new cardinality and data types
cardinality_after = combined.nunique()
dtypes_after = combined.dtypes

# Combine into a DataFrame
cardinality_df_after = pd.DataFrame({
    'Cardinality': cardinality_after,
    'Dtype': dtypes_after
})

print("Cardinality and Data Types (After Dropping):")
print(cardinality_df_after.sort_values(by="Cardinality", ascending=False))

Cardinality and Data Types (Before Dropping):
                  Cardinality    Dtype
PassengerId              1309    int64
Name                     1307   object
Ticket                    929   object
Fare                      281  float64
Age                       224  float64
Cabin_number              101   object
FamilySize                  9    int64
Parch                       8    int64
Cabin_letter                8   object
LastName_encoded            8  float64
Title_encoded               8  float64
SibSp                       7    int64
Pclass                      3    int64
Embarked                    3   object
Sex                         2   object
Survived                    2  float64
source                      2   object


Columns to drop (cardinality > 10 and not float):
Name: object
Ticket: object
Cabin_number: object


Cardinality and Data Types (After Dropping):
                  Cardinality    Dtype
PassengerId              1309    int64
Fare                      

#### Cabin_letter

In [1845]:
# Step 2: Replace NaNs with 'Missing'
combined['Cabin_letter'] = combined['Cabin_letter'].fillna('Missing')

# Step 3: Calculate target mean encoding
combined["Cabin_letter_encoded"] = combined.groupby("Cabin_letter")["Survived"].transform("mean")

# Optional: Create stats summary
Cabin_stats = combined.groupby("Cabin_letter").agg(
    Survived_Mean=("Survived", "mean"),
    Count=("Survived", "count")
).sort_values(by="Count", ascending=False)

print(Cabin_stats)

              Survived_Mean  Count
Cabin_letter                      
Missing            0.299854    687
C                  0.593220     59
B                  0.744681     47
D                  0.757576     33
E                  0.750000     32
A                  0.466667     15
F                  0.615385     13
G                  0.500000      4
T                  0.000000      1


In [1846]:
# print average survival rate overall
print(combined["Survived"].mean())

0.3838383838383838


In [1847]:

# drop rows with one missing value

In [1848]:
missing_values = combined.isnull().sum()
print("Missing values in train data:")
print(missing_values)

Missing values in train data:
PassengerId               0
Survived                418
Pclass                    0
Sex                       0
Age                       0
SibSp                     0
Parch                     0
Fare                      1
Embarked                  0
source                    0
Cabin_letter              0
FamilySize                0
LastName_encoded        230
Title_encoded             1
Cabin_letter_encoded      0
dtype: int64


In [1849]:
combined.sort_values(by=['source', 'PassengerId'], ascending=[True, True]).head(50)
# count number of rows iwth source = test
combined[combined['source'] == 'test'].shape[0]


418

In [1850]:
combined['Fare'] = combined['Fare'].fillna('-1')
combined['Title_encoded'] = combined['Title_encoded'].fillna('-1')
combined['LastName_encoded'] = combined['LastName_encoded'].fillna('-1')


In [1851]:
combined.sort_values(by=['source', 'PassengerId'], ascending=[True, True]).head(50)
# count number of rows iwth source = test
combined[combined['source'] == 'test'].shape[0]


418

In [1852]:
missing_values = combined.isnull().sum()
print("Missing values in train data:")
print(missing_values)

Missing values in train data:
PassengerId               0
Survived                418
Pclass                    0
Sex                       0
Age                       0
SibSp                     0
Parch                     0
Fare                      0
Embarked                  0
source                    0
Cabin_letter              0
FamilySize                0
LastName_encoded          0
Title_encoded             0
Cabin_letter_encoded      0
dtype: int64


### 4. Visual Exploatory Analysis

In [1853]:
fig = px.histogram(combined, x='Age', color='Pclass', barmode='overlay', 
                   facet_col='Sex',
                   title='Age Distribution by Passenger Class and Sex',
                   opacity=0.6)
fig.show()

In [1854]:
fig = px.histogram(combined, x='Cabin_letter', color='Pclass', title='Passengers by Cabin Letter')
fig.show()

In [1855]:
combined = combined.drop('Cabin_letter', axis=1)

In [1856]:
fig = px.box(combined, x='Embarked', y='Fare', color='Embarked', title='Fare by Embarkation Port')
fig.show()


In [1857]:
fig = px.box(combined, x='Pclass', y='Fare', points='all', title='Fare Distribution by Passenger Class')
fig.show()

### Normalise

In [1858]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
cols_to_scale = ['Pclass', 'SibSp', 'Parch', 'Age', 'Fare', 'FamilySize', 'LastName_encoded', 'Title_encoded', 'Cabin_letter_encoded']

combined[cols_to_scale] = scaler.fit_transform(combined[cols_to_scale])

In [1859]:
missing_values = combined.isnull().sum()
print("Missing values in combined data:")
print(missing_values)

Missing values in combined data:
PassengerId               0
Survived                418
Pclass                    0
Sex                       0
Age                       0
SibSp                     0
Parch                     0
Fare                      0
Embarked                  0
source                    0
FamilySize                0
LastName_encoded          0
Title_encoded             0
Cabin_letter_encoded      0
dtype: int64


### 5. Export Cleaned Data

In [1860]:
train = combined[combined['source'] == 'train'].drop(columns=['source'])
test = combined[combined['source'] == 'test'].drop(columns=['source'])

In [1861]:
train.to_csv('../data/processed/train_cleaned.csv', index=False)
test.to_csv('../data/processed/test_cleaned.csv', index=False)

In [1862]:
combined.sort_values(by=['source', 'LastName_encoded'], ascending=[True, True]).head(400)

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,source,FamilySize,LastName_encoded,Title_encoded,Cabin_letter_encoded
1,893,,0.841916,female,1.264771,0.481288,-0.445000,-0.507837,S,test,0.073352,-1.723438,1.425913,-0.513694
2,894,,-0.352091,male,2.374421,-0.479087,-0.445000,-0.455882,Q,test,-0.558346,-1.723438,-0.802416,-0.513694
3,895,,0.841916,male,-0.214761,-0.479087,-0.445000,-0.475697,S,test,-0.558346,-1.723438,-0.802416,-0.513694
8,900,,0.841916,female,-0.880551,-0.479087,-0.445000,-0.503406,C,test,-0.558346,-1.723438,1.425913,-0.513694
10,902,,0.841916,male,-0.172176,-0.479087,-0.445000,-0.490519,S,test,-0.558346,-1.723438,-0.802416,-0.513694
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
293,1185,,-1.546098,male,1.708631,0.481288,0.710763,0.939321,S,test,0.705051,1.259457,0.151233,0.534631
302,1194,,-0.352091,male,0.968865,-0.479087,0.710763,-0.237189,S,test,0.073352,1.259457,-0.802416,-0.513694
307,1199,,0.841916,male,-2.150729,-0.479087,0.710763,-0.462407,S,test,0.073352,1.259457,0.664813,-0.513694
308,1200,,-1.546098,male,1.856584,0.481288,0.710763,1.164378,S,test,0.705051,1.259457,-0.802416,2.281800


In [1863]:
missing_values = combined.isnull().sum()
print("Missing values in train data:")
print(missing_values)

combined.sort_values(by=['source', 'PassengerId'], ascending=[True, True]).head(50)

Missing values in train data:
PassengerId               0
Survived                418
Pclass                    0
Sex                       0
Age                       0
SibSp                     0
Parch                     0
Fare                      0
Embarked                  0
source                    0
FamilySize                0
LastName_encoded          0
Title_encoded             0
Cabin_letter_encoded      0
dtype: int64


Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,source,FamilySize,LastName_encoded,Title_encoded,Cabin_letter_encoded
0,892,,0.841916,male,0.340064,-0.479087,-0.445,-0.491807,Q,test,-0.558346,0.886595,-0.802416,-0.513694
1,893,,0.841916,female,1.264771,0.481288,-0.445,-0.507837,S,test,0.073352,-1.723438,1.425913,-0.513694
2,894,,-0.352091,male,2.374421,-0.479087,-0.445,-0.455882,Q,test,-0.558346,-1.723438,-0.802416,-0.513694
3,895,,0.841916,male,-0.214761,-0.479087,-0.445,-0.475697,S,test,-0.558346,-1.723438,-0.802416,-0.513694
4,896,,0.841916,female,-0.584644,0.481288,0.710763,-0.405619,S,test,0.705051,1.259457,1.425913,-0.513694
5,897,,0.841916,male,-1.176457,-0.479087,-0.445,-0.464823,S,test,-0.558346,-0.23199,-0.802416,-0.513694
6,898,,0.841916,female,0.007169,-0.479087,-0.445,-0.495673,Q,test,-0.558346,1.259457,1.095526,-0.513694
7,899,,-0.352091,male,-0.288738,0.481288,0.710763,-0.082534,S,test,0.705051,1.259457,-0.802416,-0.513694
8,900,,0.841916,female,-0.880551,-0.479087,-0.445,-0.503406,C,test,-0.558346,-1.723438,1.425913,-0.513694
9,901,,0.841916,male,-0.658621,1.441662,-0.445,-0.176294,S,test,0.705051,0.265159,-0.802416,-0.513694
