# 2 Data Wrangling - "Of Genomes And Genetics"

## Import

In [6]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

## Load The Data

In [10]:
train_df = pd.read_csv('/Users/serenahy/Documents/GitHub/DataScienceCapstone-OfGenomesAndGenetics/raw_data/train.csv')
test_df = pd.read_csv('/Users/serenahy/Documents/GitHub/DataScienceCapstone-OfGenomesAndGenetics/raw_data/test.csv')
sample_submission_df = pd.read_csv('/Users/serenahy/Documents/GitHub/DataScienceCapstone-OfGenomesAndGenetics/raw_data/sample_submission.csv')

## Data Collection

In [15]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22083 entries, 0 to 22082
Data columns (total 45 columns):
 #   Column                                            Non-Null Count  Dtype  
---  ------                                            --------------  -----  
 0   Patient Id                                        22083 non-null  object 
 1   Patient Age                                       20656 non-null  float64
 2   Genes in mother's side                            22083 non-null  object 
 3   Inherited from father                             21777 non-null  object 
 4   Maternal gene                                     19273 non-null  object 
 5   Paternal gene                                     22083 non-null  object 
 6   Blood cell count (mcL)                            22083 non-null  float64
 7   Patient First Name                                22083 non-null  object 
 8   Family Name                                       12392 non-null  object 
 9   Father's name    

In [16]:
train_df.head()

Unnamed: 0,Patient Id,Patient Age,Genes in mother's side,Inherited from father,Maternal gene,Paternal gene,Blood cell count (mcL),Patient First Name,Family Name,Father's name,...,Birth defects,White Blood cell count (thousand per microliter),Blood test result,Symptom 1,Symptom 2,Symptom 3,Symptom 4,Symptom 5,Genetic Disorder,Disorder Subclass
0,PID0x6418,2.0,Yes,No,Yes,No,4.760603,Richard,,Larre,...,,9.857562,,1.0,1.0,1.0,1.0,1.0,Mitochondrial genetic inheritance disorders,Leber's hereditary optic neuropathy
1,PID0x25d5,4.0,Yes,Yes,No,No,4.910669,Mike,,Brycen,...,Multiple,5.52256,normal,1.0,,1.0,1.0,0.0,,Cystic fibrosis
2,PID0x4a82,6.0,Yes,No,No,No,4.893297,Kimberly,,Nashon,...,Singular,,normal,0.0,1.0,1.0,1.0,1.0,Multifactorial genetic inheritance disorders,Diabetes
3,PID0x4ac8,12.0,Yes,No,Yes,No,4.70528,Jeffery,Hoelscher,Aayaan,...,Singular,7.919321,inconclusive,0.0,0.0,1.0,0.0,0.0,Mitochondrial genetic inheritance disorders,Leigh syndrome
4,PID0x1bf7,11.0,Yes,No,,Yes,4.720703,Johanna,Stutzman,Suave,...,Multiple,4.09821,,0.0,0.0,0.0,0.0,,Multifactorial genetic inheritance disorders,Cancer


#### Train Dataset
- The training dataset contains multiple columns, including Patient Id, Patient Age, genetic information, blood test results, symptoms, and more. It appears to have a wide variety of attributes, some of which have missing values.
- It includes both numerical and categorical data, with features that describe genetic traits, family medical history, and patient symptoms.

In [17]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9465 entries, 0 to 9464
Data columns (total 43 columns):
 #   Column                                            Non-Null Count  Dtype  
---  ------                                            --------------  -----  
 0   Patient Id                                        9465 non-null   object 
 1   Patient Age                                       9465 non-null   int64  
 2   Genes in mother's side                            9465 non-null   object 
 3   Inherited from father                             8914 non-null   object 
 4   Maternal gene                                     5742 non-null   object 
 5   Paternal gene                                     9465 non-null   object 
 6   Blood cell count (mcL)                            9465 non-null   float64
 7   Patient First Name                                9465 non-null   object 
 8   Family Name                                       148 non-null    object 
 9   Father's name      

In [13]:
test_df.head()

Unnamed: 0,Patient Id,Patient Age,Genes in mother's side,Inherited from father,Maternal gene,Paternal gene,Blood cell count (mcL),Patient First Name,Family Name,Father's name,...,History of anomalies in previous pregnancies,No. of previous abortion,Birth defects,White Blood cell count (thousand per microliter),Blood test result,Symptom 1,Symptom 2,Symptom 3,Symptom 4,Symptom 5
0,PID0x4175,6,No,Yes,No,No,4.981655,Charles,,Kore,...,-99,2,Multiple,-99.0,slightly abnormal,True,True,True,True,True
1,PID0x21f5,10,Yes,No,,Yes,5.11889,Catherine,,Homero,...,Yes,-99,Multiple,8.179584,normal,False,False,False,True,False
2,PID0x49b8,5,No,,No,No,4.876204,James,,Danield,...,No,0,Singular,-99.0,slightly abnormal,False,False,True,True,False
3,PID0x2d97,13,No,Yes,Yes,No,4.687767,Brian,,Orville,...,Yes,-99,Singular,6.884071,normal,True,False,True,False,True
4,PID0x58da,5,No,,,Yes,5.152362,Gary,,Issiah,...,No,-99,Multiple,6.195178,normal,True,True,True,True,False


#### Test Dataset
- The test dataset structure is similar to the training dataset, containing the same features except for the target variables (Genetic Disorder and Disorder Subclass).
- This dataset is used to predict the genetic disorder and disorder subclass based on the provided features.

In [20]:
sample_submission_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Patient Id         5 non-null      object
 1   Genetic Disorder   5 non-null      object
 2   Disorder Subclass  5 non-null      object
dtypes: object(3)
memory usage: 252.0+ bytes


In [14]:
sample_submission_df.head()

Unnamed: 0,Patient Id,Genetic Disorder,Disorder Subclass
0,PID0x6418,Mitochondrial genetic inheritance disorders,Leber's hereditary optic neuropathy
1,PID0x25d5,Single-gene inheritance diseases,Cystic fibrosis
2,PID0x4a82,Multifactorial genetic inheritance disorders,Diabetes
3,PID0x4ac8,Mitochondrial genetic inheritance disorders,Leigh syndrome
4,PID0x1bf7,Multifactorial genetic inheritance disorders,Cancer


#### Sample Submission Dataset
- This dataset provides a format for submitting predictions, including Patient Id, Genetic Disorder, and Disorder Subclass.
- It serves as a template indicating how the predictions should be structured for submission.

## Data Organization

## Data Definition

In [21]:
data_types = train_df.dtypes
missing_values = train_df.isnull().sum()
missing_percentage = (missing_values / len(train_df)) * 100

In [22]:
data_summary = pd.DataFrame({
    'Data Type': data_types,
    'Missing Values': missing_values,
    'Missing Percentage (%)': missing_percentage
})

data_summary.sort_values(by="Missing Percentage (%)", ascending=False)


Unnamed: 0,Data Type,Missing Values,Missing Percentage (%)
Family Name,object,9691,43.884436
Mother's age,float64,6036,27.333243
Father's age,float64,5986,27.106824
Institute Name,object,5106,23.121858
Maternal gene,object,2810,12.72472
Symptom 2,float64,2222,10.062039
H/O substance abuse,object,2195,9.939773
Gender,object,2173,9.840149
History of anomalies in previous pregnancies,object,2172,9.83562
Test 5,float64,2170,9.826563


### 
- High Missing Values: Certain columns have a high percentage of missing values, such as Family Name, Mother's age, and Father's age, with missing rates above 50% for some.
- Data Types: The dataset contains a mix of object (string), float64 (numerical), and possibly categorical data disguised as object types due to the presence of textual data.
- Columns with Low Missing Values: Some columns have very few missing values, indicating they are almost complete.

## Data Cleaning