# Exploratory Data Analysis on Polycystic Ovary Syndrome (PCOS)

## Introduction

Polycystic ovary syndrome, or PCOS, is one of the most common causes of female infertility, afffecting as many as 5 million US women who are of childbearing age. Women who have PCOS produce more male hormones than normal which may impact their overall heath, even past their childbearing years. Symptoms can be different for every woman, which makes it very difficult to diagnose. This analysis explores various PCOS symptoms to see if any one or more symptoms impact the PCOS diagnoses.

## About the Data

## Questions of Interest 

 - There has been some reseach around women who have PCOS are also insulin resistent which is associated with weight gain. Do women who have a higher weight and BMI are more likely to be diagnosed with PCOS? Do they have a higher blood sugar level (RBS mg/dl)? 
- Do the hormonal levels lean towards mean male values or mean female values?
- Does pregancy have an impact on PCOS?
- Does PCOS have an impact on infertility?
- What are the most frequent symptoms of PCOS patients? Is there a pattern?
- What is the possibility of an undiagnosed patient of PCOS? 
- What lifestyle patterns do patients with PCOS have vs. non PCOS patients?

## Data Inspection

Load the data. 
Describe the data. How many rows? How many columns?
Inspect the rows of the data. 
Inspect the columns of the data. 
Inspect missing values. 

Import the necessary libraries for data analysis:
 - Numpy as np: Used for linear algebra or matrix math. 
 - Pandas as pd: Used for data analysis in a tabular structure. 
 - Matplotlib.pyplot as plt: Used for plotting data. 
 - Seaborn as sns: Subpackage of Matplotlib used for statistical data visualization. 

In [1]:
import numpy as np 
import pandas as pd  
import matplotlib.pyplot as plt 
import seaborn as sns

Load the data. 

In [21]:
PCOS_inf = pd.read_csv("C:\\Users\\sarah\\OneDrive\\Documents\\NYCDS_Bootcamp\\Project 1\\PCOS_infertility.csv")

PCOS_woinf = pd.read_csv("C:\\Users\\sarah\\OneDrive\\Documents\\NYCDS_Bootcamp\\Project 1\\PCOS_data_without_infertility.csv")

Describe the data.

 - PCOS_inf has 541 rows and 6 columns. No NaNs are found.
 - PCOS_woinf has 999 rows and 45 columns. Additional exploration to find # of NaNs.

In [22]:
PCOS_inf.describe

<bound method NDFrame.describe of      Sl. No  Patient File No.  PCOS (Y/N)    I   beta-HCG(mIU/mL)  \
0         1             10001           0                    1.99   
1         2             10002           0                   60.80   
2         3             10003           1                  494.08   
3         4             10004           0                    1.99   
4         5             10005           0                  801.45   
..      ...               ...         ...                     ...   
536     537             10537           0                    1.99   
537     538             10538           0                   80.13   
538     539             10539           0                    1.99   
539     540             10540           0                  292.92   
540     541             10541           1                    1.99   

     II    beta-HCG(mIU/mL) AMH(ng/mL)  
0                      1.99       2.07  
1                      1.99       1.53  
2             

In [23]:
PCOS_woinf.describe

<bound method NDFrame.describe of      Sl. No  Patient File No.  PCOS (Y/N)   Age (yrs)  Weight (Kg)  \
0       1.0               1.0         0.0        28.0         44.6   
1       2.0               2.0         0.0        36.0         65.0   
2       3.0               3.0         1.0        33.0         68.8   
3       4.0               4.0         0.0        37.0         65.0   
4       5.0               5.0         0.0        25.0         52.0   
..      ...               ...         ...         ...          ...   
994     NaN               NaN         NaN         NaN          NaN   
995     NaN               NaN         NaN         NaN          NaN   
996     NaN               NaN         NaN         NaN          NaN   
997     NaN               NaN         NaN         NaN          NaN   
998     NaN               NaN         NaN         NaN          NaN   

     Height(Cm)      BMI  Blood Group  Pulse rate(bpm)   RR (breaths/min)  \
0          152.0    19.3         15.0           

In [57]:
PCOS_woinf.shape

(541, 42)

In [58]:
PCOS_inf.shape

(541, 6)

In [25]:
PCOS_woinf.head()

Unnamed: 0,Sl. No,Patient File No.,PCOS (Y/N),Age (yrs),Weight (Kg),Height(Cm),BMI,Blood Group,Pulse rate(bpm),RR (breaths/min),...,Fast Food (Y/N),Reg.Exercise(Y/N),BP _Systolic (mmHg),BP _Diastolic (mmHg),Follicle No. (L),Follicle No. (R),Avg. F size (L) (mm),Avg. F size (R) (mm),Endometrium (mm),Unnamed: 44
0,1.0,1.0,0.0,28.0,44.6,152.0,19.3,15.0,78.0,22.0,...,1.0,0.0,110.0,80.0,3.0,3.0,18.0,18.0,8.5,
1,2.0,2.0,0.0,36.0,65.0,161.5,#NAME?,15.0,74.0,20.0,...,0.0,0.0,120.0,70.0,3.0,5.0,15.0,14.0,3.7,
2,3.0,3.0,1.0,33.0,68.8,165.0,#NAME?,11.0,72.0,18.0,...,1.0,0.0,120.0,80.0,13.0,15.0,18.0,20.0,10.0,
3,4.0,4.0,0.0,37.0,65.0,148.0,#NAME?,13.0,72.0,20.0,...,0.0,0.0,120.0,70.0,2.0,2.0,15.0,14.0,7.5,
4,5.0,5.0,0.0,25.0,52.0,161.0,#NAME?,11.0,72.0,18.0,...,0.0,0.0,120.0,80.0,3.0,4.0,16.0,14.0,7.0,


In [26]:
PCOS_woinf.tail()

Unnamed: 0,Sl. No,Patient File No.,PCOS (Y/N),Age (yrs),Weight (Kg),Height(Cm),BMI,Blood Group,Pulse rate(bpm),RR (breaths/min),...,Fast Food (Y/N),Reg.Exercise(Y/N),BP _Systolic (mmHg),BP _Diastolic (mmHg),Follicle No. (L),Follicle No. (R),Avg. F size (L) (mm),Avg. F size (R) (mm),Endometrium (mm),Unnamed: 44
994,,,,,,,,,,,...,,,,,,,,,,
995,,,,,,,,,,,...,,,,,,,,,,
996,,,,,,,,,,,...,,,,,,,,,,
997,,,,,,,,,,,...,,,,,,,,,,
998,,,,,,,,,,,...,,,,,,,,,,


In [27]:
#Find the first row with all NaNs.

## Data Preparation

#### Data Cleaning Steps Taken:

 - Find all nulls in both PCOS_winf DataFrame and PCOS_woinf DataFrame.
 - Remove all nulls in both DataFrames. 
 - Drop Unnamed columns found in PCOSwoinf DataFrame. 
 - Remove duplicates in both DataFrames. 

#### Data Pre-processing Steps Taken:

 - Merge the two DataFrames together using an outer join to retain all values for all rows for analysis.

In [37]:
#Find all nulls in PCOS_winf DataFrame.

num_nulls_in_PCOS_winf = np.sum(PCOS_inf.isnull(), axis = 0)

num_nulls_in_PCOS_winf

Sl. No                    0
Patient File No.          0
PCOS (Y/N)                0
  I   beta-HCG(mIU/mL)    0
II    beta-HCG(mIU/mL)    0
AMH(ng/mL)                0
dtype: int64

In [38]:
#Find all nulls in PCOS_woinf DataFrame. 

num_nulls_in_PCOS_woinf = np.sum(PCOS_woinf.isnull(), axis=0)

num_nulls_in_PCOS_woinf

Sl. No                    458
Patient File No.          458
PCOS (Y/N)                458
 Age (yrs)                458
Weight (Kg)               458
Height(Cm)                458
BMI                       458
Blood Group               458
Pulse rate(bpm)           458
RR (breaths/min)          458
Hb(g/dl)                  458
Cycle(R/I)                458
Cycle length(days)        458
Marraige Status (Yrs)     459
Pregnant(Y/N)             458
No. of aborptions         458
  I   beta-HCG(mIU/mL)    458
II    beta-HCG(mIU/mL)    458
FSH(mIU/mL)               458
LH(mIU/mL)                458
FSH/LH                    458
Hip(inch)                 458
Waist(inch)               458
Waist:Hip Ratio           458
TSH (mIU/L)               458
AMH(ng/mL)                458
PRL(ng/mL)                458
Vit D3 (ng/mL)            458
PRG(ng/mL)                458
RBS(mg/dl)                458
Weight gain(Y/N)          458
hair growth(Y/N)          458
Skin darkening (Y/N)      458
Hair loss(

In [47]:
#Remove the NaNs from rows and columns in PCOS_woinf DataFrame. 

PCOS_woinf = PCOS_woinf.dropna(axis = 0, how = 'all').dropna(axis = 1, how = 'any')

#Check to see if nulls were removed. 

num_nulls_in_PCOS_woinf = np.sum(PCOS_woinf.isnull(), axis=0)

num_nulls_in_PCOS_woinf

Sl. No                    0
Patient File No.          0
PCOS (Y/N)                0
 Age (yrs)                0
Weight (Kg)               0
Height(Cm)                0
BMI                       0
Blood Group               0
Pulse rate(bpm)           0
RR (breaths/min)          0
Hb(g/dl)                  0
Cycle(R/I)                0
Cycle length(days)        0
Pregnant(Y/N)             0
No. of aborptions         0
  I   beta-HCG(mIU/mL)    0
II    beta-HCG(mIU/mL)    0
FSH(mIU/mL)               0
LH(mIU/mL)                0
FSH/LH                    0
Hip(inch)                 0
Waist(inch)               0
Waist:Hip Ratio           0
TSH (mIU/L)               0
AMH(ng/mL)                0
PRL(ng/mL)                0
Vit D3 (ng/mL)            0
PRG(ng/mL)                0
RBS(mg/dl)                0
Weight gain(Y/N)          0
hair growth(Y/N)          0
Skin darkening (Y/N)      0
Hair loss(Y/N)            0
Acne (Y/N)                0
Reg.Exercise(Y/N)         0
BP _Systolic (mmHg) 

In [48]:
#Remove the unnamed columns from PCOS_woinf DataFrame. 

PCOS_woinf = PCOS_woinf.loc[:, ~PCOS_woinf.columns.str.contains('^Unnamed')]

In [53]:
#Find the number of duplicates in the PCOS_winf DataFrame. 

PCOS_inf.groupby(PCOS_inf.columns.tolist(),as_index=False).size()

Unnamed: 0,Sl. No,Patient File No.,PCOS (Y/N),I beta-HCG(mIU/mL),II beta-HCG(mIU/mL),AMH(ng/mL),size
0,1,10001,0,1.99,1.99,2.07,1
1,2,10002,0,60.80,1.99,1.53,1
2,3,10003,1,494.08,494.08,6.63,1
3,4,10004,0,1.99,1.99,1.22,1
4,5,10005,0,801.45,801.45,2.26,1
...,...,...,...,...,...,...,...
536,537,10537,0,1.99,1.99,1.7,1
537,538,10538,0,80.13,1.99,5.6,1
538,539,10539,0,1.99,1.99,3.7,1
539,540,10540,0,292.92,1.99,5.2,1


In [54]:
PCOS_woinf.groupby(PCOS_woinf.columns.tolist(),as_index=False).size()

Unnamed: 0,Sl. No,Patient File No.,PCOS (Y/N),Age (yrs),Weight (Kg),Height(Cm),BMI,Blood Group,Pulse rate(bpm),RR (breaths/min),...,Acne (Y/N),Reg.Exercise(Y/N),BP _Systolic (mmHg),BP _Diastolic (mmHg),Follicle No. (L),Follicle No. (R),Avg. F size (L) (mm),Avg. F size (R) (mm),Endometrium (mm),size
0,1.0,1.0,0.0,28.0,44.6,152.000,19.3,15.0,78.0,22.0,...,0.0,0.0,110.0,80.0,3.0,3.0,18.0,18.0,8.5,1
1,2.0,2.0,0.0,36.0,65.0,161.500,#NAME?,15.0,74.0,20.0,...,0.0,0.0,120.0,70.0,3.0,5.0,15.0,14.0,3.7,1
2,3.0,3.0,1.0,33.0,68.8,165.000,#NAME?,11.0,72.0,18.0,...,1.0,0.0,120.0,80.0,13.0,15.0,18.0,20.0,10.0,1
3,4.0,4.0,0.0,37.0,65.0,148.000,#NAME?,13.0,72.0,20.0,...,0.0,0.0,120.0,70.0,2.0,2.0,15.0,14.0,7.5,1
4,5.0,5.0,0.0,25.0,52.0,161.000,#NAME?,11.0,72.0,18.0,...,0.0,0.0,120.0,80.0,3.0,4.0,16.0,14.0,7.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
536,537.0,537.0,0.0,35.0,50.0,164.592,18.5,17.0,72.0,16.0,...,0.0,0.0,110.0,70.0,1.0,0.0,17.5,10.0,6.7,1
537,538.0,538.0,0.0,30.0,63.2,158.000,25.3,15.0,72.0,18.0,...,0.0,0.0,110.0,70.0,9.0,7.0,19.0,18.0,8.2,1
538,539.0,539.0,0.0,36.0,54.0,152.000,23.4,13.0,74.0,20.0,...,0.0,0.0,110.0,80.0,1.0,0.0,18.0,9.0,7.3,1
539,540.0,540.0,0.0,27.0,50.0,150.000,22.2,15.0,74.0,20.0,...,1.0,0.0,110.0,70.0,7.0,6.0,18.0,16.0,11.5,1


In [50]:
#Remove duplicates in the PCOS_winf DataFrame. 

In [None]:
#Remove duplicates in the PCOS_woinf DataFrame.

In [51]:
#replace the BMI column with the correct calc

In [None]:
#replace FSH / LH column to correct calc

In [None]:
#replace Waist to Hip ratio to correct calc

In [None]:
#convert height to feet and inches

In [30]:
#convert weight to lbs

In [31]:
#convert blood group to strings

In [None]:
#convert blood pressure to correct format (systolic over diastolic)

## Exploratory Data Analysis

Calculate descriptive statistics. 
Correlation statistics. Intial inferences for research questions. 
Create a test case. 
Calculate statistically significance of inferences. 
Create visualizations that support.

In [None]:
#descriptive stats for each column

In [None]:
#stats/graphs for each question

## Results

Summarize analysis. Describe outcomes of the analysis and next steps. How would you take this analysis further? How can this analysis be applied to different industries and business problems?

### References

 - https://www.cdc.gov/diabetes/basics/pcos.html#:~:text=What%20is%20PCOS%3F,US%20women%20of%20reproductive%20age.
 
 - https://www.mayoclinic.org/diseases-conditions/pcos/symptoms-causes/syc-20353439?p=1
 
 - https://medlineplus.gov/polycysticovarysyndrome.html 