# Exploratory Data Analysis on Polycystic Ovary Syndrome (PCOS)

## Introduction

Polycystic ovary syndrome, or PCOS, is one of the most common causes of female infertility, afffecting as many as 5 million US women who are of childbearing age. Women who have PCOS produce more male hormones than normal which may impact their overall heath, even past their childbearing years. Symptoms can be different for every woman, which makes it very difficult to diagnose. This analysis explores various PCOS symptoms or variables that increase the likelihood of a PCOS diagnosis or infertility. In this analysis, I would to explore three specific questions:

 1. Are there any features that are correlated with PCOS?
 2. What are the most frequent symptoms PCOS pateints exihibit?
 3. Do non-PCOS patients exhibit similar symptoms to those diagnosed with PCOS?


## About the Data

This data set includes all physical and clinical parameters from a group of patients collected from ten different hospitals across Kerala, India. The original data set and notebook can be found on [Kaggle] (https://www.kaggle.com/datasets/prasoonkottarathil/polycystic-ovary-syndrome-pcos?select=PCOS_infertility.csv). The data set contains to Comma Separated Value (CSV) files: 
 - `PCOS_Data_without_infertility`: Contains 45 columns (representing different parameters) and 541 rows (representing different patients identified by a Patient File Number)
 - `PCOS_infertility`: Contains 6 columns (representing different parameters) and 541 rows (representing different patients idenfitied by a Patient File Number)
 
 Because this data set contains specific domain knowledge to understand the features and what they mean, a Data Dictionary was created to provide information about each feature.  

In [None]:
import pandas as pd 
pd.set_option('display.max_colwidth', 0)

data_dict_filepath = "C:\\Users\\sarah\\OneDrive\\Documents\\NYCDS_Bootcamp\\Project 1\\PCOS_Data_Dictionary.csv"

PCOS_Data_Dict = pd.read_csv(data_dict_filepath)

PCOS_Data_Dict

## Questions of Interest

PCOS patients experience a wide variety of symptoms, and each woman experiences a range of symptoms throughout each individual cycle. Because of this, I would like to center my analysis on three main questions:

1. Are there any symptoms or traits that are correlated with PCOS?
2. What are the most frequent symptoms that PCOS patients exibit?
3. How likely is a non-PCOS patient have PCOS based on her current traits or symptoms?

## Data Inspection

Before exploring the questions of interest in the data, we will inspect it to get a sense of it's general construct. In the data inspection, we will complete the following tasks:

- Load the data.
- Describe the data (shape, structure and descriptive statistics). 
- Inspect the data including missing values or `NaN` values. 
- Make intial observations about the data for subsequent steps such as data cleaning and pre-processing.

Before we begin the inspection, we first import the necessary libraries for data analysis:
 - Numpy as np: Used for linear algebra or matrix math. 
 - Pandas as pd: Used for data analysis in a tabular structure. 
 - Matplotlib.pyplot as plt: Used for plotting data. 
 - Seaborn as sns: Subpackage of Matplotlib used for statistical data visualization. 

In [None]:
import numpy as np 
import pandas as pd  
import matplotlib.pyplot as plt 
import seaborn as sns
from scipy import stats
import researchpy as rp
import pingouin as pg

Load the data. 

In [None]:
file_path_with_infertility = "C:\\Users\\sarah\\OneDrive\\Documents\\NYCDS_Bootcamp\\Project 1\\PCOS_infertility.csv"
file_path_without_infertility = "C:\\Users\\sarah\\OneDrive\\Documents\\NYCDS_Bootcamp\\Project 1\\PCOS_data_without_infertility.xlsx"

PCOS_inf = pd.read_csv(file_path_with_infertility)

PCOS_woinf = pd.read_excel(file_path_without_infertility, sheet_name = "Full_new")

Observations:

 - `PCOS_inf` has 541 rows and 6 columns. Additional exploration to find location of `NaN` values.
 - `PCOS_woinf` has 999 rows and 45 columns. Additional exploration to find location of `NaN` values.

In [None]:
PCOS_inf.describe(exclude = 'category').T

In [None]:
PCOS_woinf.describe(exclude = 'category').T

First data frame `PCOS_woinf` without infertility has 999 rows and 42 columns, but with 541 records. The other records will need to be explored and possibly removed.

In [None]:
PCOS_woinf.shape

Second data frame `PCOS_inf` with only infertility records has 541 records and 6 columns.

In [None]:
PCOS_inf.shape

Observations:
 - The `PCOS_woinf` has the `BMI` column with the `#NAME?` error. We will have to compute the `BMI` to replace the `#NAME?` error. 
 - The `PCOS_woinf` has 44 `Unnamed:` columsn with `NaN` values. We will need to drop the `NaN` values and the `Unnamed:` columns in the DataFrame. 

In [None]:
PCOS_woinf.head()

Observations:

- The `PCOS_woinf` has multiple records with `NaN` due to what looks like a formatting issue in the Comma Seperated Value (CSV) file that was imported as a DataFrame. We will need to drop the `NaN` values from these rows. 

In [None]:
PCOS_woinf.tail()

Observations:

- `PCOS_woinf` has 541 entries or rows. 
- All columns are of `float64` type except for `BMI`, `FSH/LH`, `Waist:Hip Ratio`, and `AMH(ng/mL)` which are of `object` type.

In [None]:
PCOS_woinf.info()

Observations:

 - `PCOS_inf` has 541 entries or rows with a total of 6 columns as we found from the `shape()` function above. 
 - `PCOS_inf` has all `int` and `float` types, except for column `AMH(ng/mL)`. 

In [None]:
PCOS_inf.info()

### Important Note About the Data
- Upon initial inspection, I identified that each record in each CSV file had a unique Patient File Number which would indicate files for unique patients. 
- Also, I recognized that the file names `PCOS_woinf` and `PCOS_inf` would indicate patients were not diagnosed with "Unexplained Infertility" and patients that were diagnosed with "Unexplained Infertility".
- However, upon further inspection, the data in both files, are the same with different Pateint File Numbers. 
- Because of these observations, I will exclude the `PCOS_inf` for the main Explanatory Data Analasis. 
- To explore the questions of interest regarding fertility, I will use random generated values with the same distributions as those values of patients that are not pregnant according to the `Pregnant (Y/N)` feature. This will allow exploratory analysis regarding patients that are classfied with infertility versus without infertility. THis can be found in the 'Additional EDA' Section at the end of this notebook. 
- Because of these observations, I will only use the `PCOS_woinf` in the main analysis.

## Data Preparation

#### Data Cleaning Steps Taken:

 - Find and remove `NaN` values.
 - Drop `Unnamed:` columns. 
 - Find and remove duplicates.
 - Remove white space in column names.
 - Calculate correct values for `BMI`, `FSH/LH`, and `Waist:Hip Ratio` columns.

Find all `NaN` values in `PCOS_woinf`.

In [None]:
num_nulls_in_PCOS_woinf = np.sum(PCOS_woinf.isnull(), axis=0)

Remove the `NaN` values from rows and columns in `PCOS_woinf`.

In [None]:
PCOS_woinf = PCOS_woinf.dropna(axis = 0, how = 'all').dropna(axis = 1, how = 'any')

Check to see if `NaN` values were removed from `PCOS_woinf`. 

In [None]:
num_nulls_in_PCOS_woinf

Remove `Unnamed:` columns from `PCOS_woinf`. 

In [None]:
PCOS_woinf = PCOS_woinf.loc[:, ~PCOS_woinf.columns.str.contains('^Unnamed')]

Find duplicates in `PCOS_woinf`.

In [None]:
PCOS_woinf.duplicated().sum()

Replace `BMI` with the [correct calculation](https://www.nhlbi.nih.gov/health/educational/lose_wt/BMI/bmicalc.htm). 

In [None]:
PCOS_woinf['Height (m)'] = PCOS_woinf['Height(Cm) '] / 100

PCOS_woinf['BMI'] = PCOS_woinf['Weight (Kg)'] / (PCOS_woinf['Height (m)'] ** 2)

Replace `FSH/LH` with the correct calculation.

In [None]:
PCOS_woinf['FSH/LH'] = PCOS_woinf['FSH(mIU/mL)'] / PCOS_woinf['LH(mIU/mL)']

Replace `Waist:Hip Ratio` with the correct calculation. 

In [None]:
PCOS_woinf['Waist:Hip Ratio'] = PCOS_woinf['Waist(inch)'] / PCOS_woinf['Hip(inch)']

Add a column `Weight (lbs)` converting weight in kg to weight in lbs for readability.

In [None]:
PCOS_woinf['Weight (lbs)'] = PCOS_woinf['Weight (Kg)'] * 2.205

Create a column `Blood Type (str)` that is represented by the string value vs. the numerical representation defined. 

In [None]:
PCOS_woinf['Blood Type (str)'] = PCOS_woinf['Blood Group'].replace([11, 12, 13, 14, 15, 16, 17, 18], ['A+', 'A-', 'B+', 'B-', 'O+', 'O-', 'AB+', 'AB-'] )

Convert blood pressure to the correct format systolic over diastolic. [Reference Link](https://www.heart.org/en/health-topics/high-blood-pressure/understanding-blood-pressure-readings).

In [None]:
PCOS_woinf['Blood Pressure (str)'] = PCOS_woinf['BP _Systolic (mmHg)'].astype(str) + '/' + PCOS_woinf['BP _Diastolic (mmHg)'].astype(str)
PCOS_woinf['Blood Pressure (str)'] = PCOS_woinf['Blood Pressure (str)'].replace(['nan/nan'], np.nan)
PCOS_woinf['Blood Pressure (str)']

Correct spelling and rename column headers.

In [None]:
PCOS_woinf.rename(columns = {'No. of aborptions': 'No. of Abortions'}, inplace = True)
PCOS_woinf.rename(columns = {'Marraige Status (Yrs)': 'Marriage Status (Yrs)'}, inplace = True)
PCOS_woinf.rename(columns = {'Pimples(Y/N)': 'Acne (Y/N)'}, inplace = True)

 - Change non-string types to floats for numerical variables. 
 - Received `ValueError: Unable to parse string "1.99." at position 123.`
 - Used replace method to `replace` method into format that can be coerced with `to_numeric`.

In [None]:
PCOS_woinf['II    beta-HCG(mIU/mL)'] = PCOS_woinf['II    beta-HCG(mIU/mL)'].replace(['1.99.'], '1.99')
PCOS_woinf['II    beta-HCG(mIU/mL)'] = pd.to_numeric(PCOS_woinf['II    beta-HCG(mIU/mL)'])

In [None]:
PCOS_woinf['Waist:Hip Ratio'] = pd.to_numeric(PCOS_woinf['Waist:Hip Ratio'])

- For `PCOS_main['AMH(ng/mL)']` recieved `ValueError` at position 307 with the value 'a'. 
- Since this was the only string value in this numerical column, I chose to replace this string with the mean of the column to avoid dropping that record.

In [None]:
PCOS_woinf['AMH(ng/mL)'] = pd.to_numeric(PCOS_woinf['AMH(ng/mL)'], errors = 'coerce').astype('float64')
PCOS_woinf['AMH(ng/mL)'] = PCOS_woinf['AMH(ng/mL)'].fillna(PCOS_woinf['AMH(ng/mL)'].mean())

- Change the type of the categorical variable columns to `bool` from `float64`:
 - `PCOS (Y/N)`
 - `Pregnant (Y/N)`
 - `Weight gain (Y/N)`
 - `hair growth (Y/N)`
 - `Skin darkening (Y/N)`
 - `Hair loss (Y/N)`
 - `Acne (Y/N)`
 - `Fast Food (Y/N)`
 - `Reg.Exercise (Y/N)`

In [None]:
#PCOS_woinf['PCOS (Y/N)'] = PCOS_woinf['PCOS (Y/N)'].astype('bool')
PCOS_woinf['Pregnant(Y/N)'] = PCOS_woinf['Pregnant(Y/N)'].astype('bool')
PCOS_woinf['Weight gain(Y/N)'] = PCOS_woinf['Weight gain(Y/N)'].astype('bool')
PCOS_woinf['hair growth(Y/N)'] = PCOS_woinf['hair growth(Y/N)'].astype('bool')
PCOS_woinf['Skin darkening (Y/N)'] = PCOS_woinf['Skin darkening (Y/N)'].astype('bool')
PCOS_woinf['Hair loss(Y/N)'] = PCOS_woinf['Hair loss(Y/N)'].astype('bool')
PCOS_woinf['Acne (Y/N)'] = PCOS_woinf['Acne (Y/N)'].astype('bool')
PCOS_woinf['Reg.Exercise(Y/N)'] = PCOS_woinf['Reg.Exercise(Y/N)'].astype('bool')

In [None]:
PCOS_woinf.columns

- Drop `Sl. No` as this seems like a duplicate column and will not add value to our analysis.
- Drop `Cycle(R/I)` column as this variable was not well-defined in the information about the original data set. 

In [None]:
PCOS_woinf.drop(['Sl. No'], axis = 1, inplace = True)
PCOS_woinf.drop(['Cycle(R/I)'], axis = 1, inplace = True)

Double check `NaN` values are removed before moving on to analysis.

In [None]:
PCOS_woinf.isnull().values.any()

Check Dtypes in `PCOS_woinf` before moving on to analysis.

In [None]:
PCOS_woinf.info()

Get rid of extra space before and after column headers.

In [None]:
PCOS_woinf.columns = [col.strip() for col in PCOS_woinf.columns]
PCOS_woinf.columns

In [None]:
PCOS_woinf.dtypes.value_counts()

## Exploratory Data Analysis

In this section, I will continue to explore the data to start to make inferences for further analysis keeping in mind our three questions:

 1. Are there any features that are correlated with PCOS?
 2. What are the most frequent symptoms PCOS pateints exihibit?
 3. Do non-PCOS patients exhibit similar symptoms to those diagnosed with PCOS?

In [None]:
PCOS_woinf.groupby("PCOS (Y/N)").mean(numeric_only=True)

In [None]:
#Set sylistic themes for Seaborn plots.
sns.set(style = "white")
light = sns.color_palette("light:#5A9", as_cmap=True)

Plot a correlation matrix to see if any of the features are correlated with other features in the data set. The higher the number is to 1, the more likely that this is correlated with the other feature in the matrix.

In [None]:
coorelation_matrix = PCOS_woinf.corr().round(2)
mask = np.triu(np.ones_like(coorelation_matrix))
plt.subplots(figsize = (12, 12))
sns.heatmap(coorelation_matrix, vmax = 1, vmin = -1, cmap = "vlag", square = True, mask = mask).set(title = "PCOS Correlation Matrix")

Because the matrix is so large, it is difficult to see what features are coorelated with other features. Just from the heat map above, we can make a couple of observations:

- Follicles in the left and right overaries and symptoms such as skin darkening, hair growth and weight gain are highly correlated with PCOS. 
- We can validate these correlation values with what the research has told us so far. PCOS patients are more likely to have a number of follicles that likely will not mature and therefore will prevent the pateint from ovulating or having a 'regular' cycle. 
- Skin darkening, hair growth and weight gain are all symptoms that are most frequent in PCOS patients due to over production of male linked hormones. 

Next Steps:
- Drop variables that have a value of 0.7 or more for further analysis in a two-way ANOVA.
- Because `Weight (Kg)` and `Weight (lbs)` and `BMI` are corrrelated at .90, I will drop `Weight (Kg)` and `Weight (lbs)` from the two-way ANOVA to prevent duplicating highly correlated variables.
- Dropping `Follicle (L)` since it's highly correlated at .80 with `Follicle (R)`.
- Dropping `FSH/LH` as it's highly correlated with `FSH(mIU/mL)`. 
- Dropping `Hip(inch)` as it's highly correlated with `Waist(inch)`. 
- Dropping `Height (cm)` as it's highly correlated with `Height (m)`.

In [None]:
coorelation_matrix_filtered = coorelation_matrix.unstack()
coorelation_matrix_filtered = coorelation_matrix_filtered[abs(coorelation_matrix_filtered) >= 0.7]

print(coorelation_matrix_filtered)

Created a new DataFrame `PCOS_woinf_ANOVA` dropping all columns that were noted above due to correlation value of 0.7 or more and categorical (other than `PCOS (Y/N)`) for one-way ANOVA analysis.

In [None]:
PCOS_woinf_ANOVA = PCOS_woinf[['PCOS (Y/N)', 'Age (yrs)', 'BMI', 'Pulse rate(bpm)',
       'RR (breaths/min)', 'Hb(g/dl)', 'Cycle length(days)',
       'No. of Abortions', 'I   beta-HCG(mIU/mL)', 'II    beta-HCG(mIU/mL)',
       'FSH(mIU/mL)', 'LH(mIU/mL)', 'Waist(inch)',
       'Waist:Hip Ratio', 'TSH (mIU/L)', 'AMH(ng/mL)', 'PRL(ng/mL)',
       'Vit D3 (ng/mL)', 'PRG(ng/mL)', 'RBS(mg/dl)', 'BP _Systolic (mmHg)',
       'BP _Diastolic (mmHg)', 'Follicle No. (R)',
       'Avg. F size (L) (mm)', 'Avg. F size (R) (mm)', 'Endometrium (mm)', 'Height (m)']]

In [None]:
PCOS_woinf_ANOVA = PCOS_woinf_ANOVA.rename(columns = {'PCOS (Y/N)': "PCOS", 'Age (yrs)': "Age", 'Pulse rate(bpm)': "Pulse_Rate", 'RR (breaths/min)': "Resp_Rate", 'Hb(g/dl)': "Hemoglobin", 'Cycle length(days)' : "Cycle_Length", 'No. of Abortions': "Num_Abortions", 'I   beta-HCG(mIU/mL)': "HCG_Read_1", 'II    beta-HCG(mIU/mL)': "HCG_Read_2", 'FSH(mIU/mL)': "Follicle_Stim_Horm", 'LH(mIU/mL)': "Luteninizing_Horm", 'Waist(inch)': "Waist_in", 'Waist:Hip Ratio': "Waist_Hip_Ratio", 'TSH (mIU/L)': "Thyroid_Horm", 'AMH(ng/mL)': "Anti_Mull_Horm", 'PRL(ng/mL)': "Prolactin", 'Vit D3 (ng/mL)': "Vit_D", 'PRG(ng/mL)': "Progesterone", 'RBS(mg/dl)': "Random_Blood_Sug", 'BP _Systolic (mmHg)': "Systolic", 'BP _Diastolic (mmHg)': "Diastolic", 'Follicle No. (R)': "Foll_No_R", 'Avg. F size (L) (mm)': "Avg_Foll_Size_L", 'Avg. F size (R) (mm)': "Avg_Foll_Size_R", 'Endometrium (mm)': "Endometrium", 'Height (m)' :"Height_m"})

In [None]:
PCOS_woinf_ANOVA.columns

### Hypothesis Testing

My null hypothesis is that the mean of each variable is equal to the same mean of that variable to patients that have PCOS. I will set my significance level to 0.05 so that I can be 95% confident in my conclusion and accept 5% error that my conclusions are incorrect.

- $H_0$: $\mu_{PCOS_{β}}$ = $\mu_{non-PCOS_{β}}$

- $H_a$: $\mu_{PCOS_{β}}$ $\neq$ $\mu_{non-PCOS_{β}}$

_Where ${β}$ is equal to the each independent variable or feature in the dataset._

Given that our hypothesis is true, if the probability of observing the average of that variable is extreme or as extreme as the one we observed is higher than the significance level, $\alpha $ $= 0.05$, then we fail to reject (retain) the null hypothesis. 

- If the $p-value$ is greater than $\alpha $, we would retain $H_0$, meaning we have sufficient statistical evidence to assume that the variable we are observing could be correlated with a PCOS diagnosis. 
- If the $p-value$ is lower than $\alpha $, we would reject $H_0$ in favor of $H_a$, meaning that we have enough statistical evidence to assume that particular variable is NOT correlated with a PCOS diagnosis and is therefore significant.

In [None]:
pg.anova(dv = 'PCOS', between = ['Age'], data = PCOS_woinf_ANOVA).round(3)
#0.007 < 0.05, reject H0 in favor of Ha. We have enough statistical \
#evidence to suggest that age is NOT correlated with a PCOS diagnosis.

In [None]:
pg.anova(dv = 'PCOS', between = ['BMI'], data = PCOS_woinf_ANOVA).round(3)
#0.086 > 0.05, retain H0.We have enough statistical \
#evidence to suggest that BMI is correlated with a PCOS diagnosis.

In [None]:
pg.anova(dv = 'PCOS', between = ['Pulse_Rate'], data = PCOS_woinf_ANOVA).round(3)
#0.393 > 0.05, retain H0. We have enough statistical \
#evidence to suggest that Pulse Rate is correlated with a PCOS diagnosis.

In [None]:
pg.anova(dv = 'PCOS', between = ['Resp_Rate'], data = PCOS_woinf_ANOVA).round(3)
#0.21 > 0.05, retain H0. We have enough statistical \
#evidence to suggest that Respirtory Rate is correlated with a PCOS diagnosis.

In [None]:
pg.anova(dv = 'PCOS', between = ['Hemoglobin'], data = PCOS_woinf_ANOVA).round(3)
#0.243 > 0.05, retain H0. We have enough statistical \
#evidence to suggest that Hemoglobin is correlated with a PCOS diagnosis.

In [None]:
pg.anova(dv = 'PCOS', between = ['Cycle_Length'], data = PCOS_woinf_ANOVA).round(3)
#0.0 < 0.05, reject H0 in favor of Ha. We have enough statistical \
#evidence to assume that Cycle length is NOT correlated with a PCOS diagnosis.

In [None]:
pg.anova(dv = 'PCOS', between = ['Num_Abortions'], data = PCOS_woinf_ANOVA).round(3)
#0.544 > 0.05, retain H0. We have enough statistical \
#evidence to suggest that the number of abortions is correlated with a PCOS diagnosis. 

In [None]:
pg.anova(dv = 'PCOS', between = ['HCG_Read_1'], data = PCOS_woinf_ANOVA).round(3)
#0.059 > 0.05, retain H0. We have enough statistical \
#evidence to suggest that HCG is correlated with a PCOS diagnosis. 

In [None]:
pg.anova(dv = 'PCOS', between = ['HCG_Read_2'], data = PCOS_woinf_ANOVA).round(3)
#0.238 > 0.05, retain H0. We have enough statistical \
#evidence to suggest that HCG is correlated with a PCOS diagnosis.
#This second reading would be a confirmation of the first. 

In [None]:
pg.anova(dv = 'PCOS', between = ['Follicle_Stim_Horm'], data = PCOS_woinf_ANOVA).round(3)
#0.022 < 0.05, reject H0 in favor of Ha. We have enough statistical \
#evidence to assume that the Follicle Stimulating Hormone is NOT correlated with a PCOS diagnosis.

In [None]:
pg.anova(dv = 'PCOS', between = ['Luteninizing_Horm'], data = PCOS_woinf_ANOVA).round(3)
#0.133 > 0.05, retain H0. We have enough statistical \
#evidence to suggest that the Luteninzing Hormone is correlated with a PCOS diagnosis. 

In [None]:
pg.anova(dv = 'PCOS', between = ['Waist_in'], data = PCOS_woinf_ANOVA).round(3)
#0.003 < 0.05, reject H0 in favor of Ha. We have enough statistical \
#evidence to assume that Waist measurement is NOT correlated with a PCOS diagnosis.

In [None]:
pg.anova(dv = 'PCOS', between = ['Waist_Hip_Ratio'], data = PCOS_woinf_ANOVA).round(3)
#0.029 < 0.05, reject H0 in favor of Ha. We have enough statistical \
#evidence to assume that Waist:Hip Ratio is NOT correlated with a PCOS diagnosis.

In [None]:
pg.anova(dv = 'PCOS', between = ['Thyroid_Horm'], data = PCOS_woinf_ANOVA).round(3)
#0.23 > 0.05, retain H0. The Luteninzing Hormone is correlated with a PCOS diagnosis. 

In [None]:
pg.anova(dv = 'PCOS', between = ['Anti_Mull_Horm'], data = PCOS_woinf_ANOVA).round(3)
#0.0 < 0.05, reject H0 in favor of Ha. We have enough statistical \
#evidence to assume that Anti-Mullarian Hormone is NOT correlated with a PCOS diagnosis. 

In [None]:
pg.anova(dv = 'PCOS', between = ['Prolactin'], data = PCOS_woinf_ANOVA).round(3)
#0.011 < 0.05, reject H0 in favor of Ha. We have enough statistical \
#evidence to assume that Prolactin is NOT correlated with a PCOS diagnosis.

In [None]:
pg.anova(dv = 'PCOS', between = ['Vit_D'], data = PCOS_woinf_ANOVA).round(3)
#0.1 > 0.05, retain H0. We have enough statistical \
#evidence to suggest that Vitamin D is correlated with a PCOS diagnosis.

In [None]:
pg.anova(dv = 'PCOS', between = ['Random_Blood_Sug'], data = PCOS_woinf_ANOVA).round(3)
#0.483 > 0.05, retain H0. We have enough statistical \
#evidence to suggest that Random Blood Sugar is correlated with a PCOS diagnosis. 

In [None]:
pg.anova(dv = 'PCOS', between = ['Progesterone'], data = PCOS_woinf_ANOVA).round(3)
#0.466 > 0.05, retain H0. We have enough statistical \
#evidence to suggest that Progesterone is correlated with a PCOS diagnosis.

In [None]:
pg.anova(dv = 'PCOS', between = ['Systolic'], data = PCOS_woinf_ANOVA).round(3)
#0.886 > 0.05, retain H0. We have enough statistical \
#evidence to suggest that Systolic blood pressure is correlated with a PCOS diagnosis.

In [None]:
pg.anova(dv = 'PCOS', between = ['Diastolic'], data = PCOS_woinf_ANOVA).round(3)
#0.754 > 0.05, retain H0. We have enough statistical \
#evidence to suggest that Diastolic blood pressure is correlated with a PCOS diagnosis.

In [None]:
pg.anova(dv = 'PCOS', between = ['Foll_No_R'], data = PCOS_woinf_ANOVA).round(3)
#0.0 < 0.05, reject H0 in favor of Ha. We have enough statistical \
#evidence to assume that Follicle Count in the right ovary is NOT correlated with a PCOS diagnosis.

In [None]:
pg.anova(dv = 'PCOS', between = ['Avg_Foll_Size_L'], data = PCOS_woinf_ANOVA).round(3)
#0.17 > 0.05, retain H0. We have enough statistical \
#evidence to suggest that Avg Follicle Size in the left ovary does impact PCOS diagnosis.

In [None]:
pg.anova(dv = 'PCOS', between = ['Avg_Foll_Size_R'], data = PCOS_woinf_ANOVA).round(3)
#0.611 > 0.05, retain H0. We have enough statistical \
#evidence to suggest that Avg Follicle Size in the right ovary is correlated with a PCOS diagnosis.

In [None]:
pg.anova(dv = 'PCOS', between = ['Endometrium'], data = PCOS_woinf_ANOVA).round(3)
#0.435 > 0.05, retain H0. We have enough statistical \
#evidence to suggest that Endometrium lining measurement is correlated with a PCOS diagnosis. 

In [None]:
pg.anova(dv = 'PCOS', between = ['Height_m'], data = PCOS_woinf_ANOVA).round(3)
#0.6 > 0.05, retain H0. We have enough statistical \
#evidence to suggest that Height is correlated with a PCOS diagnosis.

Now that I've done my ANOVA tests to see which features correlate with PCOS, I will plot the p-values of those features in a bar graph for visualization purposes and to summarize. 

I'll create a new dataframe with these features and respective P-Values for graphing purposes.

In [None]:
#Create a dictionary containing all Features and corresponding P-Values that are correlated with PCOS.
corr_pvals = { 'Features': ['BMI', 'Pulse_Rate', 'Resp_Rate', 'Hemoglobin', 'Num_Abortions', 
                          'HCG_Read_1', 'HCG_Read_2', 'Luteninizing_Horm','Thyroid_Horm', 'Vit_D','Progesterone', 
                          'Random_Blood_Sug', 'Systolic', 'Diastolic', 'Avg_Foll_Size_L', 'Avg_Foll_Size_R', 
                          'Endometrium', 'Height_m'],
            'P-Values': [0.086, 0.393, 0.21, 0.243, 0.544, 0.059,
                        0.238, 0.133, 0.23, 0.1, 0.466, 0.483, 0.886, 0.754, 0.17, 0.611, 0.435, 0.6]}

In [None]:
#Create the data frame from the dictionary above and sort by highest to lowest P-value recieved from each ANOVA. 
corr_pvals_df = pd.DataFrame.from_dict(corr_pvals)
corr_pvals_df = corr_pvals_df.sort_values(by = 'P-Values', ascending = False)

In [None]:
#Create the bar plot of P-values.
sns.barplot(x = "P-Values", y = "Features", data = corr_pvals_df, orient = "h", palette = "light:b_r").set(title = 'P-Values of PCOS Correlated Features')

Now that we've taken a look at the numerical features that are correlated with PCOS, let's take a look at the symptoms in our data set which are categorical features.

In [None]:
#Using the crosstab function to create a simple table of PCOS diagnosis and if that patient is pregnant or not. 

pregnancy = pd.crosstab(index = PCOS_woinf["PCOS (Y/N)"], columns = PCOS_woinf['Pregnant(Y/N)'], margins = True, margins_name = "Total", normalize = "index").round(2)
pregnancy

In [None]:
pregnancy.plot.bar(stacked = False)

In [None]:
#To tabularize the categoical features, we need to create a new DataFrame that gives us the ability 
#to view the data by those patinets that are exhibiting symptom and by their PCOS diagnosis. This will require
#to use the melt() function to change the structure of our dataframe.

test_df = PCOS_woinf.melt(id_vars = "PCOS (Y/N)", value_vars = ['Weight gain(Y/N)', 'hair growth(Y/N)', 'Skin darkening (Y/N)', 'Hair loss(Y/N)',
       'Acne (Y/N)', 'Reg.Exercise(Y/N)', 'Pregnant(Y/N)'], var_name = "Symptom",  value_name = "Exhibits_Symptom")

In [None]:
#Using the melted dataframe test_df to tabularize our data. 

symptom_data_table = pd.crosstab(test_df.Symptom, columns = [test_df["PCOS (Y/N)"], test_df.Exhibits_Symptom]).apply(lambda row: row/row.sum(), axis = 1)

In [None]:
#Formatting the data so it's easier to read. In the table below, the minimum values are highlighted in dark blue.

symptom_data_table.loc[:].style.highlight_min(axis=1, props='color:white; font-weight:bold; background-color:darkblue;')

In [None]:
#Formatting the data so it's easier to read. In the table below, the maximum values are highlighted in dark blue.

symptom_data_table.loc[:].style.highlight_max(axis=1, props='color:white; font-weight:bold; background-color:darkblue;')

Observations about Symptoms:

- Most patients that are not diagnosed with PCOS do not exhibit any symptoms. 
- Pateints that ARE diagnosed with PCOS, have the lowest percentages of exhibiting symptoms of both Acne and Hair Loss. 
- Pateints that ARE diagnosed with PCOS, make up the smallest percentage of patients that were pregant in our sample. 

## Conclusion

Summarize analysis. Describe outcomes of the analysis and next steps. How would you take this analysis further? How can this analysis be applied to different industries and business problems?

#### Additional EDA

 - For additional analysis, I created a DataFrame `PCOS_outiers_removed` to remove outliers.
 - Since we cannot safely assume that all of our data is normal, I chose to leave the outliers in as it's already a small data set.
 - Code and DataFrame defined below if you would like to take a look at the various plots and data using the DataFrame with outliers removed.

In [None]:
constraints = PCOS_woinf.select_dtypes(include = [np.number]).apply(lambda x: np.abs(stats.zscore(x)) < 3, result_type = 'expand').all(axis = 1)
PCOS_outliers_removed = PCOS_woinf.drop(PCOS_woinf.index[~constraints], inplace = False)
PCOS_outliers_removed.shape

Because Weight Gain is associated with a high Random Blood Sugar (RBS), I chose to compare the spread or distribution of those values side by side with PCOS and non-PCOS patients.

Observations:

 - From the histrograms below, it looks like patients without PCOS have higher RBS levels vs. the PCOS patients.
 - This distribution of the non-PCOS patients look normally distributed and PCOS pateints look more uniform.
 - This is counter to our original hypothesis. 

In [None]:
g = sns.FacetGrid(PCOS_woinf, col="PCOS (Y/N)")
g.map(sns.histplot, "RBS(mg/dl)")

To see if there is any relationship between weight and Random Blood Sugar and PCOS pateints, I chose to do a scatter plot. From the scatter plot below, it does not there is any relationship between BMI and Random Blood Sugar values.  

In [None]:
sns.relplot(x = "BMI", y = "RBS(mg/dl)", hue = "PCOS (Y/N)", size = "BMI",
           sizes = (400, 40), alpha = .5, palette = "muted", height = 4, data = PCOS_woinf)

In [None]:
sns.scatterplot(data = PCOS_woinf, x = "Weight (lbs)", y = "RBS(mg/dl)", hue = "PCOS (Y/N)")

In [None]:
sns.boxplot(data = PCOS_woinf, x = "RBS(mg/dl)", hue = "PCOS (Y/N)")

In [None]:
sns.violinplot(data = PCOS_woinf, y = "RBS(mg/dl)", x = "PCOS (Y/N)")

In [None]:
sns.lmplot(data = PCOS_woinf, x = "RBS(mg/dl)", y = "BMI", hue = "PCOS (Y/N)")

In [None]:
g = sns.FacetGrid(PCOS_woinf, col="PCOS (Y/N)")
g.map(sns.histplot, "PRL(ng/mL)")