## Analysis of Fracture Form in MrOS V1 Data (EDA3-V1-FF.ipynb)
The Fracture Form is essential in understanding the occurrence, characteristics, and management of fractures within the study population. It provides valuable insights into the prevalence of fractures, associated risk factors, treatment patterns, and the impact of fractures on the participants' health and well-being.

#### 1. [Installation and Importing of Libraries](#eda_import)
#### 2. [Retreival of Data](#eda_retrieval)
#### 3. [Data Cleanup and Consolidation](#eda_cleanup)
#### 4. [Handling of NAs](#eda_na)

### <a name="eda_import"></a>Installation and Importing of Libraries
In order to both explore and visualize the data, it's necessary for us to load various libraries.  In addition to loading already pre-installed libraries, we've also had to install seaborn for plotting. 

In [1]:
!pip install seaborn --upgrade

Requirement already up-to-date: seaborn in /opt/conda/lib/python3.7/site-packages (0.12.2)


In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import seaborn.objects as so
from sklearn.decomposition import PCA
import mpl_toolkits.mplot3d
from scipy.stats import chi2_contingency


### <a name="eda_retrieval"></a>Retreival of Data
The data needs to be retrieved from the Postgres database and stored in a dataframe for us to begin analyzing.

In [3]:
import psycopg2
import sqlalchemy
import getpass

user = "dtfp3"
host = "pgsql.dsa.lan"
database = "casestdysu23t03"
password = getpass.getpass()
connectionstring = "postgresql://" + user + ":" + password + "@" + host + "/" + database
engine = sqlalchemy.create_engine(connectionstring)
connection = None

try:
    connection = engine.connect()
except Exception as err:
    print("An error has occurred trying to connect: {}".format(err))

del password

········


In [4]:
def binary2StringLiteral(df):
    for column in df.columns:
        if df[column].dtype == "object":
            df[column] = df[column].str.decode('utf-8')

In [5]:
fafeb23_df = pd.read_sas("/dsa/groups/casestudy2023su/team03/FAFEB23.SAS7BDAT")
fafeb23_df = fafeb23_df[["ID","FANOTMOF"]]
binary2StringLiteral(fafeb23_df)

### <a name="eda_cleanup"></a>Data Cleanup and Consolidation

In [6]:
query = "SELECT * FROM public.v1_form_FF"
form_ff_df = pd.read_sql_query(query, con=connection)

##Remove Staff IDs
##form_ff_df = form_ff_df.drop(form_ff_df.filter(regex="(STAFF)",axis=1).columns,axis=1)

In [7]:
form_ff_df = form_ff_df.loc[:,form_ff_df.columns.str.startswith(("FF","ID"))]

In [8]:
family_fractures = form_ff_df.filter(regex="(MOM)|(DAD)")

### <a name="eda_na"></a>Exploration of NA Values

In [9]:
family_fractures.isna().sum().T

FFMOMOST    2299
FFMOMFX     1762
FFMOMHIP    4703
FFMOMWST    4802
FFMOMSPN    4755
FFMOMOTH    4931
FFMOM         36
FFMOMAGE    5659
FFMOMDIE     393
FFDADOST    2202
FFDADFX     2647
FFDADHIP    5140
FFDADWST    5198
FFDADSPN    5146
FFDADOTH    5202
FFDAD         91
FFDADAGE    5928
FFDADDIE     201
dtype: int64

The variables of interest in the family history portion of FF form have such a large amount of missing values (>50% in most cases) that it would be innapropriate to use.

In [10]:
##Only keep high fidelity data
fracture_hist = form_ff_df.drop(family_fractures.columns,axis=1).dropna(axis=1,thresh=5800)

In [11]:
fracture_hist.columns

Index(['ID', 'FFFRAC', 'FFDAUGH', 'FFSIS', 'FFBRO', 'FFSON', 'FFNOHP',
       'FFNOSP', 'FFNOHS', 'FFNOHSW', 'FFNT504', 'FFNT502', 'FFNTGT50',
       'FFNTLE50', 'FFNMGT50', 'FFNMLE50', 'FFFX50', 'FF'],
      dtype='object')

In [12]:
fracture_hist.shape

(5994, 18)

In [13]:
form_ff_df.FFFRAC

0       0.0
1       0.0
2       1.0
3       0.0
4       1.0
       ... 
5989    0.0
5990    1.0
5991    0.0
5992    1.0
5993    1.0
Name: FFFRAC, Length: 5994, dtype: float64