## Group 15 - Group Project

**Question:** Given the wife's age, religion, and standard of living, what is the preferred contraceptive method used?

We first want to load the csv file into a dataframe so that we can better visualize the data. Appropriate headers are added and only the columns that have variables that are not related to the question are removed.

In [8]:
import altair as alt
import numpy as np
import pandas as pd
import sklearn
from sklearn.compose import make_column_transformer
from sklearn.metrics import confusion_matrix
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

alt.data_transformers.disable_max_rows()
alt.renderers.enable("mimetype")

RendererRegistry.enable('mimetype')

In [3]:
cmc_dataset = pd.read_csv("data/cmc.data.csv", header=None,
                     names=[
                         "wife_age", 
                         "wife_education", #1=low 2,3,4=high
                         "husband_education", #1=low 2,3,4=high
                         "num_children_born",
                         "wife_religion", #0=not Islam 1=Islam
                         "wife_working", #0=yes 1=no
                         "husband_occupation", #1,2,3,4 (categorical)
                         "standard_of_living_index", #1=low 2,3,4=high
                         "media_exposure", #0=good 1=not good
                         "contraceptive_method_used", #1= no use 2=long-term 3=short-term
                     ]
                     )
cmc_dataset

Unnamed: 0,wife_age,wife_education,husband_education,num_children_born,wife_religion,wife_working,husband_occupation,standard_of_living_index,media_exposure,contraceptive_method_used
0,24,2,3,3,1,1,2,3,0,1
1,45,1,3,10,1,1,3,4,0,1
2,43,2,3,7,1,1,3,4,0,1
3,42,3,2,9,1,1,3,3,0,1
4,36,3,3,8,1,1,3,2,0,1
...,...,...,...,...,...,...,...,...,...,...
1468,33,4,4,2,1,0,2,4,0,3
1469,33,4,4,3,1,1,1,4,0,3
1470,39,3,3,8,1,0,1,4,0,3
1471,33,3,3,4,1,0,2,2,0,3


In [5]:
cmc_dataset_filtered = cmc_dataset.loc[:, ["wife_age", "wife_education","wife_religion","standard_of_living_index", "contraceptive_method_used"]]
cmc_dataset_filtered

Unnamed: 0,wife_age,wife_education,wife_religion,standard_of_living_index,contraceptive_method_used
0,24,2,1,3,1
1,45,1,1,4,1
2,43,2,1,4,1
3,42,3,1,3,1
4,36,3,1,2,1
...,...,...,...,...,...
1468,33,4,1,4,3
1469,33,4,1,4,3
1470,39,3,1,4,3
1471,33,3,1,2,3


Now we want to split our data into a training and testing set. The preliminary data analysis will be an exploratory analysis that uses data from the training set only. In order to reduce human bias, only the training set will be visualized. For this project, the dataset will be split 75% training and 25% testing. 

In [11]:
cmc_train, cmc_test = train_test_split(cmc_dataset_filtered, test_size=0.25, random_state=123)
cmc_train.head()

Unnamed: 0,wife_age,wife_education,wife_religion,standard_of_living_index,contraceptive_method_used
1208,40,1,1,2,1
1297,35,4,1,4,2
736,19,3,1,3,3
186,20,2,1,2,1
1027,41,2,0,2,1


Now we want to know how many of each observation is in each class in order to provide the most equal split between different variables to ensure that one observation doesn't have a significantly larger influence on the results. 

In [12]:
cmc_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1104 entries, 1208 to 1389
Data columns (total 5 columns):
 #   Column                     Non-Null Count  Dtype
---  ------                     --------------  -----
 0   wife_age                   1104 non-null   int64
 1   wife_education             1104 non-null   int64
 2   wife_religion              1104 non-null   int64
 3   standard_of_living_index   1104 non-null   int64
 4   contraceptive_method_used  1104 non-null   int64
dtypes: int64(5)
memory usage: 51.8 KB


In our preliminary analysis, we want to be able to summarize useful information about the training set in a table. As such, we want to be able to see if there is any missing data, and the means and medians of our predictor variables. 

In [35]:
variable_agg = cmc_train.agg(["mean", "median", "max", "min"])
variable_agg

Unnamed: 0,wife_age,wife_education,wife_religion,standard_of_living_index,contraceptive_method_used
mean,32.570652,2.950181,0.856884,3.130435,1.906703
median,32.0,3.0,1.0,3.0,2.0
max,49.0,4.0,1.0,4.0,3.0
min,16.0,1.0,0.0,1.0,1.0


From the table above, we can see that the mean and median of each variable is similar, meaning that there is a good distribution of data.

In [40]:
missing_data = cmc_train.isna().sum()
missing_data

wife_age                     0
wife_education               0
wife_religion                0
standard_of_living_index     0
contraceptive_method_used    0
dtype: int64

We picked our variables using academic papers that have found a correlation between each variable and the method of contraception used. As such we want to be able to see if each variable demonstrates a clear correlation. 

In [59]:
wife_age_correlation = alt.Chart(cmc_train, title = "Age vs. Method of Contraception Used").mark_point().encode(
    x = alt.X("wife_age", title = "Wife's age (years)"),
    y = alt.Y("count()", title = "Number of records")
).facet("contraceptive_method_used")
wife_age_correlation

<VegaLite 4 object>

If you see this message, it means the renderer has not been properly enabled
for the frontend that you are using. For more information, see
https://altair-viz.github.io/user_guide/troubleshooting.html


In [61]:
wife_education_correlation = alt.Chart(cmc_train, title = "Education vs. Method of Contraception Used").mark_point().encode(
    x = alt.X("wife_education", title = "Wife's education"),
    y = alt.Y("count()", title = "Number of records")
).facet("contraceptive_method_used")
wife_education_correlation

<VegaLite 4 object>

If you see this message, it means the renderer has not been properly enabled
for the frontend that you are using. For more information, see
https://altair-viz.github.io/user_guide/troubleshooting.html


In [64]:
living_standard_correlation = alt.Chart(cmc_train, title = "Standard of living vs. Method of Contraception Used").mark_bar().encode(
    x = alt.X("standard_of_living_index", title = "Standard of living"),
    y = alt.Y("count()", title = "Number of records")
).facet("contraceptive_method_used")
living_standard_correlation

<VegaLite 4 object>

If you see this message, it means the renderer has not been properly enabled
for the frontend that you are using. For more information, see
https://altair-viz.github.io/user_guide/troubleshooting.html


In [None]:
wife_religion_correlation = alt.Chart(cmc_train, title = "Religion vs. Method of Contraception Used").mark_bar().encode(
    x = alt.X("wife_religion", title = "Wife's Religion"),
    y = alt.Y("count()", title = "Number of records")
).facet("contraceptive_method_used")
wife_religion_correlation