# Análisis de datos para la creación de un modelo de estimación de niveles de obesidad basado en hábitos alimenticios y de condición física

Iniciaremos la exploración de este [conjunto de datos](https://archive.ics.uci.edu/ml/datasets/Estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition+) encontrado en el repositorio de Machine Learning de la [UCI](https://archive.ics.uci.edu/ml/index.html) recabado y donado por:
- Fabio Mendoza Palechor, Email: fmendoza1@cuc.edu.co, Telefono: +573182929611
- Alexis de la Hoz Manotas, Email: akdelahoz@gmail.com, Telefono: +573017756983

El conjunto incluye datos sobre la estimación de niveles de obesidad de personas de paises como México, Perú y Colombia, basados en sus hábitos alimenticios y su condición física.

## Descripción de los datos

Hay 3 tipos diferentes de características de entrada:
* Objetiva: información objetiva.
* Revisión: resultados de una revisión médica.
* Subjetiva: información dada por el paciente.

### Características:
|Característica|Tipo de Característica|Nombre en el Dataset|Unidad|
|:-------------|:---------------------|:-------------------|:----:|
|Edad  | Objetiva | age    | int (días)|
|Peso  | Objetiva | height | int (cm) |
|Talla | Objetiva | weight | float (kg) |
|Sexo  | Objetiva | gender | 1: mujer, 2: hombre |
|Presión Arterial Sistólica | Revisión | ap_hi | int (mmHg)|
|Presión Arterial Diastólica| Revisión | ap_lo | int (mmHg)|
|Colesterol | Revisión | cholesterol | 1: normal, 2: arriba de lo normal, 3: muy arriba de lo normal |
|Glucosa    | Revisión | gluc | 1: normal, 2: arriba de lo normal, 3: muy arriba de lo normal |
|Fumador | Subjetiva | smoke | binario |
|Consumo de Alcohol | Subjetiva | alco | binario |
|Actividad Física | Subjetiva | active | binario |
|Presencia o ausencia de una Enfermedad Cardiovacular | Variable objetivo | cardio | binario |

## Conociendo y limpiando los datos

In [1]:
# importando librerías
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from warnings import filterwarnings
filterwarnings('ignore')

In [2]:
# importando el dataset
# en este caso el carácter que delimita es ';'
data = pd.read_csv("ObesityDataSet.csv", delimiter = ',')
data.head()

Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad
0,Female,21.0,1.62,64.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,0.0,1.0,no,Public_Transportation,Normal_Weight
1,Female,21.0,1.52,56.0,yes,no,3.0,3.0,Sometimes,yes,3.0,yes,3.0,0.0,Sometimes,Public_Transportation,Normal_Weight
2,Male,23.0,1.8,77.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,2.0,1.0,Frequently,Public_Transportation,Normal_Weight
3,Male,27.0,1.8,87.0,no,no,3.0,3.0,Sometimes,no,2.0,no,2.0,0.0,Frequently,Walking,Overweight_Level_I
4,Male,22.0,1.78,89.8,no,no,2.0,1.0,Sometimes,no,2.0,no,0.0,0.0,Sometimes,Public_Transportation,Overweight_Level_II


In [3]:
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()

columns_to_encode = ['Gender', 'family_history_with_overweight', 'FAVC', 'CAEC', 'SMOKE', 'SCC', 'CALC', 'MTRANS', 'NObeyesdad']

for encode in columns_to_encode:
    data[encode] = labelencoder.fit_transform(data[encode])

In [4]:
x = data.drop(['NObeyesdad'], axis=1,inplace=False)
y = data['NObeyesdad']

In [5]:
from sklearn.model_selection import train_test_split
x_train,x_test, y_train, y_test = train_test_split(x,y,test_size=0.2,random_state=42)

In [6]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score

dec = DecisionTreeClassifier()
ran = RandomForestClassifier(n_estimators=100)
knn = KNeighborsClassifier(n_neighbors=100)
svm = SVC(random_state=1)
naive = GaussianNB()

models = {"Decision tree" : dec,
          "Random forest" : ran,
          "KNN" : knn,
          "SVM" : svm,
          "Naive bayes" : naive}

scores= { }

for key, value in models.items():    
    model = value
    model.fit(x_train, y_train)
    #scores[key] = model.score(x_test, y_test)
    score = cross_val_score(model,x_test,y_test,cv=5)
    scores[key] = ("%0.2f accuracy with a standard deviation of %0.2f" % (score.mean(), score.std()))

# mostrando resultados
scores_frame = pd.DataFrame(scores, index=["Accuracy Score"]).T
scores_frame.sort_values(by=["Accuracy Score"], axis=0 ,ascending=False, inplace=True)
scores_frame

Unnamed: 0,Accuracy Score
Random forest,0.89 accuracy with a standard deviation of 0.03
Decision tree,0.84 accuracy with a standard deviation of 0.04
SVM,0.54 accuracy with a standard deviation of 0.04
Naive bayes,0.48 accuracy with a standard deviation of 0.02
KNN,0.47 accuracy with a standard deviation of 0.05
