<a href="https://colab.research.google.com/github/joigs/TallerDeGit/blob/master/620454_Estadi%CC%81stica_para_ana%CC%81lisis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Metodología de trabajo

![](https://www.proglobalbusinesssolutions.com/wp-content/uploads/2017/12/data-mining.jpg)

# Medidas de tendencia central

## Promedio o media aritmética

$\bar{x}=\frac{\sum_{i=1}^{n}x_i}{n}$

## Varianza
### Población

$\sigma^{2}={\frac{{\displaystyle \sum_{i=1}^{N}\left(x_{i}-\mu\right)^{2}}}{N}}$

### Muestra

$s^{2}={\frac{{\displaystyle \sum_{i=1}^{n}\left(x_{i}-\mu\right)^{2}}}{n-1}}$

## Desviación estándar
### Población : $\sigma=\sqrt{\sigma^{2}}$
### Muestra : $s=\sqrt{s^{2}}$

In [1]:
import random
random.seed(29)
edades = range(17,66)
muestra = random.sample(edades, 30)
print(muestra)

[52, 21, 39, 55, 62, 35, 22, 49, 63, 42, 43, 18, 19, 23, 45, 31, 57, 59, 32, 27, 36, 34, 30, 40, 65, 50, 48, 24, 61, 37]


In [2]:
import numpy as np
print("Media : {:.2f}".format(np.mean(muestra)))
print("Mediana  :{:.2f}".format(np.median(muestra)))
print("Varianza :{:.2f}".format(np.var(muestra)))
print("Desviación estándar : {:.2f}".format(np.std(muestra)))
print("Desviación estándar * : {:.2f}".format(np.sqrt(np.var(muestra))))

Media : 40.63
Mediana  :39.50
Varianza :200.83
Desviación estándar : 14.17
Desviación estándar * : 14.17


# Medidas de posición

In [3]:
# Medidas de posición
# Tercer cuartil
print("Tercer cuartil : {0}".format(np.percentile(muestra, 75)))
# Segundo cuartil
print("Segundo cuartil : {0}".format(np.percentile(muestra, 50)))
# Primer cuartil
print("Primer cuartil : {0}".format(np.percentile(muestra, 25)))

Tercer cuartil : 51.5
Segundo cuartil : 39.5
Primer cuartil : 30.25


In [4]:
from statistics import mean, median, median_grouped
print("Mediana : {0:.2f}".format(median(muestra)))
print("Mediana agrupada : {0:.2f}".format(median_grouped(muestra)))
print("Promedio : {0:.2f}".format(mean(muestra)))

Mediana : 39.50
Mediana agrupada : 39.50
Promedio : 40.63


# Trabajando con dataset

[Descripción de los datos](http://www.biostatisticien.eu/springeR/jeuxDonnees4-en.html)

In [6]:
import pandas as pd
data = pd.read_excel("nutrition_elderly.xls")


In [7]:
# Cantidad de observaciones y características
data.shape

(226, 13)

In [13]:
# Muestra 5 observaciones al azar
data.sample(5)

Unnamed: 0,gender,situation,tea,coffee,height,weight,age,meat,fish,raw_fruit,cooked_fruit_veg,chocol,fat
135,2,1,0,2,168,63,69,3,3,4,5,1,3
187,2,2,6,0,158,62,75,3,3,5,4,0,2
3,2,1,0,0,154,45,91,0,4,4,0,3,2
24,2,2,3,1,156,56,77,3,2,5,3,0,4
206,2,1,0,2,171,65,76,3,3,5,5,5,5


## Variables

Existen 2 tipos de variables

+ Cualitativas
+ Cuantitativas

Las cuantitativas se dividen en:
+ Discreta
+ Continua

In [14]:
# Nombre de las columnas
data.columns

Index(['gender', 'situation', 'tea', 'coffee', 'height', 'weight', 'age',
       'meat', 'fish', 'raw_fruit', 'cooked_fruit_veg', 'chocol', 'fat'],
      dtype='object')

$\textbf{Cualitativas}$
+ gender,
+ situation,
+ fat,
+ meat,
+ fish,
+ raw_fruit,
+ cooked_fruit_veg,
+ chocol

$\textbf{Cuantitativa discreta}$
+ tea
+ coffee

$\textbf{Cuantitativa continua}$
+ height
+ weight
+ age

In [15]:
# Información de las columnas disponibles
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226 entries, 0 to 225
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   gender            226 non-null    int64
 1   situation         226 non-null    int64
 2   tea               226 non-null    int64
 3   coffee            226 non-null    int64
 4   height            226 non-null    int64
 5   weight            226 non-null    int64
 6   age               226 non-null    int64
 7   meat              226 non-null    int64
 8   fish              226 non-null    int64
 9   raw_fruit         226 non-null    int64
 10  cooked_fruit_veg  226 non-null    int64
 11  chocol            226 non-null    int64
 12  fat               226 non-null    int64
dtypes: int64(13)
memory usage: 23.1 KB


In [16]:
# Obtiene los valores únicos de la columna
data.fat.unique()

array([6, 4, 2, 3, 8, 1, 5, 7])

In [17]:
data.gender.unique()

array([2, 1])

## Análisis de columnas

In [18]:
# Cálculo de medidas
data['age'].describe()

count    226.000000
mean      74.477876
std        6.005327
min       65.000000
25%       70.000000
50%       74.000000
75%       78.000000
max       91.000000
Name: age, dtype: float64

In [19]:
# Cálculo de medidas
data['gender'].describe()

count    226.000000
mean       1.623894
std        0.485482
min        1.000000
25%        1.000000
50%        2.000000
75%        2.000000
max        2.000000
Name: gender, dtype: float64

**OBSERVACIÓN**



---



Lo anterior no tiene mucho sentido porque la variable género está codificada, porque la variable es cualitativa

In [28]:
# Agrega la descripción
dict_gender = {1:'Male', 2:'Female'}
data['gender_c'] = data['gender'].replace(dict_gender).astype("category")

In [23]:
# Volvemos a usar el método describe()
data['gender_c'].describe()

count        226
unique         2
top       Female
freq         141
Name: gender_c, dtype: object

In [24]:
# Construye una simple tabla de frecuencia relativa
data['gender_c'].value_counts()

Female    141
Male       85
Name: gender_c, dtype: int64

## Desafío

+ Termine usted de transformar todas las otras variables del tipo cualitativas considerando la descripción de la data desde el enlace que le han entregado.

+ Luego, usando el método describe(), comente los resultados obtenidos.

+ Construya tablas de frecuencia relativas

In [25]:
# Agrega la descripción
dict_situation = {1:'Single', 2:'Living with spouse', 3:'Living with family', 4:'Living with someone else'}
data['situation_c'] = data['situation'].replace(dict_situation).astype("category")

In [26]:
data['situation_c'].describe()

count                    226
unique                     3
top       Living with spouse
freq                     119
Name: situation_c, dtype: object

In [27]:
# Construye una simple tabla de frecuencia relativa
data['situation_c'].value_counts()

Living with spouse    119
Single                 98
Living with family      9
Name: situation_c, dtype: int64

In [29]:
dict_fat = {1:'butter', 2:'margarine', 3:'peanut oil', 4:'sunflower oil', 5:'olive oil', 6:'mix of vegetable oils', 7:'colza oil', 8:'duck or goose fat'}
data['fat_c'] = data['fat'].replace(dict_fat).astype("category")

In [32]:
data['fat_c'].describe()

count               226
unique                8
top       sunflower oil
freq                 68
Name: fat_c, dtype: object

In [31]:
data['fat_c'].value_counts()

sunflower oil            68
peanut oil               48
olive oil                40
margarine                27
mix of vegetable oils    23
butter                   15
duck or goose fat         4
colza oil                 1
Name: fat_c, dtype: int64

In [33]:
dict_meat = {0:'never', 1:'less than once a week', 2:'once a week', 3:'2/3 times a week', 4:'4/6 times a week', 5:'every day'}
data['meat_c'] = data['meat'].replace(dict_meat).astype('category')

In [34]:
data['meat_c'].describe()

count                  226
unique                   6
top       2/3 times a week
freq                    83
Name: meat_c, dtype: object

In [35]:
data['meat_c'].value_counts()

2/3 times a week         83
4/6 times a week         67
every day                61
once a week              11
less than once a week     3
never                     1
Name: meat_c, dtype: int64

In [36]:
dict_fish = {0:'never', 1:'less than once a week', 2:'once a week', 3:'2/3 times a week', 4:'4/6 times a week', 5:'every day'}
data['fish_c'] = data['fish'].replace(dict_fish).astype('category')

In [37]:
data['fish_c'].describe()

count                  226
unique                   6
top       2/3 times a week
freq                   118
Name: fish_c, dtype: object

In [38]:
data['fish_c'].value_counts()

2/3 times a week         118
once a week               61
less than once a week     21
4/6 times a week          15
every day                  7
never                      4
Name: fish_c, dtype: int64

In [43]:
dict_raw_fruit = {0:'never', 1:'less than once a week', 2:'once a week', 3:'2/3 times a week', 4:'4/6 times a week', 5:'every day'}
data['raw_fruit_c'] = data['raw_fruit'].replace(dict_raw_fruit).astype('category')

In [44]:
data['raw_fruit_c'].describe()

count           226
unique            6
top       every day
freq            172
Name: raw_fruit_c, dtype: object

In [45]:
data['raw_fruit_c'].value_counts()

every day                172
4/6 times a week          22
2/3 times a week          14
less than once a week      8
once a week                8
never                      2
Name: raw_fruit_c, dtype: int64

In [46]:
dict_cooked_fruit_veg={0:'never', 1:'less than once a week', 2:'once a week', 3:'2/3 times a week', 4:'4/6 times a week', 5:'every day'}
data['cooked_fruit_veg_c'] = data['cooked_fruit_veg'].replace(dict_raw_fruit).astype('category')

In [47]:
data['cooked_fruit_veg_c'].describe()

count           226
unique            6
top       every day
freq            148
Name: cooked_fruit_veg_c, dtype: object

In [48]:
data['cooked_fruit_veg_c'].value_counts()

every day                148
4/6 times a week          36
2/3 times a week          30
once a week                7
less than once a week      3
never                      2
Name: cooked_fruit_veg_c, dtype: int64

In [49]:
dict_chocol = {0:'never', 1:'less than once a week', 2:'once a week', 3:'2/3 times a week', 4:'4/6 times a week', 5:'every day'}
data['chocol_c'] = data['chocol'].replace(dict_chocol).astype('category')

In [50]:
data['chocol_c'].describe()

count           226
unique            6
top       every day
freq             65
Name: chocol_c, dtype: object

In [51]:
data['chocol_c'].value_counts()

every day                65
less than once a week    62
never                    50
2/3 times a week         22
once a week              16
4/6 times a week         11
Name: chocol_c, dtype: int64

In [52]:
data.head

<bound method NDFrame.head of      gender  situation  tea  coffee  height  weight  age  meat  fish  \
0         2          1    0       0     151      58   72     4     3   
1         2          1    1       1     162      60   68     5     2   
2         2          1    0       4     162      75   78     3     1   
3         2          1    0       0     154      45   91     0     4   
4         2          1    2       1     154      50   65     5     3   
..      ...        ...  ...     ...     ...     ...  ...   ...   ...   
221       2          1    0       1     160      73   74     4     3   
222       2          2    0       3     163      62   68     4     3   
223       1          2    0       2     170      74   71     4     3   
224       2          1    0       2     154      45   77     4     3   
225       2          2    2       0     159      63   69     3     3   

     raw_fruit  ...  chocol  fat  gender_c         situation_c  \
0            1  ...       5    6    Fem