## Reto 2: Tablas de frecuencias

### 1. Objetivos:
    - Aprender a generar tablas de frecuencias segmentando nuestros datos

---
    
### 2. Desarrollo:

#### a) Analizando distribución con tablas de frecuencias

Vamos a generar tablas de frecuencias de los siguientes datasets y columnas:

1. Dataset: 'near_earth_objects-jan_feb_1995-clean.csv'
    - Columnas a graficar: 'estimated_diameter.meters.estimated_diameter_max' y 'relative_velocity.kilometers_per_second'
2. Dataset: 'new_york_times_bestsellers-clean.json'
    - Columnas a graficar: 'price.numberDouble'
3. Dataset: 'melbourne_housing-clean.csv'
    - Columnas a graficar: 'land_size'
    
Estos conjuntos de datos son los mismos que graficamos en el Reto anterior. Antes de generar las tablas de frecuencias, revisa el rango de tus conjuntos de datos y decide el número de segmentos adecuado para cada uno.

Después, genera las tablas de frecuencias para cada uno de estos conjuntos de datos y compáralos con las gráficas de caja que realizaste en el Reto anterior. ¿Hay información nueva? ¿Qué ventajas o desventajas nos da esta nueva perspectiva?

Piensa cuál de las dos aproximaciones (boxplots y tablas de frecuencia) resulta más útil para detectar valores atípicos. ¿O simplemente son útiles en diferentes contextos?

In [4]:
import pandas as pd


In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
df = pd.read_csv('/content/drive/MyDrive/Datasets/near_earth_objects-jan_feb_1995-raw.csv',index_col='Unnamed: 0')

In [7]:
df2 = pd.read_json('/content/drive/MyDrive/Datasets/new_york_times_bestsellers-clean.json')

In [8]:
df3 = pd.read_csv('/content/drive/MyDrive/Datasets/melbourne_housing-no_nans.csv',index_col='Unnamed: 0')

In [26]:
diameter = df['estimated_diameter.meters.estimated_diameter_max']
velocity = df['relative_velocity.kilometers_per_second']

In [31]:
price=df2['price.numberDouble']

In [34]:
landsize = df3['Landsize']

In [16]:
diameter.min()

2.978790628

In [17]:
diameter.max()

6516.883821679

In [18]:
diameter.max()-diameter.min()

6513.905031051

In [28]:
velocity.max()-velocity.min()

39.8459916905

## Diameter

In [29]:
bins = pd.cut(diameter,20)
diameter.groupby(bins).count()

estimated_diameter.meters.estimated_diameter_max
(-3.535, 328.674]       207
(328.674, 654.369]       67
(654.369, 980.065]       24
(980.065, 1305.76]       18
(1305.76, 1631.455]       4
(1631.455, 1957.15]       6
(1957.15, 2282.846]       1
(2282.846, 2608.541]      1
(2608.541, 2934.236]      1
(2934.236, 3259.931]      1
(3259.931, 3585.627]      1
(3585.627, 3911.322]      1
(3911.322, 4237.017]      0
(4237.017, 4562.712]      0
(4562.712, 4888.408]      0
(4888.408, 5214.103]      0
(5214.103, 5539.798]      0
(5539.798, 5865.493]      0
(5865.493, 6191.189]      0
(6191.189, 6516.884]      1
Name: estimated_diameter.meters.estimated_diameter_max, dtype: int64

## Velocity

In [30]:
bins2 = pd.cut(velocity,20)
velocity.groupby(bins2).count()

relative_velocity.kilometers_per_second
(0.642, 2.674]       5
(2.674, 4.666]      14
(4.666, 6.658]      33
(6.658, 8.651]      33
(8.651, 10.643]     31
(10.643, 12.635]    28
(12.635, 14.628]    26
(14.628, 16.62]     41
(16.62, 18.612]     36
(18.612, 20.604]    18
(20.604, 22.597]    16
(22.597, 24.589]     9
(24.589, 26.581]    11
(26.581, 28.574]     6
(28.574, 30.566]    11
(30.566, 32.558]     5
(32.558, 34.551]     2
(34.551, 36.543]     3
(36.543, 38.535]     3
(38.535, 40.527]     2
Name: relative_velocity.kilometers_per_second, dtype: int64

## Price

In [36]:
bins3 = pd.cut(price,20)
price.groupby(bins3).count()

price.numberDouble
(14.97, 15.99]      3
(15.99, 16.99]     11
(16.99, 17.99]      0
(17.99, 18.99]      0
(18.99, 19.99]     33
(19.99, 20.99]      0
(20.99, 21.99]     24
(21.99, 22.99]      9
(22.99, 23.99]     39
(23.99, 24.99]    407
(24.99, 25.99]    666
(25.99, 26.99]    591
(26.99, 27.99]    986
(27.99, 28.99]    168
(28.99, 29.99]     75
(29.99, 30.99]      9
(30.99, 31.99]      0
(31.99, 32.99]      0
(32.99, 33.99]      0
(33.99, 34.99]     12
Name: price.numberDouble, dtype: int64

## Land size

In [41]:
bins4 = pd.cut(landsize,30)
landsize.groupby(bins4).count()

Landsize
(-76.0, 2533.333]         11512
(2533.333, 5066.667]         88
(5066.667, 7600.0]           19
(7600.0, 10133.333]           9
(10133.333, 12666.667]        0
(12666.667, 15200.0]          3
(15200.0, 17733.333]          7
(17733.333, 20266.667]        0
(20266.667, 22800.0]          2
(22800.0, 25333.333]          0
(25333.333, 27866.667]        0
(27866.667, 30400.0]          0
(30400.0, 32933.333]          0
(32933.333, 35466.667]        0
(35466.667, 38000.0]          2
(38000.0, 40533.333]          1
(40533.333, 43066.667]        1
(43066.667, 45600.0]          0
(45600.0, 48133.333]          0
(48133.333, 50666.667]        0
(50666.667, 53200.0]          0
(53200.0, 55733.333]          0
(55733.333, 58266.667]        0
(58266.667, 60800.0]          0
(60800.0, 63333.333]          0
(63333.333, 65866.667]        0
(65866.667, 68400.0]          0
(68400.0, 70933.333]          0
(70933.333, 73466.667]        0
(73466.667, 76000.0]          2
Name: Landsize, dtype: int64