<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/marco-canas/cielo/blob/main/revision_bibloografica/c_2_geron/2_edwin.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
</table>


## [Video de exposición de la lectura]()

# Informe de revisión bibliográfica 

Páginas 80 a 93 de Géron. 

# Obtener los datos


En entornos típicos, sus datos estarían disponibles en una base de datos relacional (o algún otro almacén de datos común) y distribuidos en múltiples tablas/documentos/archivos.

Para acceder a él, primero deberá obtener sus credenciales y autorizaciones de acceso y familiarizarse con el esquema de datos.

En este proyecto, sin embargo, las cosas son mucho más sencillas: simplemente descargará un único archivo de valores separados por comas (CSV), `vivienda.csv` que tiene todos los datos.

Puede usar su navegador web para descargar el archivo y ejecutar y extraer el archivo CSV, pero es preferible crear una pequeña función para hacerlo.

Tener una función que descargue los datos es útil en particular si los datos cambian regularmente: puede escribir un pequeño script que use la función para obtener los datos más recientes (o puede configurar un trabajo programado para hacerlo automáticamente a intervalos regulares).

Automatizar el proceso de obtención de datos también es útil si necesita instalar el conjunto de datos en varias máquinas.

Aquí está la función `obtener_datos()`: 


In [1]:
from regresion import obtener_datos
vivienda = obtener_datos('https://raw.githubusercontent.com/marco-canas/didactica_ciencia_datos/main/datasets/vivienda/vivienda.csv')

Ahora, cuando llama a `obtener_datos()`, descarga el archivo `vivienda.csv` del repositorio DIMATHDATA (Didáctica de la ciencia de datos) y los convierte en un DataFrame de Pandas.

This function returns a pandas DataFrame object containing all the data.

# Eche un vistazo rápido a la estructura de datos

Let’s take a look at the top five rows using the DataFrame’s `head()` method (see Figure 2-5).

In [2]:
vivienda.head() 

Unnamed: 0,longitud,latitud,antiguedad,habitaciones,dormitorios,población,hogares,ingresos,proximidad,precio
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,NEAR BAY,452600.0
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,NEAR BAY,358500.0
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,NEAR BAY,352100.0
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,NEAR BAY,341300.0
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,NEAR BAY,342200.0


Each row represents one district. There are 10 attributes (you can see the first 6
in the screenshot): longitude, latitude, housing_median_age,
total_rooms, total_bedrooms, population, households, median_income,
median_house_value, and ocean_proximity.
The info() method is useful to get a quick description of the data, in particular
the total number of rows, each attribute’s type, and the number of nonnull
values (see Figure 2-6).


In [3]:
vivienda.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   longitud      20640 non-null  float64
 1   latitud       20640 non-null  float64
 2   antiguedad    20640 non-null  float64
 3   habitaciones  20640 non-null  float64
 4   dormitorios   20433 non-null  float64
 5   población     20640 non-null  float64
 6   hogares       20640 non-null  float64
 7   ingresos      20640 non-null  float64
 8   proximidad    20640 non-null  object 
 9   precio        20640 non-null  float64
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


There are 20,640 instances in the dataset, which means that it is fairly small by
Machine Learning standards, but it’s perfect to get started. Notice that the
total_bedrooms attribute has only 20,433 nonnull values, meaning that 207
districts are missing this feature. We will need to take care of this later.
All attributes are numerical, except the ocean_proximity field. Its type is
object, so it could hold any kind of Python object. But since you loaded this
data from a CSV file, you know that it must be a text attribute. When you
looked at the top five rows, you probably noticed that the values in the
ocean_proximity column were repetitive, which means that it is probably a
categorical attribute. You can find out what categories exist and how many
districts belong to each category by using the value_counts() method:


In [4]:
vivienda['proximidad'].value_counts() 

<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: proximidad, dtype: int64

Let’s look at the other fields. The describe() method shows a summary of the
numerical attributes (Figure 2-7)

In [5]:
vivienda.describe()

Unnamed: 0,longitud,latitud,antiguedad,habitaciones,dormitorios,población,hogares,ingresos,precio
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


The count, mean, min, and max rows are self-explanatory. Note that the null
values are ignored (so, for example, the count of total_bedrooms is 20,433,
not 20,640). The std row shows the standard deviation, which measures how
dispersed the values are. The 25%, 50%, and 75% rows show the
corresponding percentiles: a percentile indicates the value below which a given
percentage of observations in a group of observations fall. For example, 25% of
the districts have a housing_median_age lower than 18, while 50% are lower
than 29 and 75% are lower than 37. These are often called the 25th percentile
(or first quartile), the median, and the 75th percentile (or third quartile).
Another quick way to get a feel of the type of data you are dealing with is to
plot a histogram for each numerical attribute. A histogram shows the number of
instances (on the vertical axis) that have a given value range (on the horizontal
axis). You can either plot this one attribute at a time, or you can call the hist()
method on the whole dataset (as shown in the following code example), and it
will plot a histogram for each numerical attribute (see Figure 2-8):


# Referentes  

