# Procesamiento de Datos Numéricos en Python

En este notebook, exploraremos varias técnicas para el procesamiento de datos numéricos, que son fundamentales en el campo de la ciencia de datos y la inteligencia artificial. Estas técnicas incluyen el escalamiento y la transformación de datos, que son esenciales para preparar los datos para los algoritmos de aprendizaje automático.

El procesamiento de datos numéricos es un paso crucial en cualquier flujo de trabajo de ciencia de datos. Las técnicas de escalamiento y transformación pueden mejorar la eficacia de los algoritmos de aprendizaje automático al hacer que los datos sean más manejables y relevantes para los algoritmos. Las utilidades de preprocesamiento de Scikit Learn pueden facilitar este proceso al proporcionar funciones eficientes para realizar estas transformaciones.



## Importing Necessary Libraries

First, we need to import the necessary libraries for our data processing. We will be using `numpy` for numerical operations, `pandas` for data manipulation, `matplotlib` for data visualization, and `sklearn` for machine learning tasks.

In [None]:
import timeit
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets, linear_model

## Loading and Preparing Data

Next, we will load a dataset from `sklearn` datasets. We will use the diabetes dataset for our examples. This dataset has ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.

In [None]:
X, y = datasets.load_diabetes(return_X_y=True)
raw = X[:, None, 2]

## Data Scaling

Data scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.

### Max-Min Scaling

Max-Min Scaling is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max normalization.

### Z-score Normalization

Z-score normalization is a strategy of normalizing data that avoids this outlier issue. Each raw value is subtracted from the mean and divided by the standard deviation, resulting in a distribution of values with a mean of 0 and a standard deviation of 1.

In [None]:
# max-min scaling
max_raw = max(raw)
min_raw = min(raw)
scaled = (2*raw - max_raw -min_raw)/(max_raw - min_raw)

# z-score normalization
avg = np.average(raw)
std = np.std(raw)
z_scaled = (raw - avg)/std

## Training Models and Comparing Training Times

Now, we will train linear regression models using the raw, max-min scaled, and z-score normalized data. We will compare the training times for each to see the impact of our data scaling.

In [None]:
# models for training
def train_raw():
    linear_model.LinearRegression().fit(raw, y)

def train_scaled():
    linear_model.LinearRegression().fit(scaled, y)

def train_z_scaled():
    linear_model.LinearRegression().fit(z_scaled, y)

raw_time = timeit.timeit(train_raw, number = 100)
scaled_time = timeit.timeit(train_raw, number = 100)
z_scaled_time = timeit.timeit(train_raw, number = 100)
print('Training time for raw data : {} '.format(raw_time))
print('Training time for scaled data : {}'.format(scaled_time))
print('Training time for z_scaled data : {}'.format(z_scaled_time))

## Non-linear Transformations

In addition to scaling, we can also apply non-linear transformations to our data. These transformations can be useful when our data has a skewed distribution. For example, we can use the `tanh` transformation to reduce the impact of outliers.

In [None]:
# non-linear transformation
tanh_transformed = np.tanh(raw)

def train_tanh_transformed():
    linear_model.LinearRegression().fit(tanh_transformed, y)

tanh_transformed_time = timeit.timeit(train_tanh_transformed, number = 100)
print('Training time for tanh_transformed data : {}'.format(tanh_transformed_time))

## Conclusion

In this notebook, we have explored various techniques for processing numerical data, including scaling and non-linear transformations. We have seen how these techniques can impact the training time of our machine learning models. It's important to remember that the choice of preprocessing techniques depends on the specific dataset and the machine learning algorithm being used. Therefore, it's always a good idea to experiment with different preprocessing techniques to find the best approach for your specific task.