<a href="https://colab.research.google.com/github/mayait/ClaseMachineLearning/blob/main/SupervisedLearning/Classification/LinearRegEjercicio.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regresión lineal

## Objetivo

*   Usar scikit-learn para implementar un modelo de Linear Regression
*   Crear un modelo, entrenarlo, evaluarlo y utilizarlo
*   Optimizar el modelo con regularizaciones

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import pylab as pl
import numpy as np
%matplotlib inline

In [None]:
# Datasets
!wget -nv https://raw.githubusercontent.com/mayait/ClaseAnalisisDatos/main/machine_learning/datasets/advertising_and_sales_clean.csv
!wget -nv https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%202/data/FuelConsumptionCo2.csv
!wget -nv https://github.com/mayait/ClaseAnalisisDatos/raw/main/machine_learning/datasets/crime.xlsx

# Consumo de combustible

### `FuelConsumption.csv`:

Fuel consumption dataset, **`FuelConsumption.csv`**, Contiene las calificaciones de consumo de combustible específicas del modelo y las emisiones estimadas de dióxido de carbono para vehículos ligeros nuevos para la venta al por menor en Canadá. [Dataset source](http://open.canada.ca/data/en/dataset/98f1a129-f628-4ce4-b24d-6f16bf24dd64)

*   **MODELYEAR** e.g. 2014
*   **MAKE** e.g. Acura
*   **MODEL** e.g. ILX
*   **VEHICLE CLASS** e.g. SUV
*   **ENGINE SIZE** e.g. 4.7
*   **CYLINDERS** e.g 6
*   **TRANSMISSION** e.g. A6
*   **FUELTYPE** e.g. z
*   **FUEL CONSUMPTION in CITY(L/100 km)** e.g. 9.9
*   **FUEL CONSUMPTION in HWY (L/100 km)** e.g. 8.9
*   **FUEL CONSUMPTION COMB (L/100 km)** e.g. 9.2
*   **CO2 EMISSIONS (g/km)** e.g. 182   --> low --> 0

In [None]:
df = pd.read_csv("FuelConsumption.csv")

## 🌶️ 🐕 Realiza un análisis EDA a FuelConsumption, resume los hallazgos

- Análisis de correlación: Evaluar la correlación entre la variable dependiente y las variables independientes mediante una matriz de correlación o un gráfico de dispersión.
- Análisis de normalidad: Verificar si las variables siguen una distribución normal utilizando histogramas.

- Análisis de valores atípicos: Buscar valores extremos en las variables.

- Análisis de datos faltantes: Evaluar si hay valores faltantes en los datos y determinar cómo manejarlos.

## 🌶️ 🐕 Selecciona las variables

In [None]:
# Vamos a utilizar solo las siguientes caracteristicas para este ejercicio
# Para X serán: 'ENGINESIZE','CYLINDERS','FUELCONSUMPTION_CITY','FUELCONSUMPTION_HWY','FUELCONSUMPTION_COMB'
# Para Y será: 'CO2EMISSIONS'
# Divide la data en train y test con un 70% para train y 30% para test

from sklearn.model_selection import train_test_split    
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# 🌶️ 🐕 Seleccionamos las caracteristicas que vamos a utilizar
X = df[🌶️🌶️🌶️🌶️]

# 🌶️ 🐕 Seleccionamos la variable a predecir
Y = df[🌶️🌶️🌶️🌶️]

# 🌶️ 🐕 Divide la data en train y test
X_train, X_test, Y_train, Y_test = train_test_split(🌶️🌶️🌶️🌶️🌶️🌶️🌶️🌶️🌶️)



# Método de mínimos cuadrados

In [None]:
# 🌶️ 🐕 Haz una función que entrene un modelo re regresión lineal recibiendo X_train, X_test, Y_train, Y_test
# 🌶️ 🐕 La funcion devuelve el error cuadrático medio (MSE) y el coeficiente de determinación (R2) para train y test
# 🌶️ 🐕 Calcula si el modelo está sobreajustado, subajustado o tiene un buen ajuste

from sklearn.metrics import mean_squared_error, r2_score

def entrenar_modelo_minimos_cuadrados(X_train, X_test, Y_train, Y_test):
    # 🌶️ 🐕 Instanciamos el modelo
    regr = 🌶️🌶️🌶️🌶️🌶️🌶️ # ¿Que modelo?
    
    # Entrenamos el modelo
    regr.fit(X_train, Y_train)
    
    # Hacemos las predicciones
    Y_pred_train = regr.predict(X_train)
    Y_pred_test = regr.predict(X_test)
    
    # 🌶️ 🐕 Calculamos el MSE para train y test mean_squared_error(...)
    mse_train = 🌶️🌶️🌶️🌶️🌶️🌶️
    mse_test = 🌶️🌶️🌶️🌶️🌶️🌶️
    
    # 🌶️ 🐕 Calculamos el R2 r2_score
    r2_train = 🌶️🌶️🌶️🌶️🌶️🌶️
    r2_test = 🌶️🌶️🌶️🌶️🌶️🌶️

    # Imprime los resultados
    print("MSE train: %.2f" % mse_train)
    print("MSE test: %.2f" % mse_test)
    print("R2 train: %.2f" % r2_train)
    print("R2 test: %.2f" % r2_test)

    # calcula si el modelo está sobreajustado, subajustado o tiene un buen ajuste
    if r2_train > r2_test:  
        print("El modelo está 🌶️🌶️🌶️🌶️🌶️🌶️")
    elif r2_train < r2_test:
        print("El modelo está 🌶️🌶️🌶️🌶️🌶️🌶️")
    else:
        print("El modelo 🌶️🌶️🌶️🌶️🌶️🌶️")
        
    return mse_train, mse_test, r2_train, r2_test


In [None]:
"""
El resultado deberia ser así:
MSE train: 563.97
MSE test: 502.45
R2 train: 0.86
R2 test: 0.88
El modelo está subajustado
(563.9675185805852, 502.4460931860747, 0.8590623862972946, 0.8754316437513163)
"""

entrenar_modelo_minimos_cuadrados(X_train, X_test, Y_train, Y_test)

# Ridge

In [None]:
# Importa ridge de sklearn.linear_model
# Crea una funcion que ajuste un modelo de regresion ridge con diferentes valores de alpha
# La funcion debe devolver el MSE y el R2 para train y test

from sklearn.linear_model import Ridge

def entrenar_modelo_ridge(X_train, X_test, Y_train, Y_test):

    # alphas
    alphas = [0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]


    # Crea una tabla vacia para almacenar los resultados de Alpha, MSE y R2 para train y test
    df_ridge = pd.DataFrame(columns=['Alpha', 'MSE_train', 'MSE_test', 'R2_train', 'R2_test'])

    # Itera sobre los diferentes valores de alpha
    for alpha in alphas:
        # Instanciamos el modelo
        ridge = Ridge(alpha=alpha)
    
        # Entrenamos el modelo
        ridge.fit(X_train, Y_train)
    
        # Hacemos las predicciones
        Y_pred_train = ridge.predict(X_train)
        Y_pred_test = ridge.predict(X_test)
    
        # Calculamos el MSE
        mse_train = mean_squared_error(Y_train, Y_pred_train)
        mse_test = mean_squared_error(Y_test, Y_pred_test)
        
        # Calculamos el R2
        r2_train = r2_score(Y_train, Y_pred_train)
        r2_test = r2_score(Y_test, Y_pred_test)

        # Guarda los resultados en la tabla df_ridge con pandas.concat
        df_ridge = pd.concat([df_ridge, pd.DataFrame([[alpha, mse_train, mse_test, r2_train, r2_test]], columns=['Alpha', 'MSE_train', 'MSE_test', 'R2_train', 'R2_test'])])

    # Imprime la tabla df_ridge
    print(df_ridge)


    # ¿Que valor de alpha tiene el mejor R2, imprime todos los valores para ese alpha?
    print(df_ridge[df_ridge['R2_test'] == df_ridge['R2_test'].max()])
        
    return mse_train, mse_test, r2_train, r2_test
entrenar_modelo_ridge(X_train, X_test, Y_train, Y_test)

# Lasso

In [None]:
# Importa lasso de sklearn.linear_model
# Crea una funcion que ajuste un modelo de regresion lasso con diferentes valores de alpha
# La funcion debe devolver el MSE y el R2 para train y test


from sklearn.linear_model import Lasso

def entrenar_modelo_lasso(X_train, X_test, Y_train, Y_test):

    # alphas
    alphas = [0.1, 1.0, 2, 10.0, 100.0, 1000.0]


    # Crea una tabla vacia para almacenar los resultados de Alpha, MSE y R2 para train y test
    df_lasso = pd.DataFrame(columns=['Alpha', 'MSE_train', 'MSE_test', 'R2_train', 'R2_test'])

    # Itera sobre los diferentes valores de alpha
    for alpha in alphas:
        # Instanciamos el modelo
        lasso = Lasso(alpha=alpha, max_iter = 10000)
    
        # Entrenamos el modelo
        lasso.fit(X_train, Y_train)
    
        # Hacemos las predicciones
        Y_pred_train = lasso.predict(X_train)
        Y_pred_test = lasso.predict(X_test)
    
        # Calculamos el MSE
        mse_train = mean_squared_error(Y_train, Y_pred_train)
        mse_test = mean_squared_error(Y_test, Y_pred_test)
        
        # Calculamos el R2
        r2_train = r2_score(Y_train, Y_pred_train)
        r2_test = r2_score(Y_test, Y_pred_test)
        

        # Guarda los resultados en la tabla df_ridge con pandas.concat
        df_lasso = pd.concat([df_lasso, pd.DataFrame([[alpha, mse_train, mse_test, r2_train, r2_test]], columns=['Alpha', 'MSE_train', 'MSE_test', 'R2_train', 'R2_test'])])

    # Imprime la tabla df_ridge
    print(df_lasso)


    # ¿Que valor de alpha tiene el mejor R2, imprime todos los valores para ese alpha?
    print('El mejor valor de alpha es: ')
    print(df_lasso[df_lasso['R2_test'] == df_lasso['R2_test'].max()])
        
    return mse_train, mse_test, r2_train, r2_test
entrenar_modelo_lasso(X_train, X_test, Y_train, Y_test)

# Polynomial

In [None]:
#@title Ajuste Polinómico en Regresión Lineal

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, r2_score

# Generar puntos de datos
x = np.linspace(1, 7, num=20)
y = x ** 2 + np.random.normal(scale=2, size=20)

# Crear una figura con dos subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 9))

# Graficar el primer gráfico (regresión lineal sin ajuste)
ax1.scatter(x, y, color='blue')
ax1.set_title('Regresión Lineal Sin Ajuste')
ax1.set_xlabel('X')
ax1.set_ylabel('Y')

# Ajustar una función lineal (grado=1) sin ajuste
pendiente, intercep = np.polyfit(x, y, 1)
funcion_lineal = np.poly1d((pendiente, intercep))
x_vals = np.linspace(x[0], x[-1], 100)
y_vals = funcion_lineal(x_vals)
ax1.plot(x_vals, y_vals, color='red')

# Calcular RMSE y R^2
y_pred = funcion_lineal(x)
rmse = np.sqrt(mean_squared_error(y, y_pred))
r2 = r2_score(y, y_pred)
ax1.text(0.05, 0.9, 'RMSE: {:.2f}\nR$^2$: {:.2f}'.format(rmse, r2),
         transform=ax1.transAxes, bbox=dict(facecolor='white', edgecolor='black', pad=10))

# Graficar el segundo gráfico (regresión lineal con ajuste polinómico)
ax2.scatter(x, y, color='blue')
ax2.set_title('Regresión Lineal con Ajuste Polinómico')
ax2.set_xlabel('X')
ax2.set_ylabel('Y')

# Ajustar una función polinómica (grado=2) con ajuste
grado = 2
coeficientes = np.polyfit(x, y, grado)
polinomio = np.poly1d(coeficientes)
x_vals = np.linspace(x[0], x[-1], 100)
y_vals = polinomio(x_vals)
ax2.plot(x_vals, y_vals, color='green')

# Calcular RMSE y R^2
y_pred = polinomio(x)
rmse = np.sqrt(mean_squared_error(y, y_pred))
r2 = r2_score(y, y_pred)
ax2.text(0.05, 0.9, 'RMSE: {:.2f}\nR$^2$: {:.2f}'.format(rmse, r2),
         transform=ax2.transAxes, bbox=dict(facecolor='white', edgecolor='black', pad=10))

# Agregar un título a la figura
fig.suptitle('Ajuste Polinómico en Regresión Lineal', fontsize=24)

# Ajustar el diseño y mostrar la figura
fig.tight_layout()
plt.show()

In [None]:
# Crea una función que ajuste un modelo de regresión polinomial con diferentes grados
# Importa PolynomialFeatures    
# Crea objeto PolynomialFeatures con el grado deseado
# La funcion debe devolver el MSE y el R2 para train y test

from sklearn.preprocessing import PolynomialFeatures

def entrenar_modelo_polinomial(X_train, X_test, Y_train, Y_test):
    
        # grados
        grados = [2, 3]
    
        # Crea una tabla vacia para almacenar los resultados de grado, MSE y R2 para train y test
        df_polinomial = pd.DataFrame(columns=['Grado', 'MSE_train', 'MSE_test', 'R2_train', 'R2_test', 'Modelo'])
    
        # Itera sobre los diferentes grados
        for grado in grados:
            
            # Instanciamos el objeto transformador PolynomialFeatures
            poly = PolynomialFeatures(degree=grado)
            
            # Transforma los datos de entrenamiento y prueba en un conjunto de características polinómicas
            X_train_poly = poly.fit_transform(X_train)
            X_test_poly = poly.transform(X_test)

            print('Grado: ', grado)
            
             # Entrenamos el modelo con Least Squares
            print('Entrenando modelo con Least Squares')
            mse_train, mse_test, r2_train, r2_test = entrenar_modelo(X_train_poly, X_test_poly, Y_train, Y_test)

            # Añadimos los resultados a la tabla df_polinomial
            df_polinomial = pd.concat([df_polinomial, pd.DataFrame([[grado, mse_train, mse_test, r2_train, r2_test, 'Least Squares']], columns=['Grado', 'MSE_train', 'MSE_test', 'R2_train', 'R2_test', 'Modelo'])])

            # Entrenamos el modelo con Ridge
            print('Entrenando modelo con Ridge')
            entrenar_modelo_ridge(X_train_poly, X_test_poly, Y_train, Y_test)

            # Entrenamos el modelo con Lasso
            print('Entrenando modelo con Lasso')
            entrenar_modelo_lasso(X_train_poly, X_test_poly, Y_train, Y_test)
            
        return None

entrenar_modelo_polinomial(X_train, X_test, Y_train, Y_test)

# Normalización Min-Max

In [None]:
#@title Min Max Norm

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler

# Generar puntos de datos
x = np.linspace(-30, 500, num=100)
y = 2 * x + np.random.normal(scale=10, size=100)

# Normalizar los datos utilizando la normalización min-max
scaler = MinMaxScaler()
x_norm = scaler.fit_transform(x.reshape(-1, 1)).ravel()
y_norm = y.reshape(-1, 1).ravel()

# Crear una figura con dos subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 9))

# Graficar el primer gráfico (datos originales)
ax1.scatter(x, y, color='blue')
ax1.set_title('Datos Originales')
ax1.set_xlabel('X')
ax1.set_ylabel('Y')

# Graficar el segundo gráfico (datos normalizados)
ax2.scatter(x_norm, y_norm, color='red')
ax2.set_title('Datos Normalizados')
ax2.set_xlabel('X (Normalizado)')
ax2.set_ylabel('Y (Normalizado)')

# Agregar un título a la figura
fig.suptitle('Normalización Min-Max', fontsize=24)

# Ajustar el diseño y mostrar la figura
fig.tight_layout()
plt.show()



In [None]:
# Crea una función que normalice los datos de train y test utilizando MinMaxScaler
# Importa MinMaxScaler de sklearn.preprocessing
# Crea objeto MinMaxScaler los datos de train y test
# Ajustar el escalador a los datos de entrenamiento y transformar los datos
# Crear un objeto LinearRegression y ajustar el modelo a los datos de entrenamiento escalados
# Transformar los datos de prueba utilizando el mismo objeto escalador
# Realizar predicciones sobre los datos de prueba escalados utilizando el modelo entrenado
# Evaluar el rendimiento del modelo


from sklearn.preprocessing import MinMaxScaler

def entrenar_modelo_normalizado(X_train, X_test, Y_train, Y_test):
    
        # Instancia MinMaxScaler
        scaler = MinMaxScaler()
    
        # Ajusta el escalador a los datos de entrenamiento
        scaler.fit_transform(X_train)
    
        # Transforma los datos de entrenamiento
        X_train_scaled = scaler.transform(X_train)
    
        # Transforma los datos de prueba
        X_test_scaled = scaler.transform(X_test)

        # Entrenamos el modelo con Least Squares
        print('Entrenando modelo con Least Squares')
        entrenar_modelo(X_train_scaled, X_test_scaled, Y_train, Y_test)

        # Entrenamos el modelo con Ridge
        print('Entrenando modelo con Ridge')
        entrenar_modelo_ridge(X_train_scaled, X_test_scaled, Y_train, Y_test)

        # Entrenamos el modelo con Lasso
        print('Entrenando modelo con Lasso')
        entrenar_modelo_lasso(X_train_scaled, X_test_scaled, Y_train, Y_test)


entrenar_modelo_normalizado(X_train, X_test, Y_train, Y_test)

In [None]:
# Importa ridge de sklearn.linear_model
# Crea una funcion que ajuste un modelo de regresion ridge con diferentes valores de alpha
# La funcion debe devolver el MSE y el R2 para train y test

from sklearn.linear_model import Ridge

def entrenar_modelo_ridge(X_train, X_test, Y_train, Y_test):

    # alphas
    alphas = [0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]


    # Crea una tabla vacia para almacenar los resultados de Alpha, MSE y R2 para train y test
    df_ridge = pd.DataFrame(columns=['Alpha', 'MSE_train', 'MSE_test', 'R2_train', 'R2_test'])

    # Itera sobre los diferentes valores de alpha
    for alpha in alphas:
        # Instanciamos el modelo
        ridge = Ridge(alpha=alpha)
    
        # Entrenamos el modelo
        ridge.fit(X_train, Y_train)
    
        # Hacemos las predicciones
        Y_pred_train = ridge.predict(X_train)
        Y_pred_test = ridge.predict(X_test)
    
        # Calculamos el MSE
        mse_train = mean_squared_error(Y_train, Y_pred_train)
        mse_test = mean_squared_error(Y_test, Y_pred_test)
        
        # Calculamos el R2
        r2_train = r2_score(Y_train, Y_pred_train)
        r2_test = r2_score(Y_test, Y_pred_test)

        # Guarda los resultados en la tabla df_ridge con pandas.concat
        df_ridge = pd.concat([df_ridge, pd.DataFrame([[alpha, mse_train, mse_test, r2_train, r2_test]], columns=['Alpha', 'MSE_train', 'MSE_test', 'R2_train', 'R2_test'])])

    # Imprime la tabla df_ridge
    print(df_ridge)


    # ¿Que valor de alpha tiene el mejor R2, imprime todos los valores para ese alpha?
    print(df_ridge[df_ridge['R2_test'] == df_ridge['R2_test'].max()])
        
    return mse_train, mse_test, r2_train, r2_test
entrenar_modelo_ridge(X_train, X_test, Y_train, Y_test)

----
Crime

# UCI Communities and Crime Unnormalized Data Set
## Data Overview
Context
Introduction: The dataset used for this experiment is real and authentic. The dataset is acquired from UCI machine learning repository website. The title of the dataset is ‘Crime and Communities’. 
This dataset contains a total number of 147 attributes and 2216 instances.

The per capita crimes variables were calculated using population values included in the 1995 FBI data (which differ from the 1990 Census values).

Content

The variables included in the dataset involve the community, such as the percent of the population considered urban, and the median family income, and involving law enforcement, such as per capita number of police officers, and percent of sworn full time police officers on patrol.

The per capita violent crimes variable, **ViolentCrimesPerPop**, is the _target_ variable and is calculated using population and the sum of crime variables considered violent crimes in the United States: murder, rape, robbery, and assault.

<ul>
<li><strong>state</strong>:  US state (by number) - if considered, should be consided nominal (nominal)</li>
<li><strong>county</strong>:  numeric code for county - many missing values (numeric)</li>
<li><strong>communityname</strong>:  community name </li>
<li><strong>population</strong>:  population for community (numeric - decimal)</li>
<li><strong>householdsize</strong>:  mean people per household (numeric - decimal)</li>
<li><strong>racepctblack</strong>:  percentage of population that is african american (numeric - decimal)</li>
<li><strong>racePctWhite</strong>:  percentage of population that is caucasian (numeric - decimal)</li>
<li><strong>racePctAsian</strong>:  percentage of population that is of asian heritage (numeric - decimal)</li>
<li><strong>racePctHisp</strong>:  percentage of population that is of hispanic heritage (numeric - decimal)</li>
<li><strong>agePct12t21</strong>:  percentage of population that is 12-21 in age (numeric - decimal)</li>
<li><strong>agePct12t29</strong>:  percentage of population that is 12-29 in age (numeric - decimal)</li>
<li><strong>agePct16t24</strong>:  percentage of population that is 16-24 in age (numeric - decimal)</li>
<li><strong>agePct65up</strong>:  percentage of population that is 65 and over in age (numeric - decimal)</li>
<li><strong>numbUrban</strong>:  number of people living in areas classified as urban (numeric - decimal)</li>
<li><strong>pctUrban</strong>:  percentage of people living in areas classified as urban (numeric - decimal)</li>
<li><strong>medIncome</strong>:  median household income (numeric - decimal)</li>
<li><strong>pctWWage</strong>:  percentage of households with wage or salary income in 1989 (numeric - decimal)</li>
<li><strong>pctWFarmSelf</strong>:  percentage of households with farm or self employment income in 1989 (numeric - decimal)</li>
<li><strong>pctWInvInc</strong>:  percentage of households with investment / rent income in 1989 (numeric - decimal)</li>
<li><strong>pctWSocSec</strong>:  percentage of households with social security income in 1989 (numeric - decimal)</li>
<li><strong>pctWPubAsst</strong>:  percentage of households with public assistance income in 1989 (numeric - decimal)</li>
<li><strong>pctWRetire</strong>:  percentage of households with retirement income in 1989 (numeric - decimal)</li>
<li><strong>medFamInc</strong>:  median family income (differs from household income for non-family households) (numeric - decimal)</li>
<li><strong>perCapInc</strong>:  per capita income (numeric - decimal)</li>
<li><strong>whitePerCap</strong>:  per capita income for caucasians (numeric - decimal)</li>
<li><strong>blackPerCap</strong>:  per capita income for african americans (numeric - decimal)</li>
<li><strong>indianPerCap</strong>:  per capita income for native americans (numeric - decimal)</li>
<li><strong>AsianPerCap</strong>:  per capita income for people with asian heritage (numeric - decimal)</li>
<li><strong>OtherPerCap</strong>:  per capita income for people with 'other' heritage (numeric - decimal)</li>
<li><strong>HispPerCap</strong>:  per capita income for people with hispanic heritage (numeric - decimal)</li>
<li><strong>NumUnderPov</strong>:  number of people under the poverty level (numeric - decimal)</li>
<li><strong>PctPopUnderPov</strong>:  percentage of people under the poverty level (numeric - decimal)</li>
<li><strong>PctLess9thGrade</strong>:  percentage of people 25 and over with less than a 9th grade education (numeric - decimal)</li>
<li><strong>PctNotHSGrad</strong>:  percentage of people 25 and over that are not high school graduates (numeric - decimal)</li>
<li><strong>PctBSorMore</strong>:  percentage of people 25 and over with a bachelors degree or higher education (numeric - decimal)</li>
<li><strong>PctUnemployed</strong>:  percentage of people 16 and over, in the labor force, and unemployed (numeric - decimal)</li>
<li><strong>PctEmploy</strong>:  percentage of people 16 and over who are employed (numeric - decimal)</li>
<li><strong>PctEmplManu</strong>:  percentage of people 16 and over who are employed in manufacturing (numeric - decimal)</li>
<li><strong>PctEmplProfServ</strong>:  percentage of people 16 and over who are employed in professional services (numeric - decimal)</li>
<li><strong>PctOccupManu</strong>:  percentage of people 16 and over who are employed in manufacturing (numeric - decimal) </li>
<li><strong>PctOccupMgmtProf</strong>:  percentage of people 16 and over who are employed in management or professional occupations (numeric - decimal)</li>
<li><strong>MalePctDivorce</strong>:  percentage of males who are divorced (numeric - decimal)</li>
<li><strong>MalePctNevMarr</strong>:  percentage of males who have never married (numeric - decimal)</li>
<li><strong>FemalePctDiv</strong>:  percentage of females who are divorced (numeric - decimal)</li>
<li><strong>TotalPctDiv</strong>:  percentage of population who are divorced (numeric - decimal)</li>
<li><strong>PersPerFam</strong>:  mean number of people per family (numeric - decimal)</li>
<li><strong>PctFam2Par</strong>:  percentage of families (with kids) that are headed by two parents (numeric - decimal)</li>
<li><strong>PctKids2Par</strong>:  percentage of kids in family housing with two parents (numeric - decimal)</li>
<li><strong>PctYoungKids2Par</strong>:  percent of kids 4 and under in two parent households (numeric - decimal)</li>
<li><strong>PctTeen2Par</strong>:  percent of kids age 12-17 in two parent households (numeric - decimal)</li>
<li><strong>PctWorkMomYoungKids</strong>:  percentage of moms of kids 6 and under in labor force (numeric - decimal)</li>
<li><strong>PctWorkMom</strong>:  percentage of moms of kids under 18 in labor force (numeric - decimal)</li>
<li><strong>NumIlleg</strong>:  number of kids born to never married (numeric - decimal)</li>
<li><strong>PctIlleg</strong>:  percentage of kids born to never married (numeric - decimal)</li>
<li><strong>NumImmig</strong>:  total number of people known to be foreign born (numeric - decimal)</li>
<li><strong>PctImmigRecent</strong>:  percentage of _immigrants_ who immigated within last 3 years (numeric - decimal)</li>
<li><strong>PctImmigRec5</strong>:  percentage of _immigrants_ who immigated within last 5 years (numeric - decimal)</li>
<li><strong>PctImmigRec8</strong>:  percentage of _immigrants_ who immigated within last 8 years (numeric - decimal)</li>
<li><strong>PctImmigRec10</strong>:  percentage of _immigrants_ who immigated within last 10 years (numeric - decimal)</li>
<li><strong>PctRecentImmig</strong>:  percent of _population_ who have immigrated within the last 3 years (numeric - decimal)</li>
<li><strong>PctRecImmig5</strong>:  percent of _population_ who have immigrated within the last 5 years (numeric - decimal)</li>
<li><strong>PctRecImmig8</strong>:  percent of _population_ who have immigrated within the last 8 years (numeric - decimal)</li>
<li><strong>PctRecImmig10</strong>:  percent of _population_ who have immigrated within the last 10 years (numeric - decimal)</li>
<li><strong>PctSpeakEnglOnly</strong>:  percent of people who speak only English (numeric - decimal)</li>
<li><strong>PctNotSpeakEnglWell</strong>:  percent of people who do not speak English well (numeric - decimal)</li>
<li><strong>PctLargHouseFam</strong>:  percent of family households that are large (6 or more) (numeric - decimal)</li>
<li><strong>PctLargHouseOccup</strong>:  percent of all occupied households that are large (6 or more people) (numeric - decimal)</li>
<li><strong>PersPerOccupHous</strong>:  mean persons per household (numeric - decimal)</li>
<li><strong>PersPerOwnOccHous</strong>:  mean persons per owner occupied household (numeric - decimal)</li>
<li><strong>PersPerRentOccHous</strong>:  mean persons per rental household (numeric - decimal)</li>
<li><strong>PctPersOwnOccup</strong>:  percent of people in owner occupied households (numeric - decimal)</li>
<li><strong>PctPersDenseHous</strong>:  percent of persons in dense housing (more than 1 person per room) (numeric - decimal)</li>
<li><strong>PctHousLess3BR</strong>:  percent of housing units with less than 3 bedrooms (numeric - decimal)</li>
<li><strong>MedNumBR</strong>:  median number of bedrooms (numeric - decimal)</li>
<li><strong>HousVacant</strong>:  number of vacant households (numeric - decimal)</li>
<li><strong>PctHousOccup</strong>:  percent of housing occupied (numeric - decimal)</li>
<li><strong>PctHousOwnOcc</strong>:  percent of households owner occupied (numeric - decimal)</li>
<li><strong>PctVacantBoarded</strong>:  percent of vacant housing that is boarded up (numeric - decimal)</li>
<li><strong>PctVacMore6Mos</strong>:  percent of vacant housing that has been vacant more than 6 months (numeric - decimal)</li>
<li><strong>MedYrHousBuilt</strong>:  median year housing units built (numeric - decimal)</li>
<li><strong>PctHousNoPhone</strong>:  percent of occupied housing units without phone (in 1990, this was rare!) (numeric - decimal)</li>
<li><strong>PctWOFullPlumb</strong>:  percent of housing without complete plumbing facilities (numeric - decimal)</li>
<li><strong>OwnOccLowQuart</strong>:  owner occupied housing - lower quartile value (numeric - decimal)</li>
<li><strong>OwnOccMedVal</strong>:  owner occupied housing - median value (numeric - decimal)</li>
<li><strong>OwnOccHiQuart</strong>:  owner occupied housing - upper quartile value (numeric - decimal)</li>
<li><strong>RentLowQ</strong>:  rental housing - lower quartile rent (numeric - decimal)</li>
<li><strong>RentMedian</strong>:  rental housing - median rent (Census variable H32B from file STF1A) (numeric - decimal)</li>
<li><strong>RentHighQ</strong>:  rental housing - upper quartile rent (numeric - decimal)</li>
<li><strong>MedRent</strong>:  median gross rent (Census variable H43A from file STF3A - includes utilities) (numeric - decimal)</li>
<li><strong>MedRentPctHousInc</strong>:  median gross rent as a percentage of household income (numeric - decimal)</li>
<li><strong>MedOwnCostPctInc</strong>:  median owners cost as a percentage of household income - for owners with a mortgage (numeric - decimal)</li>
<li><strong>MedOwnCostPctIncNoMtg</strong>:  median owners cost as a percentage of household income - for owners without a mortgage (numeric - decimal)</li>
<li><strong>NumInShelters</strong>:  number of people in homeless shelters (numeric - decimal)</li>
<li><strong>NumStreet</strong>:  number of homeless people counted in the street (numeric - decimal)</li>
<li><strong>PctForeignBorn</strong>:  percent of people foreign born (numeric - decimal)</li>
<li><strong>PctBornSameState</strong>:  percent of people born in the same state as currently living (numeric - decimal)</li>
<li><strong>PctSameHouse85</strong>:  percent of people living in the same house as in 1985 (5 years before) (numeric - decimal)</li>
<li><strong>PctSameCity85</strong>:  percent of people living in the same city as in 1985 (5 years before) (numeric - decimal)</li>
<li><strong>PctSameState85</strong>:  percent of people living in the same state as in 1985 (5 years before) (numeric - decimal)</li>
<li><strong>FTPolicePerPop</strong>:  sworn full time police officers per 100K population (numeric - decimal)</li>
<li><strong>FTPoliceFieldPerPop</strong>:  sworn full time police officers in field operations (on the street as opposed to administrative etc) per 100K population (numeric - decimal)</li>
<li><strong>RacialMatchCommPol</strong>:  a measure of the racial match between the community and the police force. High values indicate proportions in community and police force are similar (numeric - decimal)</li>
<li><strong>PctPolicWhite</strong>:  percent of police that are caucasian (numeric - decimal)</li>
<li><strong>PctPolicBlack</strong>:  percent of police that are african american (numeric - decimal)</li>
<li><strong>PctPolicHisp</strong>:  percent of police that are hispanic (numeric - decimal)</li>
<li><strong>PctPolicAsian</strong>:  percent of police that are asian (numeric - decimal)</li>
<li><strong>PctPolicMinor</strong>:  percent of police that are minority of any kind (numeric - decimal)</li>
<li><strong>OfficAssgnDrugUnits</strong>:  number of officers assigned to special drug units (numeric - decimal)</li>
<li><strong>PolicAveOTWorked</strong>:  police average overtime worked (numeric - decimal)</li>
<li><strong>LandArea</strong>:  land area in square miles (numeric - decimal)</li>
<li><strong>PopDens</strong>:  population density in persons per square mile (numeric - decimal)</li>
<li><strong>PctUsePubTrans</strong>:  percent of people using public transit for commuting (numeric - decimal)</li>
<li><strong>PolicCars</strong>:  number of police cars (numeric - decimal)</li>
<li><strong>PolicOperBudg</strong>:  police operating budget (numeric - decimal)</li>
<li><strong>PctPolicOnPatr</strong>:  percent of sworn full time police officers on patrol (numeric - decimal)</li>
<li><strong>GangUnitDeploy</strong>:  gang unit deployed (numeric - decimal - but really ordinal - 0 means NO, 1 means YES, 0.5 means Part Time)</li>
<li><strong>PolicBudgPerPop</strong>:  police operating budget per population (numeric - decimal)</li>
<li><strong>ViolentCrimesPerPop</strong>:  total number of violent crimes per 100K popuation (numeric - decimal) target variable to be predicted; not in the crimeTest.csv data.</li>
</ul>

In [None]:
from sklearn.model_selection import train_test_split    
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# importa en df crime.xlsx
df = pd.read_excel("crime.xlsx")
df.sample(10)

# Seleccionamos las caracteristicas que vamos a utilizar que son todas menos ViolentCrimesPerPop
X = df.drop(['ViolentCrimesPerPop'], axis=1)

# Seleccionamos la variable a predecir
Y = df['ViolentCrimesPerPop']

# Divide la data en train y test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

# Instanciamos el modelo
regr = LinearRegression()

# Entrenamos el modelo
regr.fit(X_train, Y_train)

In [None]:
df.sample(10)