# Ejercicio con SciKit-Learn

En este ejercicio utilizaremos algunas de las funcionalidades de la librería SciKit-Learn orientadas al análisis de datos.

Para empezar cargaremos el CSV, limpiaremos las filas nulas y nos quedaremos con las columnas que nos interesen.

In [21]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

datos = pd.read_csv('players.csv')

# Eliminar filas con nulos y contar cuántas son
datos.dropna(inplace=True)

# Convertir altura a cm
datos['height_cm'] = datos['height_feet'] * 30.48

# Convertir peso a Kg
datos['weight_kg'] = datos['weight_pounds'] * 0.45

# Selección de columnas
datos = datos[['first_name', 'last_name', 'position', 'team', 'height_cm', 'weight_kg']]

# Eliminar posiciones distintas de F, C o G
datos = datos[datos['position'].isin(['F', 'C', 'G'])]

# Reseteamos índices de fila
datos.reset_index(drop=True, inplace=True)

datos

Unnamed: 0,first_name,last_name,position,team,height_cm,weight_kg
0,Alex,Abrines,G,"{'id': 21, 'abbreviation': 'OKC', 'city': 'Okl...",182.88,90.00
1,Kosta,Koufos,C,"{'id': 26, 'abbreviation': 'SAC', 'city': 'Sac...",213.36,110.25
2,Michael,Beasley,F,"{'id': 25, 'abbreviation': 'POR', 'city': 'Por...",182.88,105.75
3,Wade,Baldwin IV,G,"{'id': 11, 'abbreviation': 'HOU', 'city': 'Hou...",182.88,90.00
4,Jared,Terrell,G,"{'id': 18, 'abbreviation': 'MIN', 'city': 'Min...",182.88,102.15
...,...,...,...,...,...,...
376,Mitchell,Robinson,C,"{'id': 20, 'abbreviation': 'NYK', 'city': 'New...",213.36,108.00
377,Collin,Sexton,G,"{'id': 29, 'abbreviation': 'UTA', 'city': 'Uta...",182.88,85.50
378,Landry,Shamet,G,"{'id': 24, 'abbreviation': 'PHX', 'city': 'Pho...",182.88,84.60
379,Anfernee,Simons,G,"{'id': 25, 'abbreviation': 'POR', 'city': 'Por...",182.88,83.25


Ahora vamos a escalar con `MinMaxScaler` de SciKit-Learn las columnas de altura y peso.

In [22]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
datos[['height_cm', 'weight_kg']] = scaler.fit_transform(datos[['height_cm', 'weight_kg']])

datos

Unnamed: 0,first_name,last_name,position,team,height_cm,weight_kg
0,Alex,Abrines,G,"{'id': 21, 'abbreviation': 'OKC', 'city': 'Okl...",0.5,0.250000
1,Kosta,Koufos,C,"{'id': 26, 'abbreviation': 'SAC', 'city': 'Sac...",1.0,0.625000
2,Michael,Beasley,F,"{'id': 25, 'abbreviation': 'POR', 'city': 'Por...",0.5,0.541667
3,Wade,Baldwin IV,G,"{'id': 11, 'abbreviation': 'HOU', 'city': 'Hou...",0.5,0.250000
4,Jared,Terrell,G,"{'id': 18, 'abbreviation': 'MIN', 'city': 'Min...",0.5,0.475000
...,...,...,...,...,...,...
376,Mitchell,Robinson,C,"{'id': 20, 'abbreviation': 'NYK', 'city': 'New...",1.0,0.583333
377,Collin,Sexton,G,"{'id': 29, 'abbreviation': 'UTA', 'city': 'Uta...",0.5,0.166667
378,Landry,Shamet,G,"{'id': 24, 'abbreviation': 'PHX', 'city': 'Pho...",0.5,0.150000
379,Anfernee,Simons,G,"{'id': 25, 'abbreviation': 'POR', 'city': 'Por...",0.5,0.125000


Ahora codificaremos con `OneHotEncoder`, de SciKit-Learn, la columna "position".

In [23]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False)
posiciones = encoder.fit_transform(datos[['position']])
df_posiciones = pd.DataFrame(posiciones, columns=encoder.get_feature_names_out(['position']))
datos = datos.join(df_posiciones)

datos

Unnamed: 0,first_name,last_name,position,team,height_cm,weight_kg,position_C,position_F,position_G
0,Alex,Abrines,G,"{'id': 21, 'abbreviation': 'OKC', 'city': 'Okl...",0.5,0.250000,0.0,0.0,1.0
1,Kosta,Koufos,C,"{'id': 26, 'abbreviation': 'SAC', 'city': 'Sac...",1.0,0.625000,1.0,0.0,0.0
2,Michael,Beasley,F,"{'id': 25, 'abbreviation': 'POR', 'city': 'Por...",0.5,0.541667,0.0,1.0,0.0
3,Wade,Baldwin IV,G,"{'id': 11, 'abbreviation': 'HOU', 'city': 'Hou...",0.5,0.250000,0.0,0.0,1.0
4,Jared,Terrell,G,"{'id': 18, 'abbreviation': 'MIN', 'city': 'Min...",0.5,0.475000,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...
376,Mitchell,Robinson,C,"{'id': 20, 'abbreviation': 'NYK', 'city': 'New...",1.0,0.583333,1.0,0.0,0.0
377,Collin,Sexton,G,"{'id': 29, 'abbreviation': 'UTA', 'city': 'Uta...",0.5,0.166667,0.0,0.0,1.0
378,Landry,Shamet,G,"{'id': 24, 'abbreviation': 'PHX', 'city': 'Pho...",0.5,0.150000,0.0,0.0,1.0
379,Anfernee,Simons,G,"{'id': 25, 'abbreviation': 'POR', 'city': 'Por...",0.5,0.125000,0.0,0.0,1.0


Finalmente vamos a quedarnos con una muestra del 2% de los jugadores, respetando la proporción que haya de cada posición (muestreo estratificado).

Primero calcularemos qué proporción hay de cada posición, y luego usaremos `train_test_split` para quedarnos con ese 2% y verificarlo.

In [26]:
conteo_posiciones = datos['position'].value_counts()
proporcion_posiciones = conteo_posiciones / conteo_posiciones.sum()
print("Proporción posiciones:")
print(proporcion_posiciones)

from sklearn.model_selection import train_test_split

datos_train, datos_test = train_test_split(datos, test_size=0.02, stratify=datos['position'], random_state=12)
datos_test

Proporción posiciones:
position
G    0.464567
F    0.406824
C    0.128609
Name: count, dtype: float64


Unnamed: 0,first_name,last_name,position,team,height_cm,weight_kg,position_C,position_F,position_G
6,Isaiah,Canaan,G,"{'id': 18, 'abbreviation': 'MIN', 'city': 'Min...",0.5,0.25,0.0,0.0,1.0
143,Lou,Williams,G,"{'id': 1, 'abbreviation': 'ATL', 'city': 'Atla...",0.5,0.041667,0.0,0.0,1.0
327,Monte,Morris,G,"{'id': 30, 'abbreviation': 'WAS', 'city': 'Was...",0.5,0.041667,0.0,0.0,1.0
155,Marquese,Chriss,F,"{'id': 21, 'abbreviation': 'OKC', 'city': 'Okl...",0.5,0.583333,0.0,1.0,0.0
285,Julius,Randle,F,"{'id': 20, 'abbreviation': 'NYK', 'city': 'New...",0.5,0.666667,0.0,1.0,0.0
95,Jordan,Bell,F,"{'id': 5, 'abbreviation': 'CHI', 'city': 'Chic...",0.5,0.45,0.0,1.0,0.0
68,Jacob,Evans,G,"{'id': 18, 'abbreviation': 'MIN', 'city': 'Min...",0.5,0.333333,0.0,0.0,1.0
108,Jahlil,Okafor,C,"{'id': 1, 'abbreviation': 'ATL', 'city': 'Atla...",0.5,0.875,1.0,0.0,0.0
