In [30]:
import pandas as pd
import numpy as np

In [3]:
players_df = pd.read_csv('data/laliga_player_stats_english.csv')

In [4]:
display(players_df.head())

Unnamed: 0,Team,Position,Shirt number,Name,Minutes played,Games played,Percentage of games played,Full games played,Percentage of full games played,Games started,...,Corners,Tackles.1,Duels,Man-to-man duels,Aerial duels,Passes,Short passes,Long passes,Through balls,Goals scored per attempt
0,Athletic Club,Goalkeeper,,Hodei Oleaga,0.0,0,0.00%,0,0.00%,0,...,0,0,0,0,0,0.0,0.0,0,0,0
1,Athletic Club,Goalkeeper,1.0,A. Remiro,0.0,0,0.00%,0,0.00%,0,...,0,0,0,0,0,0.0,0.0,0,0,0
2,Athletic Club,Goalkeeper,13.0,Herrerín,2.79,31,82.00%,31,82.00%,31,...,0,0,25,6,19,887.0,128.0,759,1,0
3,Athletic Club,Goalkeeper,25.0,Unai Simón,630.0,7,18.00%,7,18.00%,7,...,0,0,3,2,1,155.0,49.0,106,0,0
4,Athletic Club,Defender,3.0,Núñez,1.063,12,32.00%,11,29.00%,11,...,0,15,107,38,69,536.0,457.0,78,1,0


In [38]:
players_df.shape

(556, 62)

In [40]:
players_df.size

34472

# Limpieza

Comenzamos por la columna 'Team', sustituyendo los espacios por "_".

In [14]:
players_df['Team'].unique()

array(['Athletic Club', 'Atlético de Madrid', 'CD Leganés', 'D. Alavés',
       'FC Barcelona', 'Getafe CF', 'Girona FC', 'Levante UD',
       'R. Valladolid CF', 'Rayo Vallecano', 'RC Celta', 'RCD Espanyol',
       'Real Betis', 'Real Madrid', 'Real Sociedad', 'SD Eibar',
       'SD Huesca', 'Sevilla FC', 'Valencia CF', 'Villarreal CF'],
      dtype=object)

Identificamos un problema con los nombres de los jugadores en nuestro DataFrame.

In [24]:
len(players_df.Name)

556

In [25]:
len(players_df.Name.unique())

547

Observamos que existen nombres repetidos (el número de valores es mayor al de valores únicos). En concreto, se repite un total de 9 valores (nombres de jugadores)

Esto será un problema a la hora de vincular esta tabla con la de valores de mercado por jugadores, así que trataremos de limpiar los datos y arreglar este problema.

In [21]:
# Primero averiguamos cuáles son los valores repetidos.

pl = players_df.copy()  # Generemos una copia del DataFrame (para mantener el original intacto).
p = pl.groupby('Name').Team.nunique().reset_index() # Agrupamos por 'Name' y en la columna 'Team' incluimos el número de valores en esta columna (una vez hecho el groupby).

In [23]:
print(p.loc[p['Team'] > 1]) # Seleccionamos aquellos registros con un valor de 'Team' mayor que 1 (aquellos que se repiten).

          Name  Team
68       Borja     2
76       Bruno     2
248    Joaquín     2
262   Juanfran     2
280       Koke     2
361      Nacho     2
434    Rodrigo     2
467  Sergio A.     2
512    Vázquez     2


In [31]:
duplicate = list(np.where(players_df["Name"] == 'Borja')) 
duplicate # Nos devuelve el index del registro donde 'Name' = 'Borja'.

[array([114, 236])]

In [35]:
# Observamos que el 'Name' para los index consultados coincide con 'Borja'

print(players_df["Name"][114], players_df["Name"][236])

Borja Borja


In [37]:
# Consultamos el equipo de cada Borja para poder buscar su apellido y cambiar el valor.

print(players_df["Team"][114], players_df["Team"][236])

D. Alavés R. Valladolid CF


Cambiaremos el nombre por nombre + apellido.

Lo haremos "a mano" buscando el apellido del jugador en función del equipo en el que juega.

En este primer caso:

   - Borja del Alavés [114] = Borja Bastón
   - Borja del Valladolid [236] = Borja Fernández
   
Cabe destacar que aprovecharemos para incluir el valor (nombre del jugador) tal y como lo tenemos en la tabla de market_values (ahorramos futuros procesos de limpieza).

In [52]:
players_df["Name"][114] = 'Borja Bastón'
players_df["Name"][236] = 'Borja Fernández'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  players_df["Name"][114] = 'Borja Bastón'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  players_df["Name"][236] = 'Borja Fernández'


In [56]:
print(players_df["Name"][114], players_df["Name"][236])

Borja Bastón Borja Fernández


Seguimos este mismo proceso para el resto de valores repetidos.

Para los "Bruno".

In [57]:
duplicate = list(np.where(players_df["Name"] == 'Bruno')) 
duplicate 

[array([147, 548])]

In [60]:
print(players_df["Team"][147], players_df["Team"][548])

Getafe CF Villarreal CF


In [61]:
players_df["Name"][147] = 'Bruno González'
players_df["Name"][548] = 'Bruno Soriano'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  players_df["Name"][147] = 'Bruno González'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  players_df["Name"][548] = 'Bruno Soriano'


In [62]:
print(players_df["Name"][147], players_df["Name"][548])

Bruno González Bruno Soriano


Para los "Joaquín".

In [63]:
duplicate = list(np.where(players_df["Name"] == 'Joaquín')) 
duplicate 

[array([223, 354])]

In [65]:
print(players_df["Team"][223], players_df["Team"][354])

R. Valladolid CF Real Betis


In [66]:
players_df["Name"][223] = 'Joaquín Fernández'
players_df["Name"][354] = 'Joaquín'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  players_df["Name"][223] = 'Joaquín Fernández'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  players_df["Name"][354] = 'Joaquín'


Para los "Juanfran".

In [67]:
duplicate = list(np.where(players_df["Name"] == 'Juanfran')) 
duplicate 

[array([37, 65])]

In [69]:
print(players_df["Team"][37], players_df["Team"][65])

Atlético de Madrid CD Leganés


In [70]:
players_df["Name"][37] = 'Juanfran Torres'
players_df["Name"][65] = 'Juanfran'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  players_df["Name"][37] = 'Juanfran Torres'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  players_df["Name"][65] = 'Juanfran'


Para los "Koke".

In [None]:
duplicate = list(np.where(players_df["Name"] == 'Koke')) 
duplicate 

In [73]:
print(players_df["Team"][44], players_df["Team"][195])

Atlético de Madrid Levante UD


In [76]:
players_df["Name"][44] = 'Koke'
players_df["Name"][195] = 'Koke Vegas'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  players_df["Name"][44] = 'Koke'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  players_df["Name"][195] = 'Koke Vegas'


Para los "Nacho".

In [80]:
duplicate = list(np.where(players_df["Name"] == 'Nacho')) 
duplicate 

[array([229, 375])]

In [81]:
print(players_df["Team"][229], players_df["Team"][375])

R. Valladolid CF Real Madrid


In [83]:
players_df["Name"][229] = 'Nacho Martínez'
players_df["Name"][375] = 'Nacho Fernández'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  players_df["Name"][229] = 'Nacho Martínez'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  players_df["Name"][375] = 'Nacho Fernández'


Para los "Rodrigo".

In [84]:
duplicate = list(np.where(players_df["Name"] == 'Rodrigo')) 
duplicate 

[array([ 46, 525])]

In [85]:
print(players_df["Team"][46], players_df["Team"][525])

Atlético de Madrid Valencia CF


In [86]:
players_df["Name"][46] = 'Rodri'
players_df["Name"][525] = 'Rodrigo'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  players_df["Name"][46] = 'Rodri'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  players_df["Name"][525] = 'Rodrigo'


Para los "Sergio A.".

In [87]:
duplicate = list(np.where(players_df["Name"] == 'Sergio A.')) 
duplicate 

[array([284, 441])]

In [88]:
print(players_df["Team"][284], players_df["Team"][441])

RC Celta SD Eibar


In [89]:
players_df["Name"][284] = 'Sergio Álvarez'
players_df["Name"][441] = 'Sergio Álvarez Díaz' # Ojo que este no coincide con el nombre en market_values (allí es 'Sergio Álvarez'; lo cambiamos para que sea valor único).

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  players_df["Name"][284] = 'Sergio Álvarez'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  players_df["Name"][441] = 'Sergio Álvarez Díaz' # Ojo que este no coincide con el nombre en market_values (allí es 'Sergio Álvarez'; lo cambiamos para que sea valor único).


Para los "Vázquez".

In [90]:
duplicate = list(np.where(players_df["Name"] == 'Vázquez')) 
duplicate 

[array([293, 498])]

In [91]:
print(players_df["Team"][293], players_df["Team"][498])

RC Celta Sevilla FC


In [95]:
players_df["Name"][293] = 'Kevin Vázquez'
players_df["Name"][498] = 'Franco Vázquez'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  players_df["Name"][293] = 'Kevin Vázquez'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  players_df["Name"][498] = 'Franco Vázquez'


Comprobamos que ya no quedan valores duplicados en esta columna.

In [98]:
len(players_df.Name)

556

In [99]:
len(players_df.Name.unique())

556