**Data Mining Project, midterm 2021/2022**
**Authors:** Niko Dalla Noce, Alessandro Ristori, Giuseppe Lombardi
**Date:**




#**Task 1: Data Understanding and Data Preparation**

**Importing libraries**

In [3]:
import numpy as np
import pandas as pd
import matplotlib as plt
import scipy

import zipfile
with zipfile.ZipFile("prj_data.zip", 'r') as zip_ref:
     zip_ref.extractall()

**Load the datasets**

Read the data from the three csv files, each one is assigned to a different datafram for now.

In [120]:
df_male = pd.read_csv("dataset/male_players.csv", sep=",")
df_female = pd.read_csv("dataset/female_players.csv", sep=",")
df_matches = pd.read_csv("dataset/tennis_matches.csv", sep=",")

players_winner = df_matches[["winner_name"]].rename(columns={"winner_name":"Name"})
players_loser = df_matches[["loser_name"]].rename(columns={"loser_name":"Name"})
players = players_winner.append(players_loser)
players_unique = players["Name"].unique()
players_unique = pd.DataFrame(players_unique, columns=["Name"])
players_unique = players_unique.dropna()
players_unique = players_unique.sort_values(by=["Name"])
players_unique.to_csv("players.csv")

# test
def players_from_sex(df, df_players):
  players_name = list(df["name"])
  players_surname = list(df["surname"])
  players = list()
  for n, s in zip(players_name, players_surname):
    players.append("{0} {1}".format(n, s))

  players = df_players[df_players["Name"].isin(players)]
  return players

players_male = players_from_sex(df_male, players_unique)
players_female = players_from_sex(df_female, players_unique)
players_male_found = list(players_male["Name"])
players_female_found = list(players_female["Name"])
players_found = players_male_found+players_female_found
players_not_found = players_unique[~players_unique["Name"].isin(players_found)]
print("Numero di giocarori uomini: {0}".format(len(players_male)))
print("Numero di giocarori donne: {0}".format(len(players_female)))
print("Numero di giocarori: {0}".format(len(players_unique)))
print("Numero di giocarori senza sesso: {0}".format(len(players_not_found)))
# for player in players_not_found["Name"]:
#  print(player)

Numero di giocarori uomini: 3018
Numero di giocarori donne: 7062
Numero di giocarori: 10104
Numero di giocarori senza sesso: 30
Alexandar Lazov
Alona Fomina
Andres Artunedo Martinavarro
Antoine Hoang
Ben Patael
Botic van de Zandschulp
Christopher O'Connell
Cristian Garin
Daniel Elahi Galan
Daniel Munoz de la Nava
David O'Hare
Diego Schwartzman
Evgenii Tiurnev
Frances Tiafoe
Franko Skugor
Holger Rune
J.J. Wolf
Jo-Wilfried Tsonga
Joao Menezes
Juan Martin del Potro
Juan Pablo Varillas
Jurabek Karimov
Khumoun Sultanov
Lloyd Harris
Mackenzie McDonald
Pedro Martinez
Sam Groth
Stan Wawrinka
Taylor Fritz
Zeynep  Sena Sarioglan


Let's take a look to the players dataframes by calling the head() method, which shows the first five lines of the dataframes.

In [None]:
df_male.head()

Unnamed: 0,name,surname
0,Gardnar,Mulloy
1,Pancho,Segura
2,Frank,Sedgman
3,Giuseppe,Merlo
4,Richard Pancho,Gonzales


In [None]:
df_female.head()

Unnamed: 0,name,surname
0,Bobby,Riggs
1,X,X
2,Martina,Hingis
3,Mirjana,Lucic
4,Justine,Henin


Let's now use the info() method to obtain information on the two player datasets.

In [None]:
print("Male players dataframe info")
df_male.info()
print("\nFemale players dataframe info")
df_female.info()

Male players dataframe info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55208 entries, 0 to 55207
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   name     55031 non-null  object
 1   surname  55166 non-null  object
dtypes: object(2)
memory usage: 862.8+ KB

Female players dataframe info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46172 entries, 0 to 46171
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   name     44505 non-null  object
 1   surname  46172 non-null  object
dtypes: object(2)
memory usage: 721.6+ KB


As we can see from the previous infos, there are null values both in the male and female dataset.

In [None]:
print("Male players dataframe null values:\n{0}".format(df_male.isnull().any()))
print("\nFemale players dataframe null values:\n{0}".format(df_female.isnull().any()))

Male players dataframe null values:
name       True
surname    True
dtype: bool

Female players dataframe null values:
name        True
surname    False
dtype: bool


We need to drop such elements with null values.

In [None]:
df_male = df_male.dropna()
df_female = df_female.dropna()
print("Male players dataframe info:")
df_male.info()
print("\nFemale players dataframe info:")
df_female.info()

Male players dataframe info:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 55031 entries, 0 to 55207
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   name     55031 non-null  object
 1   surname  55031 non-null  object
dtypes: object(2)
memory usage: 1.3+ MB

Female players dataframe info:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 44505 entries, 0 to 46171
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   name     44505 non-null  object
 1   surname  44505 non-null  object
dtypes: object(2)
memory usage: 1.0+ MB


There could be duplicates in the same dataset, but we know that there's a possibility that two players could share the same name and surname, so we won't change anything.

But, there are elements where the values are missing, or incomplete, but not null (look at df_female.head()), we need to remove such elements.

In [None]:
elements_not_to_keep = ["Unknown", "??"]
df_male = df_male[df_male["name"].str.len()>1]
df_male = df_male[df_male["surname"].str.len()>1]
df_male = df_male[~df_male["name"].isin(elements_not_to_keep)]
df_male = df_male[~df_male["surname"].isin(elements_not_to_keep)]
df_male.info()
# df_male[df_male["name"].isin(["??"])]
# df_male.where(df_male.name.str.len()==1)
# Ci sono altri elementi che non sono compatibili, andrebbero visti a mano

<class 'pandas.core.frame.DataFrame'>
Int64Index: 53965 entries, 0 to 55207
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   name     53965 non-null  object
 1   surname  53965 non-null  object
dtypes: object(2)
memory usage: 1.2+ MB
