# Handling duplicate, missing, or invalid data

## About the Data
In this notebook, we will be working with FIFA players data for 2022 obtained from [Kaggle](https://www.kaggle.com/datasets/stefanoleone992/fifa-22-complete-player-dataset)

## Setup
We will be working with the `players_22.csv` file, so we need to handle our imports and read it in.

In [14]:
import pandas as pd

In [15]:
players = pd.read_csv(
    'dirty_data.csv', 
    usecols=['short_name', 'wage_eur', 'age', 'club_name', 'nationality_name', 'preferred_foot']
)

## Finding problematic data
The `info()` method can pinpoint missing values and wrong data types:

In [17]:
players.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19265 entries, 0 to 19264
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   short_name        19265 non-null  object 
 1   wage_eur          19204 non-null  float64
 2   age               19265 non-null  int64  
 3   club_name         19204 non-null  object 
 4   nationality_name  19265 non-null  object 
 5   preferred_foot    19265 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 903.2+ KB


We can use the `isna()`/`isnull()` method of the series to find nulls:

In [18]:
contain_nulls = players[
    players.short_name.isna() | players.wage_eur.isna() | players.age.isna()
    | players.club_name.isna() | players.nationality_name.isna()
    | players.preferred_foot.isna()
]

In [19]:
contain_nulls.shape[0]

61

In [20]:
contain_nulls.head(10)

Unnamed: 0,short_name,wage_eur,age,club_name,nationality_name,preferred_foot
147,T. Alderweireld,,32,,Belgium,Right
416,G. Krychowiak,,31,,Poland,Right
694,A. Dzyuba,,32,,Russia,Right
926,M. Ozdoev,,28,,Russia,Right
1146,Y. Zhirkov,,37,,Russia,Left
1152,A. Shunin,,34,,Russia,Left
1182,A. Dragović,,30,,Austria,Right
1241,M. Borjan,,33,,Canada,Right
1255,V. Claesson,,29,,Sweden,Right
1759,D. Kuzyaev,,28,,Russia,Right


We can use the `duplicated()` method to find duplicate rows:

In [21]:
players.duplicated().sum()

26

The default for keep is 'first' meaning it won't show the first row that the duplicated data was seen in; we can pass in False to see it:

In [22]:
players.duplicated(keep=False).sum()

52

Let's look at a few duplicates:

In [23]:
players[players.duplicated()].head()

Unnamed: 0,short_name,wage_eur,age,club_name,nationality_name,preferred_foot
19239,Yang Haoyu,1000.0,20,Dalian Professional Football Club,China PR,Right
19240,R. Mahon,500.0,18,Bohemian FC,Republic of Ireland,Left
19241,S. Black,500.0,19,Finn Harps,Republic of Ireland,Right
19242,H. Wiles-Richards,1000.0,19,Bristol City,England,Right
19243,J. Affonso,500.0,23,Cerro Largo Fútbol Club,Uruguay,Right


## Mitigating Issues

### Handling duplicated data
When dealing with duplicate data, we need to think of the ramifications of removing it. If we determine it won't impact our analysis, we can use `drop_duplicates()` to remove them:

In [24]:
pl_clean = players.drop_duplicates()

In [25]:
pl_clean.duplicated().sum()

0

### Dealing with nulls
We could drop nulls, replace them with some arbitrary value, or impute them using the surrounding data. Each of these options may have ramifications, so we must choose wisely.

We can use `dropna()` to drop rows where any column has a null value:

In [26]:
pl_clean.dropna().shape

(19178, 6)

If we pass `how='all'`, we can choose to only drop rows where everything is null, but this removes nothing:

In [27]:
pl_clean.dropna(how='all').shape

(19239, 6)

We can use just a subset of columns to determine what to drop with the `subset` argument:

In [28]:
pl_clean.dropna(
    how='all', subset=['wage_eur', 'club_name']
).shape

(19178, 6)

This can also be performed along columns, and we can also require a certain number of null values before we drop the data:

In [29]:
pl_clean.dropna(axis='columns', thresh=pl_clean.shape[0] * .75).shape

(19239, 6)

We can choose to fill in the null values instead with `fillna()`:

In [34]:
pl_clean.assign(
    club_name = pl_clean.club_name.fillna('unknown'),
    wage_eur = pl_clean.wage_eur.fillna(0)
).loc[[147, 416, 694, 926, 1146], :]

Unnamed: 0,short_name,wage_eur,age,club_name,nationality_name,preferred_foot
147,T. Alderweireld,0.0,32,unknown,Belgium,Right
416,G. Krychowiak,0.0,31,unknown,Poland,Right
694,A. Dzyuba,0.0,32,unknown,Russia,Right
926,M. Ozdoev,0.0,28,unknown,Russia,Right
1146,Y. Zhirkov,0.0,37,unknown,Russia,Left
