## Missing Values

Missing values, occur when no data value is stored for the variable in an observation. The problem of missing data is relatively common in almost all research and can have a significant effect on the conclusions that can be drawn from the data, the lost data can cause bias in the estimation of parameters or it can also reduce the representativeness of the samples.

In [8]:
import pandas as pd
data=pd.read_csv("input/fifa.csv", encoding = 'gb2312')
data.head()
#data.columns

Unnamed: 0.1,Unnamed: 0,ID,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,...,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Release Clause
0,0,158023,L. Messi,31,https://cdn.sofifa.org/players/4/19/158023.png,Argentina,https://cdn.sofifa.org/flags/52.png,94,94,FC Barcelona,...,96,33,28,26,6,11,15,14,8,㈤226.5M
1,1,20801,Cristiano Ronaldo,33,https://cdn.sofifa.org/players/4/19/20801.png,Portugal,https://cdn.sofifa.org/flags/38.png,94,94,Juventus,...,95,28,31,23,7,11,15,14,11,㈤127.1M
2,2,190871,Neymar Jr,26,https://cdn.sofifa.org/players/4/19/190871.png,Brazil,https://cdn.sofifa.org/flags/54.png,92,93,Paris Saint-Germain,...,94,27,24,33,9,9,15,15,11,㈤228.1M
3,3,193080,De Gea,27,https://cdn.sofifa.org/players/4/19/193080.png,Spain,https://cdn.sofifa.org/flags/45.png,91,93,Manchester United,...,68,15,21,13,90,85,87,88,94,㈤138.6M
4,4,192985,K. De Bruyne,27,https://cdn.sofifa.org/players/4/19/192985.png,Belgium,https://cdn.sofifa.org/flags/7.png,91,92,Manchester City,...,88,68,58,51,15,13,5,10,13,㈤196.4M


In [9]:
#sorting the missing values in rows in descending order
data.isnull().sum(axis=1).sort_values(ascending=False)

1120    31
568     31
1349    28
609     28
1777    28
        ..
1162     1
1165     1
1166     1
1167     1
0        1
Length: 1822, dtype: int64

In [10]:
#checking if there are any missing values in rows
data.isnull().any(axis=1)

0       True
1       True
2       True
3       True
4       True
        ... 
1817    True
1818    True
1819    True
1820    True
1821    True
Length: 1822, dtype: bool

In [11]:
#checking if there is any row having all the values missing
data.isnull().all(axis=1).sum()

0

In [12]:
#checking for the rows which have missing values greater than 50
data[data.isnull().sum(axis=1)>50]

Unnamed: 0.1,Unnamed: 0,ID,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,...,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Release Clause


In [13]:
print("Before deleting the rows ",data.shape[0])

Before deleting the rows  1822


In [14]:
data.shape

(1822, 89)

In [15]:
print("Before deleting the rows ",data.shape[0])
data=data[data.isnull().sum(axis=1)<=50]
print("After removing the rows having more than 50 missing values ",data.shape[0])

Before deleting the rows  1822
After removing the rows having more than 50 missing values  1822


In [16]:
#checking for the missing values in columns

data.isnull().sum()

Unnamed: 0          0
ID                  0
Name                0
Age                 0
Photo               0
                 ... 
GKHandling          0
GKKicking           0
GKPositioning       0
GKReflexes          0
Release Clause    122
Length: 89, dtype: int64

In [17]:
pd.set_option("max_rows",89)
data.isnull().sum()

Unnamed: 0                     0
ID                             0
Name                           0
Age                            0
Photo                          0
Nationality                    0
Flag                           0
Overall                        0
Potential                      0
Club                          14
Club Logo                      0
Value                          0
Wage                           0
Special                        0
Preferred Foot                 0
International Reputation       0
Weak Foot                      0
Skill Moves                    0
Work Rate                      0
Body Type                      0
Real Face                      0
Position                       0
Jersey Number                  0
Joined                       121
Loaned From                 1715
Contract Valid Until          14
Height                         0
Weight                         0
LS                           180
ST                           180
RS        

In [18]:
x=data.isnull().sum()
y=(data.isnull().sum()/data.shape[0])*100
z={'Number of missing values':x,'Percentage of missing values':y}
df=pd.DataFrame(z,columns=['Number of missing values','Percentage of missing values'])
df.sort_values(by='Percentage of missing values',ascending=False)

Unnamed: 0,Number of missing values,Percentage of missing values
Loaned From,1715,94.127333
LWB,180,9.879254
LCM,180,9.879254
RS,180,9.879254
LW,180,9.879254
LF,180,9.879254
CF,180,9.879254
RF,180,9.879254
RW,180,9.879254
LAM,180,9.879254


In [19]:
data=data.drop(['Loaned From'],axis=1)

In [20]:
print("Let's check the columns after removing Loaned From column",data.columns)

Let's check the columns after removing Loaned From column Index(['Unnamed: 0', 'ID', 'Name', 'Age', 'Photo', 'Nationality', 'Flag',
       'Overall', 'Potential', 'Club', 'Club Logo', 'Value', 'Wage', 'Special',
       'Preferred Foot', 'International Reputation', 'Weak Foot',
       'Skill Moves', 'Work Rate', 'Body Type', 'Real Face', 'Position',
       'Jersey Number', 'Joined', 'Contract Valid Until', 'Height', 'Weight',
       'LS', 'ST', 'RS', 'LW', 'LF', 'CF', 'RF', 'RW', 'LAM', 'CAM', 'RAM',
       'LM', 'LCM', 'CM', 'RCM', 'RM', 'LWB', 'LDM', 'CDM', 'RDM', 'RWB', 'LB',
       'LCB', 'CB', 'RCB', 'RB', 'Crossing', 'Finishing', 'HeadingAccuracy',
       'ShortPassing', 'Volleys', 'Dribbling', 'Curve', 'FKAccuracy',
       'LongPassing', 'BallControl', 'Acceleration', 'SprintSpeed', 'Agility',
       'Reactions', 'Balance', 'ShotPower', 'Jumping', 'Stamina', 'Strength',
       'LongShots', 'Aggression', 'Interceptions', 'Positioning', 'Vision',
       'Penalties', 'Composure', 'M

In [21]:
data.dtypes[data.isnull().any()]

Club                    object
Joined                  object
Contract Valid Until    object
LS                      object
ST                      object
RS                      object
LW                      object
LF                      object
CF                      object
RF                      object
RW                      object
LAM                     object
CAM                     object
RAM                     object
LM                      object
LCM                     object
CM                      object
RCM                     object
RM                      object
LWB                     object
LDM                     object
CDM                     object
RDM                     object
RWB                     object
LB                      object
LCB                     object
CB                      object
RCB                     object
RB                      object
Release Clause          object
dtype: object

In [22]:
#Player who have missing value in jersey number means that they donot have jersey number so it will be illogical to impute the 
#missing values using mean,median or mode. So let's impute the missing value as NA
data['Jersey Number'].fillna('NA',inplace=True)

In [23]:
data['Club']=data['Club'].fillna(data['Club'].mode()[0])
data['Position']=data['Position'].fillna(data['Position'].mode()[0])
data['Joined']=data['Joined'].fillna(data['Joined'].mode()[0])
data['Contract Valid Until']=data['Contract Valid Until'].fillna(data['Contract Valid Until'].mode()[0])
data['Release Clause']=data['Release Clause'].fillna(data['Release Clause'].mode()[0])


In [24]:
#business logic
data['RB'].fillna(0,inplace=True)
data['RCB'].fillna(0,inplace=True)
data['CB'].fillna(0,inplace=True)
data['LCB'].fillna(0,inplace=True)
data['LB'].fillna(0,inplace=True)
data['RWB'].fillna(0,inplace=True)
data['RDM'].fillna(0,inplace=True)
data['CDM'].fillna(0,inplace=True)
data['LDM'].fillna(0,inplace=True)
data['LWB'].fillna(0,inplace=True)
data['RM'].fillna(0,inplace=True)
data['RCM'].fillna(0,inplace=True)
data['CM'].fillna(0,inplace=True)
data['LCM'].fillna(0,inplace=True)
data['LM'].fillna(0,inplace=True)
data['RAM'].fillna(0,inplace=True)
data['CAM'].fillna(0,inplace=True)
data['LAM'].fillna(0,inplace=True)
data['RW'].fillna(0,inplace=True)
data['RF'].fillna(0,inplace=True)
data['CF'].fillna(0,inplace=True)
data['LF'].fillna(0,inplace=True)
data['LW'].fillna(0,inplace=True)
data['RS'].fillna(0,inplace=True)
data['ST'].fillna(0,inplace=True)
data['LS'].fillna(0,inplace=True)


If you have to impute all the missing values with 0 as in the above case you can directly write the command as
data.fillna(0,inplace=True)

In [25]:
data.isnull().sum().sum()

0

## Handling missing values in dataframes

Missing Data can occur when no information is provided for one or more items or for a whole unit. Missing Data is a very big problem in real life scenario.

Missing Data can also refer to as NA(Not Available) values in pandas. In DataFrame sometimes many datasets simply arrive with missing data, either because it exists and was not collected or it never existed.

For Example, Suppose different user being surveyed may choose not to share their income, some user may choose not to share the address in this way many datasets went missing.

* None: None is a Python singleton object that is often used for missing data in Python code.
* NaN : NaN (an acronym for Not a Number), is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation.

Pandas treat None and NaN as essentially interchangeable for indicating missing or null values. To facilitate this convention, there are several useful functions for detecting, removing, and replacing null values in Pandas DataFrame :

* isnull()
* notnull()
* dropna()
* fillna()
* replace()
* interpolate()

In [26]:
# importing libraries
import pandas as pd
import numpy as np
# creating dataframe
d = {'First Score':[100, 90, np.nan, 95],
        'Second Score': [30, 45, 56, np.nan],
        'Third Score':[np.nan, 40, 80, 98]}
df = pd.DataFrame(d)
df

Unnamed: 0,First Score,Second Score,Third Score
0,100.0,30.0,
1,90.0,45.0,40.0
2,,56.0,80.0
3,95.0,,98.0


In [27]:
df.isnull().sum()

First Score     1
Second Score    1
Third Score     1
dtype: int64

In [28]:
df.isnull().sum(axis = 1)

0    1
1    0
2    1
3    1
dtype: int64

In [29]:
df.fillna(df.mean())

Unnamed: 0,First Score,Second Score,Third Score
0,100.0,30.0,72.666667
1,90.0,45.0,40.0
2,95.0,56.0,80.0
3,95.0,43.666667,98.0


## Missing data in Series
Missing values occurs when no data value is stored for a variable. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data.



In [38]:
import pandas as pd
import numpy as np
ser = pd.Series([1,6,8,np.NaN,0,np.NaN, 4])
ser

0    1.0
1    6.0
2    8.0
3    NaN
4    0.0
5    NaN
6    4.0
dtype: float64

In [39]:
# filling missing data with ffill() function
ser.ffill()

0    1.0
1    6.0
2    8.0
3    8.0
4    0.0
5    0.0
6    4.0
dtype: float64

In [40]:
# filling missing data with bfill() function
ser.bfill()

0    1.0
1    6.0
2    8.0
3    0.0
4    0.0
5    4.0
6    4.0
dtype: float64

In [41]:
# filling missing values
ser.fillna(5)

0    1.0
1    6.0
2    8.0
3    5.0
4    0.0
5    5.0
6    4.0
dtype: float64

In [30]:
# importing required libraries
import pandas as pd
import numpy as np

# creating dataframe
d = {"col1": [2019, 2019, 2020],
     "col2": [350, 365, 1],
     "col3": [np.nan, 365, None]
}

df = pd.DataFrame(d)
df

# Solution 1
df.isnull().sum()

col1    0
col2    0
col3    2
dtype: int64

In [31]:
# Solution 2
df.isna().sum()

col1    0
col2    0
col3    2
dtype: int64

In [32]:
# Solution 3
df.isna().any()

col1    False
col2    False
col3     True
dtype: bool

In [33]:
# Solution 4:
df.isna().sum(axis = 1)

0    1
1    0
2    1
dtype: int64

In [35]:
# total number of missing values in the dataframe
df.isnull().sum().sum()

2

In [36]:
# rowwise missing values
df.isnull().sum(axis=1)

0    1
1    0
2    1
dtype: int64

In [37]:
# returns boolean object 
df.isnull().any()

col1    False
col2    False
col3     True
dtype: bool