# Missing Data

Gerçek veri setlerinde pek çok nedenle eksik veri olabilir.

Machine Lerning Algoritmaları veya Statistical Methodlar eksik verilerle çalışamazlar.

Missing Data ile karşılaştığımızda ne yapacağımıza karar vermek gerekir.

Pandas missing value ları NaN olarak gösterir.

pd.Na ve pd.NaT (timestam associated) yeni geliştirilmiş null value gösterim şekilleridir.


What Null/NA/nan objects look like:
Source: https://github.com/pandas-dev/pandas/issues/28095

A new pd.NA value (singleton) is introduced to represent scalar missing values. Up to now, pandas used several values to represent missing data: np.nan is used for this for float data, np.nan or None for object-dtype data and pd.NaT for datetime-like data. The goal of pd.NA is to provide a “missing” indicator that can be used consistently across data types. pd.NA is currently used by the nullable integer and boolean data types and the new string data type

## Options for Missing Data

### Keep it

### Remove it

### Replace it

##  Keeping the Missing Data

### Pros
    Easiest to do
    Doesn't manipulaate or change the True data

### Cons
    Many methods does not support NaN
    Often there are resonable guesses

##  Droping or Removing the Missing Data

### Pros
    Easiest to do
    Can be based on rules

### Cons
    Potential to lose a lot of data or usefull information
    Limits trained models for future data

### Droping a row : makes sense when a lot of info is missing in a row

    The percentage of the data that will  be droped shold be calculated

### Droping a column (feature column) : makes sense when a lot of info is missing in that particular column/feature

## Filling in the Missing Data

### Pros
    Potential to save a lot of data for use in training a model

### Cons
    Hardest to do and somewhat arbitrary
    Potential to lead to False conclusions

### Fill with the same value
    Fill with same value ( to do that domain search should be done)
    Fill with Zero (if NaN is a placeholder)
    Fill with interpolated or estimated value (much harder and requires reasonable assumptions and also domain knowledge)
    

In [7]:
import numpy as np
import pandas as pd

In [8]:
np.nan

nan

In [9]:
pd.NaT

NaT

In [10]:
np.nan == np.nan

False

In [11]:
np.nan is np.nan

True

In [12]:
my_var = np.nan

In [13]:
my_var is np.nan

True

In [14]:
df = pd.read_csv("movie_scores.csv")
df

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [15]:
df.isnull()  # Detect missing values: Null değer içeren her satır/kolon için True, diğerleri için False verir.

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,False,False,False,False,False,False
1,True,True,True,True,True,True
2,False,False,False,False,True,True
3,False,False,False,False,False,False
4,False,False,False,False,False,False


In [16]:
df.notnull()  # Detect existing (non-missing) values: Null değer içermeyen değerleri bulmak için kullanılır.

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,True,True,True,True,True,True
1,False,False,False,False,False,False
2,True,True,True,True,False,False
3,True,True,True,True,True,True
4,True,True,True,True,True,True


In [17]:
df["pre_movie_score"].notnull()  

# kolon bazında da null olmayan değerleri bulmak için kullanabiliriz.
# Her bir satır için eğer Null ise False, değilse True döndürür.

0     True
1    False
2    False
3     True
4     True
Name: pre_movie_score, dtype: bool

In [18]:
df[df["pre_movie_score"].notnull()]  # Conditional filtering şeklinde kullandığımızda da null olmayan verileri döndürür.

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [19]:
df["pre_movie_score"].isnull()  # eğer pre_movie_score değeri olamayan verilerin diğer bilgilerini görmek istersem bu yöntemi kullanabilirim.


0    False
1     True
2     True
3    False
4    False
Name: pre_movie_score, dtype: bool

In [20]:
df[df["pre_movie_score"].isnull()]  # pre_movie_score değeri olmayan verilerin tüm bilgilerini döndürdü.

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
1,,,,,,
2,Hugh,Jackman,51.0,m,,


In [21]:
df[(df["pre_movie_score"].isnull()) & (df["first_name"]).notnull()]

# pre_movie_score değeri olmayan ve first_name değeri olan verilerin tüm bilgilerini döndürdü.

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
2,Hugh,Jackman,51.0,m,,


## Keep Data: 
Read the DataFrame and see the missing values and keep them

## Drop Data

## Fill Data

In [22]:
help(df.dropna)

Help on method dropna in module pandas.core.frame:

dropna(axis: 'Axis' = 0, how: 'str' = 'any', thresh=None, subset=None, inplace: 'bool' = False) method of pandas.core.frame.DataFrame instance
    Remove missing values.
    
    See the :ref:`User Guide <missing_data>` for more on which values are
    considered missing, and how to work with missing data.
    
    Parameters
    ----------
    axis : {0 or 'index', 1 or 'columns'}, default 0
        Determine if rows or columns which contain missing values are
        removed.
    
        * 0, or 'index' : Drop rows which contain missing values.
        * 1, or 'columns' : Drop columns which contain missing value.
    
        .. versionchanged:: 1.0.0
    
           Pass tuple or list to drop on multiple axes.
           Only a single axis is allowed.
    
    how : {'any', 'all'}, default 'any'
        Determine if row or column is removed from DataFrame, when we have
        at least one NA or all NA.
    
        * 'any' : If a

In [23]:
df.dropna()  # remove missing values.default axis=0
df.dropna
# axis belirtmediğimiz için null veri içeren row silindi.

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [24]:
df

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [25]:
df.dropna(thresh=1)  # thresh=1 old. için en az 1 null olmayan değer içerenleri silmez.

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [26]:
df.dropna(thresh=2)  # en az 2 non-null value içerenler kalır

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [31]:
df.dropna(thresh=4)  # en az 4 non-null value içerenler kalır

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [30]:
df.dropna(thresh=5)  # # en az 5 non-null value içerenler kalır (2. satırda sadece 4 non-null value old. için sildi.)

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [32]:
df

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [33]:
df.dropna(axis=1)  # missing value içeren bütün kolonları siler

0
1
2
3
4


In [34]:
df.dropna(subset=["last_name"])  # sadece "last_name" kolonunda missing value  olanların satırlarını siler.

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [37]:
df.dropna(subset=["last_name"], thresh=1)  # subset ve thresh birlikte kullanılarak da filtreleme yapılabilir.

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [41]:
help(df.fillna)

Help on method fillna in module pandas.core.frame:

fillna(value: 'object | ArrayLike | None' = None, method: 'FillnaOptions | None' = None, axis: 'Axis | None' = None, inplace: 'bool' = False, limit=None, downcast=None) -> 'DataFrame | None' method of pandas.core.frame.DataFrame instance
    Fill NA/NaN values using the specified method.
    
    Parameters
    ----------
    value : scalar, dict, Series, or DataFrame
        Value to use to fill holes (e.g. 0), alternately a
        dict/Series/DataFrame of values specifying which value to use for
        each index (for a Series) or column (for a DataFrame).  Values not
        in the dict/Series/DataFrame will not be filled. This value cannot
        be a list.
    method : {'backfill', 'bfill', 'pad', 'ffill', None}, default None
        Method to use for filling holes in reindexed Series
        pad / ffill: propagate last valid observation forward to next valid
        backfill / bfill: use next valid observation to fill gap.
  

In [42]:
df.fillna("NEW VALUE")  # tün null değerleri değiştirir.

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,NEW VALUE,NEW VALUE,NEW VALUE,NEW VALUE,NEW VALUE,NEW VALUE
2,Hugh,Jackman,51.0,m,NEW VALUE,NEW VALUE
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [43]:
df

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [44]:
df["pre_movie_score"].fillna(0)

0    8.0
1    0.0
2    0.0
3    6.0
4    7.0
Name: pre_movie_score, dtype: float64

In [45]:
df["pre_movie_score"] = df["pre_movie_score"].fillna(0)

In [46]:
df

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,0.0,
2,Hugh,Jackman,51.0,m,0.0,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [48]:
df = pd.read_csv("movie_scores.csv")
df

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [49]:
df["pre_movie_score"].fillna(df["pre_movie_score"].mean())  # aynı konoldaki deeğrlerin ortalaması ile doldurmak için

0    8.0
1    7.0
2    7.0
3    6.0
4    7.0
Name: pre_movie_score, dtype: float64

## Filling with Interpolation

Be careful with this technique, you should try to really understand whether or not this is a valid choice for your data. You should also note there are several methods available, the default is a linear method.

Full Docs on this Method:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html

In [50]:
airline_tix = {'first':100,'business':np.nan,'economy-plus':50,'economy':30}

In [53]:
ser = pd.Series(airline_tix)
ser

first           100.0
business          NaN
economy-plus     50.0
economy          30.0
dtype: float64

In [None]:
# önceki ve sonraki değerlere göre interpolation yaparak doldurma

In [54]:
ser.interpolate() 

first           100.0
business         75.0
economy-plus     50.0
economy          30.0
dtype: float64