# Pandas explanation in handeling missing data

- Importing necessary data:

In [4]:
import pandas as pd
import numpy as np

In [2]:
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])

In [3]:
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

- Check wether there is a `NaN` value or not. *True* means in row 2, there is a `NaN` value

In [5]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

* To change a value to be `NaN`, just need:

In [7]:
string_data[0] = None
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

## Filtering out missing data

In [8]:
data = pd.Series([1, np.nan, 3.5, np.nan, 7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

- To *drop* a row which contains a `na` vlue:
 - if we change `inplace` to `True`, the vlue will store

In [10]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

* `dropna` will work for *pandas dataframe* and drops entire row:

In [12]:
data = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan],
                     [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]])
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [13]:
cleaned = data.dropna()

In [14]:
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


* Passing `how='all'` will only drop rows that all are `NaN`:

In [16]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


* Using `axis=1` does drop action in column:

In [17]:
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [18]:
data[4] = np.nan
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [19]:
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


* If you want to keep only rows which contains specific number (which means **indexing**):

In [20]:
df = pd.DataFrame(np.random.randn(7, 3))
df

Unnamed: 0,0,1,2
0,1.784113,-0.574788,-0.798074
1,0.284474,0.178317,0.679814
2,1.330352,0.043382,0.354472
3,-1.05311,-0.139292,0.14021
4,0.443426,-0.06595,0.96197
5,-0.056108,0.582005,0.133648
6,-1.3356,-0.112706,-1.07544


In [26]:
df.iloc[:4, 1] = np.nan
df.iloc[:2, 2] = np.nan

In [27]:
df

Unnamed: 0,0,1,2
0,1.784113,,
1,0.284474,,
2,1.330352,,0.354472
3,-1.05311,,0.14021
4,0.443426,-0.06595,0.96197
5,-0.056108,0.582005,0.133648
6,-1.3356,-0.112706,-1.07544


In [28]:
df.dropna()

Unnamed: 0,0,1,2
4,0.443426,-0.06595,0.96197
5,-0.056108,0.582005,0.133648
6,-1.3356,-0.112706,-1.07544


In [29]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,1.330352,,0.354472
3,-1.05311,,0.14021
4,0.443426,-0.06595,0.96197
5,-0.056108,0.582005,0.133648
6,-1.3356,-0.112706,-1.07544


## Filling missing data

* Fill any `NaN` value with 0:

In [31]:
df.fillna(0)

Unnamed: 0,0,1,2
0,1.784113,0.0,0.0
1,0.284474,0.0,0.0
2,1.330352,0.0,0.354472
3,-1.05311,0.0,0.14021
4,0.443426,-0.06595,0.96197
5,-0.056108,0.582005,0.133648
6,-1.3356,-0.112706,-1.07544


* You can fill data with dictionary:

In [32]:
df.fillna({1: 0.5, 2: 0})

Unnamed: 0,0,1,2
0,1.784113,0.5,0.0
1,0.284474,0.5,0.0
2,1.330352,0.5,0.354472
3,-1.05311,0.5,0.14021
4,0.443426,-0.06595,0.96197
5,-0.056108,0.582005,0.133648
6,-1.3356,-0.112706,-1.07544


* Using `inplace` will save modified data:

In [34]:
df.fillna(0, inplace=True)

In [37]:
df

Unnamed: 0,0,1,2
0,1.784113,0.0,0.0
1,0.284474,0.0,0.0
2,1.330352,0.0,0.354472
3,-1.05311,0.0,0.14021
4,0.443426,-0.06595,0.96197
5,-0.056108,0.582005,0.133648
6,-1.3356,-0.112706,-1.07544


* With the same way also can be use for *dataframe* indexing:

In [38]:
df = pd.DataFrame(np.random.randn(6, 3))

In [39]:
df

Unnamed: 0,0,1,2
0,0.176576,-0.188739,1.302466
1,-0.062286,-0.339221,-0.826183
2,-0.615303,-0.079451,-1.099676
3,-0.336723,-0.392663,0.297082
4,0.011687,-1.893641,-0.791828
5,1.998845,0.589525,-1.710071


In [42]:
df.iloc[3:, 1] = np.nan
df.iloc[:2, 2] = np.nan

In [43]:
df

Unnamed: 0,0,1,2
0,0.176576,-0.188739,
1,-0.062286,-0.339221,
2,-0.615303,-0.079451,-1.099676
3,-0.336723,,0.297082
4,0.011687,,-0.791828
5,1.998845,,-1.710071


* For `fillna` you can use `mean`:

In [45]:
df.fillna(df.mean())

Unnamed: 0,0,1,2
0,0.176576,-0.188739,-0.826123
1,-0.062286,-0.339221,-0.826123
2,-0.615303,-0.079451,-1.099676
3,-0.336723,-0.202471,0.297082
4,0.011687,-0.202471,-0.791828
5,1.998845,-0.202471,-1.710071


* or for each column

In [48]:
df[1].fillna(df[1].mean())

0   -0.188739
1   -0.339221
2   -0.079451
3   -0.202471
4   -0.202471
5   -0.202471
Name: 1, dtype: float64

# Data Transformation

## Removing duplicates

In [50]:
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],'k2': [1, 1, 2, 3, 3, 4, 4]})

In [51]:
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


* *Pandas* `duplicated` method returns bolean series to show whether each row is duplicated or not

In [52]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

* `drop_duplicate` remove `False` rows:

In [53]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


## Transforming data using function or mapping

In [54]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
                              'Pastrami', 'corned beef', 'Bacon',
                              'pastrami', 'honey ham', 'nova lox'],
                     'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})

In [55]:
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


In [56]:
meat_to_animal = {
  'bacon': 'pig',
  'pulled pork': 'pig',
  'pastrami': 'cow',
  'corned beef': 'cow',
  'honey ham': 'pig',
  'nova lox': 'salmon'
}

* The map method on a Series accepts a function or dict-like object containing a mapping

In [57]:
lowercased = data['food'].str.lower()
lowercased

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

In [58]:
data['animal'] = lowercased.map(meat_to_animal)
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


In [59]:
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


* we also could pass a function:

In [60]:
data['food'].map(lambda x: meat_to_animal[x.lower()])

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object