# From the Docs

## Missing Values
From the [Pandas Official Documentation](http://pandas.pydata.org/pandas-docs/stable/missing_data.html)

---

In [6]:
import pandas as pd
pd.__version__

'0.23.4'

In [7]:
home = pd.read_csv('data_processed/home.csv')
home.head()

Unnamed: 0.1,Unnamed: 0,Browser,Country,Date Submitted,Device,How did you hear about us?,OS,Source URL,User,Medium,OSGroup
0,0,Firefox 57.0,Indonesia,2018-01-05 01:35:47,desktop,Friends / Family - ...,Windows 8.1,https://algorit.ma/,0,Friends / Family,Windows
1,1,Safari 11.0.1,Indonesia,2018-01-05 04:25:43,desktop,Media Publishing,Mac OS X 10.13.1,https://algorit.ma/,0,Media Publishing,Mac
2,2,Chrome 63.0.3239,Indonesia,2018-01-05 16:36:25,desktop,Friends / Family - ...,Windows 10,https://algorit.ma/,0,Friends / Family,Windows
3,3,Chrome 62.0.3202,Indonesia,2018-01-06 16:40:16,desktop,Media Publishing,Windows 10,https://algorit.ma/,0,Media Publishing,Windows
4,4,Chrome 63.0.3239,Indonesia,2018-01-08 03:45:42,desktop,Search Engine,Windows 10,https://algorit.ma/,0,Search Engine,Windows


Using `pd.isna()` to subset for any rows with NA values in the `Country` column

In [9]:
home.loc[pd.isna(home['Country']),:]

Unnamed: 0.1,Unnamed: 0,Browser,Country,Date Submitted,Device,How did you hear about us?,OS,Source URL,User,Medium,OSGroup
197,197,Chrome 66.0.3359,,2018-05-21 04:13:31,desktop,Search Engine,Windows 7,https://algorit.ma/,0,Search Engine,Windows
376,376,Opera 55.0.2994,,2018-09-06 06:03:52,desktop,Search Engine,Windows 7,https://algorit.ma/,0,Search Engine,Windows


In [18]:
so = pd.read_csv('data_input/stackoverflow_qa.csv')
so = so.iloc[0:5,:]
so

Unnamed: 0,id,creationdate,score,viewcount,title,answercount,commentcount,favoritecount,quest_name,quest_rep,ans_name,ans_rep
0,5486226,2011-03-30 12:26:50,4,2113,Rolling median in python,3,4,1.0,yueerhu,125.0,Mike Pennington,26995.0
1,5515021,2011-04-01 14:50:44,8,7015,Compute a compounded return series in Python,3,6,7.0,Jason Strimpel,3301.0,Mike Pennington,26995.0
2,5558607,2011-04-05 21:13:50,2,7392,Sort a pandas DataMatrix in ascending order,2,0,1.0,Jason Strimpel,3301.0,Wes McKinney,43310.0
3,6467832,2011-06-24 12:31:45,9,13056,How to get the correlation between two timeser...,1,0,7.0,user814005,117.0,Wes McKinney,43310.0
4,7577546,2011-09-28 01:58:38,9,2488,"Using pandas, how do I subsample a large DataF...",1,0,5.0,Uri Laserson,958.0,HYRY,54137.0


In [19]:
so['score'].sum()

32

In [20]:
so.mean(axis=1)

0    689433.875
1    694044.500
2    701576.875
3    815541.500
4    954393.000
dtype: float64

In [22]:
so[['score', 'favoritecount']].cumsum()

Unnamed: 0,score,favoritecount
0,4.0,1.0
1,12.0,8.0
2,14.0,9.0
3,23.0,16.0
4,32.0,21.0


A more practical example of using `.mean()` and `.groupby()`. Notice that only the numerical columns are used

In [23]:
so.groupby('ans_name').mean()

Unnamed: 0_level_0,id,score,viewcount,answercount,commentcount,favoritecount,quest_rep,ans_rep
ans_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
HYRY,7577546.0,9.0,2488.0,1.0,0.0,5.0,958.0,54137.0
Mike Pennington,5500623.5,6.0,4564.0,3.0,5.0,4.0,1713.0,26995.0
Wes McKinney,6013219.5,5.5,10224.0,1.5,0.0,4.0,1709.0,43310.0


In [24]:
home.loc[pd.isna(home['Country']),:]

Unnamed: 0.1,Unnamed: 0,Browser,Country,Date Submitted,Device,How did you hear about us?,OS,Source URL,User,Medium,OSGroup
197,197,Chrome 66.0.3359,,2018-05-21 04:13:31,desktop,Search Engine,Windows 7,https://algorit.ma/,0,Search Engine,Windows
376,376,Opera 55.0.2994,,2018-09-06 06:03:52,desktop,Search Engine,Windows 7,https://algorit.ma/,0,Search Engine,Windows


Using `fillna()` to fill missing values:

In [27]:
home['Country'] = home['Country'].fillna('Indonesia')
home.iloc[[197,376], :]

Unnamed: 0.1,Unnamed: 0,Browser,Country,Date Submitted,Device,How did you hear about us?,OS,Source URL,User,Medium,OSGroup
197,197,Chrome 66.0.3359,Indonesia,2018-05-21 04:13:31,desktop,Search Engine,Windows 7,https://algorit.ma/,0,Search Engine,Windows
376,376,Opera 55.0.2994,Indonesia,2018-09-06 06:03:52,desktop,Search Engine,Windows 7,https://algorit.ma/,0,Search Engine,Windows


Using `dropna()`:

In [41]:
# drop rows (axis=0) where there's any NA values
print(home.shape)
home.dropna(axis=0, inplace=True)
print(home.shape)

(409, 11)
(407, 11)


Using `interpolate()`:

In [44]:
import numpy as np
df = pd.DataFrame({
        'A': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
        'B': [.25, np.nan, np.nan, 4, 12.2, 14.4]})
print(df)

     A      B
0  1.0   0.25
1  2.1    NaN
2  NaN    NaN
3  4.7   4.00
4  5.6  12.20
5  6.8  14.40


In [45]:
# performs linear interpolation
df.interpolate()

Unnamed: 0,A,B
0,1.0,0.25
1,2.1,1.5
2,3.4,2.75
3,4.7,4.0
4,5.6,12.2
5,6.8,14.4
