### <span style="color:black"><b>Pandas Tutorial 10</b></span>
 FIX THIS NOTEBOOK UP
---

- Dealing with Missing Data
- Dropping Rows + Columns

---

**Useful DataFrame Methods**

<pre>
df.isna() or df.isnull()
df.fillna()
df.dropna()
df.drop()
</pre>

Important arguments
<pre>
axis
inplace
</pre>

In [1]:
import numpy as np
import pandas as pd
%config InlineBackend.figure_format = 'svg'

**1. Initial cleaning while reading in data**

In [2]:
df = pd.read_csv('2018_health.csv', na_values=['?', '--', '..'])
df.head(10)

Unnamed: 0.1,Unnamed: 0,GDP/Capita,MeanSchooling,InfantMortality,Life_exp,Country,Year,Country Type
0,0,,3.9,,,Afghanistan,2018.0,Developing
1,1,missing,10.1,8.5,78.458,Albania,2018.0,Developing
2,2,4153.733978059152,,20.4,,Algeria,2018.0,Developing
3,3,11633.498008646506,,8.7,76.52,Argentina,2018.0,Developing
4,4,4220.490276712489,11.3,11.0,74.945,Armenia,2018.0,Developing
5,5,57354.96404558911,12.7,3.1,,Australia,2018.0,Developed
6,6,4739.8417102839285,10.6,19.3,72.864,Azerbaijan,2018.0,Developing
7,7,23991.05667693996,9.4,6.1,77.163,Bahrain,2018.0,Developing
8,8,1698.3503942812197,6.1,26.7,72.32,Bangladesh,2018.0,Developing
9,9,17745.19346499627,10.6,12.1,79.081,Barbados,2018.0,Developing


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122 entries, 0 to 121
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Unnamed: 0       122 non-null    int64  
 1   GDP/Capita       118 non-null    object 
 2   MeanSchooling    15 non-null     float64
 3   InfantMortality  121 non-null    float64
 4   Life_exp         119 non-null    float64
 5   Country          122 non-null    object 
 6   Year             121 non-null    float64
 7   Country Type     122 non-null    object 
dtypes: float64(4), int64(1), object(3)
memory usage: 7.8+ KB


**1. Finding where missing values are for a particular series**

- Example: Find all rows where MeanSchooling was missing

In [4]:
df.loc[df['MeanSchooling'].isnull(), :]

Unnamed: 0.1,Unnamed: 0,GDP/Capita,MeanSchooling,InfantMortality,Life_exp,Country,Year,Country Type
2,2,4153.733978059152,,20.4,,Algeria,2018.0,Developing
3,3,11633.498008646506,,8.7,76.520000,Argentina,2018.0,Developing
17,17,813.0968489600382,,55.3,61.174000,Burkina Faso,2018.0,Developing
18,18,271.75204437664803,,41.2,61.247000,Burundi,2018.0,Developing
19,19,1512.1267097284697,,23.7,69.570000,Cambodia,2018.0,Developing
...,...,...,...,...,...,...,...,...
117,117,43043.2278164551,,3.7,81.256098,United Kingdom,,Developed
118,118,62996.47128526045,,5.6,78.539024,United States,2018.0,Developed
119,119,17277.970110549588,,6.5,77.770000,Uruguay,2018.0,Developing
120,120,,,16.5,71.573000,Uzbekistan,2018.0,Developing


Example: Find all rows where Life_exp was missing

In [5]:
df.loc[df.Life_exp.isnull(), :]

Unnamed: 0.1,Unnamed: 0,GDP/Capita,MeanSchooling,InfantMortality,Life_exp,Country,Year,Country Type
0,0,,3.9,,,Afghanistan,2018.0,Developing
2,2,4153.733978059152,,20.4,,Algeria,2018.0,Developing
5,5,57354.96404558911,12.7,3.1,,Australia,2018.0,Developed


**3. Counting how many missing values there are in our dataset**


In [6]:
df.isna().sum()

Unnamed: 0           0
GDP/Capita           4
MeanSchooling      107
InfantMortality      1
Life_exp             3
Country              0
Year                 1
Country Type         0
dtype: int64

**4. Seeing where ALL these values are**

In [7]:
df.loc[df.isnull().any(axis='columns'), :]

Unnamed: 0.1,Unnamed: 0,GDP/Capita,MeanSchooling,InfantMortality,Life_exp,Country,Year,Country Type
0,0,,3.9,,,Afghanistan,2018.0,Developing
2,2,4153.733978059152,,20.4,,Algeria,2018.0,Developing
3,3,11633.498008646506,,8.7,76.520000,Argentina,2018.0,Developing
5,5,57354.96404558911,12.7,3.1,,Australia,2018.0,Developed
17,17,813.0968489600382,,55.3,61.174000,Burkina Faso,2018.0,Developing
...,...,...,...,...,...,...,...,...
117,117,43043.2278164551,,3.7,81.256098,United Kingdom,,Developed
118,118,62996.47128526045,,5.6,78.539024,United States,2018.0,Developed
119,119,17277.970110549588,,6.5,77.770000,Uruguay,2018.0,Developing
120,120,,,16.5,71.573000,Uzbekistan,2018.0,Developing


**5. Filling in missing values (for a series) with either a constant or the output from an aggregation function**

In [8]:
# Fill with 20_000 just to see how it works
df['GDP/Capita'].fillna(value = 20_000, inplace=True)

In [9]:
# Don't use inplace = True, just so that nothing is committed to
school_mean = df['MeanSchooling'].mean()

df['MeanSchooling'].fillna(value=school_mean)

0       3.900000
1      10.100000
2       8.906667
3       8.906667
4      11.300000
         ...    
117     8.906667
118     8.906667
119     8.906667
120     8.906667
121     8.906667
Name: MeanSchooling, Length: 122, dtype: float64

**6. Fill in a miising values across an entire df**

In [10]:
df

Unnamed: 0.1,Unnamed: 0,GDP/Capita,MeanSchooling,InfantMortality,Life_exp,Country,Year,Country Type
0,0,20000,3.9,,,Afghanistan,2018.0,Developing
1,1,missing,10.1,8.5,78.458000,Albania,2018.0,Developing
2,2,4153.733978059152,,20.4,,Algeria,2018.0,Developing
3,3,11633.498008646506,,8.7,76.520000,Argentina,2018.0,Developing
4,4,4220.490276712489,11.3,11.0,74.945000,Armenia,2018.0,Developing
...,...,...,...,...,...,...,...,...
117,117,43043.2278164551,,3.7,81.256098,United Kingdom,,Developed
118,118,62996.47128526045,,5.6,78.539024,United States,2018.0,Developed
119,119,17277.970110549588,,6.5,77.770000,Uruguay,2018.0,Developing
120,120,20000,,16.5,71.573000,Uzbekistan,2018.0,Developing


In [11]:
df.fillna(value = df.mean())

Unnamed: 0.1,Unnamed: 0,GDP/Capita,MeanSchooling,InfantMortality,Life_exp,Country,Year,Country Type
0,0,20000,3.900000,20.004959,72.900215,Afghanistan,2018.0,Developing
1,1,missing,10.100000,8.500000,78.458000,Albania,2018.0,Developing
2,2,4153.733978059152,8.906667,20.400000,72.900215,Algeria,2018.0,Developing
3,3,11633.498008646506,8.906667,8.700000,76.520000,Argentina,2018.0,Developing
4,4,4220.490276712489,11.300000,11.000000,74.945000,Armenia,2018.0,Developing
...,...,...,...,...,...,...,...,...
117,117,43043.2278164551,8.906667,3.700000,81.256098,United Kingdom,2018.0,Developed
118,118,62996.47128526045,8.906667,5.600000,78.539024,United States,2018.0,Developed
119,119,17277.970110549588,8.906667,6.500000,77.770000,Uruguay,2018.0,Developing
120,120,20000,8.906667,16.500000,71.573000,Uzbekistan,2018.0,Developing


In [12]:
values_to_fill = {'GDP/Capita':df['GDP/Capita'].mean(), 
                  'MeanSchooling':10, 
                  'InfantMortality':df['InfantMortality'].median()}

# Use fillna() as a dataframe method
df.fillna(value = values_to_fill)

TypeError: unsupported operand type(s) for +: 'int' and 'str'

Index(['Unnamed: 0', 'GDP/Capita', 'MeanSchooling', 'InfantMortality',
       'Life_exp', 'Country', 'Year', 'Country Type'],
      dtype='object')