# Data manipulation
The following exercises will be about manipulating data in pandas and numpy by using indexing and aggregations. We will be making use of the pokemon dataset.

Remeber the following ways of subsetting data:
Multiple ways to access data:
- `df.iloc[x,y]` Access element in row `x` and column `y`, e.g. `df.iloc[1,2]`. Can't refer to column names!
- `df['A']` or `df.A` Yields a single column.
- `df.loc[bool_vec, bool_vec]` or `df.loc[bool_vec, col]` Boolean indexing.
- `df[['A','B']]` where `A` and `B` are column names.
- `df.drop('A', axis = 1)` Drop columns.

Boolean vectors are often created using `==`, `<`, `>`. Example:
```python
>>> df['A']
0    0.262266
1    0.881773
2    0.030189
3    0.241150
4    0.546736
Name: A, dtype: float64
>>> df['A'] > 0.5
0    False
1     True
2    False
3    False
4     True
Name: A, dtype: bool
```

In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv('Pokemon.csv', encoding = "latin1")
df.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Stage,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,2,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,3,False
3,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
4,5,Charmeleon,Fire,,405,58,64,58,80,65,80,2,False


Create a dataframe with the columns HP, Attack and Defense

In [4]:
newdf = df[['HP', 'Attack', 'Defense']]
newdf

Unnamed: 0,HP,Attack,Defense
0,45,49,49
1,60,62,63
2,80,82,83
3,39,52,43
4,58,64,58
...,...,...,...
146,41,64,45
147,61,84,65
148,91,134,95
149,106,110,90


In [None]:
#ANS
df_sub = df[['HP', 'Attack', 'Defense']]

# Or
df_sub = df.loc[:, ['HP', 'Attack', 'Defense']]

Create a new DataFrame containing all Pokemon of fire type. This can be done using Boolean indexing.

In [6]:
newdf=df.loc[df['Type 1'] == "Fire"]
newdf

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Stage,Legendary
3,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
4,5,Charmeleon,Fire,,405,58,64,58,80,65,80,2,False
5,6,Charizard,Fire,Flying,534,78,84,78,109,85,100,3,False
36,37,Vulpix,Fire,,299,38,41,40,50,65,65,1,False
37,38,Ninetales,Fire,,505,73,76,75,81,100,100,2,False
57,58,Growlithe,Fire,,350,55,70,45,70,50,60,1,False
58,59,Arcanine,Fire,,555,90,110,80,100,80,95,2,False
76,77,Ponyta,Fire,,410,50,85,55,65,65,90,1,False
77,78,Rapidash,Fire,,500,65,100,70,80,80,105,2,False
125,126,Magmar,Fire,,495,65,95,57,100,85,93,1,False


In [None]:
#ANS
df_fire = df.loc[df['Type 1'] == 'Fire', :]
df_fire

Create a dataframe with `Flying` type Pokemon and the columns `Name`, `Type 1`, `Type 2` and `HP`.

In [9]:
newdf=df.loc[df['Type 2'] == "Flying", ['Name', 'Type 1', 'Type 2', 'HP']]
newdf

Unnamed: 0,Name,Type 1,Type 2,HP
5,Charizard,Fire,Flying,78
11,Butterfree,Bug,Flying,60
15,Pidgey,Normal,Flying,40
16,Pidgeotto,Normal,Flying,63
17,Pidgeot,Normal,Flying,83
20,Spearow,Normal,Flying,40
21,Fearow,Normal,Flying,65
40,Zubat,Poison,Flying,40
41,Golbat,Poison,Flying,75
82,Farfetch'd,Normal,Flying,52


In [None]:
#ANS
df.loc[df['Type 2'] == 'Flying', ['Name', 'HP', 'Type 1', 'Type 2']].head()

## Aggregations

Find the average attack power for the fire pokemon

In [13]:
newdf=df.loc[df['Type 1'] == "Fire"]
newdf['Attack'].mean()

83.91666666666667

In [14]:
#ANS
df_fire['Attack'].mean()

NameError: name 'df_fire' is not defined

Find the maximum defense of pokemon with HP lower than 70.

In [18]:
df_hplowerthan70=df.loc[df['HP'] < 70]
df_hplowerthan70['Defense'].max()

180

In [None]:
#ANS
df.loc[df['HP'] < 70, 'Defense'].max()

Find the median and sum of Sp. Atk of all evolved Grass pokemon with a Total of less than 400.
Hint: Two conditions can be combined like this:
```python
(df['HP'] < 100) & (df['Defense'] > 50)
```
Conditions can also be combined with an "or" statement:
```python
(df['HP'] < 100) | (df['Defense'] > 50)
```
This corresponds to HP less than 100 OR defense larger than 50.

A pokemon's evolution is determined by `Stage`.


In [25]:
df_subset=df.loc[ (df['Stage'] > 1) & (df['Type 1'] == "Grass") & (df['Total'] < 400) ]
print(df_subset)
print(df_subset['Sp. Atk'].median())
print(df_subset['Sp. Atk'].sum())


     #        Name Type 1  Type 2  Total  HP  Attack  Defense  Sp. Atk  \
43  44       Gloom  Grass  Poison    395  60      65       70       85   
69  70  Weepinbell  Grass  Poison    390  65      90       50       85   

    Sp. Def  Speed  Stage  Legendary  
43       75     40      2      False  
69       45     55      2      False  
85.0
170


In [28]:
#ANS
cond = (df['Stage'] > 1) & (df['Total'] < 400) & (df['Type 1'] == 'Grass')
print(df.loc[cond,'Sp. Atk'].median())
print(df.loc[cond,'Sp. Atk'].sum())

85.0
170


Find all pokemon which are of type Grass or subtype Poison.

In [30]:
df_grassorpoison = df.loc[(df['Type 1'] == "Grass") | (df['Type 2'] == "Poison")]
print(df_grassorpoison)

       #        Name Type 1   Type 2  Total  HP  Attack  Defense  Sp. Atk  \
0      1   Bulbasaur  Grass   Poison    318  45      49       49       65   
1      2     Ivysaur  Grass   Poison    405  60      62       63       80   
2      3    Venusaur  Grass   Poison    525  80      82       83      100   
12    13      Weedle    Bug   Poison    195  40      35       30       20   
13    14      Kakuna    Bug   Poison    205  45      25       50       25   
14    15    Beedrill    Bug   Poison    395  65      90       40       45   
42    43      Oddish  Grass   Poison    320  45      50       55       75   
43    44       Gloom  Grass   Poison    395  60      65       70       85   
44    45   Vileplume  Grass   Poison    490  75      80       85      110   
47    48     Venonat    Bug   Poison    305  60      55       50       40   
48    49    Venomoth    Bug   Poison    450  70      65       60       90   
68    69  Bellsprout  Grass   Poison    300  50      75       35       70   

In [None]:
#ANS
df.loc[(df['Type 1'] == 'Grass') | (df['Type 2'] == 'Poison'),:]

## Vectorization
You have already been using vectorization in the exercises above. An operation as `df['Type 2'] == 'Flying'` compares all entries in the vector `df['Type 2']` to the string value `'Flying'`, this saves us having to loop through the vector and compare each entry to `'Flying'`.

We will try to make use of the vectorization features in `pandas` and in `numpy`. Start out by finding the logarithm of `Defense` and add it as a new column called `log_def`. The logatithm should be loaded from `numpy` as `np.log`, since `pandas` doesn't have a logarithm built in. 

Remember new columns can be created like this:
```python
df['new_col_name'] = val
```

In [33]:
print(df)
df['log_def'] = np.log(df['Defense'])
print(df)

       #        Name   Type 1  Type 2  Total   HP  Attack  Defense  Sp. Atk  \
0      1   Bulbasaur    Grass  Poison    318   45      49       49       65   
1      2     Ivysaur    Grass  Poison    405   60      62       63       80   
2      3    Venusaur    Grass  Poison    525   80      82       83      100   
3      4  Charmander     Fire     NaN    309   39      52       43       60   
4      5  Charmeleon     Fire     NaN    405   58      64       58       80   
..   ...         ...      ...     ...    ...  ...     ...      ...      ...   
146  147     Dratini   Dragon     NaN    300   41      64       45       50   
147  148   Dragonair   Dragon     NaN    420   61      84       65       70   
148  149   Dragonite   Dragon  Flying    600   91     134       95      100   
149  150      Mewtwo  Psychic     NaN    680  106     110       90      154   
150  151         Mew  Psychic     NaN    600  100     100      100      100   

     Sp. Def  Speed  Stage  Legendary  
0         6

In [None]:
#ANS
import numpy as np
df['log_def'] = np.log(df['Defense'])

The next task is to standardize the Speed. In order to do this we make use of the following formula:
$$
\tilde{x} = \frac{x-m}{s}
$$
Here $ \tilde{x} $ is the standardized value we want to find, $m$ is the mean of the variable and thus the mean of the whole `Speed` column. $s$ is the standard deviation of the `Speed` column - this can be computed using `np.std`.
Add the new values as a column to the DataFrame and call it `Speed_std`.

In [None]:
#ANS
df['Speed_std'] = (df['Speed'] - df['Speed'].mean())/df['Speed'].std()

# Or using numpy
df['Speed_std'] = (df['Speed'] - np.mean(df['Speed']))/np.std(df['Speed'])

## Group By
Let us try to do some aggregations on multiple groups. Firstly for each value of `Type 1` find the mean attack value, and sort the result in descending order. This can be done using the method `sort_values()` with the argument `ascending = False`.

Remember to follow the steps:
- Subset
- Group By
- Aggegrate
- Sort

In [47]:
df[['Type 1', 'Attack']].groupby('Type 1').mean().sort_values(by = "Attack", ascending = False)

Unnamed: 0_level_0,Attack
Type 1,Unnamed: 1_level_1
Fighting,102.857143
Dragon,94.0
Fire,83.916667
Rock,82.222222
Ground,81.875
Poison,74.428571
Grass,70.666667
Water,70.25
Normal,67.727273
Ice,67.5


In [34]:
#ANS
df[['Type 1', 'Attack']].groupby("Type 1").mean().sort_values(by = "Attack", ascending = False)

Unnamed: 0_level_0,Attack
Type 1,Unnamed: 1_level_1
Fighting,102.857143
Dragon,94.0
Fire,83.916667
Rock,82.222222
Ground,81.875
Poison,74.428571
Grass,70.666667
Water,70.25
Normal,67.727273
Ice,67.5


Count the number of pokemon for each `Type 1` value, and sort by the counts. This can be done by sorting by one of the other columns in the dataframe, e.g. `Name`.

In [55]:
df.groupby("Type 1").count().sort_values(by='Name')

Unnamed: 0_level_0,#,Name,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Stage,Legendary,log_def
Type 1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Fairy,2,2,0,2,2,2,2,2,2,2,2,2,2
Ice,2,2,2,2,2,2,2,2,2,2,2,2,2
Dragon,3,3,1,3,3,3,3,3,3,3,3,3,3
Ghost,3,3,3,3,3,3,3,3,3,3,3,3,3
Fighting,7,7,0,7,7,7,7,7,7,7,7,7,7
Ground,8,8,2,8,8,8,8,8,8,8,8,8,8
Psychic,8,8,1,8,8,8,8,8,8,8,8,8,8
Electric,9,9,3,9,9,9,9,9,9,9,9,9,9
Rock,9,9,9,9,9,9,9,9,9,9,9,9,9
Bug,12,12,9,12,12,12,12,12,12,12,12,12,12


In [42]:
#ANS
df[["Type 1", 'Name']].groupby("Type 1").count().sort_values(by='Name')

Unnamed: 0_level_0,Name
Type 1,Unnamed: 1_level_1
Fairy,2
Ice,2
Dragon,3
Ghost,3
Fighting,7
Ground,8
Psychic,8
Electric,9
Rock,9
Bug,12


For each `Type 1` count the number of unique `Type 2` values. Use the method `nunique()`.

In [43]:
#ANS
df[['Type 1', 'Type 2']].groupby("Type 1").nunique().sort_values(by = "Type 2", ascending = False)

Unnamed: 0_level_0,Type 2
Type 1,Unnamed: 1_level_1
Water,5
Bug,3
Rock,3
Electric,2
Grass,2
Ice,2
Normal,2
Poison,2
Dragon,1
Fire,1


If you did the above exercise correctly, you will see 5 subtypes for water type pokemon. Verify what these subtypes are. You can use the `unique()` method on a single column.

In [57]:
df.loc[df['Type 1'] == 'Water', 'Type 2']

6           NaN
7           NaN
8           NaN
53          NaN
54          NaN
59          NaN
60          NaN
61     Fighting
71       Poison
72       Poison
78      Psychic
79      Psychic
85          NaN
86          Ice
89          NaN
90          Ice
97          NaN
98          NaN
115         NaN
116         NaN
117         NaN
118         NaN
119         NaN
120     Psychic
128         NaN
129      Flying
130         Ice
133         NaN
Name: Type 2, dtype: object

In [44]:
#ANS
df.loc[df['Type 1'] == 'Water', 'Type 2'].unique()

array([nan, 'Fighting', 'Poison', 'Psychic', 'Ice', 'Flying'],
      dtype=object)

It is possible to perform multiple aggregations at once. This can be done using the `agg` method on the DataFrame object. Find the sum, min, max and median for `Attack` and `Defense` for each group of `Type 1` Pokemon. 

In [67]:
print(df[['Type 1', 'Attack', 'Defense']].groupby('Type 1').min())
print(df[['Type 1', 'Attack', 'Defense']].groupby('Type 1').max())
print(df[['Type 1', 'Attack', 'Defense']].groupby('Type 1').median())
print(df[['Type 1', 'Attack', 'Defense']].groupby('Type 1').agg(['min','max','median']))

          Attack  Defense
Type 1                   
Bug           20       30
Dragon        64       45
Electric      30       40
Fairy         45       48
Fighting      80       35
Fire          41       40
Ghost         35       30
Grass         40       35
Ground        50       25
Ice           50       35
Normal         5        5
Poison        45       35
Psychic       20       15
Rock          40       65
Water         10       35
          Attack  Defense
Type 1                   
Bug          125      100
Dragon       134       95
Electric      90       95
Fairy         70       73
Fighting     130       80
Fire         130       90
Ghost         65       60
Grass        105      115
Ground       130      120
Ice           85      100
Normal       110       95
Poison       105      120
Psychic      110      100
Rock         120      160
Water        130      180
          Attack  Defense
Type 1                   
Bug         60.0     52.5
Dragon      84.0     65.0
Electric    

In [59]:
#ANS
df[['Type 1', 'Attack', 'Defense']].groupby('Type 1').agg(['sum', 'min', 'max', 'median'])

Unnamed: 0_level_0,Attack,Attack,Attack,Attack,Defense,Defense,Defense,Defense
Unnamed: 0_level_1,sum,min,max,median,sum,min,max,median
Type 1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Bug,765,20,125,60.0,685,30,100,52.5
Dragon,282,64,134,84.0,205,45,95,65.0
Electric,558,30,90,60.0,582,40,95,60.0
Fairy,115,45,70,57.5,121,48,73,60.5
Fighting,720,80,130,105.0,427,35,80,60.0
Fire,1007,41,130,84.5,751,40,90,59.0
Ghost,150,35,65,50.0,135,30,60,45.0
Grass,848,40,105,70.0,835,35,115,67.5
Ground,655,50,130,80.0,690,25,120,95.0
Ice,135,50,85,67.5,135,35,100,67.5


### Custom aggregations
Using the `agg` method it is possible to define custom aggregation functions. This can be done by supplying a function that takes a `pandas` series object, and returns a number. As long as the function takes a `pandas` series as input and outputs a number it is a valid aggregator. There are many ways to do this. A simple way is to use a for loop:
```python
def my_sum(series):
    sum_ = 0
    for x in series:
        sum_ = sum_ + x
    return sum_

df.groupby('Type 1').agg({'Attack': [my_sum]})
```

Another example using a reduce operation:
```python
from functools import reduce
def my_sum(series):
    return reduce(lambda x, y: x + y, series, 0)

df.groupby('Type 1').agg({'Attack': [my_sum]})
```

If you are unfamiliar with the reduce function, you can read more about it here: https://www.python-course.eu/lambda.php.

Try to create a custom function, that computes the sum of the Defense, if it is divisble by 2.

In [78]:
def sumdef2(input):
    i=0
    for cur in input:
        if cur % 2 == 0:
            i=i+cur
    return i

df_defsum2=df.groupby('Type 1').agg({'Attack': [sumdef2]})
print(df_defsum2)

          Attack
         sumdef2
Type 1          
Bug          320
Dragon       282
Electric     320
Fairy         70
Fighting     510
Fire         786
Ghost         50
Grass        404
Ground       440
Ice           50
Normal       914
Poison       638
Psychic      328
Rock         380
Water        684


In [68]:
#ANS

# Solution 1
from functools import reduce
def div2_sum(series):
    # Initialize to 0, in order to avoid summing first element if it is odd
    return reduce(lambda x, y: x + y if y % 2 == 0 else x, series, 0) 
print(df.groupby('Type 1').agg({'Attack': [div2_sum]}))

# Solution 2
def div2_sum2(series):
    running_sum = 0
    for x in series:
        if x % 2 == 0:
            running_sum = running_sum + x
    return running_sum

print(df.groupby('Type 1').agg({'Attack': [div2_sum2]}))

           Attack
         div2_sum
Type 1           
Bug           320
Dragon        282
Electric      320
Fairy          70
Fighting      510
Fire          786
Ghost          50
Grass         404
Ground        440
Ice            50
Normal        914
Poison        638
Psychic       328
Rock          380
Water         684
            Attack
         div2_sum2
Type 1            
Bug            320
Dragon         282
Electric       320
Fairy           70
Fighting       510
Fire           786
Ghost           50
Grass          404
Ground         440
Ice             50
Normal         914
Poison         638
Psychic        328
Rock           380
Water          684


### Custom transformations 
Custom transformations can be done using the `apply` method. The `apply` method takes a function that works on a value from each row and returns a value for each row. The following simple example raises `Attack` to the second power (This could be done more easily using vectorization):
```python
df['Attack'].apply(lambda x: x**2)

```

Try to create a custom function, that takes a string and turns it into upper case. Apply this function on the `Name` of the pokemon DataFrame. Hint: Try to use the `upper()` method.

In [None]:
#ANS

# Solution 1
df['Name'].apply(lambda x: x.upper())

# Solution 2 (Built-in version)
df['Name'].str.upper()

Create a function that cuts off the last two letters of each word, and apply it to the `Name` column. Hint: you can use an index to get parts of a string.

In [None]:
#ANS
df['Name'].apply(lambda x: x[:-2])

## Bonus exercises

Sometimes it is useful to generate a set of dates. E.g. when pingging an API with for values on certain dates, or when constructing a new dataset.

In this exercise you should make use of the `pd.date_range` function to create a `pandas.Series` object with dates from `2018-01-01` to `2018-07-01`. Afterwards, check out this <a href="https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases">page</a>  and figure out how to get all end of month dates in the above specified period, e.g. `2018-02-28`, `2018-03-31`, `2018-04-30`, `2018-05-31`, `2018-06-30`

In [None]:
#ANS
import pandas as pd
date_vec = pd.date_range("2018-01-01", "2018-07-01")
print(date_vec)

month_vec = pd.date_range("2018-01-01", "2018-07-01", freq="M" )
print(month_vec)

# Part 2

## One Hot Encoding
We have seen an example of how to perform one hot encoding "by hand". Now we will use the built-in `pandas` function called `pd.get_dummies`. Create a subset containing `Name` and `Type 1` and perform OHE on the `Type 1` column.

In [79]:
import pandas as pd
df = pd.read_csv('Pokemon.csv', encoding = "latin1")

In [136]:
#ANS
# Load data
import pandas as pd
df = pd.read_csv('Pokemon.csv', encoding = "latin1")

# Perform OHE
pd.get_dummies(df[['Name', 'Type 1']].head(10), columns=['Type 1'])


ohe_column='Type 1'
ohe_performed=pd.get_dummies(df[['Name', ohe_column]].head(10), columns=[ohe_column], prefix='ONEHOT_'+ohe_column)
print(ohe_performed)
ohe_performed['Type 1'] = df['Type 1']
print(ohe_performed)



         Name  ONEHOT_Type 1_Bug  ONEHOT_Type 1_Fire  ONEHOT_Type 1_Grass  \
0   Bulbasaur                  0                   0                    1   
1     Ivysaur                  0                   0                    1   
2    Venusaur                  0                   0                    1   
3  Charmander                  0                   1                    0   
4  Charmeleon                  0                   1                    0   
5   Charizard                  0                   1                    0   
6    Squirtle                  0                   0                    0   
7   Wartortle                  0                   0                    0   
8   Blastoise                  0                   0                    0   
9    Caterpie                  1                   0                    0   

   ONEHOT_Type 1_Water  
0                    0  
1                    0  
2                    0  
3                    0  
4                    0  
5 

## Imputation
As we saw in the slides, imputation is a set of techniques to fill in missing values. We will try to fill in missing values in the pokemon dataset. Read in the dataset `pokemon_missing.csv` and perform mean imputation on the `Speed` column.
The `SimpleImputer` from the `sklearn.impute` module takes a `DataFrame` or 2D `numpy` array. As highlighted in the slides, imputation of string columns can be troublesome. If we put in the whole dataframe the imputer will also try to impute the string column `Type 2` with missing values and it will fail. Therefore we only want to pass the `Speed` column, but as a DataFrame. Example:

```python
>>> type(df['A'])
pandas.core.series.Series

>>> type(df[['A']])
pandas.core.frame.DataFrame
```
Thus we should use the latter approach, since it returns a DataFrame and not a Series.

The imputer has a method called `fit_transform` that should be used.

In [84]:
# Import imputer
from sklearn.impute import SimpleImputer

# Initialize imputer
imputer = SimpleImputer()

# Load data
poke_miss = pd.read_csv('pokemon_missing.csv', index_col = 0)

In [102]:
#ANS

# Import imputer
from sklearn.impute import SimpleImputer

# Initialize imputer
imputer = SimpleImputer()


# Load data
poke_miss = pd.read_csv('pokemon_missing.csv', index_col = 0)
print(poke_miss.head())
# Perform imputation and store result
poke_miss['Speed'] = imputer.fit_transform(poke_miss[['Speed']])


print(poke_miss.head())



   #        Name Type 1  Type 2  Total  HP  Attack  Defense  Sp. Atk  Sp. Def  \
0  1   Bulbasaur  Grass  Poison    318  45      49       49       65       65   
1  2     Ivysaur  Grass  Poison    405  60      62       63       80       80   
2  3    Venusaur  Grass  Poison    525  80      82       83      100      100   
3  4  Charmander   Fire     NaN    309  39      52       43       60       50   
4  5  Charmeleon   Fire     NaN    405  58      64       58       80       65   

   Speed  Stage  Legendary  
0   45.0      1      False  
1   60.0      2      False  
2   80.0      3      False  
3    NaN      1      False  
4    NaN      2      False  
   #        Name Type 1  Type 2  Total  HP  Attack  Defense  Sp. Atk  Sp. Def  \
0  1   Bulbasaur  Grass  Poison    318  45      49       49       65       65   
1  2     Ivysaur  Grass  Poison    405  60      62       63       80       80   
2  3    Venusaur  Grass  Poison    525  80      82       83      100      100   
3  4  Charmande

Try to extract the imputed value from the dataframe, without manully looking it up. 

HINT: Find the location of the missing values in the original column, and check that location in the new imputed column. Use the `isnull()` method on the original column.

In [124]:
#ANS

import pandas as pd

# Load data
poke_miss_orig = pd.read_csv('pokemon_missing.csv', index_col = 0)

# Locate entries in the loaded data where
# the data is missing
miss = poke_miss_orig['Speed'].isnull()
print(miss)

# Use the constructed index in the imputed
# column
imp_vals = poke_miss.loc[miss, 'Speed']
imp_rows = poke_miss.loc[miss]


print(imp_vals)

print(imp_rows)

# Print the imputed value
print("Imputed value is {:0.3f}".format(imp_vals.iloc[0]))

0      False
1      False
2      False
3       True
4       True
       ...  
146    False
147    False
148    False
149    False
150    False
Name: Speed, Length: 151, dtype: bool
3      68.415385
4      68.415385
24     68.415385
26     68.415385
36     68.415385
43     68.415385
50     68.415385
55     68.415385
60     68.415385
63     68.415385
67     68.415385
73     68.415385
91     68.415385
93     68.415385
102    68.415385
104    68.415385
105    68.415385
116    68.415385
119    68.415385
128    68.415385
131    68.415385
Name: Speed, dtype: float64
       #        Name    Type 1   Type 2  Total  HP  Attack  Defense  Sp. Atk  \
3      4  Charmander      Fire      NaN    309  39      52       43       60   
4      5  Charmeleon      Fire      NaN    405  58      64       58       80   
24    25     Pikachu  Electric      NaN    320  35      55       40       50   
26    27   Sandshrew    Ground      NaN    300  50      75       85       20   
36    37      Vulpix      Fire    

# Bonus exercises

## Custom One Hot Encoder

To really understand what is going on in the one hot encoder, we will try to create our own function that can take any string column and perform one hot encoding. Use it on the `Type 1` column from the pokemon dataset.

Hint: You can get the unique value in a column using the `unique()` method:
```python
df['Type 1'].unique()
```

In [98]:
#ANS
def onehot(df, col_name):
    # Remove null rows
    idx = ~df[col_name].isnull()

    # Get the unique values
    un_vals = df.loc[idx, col_name].unique()
    
    # Loop through all unique values
    for val in un_vals:
        # Create a name for the new column
        new_col_name = "{0}_ONEHOT_{1}".format(col_name, val)
        
        # Inialize new column to 0
        df[new_col_name] = 0
        
        # Insert 1 into the column
        df.loc[df[col_name] == val, new_col_name] = 1
    
    return df
    
df_pre = df.loc[:, ['Name', 'Type 1']]
df_ohe = onehot(df_pre, 'Type 1')
df_ohe.head(10)

Unnamed: 0,Name,Type 1,Type 1_ONEHOT_Grass,Type 1_ONEHOT_Fire,Type 1_ONEHOT_Water,Type 1_ONEHOT_Bug,Type 1_ONEHOT_Normal,Type 1_ONEHOT_Poison,Type 1_ONEHOT_Electric,Type 1_ONEHOT_Ground,Type 1_ONEHOT_Fairy,Type 1_ONEHOT_Fighting,Type 1_ONEHOT_Psychic,Type 1_ONEHOT_Rock,Type 1_ONEHOT_Ghost,Type 1_ONEHOT_Ice,Type 1_ONEHOT_Dragon
0,Bulbasaur,Grass,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Ivysaur,Grass,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Venusaur,Grass,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Charmander,Fire,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Charmeleon,Fire,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
5,Charizard,Fire,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
6,Squirtle,Water,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
7,Wartortle,Water,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
8,Blastoise,Water,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
9,Caterpie,Bug,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0


## Custom imputer
This exercise is about creating a custom imputer function for numeric columns. The function should take as argument the dataframe, and a list of columns to perform mean imputation on. If you have time, implement median imputation as well.

In [96]:
#ANS
def custom_imputer(df, impute_cols, impute_type = 'mean'):
    for col_ in impute_cols:
        # Get index of missing values
        idx = df[col_].isnull()
        
        # Get all non-missing values
        df_non_miss = df.loc[~idx, col_]
        
        # Calculate impute value
        if impute_type == 'mean':
            impute_val = df_non_miss.mean()
        elif impute_type == 'median':
            impute_val = df_non_miss.median()
        else:
            raise ValueError("Wrong impute_type argument!")
        
        # Fill in value
        df.loc[idx, col_] = impute_val
        
    return df

poke_miss_orig = pd.read_csv('pokemon_missing.csv', index_col = 0)
custom_imputer(poke_miss_orig, ['Speed'])

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Stage,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45.000000,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60.000000,2,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80.000000,3,False
3,4,Charmander,Fire,,309,39,52,43,60,50,68.415385,1,False
4,5,Charmeleon,Fire,,405,58,64,58,80,65,68.415385,2,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
146,147,Dratini,Dragon,,300,41,64,45,50,50,50.000000,1,False
147,148,Dragonair,Dragon,,420,61,84,65,70,70,70.000000,2,False
148,149,Dragonite,Dragon,Flying,600,91,134,95,100,100,80.000000,3,False
149,150,Mewtwo,Psychic,,680,106,110,90,154,90,130.000000,1,True


## Manually fixing dimensions
In the exercise involving imputation and the `SimpleImputer` from `sklearn` we had to input a DataFrame. This is because a DataFrame has columns, which the `Imputer` expects. In the exercise we extracted a single column as a DataFrame. Another way of doing this, is to manually extract the speed column, turn it into `numpy`, and use the reshape command (https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html). 
This can be done in the following way:
```python
speed_2d_array = df['Speed'].values.reshape(-1, 1)
```

Try using this approach and save the result back to the DataFrame


In [97]:
#ANS
import pandas as pd

# Load data
poke_miss = pd.read_csv('pokemon_missing.csv', index_col = 0)

# Import imputer
from sklearn.impute import SimpleImputer

# Initialize imputer
imputer = SimpleImputer()

# The imputer usually works on a dataframe
# we only want to impute a column. Thus we
# turn column into a dataframe

# Get the values in the columns into numpy
np_vals = poke_miss.loc[:, 'Speed'].values

# Reshape into 2d array (matrix) instead
# of 1d array (vector)
np_mat = np_vals.reshape(-1,1)

# Apply imputer. Tomorrow we will learn what fit_transform means.
poke_miss.loc[:, 'Speed'] = imputer.fit_transform(np_mat)
poke_miss

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Stage,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45.000000,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60.000000,2,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80.000000,3,False
3,4,Charmander,Fire,,309,39,52,43,60,50,68.415385,1,False
4,5,Charmeleon,Fire,,405,58,64,58,80,65,68.415385,2,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
146,147,Dratini,Dragon,,300,41,64,45,50,50,50.000000,1,False
147,148,Dragonair,Dragon,,420,61,84,65,70,70,70.000000,2,False
148,149,Dragonite,Dragon,Flying,600,91,134,95,100,100,80.000000,3,False
149,150,Mewtwo,Psychic,,680,106,110,90,154,90,130.000000,1,True


In [1]:
#CONFIG
# Hide code tagged with #ANS
from IPython.display import HTML
HTML('''<script>
function code_hide() {
    var cells = IPython.notebook.get_cells()
    cells.forEach(function(x){ if(x.get_text().includes("#ANS")){
        if (x.get_text().includes("#CONFIG")){

        } else{
            x.input.hide()
            x.output_area.clear_output()
        }

        
    }
    })
}
function code_hide2() {
    var cells = IPython.notebook.get_cells();
    cells.forEach(function(x){
    if( x.cell_type != "markdown"){
        x.input.show()      
    }
    
        });
} 
$( document ).ready(code_hide);
$( document ).ready(code_hide2);
</script>
<form action="javascript:code_hide()"><input type="submit" value="Hide answers"></form>
<form action="javascript:code_hide2()"><input type="submit" value="Show answers"></form>''')