# 75 pandas Exercises: Exercises 11 to 20

Exercises 11 to 20 from [here](https://www.machinelearningplus.com/python/101-pandas-exercises-python/). Each exercise includes the question, the input and the solution's code. Sometimes, alternative solutions and comments to better explain solutions/pandas functionality are offered.

Requirements: 
+ pandas
+ numpy

Happy Pandasing! 🐼

## Imports

In [3]:
import pandas as pd
import numpy as np # required for some questions

---

## Exercises 

### 🐼 Exercise 11

**How to bin a numeric series to 10 groups of equal size?** Bin the series `ser` into 10 equal deciles and replace the values with the bin name.

Input 

In [4]:
ser = pd.Series(np.random.random(20)) # ser is a series of 10 random numbers

Desired output (first 5 items)


_0    7th_

_1    9th_

_2    7th_

_3    3rd_

_4    8th_


_dtype: category_
_Categories (10, object): [1st < 2nd < 3rd < 4th ... 7th < 8th < 9th < 10th]`_

So, Pandas allows discretization in equal-sized buckets (quantiles) through `pd.qcut`. Apparently, we can even directly give a name to the resulting bins. This seems like it!

In [6]:
bins = pd.qcut(ser, q=10, labels=['1th', '2nd', '3rd', '4th', '5th', '6th', '7th', '8th', '9th', '10th'])
bins.head(15)

0      4th
1      1th
2      4th
3      7th
4     10th
5      7th
6      8th
7      9th
8      8th
9      5th
10    10th
11     1th
12     6th
13     6th
14     3rd
dtype: category
Categories (10, object): [1th < 2nd < 3rd < 4th ... 7th < 8th < 9th < 10th]

So, in `bins`, for each index (matching `ser`'s index, of course), we have the quantile that sample belongs to. We can actually try to see what defines a bin.

In [13]:
bins_ret, bins_limits = pd.qcut(ser, q=10, labels=['1th', '2nd', '3rd', '4th', '5th', '6th', '7th', '8th', '9th', '10th'], retbins=True)
print(bins_limits)

[0.04155485 0.11576393 0.19091413 0.25799771 0.32604731 0.50903033
 0.7155983  0.84597955 0.87102839 0.93717372 0.95575487]


In `bins_limits`, we have thus the values that limit each of the 10 bins. The first bin, for instances, contains all samples between `0.04155485` and `0.11576393`. 

### 🐼 Exercise 12

**Convert a numpy array to a dataframe of given shape.** Reshape the series `ser` into a dataframe with 7 rows and 5 columns.

Input

In [15]:
ser = pd.Series(np.random.randint(1, 10, 35))

Reshaping in pandas is a tricky business (because every row is strongly connected to an index). The easiest way is a bit of back and forth between pandas and numpy (a `np.array` representating of a `pd.DataFrame` is accessible in the `.values` method).

In [19]:
reshaped_df = pd.DataFrame(ser.values.reshape(7, 5))
print(reshaped_df.shape)

(7, 5)


_Et voilà!_

### 🐼 Exercise 13

**Find the positions of numbers that are multiples of 3 in a `pd.Series`.** 

Input 

In [4]:
ser = pd.Series(np.random.randint(1, 10, 7))

Let's do this using mostly `np` machinery.

In [5]:
idxs = np.argwhere(ser.values % 3 != 0)
print(idxs)

[[0]
 [3]
 [4]
 [5]
 [6]]


_Don't these two just go together like two love birds?_

### 🐼 Exercise 14

**Extract items from a certain position** From `ser`, extract the items at positions in list `pos`.

Input

In [8]:
ser = pd.Series(list('abcdefghijklmnopqrstuvwxyz'))
pos = [0, 4, 8, 14, 20]

We can do this with some numerical-based index:

In [14]:
extracted = ser.iloc[pos]
print(extracted)

0     a
4     e
8     i
14    o
20    u
dtype: object


Or using a specific method of `pd.Series` (_which also happens to be the suggested solution_ 😁): 

In [13]:
extracted = ser.take(pos, axis=0) # axis=0 so we select rows
print(extracted)

0     a
4     e
8     i
14    o
20    u
dtype: object


### 🐼 Exercise 15

**`pd.Series` stacking to form `pd.DataFrame`**. Stack `ser1` and `ser2` vertically and horizontally (to form a dataframe).

Input

In [15]:
ser1 = pd.Series(range(5))
ser2 = pd.Series(list('abcde'))

Horizontal stacking (there's a `pd.concat`!)

In [24]:
df_hor = pd.concat([ser1, ser2], axis=1)
df_hor.head()

Unnamed: 0,0,1
0,0,a
1,1,b
2,2,c
3,3,d
4,4,e


Vertical stacking

In [21]:
df_ver = pd.concat([ser1, ser2], axis=0)
df_ver.head(10) # so we see all elements

0    0
1    1
2    2
3    3
4    4
0    a
1    b
2    c
3    d
4    e
dtype: object

Alternatively, we can use `pd.Series.append()` to vertically stack the two `pd.Series`.

In [26]:
df_ver_append = ser1.append(ser2)
df_ver_append.head(10)

0    0
1    1
2    2
3    3
4    4
0    a
1    b
2    c
3    d
4    e
dtype: object

The resulting datatype of each stacking shows why append works for the vertical case. Appending to or vertically stacking a `pd.Series`, as it won't add new dimensions, will return an expanded `pd.Series`. On the other hand, horizontal concatenation forcefully adds a new dimension, hence the appearance of `pandas` multi-dimensional data container, `pd.DataFrame`. 

In [31]:
print(type(df_ver_append))
print(type(df_hor))

<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>


### 🐼 Exercise 16

**Get the positions of items of a `pd.Series` A in another `pd.Series`B.** Get the positions of items of `ser2` in `ser1` as a list.

Input

In [2]:
ser1 = pd.Series([10, 9, 6, 5, 3, 1, 12, 8, 13])
ser2 = pd.Series([1, 3, 10, 13])

A first and simple solution is to use the `numpy` machinery: 

In [11]:
locs = np.array([np.where(i == ser1) for i in ser2]).flatten() # converting to np.array and flattening gets the output into a list-like object 
print(locs)

[5 4 0 8]


There's another, more `pandas`-like solution, exploring its internals and the `eq()` (equals) method.

In [15]:
locs_pd = [ser1[ser1.eq(i)].index.values[0] for i in ser2]
print(locs_pd)

[5, 4, 0, 8]


The suggested solution uses `pd.Index`, the `pandas` Index object. A `pd.Index` object is built from the `pd.Series`. A `pd.Index` is a core `pandas` datatype. This ordered, sliceable set has thus a very high performant `pd.Index.get_loc()` method for accessing the location of a certain item. 

In [16]:
[pd.Index(ser1).get_loc(i) for i in ser2] # gets the location of item i in ser1

[5, 4, 0, 8]

### 🐼 Exercise 17

**Compute mean squared error between two series.** Compute the mean squared error of truth and pred series.

**Note:** Mean Squared Error (MSE) is defined as $\frac{1}{N}(x_{i}-y_{i})^2$, where $N$ is the number of datapoints (so, the simple numeric average).

Input

In [17]:
truth = pd.Series(range(10))
pred = pd.Series(range(10)) + np.random.random(10)

Simply `numpy`-like solution:

In [24]:
mse = np.mean((truth-pred)**2)
print(mse)

0.34617617749719287


Just for learning purposes, let's go for a more convoluted solution, using a `pd.DataFrame` and its built-in methods. 

In [28]:
df = pd.concat((truth, pred), axis=1)
df.columns = ['truth', 'preds']
df.head()

Unnamed: 0,truth,preds
0,0,0.075026
1,1,1.820749
2,2,2.844877
3,3,3.948473
4,4,4.392573


Creating a new column, using a small and locally defined method (the so-called _lambda_ or anonymous functions). The _lambda_ function takes each row and operates on it. 

In [36]:
df['squared_diff'] = df.apply(lambda x: (x.values[1]-x.values[0])**2, axis=1) # by specifiying axis=1, we make it so that the input to the lambda is a row
df.head()

Unnamed: 0,truth,preds,squared_diff
0,0,0.075026,0.005629
1,1,1.820749,0.673629
2,2,2.844877,0.713818
3,3,3.948473,0.899601
4,4,4.392573,0.154113


Now, we compute the MSE: 

In [41]:
mse_too_much_work = df['squared_diff'].sum()/len(df['squared_diff'])
print(mse_too_much_work)

0.3461761774971928


_Et voilá_, exactly the same. There are many roads to get to a same destination. Not all equallly efficient, though. Real world problem-solving is all about premature efficiency assessment. Why did I suddenly stop to make reflections? _Oh, Saturdays..._

### 🐼 Exercise 18

**Convert the first character of series elements to uppercase**. Change the first character of each word to upper case in each word of `ser`.

Input

In [48]:
ser = pd.Series(['how', 'to', 'kick', 'ass?'])

A simple solution with a list comprehension: 

In [49]:
ser_simple_upper = pd.Series([i[0].upper() + i[1:] for i in ser])
ser_simple_upper.head()

0     How
1      To
2    Kick
3    Ass?
dtype: object

Using `pd.Series.map()`, which, if given a method (usually it takes a dictionary of correspondences between the original value and the desired value), is somehow similar to `apply()` used in the solution above (_this is one of the suggested solutions_). 

In [52]:
ser_not_so_simple_upper = ser.map(lambda x: x.title()) # .title(), on a string, uppercases its first letter
ser_not_so_simple_upper.head() 

0     How
1      To
2    Kick
3    Ass?
dtype: object

### 🐼 Exercise 19

**Calculate the number of characters in each word in a `pd.Series`.**

Input

In [55]:
ser = pd.Series(['how', 'to', 'kick', 'ass?'])

Similarly to what we did in exercise 18, but now outputting the result to another variable: 

In [59]:
a=ser.map(lambda x: len(x))
print(a.tolist())

[3, 2, 4, 4]


See? Turns out it was useful to introduce _lambda_ functions a few Exercises ago. 

### 🐼 Exercise 20

**Difference of differences between the consecutive numbers of `ser`.**

We don't even need to resort to `numpy` machinery, because `pandas` actually provides it for us.

Input

In [65]:
ser = pd.Series([1, 3, 6, 10, 15, 21, 27, 35])

In [67]:
first_diff = ser.diff()
print(first_diff.tolist())

[nan, 2.0, 3.0, 4.0, 5.0, 6.0, 6.0, 8.0]


_BUT WE MUST GO DEEPER!_

In [70]:
second_diff = first_diff.diff()
print(second_diff.tolist())

[nan, nan, 1.0, 1.0, 1.0, 1.0, 0.0, 2.0]


---

### See ya next notebook! 🐼

Can't stop? [Exercises 21 to 30](TODO).