# 75 pandas Exercises: Exercises 1 to 10

First set of of 10 exercises from [here](https://www.machinelearningplus.com/python/101-pandas-exercises-python/). Each exercise includes the question, the input and the solution's code. Sometimes, alternative solutions and comments to better explain solutions/pandas functionality are offered.

Requirements: 
+ pandas
+ numpy

Happy Pandasing! 🐼

## Imports

In [1]:
import pandas as pd
import numpy as np # required for some questions

--- 

## Exercises

### 🐼 Exercise 1

__How to import pandas and check the version?__

Pretty easy, one:

In [7]:
print("Pandas version is {}".format(pd.__version__))

Pandas version is 0.23.4


### 🐼 Exercise 2

__How to create a series from a list, numpy array and dict?__

Input

In [9]:
mylist = list('abcedfghijklmnopqrstuvwxyz')
myarr = np.arange(26)
mydict = dict(zip(mylist, myarr))

Pandas and numpy interact very transparently and directly, so `Pd.Series` (the nuclear elements of a `Pd.DataFrame`) can be easily created...

In [12]:
list_series = pd.Series(mylist)
arr_series = pd.Series(myarr)
dict_series = pd.Series(mydict)

Visualizing. `pd.Series.head()` can be called that way or the number of rows to be shown (plus column names, if existing) can be specified by adding that number as an argument.

In [15]:
list_series.head(3)

0    a
1    b
2    c
dtype: object

In [16]:
arr_series.head(3)

0    0
1    1
2    2
dtype: int64

In [18]:
dict_series.head()

a    0
b    1
c    2
e    3
d    4
dtype: int64

### 🐼 Exercise 3

**How to convert the index of a series into a column of a dataframe?**
Convert the series `ser` into a dataframe with its index as another column on the dataframe. 

_By default, all dataframes/series have an `index` column, through which each row is assigned an index - a sort of ID used to call it (however, not a unique ID, as the index column can have duplicates)._

Input

In [29]:
mylist = list('abcedfghijklmnopqrstuvwxyz')
myarr = np.arange(26)
mydict = dict(zip(mylist, myarr))
ser = pd.Series(mydict)
ser.index

Index(['a', 'b', 'c', 'e', 'd', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n',
       'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'],
      dtype='object')

`ser` has the letters as its index. Processing to remove that...

In [25]:
df = ser.to_frame().reset_index()
df.head()

Unnamed: 0,index,0
0,a,0
1,b,1
2,c,2
3,e,3
4,d,4


Pandas DataFrames must always have an index. By default, that index is a numeric row counter. By using `reset_index()`, the index of the `pd.Series` is forgotten and, to avoid data loss, its converted into a column of the `pd.DataFrame`. Shamlessly stolen from the suggested solution. 

### 🐼 Exercise 4

**Combine ser1 and ser2 to form a dataframe.**

Input

In [32]:
ser1 = pd.Series(list('abcedfghijklmnopqrstuvwxyz'))
ser2 = pd.Series(np.arange(26))

Joining by manually accessing the raw data of each series

In [35]:
df = pd.DataFrame(data=[ser1.values, ser2.values]).T # otherwise, we would get one big horizontal array
df.head(5)

Unnamed: 0,0,1
0,a,0
1,b,1
2,c,2
3,e,3
4,d,4


There are more pandaonic (coined it!) ways to create a `DataFrame`, however. Through a `dict` where the keys are taken as column names. 

In [40]:
df_pandaonic = pd.DataFrame({'ser1': ser1.values, 'ser2': ser2.values}) # The .values is optional, pd.Series could be directly passed
df_pandaonic.head(5)

Unnamed: 0,ser1,ser2
0,a,0
1,b,1
2,c,2
3,e,3
4,d,4


There's an even more elegant way to accomplish this, which does not envolve direct instantiation of a new `DataFrame`. Through a `Series` concatenation: 

In [47]:
df_concat = pd.concat(objs=[ser1, ser2], axis=1) # concatenate along the horizontal axis, side-by-side
df_concat.head()

Unnamed: 0,0,1
0,a,0
1,b,1
2,c,2
3,e,3
4,d,4


### 🐼 Exercise 5

**Give a name to the `Series` ser calling it ‘alphabets’.**

Input

In [49]:
ser = pd.Series(list('abcedfghijklmnopqrstuvwxyz'))

There's probably a built-in `pd.Series` attribute or method to do this, I guess. 

In [57]:
ser.rename(('alphabets'), inplace=True) # using inplace=True to not return a new pd.Series, but rather do it on the original ser Series

'alphabets'

_Et voilà_, just as theorized. Alternatively, one could directly access and modify `ser.name`.

### 🐼 Exercise 6

From `ser1` remove items present in `ser2`.

Input

In [3]:
ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])

Let's solve this by simple indexing and `pd.Series.isin()`, which returns a boolean array telling us whether an element of a `pd.Series` is in another `pd.Series`. 

In [11]:
out_ser = ser1[~ser1.isin(ser2)] # we want to keep the items that are not in series2, hence the logical negation ~ 
print(out_ser.head(5))

0    1
1    2
2    3
dtype: int64


Shamlessly stolen from the suggested solution. Tried alternatives with conversion to `pd.DataFrame.merge` (requiring conversion to `pd.DataFrame` and no suitable merge strategy was available) and `pd.Series.combine()` (`pd.Series` are compared element-wise, and in here a value->`pd.Series` comparison is required). Sometimes, even when learning, KISS. 

### 🐼 Exercise 7

Get all items of `ser1` and `ser2` not common to both. Basically, the goal is to remove the intersection. 

Input

In [None]:
ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])

Getting the intersection:

In [28]:
intersect = ser1[ser1.isin(ser2)]

Now, getting the union of the two `pd.Series()` and removing the intersection: 

In [31]:
union = ser1.append(ser2)
union
out_ser = union[~union.isin(intersect)] # removing the elements in the intersection series

The suggested solution follows a similar logic but uses `np.intersect1d` and `np.union1d` to get the intersect and union series. `pd.Series` can be directly passed to these numpy methods, but the outputs must be fed back to the `pd.Series()` constructor. While numpy and pandas are often a match made in heaven, at this initial stage of exercises I refrained from putting numpy in the mix. 

_Although sometimes a considerable part of problem-solving is choosing the right tool for the job_. \end{reflection}. 

### 🐼 Exercise 8

Compute the minimum, 25th percentile, median, 75th, and maximum of `ser`.

Input

In [35]:
ser = pd.Series(np.random.normal(10, 5, 25))

So, several solutions. First, the lazy way, because laziness moves the world forward. 

In [57]:
print(ser.describe()[[s for s in ser.describe().index if s not in ["count", "std", "mean"]]]) # some list-comprehension based filtering

min     0.672789
25%     8.852104
50%    11.361746
75%    16.660299
max    21.837060
dtype: float64


😁 Now, the antithesis. Manually calculating each quantity using `pd.Series() built-in methods:  

In [41]:
min_val = ser.min()
max_val = ser.max()
median_val = ser.median()
twofive_percentile_val = ser.quantile(0.25)
sevenfive_percentile_val = ser.quantile(0.75)

# Packing everything up in a DataFrame, for pretty output
out_result = pd.DataFrame(data=[min_val, max_val, median_val, twofive_percentile_val, sevenfive_percentile_val], index=['Min', 'Max', 'Median', '25%', '75%'], columns=['Values'])
out_result.head()

Unnamed: 0,Values
Min,0.672789
Max,21.83706
Median,11.361746
25%,8.852104
75%,16.660299


Acknowledging the suggested solution just because of its beauty: the `min` is nothing but the 0% percentile (no value above), the median the 50% percentile (half of the values below, half above) and the `max` the 100% percentile (no value above).

In [59]:
ser.quantile([0, 0.25, 0.5, 0.75, 1])

0.00     0.672789
0.25     8.852104
0.50    11.361746
0.75    16.660299
1.00    21.837060
dtype: float64

### 🐼 Exercise 9

Calculte the frequency counts of each unique value `ser`.

Input

In [64]:
ser = pd.Series(np.take(list('abcdefgh'), np.random.randint(8, size=30))) # Input randomly samples the first 8 letter of the alphabet

Getting a histogram of unique values in a `pd.Series` is something very often used, I dare to say. So I also dare to say there's some built-in method to do it. 

In [66]:
ser.value_counts(normalize=True, sort=True) # normalizing to return a 0-1 normalized histogram

d    0.233333
b    0.166667
c    0.166667
e    0.133333
h    0.133333
a    0.066667
g    0.066667
f    0.033333
dtype: float64

That was not a bold statement. 

### 🐼 Exercise 10

From `ser`, keep the top 2 most frequent items as it is and replace everything else as ‘Other’.

In [95]:
np.random.RandomState(100)
ser = pd.Series(np.random.randint(1, 5, [12])) # 12 random values between 1 and 5

So, I assume I first must get those two most frequent values:

In [96]:
most_freq = ser.value_counts().index[:2] # We want the values (index of the series), not the counts of each value (the values in the series)

Now, using `pd.Series.isn()` to index everything not equal to any of those and setting it as _Other_.

In [97]:
out_ser = ser.copy(deep=True) # just to preserve the original data, as a good practice
out_ser[~out_ser.isin(most_freq.values)] = 'Other'
print(out_ser)

0         3
1         3
2         3
3         2
4     Other
5     Other
6         2
7         2
8         2
9     Other
10        3
11        3
dtype: object


---

🐼 First 10 exercises are done! 🐼

Feeling the momentum, riding the wave? [Exercises 11 to 20](TODO).