# 101 Pandas Exercises for Data Analysis

*by Selva Prabhakaran*

From the website: https://www.machinelearningplus.com/python/101-pandas-exercises-python/

*101 python pandas exercises are designed to challenge your logical muscle and to help internalize data manipulation with python’s favorite package for data analysis.*
*The questions are of 3 levels of difficulties with L1 being the easiest to L3 being the hardest.*

**NOTE**: Again, its 75 not 100, but the exercises are good.

In [1]:
pip install pandas


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


1. (L1) **Import `pandas` and check the version**

In [2]:
import pandas as pd

pd.__version__

'1.5.3'

2. (L1) **Create a `pandas` series from each of the items below: a list, numpy and a dictionary**

In [3]:
import numpy as np

# Inputs

mylist = list('abcedfghijklmnopqrstuvwxyz')
myarr = np.arange(26)
mydict = dict(zip(mylist, myarr))

for data in [mylist, myarr, mydict]:
    print(pd.Series(data).head())

0    a
1    b
2    c
3    e
4    d
dtype: object
0    0
1    1
2    2
3    3
4    4
dtype: int64
a    0
b    1
c    2
e    3
d    4
dtype: int64


3. (L1) **Convert the series `ser` into a dataframe with its index as another column on the dataframe.**

In [4]:
# Inputs

mylist = list('abcedfghijklmnopqrstuvwxyz')
myarr = np.arange(26)
mydict = dict(zip(mylist, myarr))
ser = pd.Series(mydict)

ser.to_frame().reset_index()

Unnamed: 0,index,0
0,a,0
1,b,1
2,c,2
3,e,3
4,d,4
5,f,5
6,g,6
7,h,7
8,i,8
9,j,9


4. (L1) **Combine `ser1` and `ser2` to form a dataframe.**

In [5]:
# Inputs

ser1 = pd.Series(list('abcedfghijklmnopqrstuvwxyz'))
ser2 = pd.Series(np.arange(26))

np.all(
    pd.DataFrame([ser1, ser2]).T  # Stacking horizontally and rotating by 90 deg
    == pd.concat([ser1, ser2], axis=1)  # Stacking rotated series
)

True

5. (L1) **Give a name to the series `ser` calling it `my_series`.**

In [6]:
# Input

ser = pd.Series(list('abcedfghijklmnopqrstuvwxyz'))

ser.name = "my_series"

ser.head()

0    a
1    b
2    c
3    e
4    d
Name: my_series, dtype: object

6. (L2) **From `ser1` remove items present in `ser2`.**

In [7]:
# Inputs

ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])

# With numpy
with_numpy = np.setdiff1d(ser1, ser2)

# With pandas
with_pandas = ser1[~ser1.isin(ser2)]

print(with_numpy)  # Results in ndarray
print(with_pandas)  # Results in Series

[1 2 3]
0    1
1    2
2    3
dtype: int64


7. (L2) **Get the items not common to both series `ser1` and `ser2`.**

In [12]:
# Input

ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])

union = np.union1d(ser1, ser2)
intersection = np.intersect1d(ser1, ser2)

print(union, intersection)
union[~pd.Series(union).isin(intersection)]

[1 2 3 4 5 6 7 8] [4 5]


array([1, 2, 3, 6, 7, 8])

8. (L2) **Compute the minimum, 25th percentile, median, 75th, and maximum of `ser`.**

In [18]:
# Input

ser = pd.Series(np.random.normal(10, 5, 25))

for method in [pd.Series.max, pd.Series.min, pd.Series.median]:
    print(method(ser))
print(ser.quantile(0.25))
print(ser.quantile(0.75))


16.154143778523007
3.754115805210497
8.708207892778665
6.291364503022354
11.356937702418694


9. (L2) **Calcualte the frequency counts of each unique item in `ser`.**

In [24]:
# Input

ser = pd.Series(np.take(list('abcdefgh'), np.random.randint(8, size=30)))

# Through numpy
with_numpy = np.unique(ser.to_numpy(), return_counts=True)

# Through pandas
with_pandas = ser.value_counts()

print(with_numpy)
print(with_pandas)

(array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'], dtype=object), array([5, 6, 3, 4, 4, 4, 3, 1]))
b    6
a    5
d    4
e    4
f    4
g    3
c    3
h    1
dtype: int64


10. (L2) **From `ser`, keep the top 2 most frequent items as it is and replace everything else as `Other`.**

In [39]:
# Input

np.random.RandomState(100)
ser = pd.Series(np.random.randint(1, 5, [12]))


ser[~ser.isin(ser.value_counts()[:2].index.to_series())] = 'Other'
print(ser)

0         3
1         3
2     Other
3         1
4     Other
5         3
6         1
7         1
8     Other
9     Other
10        3
11        1
dtype: object


11. (L2) **Bin the series `ser` into 10 equal deciles and replace the values with the bin name.**

In [72]:
# Input

ser = pd.Series(np.random.random(20))

with_cut = pd.cut(
    ser,
    bins=np.percentile(ser, np.arange(0, 110, 10)),
    labels=['1st', '2nd', '3rd', '4th', '5th', '6th', '7th', '8th', '9th', '10th'],
    include_lowest=True
)

with_qcut = pd.qcut(
    ser,
    q=np.arange(0, 1.1, .1),
    labels=['1st', '2nd', '3rd', '4th', '5th', '6th', '7th', '8th', '9th', '10th']
)

np.all(with_cut == with_qcut)

True

12. (L1) **Reshape the series `ser` into a dataframe with 7 rows and 5 columns.**

In [76]:
# Input

ser = pd.Series(np.random.randint(1, 10, 35))

pd.DataFrame(
    ser.to_numpy().reshape((7,5))
)

Unnamed: 0,0,1,2,3,4
0,2,7,1,1,6
1,5,5,8,7,2
2,4,3,1,7,9
3,4,7,5,3,2
4,5,2,7,1,3
5,1,3,8,8,5
6,9,8,2,1,5


13. (L2)  **Find the positions of numbers that are multiples of 3 from `ser`.**

In [82]:
# Input

ser = pd.Series(np.random.randint(1, 10, 7))

print(ser)
np.argwhere(ser.to_numpy() % 3 == 0)

0    7
1    8
2    1
3    6
4    2
5    3
6    4
dtype: int64


array([[3],
       [5]])

14. (L1) **From `ser`, extract the items at positions in list `pos`.**

In [85]:
# Input

ser = pd.Series(list('abcdefghijklmnopqrstuvwxyz'))
pos = [0, 4, 8, 14, 20]

np.all(ser[pos] == ser.take(pos))

True

15. (L1) **Stack two series vertically and horizontally.**

In [89]:
# Input

ser1 = pd.Series(range(5))
ser2 = pd.Series(list('abcde'))

horizontally = pd.concat([ser1, ser2], axis=0)
vertically = pd.concat([ser1, ser2], axis=1)
print(horizontally, vertically)

0    0
1    1
2    2
3    3
4    4
0    a
1    b
2    c
3    d
4    e
dtype: object    0  1
0  0  a
1  1  b
2  2  c
3  3  d
4  4  e


16. (L2) **Get the positions of items of `ser2` in `ser1` as a list.**

In [99]:
# Input

ser1 = pd.Series([10, 9, 6, 5, 3, 1, 12, 8, 13])
ser2 = pd.Series([1, 3, 10, 13])

np.argwhere(
    ser1.isin(ser2).to_numpy()
)

array([[0],
       [4],
       [5],
       [8]])

17. (L2) **Compute the mean squared error of `truth` and `pred`.**

**NOTE**: This question means that we need to calculate the mean squared error between the two series, using the formula:

$$MSE = \dfrac{1}{n} * \sum \left(truth - pred\right)^2$$

In [102]:
# Input

truth = pd.Series(range(10))
pred = pd.Series(range(10)) + np.random.random(10)

np.mean(
    (truth-pred)**2  # Squares of errors
)

0.2408264920442503

18. (L2) **Convert the first character of each element in `ser` to uppercase.**

In [109]:
# Input

ser = pd.Series(['how', 'to', 'kick', 'ass?'])

ser.apply(str.title)

0     How
1      To
2    Kick
3    Ass?
dtype: object

19. (L2) **Calculate the number of characters in each word in `ser`.**

In [110]:
# Input

ser = pd.Series(['how', 'to', 'kick', 'ass?'])

ser.apply(len)

0    3
1    2
2    4
3    4
dtype: int64

20. (L2) **Compute the difference of differences between consecutive numbers in `ser`.**

In [112]:
# Input

ser = pd.Series([1, 3, 6, 10, 15, 21, 27, 35])

ser.diff().diff()

0    NaN
1    NaN
2    1.0
3    1.0
4    1.0
5    1.0
6    0.0
7    2.0
dtype: float64

21. (L2) **Convert a series of date-strings to a timeseries**

Desired output:

```
0   2010-01-01 00:00:00
1   2011-02-02 00:00:00
2   2012-03-03 00:00:00
3   2013-04-04 00:00:00
4   2014-05-05 00:00:00
5   2015-06-06 12:20:00
dtype: datetime64[ns]
```

In [116]:
# Input

ser = pd.Series(['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])

ser.astype('datetime64')

0   2010-01-01 00:00:00
1   2011-02-02 00:00:00
2   2012-03-03 00:00:00
3   2013-04-04 00:00:00
4   2014-05-05 00:00:00
5   2015-06-06 12:20:00
dtype: datetime64[ns]

22. (L2) **Get the day of month, week number, day of year and day of week from `ser`.**

Desired output:

```
Date:  [1, 2, 3, 4, 5, 6]
Week number:  [53, 5, 9, 14, 19, 23]
Day num of year:  [1, 33, 63, 94, 125, 157]
Day of week:  ['Friday', 'Wednesday', 'Saturday', 'Thursday', 'Monday', 'Saturday']
```

In [126]:
# Input

ser = pd.Series(['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])

dates = []
week_numbers = []
day_numbers = []
weekdays = []

for index, date in enumerate(ser.astype('datetime64')):
    dates.append(index)
    week_numbers.append(date.week)
    day_numbers.append(date.day)
    weekdays.append(date.weekday())

print(dates)
print(week_numbers)
print(day_numbers)
print(weekdays)

[0, 1, 2, 3, 4, 5]
[53, 5, 9, 14, 19, 23]
[1, 2, 3, 4, 5, 6]
[4, 2, 5, 3, 0, 5]
