# Pandas exercises

<font size="3"> Exercises taken from [101 Pandas exercises for data analysis](https://www.machinelearningplus.com/python/101-pandas-exercises-python/) </font>

In [2]:
import pandas as pd
import numpy as np

<font size="3"> 1. How to import pandas and check the version? </font>

In [1]:
# 1

import pandas as pd

pd.__version__ # check version attribute

'1.4.4'

<font size="3"> 2. How to create a series from a list, numpy array and dict? </font>

In [5]:
import numpy as np

# input
mylist = list('abcedfghijklmnopqrstuvwxyz')
myarr = np.arange(26)
mydict = dict(zip(mylist, myarr))

# use the Series() function to do this
mylist_series = pd.Series(mylist)
myarr_series = pd.Series(myarr)
mydict_series = pd.Series(mydict)

new_series = [mylist_series, myarr_series, mydict_series] # collection of the new series objects
print(list(map(type, new_series)))

[<class 'pandas.core.series.Series'>, <class 'pandas.core.series.Series'>, <class 'pandas.core.series.Series'>]


<font size="3"> 3. How to convert the index of a series into a column of a dataframe?
Convert the series ser into a dataframe with its index as another column on the dataframe. </font>

In [12]:
# input

mylist = list('abcedfghijklmnopqrstuvwxyz')
myarr = np.arange(26)
mydict = dict(zip(mylist, myarr))
ser = pd.Series(mydict)

# use to_frame() function to convert to DataFrame and chain .reset_index() to reset the index
df = ser.to_frame().reset_index()

<font size="3"> 4. How to combine many series to form a dataframe?
Combine ser1 and ser2 to form a dataframe. </font>

In [4]:
# input
import numpy as np
ser1 = pd.Series(list('abcedfghijklmnopqrstuvwxyz'))
ser2 = pd.Series(np.arange(26))

# use pd.DataFrame 

df = pd.DataFrame({'col_1' : ser1,'col_2' : ser2})

<font size="3"> 5. How to assign name to the series’ index?
Give a name to the series ser calling it ‘alphabets’. </font>

In [42]:
# input

ser = pd.Series(list('abcedfghijklmnopqrstuvwxyz'))

# use .series.rename()

ser.rename('alphabets')

0     a
1     b
2     c
3     e
4     d
5     f
6     g
7     h
8     i
9     j
10    k
11    l
12    m
13    n
14    o
15    p
16    q
17    r
18    s
19    t
20    u
21    v
22    w
23    x
24    y
25    z
Name: alphabets, dtype: object

<font size="3"> 6. How to get the items of series A not present in series B?
From ser1 remove items present in ser2. </font>

In [59]:
# input
ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])

# method 1 is to convert the series to Sets 
set(ser1) - set(ser2)

# method 2 use isin and ~ to subset everything without the items that are shared between the series
ser1[ser1.isin(ser2)] # gets the elements that are shared

ser1_only = ser1[~ser1.isin(ser2)] # everything except the elements that are shared

<font size="3"> 7. How to get the items not common to both series A and series B? </font>

In [6]:
# input
ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])

# get the union (A,B or both), and the intersect (Only in both) and remove the shared values by filtering out all the values that are in the intersect
union_ = pd.Series(np.union1d(ser1, ser2)) 
intersect_ = pd.Series(np.intersect1d(ser1, ser2))
not_common = union_[~union_.isin(intersect_)]
not_common



0    1
1    2
2    3
5    6
6    7
7    8
dtype: int64

<font size="3"> 8. How to get the minimum, 25th percentile, median, 75th, and max of a numeric series? Compute the minimum, 25th percentile, median, 75th, and maximum of ser.

 </font>

In [15]:
# Input
ser = pd.Series(np.random.normal(10, 5, 25))

# use numpy percentile() function
min_, first_quar, median_, third_quar = np.percentile(ser, [0, 25, 50, 75])

print(f'Minimum percentile is: {min_}')
print(f'25th percentile is: {first_quar}')
print(f'Median percentile is: {median_}')
print(f'75th percentile is: {third_quar}')


Minimum percentile is: -1.7085185675703016
25th percentile is: 7.44287849457793
Median percentile is: 9.157954504608025
75th percentile is: 13.317499686053495


<font size="3"> 9. How to get frequency counts of unique items of a series?
 </font>

In [14]:
# Input

ser = pd.Series(np.take(list('abcdefgh'), np.random.randint(8, size=30)))

# can use the series 'value_counts()' method
ser.value_counts()

d    8
c    5
h    4
b    4
e    3
a    3
f    2
g    1
dtype: int64

<font size="3"> 10. How to keep only top 2 most frequent values as it is and replace everything else as ‘Other’? 
    From ser, keep the top 2 most frequent items as it is and replace everything else as ‘Other’.
    
</font>

In [62]:
# Input

np.random.RandomState(100)
ser = pd.Series(np.random.randint(1, 5, [12]))

x = ser.value_counts().index.to_list() # getting the value counts, only want the top two which are the index, so get this index as a list

# loop through ser, if the value is not either the first or second highest in the frequency then replace
for i in range(0, len(ser)):
    if ser[i] == x[0] or ser[i] == x[1]:
        continue
    else:
        ser[i] = 'Other'
print(f'value counts in order of frequency are: {x} so only {x[0]} and {x[1]} will be kept.')
print(ser)

value counts in order of frequency are: [3, 1, 4, 2] so only 3 and 1 will be kept.
0         3
1         1
2     Other
3         3
4     Other
5     Other
6         3
7         3
8         3
9         1
10    Other
11        1
dtype: object
