# 101 Pandas Exercises for Data Analysis

Exercises taken from [Machine Learning +](https://www.machinelearningplus.com/python/101-pandas-exercises-python/)

#### 1. How to import pandas and check the version?

In [1]:
import pandas as pd
print(pd.__version__)

1.3.3


#### 2. How to create a series from a list, numpy array and dict?
Create a pandas series from each of the items below: a list, numpy and a  dictionary <br>

In [2]:
import numpy as np

mylist = list('abcedfghijklmnopqrstuvwxyz')
myarr = np.arange(26)
mydict = dict(zip(mylist, myarr))

In [3]:
ser1 = pd.Series(mylist)
ser1.head(5)

0    a
1    b
2    c
3    e
4    d
dtype: object

In [4]:
ser2 = pd.Series(myarr)
ser2.head()

0    0
1    1
2    2
3    3
4    4
dtype: int32

In [5]:
ser3 = pd.Series(mydict)
ser3.head()

a    0
b    1
c    2
e    3
d    4
dtype: int32

#### 3. How to convert the index of a series into a column of a dataframe?
Convert the series `ser3` into a dataframe with its index as another column on the dataframe.

In [6]:
pd.DataFrame(ser3).reset_index().head()

Unnamed: 0,index,0
0,a,0
1,b,1
2,c,2
3,e,3
4,d,4


#### 4. How to combine many series to form a dataframe?

Combine `ser1` and `ser2` to form a dataframe.

In [7]:
pd.DataFrame({'letters': ser1, 'numbers': ser2}).head()

Unnamed: 0,letters,numbers
0,a,0
1,b,1
2,c,2
3,e,3
4,d,4


#### 5. How to assign name to the series’ index?

Give a name to the series `ser1` calling it "*alphabets*".

In [8]:
ser1.name = 'alphabets'
ser1.head()

0    a
1    b
2    c
3    e
4    d
Name: alphabets, dtype: object

#### 6. How to get the items of series A not present in series B?

From `ser1` remove items present in `ser2`.

In [9]:
ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])

# filter items in ser1 not (represented by '~') in ser2
ser1[~ser1.isin(ser2)]

0    1
1    2
2    3
dtype: int64

#### 7. How to get the items not common to both series A and series B?

Get all items of `ser1` and `ser2` not common to both.

In [10]:
# same idea from the previous exercise, but concatenating both cases in
# a single series

pd.concat([ser1[~ser1.isin(ser2)], ser2[~ser2.isin(ser1)]])

0    1
1    2
2    3
2    6
3    7
4    8
dtype: int64

#### 8. How to get the minimum, 25th percentile, median, 75th, and max of a numeric series?
Compute the minimum, 25th percentile, median, 75th, and maximum of `ser`. 

In [11]:
np.random.seed(0)
ser = pd.Series(np.random.normal(10, 5, 25))

print("Min:", ser.min())
print("25th percentile:", ser.quantile(.25))
print("Median:", ser.median())
print("75th percentile:", ser.quantile(.75))
print("Max:", ser.max())

Min: -2.7649490791703926
25th percentile: 9.48390574103221
Median: 12.052992509691862
75th percentile: 14.893689920528697
Max: 21.348773119938038


In [12]:
# More straightforward way
np.percentile(ser, q=[0, 25, 50, 75, 100])

array([-2.76494908,  9.48390574, 12.05299251, 14.89368992, 21.34877312])

#### 9. How to get frequency counts of unique items of a series?

Calculte the frequency counts of each unique value `ser`.

In [13]:
np.random.seed(0)
ser = pd.Series(np.take(list('abcdefgh'), np.random.randint(8, size=30)))

ser.value_counts()

h    5
a    5
f    4
d    4
b    4
e    3
g    3
c    2
dtype: int64

#### 10. How to keep only top 2 most frequent values as it is and replace everything else as ‘Other’?

From `ser`, keep the top 2 most frequent items as it is and replace everything else as "Other".

In [14]:
np.random.seed(0)
ser = pd.Series(np.random.randint(1, 5, [12]))
ser

0     1
1     4
2     2
3     1
4     4
5     4
6     4
7     4
8     2
9     4
10    2
11    3
dtype: int32

In [15]:
keep = ser.value_counts().nlargest(2, keep='all').index
keep

Int64Index([4, 2], dtype='int64')

In [16]:
ser.where((ser == keep[0]) | (ser == keep[1]), other='Other')

0     Other
1         4
2         2
3     Other
4         4
5         4
6         4
7         4
8         2
9         4
10        2
11    Other
dtype: object

#### 11. How to bin a numeric series to 10 groups of equal size?

Bin the series `ser` into 10 equal deciles and replace the values with the bin name.

In [17]:
np.random.seed(0)
ser = pd.Series(np.random.random(20))
ser.head()

0    0.548814
1    0.715189
2    0.602763
3    0.544883
4    0.423655
dtype: float64

In [18]:
pd.qcut(ser, q=10, labels=['1st', '2nd', '3rd', '4th', '5th',
                           '6th', '7th', '8th', '9th', '10th']).head()

0    5th
1    7th
2    6th
3    4th
4    3rd
dtype: category
Categories (10, object): ['1st' < '2nd' < '3rd' < '4th' ... '7th' < '8th' < '9th' < '10th']

#### 12. How to convert a numpy array to a dataframe of given shape?

Reshape the series `ser` into a dataframe with 7 rows and 5 columns

In [19]:
np.random.seed(0)
ser = pd.Series(np.random.randint(1, 10, 35))
ser.head()

0    6
1    1
2    4
3    4
4    8
dtype: int32

In [20]:
pd.DataFrame(ser.to_numpy().reshape((7,5)))

Unnamed: 0,0,1,2,3,4
0,6,1,4,4,8
1,4,6,3,5,8
2,7,9,9,2,7
3,8,8,9,2,6
4,9,5,4,1,4
5,6,1,3,4,9
6,2,4,4,4,8


#### 13. How to find the positions of numbers that are multiples of 3 from a series?

Find the positions of numbers that are multiples of 3 from `ser`.

In [21]:
np.random.seed(0)
ser = pd.Series(np.random.randint(1, 10, 7))
ser

0    6
1    1
2    4
3    4
4    8
5    4
6    6
dtype: int32

In [22]:
ser[ser % 3 == 0].index

Int64Index([0, 6], dtype='int64')

#### 14. How to extract items at given positions from a series

From `ser`, extract the items at positions in list pos.

In [23]:
ser = pd.Series(list('abcdefghijklmnopqrstuvwxyz'))
pos = [0, 4, 8, 14, 20]

ser.iloc[pos]

0     a
4     e
8     i
14    o
20    u
dtype: object

#### 15. How to stack two series vertically and horizontally ?

Stack `ser1` and `ser2` vertically and horizontally (to form a dataframe).

In [24]:
ser1 = pd.Series(range(5))
ser2 = pd.Series(list('abcde'))

In [25]:
# Horizontal (dataframe)
pd.concat([ser1, ser2], axis = 1)

Unnamed: 0,0,1
0,0,a
1,1,b
2,2,c
3,3,d
4,4,e


In [26]:
# Vertical
pd.concat([ser1, ser2])

0    0
1    1
2    2
3    3
4    4
0    a
1    b
2    c
3    d
4    e
dtype: object

#### 16. How to get the positions of items of series A in another series B?

Get the positions of items of `ser2` in `ser1` as a list.

In [27]:
ser1 = pd.Series([10, 9, 6, 5, 3, 1, 12, 8, 13])
ser2 = pd.Series([1, 3, 10, 13])

In [28]:
# This way the indices don't match the correspondent values:
ser1[ser1.isin(ser2)].index.to_list()

[0, 4, 5, 8]

In [29]:
# That might be a better approach:
[pd.Index(ser1).get_loc(i) for i in ser2]

[5, 4, 0, 8]

#### 17. How to compute the mean squared error on a truth and predicted series?

Compute the mean squared error of `truth` and `pred` series.

In [30]:
np.random.seed(0)
truth = pd.Series(range(10))
pred = pd.Series(range(10)) + np.random.random(10)

print(truth)
print('\n')
print(pred)

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64


0    0.548814
1    1.715189
2    2.602763
3    3.544883
4    4.423655
5    5.645894
6    6.437587
7    7.891773
8    8.963663
9    9.383442
dtype: float64


In [31]:
np.mean((truth - pred)**2)

0.41319910235287544

#### 18. How to convert the first character of each element in a series to uppercase?
Change the first character of each word to upper case in each word of `ser`.

In [32]:
ser = pd.Series(['how', 'to', 'kick', 'ass?'])
ser.str.capitalize()

0     How
1      To
2    Kick
3    Ass?
dtype: object

#### 19. How to calculate the number of characters in each word in a series?

In [33]:
ser = pd.Series(['how', 'to', 'kick', 'ass?'])
ser.apply(len)

0    3
1    2
2    4
3    4
dtype: int64

#### 20. How to compute difference of differences between consequtive numbers of a series?
Difference of differences between the consequtive numbers of `ser`.

In [34]:
ser = pd.Series([1, 3, 6, 10, 15, 21, 27, 35])
ser.diff().diff()

0    NaN
1    NaN
2    1.0
3    1.0
4    1.0
5    1.0
6    0.0
7    2.0
dtype: float64

#### 21. How to convert a series of date-strings to a timeseries?

In [35]:
ser = pd.Series(['01 Jan 2010', '02-02-2011', '20120303',
                 '2013/04/04', '2014-05-05', '2015-06-06T12:20'])
pd.to_datetime(ser)

0   2010-01-01 00:00:00
1   2011-02-02 00:00:00
2   2012-03-03 00:00:00
3   2013-04-04 00:00:00
4   2014-05-05 00:00:00
5   2015-06-06 12:20:00
dtype: datetime64[ns]

#### 22. How to get the day of month, week number, day of year and day of week from a series of date strings?


Get the day of month, week number, day of year and day of week from `ser`.

In [36]:
ser = pd.Series(['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04',
                 '2014-05-05', '2015-06-06T12:20'])
ser = pd.to_datetime(ser)

print('Day of month:', ser.dt.day.to_list())
print('Week number:', ser.dt.isocalendar().week.to_list())
print('Day of year:', ser.dt.dayofyear.to_list())
print('Day of week:', ser.dt.day_name().to_list())

Day of month: [1, 2, 3, 4, 5, 6]
Week number: [53, 5, 9, 14, 19, 23]
Day of year: [1, 33, 63, 94, 125, 157]
Day of week: ['Friday', 'Wednesday', 'Saturday', 'Thursday', 'Monday', 'Saturday']


#### 23. How to convert year-month string to dates corresponding to the 4th day of the month?

Change `ser` to dates that start with 4th of the respective months.

In [37]:
ser = pd.Series(['Jan 2010', 'Feb 2011', 'Mar 2012'])
ser = ser.map(lambda x: x + ' 04')
pd.to_datetime(ser)

0   2010-01-04
1   2011-02-04
2   2012-03-04
dtype: datetime64[ns]

#### 24. How to filter words that contain atleast 2 vowels from a series?
From `ser`, extract words that contain atleast 2 vowels.

In [38]:
ser = pd.Series(['Apple', 'Orange', 'Plan', 'Python', 'Money'])

# Convert all strings to lower case, count the number of vowels of each string
# with a regex and filter the ones with more than 2 vowels
ser[ser.str.lower().str.count('[aeiou]') >= 2]

0     Apple
1    Orange
4     Money
dtype: object

#### 25. How to filter valid emails from a series?

Extract the valid emails from the series `emails`. The regex pattern for valid emails is provided as reference.

In [39]:
emails = pd.Series(['buying books at amazom.com', 'rameses@egypt.com', 'matt@t.co', 'narendra@modi.com'])
pattern ='[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,4}'

emails[emails.str.contains(pattern)]

1    rameses@egypt.com
2            matt@t.co
3    narendra@modi.com
dtype: object

#### 26. How to get the mean of a series grouped by another series?

Compute the mean of `weights`of each `fruit`.

In [40]:
np.random.seed(0)

fruit = pd.Series(np.random.choice(['apple', 'banana', 'carrot'], 10))
weights = pd.Series(np.linspace(1, 10, 10))

df = pd.concat([fruit, weights], axis=1)
df

Unnamed: 0,0,1
0,apple,1.0
1,banana,2.0
2,apple,3.0
3,banana,4.0
4,banana,5.0
5,carrot,6.0
6,apple,7.0
7,carrot,8.0
8,apple,9.0
9,apple,10.0


In [41]:
df.groupby(0).mean().reset_index()

Unnamed: 0,0,1
0,apple,6.0
1,banana,3.666667
2,carrot,7.0


In [42]:
# More straightforward way
weights.groupby(fruit).mean()

apple     6.000000
banana    3.666667
carrot    7.000000
dtype: float64

#### 27. How to compute the euclidean distance between two series?

Compute the [euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance) between series (points) `p` and `q`, **without using a packaged formula**.

In [43]:
p = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
q = pd.Series([10, 9, 8, 7, 6, 5, 4, 3, 2, 1])

(sum((p - q)**2))**.5

18.16590212458495

#### 28. How to find all the local maxima (or peaks) in a numeric series?

Get the positions of peaks (values surrounded by smaller values on both sides) in `ser`.

In [44]:
ser = pd.Series([2, 10, 3, 4, 9, 10, 2, 7, 3])

i = 1
pos = []

while i < ser.size - 1:
    if (ser[i] > ser[i  - 1]) & (ser[i] > ser[i + 1]):
        pos.append(i)
    i += 1
pos

[1, 5, 7]

#### 29. How to replace missing spaces in a string with the least frequent character?

Replace the spaces in `my_str` with the least frequent character.

In [45]:
my_str = 'dbc deb abed gade'

# so we don't take spaces into account in the next step
new_str = my_str.replace(' ', '')

char_series = pd.Series(list(new_str)).value_counts()
char_series

d    4
b    3
e    3
a    2
c    1
g    1
dtype: int64

In [46]:
# filter items with the lowest count, take the index and transform into list
least_frequent = char_series[char_series == min(char_series)].index.to_list()

# as there are 2 characters with the same count, I'll replace the spaces with
# both of them
my_str.replace(' ', ''.join(least_frequent))

'dbccgdebcgabedcggade'

#### 30. How to create a TimeSeries starting ‘2000-01-01’ and 10 weekends (saturdays) after that having random numbers as values?

In [47]:
np.random.seed(0)

timeseries = pd.date_range(start='2000-01-01', periods=10, freq='W-SAT')
rand_num = np.random.randint(1, 10, 10)

pd.Series(rand_num, timeseries)

2000-01-01    6
2000-01-08    1
2000-01-15    4
2000-01-22    4
2000-01-29    8
2000-02-05    4
2000-02-12    6
2000-02-19    3
2000-02-26    5
2000-03-04    8
Freq: W-SAT, dtype: int32

#### 31. How to fill an intermittent time series so all missing dates show up with values of previous non-missing date?

`ser` has missing dates and values. Make all missing dates appear and fill up with value from previous date.

In [48]:
ser = pd.Series([1,10,3,np.nan],
                index=pd.to_datetime(['2000-01-01', '2000-01-03', '2000-01-06', '2000-01-08']))
ser

2000-01-01     1.0
2000-01-03    10.0
2000-01-06     3.0
2000-01-08     NaN
dtype: float64

In [49]:
ser.resample('D').fillna(method='ffill')

2000-01-01     1.0
2000-01-02     1.0
2000-01-03    10.0
2000-01-04    10.0
2000-01-05    10.0
2000-01-06     3.0
2000-01-07     3.0
2000-01-08     NaN
Freq: D, dtype: float64

#### 32. How to compute the autocorrelations of a numeric series?

Compute autocorrelations for the first 10 lags of `ser`. Find out which lag has the largest correlation.

In [50]:
np.random.seed(0)
ser = pd.Series(np.arange(20) + np.random.normal(1, 10, 20))

# autocorrelations from 1 to 10
autocorr = [ser.autocorr(i) for i in range(1,11)]

# correlation values range from -1 to 1, but the minus sign only indicates
# the direction. Absolute values should be used when trying to find out
# the highest correlation
abs_autocorr = [abs(value) for value in autocorr]

# getting the index 
max_autocorr = abs_autocorr.index(max(abs_autocorr))

print(autocorr)
print('\n')
# the lag number is different from the index
print("Lag having highest correlation:", max_autocorr + 1)

[0.018487209669936965, 0.009196558932265876, 0.026728699085191685, 0.059028493323299104, 0.03815388957192515, -0.20746082111171082, -0.011292219412742142, -0.01065374639721878, -0.13053697739554374, -0.19493061310371246]


Lag having highest correlation: 6


#### 33. How to import only every nth row from a csv file to create a dataframe?

Import every 50th row of [Boston Housing dataset](https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv) as a dataframe.

In [51]:
# Maybe it should be % 49?
pd.read_csv('https://bit.ly/3Bc1oFn',
           skiprows=lambda x: x % 50 != 0)

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.21977,0,6.91,0,0.448,5.602,62.0,6.0877,3,233,17.9,396.9,16.2,19.4
1,0.0686,0,2.89,0,0.445,7.416,62.5,3.4952,2,276,18.0,396.9,6.19,33.2
2,2.73397,0,19.58,0,0.871,5.597,94.9,1.5257,5,403,14.7,351.85,21.45,15.4
3,0.0315,95,1.47,0,0.403,6.975,15.3,7.6534,3,402,17.0,396.9,4.56,34.9
4,0.19073,22,5.86,0,0.431,6.718,17.5,7.8265,7,330,19.1,393.74,6.56,26.2
5,0.05561,70,2.24,0,0.4,7.041,10.0,7.8278,5,358,14.8,371.58,4.74,29.0
6,0.02899,40,1.25,0,0.429,6.939,34.5,8.7921,1,335,19.7,389.85,5.89,26.6
7,9.91655,0,18.1,0,0.693,5.852,77.8,1.5004,24,666,20.2,338.16,29.97,6.3
8,7.52601,0,18.1,0,0.713,6.417,98.3,2.185,24,666,20.2,304.21,19.31,13.0
9,0.17783,0,9.69,0,0.585,5.569,73.5,2.3999,6,391,19.2,395.77,15.1,17.5


#### 34. How to change column values when importing csv to a dataframe?

Import the Boston Housing dataset, but while importing change the `'medv'` (median house value) column so that values < 25 becomes ‘Low’ and > 25 becomes ‘High’.

In [52]:
convert = lambda x: 'Low' if float(x) < 25 else 'High'

pd.read_csv('https://bit.ly/3Bc1oFn',
           converters={'medv': convert})

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296,15.3,396.90,4.98,Low
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.90,9.14,Low
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,High
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,High
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.90,5.33,High
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273,21.0,391.99,9.67,Low
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273,21.0,396.90,9.08,Low
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0,396.90,5.64,Low
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273,21.0,393.45,6.48,Low


#### 35. How to create a dataframe with rows as strides from a given series?

In [53]:
## ????????????????????

L = pd.Series(range(15))

#### 36. How to import only specified columns from a csv file?

Import `‘crim’` and `‘medv’` columns of the Boston Housing dataset as a dataframe.

In [54]:
pd.read_csv('https://bit.ly/3Bc1oFn',
           usecols=['crim', 'medv'])

Unnamed: 0,crim,medv
0,0.00632,24.0
1,0.02731,21.6
2,0.02729,34.7
3,0.03237,33.4
4,0.06905,36.2
...,...,...
501,0.06263,22.4
502,0.04527,20.6
503,0.06076,23.9
504,0.10959,22.0


#### 37. How to get the nrows, ncolumns, datatype, summary stats of each column of a dataframe? Also get the array and list equivalent.

Get the number of rows, columns, datatype and summary statistics of each column of the [Cars93 dataset](https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv). Also get the numpy array and list equivalent of the dataframe.

In [55]:
cars93 = pd.read_csv('https://bit.ly/3GfhcuI')
cars93.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 93 entries, 0 to 92
Data columns (total 27 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Manufacturer        89 non-null     object 
 1   Model               92 non-null     object 
 2   Type                90 non-null     object 
 3   Min.Price           86 non-null     float64
 4   Price               91 non-null     float64
 5   Max.Price           88 non-null     float64
 6   MPG.city            84 non-null     float64
 7   MPG.highway         91 non-null     float64
 8   AirBags             87 non-null     object 
 9   DriveTrain          86 non-null     object 
 10  Cylinders           88 non-null     object 
 11  EngineSize          91 non-null     float64
 12  Horsepower          86 non-null     float64
 13  RPM                 90 non-null     float64
 14  Rev.per.mile        87 non-null     float64
 15  Man.trans.avail     88 non-null     object 
 16  Fuel.tank.

In [56]:
cars93.describe()

Unnamed: 0,Min.Price,Price,Max.Price,MPG.city,MPG.highway,EngineSize,Horsepower,RPM,Rev.per.mile,Fuel.tank.capacity,Passengers,Length,Wheelbase,Width,Turn.circle,Rear.seat.room,Luggage.room,Weight
count,86.0,91.0,88.0,84.0,91.0,91.0,86.0,90.0,87.0,85.0,91.0,89.0,92.0,87.0,88.0,89.0,74.0,86.0
mean,17.118605,19.616484,21.459091,22.404762,29.065934,2.658242,144.0,5276.666667,2355.0,16.683529,5.076923,182.865169,103.956522,69.448276,38.954545,27.853933,13.986486,3104.593023
std,8.82829,9.72428,10.696563,5.84152,5.370293,1.045845,53.455204,605.554811,486.916616,3.375748,1.045953,14.792651,6.856317,3.778023,3.304157,3.018129,3.120824,600.129993
min,6.7,7.4,7.9,15.0,20.0,1.0,55.0,3800.0,1320.0,9.2,2.0,141.0,90.0,60.0,32.0,19.0,6.0,1695.0
25%,10.825,12.35,14.575,18.0,26.0,1.8,100.75,4800.0,2017.5,14.5,4.0,174.0,98.0,67.0,36.0,26.0,12.0,2647.5
50%,14.6,17.7,19.15,21.0,28.0,2.3,140.0,5200.0,2360.0,16.5,5.0,181.0,103.0,69.0,39.0,27.5,14.0,3085.0
75%,20.25,23.5,24.825,25.0,31.0,3.25,170.0,5787.5,2565.0,19.0,6.0,192.0,110.0,72.0,42.0,30.0,16.0,3567.5
max,45.4,61.9,80.0,46.0,50.0,5.7,300.0,6500.0,3755.0,27.0,8.0,219.0,119.0,78.0,45.0,36.0,22.0,4105.0


In [57]:
cars93.to_numpy()

array([['Acura', 'Integra', 'Small', ..., 2705.0, 'non-USA',
        'Acura Integra'],
       [nan, 'Legend', 'Midsize', ..., 3560.0, 'non-USA', 'Acura Legend'],
       ['Audi', '90', 'Compact', ..., 3375.0, 'non-USA', 'Audi 90'],
       ...,
       ['Volkswagen', 'Corrado', 'Sporty', ..., 2810.0, 'non-USA',
        'Volkswagen Corrado'],
       ['Volvo', '240', 'Compact', ..., 2985.0, 'non-USA', 'Volvo 240'],
       [nan, '850', 'Midsize', ..., 3245.0, 'non-USA', 'Volvo 850']],
      dtype=object)

In [58]:
cars93.to_numpy().tolist()[0]

['Acura',
 'Integra',
 'Small',
 12.9,
 15.9,
 18.8,
 25.0,
 31.0,
 'None',
 'Front',
 '4',
 1.8,
 140.0,
 6300.0,
 2890.0,
 'Yes',
 13.2,
 5.0,
 177.0,
 102.0,
 68.0,
 37.0,
 26.5,
 nan,
 2705.0,
 'non-USA',
 'Acura Integra']

#### 38. How to extract the row and column number of a particular cell with given criterion?

Which manufacturer, model and type has the highest `Price`? What is the row and column number of the cell with the highest `Price` value?

In [59]:
# cars93[[selected columns]][filter]
most_expensive = cars93[['Manufacturer', 'Model', 'Type', 'Price']][cars93['Price'] == cars93['Price'].max()]
most_expensive

Unnamed: 0,Manufacturer,Model,Type,Price
58,Mercedes-Benz,300E,Midsize,61.9


In [60]:
row = most_expensive.index[0]
col = cars93.columns.tolist().index('Price')
row, col

(58, 4)

#### 39. How to rename a specific columns in a dataframe?

Rename the column `Type` as `CarType` in df and replace the "." in column names with "_".

In [61]:
cars93.rename(columns = {'Type':'CarType'})

# Replace '.' with '_' and reassign the list back to the column names
cars93.columns = [x.replace('.', '_') for x in cars93.columns.values]
cars93.columns

Index(['Manufacturer', 'Model', 'Type', 'Min_Price', 'Price', 'Max_Price',
       'MPG_city', 'MPG_highway', 'AirBags', 'DriveTrain', 'Cylinders',
       'EngineSize', 'Horsepower', 'RPM', 'Rev_per_mile', 'Man_trans_avail',
       'Fuel_tank_capacity', 'Passengers', 'Length', 'Wheelbase', 'Width',
       'Turn_circle', 'Rear_seat_room', 'Luggage_room', 'Weight', 'Origin',
       'Make'],
      dtype='object')

#### 40. How to check if a dataframe has any missing values?

Check if `df` has any missing values.

In [62]:
cars93.isna().sum().sum() > 0

True

#### 41. How to count the number of missing values in each column?

Count the number of missing values in each column of `df`. Which column has the maximum number of missing values?

In [63]:
count_missing = cars93.isnull().sum()
count_missing

Manufacturer           4
Model                  1
Type                   3
Min_Price              7
Price                  2
Max_Price              5
MPG_city               9
MPG_highway            2
AirBags                6
DriveTrain             7
Cylinders              5
EngineSize             2
Horsepower             7
RPM                    3
Rev_per_mile           6
Man_trans_avail        5
Fuel_tank_capacity     8
Passengers             2
Length                 4
Wheelbase              1
Width                  6
Turn_circle            5
Rear_seat_room         4
Luggage_room          19
Weight                 7
Origin                 5
Make                   3
dtype: int64

In [64]:
cars93.columns[count_missing.argmax()]

'Luggage_room'

#### 42. How to replace missing values of multiple numeric columns with the mean?

Replace missing values in `Min.Price` and `Max.Price` columns with their respective mean.

In [65]:
cars93.fillna(value={'Min_Price': cars93['Min_Price'].mean(),
                     'Max_Price': cars93['Max_Price'].mean()},
             inplace=True)
cars93[['Min_Price', 'Max_Price']].isnull().sum()

Min_Price    0
Max_Price    0
dtype: int64

#### 43. How to use apply function on existing columns with global variables as additional arguments?

In `df`, use `apply` method to replace the missing values in `Min.Price` with the column’s mean and those in `Max.Price` with the column’s median.

In [66]:
cars93 = pd.read_csv('https://bit.ly/3GfhcuI')

fill_values = {'Min.Price': cars93['Min.Price'].mean(),
               'Max.Price': cars93['Max.Price'].median()}

## ????????????

#### 44. How to select a specific column from a dataframe as a dataframe instead of a series?

Get the first column (a) in `df` as a dataframe (rather than as a Series).

In [67]:
df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))

df[['a']]

Unnamed: 0,a
0,0
1,5
2,10
3,15


#### 45. How to change the order of columns of a dataframe?

Actually 3 questions.

1. In `df`, interchange columns `'a'` and `'c'`.
2. Create a generic function to interchange two columns, without hardcoding column names.
3. Sort the columns in reverse alphabetical order, that is colume `'e'` first through column `'a'` last.

In [68]:
df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))

In [69]:
# 1.

df[['c', 'b', 'a', 'd', 'e']]

Unnamed: 0,c,b,a,d,e
0,2,1,0,3,4
1,7,6,5,8,9
2,12,11,10,13,14
3,17,16,15,18,19


In [70]:
# 2.

def swap_columns(df, col1=None, col2=None):
    
    columns_list = df.columns.to_list()
    idx1 = columns_list.index(col1)
    idx2 = columns_list.index(col2)
    columns_list[idx1] = col2
    columns_list[idx2] = col1
        
    return df[columns_list]

In [71]:
swap_columns(df, col1='a', col2='d')

Unnamed: 0,d,b,c,a,e
0,3,1,2,0,4
1,8,6,7,5,9
2,13,11,12,10,14
3,18,16,17,15,19


In [72]:
df.sort_index(axis=1, ascending=False, inplace=True)
df

Unnamed: 0,e,d,c,b,a
0,4,3,2,1,0
1,9,8,7,6,5
2,14,13,12,11,10
3,19,18,17,16,15


#### 46. How to set the number of rows and columns displayed in the output?

Change the pandas display settings on printing the dataframe `df` it shows a maximum of 10 rows and 10 columns.

In [73]:
pd.set_option('display.max_columns', 10)
pd.set_option('display.max_rows', 10)

In [74]:
cars93

Unnamed: 0,Manufacturer,Model,Type,Min.Price,Price,...,Rear.seat.room,Luggage.room,Weight,Origin,Make
0,Acura,Integra,Small,12.9,15.9,...,26.5,,2705.0,non-USA,Acura Integra
1,,Legend,Midsize,29.2,33.9,...,30.0,15.0,3560.0,non-USA,Acura Legend
2,Audi,90,Compact,25.9,29.1,...,28.0,14.0,3375.0,non-USA,Audi 90
3,Audi,100,Midsize,,37.7,...,31.0,17.0,3405.0,non-USA,Audi 100
4,BMW,535i,Midsize,,30.0,...,27.0,13.0,3640.0,non-USA,BMW 535i
...,...,...,...,...,...,...,...,...,...,...,...
88,Volkswagen,Eurovan,Van,16.6,19.7,...,34.0,,3960.0,,Volkswagen Eurovan
89,Volkswagen,Passat,Compact,17.6,20.0,...,31.5,14.0,2985.0,non-USA,Volkswagen Passat
90,Volkswagen,Corrado,Sporty,22.9,23.3,...,26.0,15.0,2810.0,non-USA,Volkswagen Corrado
91,Volvo,240,Compact,21.8,22.7,...,29.5,14.0,2985.0,non-USA,Volvo 240


#### 47. How to format or suppress scientific notations in a pandas dataframe?

Suppress scientific notations like ‘e-03’ in `df` and print upto 4 numbers after decimal.

In [75]:
np.random.seed(0)
df = pd.DataFrame(np.random.random(4)**10, columns=['random'])
df.round(4)

Unnamed: 0,random
0,0.0025
1,0.035
2,0.0063
3,0.0023


#### 48. How to format all the values in a dataframe as percentages?

Format the values in column `'random'` of `df` as percentages.

In [76]:
np.random.seed(0)
df = pd.DataFrame(np.random.random(4), columns=['random'])
df

Unnamed: 0,random
0,0.548814
1,0.715189
2,0.602763
3,0.544883


In [77]:
df.applymap(lambda x: "{:.2%}".format(x))

Unnamed: 0,random
0,54.88%
1,71.52%
2,60.28%
3,54.49%


#### 49. How to filter every nth row in a dataframe?

From `df`, filter the `'Manufacturer'`, `'Model'` and `'Type'` for every 20th row starting from 1st (row 0).

In [78]:
cars93 = pd.read_csv('https://bit.ly/3GfhcuI')

cars93[['Manufacturer', 'Model', 'Type']].iloc[::20]

Unnamed: 0,Manufacturer,Model,Type
0,Acura,Integra,Small
20,Chrysler,LeBaron,Compact
40,Honda,Prelude,Sporty
60,Mercury,Cougar,Midsize
80,Subaru,Loyale,Small


#### 50. How to create a primary key index by combining relevant columns?

In `df`, replace `NaNs` with ‘missing’ in columns `'Manufacturer'`, `'Model'` and `'Type'` and create a index as a combination of these three columns and check if the index is a primary key.

In [79]:
cars93 = pd.read_csv('https://bit.ly/3GfhcuI', usecols=[0,1,2,3,5])
cars93

Unnamed: 0,Manufacturer,Model,Type,Min.Price,Max.Price
0,Acura,Integra,Small,12.9,18.8
1,,Legend,Midsize,29.2,38.7
2,Audi,90,Compact,25.9,32.3
3,Audi,100,Midsize,,44.6
4,BMW,535i,Midsize,,
...,...,...,...,...,...
88,Volkswagen,Eurovan,Van,16.6,22.7
89,Volkswagen,Passat,Compact,17.6,22.4
90,Volkswagen,Corrado,Sporty,22.9,23.7
91,Volvo,240,Compact,21.8,23.5


In [80]:
cars93.fillna(value={'Manufacturer': 'missing',
                     'Model': 'missing',
                     'Type': 'missing'},
             inplace=True)
new_index = ['_'.join([x, y, z]) for x, y, z in \
             zip(cars93['Manufacturer'], cars93['Model'], cars93['Type'])]
cars93.index = new_index
cars93

Unnamed: 0,Manufacturer,Model,Type,Min.Price,Max.Price
Acura_Integra_Small,Acura,Integra,Small,12.9,18.8
missing_Legend_Midsize,missing,Legend,Midsize,29.2,38.7
Audi_90_Compact,Audi,90,Compact,25.9,32.3
Audi_100_Midsize,Audi,100,Midsize,,44.6
BMW_535i_Midsize,BMW,535i,Midsize,,
...,...,...,...,...,...
Volkswagen_Eurovan_Van,Volkswagen,Eurovan,Van,16.6,22.7
Volkswagen_Passat_Compact,Volkswagen,Passat,Compact,17.6,22.4
Volkswagen_Corrado_Sporty,Volkswagen,Corrado,Sporty,22.9,23.7
Volvo_240_Compact,Volvo,240,Compact,21.8,23.5


In [81]:
df.index.is_unique

True