### 1. How to import pandas and check the version?

In [2]:
import pandas as pd
print(pd.__version__)

1.1.5


### 2. How to create a series from a list, numpy array and dict?
Create a pandas series from each of the items below: a list, numpy and a dictionary

In [3]:
# Input
import numpy as np
mylist = list('abcedfghijklmnopqrstuvwxyz')
myarr = np.arange(26)
mydict = dict(zip(mylist, myarr))

In [7]:
mylist_series = pd.Series(mylist) 
myarr_series = pd.Series(np.arange(26))
mydict_series = pd.Series(dict(zip(mylist, myarr)))

#### 2.1. Creating Pandas DataFrame from lists of lists.

In [21]:
pd.DataFrame([['tom',10],['nick',20],['ana',30]], columns=['Name','Age']) # df created rowwise

Unnamed: 0,Name,Age
0,tom,10
1,nick,20
2,ana,30


#### 2.2. Creating DataFrame from dict of narray/lists

In [22]:
data = {'Name': ['tom', 'nick', 'ana'],
        'Age': [10, 20, 30]}
pd.DataFrame(data)

Unnamed: 0,Name,Age
0,tom,10
1,nick,20
2,ana,30


#### 2.3. Creates a indexes DataFrame using arrays.

In [24]:
data = {'Name': ['tom', 'nick', 'ana'],
        'Age': [10, 20, 30]}
pd.DataFrame(data, index=['rank1', 'rank2', 'rank3'])

Unnamed: 0,Name,Age
rank1,tom,10
rank2,nick,20
rank3,ana,30


#### 2.4. Creating Dataframe from list of dicts

In [28]:
data = [{'a':10, 'b':20, 'c':30},
        {'a':5, 'b':15, 'c':25}]
pd.DataFrame(data)

Unnamed: 0,a,b,c
0,10,20,30
1,5,15,25


### 3. How to convert the index of a series into a column of a dataframe?
Difficulty Level: L1

Convert the series ser into a dataframe with its index as another column on the dataframe.

In [11]:
# Input
mylist = list('abcedfghijklmnopqrstuvwxyz')
myarr = np.arange(26)
mydict = dict(zip(mylist, myarr))
ser = pd.Series(mydict)

In [29]:
df = ser.to_frame().reset_index()
df.head()

Unnamed: 0,index,0
0,a,0
1,b,1
2,c,2
3,e,3
4,d,4


### 4. How to combine many series to form a dataframe?
Difficulty Level: L1

Combine ser1 and ser2 to form a dataframe.

In [30]:
# Input

import numpy as np
ser1 = pd.Series(list('abcedfghijklmnopqrstuvwxyz'))
ser2 = pd.Series(np.arange(26))

In [35]:
# Solution 1
df = pd.DataFrame({'col1':ser1, 'col2':ser2})
df.head(2)

Unnamed: 0,col1,col2
0,a,0
1,b,1


In [38]:
# Solution 2
df = pd.concat([ser1,ser2], axis=1)
df.head(2)

Unnamed: 0,0,1
0,a,0
1,b,1


### 5. How to assign name to the series’ index?
Difficulty Level: L1

Give a name to the series ser calling it ‘alphabets’.

In [39]:
# Input

ser = pd.Series(list('abcedfghijklmnopqrstuvwxyz'))

In [42]:
ser = pd.Series(ser, name='alphabets')
ser.head(2)

# The name of a Series becomes its index or column name if used to form a Data Frame

0    a
1    b
Name: alphabets, dtype: object

### 6. How to get the items of series A not present in series B?
Difficulty Level: L2

From ser1 remove items present in ser2.

In [52]:
# Input 
ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])

In [46]:
set(ser1) - set(ser2)

{1, 2, 3}

In [47]:
pd.Series([i for i in ser1 if i not in ser2])
# Wrong!!! Because *in* is interpreted as: i in ser._info_axis, the Series
# needs to first be converted to a list

0    5
dtype: int64

In [58]:
# Solution 1
pd.Series([i for i in ser1 if i not in list(ser2)])

0    1
1    2
2    3
dtype: int64

In [62]:
# Solution 2
ser1[~ser1.isin(ser2)]

0    1
1    2
2    3
dtype: int64

### 7. How to get the items not common to both series A and series B?
Difficulty Level: L2

Get all items of ser1 and ser2 not common to both.

In [63]:
# Input

ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])

In [69]:
# Solution 1 
ser1[~ser1.isin(ser2)].append(ser2[~ser2.isin(ser1)])

0    1
1    2
2    3
2    6
3    7
4    8
dtype: int64

In [70]:
# Solution 1
ser_u = pd.Series(np.union1d(ser1,ser2)) # np.union1d()
ser_i = pd.Series(np.intersect1d(ser1,ser2)) # np.intersect1d()
ser_u[~ser_u.isin(ser_i)]

0    1
1    2
2    3
5    6
6    7
7    8
dtype: int64

### 8. How to get the minimum, 25th percentile, median, 75th, and max of a numeric series?
Difficuty Level: L2

Compute the minimum, 25th percentile, median, 75th, and maximum of ser.

In [72]:
# Input

ser = pd.Series(np.random.normal(10, 5, 25))

In [77]:
# Solution 1
np.percentile(ser, [0, 0.25, 0.5, 0.75, 1])

array([-1.81890703, -1.77071875, -1.72253046, -1.67434218, -1.62615389])

In [80]:
# Solution 2
ser.quantile([0, 0.25, 0.5, 0.75, 1])

0.00    -1.818907
0.25     4.803858
0.50     7.330702
0.75    13.739595
1.00    17.997324
dtype: float64

### 9. How to get frequency counts of unique items of a series?
Difficulty Level: L1

Calculte the frequency counts of each unique value ser.

In [81]:
# Input

ser = pd.Series(np.take(list('abcdefgh'), np.random.randint(8, size=30)))

In [84]:
ser.value_counts()

e    8
a    6
b    4
g    3
h    3
f    3
d    2
c    1
dtype: int64

### 10. How to keep only top 2 most frequent values as it is and replace everything else as ‘Other’?
Difficulty Level: L2


From ser, keep the top 2 most frequent items as it is and replace everything else as ‘Other’.

In [102]:
# Input

np.random.RandomState(100)
ser = pd.Series(np.random.randint(1, 5, [12]))
ser

0     3
1     3
2     1
3     1
4     2
5     1
6     2
7     1
8     4
9     2
10    4
11    4
dtype: int64

In [103]:
ser[~ser.isin(ser.value_counts().index[:2])] = 'Others'
ser

0     Others
1     Others
2          1
3          1
4     Others
5          1
6     Others
7          1
8          4
9     Others
10         4
11         4
dtype: object

### 11. How to bin a numeric series to 10 groups of equal size?
Difficulty Level: L2

Bin the series ser into 10 equal deciles and replace the values with the bin name.

In [104]:
# Input

ser = pd.Series(np.random.random(20))
# Desired Output

# First 5 items
# 0    7th
# 1    9th
# 2    7th
# 3    3rd
# 4    8th
# dtype: category
# Categories (10, object): [1st < 2nd < 3rd < 4th ... 7th < 8th < 9th < 10th]

In [117]:
pd.qcut(ser, q=10, labels=['1st', '2nd', '3rd', '4th', '5th', '6th', '7th', '8th', '9th', '10th']).head(2)

0    2nd
1    5th
dtype: category
Categories (10, object): ['1st' < '2nd' < '3rd' < '4th' ... '7th' < '8th' < '9th' < '10th']

### 12. How to convert a numpy array to a dataframe of given shape? (L1)
Difficulty Level: L1

Reshape the series ser into a dataframe with 7 rows and 5 columns

In [119]:
# Input

ser = pd.Series(np.random.randint(1, 10, 35))
ser.head()

0    1
1    4
2    1
3    9
4    9
dtype: int64

In [124]:
pd.DataFrame(ser.values.reshape(7,5)).head()

Unnamed: 0,0,1,2,3,4
0,1,4,1,9,9
1,4,1,6,6,6
2,7,2,6,5,8
3,4,4,8,7,2
4,2,9,6,8,9


### 13. How to find the positions of numbers that are multiples of 3 from a series?
Difficulty Level: L2


Find the positions of numbers that are multiples of 3 from ser.

In [133]:
# Input

ser = pd.Series(np.random.randint(1, 10, 7))

In [135]:
ser[ser % 3 == 0].index

Int64Index([1, 5, 6], dtype='int64')

### 14. How to extract items at given positions from a series
Difficulty Level: L1

From ser, extract the items at positions in list pos.

In [140]:
# Input

ser = pd.Series(list('abcdefghijklmnopqrstuvwxyz'))
pos = [0, 4, 8, 14, 20]

In [142]:
ser.loc[pos]

0     a
4     e
8     i
14    o
20    u
dtype: object

### 15. How to stack two series vertically and horizontally ?
Difficulty Level: L1

Stack ser1 and ser2 vertically and horizontally (to form a dataframe).

In [143]:
# Input

ser1 = pd.Series(range(5))
ser2 = pd.Series(list('abcde'))

In [152]:
pd.concat([ser1,ser2]).to_frame()

Unnamed: 0,0
0,0
1,1
2,2
3,3
4,4
0,a
1,b
2,c
3,d
4,e


In [149]:
pd.concat([ser1,ser2], axis=1)

Unnamed: 0,0,1
0,0,a
1,1,b
2,2,c
3,3,d
4,4,e


### 16. How to get the positions of items of series A in another series B?
Difficulty Level: L2

Get the positions of items of ser2 in ser1 as a list.

In [153]:
# Input

ser1 = pd.Series([10, 9, 6, 5, 3, 1, 12, 8, 13])
ser2 = pd.Series([1, 3, 10, 13])

In [169]:
# Solution 1

[np.argwhere(ser1.values == i)[0,0] for i in ser2]

[5, 4, 0, 8]

In [182]:
# Solution 2

[np.where(ser1 == i)[0][0] for i in ser2]

[5, 4, 0, 8]

### 17. How to compute the mean squared error on a truth and predicted series?
Difficulty Level: L2

Compute the mean squared error of truth and pred series.

In [183]:
# Input

truth = pd.Series(range(10))
pred = pd.Series(range(10)) + np.random.random(10)

In [184]:
np.mean((truth - pred)**2)

0.31190697545738433

### 18. How to convert the first character of each element in a series to uppercase?
Difficulty Level: L2

Change the first character of each word to upper case in each word of ser.

In [189]:
# Input

ser = pd.Series(['how', 'to', 'kick', 'ass?'])

In [193]:
pd.Series([word[0].upper() + word[1:] for word in ser])

0     How
1      To
2    Kick
3    Ass?
dtype: object

In [195]:
pd.Series([word.title() for word in ser]) # title() makes the first letter in each word uppercase

0     How
1      To
2    Kick
3    Ass?
dtype: object

In [194]:
ser.map(lambda x: x.title())

0     How
1      To
2    Kick
3    Ass?
dtype: object

### 19. How to calculate the number of characters in each word in a series?
Difficulty Level: L2

In [196]:
# Input

ser = pd.Series(['how', 'to', 'kick', 'ass?'])

In [197]:
[len(word) for word in ser]

[3, 2, 4, 4]

### 20. How to compute difference of differences between consequtive numbers of a series?
Difficulty Level: L1

Difference of differences between the consequtive numbers of ser.

In [198]:
# Input

ser = pd.Series([1, 3, 6, 10, 15, 21, 27, 35])

# Desired Output

# [nan, 2.0, 3.0, 4.0, 5.0, 6.0, 6.0, 8.0]
# [nan, nan, 1.0, 1.0, 1.0, 1.0, 0.0, 2.0]

In [207]:
print(ser.diff().tolist()) # diff() calculates the diff compared with an other element
print(ser.diff().diff().tolist())

[nan, 2.0, 3.0, 4.0, 5.0, 6.0, 6.0, 8.0]
[nan, nan, 1.0, 1.0, 1.0, 1.0, 0.0, 2.0]


### 21. How to convert a series of date-strings to a timeseries?
Difficiulty Level: L2

In [208]:
# Input

ser = pd.Series(['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])
# Desired Output

# 0   2010-01-01 00:00:00
# 1   2011-02-02 00:00:00
# 2   2012-03-03 00:00:00
# 3   2013-04-04 00:00:00
# 4   2014-05-05 00:00:00
# 5   2015-06-06 12:20:00
# dtype: datetime64[ns]

In [213]:
pd.to_datetime(ser)

0   2010-01-01 00:00:00
1   2011-02-02 00:00:00
2   2012-03-03 00:00:00
3   2013-04-04 00:00:00
4   2014-05-05 00:00:00
5   2015-06-06 12:20:00
dtype: datetime64[ns]

### 22. How to get the day of month, week number, day of year and day of week from a series of date strings?
Difficiulty Level: L2

Get the day of month, week number, day of year and day of week from ser.

In [237]:
# Input

ser = pd.Series(['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])
# Desired output

# Date:  [1, 2, 3, 4, 5, 6]
# Week number:  [53, 5, 9, 14, 19, 23]
# Day num of year:  [1, 33, 63, 94, 125, 157]
# Day of week:  ['Friday', 'Wednesday', 'Saturday', 'Thursday', 'Monday', 'Saturday']

In [259]:
ser = pd.to_datetime(ser)
# Series.dt() -> Accessor object for datetimelike properties of the Series values.
print(ser.dt.day.tolist()) # dt.day
print(ser.dt.isocalendar().week.tolist()) # dt.isocalendar().week
print(ser.dt.dayofyear.tolist()) # dt.dayofyear
print(ser.dt.day_name().tolist()) # dt.day_name()

[1, 2, 3, 4, 5, 6]
[53, 5, 9, 14, 19, 23]
[1, 33, 63, 94, 125, 157]
['Friday', 'Wednesday', 'Saturday', 'Thursday', 'Monday', 'Saturday']


### 23. How to convert year-month string to dates corresponding to the 4th day of the month?
Difficiulty Level: L2


Change ser to dates that start with 4th of the respective months.

In [260]:
# Input

ser = pd.Series(['Jan 2010', 'Feb 2011', 'Mar 2012'])
# Desired Output

# 0   2010-01-04
# 1   2011-02-04
# 2   2012-03-04
# dtype: datetime64[ns]

In [265]:
pd.to_datetime('04 ' + ser)

0   2010-01-04
1   2011-02-04
2   2012-03-04
dtype: datetime64[ns]

### 24. How to filter words that contain atleast 2 vowels from a series?
Difficiulty Level: L3

From ser, extract words that contain atleast 2 vowels.

In [266]:
# Input

ser = pd.Series(['Apple', 'Orange', 'Plan', 'Python', 'Money'])

# Desired Output

# 0     Apple
# 1    Orange
# 4     Money
# dtype: object

In [288]:
m = [True if len([letter for letter in word.lower() if letter in 'aeiou']) > 1 else False for word in ser]
ser[m]

0     Apple
1    Orange
4     Money
dtype: object

### 25. How to filter valid emails from a series?
Difficiulty Level: L3

Extract the valid emails from the series emails. The regex pattern for valid emails is provided as reference.

In [294]:
# Input

emails = pd.Series(['buying books at amazom.com', 'rameses@egypt.com', 'matt@t.co', 'narendra@modi.com'])
pattern ='[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,4}'

# Desired Output

# 1    rameses@egypt.com
# 2            matt@t.co
# 3    narendra@modi.com
# dtype: object

In [310]:
# Solution 1
import re
mask = [bool(re.match(pattern, email)) for email in emails]
emails[mask]

1    rameses@egypt.com
2            matt@t.co
3    narendra@modi.com
dtype: object

In [311]:
# Solution 1
m = emails.str.contains(pattern).tolist()
emails[mask]

1    rameses@egypt.com
2            matt@t.co
3    narendra@modi.com
dtype: object

### 26. How to get the mean of a series grouped by another series?
Difficiulty Level: L2

Compute the mean of weights of each fruit.

In [319]:
# Input

fruit = pd.Series(np.random.choice(['apple', 'banana', 'carrot'], 10))
weights = pd.Series(np.linspace(1, 10, 10))
print(weights.tolist())
print(fruit.tolist())
#> [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]
#> ['banana', 'carrot', 'apple', 'carrot', 'carrot', 'apple', 'banana', 'carrot', 'apple', 'carrot']
# Desired output

# values can change due to randomness
# apple     6.0
# banana    4.0
# carrot    5.8
# dtype: float64

[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]
['apple', 'carrot', 'banana', 'banana', 'banana', 'banana', 'carrot', 'apple', 'carrot', 'banana']


In [321]:
weights.groupby(fruit).mean() # no need to use dataframe

apple     4.5
banana    5.6
carrot    6.0
dtype: float64

### 27. How to compute the euclidean distance between two series?
Difficiulty Level: L2

Compute the euclidean distance between series (points) p and q, without using a packaged formula.

In [323]:
# Input

p = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
q = pd.Series([10, 9, 8, 7, 6, 5, 4, 3, 2, 1])

# Desired Output

# 18.165

In [329]:
np.sqrt(((p - q)**2).sum())

18.16590212458495

### 28. How to find all the local maxima (or peaks) in a numeric series?
Difficiulty Level: L3

Get the positions of peaks (values surrounded by smaller values on both sides) in ser.

In [330]:
# Input

ser = pd.Series([2, 10, 3, 4, 9, 10, 2, 7, 3])
# Desired output

# array([1, 5, 7])

In [344]:
ser[(ser.diff(-1) > 0) & (ser.diff(1) > 0)]

1    10
5    10
7     7
dtype: int64

### 29. How to replace missing spaces in a string with the least frequent character?
Replace the spaces in my_str with the least frequent character.

Difficiulty Level: L2

In [351]:
# Input

my_str = 'dbc deb abed gade'
# Desired Output

# 'dbccdebcabedcgade'  # least frequent is 'c'

In [368]:
ser = pd.Series(list('dbc deb abed gade')) # string to list to Series
freq = ser.value_counts()
least_freq = freq.dropna().index[-1]
my_str.replace(' ', least_freq)

'dbcgdebgabedggade'

### 30. How to create a TimeSeries starting ‘2000-01-01’ and 10 weekends (saturdays) after that having random numbers as values?
Difficiulty Level: L2

In [369]:
# Desired output


# values can be random
# 2000-01-01    4
# 2000-01-08    1
# 2000-01-15    8
# 2000-01-22    4
# 2000-01-29    4
# 2000-02-05    2
# 2000-02-12    4
# 2000-02-19    9
# 2000-02-26    6
# 2000-03-04    6

In [382]:
pd.Series(np.random.randint(1,10,10), pd.date_range('2000-01-01', periods=10, freq='W-SAT'))
# pd.date_range(start, end, periods, freq)

2000-01-01    7
2000-01-08    3
2000-01-15    7
2000-01-22    5
2000-01-29    8
2000-02-05    6
2000-02-12    4
2000-02-19    3
2000-02-26    1
2000-03-04    9
Freq: W-SAT, dtype: int64

### 31. How to fill an intermittent time series so all missing dates show up with values of previous non-missing date?
Difficiulty Level: L2

ser has missing dates and values. Make all missing dates appear and fill up with value from previous date.

In [4]:
# Input

ser = pd.Series([1,10,3,np.nan], index=pd.to_datetime(['2000-01-01', '2000-01-03', '2000-01-06', '2000-01-08']))
print(ser)
#> 2000-01-01     1.0
#> 2000-01-03    10.0
#> 2000-01-06     3.0
#> 2000-01-08     NaN
#> dtype: float64

# Desired Output

# 2000-01-01     1.0
# 2000-01-02     1.0
# 2000-01-03    10.0
# 2000-01-04    10.0
# 2000-01-05    10.0
# 2000-01-06     3.0
# 2000-01-07     3.0
# 2000-01-08     3.0

2000-01-01     1.0
2000-01-03    10.0
2000-01-06     3.0
2000-01-08     NaN
dtype: float64


In [27]:
ser.resample('D').ffill()

2000-01-01     1.0
2000-01-02     1.0
2000-01-03    10.0
2000-01-04    10.0
2000-01-05    10.0
2000-01-06     3.0
2000-01-07     3.0
2000-01-08     NaN
Freq: D, dtype: float64

### 32. How to compute the autocorrelations of a numeric series?
Difficiulty Level: L3

Compute autocorrelations for the first 10 lags of ser. Find out which lag has the largest correlation.

In [88]:
# Input

ser = pd.Series(np.arange(20) + np.random.normal(1, 10, 20))

# Desired output

# values will change due to randomness
# [0.29999999999999999, -0.11, -0.17000000000000001, 0.46000000000000002, 0.28000000000000003, -0.040000000000000001, -0.37, 0.41999999999999998, 0.47999999999999998, 0.17999999999999999]
# Lag having highest correlation:  9

In [89]:
np.argmax([ser.autocorr(i) for i in range(1,11)])

3

### 33. How to import only every nth row from a csv file to create a dataframe?
Difficiulty Level: L2

Import every 50th row of BostonHousing dataset as a dataframe.

In [110]:
df_chunks = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv', chunksize=50)
df2 = pd.DataFrame()

In [111]:
for df_chunk in df_chunks:
    df2 = df2.append(df_chunk.iloc[0, :])

In [112]:
df2.head()

Unnamed: 0,age,b,chas,crim,dis,indus,lstat,medv,nox,ptratio,rad,rm,tax,zn
0,65.2,396.9,0.0,0.00632,4.09,2.31,4.98,24.0,0.538,15.3,1.0,6.575,296.0,18.0
50,45.7,395.56,0.0,0.08873,6.8147,5.64,13.45,19.7,0.439,16.8,4.0,5.963,243.0,21.0
100,79.9,394.76,0.0,0.14866,2.7778,8.56,9.42,27.5,0.52,20.9,5.0,6.727,384.0,0.0
150,97.3,372.8,0.0,1.6566,1.618,19.58,14.1,21.5,0.871,14.7,5.0,6.122,403.0,0.0
200,13.9,384.3,0.0,0.01778,7.6534,1.47,4.45,32.9,0.403,17.0,3.0,7.135,402.0,95.0


### 34. How to change column values when importing csv to a dataframe?
Difficulty Level: L2

Import the boston housing dataset, but while importing change the 'medv' (median house value) column so that values < 25 becomes ‘Low’ and > 25 becomes ‘High’.

In [115]:
df = pd.read_csv(
    'https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv',
    converters = {'medv': lambda x: 'High' if float(x) > 25 else 'Low'}
) # Dict of functions to convert values in certain columns

### 35. How to create a dataframe with rows as strides from a given series?
Difficiulty Level: L3

In [117]:
# Input

L = pd.Series(range(15))

# Desired Output

# array([[ 0,  1,  2,  3],
#        [ 2,  3,  4,  5],
#        [ 4,  5,  6,  7],
#        [ 6,  7,  8,  9],
#        [ 8,  9, 10, 11],
#        [10, 11, 12, 13]])

In [126]:
L

0      0
1      1
2      2
3      3
4      4
5      5
6      6
7      7
8      8
9      9
10    10
11    11
12    12
13    13
14    14
dtype: int64

### 36. How to import only specified columns from a csv file?
Difficulty Level: L1

Import ‘crim’ and ‘medv’ columns of the BostonHousing dataset as a dataframe.

In [128]:
df = pd.read_csv(
    'https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv',
    usecols = ['crim', 'medv']
)
df.head(2)

Unnamed: 0,crim,medv
0,0.00632,24.0
1,0.02731,21.6
