# Supplementary Session on Important Series Methods

In [2]:
import numpy as np
import pandas as pd

In [31]:
# reading datasets
subs = pd.read_csv('subs.csv' )
subs = subs.squeeze()
subs

0       48
1       57
2       40
3       43
4       44
      ... 
360    231
361    226
362    155
363    144
364    172
Name: Subscribers gained, Length: 365, dtype: int64

In [21]:
kohli_runs = pd.read_csv('kohli_ipl.csv', index_col='match_no')
kohli_runs = kohli_runs.squeeze()
print(type(kohli_runs))
kohli_runs

<class 'pandas.core.series.Series'>


match_no
1       1
2      23
3      13
4      12
5       1
       ..
211     0
212    20
213    73
214    25
215     7
Name: runs, Length: 215, dtype: int64

In [4]:
bollywood = pd.read_csv('bollywood.csv', index_col='movie')
bollywood = bollywood.squeeze()
bollywood

movie
Uri: The Surgical Strike                   Vicky Kaushal
Battalion 609                                Vicky Ahuja
The Accidental Prime Minister (film)         Anupam Kher
Why Cheat India                            Emraan Hashmi
Evening Shadows                         Mona Ambegaonkar
                                              ...       
Hum Tumhare Hain Sanam                    Shah Rukh Khan
Aankhen (2002 film)                     Amitabh Bachchan
Saathiya (film)                             Vivek Oberoi
Company (film)                                Ajay Devgn
Awara Paagal Deewana                        Akshay Kumar
Name: lead, Length: 1500, dtype: object

### astype
`astype()` changes the datatype of elements of the series.

In [5]:
import sys
sys.getsizeof(kohli_runs)

3456

In [6]:
kohli_runs.astype('int16')

match_no
1       1
2      23
3      13
4      12
5       1
       ..
211     0
212    20
213    73
214    25
215     7
Name: runs, Length: 215, dtype: int16

In [7]:
# size is reduced
sys.getsizeof(kohli_runs.astype('int16'))

2166

### between
`between()` checks if the values of the series exist between the provided range or not and then outputs a Boolean Series.

In [8]:
# checking how many times runs have been scored between 51 and 99
kohli_runs[kohli_runs.between(51, 99)].size

43

### clip
`clip()` modifies the actual values of the series and make outlying values come within a range.

In [9]:
subs.clip(100, 200)

Unnamed: 0,Subscribers gained
0,100
1,100
2,100
3,100
4,100
...,...
360,200
361,200
362,155
363,144


Now, all the values that existed between 100 and 200 have been set to 100 or 200 (lower or upper), and all the values that were lying in between are preserved.

### drop_duplicates
`drop_duplicated()`, as the name suggests, drops the duplicates of a Series.

In [10]:
temp_series = pd.Series([1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4])
temp_series

0     1
1     1
2     2
3     2
4     3
5     3
6     3
7     4
8     4
9     4
10    4
dtype: int64

In [11]:
temp_series.drop_duplicates()

0    1
2    2
4    3
7    4
dtype: int64

Additionally, we can choose which occurence of the duplicate do we want to drop, using the `keep` parameter of `drop_duplicate()`.

In [12]:
# all the last occurences of the duplicates are kept
temp_series.drop_duplicates(keep='last')

1     1
3     2
6     3
10    4
dtype: int64

In [13]:
# displays the duplicated values using a Boolean Series
temp_series.duplicated()

0     False
1      True
2     False
3      True
4     False
5      True
6      True
7     False
8      True
9      True
10     True
dtype: bool

In [14]:
# to get the number of duplicates in the Series
temp_series.duplicated().sum()

np.int64(7)

### isnull / missing values

In [15]:
temp2_series = pd.Series([1, 2, 3, np.nan, 5, 6, np.nan, 8, np.nan, 10])
temp2_series

0     1.0
1     2.0
2     3.0
3     NaN
4     5.0
5     6.0
6     NaN
7     8.0
8     NaN
9    10.0
dtype: float64

The series' `size` method counts the null or missing values while estimating the count of the values, while the series' `count()` method does not consider the missing values existing in the Series.

In [16]:
print(temp2_series.size)
print(temp2_series.count())

10
7


In [17]:
kohli_runs.isnull().sum()

np.int64(0)

There are 0 null values in the kohli_runs Series. Lets try the temp2_series.

In [18]:
temp2_series.isnull().sum()

np.int64(3)

Since there are 3 null values in the Series, we can either **fill** the missing values, or we can **drop** the missing values.

### dropna
`dropna()` generates a new series, dropping all the missing values from the original series.

In [19]:
temp2_series.dropna()

0     1.0
1     2.0
2     3.0
4     5.0
5     6.0
7     8.0
9    10.0
dtype: float64

### fillna
`fillna()` is used to replace the missing values with another suitable value that may approximately represent the datapoints.

In [20]:
temp2_series.fillna(temp2_series.mean())

0     1.0
1     2.0
2     3.0
3     5.0
4     5.0
5     6.0
6     5.0
7     8.0
8     5.0
9    10.0
dtype: float64

All the missing values in the temp2_series are replced with the **mean** in the above code. Has there been an `inplace` parameter set to `True`, the changes would have been made to the original Series.

### isin
`isin()` returns a Boolean Series based on an original series. `True` values are assigned to all the rows whose elements were satisfying a condition mentioned in `isin()`. This solves the issue or redundant codewriting, and is an efficient method.

In [21]:
# without isin
# returning the values where the condition were true
kohli_runs[(kohli_runs == 49) | (kohli_runs == 99)]

match_no
82    99
86    49
Name: runs, dtype: int64

In [22]:
# rather we can perform this using
kohli_runs[kohli_runs.isin([49, 99])]

match_no
82    99
86    49
Name: runs, dtype: int64

### apply
`apply()` enables us to apply some custom logic on the Series.

In [29]:
# example fetching the first name of the lead roles
bollywood.apply(lambda x : x.split()[0].capitalize())

movie
Uri: The Surgical Strike                  Vicky
Battalion 609                             Vicky
The Accidental Prime Minister (film)     Anupam
Why Cheat India                          Emraan
Evening Shadows                            Mona
                                         ...   
Hum Tumhare Hain Sanam                     Shah
Aankhen (2002 film)                     Amitabh
Saathiya (film)                           Vivek
Company (film)                             Ajay
Awara Paagal Deewana                     Akshay
Name: lead, Length: 1500, dtype: object

In [32]:
# example calling a day bad or good if it succeeds or preceedes the mean subs count
subs.apply(lambda x : 'good day' if x > subs.mean() else 'bady day')

0      bady day
1      bady day
2      bady day
3      bady day
4      bady day
         ...   
360    good day
361    good day
362    good day
363    good day
364    good day
Name: Subscribers gained, Length: 365, dtype: object

### copy
As we also saw the usage of `copy()` earlier, this is useful to ensure that the changes being made to a replica of the original does not actually impact the original data as well.

In [46]:
new = kohli_runs.copy()

In [50]:
# now all the changes being made to the new series/dataframe won't be reflected in the original one
new[1] = 100

In [51]:
print(new.head())
print(kohli_runs.head())

match_no
1    100
2     23
3     13
4     12
5      1
Name: runs, dtype: int64
match_no
1     1
2    23
3    13
4    12
5     1
Name: runs, dtype: int64


However, not doing so would have made the changes for the original dataframe/series as well.