## Introduction
#### Large portion of Data Analysis is taken up by preparation: loading, cleaning, transforming and rearranging.
#### These tasks take up more than 80% of an analyst's time. This is because the way the data is stored in files or databases is not in the right format.
#### Researchers prefer to do ad-hoc processing of data from one form to another using languages like R, Python, etc.
#### pandas with built-in Python features provides high-level, flexible abd fast set of tools that enables you to manipulate data into right form.

## Handling missing data
#### The way missing data is represented in pandas is imperfect but functional for lot of users.
#### For numeric data, pandas uses floating-point value NaN. It is called a Sentinel value and can be easily detected.

In [2]:
import pandas as pd
import numpy as np

string_data = pd.Series(['aardvak', 'artichoke', np.nan, 'avocado'])
string_data

0      aardvak
1    artichoke
2          NaN
3      avocado
dtype: object

In [3]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

#### This convention of NaN was adopted from R, where missing values are refered as NA.
#### In statistics, NA may either be data that does not exist or data that was not observed.
#### During cleaning data, we should also analyse the missing data to identify data collection problems or potential bias due to missing data.

In [4]:
string_data[0] = None

string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

### Filtering 'Out' Missing Data
#### We always have the option to filter out missing data by hand using 'isnull' and boolean indexing.
#### The 'dropna' function can be pretty useful too. For a Series it returns the Series with only non-null data and index values.
#### For DataFrame, it is a bit complex. dropna by default will drop any row that contains even 1 missing value. By passing "how='all'" will target rows with all NAs.
#### To drop columns, pass 'axis=1'.

In [5]:
from numpy import nan as NA

data = pd.Series([1, NA, 3.5, NA, 7])
data.dropna()                                   # default dropna( axis='index' , how = 'any')

0    1.0
2    3.5
4    7.0
dtype: float64

In [6]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

In [7]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
                   [NA, NA, NA], [NA, 6.5, 3.]])

cleaned = data.dropna()
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [8]:
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [9]:
data.dropna(how='all') # here removed only with all Nan in row

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [10]:
data[4] = NA     # creates a column 
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [11]:
data.dropna(axis=1, how='all')   # axis---> (axis=column) specified in dropna 

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


#### Another DataFrame cleaning method concerns with time series data.
#### To keep only rows with certain number of observations, use the 'thresh' argument.
#### Any row that contains number of NAs greater than or equal thresh will be eliminated.

In [12]:
df = pd.DataFrame(np.random.randn(7,3))  #Note indexing of row & column starts ---> 0  && end---> Not included
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA

df

Unnamed: 0,0,1,2
0,-1.284292,,
1,-0.704092,,
2,-0.523355,,1.432774
3,0.707629,,-0.543346
4,0.06606,-0.074331,0.684093
5,-0.841558,0.36947,-0.317903
6,-0.31746,-0.791269,0.416963


In [13]:
df.dropna()   #dropna remove all row data

Unnamed: 0,0,1,2
4,0.06606,-0.074331,0.684093
5,-0.841558,0.36947,-0.317903
6,-0.31746,-0.791269,0.416963


In [14]:
df

Unnamed: 0,0,1,2
0,-1.284292,,
1,-0.704092,,
2,-0.523355,,1.432774
3,0.707629,,-0.543346
4,0.06606,-0.074331,0.684093
5,-0.841558,0.36947,-0.317903
6,-0.31746,-0.791269,0.416963


In [15]:
df.dropna(thresh=2)  # thresh=2 - all rows have at least 2 non-na values

Unnamed: 0,0,1,2
2,-0.523355,,1.432774
3,0.707629,,-0.543346
4,0.06606,-0.074331,0.684093
5,-0.841558,0.36947,-0.317903
6,-0.31746,-0.791269,0.416963


### Filling In Missing Data
#### Rather than removing NAs and discarding important information in the same rows, we can also fill in the NAs in different ways.
#### The 'fillna' is a workhorse function, where the constant we pass replaces missing values.
#### If we call fillna with a dict, we can fill different value for each column.

In [16]:
df.fillna(0)

Unnamed: 0,0,1,2
0,-1.284292,0.0,0.0
1,-0.704092,0.0,0.0
2,-0.523355,0.0,1.432774
3,0.707629,0.0,-0.543346
4,0.06606,-0.074331,0.684093
5,-0.841558,0.36947,-0.317903
6,-0.31746,-0.791269,0.416963


In [17]:
df

Unnamed: 0,0,1,2
0,-1.284292,,
1,-0.704092,,
2,-0.523355,,1.432774
3,0.707629,,-0.543346
4,0.06606,-0.074331,0.684093
5,-0.841558,0.36947,-0.317903
6,-0.31746,-0.791269,0.416963


1: 0.5: This means that for the column with the label or name '1', 
any NaN values in that column will be filled with the value 0.5.

In [18]:
df.fillna({ 1: 0.5 , 2: 0})         # Note{ 1-->column : fill with 0.5}

Unnamed: 0,0,1,2
0,-1.284292,0.5,0.0
1,-0.704092,0.5,0.0
2,-0.523355,0.5,1.432774
3,0.707629,0.5,-0.543346
4,0.06606,-0.074331,0.684093
5,-0.841558,0.36947,-0.317903
6,-0.31746,-0.791269,0.416963


#### By default it returns a new object, but we can modify it to change in-place.
#### The interpolation methods used for reindexing like 'ffill' can also be used with fillna.
#### It allows you to do lots of creative things, like filling with mean or median values.

In [19]:
_ = df.fillna(0, inplace=True)
df

Unnamed: 0,0,1,2
0,-1.284292,0.0,0.0
1,-0.704092,0.0,0.0
2,-0.523355,0.0,1.432774
3,0.707629,0.0,-0.543346
4,0.06606,-0.074331,0.684093
5,-0.841558,0.36947,-0.317903
6,-0.31746,-0.791269,0.416963


In [20]:
df = pd.DataFrame(np.random.randn(6,3))
df.iloc[2:, 1] = NA    # here second [row,column] = NA
df.iloc[4:, 2] = NA
df

Unnamed: 0,0,1,2
0,-2.298568,0.159573,-0.576376
1,-0.273622,0.72668,-1.773232
2,0.717488,,-0.27473
3,0.89961,,0.251743
4,-0.898328,,
5,-1.432355,,


In [21]:
df.fillna(method='ffill')   # NaN will be filled with backward value

  df.fillna(method='ffill')   # NaN will be filled with backward value


Unnamed: 0,0,1,2
0,-2.298568,0.159573,-0.576376
1,-0.273622,0.72668,-1.773232
2,0.717488,0.72668,-0.27473
3,0.89961,0.72668,0.251743
4,-0.898328,0.72668,0.251743
5,-1.432355,0.72668,0.251743


In [22]:
df.fillna(method='ffill', limit=2) # maximum number of consecutive NaN values to forward/backward fill

  df.fillna(method='ffill', limit=2) # maximum number of consecutive NaN values to forward/backward fill


Unnamed: 0,0,1,2
0,-2.298568,0.159573,-0.576376
1,-0.273622,0.72668,-1.773232
2,0.717488,0.72668,-0.27473
3,0.89961,0.72668,0.251743
4,-0.898328,,0.251743
5,-1.432355,,0.251743


In [23]:
data = pd.Series([1., NA, 3.5, NA, 7])
print(data.mean())
data.fillna(data.mean())

3.8333333333333335


0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

## Data Transformation
#### Till now we have seen methods for rearranging data.
#### Transformation involves filtering, cleaning and other different functions.

### Removing Duplicates
#### The DataFrame method 'duplicated' returns boolean Series indicating if each row is a duplicate (i.e. observed in a previous row) or not.
#### Similarly, 'drop_duplicates' returns DataFrame where 'duplicated' array is False.

In [24]:
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
                    'k2': [1,1,2,3,3,4,4]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


In [25]:
data.duplicated()   # return boolean series

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [26]:
data.drop_duplicates()  # duplic row contain all data same will be removed

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


#### Both the above methods by default consider all of the columns. You can also specify any subset of the DataFrame to detect duplicates.
#### By default, both keep the first observation in case of duplicates. We can specify "keep='last'" to instead keep the last observation.

In [27]:
data['v1'] = range(7)
data.drop_duplicates(['k1'])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


In [28]:
data

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5
6,two,4,6


In [29]:
data.drop_duplicates(['k1', 'k2'], keep='last')  # No k1 and k2 is duplicate together  

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
6,two,4,6


### Transforming Data Using a Function or Mapping
#### We sometimes need to make transformations based on the values present in an array, Series or column in a DataFrame.
#### We can use the map method with a function or dict-like object having the mapping to add or change a column.
#### Sometimes the column that we base our mapping on may have varying case from our map. In such a case, we can convert all the values to lowercase. Or just pass a function that does it for us.

In [30]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami', 'corned beef', 'Bacon','pastrami', 'honey ham', 'nova lox'],
                    'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


In [31]:
data['food']=data['food'].str.lower()
data    

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,pastrami,6.0
4,corned beef,7.5
5,bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


In [32]:
lowercased = data['food'].str.lower()
lowercased

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

In [33]:
 # map series based on key: value correspondence

In [34]:
meat_to_animal = {        
    'bacon': 'pig',
    'pulled pork': 'pig',
    'pastrami': 'cow',
    'corned beef': 'cow',
    'honey ham': 'Pig',
    'nova lox': 'salmon'
}

### Map doesn't run on entire dataframe it runs on specific column---->so need to select column

In [35]:
data['animal'] = lowercased.map(meat_to_animal) # here we have just mapped meat_to_animal with animal and named col-->animal 
data                                           

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,pastrami,6.0,cow
4,corned beef,7.5,cow
5,bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,Pig
8,nova lox,6.0,salmon


In [36]:
data['food'].map( lambda x: meat_to_animal[x.lower()] )     # we can also pass somethinng.map(function)

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       Pig
8    salmon
Name: food, dtype: object

### Replacing Values
#### The fillna method is a special case of more general values replacement.
#### The map function modifies a subset of values but, 'replace' provides simpler and more flexible way to do so.
#### Passing the sentinel (or garbage) value followed by the replcae value will create a new object with the values replaced.
#### If we want in-place replacement, use "inplace=True".

In [37]:
data = pd.Series( [1., -999., 2., -999., 3.] )
data

0      1.0
1   -999.0
2      2.0
3   -999.0
4      3.0
dtype: float64

In [38]:
data.replace(-999, np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    3.0
dtype: float64

#### To replace multiple values with a single value, pass a list followed by substitute value.
#### To have different replacements for different values, pass list of substitutes.
#### You can also pass a dict as argument to replace multiple substitutes.
#### NOTE - 'data.replace' is different from 'data.str.replace'. The latter is for element-wise string substitution.

In [39]:
data.replace( [-999, -1000], np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    3.0
dtype: float64

In [40]:
data

0      1.0
1   -999.0
2      2.0
3   -999.0
4      3.0
dtype: float64

In [41]:
data.replace([-999, 1], [np.nan, 0])  # passed as dataframe like with list corres list

0    0.0
1    NaN
2    2.0
3    NaN
4    3.0
dtype: float64

In [42]:
data.replace( {-999: np.nan, 1: 0} )  # passed as key value

0    0.0
1    NaN
2    2.0
3    NaN
4    3.0
dtype: float64

### Renaming Axis Indexes
#### Just like values, axis labels can also be transformed by a function or mapping to produce differently labeled objects.
#### We can also modify axes in-place without any new data structure.

In [43]:
data = pd.DataFrame(np.arange(12).reshape((3,4)),
                   index = ['Ohio', 'Colorado', 'New York'],
                   columns = ['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


In [44]:
transform = lambda x: x[:4].upper()

data.index.map(transform)        # fun passed for mapping

Index(['OHIO', 'COLO', 'NEW '], dtype='object')

In [45]:
data.index = data.index.map(transform)
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


#### To get a transformed version of a dataset without modifying the original, use 'rename'.
#### It can also be used in conjunction with a dict-like object providing new values for subset of the axis labels.
#### It saves you from copying DataFrame manuallyand then assigning it index and columns. To modify in-place, use parameter 'inplace=True'.

In [46]:
data.rename(index=str.title, columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colo,4,5,6,7
New,8,9,10,11


In [47]:
data.rename(index={'OHIO':"INDIANA"},
           columns = {'three':'peekaboo'})

Unnamed: 0,one,two,peekaboo,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [48]:
data.rename(index={'OHIO': 'INDIANA'}, inplace=True)
data

Unnamed: 0,one,two,three,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


## Discretization and Binning

### cut command creates equispaced bins but frequency of samples is unequal in each bin

### qcut command creates unequal size bins but frequency of samples is equal in each bin

#### Continuous data is always discretized or seperated into 'bins' for analysis.
#### To bin a set of continuous data, use the 'cut' method from pandas.
#### In below example, we are binnning set of gaes into groups 18 to 25, 26 to 35, 36 to 60 and 61 nd older.

In [49]:
x= np.array([24,  7,  2, 25, 22, 29])
print(x)
           
c= pd.cut(x,3).value_counts() #Bins size has equal interval of 9
print(c)
           
q= pd.qcut(x,3).value_counts() #Equal frequecy of 2 in each bins as 6 terms
print(q)   

[24  7  2 25 22 29]
(1.973, 11.0]    2
(11.0, 20.0]     0
(20.0, 29.0]     4
Name: count, dtype: int64
(1.999, 17.0]     2
(17.0, 24.333]    2
(24.333, 29.0]    2
Name: count, dtype: int64


In [51]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]

cats = pd.cut(ages, bins)
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64, right]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

#### pandas returns a special Categorical object from cuts function.
#### The output describes the bins that each of the element is in. You can treat it like a bin name for each element.
#### Internally, the output contains a categories array specifying distinct category names along with a labeling for the 'ages' data in the 'codes' attribute.

In [19]:
cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [20]:
cats.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]], dtype='interval[int64, right]')

In [21]:
pd.value_counts(cats)      # unequal freq in cut bin

  pd.value_counts(cats)


(18, 25]     5
(25, 35]     3
(35, 60]     3
(60, 100]    1
Name: count, dtype: int64

#### The interval system for cut is consistent with the mathematical notation. A parenthesis means that the side is open and a square bracket means that it is closed (inclusive).
#### We can changes which side is closed by passing 'right=False'.
#### We can have our own bin names by passing a list or array to the labels option.

### ages-->array  , right-->  Indicates whether `bins` includes the rightmost edge or not

#### ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
####  bins = [18, 26, 36, 61, 100]


In [22]:
pd.cut(ages, [18, 26, 36, 61, 100], right=False)

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64, left]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

In [29]:
group_names = ['Youth', 'YoungAdult', 'MiddleAges', 'Senior']
a= pd.cut(ages, bins, labels=group_names)
p = pd.DataFrame(a , index = ages  )
p

Unnamed: 0,0
20,Youth
22,Youth
25,Youth
27,YoungAdult
21,Youth
23,Youth
37,MiddleAges
31,YoungAdult
61,Senior
45,MiddleAges


#### Instead of specifying specific intervals, we can just pass an integer to get equal length bins of the same number based on the max and min value of the data.
#### The 'precision' parameter limits decimal precision in the values. 'precision=2' limits decimal precision to 2 digits.

In [35]:
data = np.random.randint(1,12,8)
print(data)

pd.cut(data, 4, precision=2)

[11 11 10  9  2  8  4  9]


[(8.75, 11.0], (8.75, 11.0], (8.75, 11.0], (8.75, 11.0], (1.99, 4.25], (6.5, 8.75], (1.99, 4.25], (8.75, 11.0]]
Categories (4, interval[float64, right]): [(1.99, 4.25] < (4.25, 6.5] < (6.5, 8.75] < (8.75, 11.0]]

#### cut has a closely related function - 'qcut' that bins data based on sample quantiles.
#### Based on distribution, using cut will not usually result in each bin have the same number of data points.
#### But as qcut uses sample quantiles, you wil rougjly obtain equal-size bins.
#### We can even pass our own quantiles to qcut.

In [56]:
data = np.random.randn(10)
data

array([-0.05510752, -0.37946806, -0.26335895, -1.44262749, -0.43403356,
       -0.02518621, -0.33412282,  1.02245584,  0.25333321,  1.52012784])

In [57]:
cats = pd.qcut(data, 4)     # unequal spacing b/w 2 bin 
cats

[(-0.159, 0.184], (-1.444, -0.368], (-0.368, -0.159], (-1.444, -0.368], (-1.444, -0.368], (-0.159, 0.184], (-0.368, -0.159], (0.184, 1.52], (0.184, 1.52], (0.184, 1.52]]
Categories (4, interval[float64, right]): [(-1.444, -0.368] < (-0.368, -0.159] < (-0.159, 0.184] < (0.184, 1.52]]

In [55]:
pd.value_counts(cats)    # equal freq in qcut 

  pd.value_counts(cats)    # equal freq in qcut but unequal spacing b/w 2 bin


(-1.289, -0.338]     3
(0.334, 1.456]       3
(-0.338, -0.0948]    2
(-0.0948, 0.334]     2
Name: count, dtype: int64

###  array of quantiles, e.g. [0, .25, .5, .75, 1.] for quartiles.

In [52]:
data

array([-0.72507827,  0.17566237,  1.14862703, -0.23710096, -1.09502591,
       -0.33620549, -1.17454987,  0.43754054, -0.21922989, -1.35035417,
        1.31265846, -0.45782628,  0.14305673, -0.27514335, -1.01911589,
       -0.197656  ])

In [58]:
pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.] )

[(-0.159, 1.072], (-0.535, -0.159], (-0.535, -0.159], (-1.444, -0.535], (-0.535, -0.159], (-0.159, 1.072], (-0.535, -0.159], (-0.159, 1.072], (-0.159, 1.072], (1.072, 1.52]]
Categories (4, interval[float64, right]): [(-1.444, -0.535] < (-0.535, -0.159] < (-0.159, 1.072] < (1.072, 1.52]]

### Detecting and Filtering Outliers
#### Filtering and Transforming outliers is mostly a matter of applying array operations.
#### To find values exceeding a threshold, just use boolean indexing with other functions like 'abs()' based on requirement.
#### To get all rows having at least one value exceed a threshold, use the 'any(1)' method.
#### Values can also be set based on these criteria. So you can cap values based on an interval or threshold.
#### You can also use the 'np.sign()' function to get 1 and -1 where the data is positive or negative respectively.

In [60]:
data = pd.DataFrame(np.random.randn(1000, 4))
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.119931,-0.068085,-0.016618,-0.022004
std,0.999296,0.986043,1.003256,0.996187
min,-2.834304,-3.029275,-2.894478,-3.476004
25%,-0.58724,-0.73069,-0.71875,-0.687263
50%,0.128533,-0.081061,-0.020833,-0.03943
75%,0.801665,0.637599,0.619955,0.651041
max,3.377642,2.821492,3.147032,3.178499


In [63]:
col = data[2]
col[np.abs(col) > 3]

61     3.147032
426    3.005477
903    3.117592
Name: 2, dtype: float64

In [67]:
data[(np.abs(data) > 3).any(1)]

TypeError: DataFrame.any() takes 1 positional argument but 2 were given

In [None]:
# Capping outside -3 to 3
data[np.abs(data) > 3] = np.sign(data) * 3 
data.describe()

In [None]:
np.sign(data).head()

## String Manipulation
#### Python is a popular raw data manipulation language due to its ease of use for string and text processing.
#### Simple text operations can be done using String object's built-in methods.
#### For more complex pattern matching and text manipulations, we can use regular expressions.
#### pandas enable us to apply both string and regex functions on whole arrays of data.

### String Object Methods
#### In most string manipulation scenarios, inbuilt string methods are mostly sufficient.
#### A string can be broken based on a seperator using 'split'.
#### It is often combined with 'strip' to trim out whitespace, including line breaks.
#### Substrings can be concatenated together using the '+' operator.
#### A more faster and pythonic way to do so is to pass a list or tuple of substrings to the 'join' method on the 'stitching' string.

In [73]:
val = 'a,b,   guido'

val.split(',')

['a', 'b', '   guido']

In [74]:
pieces = [x.strip() for x in val.split(',')]
pieces

['a', 'b', 'guido']

In [75]:
first, second, third = pieces
first + '::' + second + '::' + third

'a::b::guido'

In [76]:
'::'.join(pieces)

'a::b::guido'

#### The 'in' keyword is the best way to detect substring. Although, 'index' and 'find' can also be used.
#### There is 1 major difference between 'find' and 'index'. 'index' returns an Exception if substring is not found. 'find' returns a -1.
#### 'count' returns number of occurences of a particular substring.
#### 'replace' substitutes occurence of one pattern for another. It is commonly used to delete patterns by passing an empty string as replacement.

In [77]:
'guido' in val

True

In [78]:
val.index(',')

1

In [79]:
val.find(':')

-1

In [80]:
# val.index(':')

In [81]:
val.count(',')

2

In [82]:
val.replace(',', '::')

'a::b::   guido'

In [83]:
val.replace(',', '')

'ab   guido'