# Pandas Functions

## Group By in Pandas

```python

df.groupby()

```

groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names.

```python

df.groupby('key')

```

```python

df.groupby(['key1','key2'])

```

```python

df.groupby(key,axis=1)

```

```python

df.groupby([key1,key2],axis=1)

```

The key can be any series or list of series. The axis along which the group is applied can be 0 or 1. By default it is 0.

The parameters of the groupby function are:

- by: mapping, function, label, or list of labels
- axis: int, default 0
- level: If the axis is a MultiIndex (hierarchical), group by a particular level or levels
- as_index: For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output
- sort: Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. groupby preserves the order of rows within each group.
- group_keys: When calling apply, add group keys to index to identify pieces
- squeeze: Reduce the dimensionality of the return type if possible, otherwise return a consistent type
- observed: This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.
- dropna: If False, do not drop rows with null values in the grouping keys


In [1]:
import pandas as pd

In [2]:
df = pd.DataFrame({'ProductName':['Bulb', 'Pen', 'Fan', 'Bulb'] , 'Price':[100, 10, 500, 100], 'Type': ['Electrical', 'Stationary', 'Electrical', 'Electrical']})


In [3]:
df

Unnamed: 0,ProductName,Price,Type
0,Bulb,100,Electrical
1,Pen,10,Stationary
2,Fan,500,Electrical
3,Bulb,100,Electrical


In [4]:
df.groupby(['ProductName']).sum() # Here is sum of atributes with the same name.

Unnamed: 0_level_0,Price,Type
ProductName,Unnamed: 1_level_1,Unnamed: 2_level_1
Bulb,200,ElectricalElectrical
Fan,500,Electrical
Pen,10,Stationary


In [5]:
df.groupby(['Type']).sum() # Here is sum of atributes with the same type, for example, the eletrical, the sum of price is 700 (100+500+100)

Unnamed: 0_level_0,ProductName,Price
Type,Unnamed: 1_level_1,Unnamed: 2_level_1
Electrical,BulbFanBulb,700
Stationary,Pen,10


In [6]:
df.groupby([df.Type]).sum()

Unnamed: 0_level_0,ProductName,Price
Type,Unnamed: 1_level_1,Unnamed: 2_level_1
Electrical,BulbFanBulb,700
Stationary,Pen,10


In [7]:
df.groupby(['Type', 'ProductName']).sum() # Here is sum of atributes with the same type and product name, for example, the eletrical and bulb, the sum of price is 200 (100+100

Unnamed: 0_level_0,Unnamed: 1_level_0,Price
Type,ProductName,Unnamed: 2_level_1
Electrical,Bulb,200
Electrical,Fan,500
Stationary,Pen,10


In [8]:
df.groupby(['Type'])[["Price"]].sum() # Here is sum of atributes with the same type, for example, the eletrical, the sum of price is 700 (100+500+100)

Unnamed: 0_level_0,Price
Type,Unnamed: 1_level_1
Electrical,700
Stationary,10


## Multiple index

In [9]:
A = [['Bulb', 'Bulb', 'Fan', 'Bulb', 'Fan', 'Fan'],
     ['A', 'B', 'B', 'C', 'C', 'A'],
     [100, 200, 300, 400, 500, 600],
     ]

indx = pd.MultiIndex.from_arrays(A, names=('ProductName', 'Type', 'Price'))

In [10]:
df = pd.DataFrame({'EC': [20., 10., 30., 30., 50., 60.]}, index=indx)
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,EC
ProductName,Type,Price,Unnamed: 3_level_1
Bulb,A,100,20.0
Bulb,B,200,10.0
Fan,B,300,30.0
Bulb,C,400,30.0
Fan,C,500,50.0
Fan,A,600,60.0


In [11]:
df.groupby(level=0).sum() # Here is sum of atributes with the same name. The level 0 is the first column, the level 1 is the second column and the level 2 is the third column.
# Here the sum is in EC column.

Unnamed: 0_level_0,EC
ProductName,Unnamed: 1_level_1
Bulb,60.0
Fan,140.0


In [12]:
df.groupby(level=1).sum() # Here is sum of atributes with the same type

Unnamed: 0_level_0,EC
Type,Unnamed: 1_level_1
A,80.0
B,40.0
C,80.0


In [13]:
df.groupby(level=2).sum() # Here is sum of atributes with the same price

Unnamed: 0_level_0,EC
Price,Unnamed: 1_level_1
100,20.0
200,10.0
300,30.0
400,30.0
500,50.0
600,60.0


In [14]:
df.groupby(level="ProductName").sum() # Here is sum of atributes with the same name. The level 0 is the first column, the level 1 is the second column and the level 2 is the third column.

Unnamed: 0_level_0,EC
ProductName,Unnamed: 1_level_1
Bulb,60.0
Fan,140.0


## Rolling

```python 

df.rolling()

```
The function is used to provide rolling window calculations. The rolling window calculations are done on a series or dataframe. The rolling() function splits the data into groups based on a number of data points specified by the window size and performs some calculations on the data in each window.

The parameters of the rolling function are:

- window: int, or offset

- min_periods: int, default None

- center: bool, default False

- win_type: str, default None

- on: str, optional

- axis: int, default 0

- closed: str, default None

The window parameter is the number of observations used for calculating the statistic. The min_periods parameter is the minimum number of observations in window required to have a value. The center parameter is a boolean value which indicates whether the label of the window should be the center or the right edge of the window. The win_type parameter is the type of window function to apply. The on parameter is the column name to calculate the rolling window on. The axis parameter is the axis to roll on. The closed parameter is the side of the interval to make closed.

for example we have this dataframe:

| Date | Value |
| --- | --- |
| 2019-01-01 | 10 |
| 2019-01-02 | 20 |
| 2019-01-03 | 30 |
| 2019-01-04 | 40 |
| 2019-01-05 | 50 |
| 2019-01-06 | 60 |

```python

df.rolling(window=2).sum()

```

| Date | Value |
| --- | --- |
| 2019-01-01 | NaN |
| 2019-01-02 | 30.0 |
| 2019-01-03 | 50.0 |
| 2019-01-04 | 70.0 |
| 2019-01-05 | 90.0 |
| 2019-01-06 | 110.0 |

```python

df.rolling(window=2,min_periods=1).sum()

```

| Date | Value |
| --- | --- |
| 2019-01-01 | 10.0 |
| 2019-01-02 | 30.0 |
| 2019-01-03 | 50.0 |
| 2019-01-04 | 70.0 |
| 2019-01-05 | 90.0 |
| 2019-01-06 | 110.0 |

```python

df.rolling(window=2,min_periods=1,center=True).sum()

```

| Date | Value |
| --- | --- |
| 2019-01-01 | 30.0 |
| 2019-01-02 | 50.0 |
| 2019-01-03 | 70.0 |
| 2019-01-04 | 90.0 |
| 2019-01-05 | 110.0 |
| 2019-01-06 | NaN |

Basically what happen is that the rolling function takes the previous 2 values and sum them up. The first value is NaN because there is no previous value to sum up with.

In [15]:
import numpy as np
df = pd.DataFrame({'A': np.random.randint(0,10,5),
                   'B': np.random.randint(0,10,5),
                   'C': np.random.randint(0,10,5),
                   'D': np.random.randint(0,10,5)},)

df

Unnamed: 0,A,B,C,D
0,5,4,6,3
1,0,7,6,2
2,9,6,3,3
3,6,7,2,5
4,4,1,2,2


In [16]:
df.rolling(2).sum() # Here is sum of atributes with the same name. The level 0 is the first column, the level 1 is the second column and the level 2 is the third column.

Unnamed: 0,A,B,C,D
0,,,,
1,5.0,11.0,12.0,5.0
2,9.0,13.0,9.0,5.0
3,15.0,13.0,5.0,8.0
4,10.0,8.0,4.0,7.0


In [17]:
#Removing the nan values:

df.rolling(2, min_periods=1).sum() # Here is sum of atributes with the same name. The level 0 is the first column, the level 1 is the second column and the level 2 is the third column.

Unnamed: 0,A,B,C,D
0,5.0,4.0,6.0,3.0
1,5.0,11.0,12.0,5.0
2,9.0,13.0,9.0,5.0
3,15.0,13.0,5.0,8.0
4,10.0,8.0,4.0,7.0


In [18]:
df.rolling(1).sum()

Unnamed: 0,A,B,C,D
0,5.0,4.0,6.0,3.0
1,0.0,7.0,6.0,2.0
2,9.0,6.0,3.0,3.0
3,6.0,7.0,2.0,5.0
4,4.0,1.0,2.0,2.0


In [19]:
import seaborn as sns

In [20]:
iris = sns.load_dataset('iris')

In [21]:
type(iris)

pandas.core.frame.DataFrame

In [22]:
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [23]:
df = iris.drop(['species'], axis=1)

In [24]:
df.rolling(3, min_periods=1, win_type='gaussian').sum(std=3) # Here is sum of atributes with the same name. The level 0 is the first column, the level 1 is the second column and the level 2 is the third column.
# the win_type is the type of window, the std is the standard deviation.

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,4.824393,3.310858,1.324343,0.189192
1,9.735201,6.337878,2.724343,0.389192
2,14.170403,9.337928,3.954091,0.578384
3,13.686615,8.970353,4.043282,0.578384
4,13.775807,9.532524,4.054091,0.578384
...,...,...,...,...
145,19.470453,9.164949,16.200150,6.851414
146,18.997473,8.486565,15.321766,6.462222
147,18.786665,8.175757,14.837978,5.967626
148,18.324493,8.581161,15.037978,5.973030


## Where 

```python

df.where()

```

The where() function is used to replace values where the condition is **False**.

The parameters of the where function are:

- cond: boolean Series/DataFrame, array-like, or callable

- other: scalar, Series/DataFrame, or callable

- inplace: bool, default False

- axis: int, default None

- level: int, default None

- errors: str, default ‘raise’

- try_cast: bool, default False

The cond parameter is the condition which is to be checked. The other parameter is the value which is used to replace the values where the condition is False. The inplace parameter is a boolean value which indicates whether to perform the operation inplace or not. The axis parameter is the axis to fill on. The level parameter is the broadcast level if the target is a MultiIndex. The errors parameter is the string indicating how to handle errors in the input. The try_cast parameter is a boolean value which indicates whether to try casting the result back to the input type.

In [25]:
df = pd.DataFrame(np.arange(10).reshape(5,2), columns=['A', 'B'])
df

Unnamed: 0,A,B
0,0,1
1,2,3
2,4,5
3,6,7
4,8,9


In [26]:
df.where(df < 5, -df)

Unnamed: 0,A,B
0,0,1
1,2,3
2,4,-5
3,-6,-7
4,-8,-9


In [27]:
df.where( (df<5) | (df%2==0), -df)

Unnamed: 0,A,B
0,0,1
1,2,3
2,4,-5
3,6,-7
4,8,-9


In [28]:
A = df.where(df['A'] < 5, -df)
A['B'] = df['B']
A

Unnamed: 0,A,B
0,0,1
1,2,3
2,4,5
3,-6,7
4,-8,9


In [29]:
df[~(df<5)] = -df
df

Unnamed: 0,A,B
0,0,1
1,2,3
2,4,-5
3,-6,-7
4,-8,-9


## Clip

```python

df.clip()

```

The clip() function is used to trim values at input threshold(s).

The parameters of the clip function are:

- lower: int, float, or array_like, default None

- upper: int, float, or array_like, default None

- axis: int or None, default None

- inplace: bool, default False

- *args: tuple positional arguments

- **kwargs: dict keyword arguments

The lower parameter is the minimum threshold value. The upper parameter is the maximum threshold value. The axis parameter is the axis to clip on. The inplace parameter is a boolean value which indicates whether to perform the operation inplace or not. The *args and **kwargs are used to pass keyword arguments to the function.


In [30]:
df = pd.DataFrame(np.random.randint(0,50,(5,10)), columns=list('ABCDEFGHIJ')) # without the columns=list('ABCDEFGHIJ'), the columns will be 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
df

Unnamed: 0,A,B,C,D,E,F,G,H,I,J
0,2,23,43,19,26,23,17,10,7,28
1,28,13,28,36,17,38,4,32,41,26
2,6,20,37,0,20,43,2,15,43,21
3,31,35,0,32,38,8,30,21,46,21
4,16,37,3,14,9,16,3,6,17,8


In [31]:
df.clip(10,30) # All numbers less than 10 will be 10 and all numbers greater than 30 will be 30.

Unnamed: 0,A,B,C,D,E,F,G,H,I,J
0,10,23,30,19,26,23,17,10,10,28
1,28,13,28,30,17,30,10,30,30,26
2,10,20,30,10,20,30,10,15,30,21
3,30,30,10,30,30,10,30,21,30,21
4,16,30,10,14,10,16,10,10,17,10


In [32]:
df[df < 10] = 10
df[df > 30] = 30

df

Unnamed: 0,A,B,C,D,E,F,G,H,I,J
0,10,23,30,19,26,23,17,10,10,28
1,28,13,28,30,17,30,10,30,30,26
2,10,20,30,10,20,30,10,15,30,21
3,30,30,10,30,30,10,30,21,30,21
4,16,30,10,14,10,16,10,10,17,10


## Merge

```python

df.merge()

```

The merge() function is used to merge DataFrame or named Series objects with a database-style join.

The parameters of the merge function are:

- right: DataFrame or named Series

- how: {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’

- on: label or list

- left_on: label or list, or array-like

- right_on: label or list, or array-like

- left_index: bool, default False

- right_index: bool, default False

- sort: bool, default False

- suffixes: tuple of (str, str), default (‘_x’, ‘_y’)

- copy: bool, default True

- indicator: bool or str, default False

- validate: str, optional

The right parameter is the DataFrame or named Series object. The how parameter is the type of merge to be performed. The on parameter is the column or index level names to join on. The left_on parameter is the column or index level names to join on in the left DataFrame. The right_on parameter is the column or index level names to join on in the right DataFrame. The left_index parameter is a boolean value which indicates whether to use the index from the left DataFrame as the join key(s). The right_index parameter is a boolean value which indicates whether to use the index from the right DataFrame as the join key(s). The sort parameter is a boolean value which indicates whether to sort the join keys lexicographically in the result DataFrame. The suffixes parameter is a tuple of string suffixes to apply to overlapping columns. The copy parameter is a boolean value which indicates whether to copy data from the passed DataFrame objects into the resulting DataFrame. The indicator parameter is a boolean value which indicates whether to add a column to output DataFrame called _merge with information on the source of each row. The validate parameter is a string value which indicates whether to confirm that the merge keys are unique in both the left and right datasets.



In [33]:
df = pd.DataFrame({'E':['B', 'J', 'L', 'S'],'G': ['A', 'E', 'E', 'H']})

df1 = pd.DataFrame({'E': ['L', 'B', 'J', 'S'], 'H': [2004, 2008, 2012, 2018]})

In [34]:
print(df); print(df1)

   E  G
0  B  A
1  J  E
2  L  E
3  S  H
   E     H
0  L  2004
1  B  2008
2  J  2012
3  S  2018


In [35]:
pd.merge(df,df1) # Here is the intersection between the two dataframes.

Unnamed: 0,E,G,H
0,B,A,2008
1,J,E,2012
2,L,E,2004
3,S,H,2018


In [36]:
df2 = pd.merge(df,df1)
df2

Unnamed: 0,E,G,H
0,B,A,2008
1,J,E,2012
2,L,E,2004
3,S,H,2018


In [37]:
df3 = pd.DataFrame({'G': ['E', 'A' , 'H'],
                    'S': ['C', 'G', 'S']})

pd.merge(df2,df3, on='G') # on is used to merge the dataframes by the column G.

Unnamed: 0,E,G,H,S
0,B,A,2008,G
1,J,E,2012,C
2,L,E,2004,C
3,S,H,2018,S


In [38]:
df4 = pd.DataFrame({'G': ['A', 'A', 'E', 'E' , 'H', 'H'],
                    'Sk': ['M', 'S', 'C' , 'L' , 'S', 'O']})

df4

Unnamed: 0,G,Sk
0,A,M
1,A,S
2,E,C
3,E,L
4,H,S
5,H,O


In [39]:
df5 = pd.merge(df2,df3, on='G') # on is used to merge the dataframes by the column G.
pd.merge(df4, df5, on='G')

Unnamed: 0,G,Sk,E,H,S
0,A,M,B,2008,G
1,A,S,B,2008,G
2,E,C,J,2012,C
3,E,C,L,2004,C
4,E,L,J,2012,C
5,E,L,L,2004,C
6,H,S,S,2018,S
7,H,O,S,2018,S


```python

.pivot_table()

```

The pivot_table() function is used to create a spreadsheet-style pivot table as a DataFrame.

The parameters of the pivot_table function are:

- values: column to aggregate, optional

- index: column, Grouper, array, or list of the previous


- columns: column, Grouper, array, or list of the previous

- aggfunc: function, list of functions, dict, default numpy.mean

- fill_value: scalar, default None

- margins: bool, default False

- dropna: bool, default True

- margins_name: str, default ‘All’

- observed: bool, default False

The values parameter is the column to aggregate. The index parameter is the column, Grouper, array, or list of the previous. The columns parameter is the column, Grouper, array, or list of the previous. The aggfunc parameter is the function, list of functions, dict, default numpy.mean. The fill_value parameter is the scalar value to replace missing values with. The margins parameter is a boolean value which indicates whether to add all row / columns (e.g. for subtotal / grand totals). The dropna parameter is a boolean value which indicates whether to exclude rows whose entries are all NaN. The margins_name parameter is the name of the row / column that will contain the totals when margins is True. The observed parameter is a boolean value which indicates whether to only show observed values for categorical groupers.



In [41]:
from seaborn import load_dataset

T = load_dataset('titanic')


In [42]:
T.head(5)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [49]:
T.rename(columns={'sex': 'gender'}, inplace=True)

In [50]:
T.deck.isnull().sum()

688

In [51]:
T.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   gender       891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


In [52]:
T.drop(['deck'], axis=1, inplace=True)

In [55]:
T.groupby(['gender', 'class'])['survived'].mean().unstack() # the unstack is used to transform the data in a table.

class,First,Second,Third
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.968085,0.921053,0.5
male,0.368852,0.157407,0.135447


In [56]:
T.pivot_table('survived', index='gender', columns='class', aggfunc='mean') # the aggfunc is used to calculate the mean of survived.
# the first argument is the column that will be used to calculate the mean, the index is the column that will be used to index the
# table and the columns is the column that will be used to create the columns of the table.

class,First,Second,Third
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.968085,0.921053,0.5
male,0.368852,0.157407,0.135447


In [58]:
age_partitions = pd.cut(T['age'], [0,18,80]) # the cut is used to create a partition of the data.
# the first argument is the column that will be used to create the partition and the second argument is the partition.
# the first partition is 0 to 18 and the second partition is 18 to 80.

age_partitions

0      (18.0, 80.0]
1      (18.0, 80.0]
2      (18.0, 80.0]
3      (18.0, 80.0]
4      (18.0, 80.0]
           ...     
886    (18.0, 80.0]
887    (18.0, 80.0]
888             NaN
889    (18.0, 80.0]
890    (18.0, 80.0]
Name: age, Length: 891, dtype: category
Categories (2, interval[int64, right]): [(0, 18] < (18, 80]]

In [59]:
T.pivot_table('survived', ['gender', age_partitions], 'class')

Unnamed: 0_level_0,class,First,Second,Third
gender,age,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,"(0, 18]",0.909091,1.0,0.511628
female,"(18, 80]",0.972973,0.9,0.423729
male,"(0, 18]",0.8,0.6,0.215686
male,"(18, 80]",0.375,0.071429,0.133663


In [60]:
fare_partitions = pd.qcut(T['fare'], 2) # the qcut is used to create a partition of the data.
# the first argument is the column that will be used to create the partition and the second argument is the number of partitions.
# the first partition is 0 to 14.454 and the second partition is 14.454 to 512.329.

fare_partitions

0       (-0.001, 14.454]
1      (14.454, 512.329]
2       (-0.001, 14.454]
3      (14.454, 512.329]
4       (-0.001, 14.454]
             ...        
886     (-0.001, 14.454]
887    (14.454, 512.329]
888    (14.454, 512.329]
889    (14.454, 512.329]
890     (-0.001, 14.454]
Name: fare, Length: 891, dtype: category
Categories (2, interval[float64, right]): [(-0.001, 14.454] < (14.454, 512.329]]

In [61]:
T.pivot_table('survived', ['gender', age_partitions,], [fare_partitions, 'class'],)

Unnamed: 0_level_0,fare,"(-0.001, 14.454]","(-0.001, 14.454]","(-0.001, 14.454]","(14.454, 512.329]","(14.454, 512.329]","(14.454, 512.329]"
Unnamed: 0_level_1,class,First,Second,Third,First,Second,Third
gender,age,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
female,"(0, 18]",,1.0,0.714286,0.909091,1.0,0.318182
female,"(18, 80]",,0.88,0.444444,0.972973,0.914286,0.391304
male,"(0, 18]",,0.0,0.26087,0.8,0.818182,0.178571
male,"(18, 80]",0.0,0.098039,0.125,0.391304,0.030303,0.192308


In [62]:
T.pivot_table(index='gender', columns='class', aggfunc={'survived':sum, 'fare':'mean'})

Unnamed: 0_level_0,fare,fare,fare,survived,survived,survived
class,First,Second,Third,First,Second,Third
gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
female,106.125798,21.970121,16.11881,91,70,72
male,67.226127,19.741782,12.661633,45,17,47


```python

.col.str

```

The str attribute is used to perform vectorized string operations for Series and Index values. It is available only for string Series and Indexes.

The parameters of the str attribute are:

- cat: CategoricalAccessor

- split: Split an element of the Series/Index

- rsplit: Split an element of the Series/Index

- get: Extract element from each component at specified position

- join: Join lists contained as elements in the Series/Index with passed delimiter

- get_dummies: Split each string in the Series/Index at specified separator

- contains: Test if pattern or regex is contained within a string of a Series or Index

- replace: Replace occurrences of pattern/regex/string with some other string or the return value of a callable given the occurrence

- repeat: Duplicate values (s.str.repeat(3) equivalent to x * 3)

- pad: Add whitespace to left, right, or both sides of strings

- center: Filling left/right side of strings with an arbitrary character

- ljust: Filling right side of strings with an arbitrary character

- rjust: Filling left side of strings with an arbitrary character

- zfill: Pad strings in the Series/Index by prepending ‘0’ characters

- wrap: Split long strings into lines with length less than a given width

- slice: Slice each string in the Series/Index

- slice_replace: Replace slice in each string with passed value

- count: Count occurrences of pattern


The cat parameter is used to access the categorical properties of the Series. The split parameter is used to split each string with the given pattern. The rsplit parameter is used to split each string with the given pattern. The get parameter is used to extract element from each component at specified position. The join parameter is used to join lists contained as elements in the Series/Index with passed delimiter. The get_dummies parameter is used to split each string in the Series/Index at specified separator. The contains parameter is used to test if pattern or regex is contained within a string of a Series or Index. The replace parameter is used to replace occurrences of pattern/regex/string with some other string or the return value of a callable given the occurrence. The repeat parameter is used to duplicate values (s.str.repeat(3) equivalent to x * 3). The pad parameter is used to add whitespace to left, right, or both sides of strings. The center parameter is used to fill left/right side of strings with an arbitrary character. The ljust parameter is used to fill right side of strings with an arbitrary character. The rjust parameter is used to fill left side of strings with an arbitrary character. The zfill parameter is used to pad strings in the Series/Index by prepending ‘0’ characters. The wrap parameter is used to split long strings into lines with length less than a given width. The slice parameter is used to slice each string in the Series/Index. The slice_replace parameter is used to replace slice in each string with passed value. The count parameter is used to count occurrences of pattern.

In [65]:
T.gender.str.lower() # the str.lower is used to transform the data in lower case.

T.head()

Unnamed: 0,survived,pclass,gender,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,Southampton,no,True


In [66]:
T.gender.str.capitalize() # the str.capitalize is used to transform the data in capital case.


0        Male
1      Female
2      Female
3      Female
4        Male
        ...  
886      Male
887    Female
888    Female
889      Male
890      Male
Name: gender, Length: 891, dtype: object

In [67]:
T.embark_town.str.len()

0      11.0
1       9.0
2      11.0
3      11.0
4      11.0
       ... 
886    11.0
887    11.0
888    11.0
889     9.0
890    10.0
Name: embark_town, Length: 891, dtype: float64

In [69]:
T.gender.str.cat(T.embark_town, sep=' ') # the str.cat is used to concatenate two columns.

0        male Southampton
1        female Cherbourg
2      female Southampton
3      female Southampton
4        male Southampton
              ...        
886      male Southampton
887    female Southampton
888    female Southampton
889        male Cherbourg
890       male Queenstown
Name: gender, Length: 891, dtype: object

```python

.pd.to_datetime()


```

The to_datetime() function is used to convert argument to datetime.

The parameters of the to_datetime function are:

- arg: string, datetime, list, tuple, 1-d array, Series

- errors: {‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’

- dayfirst: bool, default False

- yearfirst: bool, default False

- utc: bool, default None

- box: bool, default True

- format: string, default None

- exact: bool, True by default

- unit: string, default ‘ns’

- infer_datetime_format: bool, default False

- origin: scalar, default ‘unix’

- cache: bool, default True

The arg parameter is the string, datetime, list, tuple, 1-d array, Series to convert. The errors parameter is the string which indicates how to handle errors. The dayfirst parameter is a boolean value which indicates whether to interpret the first value in an ambiguous 3-integer date (e.g. 01/05/09) as the day (True) or month (False). The yearfirst parameter is a boolean value which indicates whether to interpret the first value in an ambiguous 3-integer date (e.g. 01/05/09) as the year. The utc parameter is a boolean value which indicates whether to return UTC DatetimeIndex if True (converting any tz-aware datetime.datetime objects as well). The box parameter is a boolean value which indicates whether to return a DatetimeIndex or ndarray of values. The format parameter is the strftime to parse time, eg “%d/%m/%Y”, note that “%f” will parse all the way up to nanoseconds. The exact parameter is a boolean value which indicates whether to allow parsing of dates with out of bounds values. The unit parameter is the unit of the arg (D,s,ms,us,ns) denote the unit, which is an integer or float number. The infer_datetime_format parameter is a boolean value which indicates whether to attempt to infer the format of the datetime strings in the columns, and if successful return a DatetimeIndex or array of Timestamps, depending on input. The origin parameter is an integer which is the anchor point for counting time deltas. The cache parameter is a boolean value which indicates whether to cache parsed objects.


In [70]:
from datetime import datetime

dates = pd.to_datetime([datetime(2015,7,3), '4th of July, 2015', '2015-Jul-6', '07-07-2015', '20150708'])

dates

DatetimeIndex(['2015-07-03', '2015-07-04', '2015-07-06', '2015-07-07',
               '2015-07-08'],
              dtype='datetime64[ns]', freq=None)

In [72]:
dates.to_period('D') # the to_period is used to transform the data in period.


PeriodIndex(['2015-07-03', '2015-07-04', '2015-07-06', '2015-07-07',
             '2015-07-08'],
            dtype='period[D]')

In [73]:
dates - dates[0] # the dates[0] is the first date.

TimedeltaIndex(['0 days', '1 days', '3 days', '4 days', '5 days'], dtype='timedelta64[ns]', freq=None)