### **String Manipulation**


In [None]:
import pandas as pd


In [None]:
df = pd.read_csv('../../Datasets/new_york_times_bestsellers-dirty.csv', index_col=0)

df.head()

Let's start with the description column which has a 'Descr:' at the beginning of each text. If we want to remove that text we can use the replace method of the str property of that Series:


In [None]:
df['description'].str.replace('Descr:', '')


In [None]:
df['description'] = df['description'].str.replace('Descr:', '')


In [None]:
df.loc[0, 'description']


As you can see, we also have empty spaces at the beginning and end of our strings. Let's remove them using strip:


In [None]:
df['description'].str.strip()


In [None]:
df['description'] = df['description'].str.strip()


In [None]:
df.loc[0, 'description']


Now let's look at the 'title' column, whose texts are in uppercase. This isn't very nice, so we can use a few methods to modify the case pattern:


In [None]:
df['title'].str.lower()


In [None]:
df['title'].str.title()


The latter is more suitable, let's save it:


In [None]:
df['title'] = df['title'].str.title()


Now let's say we want to split our author column into two columns author_first_name and author_last_name. We can do that with the split method:


In [None]:
df['author'].str.split(' ')


We can convert it to two columns like so:


In [None]:
df['author'].str.split(' ', expand=True)



In [None]:
df.head()


### **MAP**

Let's say we want to transform the data in our 'rank.numberInt' column so that the 'rankink' is given by letters, not numbers.

We know there is a 'No Rank' value in that column, so our conversion dictionary might look like this:

In [None]:
int_a_letter = {
    '1': 'a',
    '2': 'b',
    '3': 'c',
    '4': 'd',
    '5': 'e',
    '6': 'f',
    '7': 'g',
    '8': 'h',
    '9': 'i',
    '10': 'j',
    '11': 'k',
    '12': 'l',
    '13': 'm',
    '14': 'n',
    '15': 'or',
    '16': 'p',
    'NoRank': 'z'
}


We apply it using map:


In [None]:
df['rank.numberInt'].map(int_a_letter).head(20)

We can also use a function for map. For example this function that performs a correspondence between the price of a book and its string representation:


In [None]:
def double_to_money(value):
    
    return f'${value} USD'

In [None]:
df['price.numberDouble'].map(double_to_money)


### **APPLY**

Another way to create correspondences is by applying a function to our DataFrame or Series using apply.

For a Series we can use apply to apply a function "element by element".

In DataFrames we can use this same method to apply functions by rows or by columns.

In [None]:
import numpy as np

We can apply functions to our Series with the apply method:


In [None]:
def years_since_bestseller(value):
    
    as_datetime = pd.to_datetime(value, unit='ms')
    today = pd.to_datetime('today')
    difference_in_days = (today - as_datetime).days
    in_years = difference_in_days / 365
    
    return in_years


In [None]:
df['published_date.numberLong'].apply(years_since_bestseller)


Or this other one:


In [None]:
def weeks_on_list_percentage_of_maximum(value, max_weeks_on_list):
    
    percentage = value * 100 / max_weeks_on_list
    as_string = f'{percentage:.2f}%'
    
    return as_string

In [None]:
df['weeks_on_list.numberInt'].apply(weeks_on_list_percentage_of_maximum, args=(df['weeks_on_list.numberInt'].max(),))


### **Filters**

Filters serve us to obtain subsets of data that have a certain characteristic that we need. We can "filter" only the data we want and leave out undesirable data.

Creating subsets from our dataset is very useful to better understand the makeup of our dataset and to perform sample analyzes of our total data.


Let's say we want all records where the author's name starts with 'R'. First, we use comparison operators (or in this case, the str.startswith method) to get our filter:


In [None]:
df['author'].str.startswith('R')


What we get back is a String with the same length as the original String. The method or comparison was applied to each element of the original Series. These methods or comparisons return True or False depending on each value. The resulting Series accumulates the Trues and Falses that we obtain from the comparison or from the application of the method.


In [None]:
df[df['author'].str.startswith('R')].head()


We can also store our filters in variables and then use them:


In [None]:
filter_price_greater_than_20 = df['price.numberDouble'] > 20


In [None]:
df[filtro_precio_mayor_a_20].head()


We can even apply two or more filters using logical operators. In this case, our and operator is represented by an & and the or operator is represented by |:


In [None]:
filter_rank_number_one = df['rank.numberInt'] == '1'


In [None]:
df[filter_price_greater_than_20 & filter_rank_number_one].head()


### **SORT**

We can also reorder our data using the sort_values ​​method. We reorder our data set taking into account the value that each row has in a given column. We can order them ascending or descendingly.

Reordering our data can help us better understand the distribution of our data, as well as prepare our set or subsets for display.

In [None]:
df.sort_values('price.numberDouble', ascending=False)


If we convert 'published_date.numberLong' to a datetime, we can also order from the oldest publication to the most recent publication:


In [None]:
df['published_date.numberLong'] = pd.to_datetime(df['published_date.numberLong'], unit='ms')


In [None]:
df.sort_values('published_date.numberLong', ascending=True)


For example, we could first filter to only have the books from the publisher that has the most books as 'best sellers', and then sort them from the one that spent the most days on the 'best seller' list to the one that spent the fewest days on the list:


In [None]:
df['publisher'].value_counts()


In [None]:
df_putnam = df[df['publisher'] == 'Putnam']


In [None]:
df_putnam.sort_values('weeks_on_list.numberInt', ascending=False)


### **PRACTICE**

In [4]:
import numpy as np
import pandas as pd
data = {
    'animal':
    ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
    'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
    'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
    'priority':
    ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']
}
 
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, index=labels)


In [None]:
df[(df['age']>2) & (df['age']>4)]
# Method 2
df[df['age'].between(2, 4)]

In [None]:
df.groupby('animal')['age'].mean()


In [None]:
df['animal'].value_counts()


In [None]:
df.sort_values(by=['age', 'visits'], ascending=[False, True])


In [None]:
df['priority'] = df['priority'].map({'yes': True, 'no': False})
df

In [None]:
df['animal'] = df['animal'].replace('snake', 'python')
df

In [None]:
df.pivot_table(index='animal', columns='visits', values='age', aggfunc='mean')


EXAMPLE #2

In [5]:
df = pd.DataFrame(np.random.random(size=(5, 5)), columns=list('abcde'))
print(df)


In [None]:
df.sum().idxmin()

EXAMPLE # 3

In [None]:
dti = pd.date_range(start='2015-01-01', end='2015-12-31', freq='B')
s = pd.Series(np.random.rand(len(dti)), index=dti)
 
s.head(10)


In [None]:
s[s.index.weekday == 2].sum()