Let's import the Pandas library

In [None]:
import pandas as pd

Let's imagine you have an array of company names

In [None]:
companies = pd.Series(['Meta ', 'Ama!on ', '#Apple', '#Netflix', 'Google ', 'CORRUPT'])
companies

However, this data is not clean, as it often contains extra spaces and characters which were caused due to incorrect data entry. You have noticed that these characters follow a pattern: There may be extra spaces at the end, or a number sign at the start, or an exclamation mark accidentally replacing the character 'z'. There are also some cells called 'CORRUPT' which you'd just want to drop. How can you identify which of the data is unclean?

In [None]:
companies.str.contains('!')

In [None]:
companies.str.startswith('#')

In [None]:
companies.str.endswith(' ')

Let's clean up this data. As with all string methods, these are not inplace because strings are immutable.

In [None]:
companies.str.replace(' ', '')

In [None]:
companies

In [None]:
removed_spaces = companies.str.replace(' ', '')
removed_spaces

In [None]:
removed_number_signs = removed_spaces.str.replace('#', '')
removed_number_signs

In [None]:
removed_exclamation_marks = removed_number_signs.str.replace('!', 'z')
removed_exclamation_marks

In [None]:
companies_cleaned = removed_exclamation_marks[~removed_exclamation_marks.str.contains('CORRUPT')]  # you can use the conditions above to filter your data
companies_cleaned

You can also change the case of your string Series

In [None]:
companies_cleaned.str.lower()

In [None]:
companies_cleaned.str.upper()

You can find the lengths of each string in your Series

In [None]:
companies_cleaned.str.len()

In [None]:
companies_cleaned  # these aren't inplace operations; reassign if you want to preserve changes

You can also remove leading and trailing whitespace using the `.str.strip()` method

In [None]:
series_with_whitespace = pd.Series(['   Apple', 'Banana   ', '    Grape  ', '  Orange  '])
series_with_whitespace

In [None]:
series_with_whitespace.str.len()

In [None]:
stripped_series = series_with_whitespace.str.strip()
stripped_series

In [None]:
stripped_series.str.len()

Pandas can also work with dates and times

In [None]:
dates = pd.Series(['2024-01-01', '2024-02-15', '2024-03-30'])
dates

In [None]:
dates.dtype

In [None]:
datetime_series = pd.to_datetime(dates)  # convert strings to datetime objects
datetime_series

In [None]:
datetime_series.dtype

In [None]:
datetime_series.dt.year  # retrieve the year

In [None]:
datetime_series.dt.month  # retrieve the month

In [None]:
datetime_series.dt.day  # retrieve the day of the month (a number)

In [None]:
datetime_series.dt.weekday  # retrieve the day of the week (a number)

In [None]:
datetime_series.dt.day_name()  # retrieve the name of the day of the week

In [None]:
new_datetime_series = datetime_series + pd.Timedelta(days = 10)  # shift every date 10 days ahead
new_datetime_series

In [None]:
is_weekend = new_datetime_series.dt.weekday > 5  # find if the days are weekends
is_weekend

In [None]:
new_datetime_series.dt.day_name()