# 75 pandas Exercises: Exercises 21 to 30

Exercises 21 to 30 from [here](https://www.machinelearningplus.com/python/101-pandas-exercises-python/). Each exercise includes the question, the input and the solution's code. Sometimes, alternative solutions and comments to better explain solutions/pandas functionality are offered.

Requirements: 
+ pandas
+ numpy

Happy Pandasing! 🐼

## Imports

In [2]:
import pandas as pd
import numpy as np # required for some questions

---

## Exercises

### 🐼 Exercise 21

**How to convert a series of date-strings to a timeseries?** 

Input

In [3]:
ser = pd.Series(['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])

To convert this to a datetime format understood by Pandas, we would need to specify a format that the parser would identify (something alone the lines of DD-MM-YYYY). However, we have several different date formats in `ser`, so we would need multiple parsing specifications. Thankfully, Pandas already has a built-in parsing engine that can automatically extract dates from common timestamp formats.  



In [4]:
new_ser = pd.to_datetime(ser, infer_datetime_format=True) # infer datetime format to activate the built-in parsing engine

In [5]:
print(new_ser)

0   2010-01-01 00:00:00
1   2011-02-02 00:00:00
2   2012-03-03 00:00:00
3   2013-04-04 00:00:00
4   2014-05-05 00:00:00
5   2015-06-06 12:20:00
dtype: datetime64[ns]


All date strings converted to a common, continuous format, that can be easily used as a `DataFrame`'s  index.   

### 🐼 Exercise 22

**Get the day of month, week number, day of year and day of week from a series of date strings.**

Input

In [11]:
ser = pd.Series(['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])

First, and building upon exercise 22, let's convert everything to the `pd.Timestamp` format.

In [12]:
ser_ts = pd.to_datetime(ser)
print(type(ser_ts))
print(type(ser_ts[0]))

<class 'pandas.core.series.Series'>
<class 'pandas._libs.tslibs.timestamps.Timestamp'>


In [14]:
print(ser_ts)

0   2010-01-01 00:00:00
1   2011-02-02 00:00:00
2   2012-03-03 00:00:00
3   2013-04-04 00:00:00
4   2014-05-05 00:00:00
5   2015-06-06 12:20:00
dtype: datetime64[ns]


Okay, we know have a `pd.Series` of timestamps. The `pd.Timestamp` is a complex object holding, in its attributes, all the metadata we are looking for, so let's extract it with a bit of list comprehension magic.

In [31]:
day_of_month = [ts.days_in_month for ts in ser_ts]
week_number = [ts.weekofyear for ts in ser_ts]
day_of_year = [ts.dayofyear for ts in ser_ts]
day_of_week = [ts.day_name() for ts in ser_ts] # previously, it was weekday_name, but it's being phased out in favour of the method day_name().
# It's a method because it has a locale argument (a same absolute timestamp can mean different week days, according to the timezone).

Checking:

In [32]:
print("Day in month: {}".format(day_of_month))
print("Week number: {}".format(week_number))
print("Day of the year: {}".format(day_of_year))
print("Weekday name: {}".format(day_of_week))

Day in month: [31, 28, 31, 30, 31, 30]
Week number: [53, 5, 9, 14, 19, 23]
Day of the year: [1, 33, 63, 94, 125, 157]
Weekday name: ['Friday', 'Wednesday', 'Saturday', 'Thursday', 'Monday', 'Saturday']


_Voilà!_

### 🐼 Exercise 23

**Convert year-month string to dates corresponding to the 4th day of the month?** Change `ser` to dates that start with the 4th day of the respective months.

Input

In [34]:
ser = pd.Series(['Jan 2010', 'Feb 2011', 'Mar 2012'])

Okay, so maybe we can go for a bit of timestamp arithmetic? Let's start by creating a timestamp `pd.Series`.

In [36]:
ser_ts = pd.to_datetime(ser)
print(ser_ts)

0   2010-01-01
1   2011-02-01
2   2012-03-01
dtype: datetime64[ns]


Basically, we can implement timestamp arithmetic on `pd.Timestamps` objects using `pd.Timedelta` to create our time parcels and, well... arithmetic operators. 

In [42]:
delta_time = pd.Timedelta(days=3)
ser_ts_delta = ser_ts + delta_time
ser_ts_delta.head()

0   2010-01-04
1   2011-02-04
2   2012-03-04
dtype: datetime64[ns]

_Easy peasy!_

### 🐼 Exercise 24

**Filter words that contain at least 2 vowels from a `pd.Series`?** From `ser`, extract the words containing at least 2 vowels.

Input

In [41]:
ser = pd.Series(['Apple', 'Orange', 'Plan', 'Python', 'Money'])

So, let's count the vowels in each element of the `pd.Series` and filter based on that. The best way to count elements in a list (apart from mcgyvering it) is to use `collections.Counter`. 

Let's pre-process the list: 

In [66]:
vowels = list('aeiou') # quick & dirty way to transform a word into a list of characters
ser_lower = ser.apply(lambda x: x.lower()) # putting things in lowercase

Let's map each element in `ser` to its number of vowels. Calling `Counter()` on a string returns a dictionary with all the letters in that string and the number of occurrences of that letter. By accessing only the keys of that dictionary that are vowels and summing them up, we get the vowel count for each entry in `ser_lower`. 

**Note**: The `.get()` method of a dictionary allows setting a default value in case a key is not found. This is crucial because if the vowel we are sweeping for is not present in the word, it would error out. Instead, we set it to 0, meaning no vowel is present. 

In [94]:
from collections import Counter
ser_vowels_count = ser_lower.map(lambda x: sum([Counter(x).get(v, 0) for v in vowels]))

With the vowel counts in the `pd.Series` `ser_vowels_count`, let's filter: 

In [93]:
ser[ser_vowels_count >= 2]

0     Apple
1    Orange
4     Money
dtype: object

It's done. Partly, based on the suggested solution. A little bit complex upon a first inspection because a lot is going on on that one-liner above, but hey!, learned about `collections.Counter`, a very useful tool. 

### 🐼 Exercise 25

**Filter valid emails from a `pd.Series`**. Extract the valid emails from the series emails. The regex pattern for valid emails is provided as reference.

Input & regular expression pattern to detect valid emails

In [60]:
emails = pd.Series(['buying books at amazom.com', 'rameses@egypt.com', 'matt@t.co', 'narendra@modi.com'])
pattern ='[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,4}'