**Problem:**

You are given the following dataset:
1. **Audible Data** : https://1drv.ms/u/s!AiqdXCxPTydhoog8ckLN-6Cw55fzIg?e=EWgZ5d

Your task is to:
- Find the problems with the datasets.
- Define the Data Quality Dimensions.
- Try to clean the datasets.

In [59]:
import pandas as pd
import numpy as np
import re
import datetime

## Issues with the data



### Dirty Data
    1. 



### Messy Data
    1. There seems to be some formatting errors while writing names of non English books in the 'name' column.
    2. There are unnecessary prefixes 'Writtenby:' and 'Narratedby:' in the 'author' and 'narrator' columns respectively.
    3. The name of the author are a single word instead of separated names, same for the narrators.
    4. Time is given in hrs and mins instead of a common standard unit.
    5. The release date is sometimes separated by hyphens, sometimes by slashes. Needs a common format. The year is sometimes in yy format, sometimes in yyyy format.
    6. The rating is given in string format and is combined with no. of votes given. The unrated books are noted by 'Not rated yet'
    7. The price is integer and float mixed.


In [26]:
df = pd.read_csv(r"../Datasets/audible.csv")

In [27]:
df

Unnamed: 0,name,author,narrator,time,releasedate,language,stars,price
0,Geronimo Stilton #11 & #12,Writtenby:GeronimoStilton,Narratedby:BillLobely,2 hrs and 20 mins,4-8-2008,English,5 out of 5 stars34 ratings,468
1,The Burning Maze,Writtenby:RickRiordan,Narratedby:RobbieDaymond,13 hrs and 8 mins,1-5-2018,English,4.5 out of 5 stars41 ratings,820
2,The Deep End,Writtenby:JeffKinney,Narratedby:DanRussell,2 hrs and 3 mins,6-11-2020,English,4.5 out of 5 stars38 ratings,410
3,Daughter of the Deep,Writtenby:RickRiordan,Narratedby:SoneelaNankani,11 hrs and 16 mins,5-10-2021,English,4.5 out of 5 stars12 ratings,615
4,"The Lightning Thief: Percy Jackson, Book 1",Writtenby:RickRiordan,Narratedby:JesseBernstein,10 hrs,1-13-2010,English,4.5 out of 5 stars181 ratings,820
...,...,...,...,...,...,...,...,...
87484,Last Days of the Bus Club,Writtenby:ChrisStewart,Narratedby:ChrisStewart,7 hrs and 34 mins,9-3-2017,English,Not rated yet,596
87485,The Alps,Writtenby:StephenO'Shea,Narratedby:RobertFass,10 hrs and 7 mins,21-02-17,English,Not rated yet,820
87486,The Innocents Abroad,Writtenby:MarkTwain,Narratedby:FloGibson,19 hrs and 4 mins,30-12-16,English,Not rated yet,938
87487,A Sentimental Journey,Writtenby:LaurenceSterne,Narratedby:AntonLesser,4 hrs and 8 mins,23-02-11,English,Not rated yet,680


In [28]:
# Cleaning the authors and narrator columns

df['author'] = df['author'].apply(lambda x: re.search(r"^Writtenby:(.*)$", x).group(1))

df['narrator'] = df['narrator'].apply(lambda x: re.search(r"^Narratedby:(.*)$", x).group(1))

In [29]:
# Checking for dtypes in all columns

df['name'].apply(lambda x:type(x)).unique()

array([<class 'str'>], dtype=object)

In [30]:
# Checking for null values in all columns

df['name'].isnull().sum()

0

In [None]:
def cleaning_time(x:str) -> str:
    mt = re.search("^(?:([0-9]*) (?:hr|hrs))?(?: and )?(?:([0-9]*) ?(?:min|mins)?)$", x)
    if mt == None:
        return pd.Timedelta('1 min')
    else:
        return pd.Timedelta(f'mt')

df['time'] = df['time'].apply(cleaning_time)

In [62]:
re.search(r'^([0-9]*)(?: (hrs|hr))?(?: and )?([0-9]*)?(?: (mins|min))?$',"96 mins").group(1)

'96'

In [54]:
re.search("^(?:([0-9]*) (?:hr|hrs))?(?: and )?(?:([0-9]*) ?(?:min|mins)?)$", "Lesss than 1 minute") == None

True

In [34]:
df[df.time.str.contains('Less than 1 minute')]

Unnamed: 0,name,author,narrator,time,releasedate,language,stars,price
1401,The Story of Ice Cream,StacyTaus-Bolstad,BookBuddyDigitalMedia,Less than 1 minute,1-1-2021,English,Not rated yet,164
1403,The Story of Salt,LisaOwings,BookBuddyDigitalMedia,Less than 1 minute,1-1-2021,English,Not rated yet,164
1404,The Story of Milk,StacyTaus-Bolstad,BookBuddyDigitalMedia,Less than 1 minute,1-1-2021,English,Not rated yet,164
1408,The Story of an Apple,StacyTaus-Bolstad,BookBuddyDigitalMedia,Less than 1 minute,1-1-2021,English,Not rated yet,164
1409,We Like the Summer,KatiePeters,BookBuddyDigitalMedia,Less than 1 minute,1-1-2021,English,Not rated yet,164
...,...,...,...,...,...,...,...,...
87171,ç¬¬äºŒåäº”è©±ã‚µãƒ³ãƒ»ãƒŸã‚·ã‚§ãƒ«ã®ã„ã„ã...,æ£®æœ¬å“²éƒŽ,å°é‡Žç”°è‹±ä¸€,Less than 1 minute,20-11-15,japanese,Not rated yet,139
87175,ç¬¬ä¹è©±ã‚ªãƒ©ãƒ³æœ€å¾Œã®å¤•ã¹ï¼šã¼ãã®æ...,æ£®æœ¬å“²éƒŽ,å°é‡Žç”°è‹±ä¸€,Less than 1 minute,19-11-15,japanese,Not rated yet,139
87176,ç¬¬ä¸€è©±ãƒªãƒ¥ãƒ¼ãƒ™ãƒƒã‚¯ã®è¿½æ†¶:ã¼ãã®...,æ£®æœ¬å“²éƒŽ,å°é‡Žç”°è‹±ä¸€,Less than 1 minute,23-07-15,japanese,Not rated yet,139
87180,ç¬¬ä¸ƒè©±ãƒã‚°ãƒ€ãƒ¼ãƒ‰ã®èŒ¶å±‹:ã¼ãã®æ—…...,æ£®æœ¬å“²éƒŽ,å°é‡Žç”°è‹±ä¸€,Less than 1 minute,13-07-15,japanese,Not rated yet,139


In [65]:
#pd.Timestamp("10 hrs, 3 mins")
datetime.datetime("10, 03, 19")

TypeError: 'str' object cannot be interpreted as an integer