In [None]:
import pandas as pd

In [None]:
# death_metal.csv
url = 'https://drive.google.com/file/d/11HsCgxJL_PtJ8xxdT5VZbw6e0y0-VKag/view?usp=sharing' 
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
bands = pd.read_csv(path)

In [None]:
bands.head()

Unnamed: 0,name,country,status,formed_in,genre,theme,active
0,**Act of Destruction?>,united states,ac,2005.0,Melodic Death/Thrash Metal,Death| Love| Life| Evil| Darker Tones,2005-present
1,**Nirvana 2002?>,sweden,su,1988.0,Death Metal,Metaphysical Philosophy| Parapsychology,1988-1992
2,**Olemus?>,austria,ac,1993.0,Death/Black/Gothic Metal,Sadness| Life| Death,1993-present
3,**Misanthrope?>,mexico,oh,2010.0,Death Metal,Death| Destruction| War| Decadence,2010-present
4,**Detonator?>,russia,su,1991.0,Technical Death/Thrash Metal,Loneliness philosophy| state of mind of the pe...,1991-2002


In [None]:
bands.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26 entries, 0 to 25
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   name       26 non-null     object 
 1   country    26 non-null     object 
 2   status     26 non-null     object 
 3   formed_in  26 non-null     float64
 4   genre      26 non-null     object 
 5   theme      26 non-null     object 
 6   active     26 non-null     object 
dtypes: float64(1), object(6)
memory usage: 1.5+ KB


### **Exercise 1:** 
Cleaning the 'name' column

The names of the bands are messy. They have some extra characters like '\*\*' at the begining and '?>' at the end. Remove them using any or all of these methods:  
- [pandas.Series.str.replace](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html)
- [pandas.Series.str.lstrip](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.lstrip.html)
- [pandas.Series.str.rstrip](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.rstrip.html)
- [pandas.Series.str.strip](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.strip.html)

In [None]:
bands['name'].head(5)

0    **Act of Destruction?>
1          **Nirvana 2002?>
2                **Olemus?>
3           **Misanthrope?>
4             **Detonator?>
Name: name, dtype: object

#### Solution with pandas.Series.str.replace
This method replaces chosen sequence of characters with new one.  
First parameter is old sequence you want to change, the second parameter is the new sequence we want to replace old one with.  
If you give "" (empty string) as new sequence it will actually only delete the old one since it is replaced with nothing (empty string)

In [None]:
messy_names = bands.loc[:,'name'] #saving messy data so we can reset it before each solution

In [None]:
# .replace() with regex (regex=True)
bands['name'] = bands['name'].str.replace('\*\*|\?>','')
bands['name'].head(5)

  


0    Act of Destruction
1          Nirvana 2002
2                Olemus
3           Misanthrope
4             Detonator
Name: name, dtype: object

Explanation of **'\\\*\\\*|\\?>'** regular expression:  
- **First part** **\\\*\\\*** means you want to select ** but since * is a special character in regex we need to put \\ before it so it 'loses it's powers' and becomes just a regular * (star)  
- **Second part** **\\?>** ? is also special character so, once again, we need to put \\ before it to become just a regular ?(question mark). > is not a special character in regex so we can just put >  
- **Middle part |** Character | means replace either what is on left side form | or what is on right side from it. If we wrote '\\*\\*\\?>' without | it would look for exact match of \*\*\?> and we would have no match.
For practicing regular expressions check this [website](https://regex101.com/)

In [None]:
#reseting names for second solution
bands.loc[:,'name'] = messy_names 

In [None]:
# .replace() without regex (regex=False)
bands.loc[:,'name'] = bands.loc[:,'name'].str.replace('**','', regex=False)
bands.loc[:,'name'] = bands.loc[:,'name'].str.replace('?>','', regex=False)
bands.loc[:,'name'].head(5)

0    Act of Destruction
1          Nirvana 2002
2                Olemus
3           Misanthrope
4             Detonator
Name: name, dtype: object

#### Solution with pandas.Series.str.strip  
This method replaces whitespaces or any given special characters from begining and ending of a string.
rstrip() and lstrip() function almost the same but rstrip() removes only from end (right strip) of string and lstrip() removes from begining (left strip) of a string.
If you don't give any parameters to these methods, they will remove only whitespaces.

In [None]:
bands.loc[:,'name'] = messy_names #reseting names for third solution

In [None]:
bands.loc[:,'name'] = bands.loc[:,'name'].str.strip('*?>')
bands.loc[:,'name'].head(5)

0    Act of Destruction
1          Nirvana 2002
2                Olemus
3           Misanthrope
4             Detonator
Name: name, dtype: object

### **Exercise 2:** 
Cleaning the country column

The country column has all countries written with small capital letter. Change them so they have all capital letters.  
Best method for this is: [pandas.Series.str.title](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.title.html)

In [None]:
bands['country'].head(5)

0    united states
1           sweden
2          austria
3           mexico
4           russia
Name: country, dtype: object

#### Solution with pandas.Series.str.title

This method changes all first letters in words in a string.  
**'this is a string'** would be changed to **'This Is A String'**

In [None]:
bands.loc[:,'country'] = bands.loc[:,'country'].str.title()
bands.loc[:,'country'].head(5)

0    United States
1           Sweden
2          Austria
3           Mexico
4           Russia
Name: country, dtype: object

### **Exercise 3:**
Cleaning the status column

The status column has some abbreviations instead of the real status.  
Change them in accordance with this:  
* ac = Active  
* su = Split-up  
* cn = Changed name  
* oh = On hold  
* un = Unknown

You can use  
pandas.Series.str.replace -> https://pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html  
or you can create a dictionary and remap values according to it with  
pandas.Series.replace -> https://pandas.pydata.org/docs/reference/api/pandas.Series.replace.html

In [None]:
bands['status'].head(5)

0    ac
1    su
2    ac
3    oh
4    su
Name: status, dtype: object

In [None]:
messy_status = bands.loc[:,'status'] #saving messy data so we can reset it before each solution

#### Solution with pandas.Series.str.replace

In [None]:
bands.loc[:,'status'] = bands.loc[:,'status'].str.replace('ac','Active')
bands.loc[:,'status'] = bands.loc[:,'status'].str.replace('su','Split-up')
bands.loc[:,'status'] = bands.loc[:,'status'].str.replace('cn','Changed name')
bands.loc[:,'status'] = bands.loc[:,'status'].str.replace('oh','On hold')
bands.loc[:,'status'] = bands.loc[:,'status'].str.replace('un','Unknown')
bands['status'].head(5)

0      Active
1    Split-up
2      Active
3     On hold
4    Split-up
Name: status, dtype: object

#### Solution with pandas.DataFrame.replace

This method replaces values given by single values,lists or dictionaries. We are going to use solution with dictionary.

In [None]:
#reseting status for second solution
bands.loc[:,'status'] = messy_status 

In [None]:
# Shape of dictionary is {'value_to_replace' : 'value_to_replace_with'}
dictionary = {'su':'Split-up',
              'cn':'Changed name',
              'oh':'On hold',
              'un':'Unknown',
              'ac':'Active'}

In [None]:
bands.loc[:,'status'] = bands.loc[:,'status'].replace(dictionary)
bands.loc[:,'status'].head(5)

0      Active
1    Split-up
2      Active
3     On hold
4    Split-up
Name: status, dtype: object

### **Exercise 4:**
Cleaning the genre column

The column genre has genres in a single string separated by character /  
1. First, transform the string to list of strings  
(e.g. 'Avant-garde Black/Death Metal'    to     \[Avant-garde Black, Death Metal\]  
1. Then, create a new column 'number_of_genres' where you will store the number of genres in each list.  

Methods you can use are:  
pandas.Series.str.split -> https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html  
pandas.Series.str.len -> https://pandas.pydata.org/docs/reference/api/pandas.Series.str.len.html

In [None]:
bands['genre'].head(5)

0      Melodic Death/Thrash Metal
1                     Death Metal
2        Death/Black/Gothic Metal
3                     Death Metal
4    Technical Death/Thrash Metal
Name: genre, dtype: object

#### Solution with pandas.Series.str.split & pandas.Series.str.len

We will use method pandas.Series.str.split to split each string around the given character (in our case /)

In [None]:
bands['genre'] = bands['genre'].str.split('/')
bands['genre'].head(5)

0      [Melodic Death, Thrash Metal]
1                      [Death Metal]
2       [Death, Black, Gothic Metal]
3                      [Death Metal]
4    [Technical Death, Thrash Metal]
Name: genre, dtype: object

Now we can use pandas.Series.str.len to count the number of items in each list and save it to new column.

In [None]:
bands['number_of_genres'] = bands['genre'].str.len()
bands[['genre','number_of_genres']].head(5)

Unnamed: 0,genre,number_of_genres
0,"[Melodic Death, Thrash Metal]",2
1,[Death Metal],1
2,"[Death, Black, Gothic Metal]",3
3,[Death Metal],1
4,"[Technical Death, Thrash Metal]",2


### **Exercise 5:** 
Cleaning the active column

The column active contains information about years when band was form and/or year when it stopped being active and/or if it's still active ('present') and/or '?' if the status or year is unknown.  
Create two new columns 'active_from' and 'active_to' and fill them up acording to information in column active.  
Method you can use is:  
pandas.Series.str.extract -> https://pandas.pydata.org/docs/reference/api/pandas.Series.str.extract.html#

In [None]:
bands['active'].head(50)

0     2005-present
1        1988-1992
2     1993-present
3     2010-present
4        1991-2002
5     2013-present
6     2001-present
7        2001-2007
8        2001-2006
9        2006-2013
10       2010-2011
11       1998-2008
12          2005-?
13          1986-?
14    1997-present
15       2012-2013
16          1991-?
17    2009-present
18    2014-present
19       1997-1998
20       1999-2008
21          2002-?
22          1989-?
23       2004-2007
24    2007-present
25       2004-2012
Name: active, dtype: object

#### Solution active_from
We will use method pandas.Series.str.extract to extract the value we want from the string.  
For active_from column we want year that stand before - character (**1998**-present).  
The regex that will help us capture it is **\d{4}-** where \d represents digit and {4} says we want 4 digits, and - represents -.  
Notice that our regex has () brackets **(**\d{4}**)**- which means we want our regex to match '1998-' but we only want to capture '1998'

In [None]:
bands['active_from'] = bands.loc[:,'active'].str.extract(r'(\d{4})-')
bands['active_from'] 

0     2005
1     1988
2     1993
3     2010
4     1991
5     2013
6     2001
7     2001
8     2001
9     2006
10    2010
11    1998
12    2005
13    1986
14    1997
15    2012
16    1991
17    2009
18    2014
19    1997
20    1999
21    2002
22    1989
23    2004
24    2007
25    2004
Name: active_from, dtype: object

#### Solution active_to
We will use method pandas.Series.str.extract to extract the value we want from the string.  
For active_to column we want year(or ? or present) that stand after - character (1998-**present**).  
The regex that will help us capture it is **-\d{4}|present|\?** where - represents -, \d represents digit and {4} says we want 4 digits, present represents present and \? represents ?.  
The | sign between them says we want to capture either one of 3 paterns (or|or|or).  
Notice that our regex has () brackets -**(**\d{4}|present|\?**)** which means we want our regex to match '-present' but we only want to capture 'present'

In [None]:
bands['active_to'] = bands.loc[:,'active'].str.extract(r'-(\d{4}|present|\?)')
bands['active_to']

0     present
1        1992
2     present
3     present
4        2002
5     present
6     present
7        2007
8        2006
9        2013
10       2011
11       2008
12          ?
13          ?
14    present
15       2013
16          ?
17    present
18    present
19       1998
20       2008
21          ?
22          ?
23       2007
24    present
25       2012
Name: active_to, dtype: object

In [None]:
bands[['active','active_from','active_to']].head(5)

Unnamed: 0,active,active_from,active_to
0,2005-present,2005,present
1,1988-1992,1988,1992
2,1993-present,1993,present
3,2010-present,2010,present
4,1991-2002,1991,2002


### **Exercise 6** 
Counting the themes

Count how many times do the words Love, Life, Death repeat in a themes column.  
Method you can use is:  
pandas.Series.str.count -> https://pandas.pydata.org/docs/reference/api/pandas.Series.str.count.html  

In [None]:
bands['theme'].head(5)

0                Death| Love| Life| Evil| Darker Tones
1              Metaphysical Philosophy| Parapsychology
2                                 Sadness| Life| Death
3                   Death| Destruction| War| Decadence
4    Loneliness philosophy| state of mind of the pe...
Name: theme, dtype: object

#### Solution
We will use method pandas.Series.str.count to count the occurances of words in each row and then we will use sum() to summarize all rows

In [None]:
bands['theme'].str.count('Death').sum()

12

In [None]:
bands['theme'].str.count('Love').sum()

3

In [None]:
bands['theme'].str.count('Life').sum()

4

## Final result

In [None]:
bands

Unnamed: 0,name,country,status,formed_in,genre,theme,active,number_of_genres,active_from,active_to
0,Act of Destruction,United States,Active,2005.0,"[Melodic Death, Thrash Metal]",Death| Love| Life| Evil| Darker Tones,2005-present,2,2005,present
1,Nirvana 2002,Sweden,Split-up,1988.0,[Death Metal],Metaphysical Philosophy| Parapsychology,1988-1992,1,1988,1992
2,Olemus,Austria,Active,1993.0,"[Death, Black, Gothic Metal]",Sadness| Life| Death,1993-present,3,1993,present
3,Misanthrope,Mexico,On hold,2010.0,[Death Metal],Death| Destruction| War| Decadence,2010-present,1,2010,present
4,Detonator,Russia,Split-up,1991.0,"[Technical Death, Thrash Metal]",Loneliness philosophy| state of mind of the pe...,1991-2002,2,1991,2002
5,Blood Agent,Germany,Active,2013.0,"[Death, Thrash Metal]",War| Death| Apocalypse,2013-present,2,2013,present
6,Traumagain,Italy,Active,2001.0,[Brutal Death Metal],Nihilism| Death,2001-present,1,2001,present
7,Anaktorian,Finland,Split-up,2001.0,[Melodic Death Metal],Death| emotions| pain,2001-2007,1,2001,2007
8,Revolt,Italy,Changed name,2001.0,"[Death, Thrash Metal]",Society| Hate,2001-2006,2,2001,2006
9,Coldworker,Sweden,Split-up,2006.0,[Death Metal],Death,2006-2013,1,2006,2013
