## Preprocessing

Introduction to the data sets:

Below are the two csv files of Resident Advisors singles reviews. The `add_df` file was scraped in late February 2018 to add to the existing/original dataset of `old_df`, which was scraped mid-October. 

Warning:
- The web scraping may not fully have covered the singles reviews of October.

In [10]:
# imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline  

In [11]:
import os
cwd = os.getcwd()
cwd

'/Users/issy/Desktop/resident-adv-reviews-singles'

In [12]:
# read csv files
old_df = pd.read_csv("/Users/issy/Desktop/resident-adv-reviews-singles/data/resident-adv.csv") # main folder from start to oct-2017
add_df = pd.read_csv("/Users/issy/Desktop/resident-adv-reviews-singles/data/new2.csv") # oct 2017 to feb 2018
print(old_df.shape)
print(add_df.shape)

(10713, 12)
(236, 9)


In [13]:
old_df.head()

Unnamed: 0,title,artist,single,label,record,style,reviewed_date,release_date,comments,rating,description,URL
0,Epic B - Late Night FlexN,Epic B,Late Night FlexN,Swing Ting,SWINGTING015,Flex Dance Music,2017-10-26,September 2017,1,3.7,Brooklyn's flex dance music has been around fo...,https://www.residentadvisor.net/reviews/21662
1,Fatima Al Qadiri - Shaneera,Fatima Al Qadiri,Shaneera,Hyperdub,HDB110,Bass,2017-10-26,October 2017,8,4.0,Few artists in the bass scene are as explicitl...,https://www.residentadvisor.net/reviews/21671
2,The Room Below - Healing Scaphoids EP,The Room Below,Healing Scaphoids EP,Don't Be Afraid,DBA033,"Broken Beat, House",2017-10-25,September 2017,1,3.6,"I met Henry Keen in Cuba last year, where he a...",https://www.residentadvisor.net/reviews/21659
3,Fatima - Somebody Else,Fatima,Somebody Else,Eglo,EGLO60,R&B,2017-10-25,October 2017,2,3.6,"Fatima's debut LP, a deeply imaginative take o...",https://www.residentadvisor.net/reviews/21658
4,Ciel - Electrical Encounters,Ciel,Electrical Encounters,Peach Discs,PEACH004,"House, Electro",2017-10-25,October 2017,13,4.0,Cindy Li is one of the busiest promoters in To...,https://www.residentadvisor.net/reviews/21657


In [14]:
add_df.head()

Unnamed: 0,title,label,record,style,release_date,comments,rating,description,URL
0,Ex-Terrestrial - Urth Born,Pacific Rhythm,PR003,"Downtempo, Ambient, House",February 2018,Post a comment,3.6,"Dance music history is a limited resource, and...",https://www.residentadvisor.net/reviews/22198
1,Kym Sugiru - Ophelia,Diskotopia,DSK038,"Electro, Dancehall",February 2018,Post a comment,3.6,Ophelia is an eccentric take on two styles we'...,https://www.residentadvisor.net/reviews/22237
2,Stanislav Tolkachev - Catacomb Saints,Pohjola,POHLTD0015,"Techno, Ambient",February 2018,Post a comment,3.7,Stanislav Tolkachev revels in disorientating a...,https://www.residentadvisor.net/reviews/22215
3,Various - Introduction EP,Intergraded,INTGRD001,1 / View,February 2018,3.8,3.8,"Harry Agius, AKA Midland, gives back to the sc...",https://www.residentadvisor.net/reviews/22202
4,Taraval - No Coast,Hypercolour,HYPE069,Techno,March 2018,Post a comment,3.6,"Since his last record as Taraval, Ryan Smith h...",https://www.residentadvisor.net/reviews/22210


Comments:

- `add_df` doesn't have the artist or single split into separate columns like `old_df`. 

- The columns of the two datasets are also in different order.

- `add_df` also doesn't have a reviewed date column. Use release date only instead

In [15]:
# order columns of old_df
columns = ["title", "single", "artist", "record", "label", "style", "release_date", "rating", "comments", "URL", "description"]
old_df = old_df[columns]

In [16]:
# split those without -\u200e problem
partial_split = add_df["title"].str.split(" - ")

# split all titles into artist and single (count for -\u200e errored titles also)
split_list = []
for i in partial_split:
    if len(i) == 1:
        split_list += [i[0].split(" \u200e- ")]
    else:
        split_list += [i]

In [17]:
# create variables for artist and single using split_list
artists = [i[0] for i in split_list]
singles = [i[1] for i in split_list]

In [18]:
# add artist and single columns to add_df
add_df["artist"] = artists
add_df["single"] = singles
#order columns
add_df = add_df[columns]
add_df.head()

Unnamed: 0,title,single,artist,record,label,style,release_date,rating,comments,URL,description
0,Ex-Terrestrial - Urth Born,Urth Born,Ex-Terrestrial,PR003,Pacific Rhythm,"Downtempo, Ambient, House",February 2018,3.6,Post a comment,https://www.residentadvisor.net/reviews/22198,"Dance music history is a limited resource, and..."
1,Kym Sugiru - Ophelia,Ophelia,Kym Sugiru,DSK038,Diskotopia,"Electro, Dancehall",February 2018,3.6,Post a comment,https://www.residentadvisor.net/reviews/22237,Ophelia is an eccentric take on two styles we'...
2,Stanislav Tolkachev - Catacomb Saints,Catacomb Saints,Stanislav Tolkachev,POHLTD0015,Pohjola,"Techno, Ambient",February 2018,3.7,Post a comment,https://www.residentadvisor.net/reviews/22215,Stanislav Tolkachev revels in disorientating a...
3,Various - Introduction EP,Introduction EP,Various,INTGRD001,Intergraded,1 / View,February 2018,3.8,3.8,https://www.residentadvisor.net/reviews/22202,"Harry Agius, AKA Midland, gives back to the sc..."
4,Taraval - No Coast,No Coast,Taraval,HYPE069,Hypercolour,Techno,March 2018,3.6,Post a comment,https://www.residentadvisor.net/reviews/22210,"Since his last record as Taraval, Ryan Smith h..."


In [19]:
raw_df = pd.concat([add_df, old_df], ignore_index = True)
raw_df.head()

Unnamed: 0,title,single,artist,record,label,style,release_date,rating,comments,URL,description
0,Ex-Terrestrial - Urth Born,Urth Born,Ex-Terrestrial,PR003,Pacific Rhythm,"Downtempo, Ambient, House",February 2018,3.6,Post a comment,https://www.residentadvisor.net/reviews/22198,"Dance music history is a limited resource, and..."
1,Kym Sugiru - Ophelia,Ophelia,Kym Sugiru,DSK038,Diskotopia,"Electro, Dancehall",February 2018,3.6,Post a comment,https://www.residentadvisor.net/reviews/22237,Ophelia is an eccentric take on two styles we'...
2,Stanislav Tolkachev - Catacomb Saints,Catacomb Saints,Stanislav Tolkachev,POHLTD0015,Pohjola,"Techno, Ambient",February 2018,3.7,Post a comment,https://www.residentadvisor.net/reviews/22215,Stanislav Tolkachev revels in disorientating a...
3,Various - Introduction EP,Introduction EP,Various,INTGRD001,Intergraded,1 / View,February 2018,3.8,3.8,https://www.residentadvisor.net/reviews/22202,"Harry Agius, AKA Midland, gives back to the sc..."
4,Taraval - No Coast,No Coast,Taraval,HYPE069,Hypercolour,Techno,March 2018,3.6,Post a comment,https://www.residentadvisor.net/reviews/22210,"Since his last record as Taraval, Ryan Smith h..."


In [20]:
# convert release_date into datetime object
raw_df['release_date'] = pd.to_datetime(raw_df["release_date"], infer_datetime_format = True, errors = "coerce")
raw_df

Unnamed: 0,title,single,artist,record,label,style,release_date,rating,comments,URL,description
0,Ex-Terrestrial - Urth Born,Urth Born,Ex-Terrestrial,PR003,Pacific Rhythm,"Downtempo, Ambient, House",2018-02-01,3.6,Post a comment,https://www.residentadvisor.net/reviews/22198,"Dance music history is a limited resource, and..."
1,Kym Sugiru - Ophelia,Ophelia,Kym Sugiru,DSK038,Diskotopia,"Electro, Dancehall",2018-02-01,3.6,Post a comment,https://www.residentadvisor.net/reviews/22237,Ophelia is an eccentric take on two styles we'...
2,Stanislav Tolkachev - Catacomb Saints,Catacomb Saints,Stanislav Tolkachev,POHLTD0015,Pohjola,"Techno, Ambient",2018-02-01,3.7,Post a comment,https://www.residentadvisor.net/reviews/22215,Stanislav Tolkachev revels in disorientating a...
3,Various - Introduction EP,Introduction EP,Various,INTGRD001,Intergraded,1 / View,2018-02-01,3.8,3.8,https://www.residentadvisor.net/reviews/22202,"Harry Agius, AKA Midland, gives back to the sc..."
4,Taraval - No Coast,No Coast,Taraval,HYPE069,Hypercolour,Techno,2018-03-01,3.6,Post a comment,https://www.residentadvisor.net/reviews/22210,"Since his last record as Taraval, Ryan Smith h..."
5,Binh - Eastern Bloc,Eastern Bloc,Binh,CABARET016,Cabaret Recordings,"Minimal, Techno",2018-02-01,4.0,1,https://www.residentadvisor.net/reviews/22236,Cabaret's streak of fresh and intriguing minim...
6,Mica Levi - Delete Beach,Delete Beach,Mica Levi,DDS024,DDS,"Electronic, Experimental, Ambient",2018-02-01,3.9,3,https://www.residentadvisor.net/reviews/22157,"""I've just always been interested in those gli..."
7,Nathan Fake - Sunder,Sunder,Nathan Fake,ZEN12471,Ninja Tune,"Techno, Electronica",2018-02-01,3.8,3,https://www.residentadvisor.net/reviews/22153,"""I've always been totally turned off by the id..."
8,Albrecht La'Brooy - Tidal River,Tidal River,Albrecht La'Brooy,AMB1801,Apollo,"Ambient, Deep House",2018-01-01,3.7,3,https://www.residentadvisor.net/reviews/22197,There has always been an appreciation of the o...
9,Cesare vs Disorder ‎- Ararapira Jazz,Ararapira Jazz,Cesare vs Disorder,VA072,Vakant,"Minimal, House",2018-02-01,3.5,2,https://www.residentadvisor.net/reviews/22265,Minimal has gone through many phases over the ...


In [21]:
## grouping the releted genre labels ##
# first replace special characters in df, but keep hyphen for title column
raw_df["style"].replace(regex=True,inplace=True,to_replace='[^A-Za-z0-9\s-]+',value=r'')

# group all techno house
raw_df.replace(inplace=True, to_replace= ["Techno House", "House Techno", "Tech house", "House Tech House", "Techno Tech House", "Tech House House", "Tech House Techno"], value='Tech House')
# group all minimal 
raw_df.replace(inplace=True, to_replace= ["House Minimal","Minimal House", "Minimal Techno", "Techno Minimal", "Minimal house", "Minimal techno", 'Minimal Tech House','Tech House Minimal'] , value='Minimal')
# group all Techno
raw_df.replace(inplace=True, to_replace= ["Techno Ambient", "Techno Bass", "Techno Industrial", "Techno Dub Techno"] , value='Techno')
# group all deep house
raw_df.replace(inplace=True, to_replace= ["House Deep House", "Deep House House", "Deep house"] , value='Deep House')
# group all experimental
raw_df.replace(inplace=True, to_replace= ["Experimental", "Progressive", "Techno Experimental","Experimental Techno", "Experimental House", "House Experimental", "Experimental house", "Experimental techno", "Progressive House", "House Progressive", "Progressive house", "Progressive techno"] , value='Progressive')
# group all breakbeat
raw_df.replace(inplace=True, to_replace= ["Breaks"] , value='Breakbeat')
# group all disco
raw_df.replace(inplace=True, to_replace= ["House Disco", "Disco House"] , value='Disco')
# group all electro
raw_df.replace(inplace=True, to_replace= ["Electro House", "Electro Techno", "House Electro", "Techno Electro"] , value='Electro')
# group dubstep
raw_df.replace(inplace=True, to_replace= ["Dub", "Dub Dubstep", "Grime Dubstep", "Dubstep Grime"] , value='Dubstep')

In [22]:
# most poopular genres
raw_df["style"].value_counts()

House                                 2222
Techno                                1649
Tech House                             954
Progressive                            634
Minimal                                498
Deep House                             351
Electro                                256
Breakbeat                              211
Dubstep                                205
Disco                                  192
Post a comment                         153
Drum  Bass                              97
Electronica                             60
Grime                                   48
Dub Techno                              46
Bass                                    42
House Downtempo                         37
UK Bass                                 35
House Bass                              30
Deep House Tech House                   29
House Ambient                           29
House Electronica                       28
Trance                                  28
Edits      

#### Cleaning dataframe with null values or wrongly scrape values

In [23]:
# clean dataframes and create new ones by remove rows with null values in certain columns
df_dates = raw_df[pd.notnull(raw_df['release_date'])] # all rows with non-null release_date
df_dates

Unnamed: 0,title,single,artist,record,label,style,release_date,rating,comments,URL,description
0,Ex-Terrestrial - Urth Born,Urth Born,Ex-Terrestrial,PR003,Pacific Rhythm,Downtempo Ambient House,2018-02-01,3.6,Post a comment,https://www.residentadvisor.net/reviews/22198,"Dance music history is a limited resource, and..."
1,Kym Sugiru - Ophelia,Ophelia,Kym Sugiru,DSK038,Diskotopia,Electro Dancehall,2018-02-01,3.6,Post a comment,https://www.residentadvisor.net/reviews/22237,Ophelia is an eccentric take on two styles we'...
2,Stanislav Tolkachev - Catacomb Saints,Catacomb Saints,Stanislav Tolkachev,POHLTD0015,Pohjola,Techno,2018-02-01,3.7,Post a comment,https://www.residentadvisor.net/reviews/22215,Stanislav Tolkachev revels in disorientating a...
3,Various - Introduction EP,Introduction EP,Various,INTGRD001,Intergraded,1 View,2018-02-01,3.8,3.8,https://www.residentadvisor.net/reviews/22202,"Harry Agius, AKA Midland, gives back to the sc..."
4,Taraval - No Coast,No Coast,Taraval,HYPE069,Hypercolour,Techno,2018-03-01,3.6,Post a comment,https://www.residentadvisor.net/reviews/22210,"Since his last record as Taraval, Ryan Smith h..."
5,Binh - Eastern Bloc,Eastern Bloc,Binh,CABARET016,Cabaret Recordings,Minimal,2018-02-01,4.0,1,https://www.residentadvisor.net/reviews/22236,Cabaret's streak of fresh and intriguing minim...
6,Mica Levi - Delete Beach,Delete Beach,Mica Levi,DDS024,DDS,Electronic Experimental Ambient,2018-02-01,3.9,3,https://www.residentadvisor.net/reviews/22157,"""I've just always been interested in those gli..."
7,Nathan Fake - Sunder,Sunder,Nathan Fake,ZEN12471,Ninja Tune,Techno Electronica,2018-02-01,3.8,3,https://www.residentadvisor.net/reviews/22153,"""I've always been totally turned off by the id..."
8,Albrecht La'Brooy - Tidal River,Tidal River,Albrecht La'Brooy,AMB1801,Apollo,Ambient Deep House,2018-01-01,3.7,3,https://www.residentadvisor.net/reviews/22197,There has always been an appreciation of the o...
9,Cesare vs Disorder ‎- Ararapira Jazz,Ararapira Jazz,Cesare vs Disorder,VA072,Vakant,Minimal,2018-02-01,3.5,2,https://www.residentadvisor.net/reviews/22265,Minimal has gone through many phases over the ...


In [24]:
df_label = raw_df[raw_df['label'] != '<li>\r\n<div>\r\nLabel /'] # rows where label is not that error
df_label

Unnamed: 0,title,single,artist,record,label,style,release_date,rating,comments,URL,description
0,Ex-Terrestrial - Urth Born,Urth Born,Ex-Terrestrial,PR003,Pacific Rhythm,Downtempo Ambient House,2018-02-01,3.6,Post a comment,https://www.residentadvisor.net/reviews/22198,"Dance music history is a limited resource, and..."
1,Kym Sugiru - Ophelia,Ophelia,Kym Sugiru,DSK038,Diskotopia,Electro Dancehall,2018-02-01,3.6,Post a comment,https://www.residentadvisor.net/reviews/22237,Ophelia is an eccentric take on two styles we'...
2,Stanislav Tolkachev - Catacomb Saints,Catacomb Saints,Stanislav Tolkachev,POHLTD0015,Pohjola,Techno,2018-02-01,3.7,Post a comment,https://www.residentadvisor.net/reviews/22215,Stanislav Tolkachev revels in disorientating a...
3,Various - Introduction EP,Introduction EP,Various,INTGRD001,Intergraded,1 View,2018-02-01,3.8,3.8,https://www.residentadvisor.net/reviews/22202,"Harry Agius, AKA Midland, gives back to the sc..."
4,Taraval - No Coast,No Coast,Taraval,HYPE069,Hypercolour,Techno,2018-03-01,3.6,Post a comment,https://www.residentadvisor.net/reviews/22210,"Since his last record as Taraval, Ryan Smith h..."
5,Binh - Eastern Bloc,Eastern Bloc,Binh,CABARET016,Cabaret Recordings,Minimal,2018-02-01,4.0,1,https://www.residentadvisor.net/reviews/22236,Cabaret's streak of fresh and intriguing minim...
6,Mica Levi - Delete Beach,Delete Beach,Mica Levi,DDS024,DDS,Electronic Experimental Ambient,2018-02-01,3.9,3,https://www.residentadvisor.net/reviews/22157,"""I've just always been interested in those gli..."
7,Nathan Fake - Sunder,Sunder,Nathan Fake,ZEN12471,Ninja Tune,Techno Electronica,2018-02-01,3.8,3,https://www.residentadvisor.net/reviews/22153,"""I've always been totally turned off by the id..."
8,Albrecht La'Brooy - Tidal River,Tidal River,Albrecht La'Brooy,AMB1801,Apollo,Ambient Deep House,2018-01-01,3.7,3,https://www.residentadvisor.net/reviews/22197,There has always been an appreciation of the o...
9,Cesare vs Disorder ‎- Ararapira Jazz,Ararapira Jazz,Cesare vs Disorder,VA072,Vakant,Minimal,2018-02-01,3.5,2,https://www.residentadvisor.net/reviews/22265,Minimal has gone through many phases over the ...


In [25]:
df_rating = raw_df.iloc[raw_df.select_dtypes("float64").index.values] # rows with all floating values in rating
df_rating

Unnamed: 0,title,single,artist,record,label,style,release_date,rating,comments,URL,description
0,Ex-Terrestrial - Urth Born,Urth Born,Ex-Terrestrial,PR003,Pacific Rhythm,Downtempo Ambient House,2018-02-01,3.6,Post a comment,https://www.residentadvisor.net/reviews/22198,"Dance music history is a limited resource, and..."
1,Kym Sugiru - Ophelia,Ophelia,Kym Sugiru,DSK038,Diskotopia,Electro Dancehall,2018-02-01,3.6,Post a comment,https://www.residentadvisor.net/reviews/22237,Ophelia is an eccentric take on two styles we'...
2,Stanislav Tolkachev - Catacomb Saints,Catacomb Saints,Stanislav Tolkachev,POHLTD0015,Pohjola,Techno,2018-02-01,3.7,Post a comment,https://www.residentadvisor.net/reviews/22215,Stanislav Tolkachev revels in disorientating a...
3,Various - Introduction EP,Introduction EP,Various,INTGRD001,Intergraded,1 View,2018-02-01,3.8,3.8,https://www.residentadvisor.net/reviews/22202,"Harry Agius, AKA Midland, gives back to the sc..."
4,Taraval - No Coast,No Coast,Taraval,HYPE069,Hypercolour,Techno,2018-03-01,3.6,Post a comment,https://www.residentadvisor.net/reviews/22210,"Since his last record as Taraval, Ryan Smith h..."
5,Binh - Eastern Bloc,Eastern Bloc,Binh,CABARET016,Cabaret Recordings,Minimal,2018-02-01,4.0,1,https://www.residentadvisor.net/reviews/22236,Cabaret's streak of fresh and intriguing minim...
6,Mica Levi - Delete Beach,Delete Beach,Mica Levi,DDS024,DDS,Electronic Experimental Ambient,2018-02-01,3.9,3,https://www.residentadvisor.net/reviews/22157,"""I've just always been interested in those gli..."
7,Nathan Fake - Sunder,Sunder,Nathan Fake,ZEN12471,Ninja Tune,Techno Electronica,2018-02-01,3.8,3,https://www.residentadvisor.net/reviews/22153,"""I've always been totally turned off by the id..."
8,Albrecht La'Brooy - Tidal River,Tidal River,Albrecht La'Brooy,AMB1801,Apollo,Ambient Deep House,2018-01-01,3.7,3,https://www.residentadvisor.net/reviews/22197,There has always been an appreciation of the o...
9,Cesare vs Disorder ‎- Ararapira Jazz,Ararapira Jazz,Cesare vs Disorder,VA072,Vakant,Minimal,2018-02-01,3.5,2,https://www.residentadvisor.net/reviews/22265,Minimal has gone through many phases over the ...


In [26]:
# grouping 10 most popular styles 
# genre counts of raw data
counts_genres = raw_df["style"].value_counts()
# drop 'post a comment' (mistake in the scraping)
df_style = raw_df[raw_df["style"].isin(["Breakbeat", "Tech House", "Minimal", "Techno", "House", "Deep House", "Progressive", "Disco", "Dubstep", "Electro"])]
df_style

Unnamed: 0,title,single,artist,record,label,style,release_date,rating,comments,URL,description
2,Stanislav Tolkachev - Catacomb Saints,Catacomb Saints,Stanislav Tolkachev,POHLTD0015,Pohjola,Techno,2018-02-01,3.7,Post a comment,https://www.residentadvisor.net/reviews/22215,Stanislav Tolkachev revels in disorientating a...
4,Taraval - No Coast,No Coast,Taraval,HYPE069,Hypercolour,Techno,2018-03-01,3.6,Post a comment,https://www.residentadvisor.net/reviews/22210,"Since his last record as Taraval, Ryan Smith h..."
5,Binh - Eastern Bloc,Eastern Bloc,Binh,CABARET016,Cabaret Recordings,Minimal,2018-02-01,4.0,1,https://www.residentadvisor.net/reviews/22236,Cabaret's streak of fresh and intriguing minim...
9,Cesare vs Disorder ‎- Ararapira Jazz,Ararapira Jazz,Cesare vs Disorder,VA072,Vakant,Minimal,2018-02-01,3.5,2,https://www.residentadvisor.net/reviews/22265,Minimal has gone through many phases over the ...
10,Various - PGH Electro Vol. 1,PGH Electro Vol. 1,Various,IW03,Is / Was,Electro,2018-02-01,3.7,4,https://www.residentadvisor.net/reviews/22174,Several Midwest-based producers have received ...
11,Loco Dice - Roots,Roots,Loco Dice,DESOLAT060,Desolat,Tech House,2018-02-01,2.9,8,https://www.residentadvisor.net/reviews/22264,With help from his studio partner Martin Buttr...
14,Philipp Otterbach ‎- Humans,Humans,Philipp Otterbach,TM01,Tour Messier,Techno,2018-02-01,3.8,Post a comment,https://www.residentadvisor.net/reviews/22111,Humans is a murky and unsettling record. A sen...
17,S & M Trading Co. - Metal Surface Repair,Metal Surface Repair,S & M Trading Co.,FIT019,Fit,Tech House,2018-02-01,3.7,2,https://www.residentadvisor.net/reviews/22193,"DJ Sotofett and Aaron ""FIT"" Siegel have worked..."
19,Flow ‎- Trine,Trine,Flow,TAR008,Tardis,Tech House,2018-02-01,4.2,10,https://www.residentadvisor.net/reviews/22249,Recent EPs from artists like The Persuader and...
21,Taraval - No Coast,No Coast,Taraval,HYPE069,Hypercolour,Techno,2018-03-01,3.6,Post a comment,https://www.residentadvisor.net/reviews/22210,"Since his last record as Taraval, Ryan Smith h..."
