# Data wrangling with `pandas`

Methods/functions for data frames (df):
* `read_csv(), .to_csv()` 
* `.head(), .tail(), .describe()`
* `.drop(), .sort_values(), .copy()`

**Attributes** of data frames: 
* `.index` 
* `.columns` 
* `.dtypes`

Methods we've used for separate df columns (and also for entire df): 
* `.astype()`
* `.isna()`
* `.notna()`

**Indexing:** `[]`, `.loc[]`

We will be working with a data set of popular Spotify songs 2000-2019 (taken from [Kaggle](https://www.kaggle.com/datasets/paradisejoy/top-hits-spotify-from-20002019)), which is saved in `data/songs_normalize.csv`

In [1]:
# import pandas
import pandas as pd

In [2]:
# read_csv: pandas function; read in the .csv file into a pandas dataframe
df = pd.read_csv("data/songs_normalize.csv")

In [4]:
# to_csv: METHOD of pandas dataframe; save the pandas dataframe to a csv file
df.to_csv("data/songs.csv")

In [5]:
# explore the dataframe: number of rows
len(df)

2000

In [7]:
# explore the dataframe: .head()
df.head(3)

Unnamed: 0,artist,song,duration_ms,explicit,year,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,genre
0,Britney Spears,Oops!...I Did It Again,211160,False,2000,77,0.751,0.834,1,-5.444,0,0.0437,0.3,1.8e-05,0.355,0.894,95.053,pop
1,blink-182,All The Small Things,167066,False,1999,79,0.434,0.897,0,-4.918,1,0.0488,0.0103,0.0,0.612,0.684,148.726,"rock, pop"
2,Faith Hill,Breathe,250546,False,1999,66,0.529,0.496,7,-9.007,1,0.029,0.173,0.0,0.251,0.278,136.859,"pop, country"


In [9]:
# explore the dataframe: .tail()
df.tail(1)

Unnamed: 0,artist,song,duration_ms,explicit,year,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,genre
1999,Post Malone,Circles,215280,False,2019,85,0.695,0.762,0,-3.497,1,0.0395,0.192,0.00244,0.0863,0.553,120.042,hip hop


In [10]:
# explore the dataframe: .describe()
df.describe()

Unnamed: 0,duration_ms,year,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,228748.1245,2009.494,59.8725,0.667438,0.720366,5.378,-5.512434,0.5535,0.103568,0.128955,0.015226,0.181216,0.55169,120.122558
std,39136.569008,5.85996,21.335577,0.140416,0.152745,3.615059,1.933482,0.497254,0.096159,0.173346,0.087771,0.140669,0.220864,26.967112
min,113000.0,1998.0,0.0,0.129,0.0549,0.0,-20.514,0.0,0.0232,1.9e-05,0.0,0.0215,0.0381,60.019
25%,203580.0,2004.0,56.0,0.581,0.622,2.0,-6.49025,0.0,0.0396,0.014,0.0,0.0881,0.38675,98.98575
50%,223279.5,2010.0,65.5,0.676,0.736,6.0,-5.285,1.0,0.05985,0.0557,0.0,0.124,0.5575,120.0215
75%,248133.0,2015.0,73.0,0.764,0.839,8.0,-4.16775,1.0,0.129,0.17625,6.8e-05,0.241,0.73,134.2655
max,484146.0,2020.0,89.0,0.975,0.999,11.0,-0.276,1.0,0.576,0.976,0.985,0.853,0.973,210.851


In [11]:
# dataframe (df) attributes: .columns
df.columns

Index(['artist', 'song', 'duration_ms', 'explicit', 'year', 'popularity',
       'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
       'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',
       'genre'],
      dtype='object')

In [12]:
# dataframe (df) attributes: .index
df.index

RangeIndex(start=0, stop=2000, step=1)

In [13]:
# dataframe (df) attributes: .dtypes
df.dtypes

artist               object
song                 object
duration_ms           int64
explicit               bool
year                  int64
popularity            int64
danceability        float64
energy              float64
key                   int64
loudness            float64
mode                  int64
speechiness         float64
acousticness        float64
instrumentalness    float64
liveness            float64
valence             float64
tempo               float64
genre                object
dtype: object

In [18]:
# indexing to a single column
df["year"]

0       2000
1       1999
2       1999
3       2000
4       2000
        ... 
1995    2019
1996    2019
1997    2019
1998    2019
1999    2019
Name: year, Length: 2000, dtype: int64

In [12]:
# what data type is a single column?

In [19]:
# indexing to several columns
df[ ["year", "speechiness"] ]

Unnamed: 0,year,speechiness
0,2000,0.0437
1,1999,0.0488
2,1999,0.0290
3,2000,0.0466
4,2000,0.0516
...,...,...
1995,2019,0.0588
1996,2019,0.1570
1997,2019,0.1090
1998,2019,0.0656


In [20]:
# indexing to a single row
df.loc[20]

artist              Linkin Park
song                 In the End
duration_ms              216880
explicit                  False
year                       2000
popularity                   83
danceability              0.556
energy                    0.864
key                           3
loudness                  -5.87
mode                          0
speechiness              0.0584
acousticness            0.00958
instrumentalness            0.0
liveness                  0.209
valence                     0.4
tempo                   105.143
genre               rock, metal
Name: 20, dtype: object

In [15]:
# what data type is a single row?

In [23]:
# indexing to several rows (update: you can do this numpy-style!)
df.loc[3:7]

Unnamed: 0,artist,song,duration_ms,explicit,year,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,genre
3,Bon Jovi,It's My Life,224493,False,2000,78,0.551,0.913,0,-4.063,0,0.0466,0.0263,1.3e-05,0.347,0.544,119.992,"rock, metal"
4,*NSYNC,Bye Bye Bye,200560,False,2000,65,0.614,0.928,8,-4.806,0,0.0516,0.0408,0.00104,0.0845,0.879,172.656,pop
5,Sisqo,Thong Song,253733,True,1999,69,0.706,0.888,2,-6.959,1,0.0654,0.119,9.6e-05,0.07,0.714,121.549,"hip hop, pop, R&B"
6,Eminem,The Real Slim Shady,284200,True,2000,86,0.949,0.661,5,-4.244,0,0.0572,0.0302,0.0,0.0454,0.76,104.504,hip hop
7,Robbie Williams,Rock DJ,258560,False,2000,68,0.708,0.772,7,-4.264,1,0.0322,0.0267,0.0,0.467,0.861,103.035,"pop, rock"


In [28]:
# changing the data type of one column
# let's convert the "explicit" booleans False/True into integers 0/1
df["explicit"] = df["explicit"].astype(int)
df.dtypes

artist               object
song                 object
duration_ms           int64
explicit              int64
year                  int64
popularity            int64
danceability        float64
energy              float64
key                   int64
loudness            float64
mode                  int64
speechiness         float64
acousticness        float64
instrumentalness    float64
liveness            float64
valence             float64
tempo               float64
genre                object
dtype: object

In [32]:
# check for missing values in a column
# update: you can also use the ".sum()" method (for pandas dataframes and series)
df["year"].notna().sum()

2000

In [35]:
# update: check for missing values in entire dataframe
# update: you can also use the ".sum()" method (for pandas dataframes and series)
df.notna().sum()

artist              2000
song                2000
duration_ms         2000
explicit            2000
year                2000
popularity          2000
danceability        2000
energy              2000
key                 2000
loudness            2000
mode                2000
speechiness         2000
acousticness        2000
instrumentalness    2000
liveness            2000
valence             2000
tempo               2000
genre               2000
dtype: int64

In [36]:
?pd.DataFrame.sum

In [37]:
# check for available values in a row: .notna()
df.loc[0].notna()

artist              True
song                True
duration_ms         True
explicit            True
year                True
popularity          True
danceability        True
energy              True
key                 True
loudness            True
mode                True
speechiness         True
acousticness        True
instrumentalness    True
liveness            True
valence             True
tempo               True
genre               True
Name: 0, dtype: bool

In [22]:
# check for available values in entire data frame: .notna()

In [40]:
# Boolean indexing: filter for only the year 2000
df[ df["year"]==2000 ]     

Unnamed: 0,artist,song,duration_ms,explicit,year,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,genre
0,Britney Spears,Oops!...I Did It Again,211160,0,2000,77,0.751,0.834,1,-5.444,0,0.0437,0.300000,0.000018,0.3550,0.894,95.053,pop
3,Bon Jovi,It's My Life,224493,0,2000,78,0.551,0.913,0,-4.063,0,0.0466,0.026300,0.000013,0.3470,0.544,119.992,"rock, metal"
4,*NSYNC,Bye Bye Bye,200560,0,2000,65,0.614,0.928,8,-4.806,0,0.0516,0.040800,0.001040,0.0845,0.879,172.656,pop
6,Eminem,The Real Slim Shady,284200,1,2000,86,0.949,0.661,5,-4.244,0,0.0572,0.030200,0.000000,0.0454,0.760,104.504,hip hop
7,Robbie Williams,Rock DJ,258560,0,2000,68,0.708,0.772,7,-4.264,1,0.0322,0.026700,0.000000,0.4670,0.861,103.035,"pop, rock"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193,OPM,Heaven Is a Halfpipe (If I Die),257426,1,2000,56,0.743,0.894,8,-6.886,1,0.0349,0.075500,0.002830,0.3670,0.770,95.900,rock
196,3LW,No More (Baby I'ma Do Right),263440,0,2000,56,0.721,0.723,2,-7.080,0,0.0631,0.102000,0.000004,0.0651,0.761,88.933,"pop, R&B"
199,Lifehouse,Hanging By A Moment,216360,0,2000,61,0.537,0.858,1,-4.903,1,0.0349,0.000966,0.000000,0.0812,0.502,124.599,"pop, rock, metal"
215,Linkin Park,In the End,216880,0,2000,83,0.556,0.864,3,-5.870,0,0.0584,0.009580,0.000000,0.2090,0.400,105.143,"rock, metal"


In [41]:
# Boolean indexing: filter for ony the year 2000 and pop songs
# use (condition1) & (condition2)
df[ (df["year"]==2000) & (df["genre"]=="pop") ]

Unnamed: 0,artist,song,duration_ms,explicit,year,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,genre
0,Britney Spears,Oops!...I Did It Again,211160,0,2000,77,0.751,0.834,1,-5.444,0,0.0437,0.3,1.8e-05,0.355,0.894,95.053,pop
4,*NSYNC,Bye Bye Bye,200560,0,2000,65,0.614,0.928,8,-4.806,0,0.0516,0.0408,0.00104,0.0845,0.879,172.656,pop
12,Bomfunk MC's,Freestyler,306333,0,2000,55,0.822,0.922,11,-5.798,0,0.0989,0.0291,0.325,0.252,0.568,163.826,pop
17,Alice Deejay,Better Off Alone,214883,0,2000,73,0.671,0.88,8,-6.149,0,0.0552,0.00181,0.691,0.285,0.782,136.953,pop
22,Sonique,It Feels So Good,240866,0,2000,62,0.634,0.677,5,-7.278,0,0.0304,0.0117,0.00103,0.126,0.558,135.012,pop
32,Madonna,Music,225973,0,2000,58,0.736,0.802,7,-8.527,1,0.0663,0.00149,0.0876,0.14,0.871,119.854,pop
61,P!nk,Most Girls,298960,0,2000,52,0.742,0.732,2,-6.046,0,0.0311,0.0424,0.00446,0.101,0.694,97.922,pop
65,Madonna,American Pie,273533,0,2000,58,0.631,0.734,5,-7.48,0,0.036,0.348,0.0,0.135,0.591,124.036,pop
71,P!nk,There You Go,202800,0,2000,55,0.822,0.847,10,-6.729,0,0.0917,0.0854,0.0,0.0452,0.668,107.908,pop
72,Vengaboys,Shalala Lala,214819,0,2000,58,0.751,0.901,2,-5.802,1,0.0328,0.0504,0.00308,0.0395,0.973,124.017,pop


In [42]:
# Boolean indexing: filter only for the years 2005 or 2010
# use (condition1) | (condition2)
df[ (df["year"]==2005) | (df["year"]==2010) ] 

Unnamed: 0,artist,song,duration_ms,explicit,year,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,genre
60,DJ Ötzi,Hey Baby (Radio Mix),219240,0,2010,58,0.666,0.968,10,-3.196,1,0.0460,0.1230,0.000000,0.3470,0.834,135.099,"pop, easy listening, Dance/Electronic"
255,Diddy,I Need a Girl (Pt. 1) [feat. Usher & Loon],268800,0,2005,63,0.660,0.707,6,-5.758,1,0.2080,0.3970,0.000000,0.2110,0.761,89.279,"hip hop, pop"
360,Snoop Dogg,Beautiful,299146,1,2005,67,0.893,0.740,11,-4.936,0,0.1320,0.2990,0.000000,0.0881,0.963,101.025,"hip hop, pop"
442,JoJo,Leave (Get Out) - Radio Edit,242746,0,2005,49,0.656,0.513,5,-8.691,1,0.2530,0.1560,0.000064,0.0763,0.464,86.891,"hip hop, pop, R&B"
500,Mariah Carey,We Belong Together,201400,0,2005,69,0.840,0.476,0,-7.918,1,0.0629,0.0264,0.000000,0.0865,0.767,139.987,"pop, R&B"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1185,P!nk,F**kin' Perfect,213413,1,2010,60,0.563,0.671,7,-4.788,1,0.0373,0.0422,0.000000,0.3600,0.450,91.964,pop
1193,P!nk,Raise Your Glass,202960,1,2010,76,0.700,0.709,7,-5.006,1,0.0838,0.0048,0.000000,0.0290,0.624,122.019,pop
1195,Bruno Mars,Marry You,230192,0,2010,75,0.621,0.820,10,-4.865,1,0.0367,0.3320,0.000000,0.1040,0.452,144.905,pop
1196,The Band Perry,If I Die Young,222773,0,2010,64,0.606,0.497,4,-6.611,1,0.0277,0.3480,0.000000,0.2750,0.362,130.739,country


In [45]:
# UPDATE: there is also an .isin() method in pandas
df[ df["year"].isin([2005, 2006, 2007, 2008, 2009, 2010]) ]

Unnamed: 0,artist,song,duration_ms,explicit,year,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,genre
60,DJ Ötzi,Hey Baby (Radio Mix),219240,0,2010,58,0.666,0.968,10,-3.196,1,0.0460,0.12300,0.0000,0.3470,0.834,135.099,"pop, easy listening, Dance/Electronic"
118,iio,Rapture (feat.Nadia Ali),253586,0,2006,54,0.661,0.855,8,-8.403,1,0.0377,0.07220,0.0185,0.1990,0.601,123.943,Dance/Electronic
138,Ricky Martin,Nobody Wants to Be Lonely (with Christina Agui...,252706,0,2008,52,0.635,0.854,10,-5.020,0,0.0612,0.00579,0.0083,0.0623,0.590,100.851,"pop, latin"
255,Diddy,I Need a Girl (Pt. 1) [feat. Usher & Loon],268800,0,2005,63,0.660,0.707,6,-5.758,1,0.2080,0.39700,0.0000,0.2110,0.761,89.279,"hip hop, pop"
317,DMX,X Gon' Give It To Ya,217586,1,2007,70,0.761,0.899,10,-3.090,0,0.1830,0.01350,0.0000,0.0719,0.673,95.027,"hip hop, pop"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1185,P!nk,F**kin' Perfect,213413,1,2010,60,0.563,0.671,7,-4.788,1,0.0373,0.04220,0.0000,0.3600,0.450,91.964,pop
1193,P!nk,Raise Your Glass,202960,1,2010,76,0.700,0.709,7,-5.006,1,0.0838,0.00480,0.0000,0.0290,0.624,122.019,pop
1195,Bruno Mars,Marry You,230192,0,2010,75,0.621,0.820,10,-4.865,1,0.0367,0.33200,0.0000,0.1040,0.452,144.905,pop
1196,The Band Perry,If I Die Young,222773,0,2010,64,0.606,0.497,4,-6.611,1,0.0277,0.34800,0.0000,0.2750,0.362,130.739,country


In [48]:
# saving and manipulating a PART of the dataset: USE .copy() !!!
# save the non-explicit year2000 songs to a separate df, then save the df to csv
nonex2000 = df[ (df["explicit"]==0) & (df["year"]==2000) ].copy()