# Data wrangling with `pandas`

Methods/functions for data frames (df):
* `read_csv(), .to_csv()` 
* `.head(), .tail(), .describe()`
* `.drop(), .sort_values(), .copy()`

**Attributes** of data frames: 
* `.index` 
* `.columns` 
* `.dtypes`

Methods we've used for separate df columns (and also for entire df): 
* `.astype()`
* `.isna()`
* `.notna()`

**Indexing:** `[]`, `.loc[]`

We will be working with a data set of popular Spotify songs 2000-2019 (taken from [Kaggle](https://www.kaggle.com/datasets/paradisejoy/top-hits-spotify-from-20002019)), which is saved in `data/songs_normalize.csv`

In [None]:
# import pandas
import pandas as pd

In [None]:
# read_csv: pandas function; read in the .csv file into a pandas dataframe
df = pd.read_csv("data/songs_normalize.csv")

In [None]:
# to_csv: METHOD of pandas dataframe; save the pandas dataframe to a csv file
df.to_csv("data/songs.csv")

In [None]:
# explore the dataframe: number of rows
len(df)

In [None]:
# explore the dataframe: .head()
df.head(3)

In [None]:
# explore the dataframe: .tail()
df.tail(1)

In [None]:
# explore the dataframe: .describe()
df.describe()

In [None]:
# dataframe (df) attributes: .columns
df.columns

In [None]:
# dataframe (df) attributes: .index
df.index

In [None]:
# dataframe (df) attributes: .dtypes
df.dtypes

In [None]:
# indexing to a single column
df["year"]

In [None]:
# what data type is a single column?

In [None]:
# indexing to several columns
df[ ["year", "speechiness"] ]

In [None]:
# indexing to a single row
df.loc[20]

In [None]:
# what data type is a single row?

In [None]:
# indexing to several rows (update: you can do this numpy-style!)
df.loc[3:7]

In [None]:
# changing the data type of one column
# let's convert the "explicit" booleans False/True into integers 0/1
df["explicit"] = df["explicit"].astype(int)
df.dtypes

In [None]:
# check for missing values in a column
# update: you can also use the ".sum()" method (for pandas dataframes and series)
df["year"].notna().sum()

In [None]:
# update: check for missing values in entire dataframe
# update: you can also use the ".sum()" method (for pandas dataframes and series)
df.notna().sum()

In [None]:
?pd.DataFrame.sum

In [None]:
# check for available values in a row: .notna()
df.loc[0].notna()

In [None]:
# check for available values in entire data frame: .notna()

In [None]:
# Boolean indexing: filter for only the year 2000
df[ df["year"]==2000 ]     

In [None]:
# Boolean indexing: filter for ony the year 2000 and pop songs
# use (condition1) & (condition2)
df[ (df["year"]==2000) & (df["genre"]=="pop") ]

In [None]:
# Boolean indexing: filter only for the years 2005 or 2010
# use (condition1) | (condition2)
df[ (df["year"]==2005) | (df["year"]==2010) ] 

In [None]:
# UPDATE: there is also an .isin() method in pandas
df[ df["year"].isin([2005, 2006, 2007, 2008, 2009, 2010]) ]

In [None]:
# saving and manipulating a PART of the dataset: USE .copy() !!!
# save the non-explicit year2000 songs to a separate df, then save the df to csv
nonex2000 = df[ (df["explicit"]==0) & (df["year"]==2000) ].copy()