# Exploratory data analysis
Introduction to exploratory data analysis (EDA).

EDA is an approach to analyzing datasets to summarize their main characteristics, often with visual methods. EDA is used for seeing what the data can tell us before the modeling task [(source 1)](https://chartio.com/learn/data-analytics/what-is-exploratory-data-analysis/). It is used to explore the data, find different patterns, relations, and anomalies in the data using some statistical graphs and other visualization techniques, and possibly formulate hypotheses that could lead to new data collection and experiments [(source 2)](https://www.analyticsvidhya.com/blog/2021/08/how-to-perform-exploratory-data-analysis-a-guide-for-beginners/). More specifically, EDA enables analysts to:
1. get maximum insights from a data set
2. uncover underlying structure
3. extract important variables from the dataset
4. detect outliers and anomalies (if any)
5. test underlying assumptions
6. determine the optimal factor settings

## EDA steps and tools
Practical steps in conducting EDA and frequently used EDA tools.
Based on *pandas2020-main.Sales_Analysis_Pandas_P3_tutorial.ipynb* and *pandas2020-main.TED_Talks_Pandas_P3_tutorial.ipynb*.


Based on [this](https://stackoverflow.com/a/22149930/1899061), in all computations, `axis=...` refers to the axis **along which** the computation is done. By default, `axis=0`. This is consistent with the `numpy.mean` usage when axis is specified explicitly (in `numpy.mean`, `axis==None` by default, which computes the mean value over the flattened array), in which `axis=0` along the rows (namely, index in pandas), and `axis=1` along the columns.
Note also that that `axis=0` indicates aggregating along rows and `axis=1` indicates aggregating along columns. This is consistent with how we index into a dataframe. In `df.iloc[<row>, <column>]`, `<row>` is in index position 0 and `<column>` is in index position 1. For added clarity, one may choose to specify `axis='index'` (instead of `axis=0`) or `axis='columns'` (instead of `axis=1`).
**But**, `axis=0` means each row as a bulk - we manipulate a `pd.DataFrame` inter-row, instead of within-row. Likewise, 1 means each column as a bulk, i.e. we manipulate a `pd.DataFrame` inter-column instead of within-column. For example, `<pd.df>.drop("A", axis=1)` will drop a whole column.

### Reading the dataset
- `pd.read_csv()`

### Initial examination and adaptations
- `<pd.df>.shape`, `<pd.df>.head()`, `<pd.df>.tail()`, `<pd.df>.sample()`, `<pd.df>.dtypes`, `<pd.df>.info()`, `<pd.df>.describe()`
- `<pd.df>.loc[...]`, `<pd.df>.iloc[...]` - examine individual cells, columns, rows
    - `loc` works with conditions and column names, `iloc` with numerical indices
    - in both `loc` and `iloc`, multiple columns can be specified as a list of column names, and `:` in each index position means 'all'
    - in `iloc`, both index positions can be specified as lists of numeric values
- `<pd.df>.columns`, `<pd.df>.columns.values`, `<pd.df>.columns.values.tolist()` (or `<pd.df>.columns.values.to_list()`), `<pd.df>.values`
- `<pd.df>.rename({'<column_1 old name>':'<column_1 new name>', '<column_2 old name>':'<column_2 new name>', ...}, axis='columns')`, `<pd.df>.columns = ['<column_1 name>', '<column_2 name>', ...]` (change the names of all columns in <pd.df>)
- `ast.literal_eval()` (using Python's *ast* module to transform a string into a literal value, a list, a tuple or any other container object)

### Missing values and value counts
- `sb.heatmap()`, e.g. `sb.heatmap(<pd.df>.isna(),cbar=False,cmap='viridis')` ([example](https://www.analyticsvidhya.com/blog/2021/08/how-to-perform-exploratory-data-analysis-a-guide-for-beginners/))
- `<pd.df>.isna()` (`<pd.df>.isnull()`), `<pd.df>.isna().sum()` (`<pd.df>.isnull().sum`) ([example](https://www.analyticsvidhya.com/blog/2021/08/how-to-perform-exploratory-data-analysis-a-guide-for-beginners/))
- `<pd.df>['<column>'].value_counts()` (shows only the rows without NAs (default: dropna=True), check shape)
- `<pd.df>['<column>'].value_counts(normalize=True)` (show proportions, rather than frequencies)
- `<pd.Series>.dropna(how='all'/'any', inplace=True)` (return a new `<pd.Series>` object with missing values removed)

The `cmap` parameter of `sb.heatmap()` denotes a [Matplotlib colormap](https://matplotlib.org/stable/tutorials/colors/colormaps.html#classes-of-colormaps) (`viridis`, `cividis`, `tab20`, `winter`, `BuPu_r`, `ocean`,...).

### Examining individual data items, rows and columns
- `<pd.df>.sample()`
- Simple indexing and fancy indexing: `<pd.df>.iloc[]`, `<pd.df>.loc[]`
- `<pd.df>.index`, `<pd.df>.index[<from>:<to>]`, `<pd.df>.reset_index(drop=True, inplace=True)`
- Indexing using list of values: `<pd.df>.loc[<pd.df>.<column>.isin(<list of values>)]` (select those observations where the value of `<column>` is in the `<list of values>`)
- Indexing in data stats: `<pd.df>.describe().loc['50%', '<column_name>']` (select the median of `<column_name>` from the `<pd.df>` stats computed by `describe()`)

### Grouping and sorting data
- `<pd.df>['<column>'].unique()`, `<pd.df>['<column>'].nunique()`
- `<pd.df>['<column>'].groupby()`, `<pd.df>['<column>'].groupby().get_group()`
- `<pd.df>['<column>'].value_counts()`, `<pd.df>['<column>'].value_counts().sort_index()`, `<pd.df>['<column>'].value_counts().sort_index(inplace=True)`
- `<pd.df>.sort_values(by='<column name>', ascending=False/True)`
- `<pd.df>.groupby('<column>').<another column>.<f()>.sort_values(ascending=False)` (aggregate using function `f()`, e.g. `mean()`)
- `<pd.df>.groupby('<column>').<another column>.agg(['<f1 name>', '<f2 name>', ...])` (aggregate using multiple functions, e.g. `mean()`, `count()`,...)

If `sort_values()` is used after `agg(['f1 name>', '<f2 name>', ...])` (`agg(['<f1 name>', '<f2 name>', ...]).sort_values(by='<f name>', ascending=False)`), it must be passed one positional argument (`by='<f name>'`) before the optional `ascending=False`.


### Data transformations
- `<pd.df>.describe()`
- `pd.to_numeric(<pd.DataFrame object>['<column name>'], errors='coerce')`, `pd.DataFrame.to_numpy()`, `pd.Series.to_numpy()`, `pd.to_datetime()`, ...
- `<pd.df>.<column>.apply(<f_name>)` (apply the <f_name> function to all elements of each element of the `<column>`; for example, each element of the `<column>` can be a list of other elements)


### Exploring correlations
Explore correlations between the (numerical) columns.
- `sb.heatmap()`
- [Example](https://www.analyticsvidhya.com/blog/2021/08/how-to-perform-exploratory-data-analysis-a-guide-for-beginners/)

### Data visualization
Plot some bargraphs, scatterplots, boxplots,...
- [Example](https://www.analyticsvidhya.com/blog/2021/08/how-to-perform-exploratory-data-analysis-a-guide-for-beginners/)

### Other
[Other interesting ideas and different ways of using the things from above](https://realpython.com/pandas-python-explore-dataset/#exploring-your-dataset) (see the rest from [that article](https://realpython.com/pandas-python-explore-dataset/) as well).

## Import and configure packages
The `%run` magic might not work well in DataSpell, thus the following `import` statements are copied here from *import_packages.ipynb*:

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
%run "../notebooks/import_packages.ipynb"

In [None]:
# # %load_ext autoreload
# # %autoreload 2
# 
# %matplotlib inline
# 
# # %config IPCompleter.greedy=True
# 
# import numpy as np
# import matplotlib as mpl
# import matplotlib.pyplot as plt
# plt.style.use('classic')
# import pandas as pd
# import seaborn as sb
# 
# from plotnine import ggplot, aes, labs, geom_point, geom_line, geom_histogram, theme_xkcd, coord_cartesian, xlim, ylim, xlab, ylab, ggtitle, theme

## Introducing The British Invasion datasets

### Available datasets
The British Invasion datasets, located in the *data* folder:
* *brit.csv* - complete raw dataset (including data from Spotify, Wikipedia, AllMusic, etc.)
* *brit_col_renamed.csv* - same as *brit.csv*, but with column names modified for the sake of consistency
* *brit_performers_stripped.csv* - same as *brit_col_renamed.csv*, but with performer names stripped for `\n` etc.
* *brit_titles_stripped* - same as *brit_performers_stripped.csv*, but with song titles rstripped
* *attrs.csv* - incomplete raw dataset (some of the attributes from [a Kaggle dataset](https://www.kaggle.com/datasets/saurabhshahane/music-dataset-1950-to-2019))

### Read the *csv* file containing one of the available datasets describing The British Invasion songs
`pd.read_csv()` returns a `pd.DataFrame` object.

As for specifying the path of the dataset properly, see [this](https://stackoverflow.com/questions/35384358/how-to-open-my-files-in-data-folder-with-pandas-using-relative-path) (more specifically, **both** [this](https://stackoverflow.com/a/35384414/1899061) and [this](https://stackoverflow.com/a/43600253/1899061)).

In [None]:
# Get the songs as a pd.DataFrame object from 'data/brit.csv', 
# or from '../data/brit.csv', 
# or '../../data/brit.csv', 
# or ..., 
# depending on where the csv file is located

songs = pd.read_csv('../data/brit.csv')
songs

# If an int column contains NaN values, read_csv() sets all values to float values, because NaN are internally
# represented as float values. To read the int columns as int values and still preserve NaN values where they 
# exist, see this: https://stackoverflow.com/a/72323514. 
# The trick is: df = pd.read_csv('file.csv', dtype={'a': 'Int32', 'b': 'Int32'}), assuming that 'a' and 'b' 
# columns contain int and NaN values.

### Explore the dataset (first steps)

##### Initial examination and adaptations
- `<pd.df>.shape`, `<pd.df>.head()`, `<pd.df>.tail()`, `<pd.df>.sample()`, `<pd.df>.dtypes`, `<pd.df>.info()`, `<pd.df>.describe()`
- `<pd.df>.loc[...]`, `<pd.df>.iloc[...]` - examine individual cells, columns, rows
    - `loc` works with conditions and column names, `iloc` with numerical indices
    - in both `loc` and `iloc`, multiple columns can be specified as a list of column names, and `:` in each index position means 'all'
    - in `iloc`, both index positions can be specified as lists of numeric values
- `<pd.df>.columns`, `<pd.df>.columns.values`, `<pd.df>.columns.values.tolist()` (or `<pd.df>.columns.values.to_list()`), `<pd.df>.values`
- `<pd.df>.rename({'<column_1 old name>':'<column_1 new name>', '<column_2 old name>':'<column_2 new name>', ...}, axis='columns')`, `<pd.df>.columns = ['<column_1 name>', '<column_2 name>', ...]` (change the names of all columns in <pd.df>)
- `ast.literal_eval()` (using Python's *ast* module to transform a string into a literal value, a list, a tuple or any other container object)

###### A sneak peek into the dataset
- `<pd.df>.shape`, `<pd.df>.head()`, `<pd.df>.tail()`, `<pd.df>.sample()`, `<pd.df>.dtypes`, **<u>`<pd.df>.info()`**</u>, `<pd.df>.describe()` (shows descriptive statistics for numerical columns only).

When calling `display()` on a method like `<pd.df>.head()`, `<pd.df>.tail()` and `<pd.df>.sample()`, only a certain default number of columns is displayed. To display *all* columns, use `pd.set_option('display.max_columns', None)` first. To display `<n>` columns, use `pd.set_option('display.max_columns', <n>)` first. 

In [None]:
songs.shape
songs.head()
songs.tail()
songs.sample(10)
songs.dtypes
songs.info()
songs.describe()

In [None]:
songs.iloc[34, 12]
songs.iloc[[245, 678, 789], [0, 1, 3]]
songs.iloc[[245, 678, 789], :]
songs.iloc[:, [0, 1, 3]]
songs.loc[songs.Performer == 'The Beatles', ['Title', 'AlbumName', 'Performer']]
songs.loc[songs['Performer'] == 'The Beatles', ['Title', 'AlbumName', 'Performer']]
songs.loc[songs['Performer'] == 'The Beatles']

###### Columns
- `<pd.df>.columns`, `<pd.df>.columns.values`, `<pd.df>.columns.values.tolist()` (or `<pd.df>.columns.values.to_list()`), `<pd.df>.values`

Show the columns of the `songs` object (which is a `pd.DataFrame` object).

In [None]:
# Get the columns as a pd.Index object, using <pd.df>.columns
songs.columns
# Get the columns as a list, using list(<pd.df>.columns)
list(songs.columns)
# Get the columns as a list, using <pd.df>.columns.tolist() or <pd.df>.columns.to_list()
songs.columns.tolist()
# Get the columns as a numpy.ndarray object, using <pd.df>.columns.values or np.array(<pd.df>.columns)
songs.columns.values
# Get the values of all items in the dataset as a numpy.ndarray of sequences of the values in each item, 
# using <pd.df>.values (the type of both the encompassing and the encompassed sequences is numpy.ndarray)
songs.values
songs.values[0]
type(songs.values)
type(songs.values[0])

###### Renaming columns
- `<pd.df>.rename(columns={'<column_1 old name>':'<column_1 new name>', '<column_2 old name>':'<column_2 new name>', ...}, inplace=True)`, or
- `<pd.df>.rename({'<column_1 old name>':'<column_1 new name>', '<column_2 old name>':'<column_2 new name>', ...}, axis='columns', inplace=True)`;
- `<pd.df>.columns = ['<column_1 name>', '<column_2 name>', ...]` (change the names of all columns in `<pd.df>`)

In [None]:
# Rename the names of some columns
songs.rename(columns={'AlbumName': 'Album', 'Record Label': 'Record_label', 'Track.number': 'Track_number', 
                      'Song.duration': 'Duration', 'shake.the.audience': 'shake_the_audience', 'family.spiritual': 'family_spiritual'}, inplace=True)
songs.columns
# Rename these columns back to their original names


In [None]:
# Save the modified dataset
songs.to_csv('../data/brit_col_renamed.csv', index=False)

###### Adapt the data in columns to the usual formats

In [None]:
# Performer - strip everything after the performer name
for i in range(len(songs)):
    songs.iloc[i, 3] = songs.iloc[i, 3].split('\n')[0].strip()
songs.Performer

In [None]:
# Performer - get rid of the 'feat: ' prefix in some performer names

# songs.loc[songs.Performer.str.startswith('feat: '), 'Performer']

for i in range(len(songs)):
    if songs.Performer[i].startswith('feat: '):
        songs.iloc[i, 3] = songs.iloc[i, 3].split('feat: ')[1].strip()
songs.loc[songs.Performer.str.startswith('feat: '), 'Performer']
not any([p.startswith('feat: ') for p in songs.Performer])

In [None]:
# Save the modified dataset
songs.to_csv('../data/brit_performers_stripped.csv', index=False)

In [None]:
# Title - strip trailing blanks
for i in range(len(songs)):
    songs.iloc[i, 0] = songs.iloc[i, 0].rstrip()

# # Alternatively
# songs.Title = songs.Title.apply(lambda x: x.rstrip())

# print(repr(songs.Title[4]))
any([t.endswith(' ') for t in songs.Title])

In [None]:
# Save the modified dataset
songs.to_csv('../data/brit_titles_stripped.csv', index=False)

##### Missing values and value counts
- `sb.heatmap()`, e.g. `sb.heatmap(<pd.df>.isna(),cbar=False,cmap='viridis')` ([example](https://www.analyticsvidhya.com/blog/2021/08/how-to-perform-exploratory-data-analysis-a-guide-for-beginners/))
- `<pd.df>.isna()` (`<pd.df>.isnull()`), `<pd.df>.isna().sum()` (`<pd.df>.isnull().sum()`) ([example](https://www.analyticsvidhya.com/blog/2021/08/how-to-perform-exploratory-data-analysis-a-guide-for-beginners/))
- `<pd.df>['<column>'].value_counts()` (shows only the rows without NAs (default: dropna=True), check shape)
- `<pd.df>['<column>'].value_counts(normalize=True)` (show proportions, rather than frequencies)
- `<pd.df>.dropna(how='all'/'any', inplace=True)`, `<pd.Series>.dropna(how='all'/'any', inplace=True)` (return a new `<pd.Series>`/`<pd.Series>` object with missing values removed)

The `cmap` parameter of `sb.heatmap()` denotes a [Matplotlib colormap](https://matplotlib.org/stable/tutorials/colors/colormaps.html#classes-of-colormaps) (`viridis`, `cividis`, `tab20`, `winter`, `BuPu_r`, `ocean`,...).

In [None]:
# Read the dataset
songs = pd.read_csv('../data/brit_titles_stripped.csv')

In [None]:
# Display the heatmap (missing values) of the songs dataset 
# (demonstrate using sb.heatmap() vs. sb.heatmap();)
sb.heatmap(songs.isna(), cbar=False, cmap='cividis');

How many missing values are there? (`<pd.df>.isna().sum()` for all columns, `<pd.df>.['<column>'].isna().sum()` for a specific column, `<pd.df>.isna()[['<column1>', 'column2', ...]].sum()` for selected multiple columns; `isnull()` is the same as `isna()`, and `isna()` is used more often).

Try also `<pd.df>.isna()`, `<pd.df>.isna()[['<column1>', 'column2', ...]]`, `type(<pd.df>.isna())`, `type(<pd.df>.isna().sum())`, `type(<pd.df>.isna()[['<column1>', 'column2', ...]].sum())`, `<pd.df>.isna().sum().value_counts()`.

In [None]:
songs.isna()
songs.isna().sum()
type(songs.isna().sum())
songs.acousticness.isna().sum()
songs.isna().acousticness.sum()
songs.isna()[['Performer', 'acousticness', 'Duration']]
songs.isna()[['Performer', 'acousticness', 'Duration']].sum()
type(songs.isna()[['Performer', 'acousticness', 'Duration']].sum())
songs.isna().sum().value_counts()

How many missing values are there in the columns where there *are* missing values? `<i> = <pd.df>.isna().sum() > 0`, `<pd.df>.isna().sum()[<i>]`. 
Try also `<i>`, `type(<i>)`, `<i>[<i>]`, `<pd.df>.loc[:, <i>]`.

In [None]:
# songs.isna().sum() > 0
i = songs.isna().sum() > 0
# i[i]
# songs.loc[:, i]
type(songs.isna().sum())
songs.isna().sum()[i]

Leave out rows with `np.NaN` values: `<pd.df>.dropna()`, `<pd.df>.<column>.dropna()`, `<pd.df>['<column>'].dropna()`.

In [None]:
songs.dropna()
songs.dropna().isna().sum()
songs.acousticness.isna().sum()
songs.acousticness.dropna().isna().sum()

Show value counts for a dataframe: `<pd.df>.value_counts()`, `<pd.df>.value_counts(normalize=True)`.

In [None]:
songs.value_counts()
songs.value_counts(normalize=True)

In [None]:
songs.value_counts()
type(songs.value_counts())                                      # pd.Series
songs.value_counts().index
type(songs.value_counts().index)                                # a pd.MultiIndex object
songs.value_counts().values
len(songs.value_counts(dropna=False))                           # dropna=True by default

Show duplicates (if any): `<pd.df>.duplicated()` (keeps the first occurrence by default), `<pd.df>.duplicated(keep=False)` (keeps all occurrences), `<pd.df>.<column>.duplicated()` (find duplicates based on a specific column), `<pd.df>.duplicated(subset=['<column 1>', '<column 2>',...])` (find duplicates based on multiple specific columns). 

Drop duplicates (if any): `<pd.df>.drop_duplicates(inplace=True)`.

In [None]:
songs.duplicated()                                              # a pd.Series object with True for duplicates, False otherwise
songs.loc[songs.duplicated(), ]                                 # careful, since duplicated() drops duplicates by default and the dimensions might not match
songs.loc[songs.duplicated(keep=False), ]                       # keep ALL duplicates to match the dimensions; default: keep='first' 
# (https://stackoverflow.com/a/41786821)
songs.loc[songs.value_counts(dropna=False).values > 1, ]        # another way to check if there are duplicates

### Examining individual data items, rows and columns
- `<pd.df>.sample()`
- Simple indexing and fancy indexing: `<pd.df>.iloc[]`, `<pd.df>.loc[]`
- `<pd.df>.index`, `<pd.df>.index[<from>:<to>]`, `<pd.df>.reset_index(drop=True, inplace=True)`
- Indexing using list of values: `<pd.df>.loc[<pd.df>.<column>.isin(<list of values>)]` (select those observations where the value of <column> is in the `<list of values>`)
- Indexing in data stats: `<pd.df>.describe().loc['50%', '<column_name>']` (select the median of `<column_name>` from the `<pd.df>` stats computed by `describe()`)

Take a sample of the dataset to get a feeling of what's in there.

In [None]:
songs.sample(10)

What are the songs that have *some* missing values? 
Use masking to create the index of such elements; e.g. `<i>`, e.g., `<i> = songs.isna().sum() > 0` and show the type of the result (it's a `pd.Series` object).
Display `<i>.index` and `<i>.values`. 

In [None]:
i = songs.isna().sum() > 0
songs.isna().sum()[i]
i.index
i.values

From the `pd.Series` object `<i>` retrieved in the previous step, select the elements that have the values > 0 (i.e., the names of the columns that have some `NaN` values) - `<i>[<i>.values > 0]`, `<pd.df>.isna().sum()[i]`. 
Also, from the `<pd.df>` select a subset with only those columns that have *some* `NaN` values - `<pd.df>.loc[:, <i>]`.

In [None]:
songs.isna().sum()[i]
songs.loc[:, i]

From a `<pd.df>` select all rows that have *some* missing values: `<pd.df>[<pd.df>.isna().any(axis=1)]`, `<pd.df>.loc[<pd.df>.isna().any(axis=1)]`, `<pd.df>.loc[<pd.df>.isna().any(axis=1), :]`.

In [None]:
songs[songs.isna().any(axis=1)]
songs.loc[songs.isna().any(axis=1)]
songs.loc[songs.isna().any(axis=1), :]

Select rows based on column conditions: `<pd.df>.loc[<pd.df>.<column 1> == <...>]`, `<pd.df>.loc[(<pd.df>.<column 1> == <...>) & (<pd.df>.<column 2> == <...>)]`, etc. Notice the use of `&`, not `and`.

In [None]:
songs.loc[songs.Duration.isna(), ]
songs.loc[songs.Duration == '2:09', ]
songs.loc[(songs.Duration == '2:09') & (songs.Title == 'Stay'), :]       # &, not and !!!

What are the rows that have missing values in a specific column of a `<pd.df>`? For example, what are the songs with missing `Duration` values?

Using `isna()`, `loc[]`, `iloc[]`, `len()` and `index`.

Calling `loc[]` effectively means *creating a subset* (typically based on a relational or logical expression over one or more columns of the dataset). In other words, `loc[]` creates a *slice* of the dataframe, so the type of the result is `<pd.df>`.

Note that `loc[]` works as `loc[<selected rows>, <selected columns>]`. The indices `<selected rows>` and `<selected columns>` can be created either directly in `loc[]` or beforehand.

If defining the <selected rows> index to be used with `loc[]` subsequently, it is a good practice to define it as a boolean *mask* over a single column, like `<pd.df>['<column>'].isna()`, or as a logical expression in which each chunk is a relational expression over a single column, e.g. `<pd.df>['<column1>'].isna() & <pd.df>['<column2>'] < 23`. The result will be a subset of the original dataframe (i.e., another `<pd.df>`).

Defining the relevant index with a statement like `<pd.df>.loc[<pd.df>['<column>'].isna()].index` is a good starting point when using `iloc[]` subsequently.

If using `iloc[]`, don't forget the `.index` chunk in the statement used to create the index (such as `<pd.df>.loc[<pd.df>['<column>'].isna()].index`). Without it, the result is another `<pd.df>`.

In [None]:
# Define i_iloc, the index to be used with iloc[], starting from <i> = <pd.df>['<column>'].isna();
# iloc[] can be used conveniently here if the relevant index is already defined with <pd.df>.loc[<i>].index, i.e. <pd.df>.loc[<pd.df>['<column>'].isna()].index;
# remember that the second index in iloc[] must be a number too (the relevant column index)
i = songs.Duration.isna()
i
i[i]
i_iloc = songs.loc[i].index
i_iloc
# # Alternatively
# i_iloc = i[i].index
# i_iloc
songs.iloc[i_iloc, [0, 1, 3, 5]]

# Define i_loc, the index (boolean mask) to be used with loc[]
i_loc = songs.Duration.isna()
songs.loc[i_loc, ['Title', 'Album', 'Performer', 'Duration']]

# display(songs.loc[i_loc.index, ['Title', 'Album']])
# display(songs.iloc[i_iloc, [0, 2]])

Replace `NaN` values in `Duration` with 'No' (`<pd.df>.loc[<i_loc>, '<column>'] = <new value>`, `<pd.df>.iloc[<i_iloc>, <column index>] = <new value>`).

In [None]:
# Make the replacement and display it
songs.loc[i_loc, 'Duration'] = 'No'
songs.loc[i_loc, ['Title', 'Album', 'Performer', 'Duration']]

Double-check the missing values now:

In [None]:
# Use <pd.df>.Duration.isna().sum(), or <pd.df>.isna().sum()['Duration'] or sb.heatmap(<pd.df>.isna(), cmap='...')
songs.Duration.isna().sum()

How many songs from the beginning of The British Invasion are there?

In [None]:
# Define the beginning of The British Invasion a list comprehension
beginning = [y for y in range(1964, 1966)]
# Display the songs from the early years using a combination of <pd.df>.loc[] and isin()
songs.loc[songs.Year.isin(beginning), ['Title', 'Performer', 'Year']]

### Grouping and sorting data
- `<pd.df>['<column>'].unique()`, `<pd.df>['<column>'].nunique()`
- `<pd.df>.<column>.groupby()`, `<pd.df>.groupby('<column>')`, `<pd.df>.groupby('<column>').get_group(<value>)`
- `<pd.df>['<column>'].value_counts()`, `<pd.df>['<column>'].value_counts().sort_index()`, `<pd.df>['<column>'].value_counts().sort_index(inplace=True)`
- `<pd.df>.sort_values(by='<column name>', ascending=False/True)`
- `<pd.df>.groupby('<column>').<another column>.<f()>.sort_values(ascending=False)` (aggregate using function `f()`, e.g. `mean()`)
- `<pd.df>.groupby('<column>').<another column>.agg(['<f1 name>', '<f2 name>', ...])` (aggregate using multiple functions, e.g. `mean()`, `count()`,...)

If `sort_values()` is used after `agg([<'f1 name>', '<f2 name>', ...])` (`agg(['<f1 name>', '<f2 name>', ...]).sort_values(by='<f name>', ascending=False)`), it must be passed one positional argument (`by='<f name>'`) before the optional `ascending=False`.


How many unique values for `Year` are there in the dataset (`<pd.df>['<column>'].unique()`, `<pd.df>.<column>.unique()`; `<pd.df>['<column>'].nunique()`, `<pd.df>.<column>.nunique()`)?

In [None]:
songs.Year.unique()
songs.Year.nunique()

Group the songs in the dataset by the year of release (`<pd.df>.groupby('<column>')`). The result can be `songs_by_year`. Display it, show its type, and explore its individual groups and their types (`<pd.df>.groupby('<column>').get_group(<value>)`).

In [None]:
songs_by_year = songs.groupby('Year')
songs_by_year
songs_by_year.get_group(1964)

How many songs are there in the dataset for each `Year` (`<pd.df>['<column>'].value_counts()`, `<pd.df>['<column>'].value_counts()[<year>]`, `<pd.df>['<column>'].value_counts().sort_index()`)?

Note that `value_counts()` returns a `pd.Series` object, with the index equal to `<pd.df>['<column>'].unique()` values.

In [None]:
songs.Year.value_counts()
songs.Year.value_counts()[1964]
songs.Year.value_counts().sort_index()

Sort the songs from the dataset by the year of release (`<pd.df>.sort_values(by='<column name>', ascending=False/True)`).
(It is also possible to use `inplace=True` in `sort_values()`, but it will change the order of songs in the dataset from that point on.)

In [None]:
songs.sort_values('Year', ascending=True)
songs.sort_values('Year')

Group the songs in the dataset by the year of release and display `mean` and/or `max` duration of the songs in each year, as well as the number (`count`) of songs in each year (`<pd.df>.groupby('<column>').<another column>.<f()>.sort_values(ascending=False)` (aggregate using function `f()`, e.g. `mean()`), `<pd.df>.groupby('<column>').<another column>.agg(['f1 name>', '<f2 name>', ...])` (aggregate using multiple functions, e.g. `mean()`, `count()`, `max()`,...)).
If `sort_values()` is used after `agg([<'f1 name>', '<f2 name>', ...])` (`agg(['<f1 name>', '<f2 name>', ...]).sort_values(by='<f name>', ascending=False)`), it must be passed one positional argument (`by='<f name>'`) before the optional `ascending=False`.

In [None]:
# To make all strings in songs.Duration look alike, insert placeholder values of the form 'mm:ss' 
# for those that are NaN, or that have been previously set to 'No', or the like
i = songs.Duration == 'No'
i[i]
songs.loc[i, 'Duration'] = '0:18'
# songs.loc[i, 'Duration']

In [None]:
# Make sure that now all strings in songs.Duration are of the form 'mm:ss', i.e. they contain ':'
# using len(songs.Duration.str.contains(':')) or all(songs.Duration.str.contains(':'))
all(songs.Duration.str.contains(':'))

In [None]:
# Convert Duration to int

# Define to_sec() function that converts a single string of the form 'mm:ss' to int
def to_int(s):
    m, s = s.split(':')
    return int(m) * 60 + int(s)
# Use <pd.df>.<column>.apply(<function>) to convert the entire Duration column to int
songs.Duration = songs.Duration.apply(to_int)
songs.Duration

In [None]:
# Make the groupings and aggregations

# <pd.df>.groupby('<column>').<another column>.<f()>.sort_values(ascending=False)
songs.groupby('Year').Duration.mean()
songs.groupby('Year').Duration.mean().sort_values(ascending=False)
# <pd.df>.groupby('<column>').<another column>.agg(['<f1 name>', '<f2 name>', ...]).sort_values(by='<f name>', ascending=False)
songs.groupby('Year').Duration.agg(['count', 'mean', 'max']).sort_values(by='count', ascending=False)

## Data visualization
Plot some scatterplots, line plots, bar graphs, histograms, scatterplots, box plots, violins, heatmaps,...
[Example](https://www.analyticsvidhya.com/blog/2021/08/how-to-perform-exploratory-data-analysis-a-guide-for-beginners/)

[Matplotlib examples](https://matplotlib.org/stable/gallery/index.html)

[Seaborn examples](https://seaborn.pydata.org/examples/index.html) (see also [The Python Graph Gallery](https://www.python-graph-gallery.com/); it has a very neat user interface!)

[Plotnine examples](https://plotnine.org/reference/) (click on any element for its API and examples)

<u>**Note that it is also possible to**</u> <u>**[plot lines, bargraphs,... with Pandas only](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.line.html)**</u> (although in such cases Pandas interacts with Matplotlib under the hood).

<b>IMPORTANT: Matplotlib terminology, Figure vs. Axes</b><br>
A `Figure` object in Matplotlib is the outermost container for a Matplotlib graphic, which can contain multiple `Axes` objects. One source of confusion is the name: an `Axes` actually translates into what we think of as an individual plot or graph (rather than the plural of "axis", as we might expect).

##### Missing values

Read the dataset.

`pd.read_csv()` returns a `pd.DataFrame` object.

As for specifying the path of the dataset properly, see [this](https://stackoverflow.com/questions/35384358/how-to-open-my-files-in-data-folder-with-pandas-using-relative-path) (more specifically, **both** [this](https://stackoverflow.com/a/35384414/1899061) and [this](https://stackoverflow.com/a/43600253/1899061)).

In [None]:
# Get the songs as a pd.DataFrame object from 'data/brit_titles_stripped.csv', 
# or from '../data/brit_titles_stripped.csv', or '../../data/brit_titles_stripped.csv', or ...,
# depending on where the csv file is located
songs = pd.read_csv('../data/brit_titles_stripped.csv')
songs

Check for missing values (use, e.g., `sb.heatmap(<pd.df>.isna(), cbar=False, cmap='viridis')`).

In [None]:
sb.heatmap(songs.isna(), cbar=False, cmap='viridis');

Briefly analyze the rows with `NaN`s. To select all such rows, use `any()` (`<pd.df>.loc[<pd.df>.isna().any(axis=1), ['<column 1>', '<column 2>', ...]`). To select the rows where there are no `NaN`s at all, use `<pd.df>.loc[<pd.df>.notna().all(axis=1), ['<column 1>', '<column 2>', ...]`.

In [None]:
songs.loc[songs.isna().any(axis=1), ['Title', 'Performer', 'Duration', 'danceability']]

It is difficult to spot any regular pattern, so get rid of `NaN`s in the simplest way possible (`<pd.df>.dropna(inplace=True)`). Make sure that the modified dataset is `NaN`-free (`<pd.df>.isna().sum()`).

In [None]:
songs.dropna(inplace=True)
songs
songs.isna().sum()

Save the reduced dataset as `brit_visualization.csv`, the starting one to make visualizations.

In [None]:
# Save the modified dataset
songs.to_csv('../data/brit_visualization.csv', index=False)

##### Scatterplot

Read the modified dataset (if necessary).

`pd.read_csv()` returns a `pd.DataFrame` object.

As for specifying the path of the dataset properly, see [this](https://stackoverflow.com/questions/35384358/how-to-open-my-files-in-data-folder-with-pandas-using-relative-path) (more specifically, **both** [this](https://stackoverflow.com/a/35384414/1899061) and [this](https://stackoverflow.com/a/43600253/1899061)).

In [None]:
songs = pd.read_csv('../data/brit_visualization.csv')
songs

Scatterplot the relationship between `Duration` and `danceability`.

Change the format of `Duration` from `str` to `int`.

In [None]:
def to_int(s):
    m, s = s.split(':')
    return int(m) * 60 + int(s)
songs.Duration = songs.Duration.apply(to_int)
songs.Duration

Save the modified dataset as `brit_visualization_duration_int.csv`.

In [None]:
# Save the modified dataset
songs.to_csv('../data/brit_visualization_duration_int.csv', index=False)

To set the ranges of values on x-axis and y-axis (`Duration`, `danceability`), check the max values or run `describe()`.

In [None]:
max(songs.Duration)
songs.describe()

###### 1. Matplotlib version

[Matplotlib scatterplot example](https://matplotlib.org/stable/gallery/shapes_and_collections/scatter.html)<br>
[Excellent tutorial on matplotlib](https://realpython.com/python-matplotlib-guide/)

Use the following syntax:<br>
`ax = plt.axes()`<br>
`ax.set(xlim=(<from>, <to>), ylim=(<from>, <to>), xlabel='<xlabel>', ylabel='<ylabel>', title='<title>')`<br>
`ax.scatter(<pd.df>['<X>'], <pd.df>['<Y>'], marker='<marker type>', c='<fill color>', edgecolors='<edgecolor>', s=<marker size>)`; <br>

The `<pd.df>['<X>']` and `<pd.df>['<Y>']` arguments can be also specified as `<pd.df>.<X>` and `<pd.df>.<Y>` if `<X>` and `<Y>` are single words.
The color parameter (`c`) is optional; if present, it should be a scalar or a sequence of length consistent with the lengths of `<X>` and `<Y>` (`(<X>, <Y>)` points). The `marker` parameter is optional as well. Both `c` and `marker` have defaults. For other values of `c` and `marker`, see [this](https://matplotlib.org/stable/gallery/color/named_colors.html#css-colors) and [this](https://matplotlib.org/stable/api/_as_gen/matplotlib.markers.MarkerStyle.html#matplotlib.markers.MarkerStyle.markers), respectively. A good value for `s` is 30-40 for 200-300 markers on the plot.

Alternatively:<br>
`ax.plot(<pd.df>['<X>'], <pd.df>['<Y>'], marker='<marker type>', color='<color>', linestyle='');`<br>

The `linestyle=''` parameter is essential for plotting the dots only - omitting it means that the connecting lines are plotted as well.

In [None]:
ax = plt.axes()
# ax;
ax.set(xlim=(50, 700), ylim=(0, 1), xlabel='Duration', ylabel='danceability', title='danceability(Duration)')
ax;
# ax.scatter(x=songs.Duration, y=songs.danceability)
ax.scatter(x=songs.Duration, y=songs.danceability, c='yellow', edgecolors='black', marker='o');

###### 2. Plotnine version
[Plotnine scatterplot example](https://plotnine.org/reference/geom_point.html#plotnine.geom_point)<br>
[Excellent tutorial on plotnine](https://realpython.com/ggplot-python/)


In *Plotnine*, the syntax for setting the ranges on x and y axes is `xlim(<from>, <to>)`, `ylim(<from>, <to>)` (as two separate lines in calling `ggplot()`), or, alternatively, `coord_cartesian(xlim=(<from>, <to>), ylim=(<from>, <to>))` as a single separate line.

If `<x>` and `<y>` values are not columns of a dataframe already (`<X>` and `<Y>`), create a minimal dataframe to support plotting (`<df> = pd.DataFrame({'<X>': <x>, '<Y>': <y>})`).

Use `ggplot` as:

`(`<br>
&emsp;&emsp;`ggplot(<df>, aes(x='<X>', y='<Y>) +`<br>
&emsp;&emsp;`geom_point(color='<color>', fill='<fill color>', shape='<shape>', size=<size>) +`<br>
&emsp;&emsp;`coord_cartesian(xlim=(<from>, <to>), ylim=(<from>, <to>)) +`<br>
&emsp;&emsp;`theme(figure_size=(10, 7), dpi=60, axis_text_x=element_text(color='<color>, size=<size>), axis_text_y=element_text(color='<color>, size=<size>)) +`<br>
&emsp;&emsp;`labs(x='...', y='...', title='...')`<br>
`).draw()`

The `color`, `fill` and `shape` parameters have defaults. The other values of these parameters are the same as in Matplotlib (see [this](https://matplotlib.org/stable/gallery/color/named_colors.html) and [this](https://matplotlib.org/stable/api/_as_gen/matplotlib.markers.MarkerStyle.html#matplotlib.markers.MarkerStyle), respectively).

In `theme(figure_size=(10, 7), dpi=60, ...)`, the `dpi` parameter is necessary to achieve full control over the plot size (`figure_size` is not enough). It is a good idea to experiment with the actual values for `figure_size`and `dpi`. 

Another useful parameter of `theme()` is `axis_text_x=element_text(color='<color>, size=<size>)` (and `axis_text_y=element_text(color='<color>, size=<size>)`). It controls the parameters of the axes text. Similarly, `axis_title=element_text(color='<color>, size=<size>)` can be used in `theme()` to set the color and font size of axis labels (<b>both simultaneously!</b>), `axis_title_x=element_text(color='<color>, size=<size>)` (and `axis_title_y=element_text(color='<color>, size=<size>)`) change the color and font size of x-axis label (y-axis label), and `title=element_text(color='<color>, size=<size>)` do the same for the plot title.

**Note 1:** `aes(x='<X>', y='<Y>)` shows compiler errors but works anyway; `aes('<X>', '<Y>)` does not show any compiler error. However, `labs(x='...', y='...', title='...')` shows compiler errors regardless of `x=...`, `y=...`, ..., but works only *with* `x=...`, `y=...`. To eliminate these compiler errors, use `xlab('...')`, `ylab('...')` and `ggtitle('...')` as separate lines after calling `ggplot()`. 

**Note 2:** Once the figure size is changed for plotnine graphs by calling `theme(figure_size=(10, 7), dpi=60)` or similar, the Matplotlib graphs use the new figure size as well. To change it, use `plt.figure(figsize=...)` in the code for Matplotlib graphs. 

In [None]:
(
    ggplot(songs, aes(x='Duration', y='danceability')) +
    theme(figure_size=(8, 5), dpi=60) + 
    geom_point(color='black', fill='yellow', ) + 
    # xlab('Duration') + 
    # ylab('danceability') + 
    # ggtitle('danceability(Duration)')
    labs(x='Duration', y='danceability', title='danceability(Duration)')
).draw()

###### 3. A brief analysis of the plot: What are the shortest/longest songs and their durations?

In [None]:
# display(<pd.df>['column'] <= <value>)                                    # Boolean mask
# display(type(<pd.df>['column'] <= <value>))                              # pd.Series
# display(<pd.df>[<pd.df>['column'] <= <value>]['column to to display'])   # select one column
# display(<pd.df>[<pd.df>['column'] <= <value>]['column 1  to to display', 'column 2 to display',...])   # select multiple columns

# Try this also with .loc[], as well as with .iloc[], with an explicitly set index and with .index

songs.loc[songs.Duration > 300, ['Title', 'Performer', 'Album', 'Year', 'Duration']]
songs.loc[songs.Duration < 90, ['Title', 'Performer', 'Album', 'Year', 'Duration']]

##### Line plot

Read the modified dataset (if necessary).

`pd.read_csv()` returns a `pd.DataFrame` object.

As for specifying the path of the dataset properly, see [this](https://stackoverflow.com/questions/35384358/how-to-open-my-files-in-data-folder-with-pandas-using-relative-path) (more specifically, **both** [this](https://stackoverflow.com/a/35384414/1899061) and [this](https://stackoverflow.com/a/43600253/1899061)).

In [None]:
songs = pd.read_csv('../data/brit_visualization_duration_int.csv')
songs

How many songs from each `Year` are there?

In [None]:
# Use <pd.df>['<column>'].value_counts(), <pd.df>['<column>'].value_counts()[<specific value> in <column>]
songs.Year.value_counts()

Sort this result by index: `pd.Series.sort_index()` (there is also `pd.DataFrame.sort_index()`).

In [None]:
# Define val_counts_sorted_by_index
val_counts_sorted_by_index = songs.Year.value_counts().sort_index()
val_counts_sorted_by_index

Preparation for plotting (`counts` on y-axis, `year` on x-axis): get the `np.ndarray` version of `val_counts_sorted_by_index`, as well as of its index.

One way of doing it is to use `np.array()` over `val_counts_sorted_by_index.index` and `val_counts_sorted_by_index.values`. However, the same effect is achieved using only `val_counts_sorted_by_index.index` and `val_counts_sorted_by_index.values` (their type is `np.ndarray`).

In [None]:
years = val_counts_sorted_by_index.index
counts = val_counts_sorted_by_index.values

And now plot it.

###### 1. Matplotlib version
[Matplotlib line plot example](https://matplotlib.org/stable/gallery/lines_bars_and_markers/simple_plot.html)<br>
[Excellent tutorial on matplotlib](https://realpython.com/python-matplotlib-guide/)

<em>Initial version</em><br>

`ax = plt.axes()`<br>
`ax.set(xlim=(<lower limit>, <upper limit>), ylim=(<lower limit>, <upper limit>), xlabel='...', ylabel='...', title='...')`<br>
`ax.ticklabel_format(useOffset=False)`<br>
`ax.plot(<x>, <y>, color='...', marker='<marker type>', linewidth=<number>, alpha=<number>)`<br>

To prevent numbers displayed in scientific notation (exponential) on axes ticks, make sure to use `ax.ticklabel_format(useOffset=False)`.

Do not use `x=<x>, y=<y>` in `ax.plot()`, it generates an error. Use just `<x>, <y>`. For the other parameters, the keywords are necessary.

Examples of parameters in `ax.plot()`: `color='steelblue'`, `linewidth=3`, `alpha=0.8` (alpha: transparency (0-1)).



In [None]:
# Initial version
ax = plt.axes()
ax.set(xlim=(1963, 1968), ylim=(150, 400), xlabel='year', ylabel='count', title='Number of songs by year')
ax.ticklabel_format(useOffset=False)
ax.plot(years, counts, color='steelblue', linewidth=2, alpha=0.8);

<em>Elaborated version 1 (without `plt.subplots()`)</em><br>

`plt.figure(layout='constrained', facecolor='<color>', figsize=(<x_size>, <y_size>))`&emsp;&emsp;# Set the Figure object parameters<br><br>
`ax = plt.axes()`&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;# Get the Axes object<br>
`ax.set_facecolor('<color>')`&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;# Set the Axes object parameters<br>
`ax.set_title('<title>', fontsize=12, loc='left')`<br>
`ax.set_xlabel('<x_label>', fontsize=8)`<br>
`ax.set_ylabel('<y_label>', fontsize=8)`<br>

`ax.set(xlim=(<m>, <n>), ylim=(<p>, <q>))`&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;# Without `plt.subplots()`, `xlim` and `ylim` have to be set using `ax.set()`<br>

`ax.ticklabel_format(useOffset=False)`&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;# Set the tick parameters<br>
`ax.tick_params(axis='x', labelsize=6)`<br>
`ax.tick_params(axis='y', labelsize=6)`<br>

`ax.plot(years, counts, color='<color>', linewidth=2, alpha=0.8);`&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;# Display the plot using `ax.plot()`<br>

In `plt.figure(layout='constrained', facecolor='<color>', figsize=(<x_size>, <y_size>))`, using `layout='constrained'` is recommended to avoid overlapping of figure elements when changing the figure size. For a good figure size, use `figsize=(3.5, 2)` or similar.

It is also possible to set the Axes object background color using `plt.axes(facecolor='<color>')` instead of `ax.set_facecolor('<color>')`.

To prevent numbers displayed in scientific notation (exponential) on axes ticks, make sure to use `ax.ticklabel_format(useOffset=False)`.

Experiment with different font sizes for labels, title and ticks.

Do not use `x=<x>, y=<y>` in `ax.plot()`, it generates an error. Use just `<x>, <y>`. For the other parameters, the keywords are necessary.

Examples of parameters in `ax.plot()`: `color='steelblue'`, `linewidth=3`, `alpha=0.8` (alpha: transparency (0-1)).

In [None]:
# Elaborated version 1 (without plt.subplots())

# Set the Figure object parameters
plt.figure(layout='constrained', facecolor='lightgreen', figsize=(3.5, 2), )
# Get the Axes object
ax = plt.axes()
# Set the Axes object parameters
ax.set_facecolor('lightyellow')
ax.set_title('Number of songs by year', fontsize=12)
ax.set_xlabel('year', fontsize=8)
ax.set_ylabel('counts', fontsize=8)
# Without plt.subplots(), xlim and ylim have to be set using ax.set()
ax.set(xlim=(1963, 1968), ylim=(150, 400))
# Set the tick parameters
ax.ticklabel_format(useOffset=False)
ax.tick_params(axis='x', labelsize=6)
ax.tick_params(axis='y', labelsize=6)
# Display the plot using ax.plot()
ax.plot(years, counts, color='steelblue', linewidth=2, alpha=0.8);

<em>Elaborated version 2 (with `plt.subplots()`)</em><br>

`fig, ax = plt.subplots(1, 1, layout='constrained', facecolor='color', figsize=(<x_size>, <y_size>))`&emsp;&emsp;# Get the Figure and the Axes objects<br>

`ax.plot(years, counts, color='<color>', linewidth=2, alpha=0.8)`&emsp;&emsp;&emsp;# Plot the data on the Axes<br>

`ax.set_title('<Title>', fontsize=12, loc='left')`&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&nbsp;# Set the Axes title, background color (face color), labels (incl. font sizes) and limits<br>
`ax.set_facecolor('<color>')`<br>
`ax.set_xlabel('<x_label>', fontsize=8)`<br>
`ax.set_ylabel('<y_label>', fontsize=8)`<br>
`ax.set_xlim(<m>, <n>)`<br>
`ax.set_ylim(<p>, <q>)`<br>

`ax.ticklabel_format(useOffset=False)`&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;# Set the tick parameters<br>
`ax.tick_params(axis='x', labelsize=6)`<br>
`ax.tick_params(axis='y', labelsize=6)`<br>

`ax.plot(years, counts, color='steelblue', linewidth=2, alpha=0.8);`&emsp;&emsp;# Display the plot using `ax.plot()`<br>

In `fig, ax = plt.subplots(1, 1, layout='constrained', facecolor='color', figsize=(<x_size>, <y_size>))`, using `layout='constrained'` is recommended to avoid overlapping of figure elements when changing the figure size. For a good figure size, use `figsize=(3.5, 2)` or similar.

To prevent numbers displayed in scientific notation (exponential) on axes ticks, make sure to use `ax.ticklabel_format(useOffset=False)`.

Experiment with different font sizes for labels, title and ticks.

Do not use `x=<x>, y=<y>` in `ax.plot()`, it generates an error. Use just `<x>, <y>`. For the other parameters, the keywords are necessary.

Examples of parameters in `ax.plot()`: `color='steelblue'`, `linewidth=3`, `alpha=0.8` (alpha: transparency (0-1)).

In [None]:
# Get the Figure and the Axes objects
fig, ax = plt.subplots(1, 1, layout='constrained', facecolor='lightgreen', figsize=(3.5, 2))
# Plot the data on the Axes
ax.plot(years, counts, color='steelblue', linewidth=2, alpha=0.8)
# Set the Axes title, labels (incl. font sizes) and limits
ax.set_title('Number of songs by year', fontsize=10)
ax.set_xlabel('year', fontsize=8)
ax.set_ylabel('counts', fontsize=8)
ax.set_xlim(1963, 1968)
ax.set_ylim(150, 400)
# Set the Axes background color (face color) 
ax.set_facecolor('lightyellow')
# Set the tick parameters
ax.ticklabel_format(useOffset=False)
ax.tick_params(axis='x', labelsize=6)
ax.tick_params(axis='y', labelsize=6)
# Display the plot using plt.show()
plt.show()

###### 2. Plotnine version
[Plotnine line plot example](https://plotnine.org/reference/geom_line.html#plotnine.geom_line)<br>
[Excellent tutorial on plotnine](https://realpython.com/ggplot-python/)

For some reason, running the Matplotlib version immediately before running the Plotnine version sometimes resets all values in `years` to 1970 (!!!), so re-creating `years` here might be necessary.

In [None]:
# years = np.array(val_counts_sorted_by_index.index)
# display(years)


If `<x>` and `<y>` values are not in a dataframe columns (`<X>` and `<Y>`) already, create a minimal dataframe to support plotting (`<df> = pd.DataFrame({'<X>': <x>, '<Y>': <y>})`).

In [None]:
df = pd.DataFrame({'Years': years, 'Counts': counts})
df

Use `ggplot` as:

`(`<br>
&emsp;&emsp;`ggplot(<df>, aes(x='<X>', y='<Y>) +`<br>
&emsp;&emsp;`geom_line(color='<color>', size=<size>, alpha=<transparency, 0-1>, linetype='<linetype>') +`<br>
&emsp;&emsp;`coord_cartesian(xlim=(<from>, <to>), ylim=(<from>, <to>)) +`<br>
&emsp;&emsp;`theme(figure_size=(10, 7), dpi=60, axis_text_x=element_text(color='<color>, size=<size>), axis_text_y=element_text(color='<color>, size=<size>)) +`<br>
&emsp;&emsp;`labs(x='...', y='...', title='...')`<br>
`).draw()`

The `color`, `size` and `linetype` parameters have defaults. The other values of these parameters are pretty much the same as in Matplotlib (see [this](https://matplotlib.org/stable/gallery/color/named_colors.html) and [this](https://matplotlib.org/stable/gallery/lines_bars_and_markers/linestyles.html), respectively).

In `theme(figure_size=(10, 7), dpi=60, ...)`, the `dpi` parameter is necessary to achieve full control over the plot size (`figure_size` is not enough). It is a good idea to experiment with the actual values for `figure_size`and `dpi`. 

Another useful parameter of `theme()` is `axis_text_x=element_text(color='<color>, size=<size>)` (and `axis_text_y=element_text(color='<color>, size=<size>)`). It controls the parameters of the axes text. Similarly, `axis_title=element_text(color='<color>, size=<size>)` can be used in `theme()` to set the color and font size of axis labels (<b>both simultaneously!</b>), `axis_title_x=element_text(color='<color>, size=<size>)` (and `axis_title_y=element_text(color='<color>, size=<size>)`) change the color and font size of x-axis label (y-axis label), and `title=element_text(color='<color>, size=<size>)` do the same for the plot title.

**Note 1:** `aes(x='<X>', y='<Y>)` shows compiler errors but works anyway; `aes('<X>', '<Y>)` does not show any compiler error. However, `labs(x='...', y='...', title='...')` shows compiler errors regardless of `x=...`, `y=...`, ..., but works only *with* `x=...`, `y=...`. To eliminate these compiler errors, use `xlab('...')`, `ylab('...')` and `ggtitle('...')` as separate lines after calling `ggplot()`. 

**Note 2:** Once the figure size is changed for plotnine graphs by calling `theme(figure_size=(10, 7), dpi=60)` or similar, the Matplotlib graphs use the new figure size as well. To change it, use `plt.figure(figsize=...)` in the code for subsequent Matplotlib graphs. 

Examples of parameters in geom_line(): color='steelblue', size=1, linetype='solid', alpha=0.8 (alpha: transparency (0-1)).

In [None]:
(
    ggplot(df, aes(x='Years', y='Counts')) + 
    geom_line() +
    theme(figure_size=(6, 4), dpi=60)
).draw()

In [None]:
(
    ggplot(df, aes(x='Years', y='Counts')) + 
    geom_line(color='red', linetype='--') + 
    geom_point(color='grey', fill='red') +
    theme(figure_size=(6, 4), dpi=60) + 
    labs(x='year', y='count', title='Number of songs by year')
).draw()

###### 3. Smoothen the curves
Based on [this](https://stackoverflow.com/a/5284038/1899061).<br><br>
`from scipy.interpolate import make_interp_spline, BSpline`<br>

`<x> = <definition of x-axis variable>`<br>
`<y> = <definition of y-axis variable>`<br>

`<x_smooth> = np.linspace(<x>.min(), <x>max(), 300)`&emsp;&emsp;&emsp;&emsp;# 300: the number of points to make between `<x>.min() and <x>.max()`<br>
`spl = make_interp_spline(year, counts, k=3)`&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; # type: BSpline<br>
`<y_smooth> = spl(<x>_smooth)`<br>

`plt.xlim([<lowest value of x to show on the plot>, <highest value of x to show on the plot>])`<br>
`plt.ylim([<lowest value of y to show on the plot>, <highest value of x to show on the plot>])`<br>

`plt.plot(<x_smooth>, <y_smooth>)`<br>
`plt.plot(<x>, <y>)`&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;# optional: show the segmented line on the same plot as well<br>
`plt.show()`



In [None]:
# # 300 represents the number of points to make between T.min and T.max
# T = np.array([6, 7, 8, 9, 10, 11, 12])
# power = np.array([1.53E+03, 5.92E+02, 2.04E+02, 7.24E+01, 2.72E+01, 1.10E+01, 4.70E+00])
#
# # plt.plot(T,power)
# # plt.show()
#
# xnew = np.linspace(T.min(), T.max(), 300)
#
# spl = make_interp_spline(T, power, k=3)  # type: BSpline
# power_smooth = spl(xnew)
#
# plt.plot(xnew, power_smooth)
# plt.show()

# from scipy.interpolate import make_interp_spline, BSpline
# 
# year_smooth = np.linspace(years.min(), years.max(), 300)
# spl = make_interp_spline(years, counts, k=3)  # type: BSpline
# counts_smooth = spl(year_smooth)
# 
# # plt.figure(layout='constrained', figsize=(5, 3), facecolor='lightyellow', alpha=0.5)
# fig, ax = plt.subplots(figsize=(5, 3), layout='constrained', facecolor='beige')
# 
# ax.set_facecolor('navajowhite')
# 
# plt.ticklabel_format(useOffset=False)
# 
# plt.xlim([1963, 1968])
# plt.ylim([150, 400])
# plt.xticks(fontsize=8)
# plt.yticks(fontsize=8)
# 
# plt.xlabel('year', fontsize=10)
# plt.ylabel('count', fontsize=10)
# plt.title('Song counts over years', fontsize=12, color='green')
# 
# plt.plot(year_smooth, counts_smooth)
# plt.plot(years, counts)
# plt.show()

# # Alternatively
# ax = plt.axes()
# ax.set(xlim=(years.min()-1, years.max()+1), ylim=(150, 400), xlabel='year', ylabel='count', title='Song counts over years')
# ax.ticklabel_format(useOffset=False)
# ax.plot(years, counts, color='steelblue', linewidth=2, marker='o', alpha=0.8)
# ax.plot(year_smooth, counts_smooth, color='green', linewidth=2, alpha=0.8);

###### 4. Multiple subplots
(shown here after [this](https://jakevdp.github.io/PythonDataScienceHandbook/04.08-multiple-subplots.html))

In [None]:
# # From https://jakevdp.github.io/PythonDataScienceHandbook/04.08-multiple-subplots.html
# fig = plt.figure()
# ax1 = fig.add_axes([0.1, 0.55, 0.8, 0.4],
#                    xticklabels=[], ylim=(-1.2, 1.2))
# ax2 = fig.add_axes([0.1, 0.1, 0.8, 0.4],
#                    ylim=(-1.2, 1.2))
# # Meanings of the numbers in [0.1, 0.55, 0.8, 0.4]:
# #     0.1 - distance from the left edge of fig (grey area)
# #     0.55 - distance between the upper and lower subplots (0.5: they touch each other)
# #     0.8 - distance from the right edge of fig (grey area)
# #     0.4 - area assigned to the upper/lower subplot (ax1/ax2) along the vertical axes
# # Experiment with these numbers to get a better feeling for them

# x = np.linspace(0, 10)
# ax1.plot(np.sin(x))
# ax2.plot(np.cos(x));
# 
# fig, ax = plt.subplots()
# ax


fig = plt.figure(figsize=(6, 6), )
# fig
ax1 = fig.add_axes([0.1, 0.579, 0.8, 0.35],
                   xlim=(1963, 1968), ylim=(150, 400),
                   xlabel='year', ylabel='counts',
                   title='Number of songs recorded over the years')
ax2 = fig.add_axes([0.1, 0.08, 0.8, 0.35],
                   xlim=(1963, 1968), ylim=(150, 400),
                   xlabel='year', ylabel='counts',
                   title='Number of songs recorded over the years')
# display(type(ax1))
ax1.ticklabel_format(useOffset=False)
ax2.ticklabel_format(useOffset=False)

ax1.plot(years, counts, color='steelblue', linewidth=1.5, alpha=0.8)    # alpha: transparency (0-1)
ax2.plot(years, counts, color='purple', linewidth=1.5, alpha=0.8);      # alpha: transparency (0-1)

##### Histogram

Read the dataset (`brit_visualization_duration_int.csv`).

In [None]:
songs = pd.read_csv('../data/brit_visualization_duration_int.csv')

Plot the histogram of song durations (lengths, times).

Use Pandas to extract song lengths as a `pd.Series` object (`<pd.Series object> = <pd.df>['<column>']`).

In [None]:
# Get the song lengths as a pd.Series object
duration = songs.Duration

In [None]:
# Convert the song lengths into a NumPy array (using <song lengths>.to_numpy(), or np.array(<song lengths>), or <song lengths>.values)
type(duration)
type(duration.values)
duration = duration.values
duration

###### 1. Matplotlib version
[Matplotlib histogram example](https://matplotlib.org/stable/gallery/statistics/hist.html)

Plot the histogram of the song lengths using Matplotlib.

Minimal version: `plt.hist(<x>, bins=<number of bins>);` or `sb.histplot(<x>, bins=<number of bins>)`.

Alternatively:<br>
`plt.figure(layout='constrained', facecolor='<color>', figsize=(3.5, 2), )`<br>
`ax = plt.axes()`<br>
`ax.set(xlabel='...', ylabel='...', title='...')`<br>
`ax.hist(<x>, bins=<number of bins>)`<br>

As for the plot styles, there are a lot of [available styles](https://www.dunderdata.com/blog/view-all-available-matplotlib-styles) that can be also shown in code using `plt.style.available`. See also [this](https://www.analyticsvidhya.com/blog/2021/08/exploring-matplotlib-stylesheets-for-data-visualization/).

Alternatively, plot style can be set using `sb.set_theme(palette='...')` (or just `sb.set()`, but that function might get deprecated and removed from *Seaborn* in the future). See [`sb.set_theme()` documentation](https://seaborn.pydata.org/generated/seaborn.set_theme.html) for the function's parameters and defaults. For `palette='...'` use any of the palettes shown with `plt.style.available`, or any of [these](https://matplotlib.org/stable/users/explain/colors/colormaps.html#qualitative), or...

In [None]:
# Set plot style using sb.set_theme(palette='Pastel2')
sb.set_theme(palette='Pastel2')

# Plot the histogram - x: song time in [sec]; y: number of songs; 40 bins

# # Minimal version
# plt.hist(duration, bins=40);
# sb.histplot(duration, bins=40);

# A more detailed version
plt.figure(layout='constrained', facecolor='lightgreen', figsize=(3.5, 2), )
ax=plt.axes()
ax.set(xlabel='duration', ylabel='count', title='Song duration histogram')
ax.hist(duration, bins=40);

###### 2. Plotnine version
[Plotnine histogram example](https://plotnine.org/reference/geom_histogram.html#plotnine.geom_histogram)

Plot the histogram of the song lengths using *Plotnine*.

A minimal, but effective version:<br>
`plot = ggplot(songs, aes(x='<x>'))`<br>
`plot + geom_histogram(bins=40)`<br>

A more detailed version:<br>
`(`<br>
&emsp;&emsp;`ggplot(songs, aes(x='<x>')) +`<br>
&emsp;&emsp;`geom_histogram(bins=40, color='<color>', fill='<fill>', size='<outline thickness>', alpha=<transparency, 0-1>) +`<br>
&emsp;&emsp;`theme(figure_size=(6, 4), dpi=60) +`<br> 
&emsp;&emsp;`labs(x='<x>', y='count', title='<title>')`<br>
`).draw()`

[Excellent tutorial on plotnine](https://realpython.com/ggplot-python/).

In [None]:
# Minimal version
plot = ggplot(songs, aes(x='Duration'))
plot += theme(figure_size=(5, 3), dpi=60)
plot += geom_histogram(bins=80, color='grey', fill='yellow')
plot


To avoid the annoying text output like `<ggplot: (177159008578)>` under the plot, use the following syntax:

`(`<br>
&emsp;&emsp;`ggplot(<pd.df>, aes(x='<x>')) +`<br>
&emsp;&emsp;`geom_histogram(bins=40, color='<color>', fill='<fill>', size='<outline thickness>', alpha=<transparency, 0-1>) +`<br>
&emsp;&emsp;`theme(figure_size=(6, 4), dpi=60) +`<br> 
&emsp;&emsp;`labs(x='<x>', y='count', title='<title>')`<br>
`).draw()`

In [None]:
(
    ggplot(songs, aes(x='Duration')) +
    geom_histogram(bins=40, color='grey', fill='yellow', size=0.5, alpha=0.8) + 
    theme(figure_size=(5, 3), dpi=60) + 
    labs(x='duration', y='count', title='Song duration histogram')
).draw()

##### Bar graph

Read the dataset (`'../data/brit_visualization_duration_int.csv'`) and make some minor transformations.

`pd.read_csv()` returns a `pd.DataFrame` object.

As for specifying the path of the dataset properly, see [this](https://stackoverflow.com/questions/35384358/how-to-open-my-files-in-data-folder-with-pandas-using-relative-path) (more specifically, **both** [this](https://stackoverflow.com/a/35384414/1899061) and [this](https://stackoverflow.com/a/43600253/1899061)).

In [None]:
# Get the songs as a pd.DataFrame object from 'data/brit_visualization_duration_int.csv', or from
# '../data/brit_visualization_duration_int.csv', or '../../data/brit_visualization_duration_int.csv', or ..., 
# depending on where the csv file is located
songs = pd.read_csv('../data/brit_visualization_duration_int.csv')

How many powerful, energetic, intense, loud, and possibly anthemic songs did each British Invasion band released during the period of British Invasion?

Define a new feature (column in the `songs` dataframe), `powerful`, as a combination of `energy` and `loudness` - songs with `energy` and `loudness` above the corresponding 3rd quartiles are considered powerful.

In [None]:
# Run songs.describe() to see the 3rd quartiles
songs.describe()
# type(songs.describe())

In [None]:
# Display the 3rd quartiles for selected candidate features to describe the new feature, 'poweful' 
# ('danceability', 'energy', 'liveness', 'loudness', 'tempo', 'valence', 'shake_the_audience')
songs.describe().loc['75%', ['danceability', 'energy', 'liveness', 'loudness', 'tempo', 'valence', 'shake_the_audience']]
# songs.describe().loc['75%', ['danceability', 'energy', 'liveness', 'loudness', 'tempo', 'valence', 'shake_the_audience']]['danceability']

In [None]:
# Define threshold values for the candidate features (3rd quartiles, i.e. '75%')
thresholds = songs.describe().loc['75%', ['danceability', 'energy', 'liveness', 'loudness', 'tempo', 'valence', 'shake_the_audience']]

In [None]:
# Define the condition for a song to be powerful (songs with `energy` and `loudness` above the corresponding 3rd quartiles); 
# experiment with different combinations of candidate features

# powerful_condition = ((songs['danceability'] > thresholds['danceability']) &
#                       (songs['energy'] > thresholds['energy']))
# powerful_condition = ((songs['liveness'] > thresholds['liveness']) &
#                       (songs['energy'] > thresholds['energy']))
powerful_condition = ((songs['loudness'] > thresholds['loudness']) &                            # !!!
                      (songs['energy'] > thresholds['energy']))
# powerful_condition = ((songs['loudness'] > thresholds['loudness']) &
#                       (songs['energy'] > thresholds['energy']) &
#                       (songs['valence'] > thresholds['valence']))
# powerful_condition = ((songs['loudness'] > thresholds['loudness']) &
#                       (songs['shake_the_audience'] > thresholds['shake_the_audience']))         # !
# powerful_condition = ((songs['loudness'] > thresholds['loudness']) &
#                       (songs['tempo'] > thresholds['tempo']))
# powerful_condition = ((songs['energy'] > thresholds['energy']) &
#                       (songs['valence'] > thresholds['valence']))                               # !!
powerful_condition
powerful_condition[powerful_condition]

In [None]:
# Define the new feature, 'powerful'
songs['powerful'] = powerful_condition.values
songs['powerful']

In [None]:
# Display these powerful songs and their performers
songs.loc[songs.powerful, ['Title', 'Performer']]

How many British Invasion songs have been powerful, in terms of the definition of `songs.powerful` shown above?

In [None]:
len(songs.loc[songs.powerful, ['Title', 'Performer']])

<u>Save this version as a new *.csv* file, for use in the subsequent examples.</u> (`<pd.df>.to_csv('<path>')`)

In [None]:
songs.to_csv('../data/brit_visualization_powerful.csv', index=False)

###### Preparing the data for plotting the bar graph

Group the data - group the songs by performers.

In [None]:
songs_by_performer = songs.groupby('Performer')

Use `get_group(<performer>)` to get all songs by a selected performer and `value_counts()` over the resulting group's `powerful` column (showing the `True` and `False` subgroups). This is a precursor to creating the data for the y-axis of the bar graph.

In [None]:
songs_by_performer.get_group('The Animals')
# songs_by_performer.get_group('The Animals').value_counts('powerful')
songs_by_performer.get_group('The Animals').powerful.value_counts()
# songs_by_performer.get_group('The Animals').powerful.value_counts()[False]

Build the data to plot by extracting relevant items from each group.

For x-axis, use `unique()` over the `Performer` column, and then optionally `list()` over the resulting array to make the list of performers.

In [None]:
performers = songs.Performer.unique()
performers = list(performers)
performers

For y-axis, create the lists of the numbers of powerful songs (`powerful`) and of the other ones (`not_powerful`).
(Start from two empty lists. Loop over the list of performers created in the previous step, `get_group()` for each performer and append the `value_counts()[True]` of the `powerful` column of the current performer (`p['powerful']`) to `powerful` if any of `p['powerful']` has the value `True`, otherwise append 0. Do the similar thing for `not_powerful`. Display both lists in the end to double-check the result.)

In [None]:
powerful = []
not_powerful = []
for p in performers:
    s = songs_by_performer.get_group(p)
    powerful.append(s.powerful.value_counts()[True] if any(s.powerful) else 0)
    not_powerful.append(s.powerful.value_counts()[False] if not all(s.powerful) else 0)
print(powerful)
print(not_powerful)

And now plot the bar graph. Based on the second example from [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.bar.html) (using `<pd.df>.plot.bar()`, not Matplotlib or Seaborn).
For a complete list of parameters used in `**kwargs`, see [this](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html).
For a list of named colors (Matplotlib named colors), see [here](https://matplotlib.org/stable/gallery/color/named_colors.html#css-colors).

First create an auxiliary dataframe to use for plotting. Use `pwerful` and `not_powerful` as the columns, <u>and the list of performers created above as the index of the dataframe</u>.

In [None]:
# # The role-model example from https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.bar.html
# speed = [0.1, 17.5, 40, 48, 52, 69, 88]
# lifespan = [2, 8, 70, 1.5, 25, 12, 28]
# index = ['snail', 'pig', 'elephant', 'rabbit', 'giraffe', 'coyote', 'horse']
# df = pd.DataFrame({'speed': speed, 'lifespan': lifespan}, index=index)

df = pd.DataFrame({'powerful': powerful, 'not_powerful': not_powerful}, index=performers)
df

###### Alternative 1 - plot the bargraph using Pandas (`<pd.df>.plot.bar()`)

[Pandas bargraph example](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.bar.html)

Use `ax = <pd.df>.plot.bar()` to plot the bargraph.

Relevant parameters:
- `figsize=(<width>, <height>)` (e.g., (6, 6))
- `rot=<rotation angle [degrees]>` for the x-axis labels
- `ylim=(<from>, <to>)`
- `color={'powerful': 'limegreen', 'not_powerful': 'navajowhite'}` (for a list of Matplotlib named colors, see [here](https://matplotlib.org/stable/gallery/color/named_colors.html#css-colors))
- `edgecolor='<color of bin lines>'`
- `title='<title>'`
- `xlabel='<xlabel>'`
- `ylabel='<ylabel>'`
- `fontsize=<fontsize>` (for all text; suitable fontsizes are 10, 12,...)
- `stacked=True` (the bins for the same x-axis value stacked on top of one another)

The returned value (`ax`) is usually unnecessary and can be omitted.

It is <b>a very good idea</b> to also use `plt.tight_layout()` <b>after</b> `<pd.df>.plot.bar()` to avoid cutoffs at the bottom of the figure.  

In [None]:
df.plot.bar(figsize=(6, 6), rot=90, ylim=(0, 100), color={'powerful': 'limegreen', 'not_powerful': 'navajowhite'}, edgecolor='grey',
            title='Powerful songs of the British Invasion', xlabel='band', ylabel='count', fontsize=6, stacked=True);
plt.tight_layout()

###### Alternative 2 - plot the bargraph using Seaborn (`sb.countplot()`)

Use `ax = sb.countplot()` to plot the bargraph.

Relevant parameters:
- `data=<pd.df>`
- `x='<column 1>'` (e.g., 'Performer')
- `hue='<column 2>'` (e.g., 'powerful`)
- `palette='<palette>'` (e.g., 'Set2'; it is also possible to define custom palletes using Hex codes, e.g. `palette=['#432371','#FAAE7B']`)
- `dodge=False` to make the bargraph stacked

If necessary, use `plt.xticks(rotation=90)` before `sb.countplot()`.

Note that `ax = sb.countplot()` returns a `pd.Axes` object, so after the call to `ax = sb.countplot()` all `pd.Axes` methods can be called (like `ax.set_title(title='<title>'`, `ax.set_ylim(...)`, etc.). 

In [None]:
plt.figure(figsize=(12, 8), facecolor='navajowhite')
plt.xticks(rotation=90)
# sb.countplot(songs, x='Performer', hue='powerful', palette='viridis', dodge=False).set(title='Powerful songs', );
ax = sb.countplot(songs, x='Performer', hue='powerful', palette=['#9fbf0d', '#db145a'], dodge=False)
ax.set_title('Powerful songs', fontsize=20)
ax.set_ylim(0, 100);

##### Box plot
[Seaborn boxplot example](https://seaborn.pydata.org/generated/seaborn.boxplot.html) (used here as the role model)

For Seaborn color palette names see [this](https://seaborn.pydata.org/generated/seaborn.color_palette.html#seaborn.color_palette) or [this](https://10xsoft.org/courses/data-analysis/mastering-data-visualization-with-python/section-4-data-visualization-using-seaborn/colour-palettes-seaborn/). To list the names of some ('quantitative') Seaborn color palettes, use `sb.palettes.SEABORN_PALETTES.keys()` (see [this](https://10xsoft.org/courses/data-analysis/mastering-data-visualization-with-python/section-4-data-visualization-using-seaborn/colour-palettes-seaborn/) and [this](https://www.codecademy.com/article/seaborn-design-ii) for additional named palettes).

Read the dataset (`'data/brit_visualization_duration_int.csv'`).

`pd.read_csv()` returns a `pd.DataFrame` object.

As for specifying the path of the dataset properly, see [this](https://stackoverflow.com/questions/35384358/how-to-open-my-files-in-data-folder-with-pandas-using-relative-path) (more specifically, **both** [this](https://stackoverflow.com/a/35384414/1899061) and [this](https://stackoverflow.com/a/43600253/1899061)).

In [None]:
# Get the songs as a pd.DataFrame object from 'data/brit_visualization_duration_int.csv', or from
# '../data/brit_visualization_duration_int.csv', or '../../data/brit_visualization_duration_int.csv', or ..., depending on where the csv file is located
songs = pd.read_csv('../data/brit_visualization_duration_int.csv')

Use `sb.boxplot()` to plot some boxplots.

For a single-column boxplot, relevant parameters are `y=<pd.df>['column']` (for 'vertical' boxplot) or `x=<pd.df>['column']` (for 'horizontal' boxplot), and `palette='<palette>'` (e.g., 'Set3', 'pastel', ...; see the links above for other named color palettes). <u>Note that in case `palette` is used, it is also necessary to use `hue=<n>`, where `<n>` can be any value, e.g. 1</u>.

For a multiple-column boxplot, relevant parameters are `data=<pd.df>[['column1', 'column2',...]]`, `orient='v'` (for 'vertical' boxplot) and `palette='<palette>'`. No `hue` is needed, no `legend`.

In [None]:
# display(sb.palettes.SEABORN_PALETTES.keys())

# For a single column (e.g., Duration)
# # sb.boxplot(x=songs.Duration, palette='Set1');

plt.figure(layout='constrained', facecolor='navajowhite', figsize=(3.5, 2), )
# sb.boxplot(y=songs.Duration, palette='Set1', hue=1, legend=False)

# # Alternatively
# sb.boxplot(data=songs, y='Duration', palette='Set1', hue=1, legend=False)
# plt.tight_layout()

# For multiple columns (e.g., energy and acousticness)
sb.boxplot(data=songs[['acousticness', 'energy']], palette='Set3');


##### Violin plot
[Seaborn violin plot example](https://seaborn.pydata.org/generated/seaborn.violinplot.html)

Combines box plot and density plot. Based on [this](https://stackoverflow.com/questions/46134113/seaborn-violin-plot-from-pandas-dataframe-each-column-its-own-separate-violin-p) and [this](https://seaborn.pydata.org/generated/seaborn.violinplot.html).

Read the dataset (`'../data/brit_visualization_duration_int.csv'`).

`pd.read_csv()` returns a `pd.DataFrame` object.

As for specifying the path of the dataset properly, see [this](https://stackoverflow.com/questions/35384358/how-to-open-my-files-in-data-folder-with-pandas-using-relative-path) (more specifically, **both** [this](https://stackoverflow.com/a/35384414/1899061) and [this](https://stackoverflow.com/a/43600253/1899061)).

In [None]:
# Get the songs as a pd.DataFrame object from ''../data/brit_visualization_duration_int.csv'', or from
# '../data/brit_visualization_duration_int.csv', or '../../data/brit_visualization_duration_int.csv', or ..., depending on where the csv file is located
songs = pd.read_csv('../data/brit_visualization_duration_int.csv')

Use `sb.violinplot()` like: `x=<pd.df>.loc[<index>, '<column for x-axis>']`, `sb.violinplot(data=<pd.df>, x=x, y=<pd.df>['<column for y-axis>'], hue=x, palette='<palette>', legend=False)`.

For example, if the violin plot should represent density/boxplot diagram of song `Duration` in certain `Year`s, then `<column for x-axis>` is `Year` and `<column for y-axis>` is `Duration`. Good values for `'<palette>'` are, e.g., 'Set3', 'pastel',...).

It is a good practice to set the `x` parameter directly before the call to `sb.violinplot()`, and then use `x=x` in `sb.violinplot()`. Using `x=<pd.df>.loc[<index>, '<column for x-axis>']` within the call to `sb.violinplot()` (like call to `sb.violinplot(x=<pd.df>.loc[<index>, '<column for x-axis>'], y=..., ...)`) might generate an error.

In [None]:
plt.figure(layout='constrained', facecolor='navajowhite', figsize=(10, 7), )

x=songs.loc[songs.Year < 1968, 'Year']
sb.violinplot(songs, x=x, y=songs.Duration, hue=x, palette='Set1', legend=False);

##### Heat map
[Seaborn heat map example](https://seaborn.pydata.org/generated/seaborn.heatmap.html) (used here as the role model)

To create a heatmap, create the corresponding pivot table first. [An intuitive visual explanation of pivot tables](https://support.microsoft.com/en-us/office/overview-of-pivottables-and-pivotcharts-527c8fa3-02c0-445a-a2db-7794676bce96#:~:text=A%20PivotTable%20is%20an%20interactive,unanticipated%20questions%20about%20your%20data.) (start from [this raw table](https://support.microsoft.com/en-us/office/create-a-pivottable-to-analyze-worksheet-data-a9a84538-bfe9-40a9-a8e9-f99134456576), and then see [the corresponding pivot table](https://support.microsoft.com/en-us/office/overview-of-pivottables-and-pivotcharts-527c8fa3-02c0-445a-a2db-7794676bce96#:~:text=A%20PivotTable%20is%20an%20interactive,unanticipated%20questions%20about%20your%20data.) (expand <em>About Pivot Tables</em>)).


Read the dataset (`'../data/brit_visualization_duration_int.csv'`).

`pd.read_csv()` returns a `pd.DataFrame` object.

As for specifying the path of the dataset properly, see [this](https://stackoverflow.com/questions/35384358/how-to-open-my-files-in-data-folder-with-pandas-using-relative-path) (more specifically, **both** [this](https://stackoverflow.com/a/35384414/1899061) and [this](https://stackoverflow.com/a/43600253/1899061)).

In [None]:
# Get the songs as a pd.DataFrame object from ''../data/brit_visualization_duration_int.csv'', or from
# '../data/brit_visualization_duration_int.csv', or '../../data/brit_visualization_duration_int.csv', or ..., depending on where the csv file is located
songs = pd.read_csv('../data/brit_visualization_duration_int.csv')

The idea: categorize songs according to their *valence*.

In [None]:
# # Plot the density function for 'valence'
# from plotnine import geom_density
# (
#     ggplot(songs, aes(x='valence')) +
#     geom_density()
# ).draw()

###### Alternative 1 - using `pd.qcut()`
Create a new column in the dataframe, e.g. `valence_category`, using `pd.qcut()` function to split the entire range of `songs.valence` values into five equally sized subranges, `Very Low` to `Very High` (with ~equal number of elements in each subrange): `songs['valence_category'] = pd.qcut(songs.valence, q=5, labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])`.

In [None]:
# Create the new column
songs['valence_category'] = pd.qcut(songs.valence, q=5, labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])

In [None]:
# Check the type of its values using type(<pd.df>.<new column>.values)
type(songs.valence_category.values)
# Display the categories in the new column using <pd.df>.<new column>.cat.categories
songs.valence_category.cat.categories

In [None]:
# Check value_counts() for 'valence_category'
songs['valence_category'].value_counts()

###### Alternative 2 - using `pd.cut()`
Create a new column in the dataframe, e.g. `valence_category`, using `pd.cut()` function to split the entire range of `songs.valence` values into five equally *spaced* subranges, `Very Low` to `Very High` (with  generally *unequal* number of elements in each subrange): `songs['valence_category'] = pd.cut(songs.valence, bins=[<bin edges>], labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'], include_lowest=True)`. Note that there one more `<bin edges>` than bins (defined in `labels`). 

Note that the ranges of values in the bins are defined as `(...]`. Thus make sure to include `include_lowest=True` in the call to `pd.cut()` to include the lowest value in the first bin (i.e., to get its range as `[...]`, not as `(...]`). The highest value in the last bin is always included.

In [None]:
# Extract mean, median and other values of valence as v_mean, v_median, etc. from songs.valence.describe().values, to be used as bin edges
songs.describe().loc[:, 'valence']
_, v_mean, _, v_min, v_q1, v_median, v_q3, v_max = songs.describe().loc[:, 'valence']

In [None]:
# Define the list of bin edges (v_min, v_mean, v_median, etc.)
bin_edges=[v_min, v_q1, v_mean, v_median, v_q3, v_max]
# Dafine the list of bin labels ('Very Low','Low', etc.)
labels=['Very Low','Low','Medium','High','Very High']
# Create 'valence_category' using pd.cut(songs['valence'], ...)
songs['valence_category'] = pd.cut(songs['valence'], bins=bin_edges, labels=labels, include_lowest=True)

In [None]:
# Check value_counts() for 'valence_category'
songs['valence_category'].value_counts()

###### Alternative 3 - create `valence` categories manually
For example, split the range of `valence` to five subranges, `Very Low` to `Very High` according to the following criteria: 
- `Very Low` is the *valence* from 0 to the first quartile (`songs.valence.describe()['25%']`)
- `Low` is the *valence* from the first quartile to the mean value (`songs.valence.describe()['mean']`), since the mean value is lower than the median value
- `Medium` is the *valence* from the mean value to the median value (`songs.valence.describe()['50%']`)
- `High` is the *valence* from the median value to the third quartile (`songs.valence.describe()['75%']`)
- `Very High` is the *valence* from the third quartile to 1

In [None]:
# # Extract mean, median and other values of valence as v_mean, v_median, etc. from songs.valence.describe().values
# songs.valence.describe()
# _, v_mean, _, _, v_q1, v_median, v_q3, _ = songs.valence.describe().values

Insert a new column, e.g. `valence_category` and set it to the default value `Medium`. Then split the range of `valence` to five subranges, `Very Low` to `Very High` (find the `max()` of `valence` first). Each such a subrange is actually an index of selected songs, based on the value of `valence` (e.g., `very_low = songs['valence'] < 10`). Then use `<pd.df>.loc[<index of selected observations>, <relevant column>]` to change the default value `Medium` where appropriate (e.g., `songs.loc[very_low, 'valence_category'] = 'Very Low'`).

In [None]:
# # Insert a new column, e.g. valence_category and set it to the default value 'Medium'. 
# # Then split the range of valence to five subranges, 'Very Low' to 'Very High.
# songs['valence_category'] = 'Medium'
# songs.loc[songs.valence <= v_q1, 'valence_category'] = 'Very Low'
# songs.loc[(songs.valence > v_q1) & (songs.valence <= v_mean), 'valence_category'] = 'Low'
# songs.loc[(songs.valence > v_median) & (songs.valence <= v_q3), 'valence_category'] = 'High'
# songs.loc[songs.valence > v_q3, 'valence_category'] = 'Very High'

<u>Save this version as a new *.csv* file, for possible use in other examples.</u> (`<pd.df>.to_csv('<path>')`)

In [None]:
songs.to_csv('../data/brit_visualization_valence_categories.csv', index=False)

Rearrange the categories of `valence_category` to make the output natural.
Use `<pd.df>['<column>'] = pd.Categorical(<pd.df>[<column>], categories=['<cat1>, <cat2>, ...'], ordered=True)`. In this example, order categories from `Very High` to `Very Low`.

In [None]:
songs.valence_category = pd.Categorical(songs.valence_category, categories=['Very High', 'High', 'Medium', 'Low', 'Very Low'], ordered=True)

Create a suitable pivot table. Use `<pivot table> = <pd.df>.pivot_table(values='<column with values to show on the heatmap>', index='<categorical index>', columns='<column>')`
- `values`: e.g. `obscene` or `tempo`
- `index`: to be shown on y-axis, e.g. `valence_category`
- `columns`: to be shown on x-axis, e.g. `Year`

In [None]:
# pivot_table = songs.pivot_table(values='energy', index='valence_category', columns='Year')
pivot_table = songs.pivot_table(values='tempo', index='valence_category', columns='Year')
pivot_table

Plot the corresponding heatmap. Based on [this](https://pythonbasics.org/seaborn-heatmap/), [this](https://seaborn.pydata.org/generated/seaborn.heatmap.html), and [this](https://stackoverflow.com/a/29648332/1899061).

It is often a good idea to change the default figure size first, using `sb.set_theme(rc={'figure.figsize': (<x_size>, <y_size>)})`, to avoid cluttering on the heatmap (alternatively, use something like `plt.figure(layout='constrained', facecolor='navajowhite', figsize=(5, 3.5))`). Here `rc` stands for 'run command' - essentially, configurations which will execute when running the code. Experiment with `(<x_size>, <y_size>)`. The values that have worked well in this example: (15.7, 5.27).

Then use `sb.heatmap(data=<pivot table>, annot=True, fmt='<format string>', cmap='<color map>');`
- `data=<pivot table>`: the pivot table created in the previous step
- `annot=True`: annotate heatmap cells with values
- `fmt='<format_string>'`: for example, use `'.0f'` to show int values in annotations, not scientific notation (`'g'` for using mixed int and float annotations)
- `cmap='<color map>'`: color map (see [this](https://10xsoft.org/courses/data-analysis/mastering-data-visualization-with-python/section-4-data-visualization-using-seaborn/colour-palettes-seaborn/)); a good one is `viridis`

To set the title for the heatmap, or to change the axes labels, use (<b>AFTER</b> the call to `sb.heatmap()`!) something like:

`plt.title('<title>', loc='left', color='<color>', alpha=0.4, size=14)`<br>
`plt.xlabel('<xlabel>', size=<font size>, color='<color>')`<br>
`plt.ylabel('<ylabel>', size=<font size>, color='<color>')`<br>
`plt.show()`    # it's a must

In [None]:
# sb.set_theme(rc={'figure.figsize': (15.7, 5.27)})
plt.figure(layout='constrained', facecolor='navajowhite', figsize=(5, 3.5))
sb.heatmap(data=pivot_table, annot=True, fmt='.2f', cmap='viridis');
# plt.title('Heatmap', loc='left', color='red', alpha=0.4, size=14)
# plt.xlabel('Year', size=10)
# plt.ylabel('Valence', size=10)
# plt.xticks(size=6, color='red')
# plt.yticks(size=6, color='red')
# plt.show()

##### A fancier example
Average duration of songs over the years, represented as circles with sizes proportional to the numbers of songs.

In [None]:
# songs = pd.read_csv('../data/brit_visualization_valence_categories.csv')
# songs_by_year = songs.groupby('Year')
# years = np.sort(songs.Year.unique())
# years
# 
# avg_duration = []
# for year in years:
#     avg_duration.append(np.mean(songs_by_year.get_group(year)['Duration']))
# avg_duration = np.array(avg_duration)
# 
# rng = np.random.RandomState(370)
# 
# colors = rng.choice(100, size=len(years), replace=False)                    # random sample, no duplicates
# # display(colors)
# 
# sizes = []
# for year in years:
#     sizes.append(len(songs_by_year.get_group(year)) * 100)                  # sizes proportional to the numbers of songs
# 
# # plt.title('Song duration over the years', fontdict={'size': 20})
# # plt.xlabel('Year')
# # plt.ylabel('Duration')
# # plt.xlim(1963, 1968)
# # plt.ticklabel_format(useOffset=False)
# # plt.scatter(years, avg_duration,
# #             c=colors, s=sizes, alpha=0.3,                                   # alpha: the level of transparency
# #             cmap='Set1')                                                    # cmap: a pre-defined color map
# # plt.colorbar();                                                             # show color scale
# # 
# # # Alternatively, but without showing the colorbar
# # ax = plt.axes()
# # ax.set(xlabel='Year', ylabel='Duration', xlim=(1963, 1968),
# #        title='Song duration over the years')
# # plt.ticklabel_format(useOffset=False)
# # ax.scatter(years, avg_duration,
# #            c=colors, s=sizes, alpha=0.3,                                    # alpha: the level of transparency
# #            cmap='Set1');                                                    # cmap: a pre-defined color map