<font size="+3" color='#053c96'><h2><center> Financial Statement Fraud Detection Notebook</h2></center></font>
<figure>
<center><img src ="https://www.netsuite.com/portal/assets/img/business-articles/accounting-software/bnr-financial-statement-fraud.jpg" width = "750" height = '600' alt="Financial Fraud Detection"/>

## Authors: 

Date: [Current Date]

[Table of Contents](#table-of-contents)

- [Introduction](#introduction)
  * [Overview](#overview)
  * [Problem Statement](#problem-statement)
  * [Objectives](#goals)
- [Importing Libraries](#importing-dependencies)
- [Data](#data)
- [Exploratory Data Analysis](#exploratory-data-analysis)
  * [Data Exploration](#data-exploration)
  * [Data Visualization](#data-visualization)
  * [Summary Statistics](#summary-statistics)
  * [Feature Correlation](#feature-correlation)
- [Data Preparation](#data-preparation)
  * [Data Cleaning](#data-cleaning)
  * [Feature Engineering](#feature-engineering)
  * [Data Transformation](#data-transformation)
- [Modeling](#modeling)
  * [Model Selection](#model-selection)
  * [Model Training](#model-training)
  * [Model Evaluation](#model-evaluation)
  * [Hyperparameter Tuning](#hyperparameter-tuning)
- [Results](#results)
  * [Analysis Results](#analysis-results)
  * [Model Performance](#model-performance)
  * [Feature Importance](#feature-importance)
  * [Implications](#implications)
- [Conclusion](#conclusion)
  * [Summary](#summary)
  * [Limitations](#limitations)
  * [Recommendations](#recommendations)
- [References](#references)

<a id='introduction'></a>
<font size="+2" color='#053c96'><b> Introduction</b></font>  
[back to top](#table-of-contents)  

This project aims to flag fraudulent financial statements from companies that have been issued a guarantee from an insurance company. The purpose is to ensure the financial information used in the credit risk modelling process is sound. The deliverables include a statistical (or other) model that predicts and flags fraudulent financial statements, an analysis of the predicted fraud across different buckets (e.g. industry, year, financial type), and a fraud indicator (or probability) that will allow the use of an extra variable in the credit risk model and give insight into the appropriateness of the data for modelling purposes. Ultimately, the credit underwriters will be the end-users of the output.

<a id='overview'></a>
<font size="+1" color='#780404'><b> Overview</b></font>  
[back to top](#table-of-contents)

Financial Statement - a complete report on the health of a business taking in balance sheet, income and cash flow. The Financial Statement determines is a business has cash flowto pay bills or repair loans and purchase stock. It tells where thebusiness generate cashfrom and where the cash goes. It tells business profitability and gives overview of the condition of the business, flaging signs of future problems where there is one. The cost of Financial Statement Fraud is estimated at $572 billion Dollars per year in the US. Apart from the direct cost, this phenomenon has negatively affected the employees and investors, undermining the reliabilityof corporate financial statements resulting in higher transactional cost and less efficient markets.

<a id='problem-statement'></a>
<font size="+1" color='#780404'><b> Problem Statement</b></font>  
[back to top](#table-of-contents)  

Financial statement fraud refers to the intentional misrepresentation of financial information in order to deceive stakeholders. South Africa has had a number of high-profile cases of financial statement fraud in recent years. One of the most notable cases involved the retail giant Steinhoff International, which was embroiled in a massive accounting scandal in 2017. The company was accused of inflating its earnings and assets, and its share price plummeted as a result. Steinhoff is still dealing with the aftermath of the scandal, which is considered one of the biggest corporate frauds in South African history.  

Financial statement fraud is not limited to South Africa; it is a global issue that affects companies all over the world, including those in Europe, Africa, Asia and America.  

For this project, the problem is that there is a need to ensure that the information used in the credit risk modelling process is sound by identifying and flagging fraudulent financial statements from companies that have been issued a guarantee from an insurance company. Currently, the methods for identifying these fraudulent financial statements are not reliable, which can lead to negative credit risk modelling and financial losses for the insurance company.

<a id='goals'></a>
<font size="+1" color='#780404'><b> Objectives</b></font>  
[back to top](#table-of-contents)  

1. Develop a statistical (or other) model that predicts/flags fraudulent financial statements
2. Have a detailed analysis of predicted fraud across different buckets (e.g. industry, year, financial type) and it’s correlation with default
3. Develop a fraud indicator or probability that can be used as an extra variable in the credit risk model to improve the accuracy of credit risk assessments. Having a fraud indicator (or probability) will allow us to:
 - Use it as an extra variable in our credit risk model
 - Give us insight into the appropriateness of the data to be used for modelling purposes
4. Provide insights into the appropriateness of the financial data for credit risk modeling purposes.
5. Provide credit underwriters with accurate and reliable information to make informed decisions about the creditworthiness of companies
6. Help insurance companies to identify potential fraud and reduce their exposure to financial losses.
7. Increase the efficiency of the credit risk modeling process by improving the accuracy of the financial data used in the process.
8. Provide possible solution and recommendations for improving financial reporting standards and regulations to prevent financial statement fraud.

<a id='importing-dependencies'></a>
<font size="+2" color='#053c96'><b> Importing Libraries</b></font>  
[back to top](#table-of-contents)

In [3]:
import os
import sys
import librosa
import librosa.display
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import plotly.graph_objs as go
from scipy.stats import skew, kurtosis
from IPython.display import Audio
import numpy as np
import scipy
import pickle
import tarfile
import tensorflow as tf
from tensorflow import keras
import pandas as pd
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 200)

<a id='data'></a>
<font size="+2" color='#053c96'><b> Data</b></font>  
[back to top](#table-of-contents)

In [None]:
genres = ['blues', 'classical', 'country', 'disco', 'hiphop', 'jazz', 'metal', 'pop', 'reggae', 'rock']
data = []

for genre in genres:
    genre_dir = f'genres/{genre}'
    for filename in os.listdir(genre_dir):
        if filename.endswith('.wav'):
            audio_path = os.path.join(genre_dir, filename)
            y, sr = librosa.load(audio_path)

            # Extract features
            central_moments = np.asarray([np.mean((y - np.mean(y)) ** i) for i in range(1, 5)])
            zero_crossing_rate = librosa.feature.zero_crossing_rate(y)
            rmse = librosa.feature.rms(y=y)
            tempo, beats = librosa.beat.beat_track(y=y, sr=sr)
            spectral_contrast = librosa.feature.spectral_contrast(y=y, sr=sr)
            spectral_rolloff = librosa.feature.spectral_rolloff(y=y, sr=sr)
            mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
            chroma = librosa.feature.chroma_stft(y=y, sr=sr)
            spectral_centroid = librosa.feature.spectral_centroid(y=y, sr=sr)
            spectral_bandwidth = librosa.feature.spectral_bandwidth(y=y, sr=sr)

            # Add features to dataframe
            features = [genre, filename, beats, sr, central_moments, zero_crossing_rate[0], rmse[0], tempo, spectral_contrast, spectral_rolloff, mfccs, chroma, spectral_centroid, spectral_bandwidth]
            data.append(features)

df = pd.DataFrame(data, columns=['Genre', 'Filename', 'Beats', 'SR', 'Central Moments', 'Zero Crossing Rate', 'RMSE', 'Tempo', 'Spectral Contrast', 'Spectral Roll-off', 'MFCC', 'Chroma', 'Spectral Centroid', 'Spectral Bandwidth'])


In [None]:
df.head()

In [None]:
df['Beats Mean'] = df['Beats'].apply(lambda x: np.mean(x))
df['Central Moments Mean'] = df['Central Moments'].apply(lambda x: np.mean(x))
df['Zero Crossing Rate Mean'] = df['Zero Crossing Rate'].apply(lambda x: np.mean(x))
df['RMSE Mean'] = df['RMSE'].apply(lambda x: np.mean(x))
df['Spectral Contrast Mean'] = df['Spectral Contrast'].apply(lambda x: np.mean(x))
df['Spectral Roll-off Mean'] = df['Spectral Roll-off'].apply(lambda x: np.mean(x))
df['MFCC Mean'] = df['MFCC'].apply(lambda x: np.mean(x))
df['Chroma Mean'] = df['Chroma'].apply(lambda x: np.mean(x))
df['Spectral Centroid Mean'] = df['Spectral Centroid'].apply(lambda x: np.mean(x))
df['Spectral Bandwidth Mean'] = df['Spectral Bandwidth'].apply(lambda x: np.mean(x))

In [None]:
df['Beats Var'] = df['Beats'].apply(lambda x: np.var(x))
df['Central Moments Var'] = df['Central Moments'].apply(lambda x: np.var(x))
df['Zero Crossing Rate Var'] = df['Zero Crossing Rate'].apply(lambda x: np.var(x))
df['RMSE Var'] = df['RMSE'].apply(lambda x: np.var(x))
df['Spectral Contrast Var'] = df['Spectral Contrast'].apply(lambda x: np.var(x))
df['Spectral Roll-off Var'] = df['Spectral Roll-off'].apply(lambda x: np.var(x))
df['MFCC Var'] = df['MFCC'].apply(lambda x: np.var(x))
df['Chroma Var'] = df['Chroma'].apply(lambda x: np.var(x))
df['Spectral Centroid Var'] = df['Spectral Centroid'].apply(lambda x: np.var(x))
df['Spectral Bandwidth Var'] = df['Spectral Bandwidth'].apply(lambda x: np.var(x))

In [None]:
df['Beats Std'] = df['Beats'].apply(lambda x: np.std(x))
df['Central Moments Std'] = df['Central Moments'].apply(lambda x: np.std(x))
df['Zero Crossing Rate Std'] = df['Zero Crossing Rate'].apply(lambda x: np.std(x))
df['RMSE Std'] = df['RMSE'].apply(lambda x: np.std(x))
df['Spectral Contrast Std'] = df['Spectral Contrast'].apply(lambda x: np.std(x))
df['Spectral Roll-off Std'] = df['Spectral Roll-off'].apply(lambda x: np.std(x))
df['MFCC Std'] = df['MFCC'].apply(lambda x: np.std(x))
df['Chroma Std'] = df['Chroma'].apply(lambda x: np.std(x))
df['Spectral Centroid Std'] = df['Spectral Centroid'].apply(lambda x: np.std(x))
df['Spectral Bandwidth Std'] = df['Spectral Bandwidth'].apply(lambda x: np.std(x))

In [1]:
df.head()

NameError: name 'df' is not defined

In [None]:
df.to_csv('classical_punk.csv')

In [4]:
df = pd.read_csv('classical_punk.csv')

<a id='exploratory-data-analysis'></a>
<font size="+2" color='#053c96'><b> Exploratory Data Anaysis</b></font>  
[back to top](#table-of-contents)

In [5]:
df.head()

Unnamed: 0.1,Unnamed: 0,Genre,Filename,Beats,SR,Central Moments,Zero Crossing Rate,RMSE,Tempo,Spectral Contrast,Spectral Roll-off,MFCC,Chroma,Spectral Centroid,Spectral Bandwidth,Beats Mean,Central Moments Mean,Zero Crossing Rate Mean,RMSE Mean,Spectral Contrast Mean,Spectral Roll-off Mean,MFCC Mean,Chroma Mean,Spectral Centroid Mean,Spectral Bandwidth Mean,Beats Var,Central Moments Var,Zero Crossing Rate Var,RMSE Var,Spectral Contrast Var,Spectral Roll-off Var,MFCC Var,Chroma Var,Spectral Centroid Var,Spectral Bandwidth Var,Beats Std,Central Moments Std,Zero Crossing Rate Std,RMSE Std,Spectral Contrast Std,Spectral Roll-off Std,MFCC Std,Chroma Std,Spectral Centroid Std,Spectral Bandwidth Std
0,0,blues,blues.00093.wav,[ 16 66 116 161 206 250 298 347 395 ...,22050,[-2.9325247e-09 6.1441921e-03 -5.6447851e-05 ...,[0.00341797 0.00732422 0.01025391 ... 0.024414...,[0.02833905 0.03485174 0.03935784 ... 0.027093...,58.726918,[[ 8.22429642 14.15588624 22.11961896 ... 37.9...,[[2756.25 1227.39257812 333.76464844 .....,[[-3.1871725e+02 -3.5795795e+02 -4.0606140e+02...,[[0.5608211 0.2739241 0.26868102 ... 0.10808...,[[1210.34232667 737.86866616 395.68889509 .....,[[2037.05487171 1575.0307463 977.8708091 .....,604.0,0.001594,0.021697,0.06586,21.403202,929.300104,-4.54897,0.377695,570.349904,995.505854,124035.769231,7e-06,0.000117,0.001804,29.045168,499972.2,12176.344,0.096389,104947.121639,84704.57758,352.187122,0.00263,0.010838,0.042473,5.389357,707.087129,110.34647,0.310465,323.955432,291.040508
1,1,blues,blues.00087.wav,[ 30 45 60 74 89 104 118 133 148 ...,22050,[4.7446376e-09 2.7690521e-02 3.5093213e-04 3.1...,[0.03320312 0.05224609 0.07421875 ... 0.024902...,[0.1732695 0.21864875 0.25347224 ... 0.109719...,172.265625,[[14.83071459 32.80825402 34.44571594 ... 21.8...,[[3628.34472656 3596.04492188 3391.47949219 .....,[[-7.0359711e+01 -7.6047897e+01 -1.0504224e+02...,[[0.36272326 0.34527805 0.42661187 ... 0.92584...,[[1806.71672337 1832.56502159 1715.97134119 .....,[[2075.07396093 1991.85544429 1852.46504775 .....,603.21519,0.007801,0.050869,0.157941,22.66312,3082.603763,1.747352,0.336902,1441.680807,1870.021373,112486.928377,0.000133,0.001155,0.002722,61.844748,2195439.0,3799.4382,0.094721,387155.803107,146843.002778,335.39071,0.011548,0.033986,0.052177,7.864143,1481.701296,61.639584,0.307767,622.218453,383.200995
2,2,blues,blues.00050.wav,[ 12 31 50 69 88 107 126 145 164 ...,22050,[ 2.6774598e-09 3.7224334e-02 -1.3197740e-03 ...,[0.06591797 0.09716797 0.13037109 ... 0.066894...,[0.06419174 0.07681078 0.08746899 ... 0.177763...,135.999178,[[10.99559361 17.8715263 17.74025947 ... 16.4...,[[3908.27636719 4102.07519531 4242.04101562 .....,[[-144.09424 -133.43872 -146.5851 .....,[[0.15900776 0.09031525 0.06776749 ... 0.55469...,[[2120.53124792 2243.43065575 2335.44978231 .....,[[1981.00376406 2007.1879474 2006.01745657 .....,597.412698,0.010525,0.085791,0.182285,21.129626,4174.460398,7.405733,0.401009,1945.252523,2081.76303,117180.686823,0.000246,0.001219,0.003976,83.527988,808384.1,2153.9006,0.089657,147430.672719,73538.994378,342.316647,0.015674,0.03492,0.063054,9.139365,899.101855,46.410133,0.299428,383.96702,271.180741
3,3,blues,blues.00044.wav,[ 13 33 52 71 89 109 127 146 164 ...,22050,[6.8291079e-09 1.9007759e-02 1.5543754e-05 1.1...,[0.06884766 0.09277344 0.13574219 ... 0.066894...,[0.12154406 0.14153785 0.1580186 ... 0.163750...,135.999178,[[ 8.7134595 12.77827152 17.12739404 ... 16.9...,[[5878.56445312 6416.89453125 6804.4921875 .....,[[-109.827324 -93.47169 -103.923355 ... -...,[[0.28844947 0.11785139 0.06642816 ... 0.54242...,[[2342.55113576 2701.20274619 3162.69526471 .....,[[2658.09856758 2800.11575974 2861.73346312 .....,638.373134,0.005054,0.092525,0.136209,21.729472,5198.686455,5.340542,0.390118,2279.031119,2375.23209,134925.099577,6.5e-05,0.000902,0.000467,88.639974,569824.0,2242.4602,0.086307,170912.654795,52988.857635,367.321521,0.008071,0.030041,0.021618,9.41488,754.866877,47.35462,0.29378,413.415838,230.193088
4,4,blues,blues.00078.wav,[ 15 37 59 81 102 124 146 167 189 ...,22050,[ 2.0866318e-09 6.9366850e-02 -2.6873951e-03 ...,[0.07373047 0.09033203 0.109375 ... 0.097656...,[0.1466363 0.25479728 0.3182426 ... 0.293897...,123.046875,[[10.29093286 12.75817456 12.60773402 ... 18.3...,[[4715.77148438 4037.47558594 3499.14550781 .....,[[-53.54439 -12.221306 19.373835 ... ...,[[0.2582713 0.4232984 0.62160033 ... 0.12042...,[[2267.65521067 2168.234727 1908.12894583 .....,[[2204.3183378 2057.33938691 1962.88245481 .....,649.133333,0.021115,0.12382,0.257993,20.964763,4942.777743,8.593781,0.414227,2333.552482,2227.361074,136273.815556,0.000838,0.000559,0.00273,109.423646,481450.5,1316.1877,0.082961,89801.808329,30628.551895,369.152835,0.028947,0.02364,0.052251,10.460576,693.866331,36.2793,0.288029,299.669498,175.010148


In [None]:
df = df.drop(['Unnamed: 0', 'Beats', 'SR', 'Central Moments', 'Zero Crossing Rate',
               'RMSE', 'Spectral Contrast', 'Spectral Roll-off',
               'MFCC', 'Chroma', 'Spectral Centroid', 'Spectral Bandwidth'], axis=1)

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df['Genre'].unique()

<a id='data-exploration'></a>
<font size="+1" color='#780404'><b> Data Exploration</b></font>  
[back to top](#table-of-contents)

This function loads and plays an audio file of a specific genre and number using the librosa library. It takes two arguments, genre and num, which specify the genre of the audio and the number of the audio file within that genre, respectively.

In [None]:
def play_audio(genre, num):
    audio = f'/Users/umarkabir/Documents/Qwasar/Classical Punk/genres/{genre}/{genre}.{num}.wav'
    data, sr = librosa.load(audio)
    return Audio(data, rate=sr)

In [None]:
play_audio('blues', '00024')

In [None]:
play_audio('classical', '00024')

In [None]:
play_audio('country', '00024')

In [None]:
play_audio('disco', '00024')

In [None]:
play_audio('hiphop', '00024')

In [None]:
play_audio('jazz', '00024')

In [None]:
play_audio('metal', '00024')

In [None]:
play_audio('pop', '00024')

In [None]:
play_audio('reggae', '00024')

In [None]:
play_audio('rock', '00024')

<a id='data-visualization'></a>
<font size="+1" color='#780404'><b> Data Visualization</b></font>  
[back to top](#table-of-contents)

This code generates a frequency bar chart of the 'Tempo' column in a pandas DataFrame.

In [None]:
if isinstance(df, (pd.DatetimeIndex, pd.MultiIndex)):
	df = df.to_frame(index=False)

df = df.reset_index().drop('index', axis=1, errors='ignore')
df.columns = [str(c) for c in df.columns]  # update columns to strings in case they are numbers

s = df[~pd.isnull(df['Tempo'])]['Tempo']
chart = pd.value_counts(s).to_frame(name='data')
chart.index.name = 'labels'
chart = chart.reset_index().sort_values(['data', 'labels'], ascending=[False, True])
chart = chart[:100]
charts = [go.Bar(x=chart['labels'].values, y=chart['data'].values, name='Frequency')]
figure = go.Figure(data=charts, layout=go.Layout({
    'barmode': 'group',
    'legend': {'orientation': 'h'},
    'title': {'text': 'Tempo Value Counts'},
    'xaxis': {'title': {'text': 'Tempo'}},
    'yaxis': {'title': {'text': 'Frequency'}}
}))

from plotly.offline import iplot, init_notebook_mode
#
init_notebook_mode(connected=True)
for chart in charts:
    chart.pop('id', None) # for some reason iplot does not like 'id'
iplot(figure)

This code generates a frequency bar chart of the 'Tempo' column in a pandas DataFrame.

In [None]:
if isinstance(df, (pd.DatetimeIndex, pd.MultiIndex)):
	df = df.to_frame(index=False)

# remove any pre-existing indices for ease of use in the D-Tale code, but this is not required
df = df.reset_index().drop('index', axis=1, errors='ignore')
df.columns = [str(c) for c in df.columns]  # update columns to strings in case they are numbers

s = df[~pd.isnull(df['Tempo'])][['Tempo']]
chart, labels = np.histogram(s['Tempo'], bins=20)
import scipy.stats as sts
kde = sts.gaussian_kde(s['Tempo'])
kde_data = kde.pdf(np.linspace(labels.min(), labels.max()))
# main statistics
stats = df['Tempo'].describe().to_frame().T
charts = [
	go.Bar(x=labels[1:], y=chart, name='Histogram'),
	go.Scatter(
		x=list(range(len(kde_data))), y=kde_data, name='KDE',		yaxis='y2', xaxis='x2',		line={'shape': 'spline', 'smoothing': 0.3}, mode='lines'
	)
]
figure = go.Figure(data=charts, layout=go.Layout({
    'barmode': 'group',
    'legend': {'orientation': 'h'},
    'title': {'text': 'Tempo Histogram (bins: 20) w/ KDE'},
    'xaxis2': {'anchor': 'y', 'overlaying': 'x', 'side': 'top'},
    'yaxis': {'side': 'left', 'title': {'text': 'Frequency'}},
    'yaxis2': {'overlaying': 'y', 'side': 'right', 'title': {'text': 'KDE'}}
}))

# If you're having trouble viewing your chart in your notebook try passing your 'chart' into this snippet:
#
from plotly.offline import iplot, init_notebook_mode
#
init_notebook_mode(connected=True)
for chart in charts:
    chart.pop('id', None) # for some reason iplot does not like 'id'
iplot(figure)

This code generates a frequency bar chart of the 'Tempo' column in a pandas DataFrame.

In [None]:
if isinstance(df, (pd.DatetimeIndex, pd.MultiIndex)):
	df = df.to_frame(index=False)

# remove any pre-existing indices for ease of use in the D-Tale code, but this is not required
df = df.reset_index().drop('index', axis=1, errors='ignore')
df.columns = [str(c) for c in df.columns]  # update columns to strings in case they are numbers

s = df[~pd.isnull(df['Central Moments Mean'])][['Central Moments Mean']]
chart, labels = np.histogram(s['Central Moments Mean'], bins=20)
import scipy.stats as sts
kde = sts.gaussian_kde(s['Central Moments Mean'])
kde_data = kde.pdf(np.linspace(labels.min(), labels.max()))
# main statistics
stats = df['Central Moments Mean'].describe().to_frame().T
charts = [
	go.Bar(x=labels[1:], y=chart, name='Histogram'),
	go.Scatter(
		x=list(range(len(kde_data))), y=kde_data, name='KDE',		yaxis='y2', xaxis='x2',		line={'shape': 'spline', 'smoothing': 0.3}, mode='lines'
	)
]
figure = go.Figure(data=charts, layout=go.Layout({
    'barmode': 'group',
    'legend': {'orientation': 'h'},
    'title': {'text': 'Central Moments Mean Histogram (bins: 20) w/ KDE'},
    'xaxis2': {'anchor': 'y', 'overlaying': 'x', 'side': 'top'},
    'yaxis': {'side': 'left', 'title': {'text': 'Frequency'}},
    'yaxis2': {'overlaying': 'y', 'side': 'right', 'title': {'text': 'KDE'}}
}))

# If you're having trouble viewing your chart in your notebook try passing your 'chart' into this snippet:
#
from plotly.offline import iplot, init_notebook_mode
#
init_notebook_mode(connected=True)
for chart in charts:
    chart.pop('id', None) # for some reason iplot does not like 'id'
iplot(figure)

This code generates a frequency bar chart of the 'Tempo' column in a pandas DataFrame.

In [None]:
if isinstance(df, (pd.DatetimeIndex, pd.MultiIndex)):
	df = df.to_frame(index=False)

# remove any pre-existing indices for ease of use in the D-Tale code, but this is not required
df = df.reset_index().drop('index', axis=1, errors='ignore')
df.columns = [str(c) for c in df.columns]  # update columns to strings in case they are numbers

chart = df.groupby(['Filename'], dropna=False)[['Central Moments Mean']].agg(['count', 'mean'])
chart.columns = chart.columns.droplevel(0)
chart.columns = ["count", "data"]
chart.index.name = 'labels'
chart = chart.reset_index()
chart = chart[:100]
charts = [
	go.Bar(x=chart['labels'].values, y=chart['data'].values),
	go.Scatter(
		x=chart['labels'].values, y=chart['count'].values, yaxis='y2',
		name='Frequency', line={'shape': 'spline', 'smoothing': 0.3}, mode='lines'
	)
]
figure = go.Figure(data=charts, layout=go.Layout({
    'barmode': 'group',
    'legend': {'orientation': 'h'},
    'title': {'text': 'Central Moments Mean(mean) Categorized by Filename'},
    'xaxis': {'title': {'text': 'Filename'}},
    'yaxis': {'side': 'left', 'title': {'text': 'Central Moments Mean (mean)'}},
    'yaxis2': {'overlaying': 'y', 'side': 'right', 'title': {'text': 'Frequency'}}
}))

# If you're having trouble viewing your chart in your notebook try passing your 'chart' into this snippet:
#
from plotly.offline import iplot, init_notebook_mode
#
init_notebook_mode(connected=True)
for chart in charts:
    chart.pop('id', None) # for some reason iplot does not like 'id'
iplot(figure)

In [None]:
def show_waveform(x, num):
    # Load WAV file
    wav_file = f'/Users/umarkabir/Documents/Qwasar/Classical Punk/genres/{x}/{x}.{num}.wav'
    # Load WAV file
    y, sr = librosa.load(wav_file)

    # Create x-axis values
    time = librosa.frames_to_time(range(len(y)), sr=sr)

    # Create figure
    fig = go.Figure()

    # Add waveform trace
    fig.add_trace(go.Scatter(x=time, y=y, mode='lines'))

    # Set layout
    fig.update_layout(
        title=f'Sample waveform for {x}',
        xaxis_title='Time (s)',
        yaxis_title='Amplitude',
    )

    # Show figure
    fig.show()

This function loads a WAV file of a specific genre and number, and generates a sample waveform plot using the librosa and plotly libraries. It takes two arguments, x and num, which specify the genre of the audio and the number of the audio file within that genre, respectively.

In [None]:
show_waveform('blues', '00090')

In [None]:
show_waveform('classical', '00090')

In [None]:
show_waveform('country', '00090')

In [None]:
show_waveform('disco', '00090')

In [None]:
show_waveform('hiphop', '00090')

In [None]:
show_waveform('jazz', '00090')

In [None]:
show_waveform('metal', '00090')

In [None]:
show_waveform('pop', '00090')

In [None]:
show_waveform('reggae', '00090')

In [None]:
show_waveform('rock', '00090')

In [None]:
def show_spectogram(x, num):
    # Load audio file
    audio_path = f'/Users/umarkabir/Documents/Qwasar/Classical Punk/genres/{x}/{x}.{num}.wav'
    y, sr = librosa.load(audio_path)

    # Calculate spectrogram
    S = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128, fmax=8000)

    # Convert to decibels
    S_dB = librosa.power_to_db(S, ref=np.max)

    # Create figure
    fig = go.Figure()

    # Add heatmap trace
    fig.add_trace(go.Heatmap(
        z=S_dB,
        x=np.arange(0, S_dB.shape[1]),
        y=np.arange(0, 8000, 8000/128),
        colorscale='Viridis'
    ))

    # Set x and y axis labels
    fig.update_xaxes(title_text='Time')
    fig.update_yaxes(title_text='Frequency (Hz)')

    # Set figure title
    fig.update_layout(title_text=f'Sample spectrogram for {x}')

    # Show figure
    fig.show()


This function loads a WAV file of a specific genre and number, and generates a sample spectrogram plot using the librosa and plotly libraries. It takes two arguments, x and num, which specify the genre of the audio and the number of the audio file within that genre, respectively.

In [None]:
show_spectogram('blues', '00090')

In [None]:
show_spectogram('classical', '00090')

In [None]:
show_spectogram('country', '00090')

In [None]:
show_spectogram('disco', '00090')

In [None]:
show_spectogram('hiphop', '00090')

In [None]:
show_spectogram('jazz', '00090')

In [None]:
show_spectogram('metal', '00090')

In [None]:
show_spectogram('pop', '00090')

In [None]:
show_spectogram('reggae', '00090')

In [None]:
show_spectogram('rock', '00090')

In [None]:
def show_sr(x, num):
        # Load audio file
    audio_path = f'/Users/umarkabir/Documents/Qwasar/Classical Punk/genres/{x}/{x}.{num}.wav'
    y, sr = librosa.load(audio_path)

    # Compute spectral rolloff
    spectral_rolloff = librosa.feature.spectral_rolloff(y=y, sr=sr)[0]

    # Create plotly trace
    trace = go.Scatter(
        x=[i for i in range(len(spectral_rolloff))],
        y=spectral_rolloff,
        mode='lines',
        name='Spectral Rolloff'
    )

    # Create plot layout
    layout = go.Layout(
        title=f'Sample spectral rolloff for {x}',
        xaxis=dict(title='Frame'),
        yaxis=dict(title='Frequency (Hz)')
    )

    # Create plot
    fig = go.Figure(data=[trace], layout=layout)
    fig.show()


This function show_sr(x, num) loads an audio file and computes the spectral rolloff. It then creates a Plotly line plot of the spectral rolloff values with the x-axis representing the frame and the y-axis representing frequency in Hz. The title of the plot is set to "Sample spectral rolloff for x", where x is the name of the audio file.

In [None]:
show_sr('blues', '00090')

In [None]:
show_sr('classical', '00090')

In [None]:
show_sr('country', '00090')

In [None]:
show_sr('disco', '00090')

In [None]:
show_sr('hiphop', '00090')

In [None]:
show_sr('jazz', '00090')

In [None]:
show_sr('metal', '00090')

In [None]:
show_sr('pop', '00090')

In [None]:
show_sr('reggae', '00090')

In [None]:
show_sr('rock', '00090')

In [None]:
def show_chroma(x, num):
    # Load audio file
    audio_path = f'/Users/umarkabir/Documents/Qwasar/Classical Punk/genres/{x}/{x}.{num}.wav'
    y, sr = librosa.load(audio_path)

    # Compute chroma feature
    chroma = librosa.feature.chroma_stft(y=y, sr=sr)

    # Create time axis in seconds
    time = librosa.frames_to_time(np.arange(chroma.shape[1]), sr=sr)

    # Create chroma note names
    chroma_note_names = ['C', 'C#', 'D', 'D#', 'E', 'F', 'F#', 'G', 'G#', 'A', 'A#', 'B']

    # Create heatmap trace
    trace = go.Heatmap(
        x=time,
        y=chroma_note_names,
        z=chroma,
        colorscale='Viridis',
    )

    # Set layout
    layout = go.Layout(
        title=f'Sample chroma feature for {x}',
        xaxis=dict(title='Time (s)'),
        yaxis=dict(title='Chroma Note')
    )

    # Create figure and plot
    fig = go.Figure(data=[trace], layout=layout)
    fig.show()


This function takes in two arguments x and num representing the music genre and the song number, respectively. It then loads the corresponding audio file and computes the chroma feature using librosa's chroma_stft function. It creates a time axis in seconds using frames_to_time function, and chroma note names as a list. It then creates a heatmap trace using go.Heatmap with time as the x-axis, chroma_note_names as the y-axis, and chroma as the z-axis. Finally, it sets the layout with appropriate x and y axis titles, and a title for the figure. It shows the resulting figure using fig.show().

In [None]:
show_chroma('blues', '00090')

In [None]:
show_chroma('classical', '00090')

In [None]:
show_chroma('country', '00090')

In [None]:
show_chroma('disco', '00090')

In [None]:
show_chroma('hiphop', '00090')

In [None]:
show_chroma('jazz', '00090')

In [None]:
show_chroma('metal', '00090')

In [None]:
show_chroma('pop', '00090')

In [None]:
show_chroma('reggae', '00090')

In [None]:
show_chroma('rock', '00090')

In [None]:
def show_zcr(x, num):
    # Load audio file
    audio_path = f'/Users/umarkabir/Documents/Qwasar/Classical Punk/genres/{x}/{x}.{num}.wav'
    y, sr = librosa.load(audio_path)

    # Compute zero crossing rate
    zcr = librosa.feature.zero_crossing_rate(y)

    # Create Plotly figure
    fig = go.Figure(data=go.Scatter(x=librosa.times_like(zcr), y=zcr[0]))
    fig.update_layout(title=f'Zero Crossing Rate for {x} Genre', xaxis_title='Time (s)', yaxis_title='ZCR')
    fig.show()

The show_zcr function takes in two arguments: x, which represents the genre of the music file, and num, which represents the number of the music file. It calculates the zero-crossing rate of the audio file and plots it using Plotly. 

In [None]:
show_zcr('blues', '00090')

In [None]:
show_zcr('classical', '00090')

In [None]:
show_zcr('country', '00090')

In [None]:
show_zcr('disco', '00090')

In [None]:
show_zcr('hiphop', '00090')

In [None]:
show_zcr('jazz', '00090')

In [None]:
show_zcr('pop', '00090')

In [None]:
show_zcr('reggae', '00090')

In [None]:
show_zcr('rock', '00090')

<a id='summary-statistics'></a>
<font size="+1" color='#780404'><b> Summary Statistics</b></font>  
[back to top](#table-of-contents)

In [None]:
df.describe(include='all')

In [None]:
import matplotlib.pyplot as plt
def skew_kurt(data, col):
    # Calculate skewness and kurtosis of Income column
    _skewness = skew(data[col])
    _kurtosis = kurtosis(data[col])

    # Create histogram of Income column with mean, median, and mode
    sns.histplot(data=data, x=col, kde=True)
    plt.axvline(data[col].mean(), color='r', linestyle='--', label='Mean')
    plt.axvline(data[col].median(), color='g', linestyle='--', label='Median')
    plt.axvline(data[col].mode()[0], color='b', linestyle='--', label='Mode')
    plt.legend()

    # Add text annotation for skewness and kurtosis values
    plt.annotate('Skewness: {:.2f}'.format(_skewness), xy=(0.5, 0.9), xycoords='axes fraction')
    plt.annotate('Kurtosis: {:.2f}'.format(_kurtosis), xy=(0.5, 0.85), xycoords='axes fraction')

    plt.show()

In [None]:
!pip install --upgrade matplotlib

In [None]:
skew_kurt(df[df['Genre'] == 'blues'], 'Tempo')

In [None]:
skew_kurt(df[df['Genre'] == 'classical'], 'Tempo')

In [None]:
skew_kurt(df[df['Genre'] == 'country'], 'Tempo')

In [None]:
skew_kurt(df[df['Genre'] == 'disco'], 'Tempo')

In [None]:
skew_kurt(df[df['Genre'] == 'hiphop'], 'Tempo')

In [None]:
skew_kurt(df[df['Genre'] == 'jazz'], 'Tempo')

In [None]:
skew_kurt(df[df['Genre'] == 'metal'], 'Tempo')

In [None]:
skew_kurt(df[df['Genre'] == 'pop'], 'Tempo')

In [None]:
skew_kurt(df[df['Genre'] == 'reggae'], 'Tempo')

In [None]:
skew_kurt(df[df['Genre'] == 'rock'], 'Tempo')

<a id='feature-correlation'></a>
<font size="+1" color='#780404'><b> Feature Correlation</b></font>  
[back to top](#table-of-contents)

In [None]:
df_corr = df.corr()

In [None]:
# Compute correlation matrix

# Set figure size and font sizes
fig, ax = plt.subplots(figsize=(50, 50))
sns.set(font_scale=1.9)

# Plot heatmap with adjusted color map
sns.heatmap(df_corr, cmap='coolwarm', annot=True, center=0, square=True)

# Adjust font size of features
ax.set_xticklabels(ax.get_xticklabels(), fontsize=35)
ax.set_yticklabels(ax.get_yticklabels(), fontsize=35)

# Add title and axis labels
plt.title('Correlation Matrix', fontsize=30)
plt.xlabel('Features', fontsize=20)
plt.ylabel('Features', fontsize=20)

# Show plot
plt.show()

<a id='data-preparation'></a>
<font size="+2" color='#053c96'><b> Data Preparation</b></font>  
[back to top](#table-of-contents)

<a id='data-cleaning'></a>
<font size="+1" color='#780404'><b> Data Cleaning</b></font>  
[back to top](#table-of-contents)

<a id='feature-engineering'></a>
<font size="+1" color='#780404'><b> Feature Engineering</b></font>  
[back to top](#table-of-contents)

<a id='data-transformation'></a>
<font size="+1" color='#780404'><b> Data Transformation</b></font>  
[back to top](#table-of-contents)
<a id='modeling'></a>

<a id='modeling'></a>

<font size="+2" color='#053c96'><b> Modeling</b></font>  
[back to top](#table-of-contents)

<a id='model-selection'></a>

<font size="+1" color='#780404'><b> Model Selection</b></font>  
[back to top](#table-of-contents)

<a id='model-training'></a>

<font size="+1" color='#780404'><b> Model Training</b></font>  
[back to top](#table-of-contents)

<a id='model-evaluation'></a>

<font size="+1" color='#780404'><b> Model Evaluation</b></font>  
[back to top](#table-of-contents)

<a id='hyperparameter-tuning'></a>
<font size="+1" color='#780404'><b> Hyperparameter Tuning</b></font>  
[back to top](#table-of-contents)

<a id='results'></a>
<font size="+2" color='#053c96'><b> Results</b></font>  
[back to top](#table-of-contents)

<a id='analysis-results'></a>

<font size="+1" color='#780404'><b> Analysis Results</b></font>  
[back to top](#table-of-contents)

<a id='model-performance'></a>

<font size="+1" color='#780404'><b> Model Performance</b></font>  
[back to top](#table-of-contents)

<a id='feature-importance'></a>

<font size="+1" color='#780404'><b> Feature Importance</b></font>  
[back to top](#table-of-contents)

<a id='implications'></a>

<font size="+1" color='#780404'><b> Implications</b></font>  
[back to top](#table-of-contents)

<a id='conclusion'></a>

<font size="+2" color='#053c96'><b> Conclusion</b></font>  
[back to top](#table-of-contents)

<a id='summary'></a>

<font size="+1" color='#780404'><b> Summary</b></font>  
[back to top](#table-of-contents)

<a id='limitations'></a>

<font size="+1" color='#780404'><b> Limitations</b></font>  
[back to top](#table-of-contents)

<a id='recommendations'></a>

<font size="+1" color='#780404'><b> Recommendations</b></font>  
[back to top](#table-of-contents)

<a id='references'></a>

<font size="+1" color='#780404'><b> References</b></font>  
[back to top](#table-of-contents)