# Intro to Data Science

[Gina Sprint](https://ginasprint.com/)

## Project (60 pts)

## Learner Objectives
At the conclusion of this programming assignment, participants should be able to:
* Utilize the Pandas library to
    * Load data from a CSV into a `DataFrame`
    * Work with `DataFrame` and `Series`
    * Join `DataFrame`s
* Clean/prepare data
* Use Matplotlib to visualize data

## Acknowledgments
Content used in this assignment is based upon information in the following sources:
* [The Spotify Hit Predictor Website](https://www.kaggle.com/datasets/theoverman/the-spotify-hit-predictor-dataset) from [Kaggle](https://www.kaggle.com/)

## Github Repository Setup
For the project, you will use GitHub to create a code repository (if private, add gsprint23 as a collaborator under your repo Settings) to submit your source/data files and to track code changes. To turn in your project, submit your your Github repository's url into the Moodle project portal. 

Note: I highly recommend committing/pushing regularly so your work is always backed up.

## Project Overview
We are going to take a look at real dataset of songs from the 2010s decade (2010-2019). This dataset contains interesting insights about what makes a song a "hit" (e.g. mainstream successful) or a "flop" (e.g. not mainstream successful). For example, "Shape of You" by Ed Sheeran is a hit, while "The Silence Thereafter" by Craft is a flop (I haven't even heard of that song...). We are going to work with this data in the following ways:
1. Part 1 Setting up the Github repository
1. Part 2 Getting Started with Python
1. Part 3 Working with the Data using Pandas
1. Part 4 Performing Exploratory Data Analysis
     
Dataset source: https://www.kaggle.com/datasets/theoverman/the-spotify-hit-predictor-dataset

## Part 1 Setting up the Github Repository (10 pts)
### Create a Local Repository
Create a folder for this project and then make a README.md. The README.md file should contain at least your name, course information, and a description of the project. Try to use your Markdown skills to make this look nice.

### Download the Data
Download songs-of-10s.csv and genres-of-10s.csv from the files directory on Github: https://github.com/gsprint23/ZIME-Intro-to-Data-Science/tree/master/files. One way to download a file is to click "Raw" then right click on the page and click "Save As." Move these files into the same folder as your local project Git repo. 

songs-of-10s.csv contains attributes of 6,000+ songs from the 2010s decade with detailed attributes about each song downloaded from the [Spotify Web API](https://developer.spotify.com/documentation/web-api/). Here is a sample of the format of the data in songs-of-10s.csv:

|track|artist|uri|danceability|energy|key|...|hit/flop|
|-|-|-|-|-|-|-|-|
|Wild Things|Alessia Cara|`spotify:track:2ZyuwVvV6Z3XJaXIFbspeE`|0.741|0.626|1|...|hit|
|Surfboard|Esquivel!|`spotify:track:61APOtq25SCMuK0V5w2Kgp`|0.447|0.247|5|...|flop|
|...|...|...|...|...|...|...|...|

Here is a description of each attribute in songs-of-10s.csv from the [dataset source](https://www.kaggle.com/datasets/theoverman/the-spotify-hit-predictor-dataset):
- track: The Name of the track.
- artist: The Name of the Artist.
- uri: The resource identifier for the track.
- danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. 
- energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. 
- key: The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C?/D?, 2 = D, and so on. If no key was detected, the value is -1.
- loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db. 
- mode: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
- speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. 
- acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. The distribution of values for this feature look like this:
- instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. The distribution of values for this feature look like this:
- liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
- valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
- tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. 
- duration_ms:  The duration of the track in milliseconds.
- time_signature: An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).
- chorus_hit: This the the author's best estimate of when the chorus would start for the track. Its the timestamp of the start of the third section of the track. This feature was extracted from the data received by the API call for Audio Analysis of that particular track.
- sections: The number of sections the particular track has. This feature was extracted from the data received by the API call for Audio Analysis of that particular track.
- hit/flop: Whether the track was a "hit" or "flop". "hit" implies that this song has featured in the weekly list (Issued by Billboards) of Hot-100 tracks in that decade at least once and is therefore a 'hit'. "flop" implies that the track is a 'flop'. The author's condition of a track being 'flop' is as follows:
    - The track must not appear in the 'hit' list of that decade.
    - The track's artist must not appear in the 'hit' list of that decade.
    - The track must belong to a genre that could be considered non-mainstream and / or avant-garde. 
    - The track's genre must not have a song in the 'hit' list.
    - The track must have 'US' as one of its markets.

genres-of-10s.csv is a file I created for this assignment. It contains the most likely genre for 3,000+ artists from the 2010s decade downloaded from the [Spotify Web API](https://developer.spotify.com/documentation/web-api/). Here is a sample of the format of the data in genres-of-10s.csv:

|artist|genres|
|-|-|
|Witchtrap|black thrash|
|Izïa|french indie pop|
|...|...|

Note that artists typically are associated with multiple genres. We will need to figure out what to do with this later! :)

### Make a Commit and Push
Make a commit with your three files (README.md and 2 CSV data files). Make a Github repository for your project and push your commit to it. Refresh your Github project repository page to make sure the three files made it up there. Congrats, you are done with part 1!!

## Part 2 Getting Started with Python (10 pts)
Note: for this part, do not use the Pandas library.

### Read the Data from File
Write code to read the songs-of-10s.csv file into a 2D Python list object. The header row is the first row in the file and should be stored separately in a 1D Python list object.

### Get Familiar with the Data
Write code to perform the following:
1. Print out the header
1. Print out how many columns are in the file. Is your number correct?
1. Print out how many rows are in the file. Is your number correct?
1. Prompt the user for a number of rows to print out. Print out that many rows. Are these the correct rows?

## Part 3 Working with the Data using Pandas (20 pts)
### Read and Join the Two Files
Write code to perform the following:
1. Read each csv file into a Pandas `DataFrame` object. The header row is the first row in each of the files.
1. Decide what index to use for each of the `DataFrame`s
1. Print out the first few rows of each `DataFrame` to confirm your index and your columns are setup properly
1. Outer join the two `DataFrame`s appropriately to make one `DataFrame`

### Clean/Prepare the Data
Write code to answer the following questions/perform the following tasks for your joined `DataFrame`:
1. How many instances are there?
    * Does this match your number of rows from part 2?
1. How many attributes are there? Which ones are categorical? Which ones are numeric? 
    * Does this match your number of columns from part 2?
1. Are there any missing values? If so how many and in which column(s)?
    *  Remove instances with missing values
1. How many hits are there and how many flops are there?
1. What are the top 5 artists with the most songs in the dataset?

## Part 4 Performing Exploratory Data Analysis (20 pts)
### Group the Data
Write code to do the following:
1. Group the `DataFrame` by hit/flop
1. For a few of the numeric attributes, what is the mean and standard deviation for hits and for flops? 

### Visualize the Data
Create a `charts/` directory in your rep and save the charts to appropriately named files in this directory. Identify a numeric attribute that you notice has a large difference between hits and flops. Then, produce a [histogram](https://matplotlib.org/devdocs/api/_as_gen/matplotlib.pyplot.hist.html) with:
* Bars: 30 bins, slightly transparent, different color for hits and flops
* X axis label: `<attribute>`
* Y axis label: "Count"
* Title: "`<attribute>` flop mean = `<2 decimal places>`, stdev = `<2 decimal places>` hit mean = `<2 decimal places>`, stdev = `<2 decimal places>`"
* [Legend](https://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.legend)

Example: <img src="https://github.com/gsprint23/ZIME-Intro-to-Data-Science/raw/master/figures/danceability_hist.png" width="400">

### BONUS (5 Extra Credit points)
Produce a bar chart showing the top 10 most frequent genres in the dataset.

Example: <img src="https://github.com/gsprint23/ZIME-Intro-to-Data-Science/raw/master/figures/genre_bar.png" width="400">

## Grading Guidelines
This assignment is worth 60 points. Your assignment will be evaluated based on a successful execution (using the Anaconda Python Distribution v3.9) and adherence to the program requirements. We will grade according to the following criteria:
* 10 pts for correct part 1
    * 5 pts for a Github repo with the data files
    * 5 pts for CSV files in the Github repo
* 10 pts for correct part 2
    * 5 pts for printing out number of rows and columns from file
    * 5 pts for prompting user and printing out number of rows
* 20 pts for correct part 3
    * 5 pts for grouping the data
    * 5 pts for printing out info about the joined `DataFrame`
    * 5 pts for handling missing values
    * 5 pts for number of hits/flops and for top 5 artists with most songs
* 20 pts for correct part 4
    * 10 pts for grouping by hit/flop and printing means/standard deviations
    * 10 pts for hits/flops histogram
* 5 BONUS pts for genre bar chart