# Lab Instructions

Complete your lab by answering questions below.  Feel free to add more code and text blocks as necessary.  Make sure that your answer is obvious.  Show your work on programming questions. The more you show, the more credit you can earn.  

Zip the completed file and upload it to submit.  

Questions will be graded using the rubric attached to the assignment.

**Q1:** `artist(s)_name` lists the names of the artist(s) credited on the song.  What type of data is `artist(s)_name` in the Spotify dataset?  Select two terms below that best describe the data type.  Import the Spotify data using `df = pd.read_csv('spotify-2023.csv', encoding='ISO-8859-1')`.

* Categorical

* Quantitative

* Ordinal

* Nominal

* Continuous

* Discrete

* Regular

* Irregular

**A1:** 

print(df['artist(s)_name'].dtype)  # Likely outputs 'object'

Best Terms:
Categorical or Nominal

**Q2:** `artist_count` lists the number of artists credited with the song.  What type of data is `artist_count` in the Spotify dataset?  Select two terms below that best describe the data type. 

* Categorical

* Quantitative

* Ordinal

* Nominal

* Continuous

* Discrete

* Regular

* Irregular

**A2:** 

print(df['artist_count'].dtype)  # Likely outputs 'int64'

Best Terms:
Quantitative or Discrete

**Q3:** `artist(s)_name` lists the names of the artist(s) credited on the song. Which artist(s) had the greatest number of songs in the Spotify dataset?  

**A3:** 

import pandas as pd

# Load the dataset
df = pd.read_csv('spotify-2023.csv', encoding='ISO-8859-1')

# Count the number of songs per artist(s)_name
artist_song_counts = df['artist(s)_name'].value_counts()

# Find the artist(s) with the maximum number of songs
max_songs = artist_song_counts.max()
top_artists = artist_song_counts[artist_song_counts == max_songs]

# Display the result
print("Artist(s) with the greatest number of songs:")
for artist, count in top_artists.items():
    print(f"{artist}: {count} songs")

**Q4:** The `mode` feature in the Spotify data tells us if the song is written in a major or minor key.  Visualize the distribution of `mode` from the Spotify data.  Make sure to give your figure a title.  

In [None]:
**A4:** 

import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv('spotify-2023.csv', encoding='ISO-8859-1')

# Verify the 'mode' column values
print("Unique values in 'mode' column:", df['mode'].unique())

# Count the number of songs by mode
mode_counts = df['mode'].value_counts()

# Map mode values to labels for clarity (assuming 1 = Major, 0 = Minor)
mode_labels = {1: 'Major', 0: 'Minor'}
mode_counts.index = mode_counts.index.map(lambda x: mode_labels.get(x, x))

# Create a bar plot
plt.figure(figsize=(8, 5))
mode_counts.plot(kind='bar', color=['skyblue', 'salmon'])
plt.title('Distribution of Songs by Musical Mode (Major vs. Minor) in Spotify 2023')
plt.xlabel('Musical Mode')
plt.ylabel('Number of Songs')
plt.xticks(rotation=0)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

# Print the counts for reference
print("\nNumber of songs by mode:")
print(mode_counts)

**Q5:**  Based on the visualization you created to answer the previous question, are there more popular songs written in a major key or a minor key?

**A5:** 
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv('spotify-2023.csv', encoding='ISO-8859-1')

# Check if 'mode' column exists and inspect its values
print("Unique values in 'mode' column:", df['mode'].unique())

# Count songs by mode (assuming 1 = Major, 0 = Minor)
mode_counts = df['mode'].value_counts()

# Map mode values to labels for clarity
mode_labels = {1: 'Major', 0: 'Minor'}
mode_counts.index = mode_counts.index.map(mode_labels)

# Display the counts
print("\nNumber of songs by key mode:")
print(mode_counts)

# Determine which mode is more common
most_common_mode = mode_counts.idxmax()
most_common_count = mode_counts.max()
print(f"\nThere are more popular songs in {most_common_mode} key ({most_common_count} songs).")

# Create a bar plot for visualization
plt.figure(figsize=(8, 5))
mode_counts.plot(kind='bar', color=['skyblue', 'salmon'])
plt.title('Number of Popular Songs by Key Mode (Major vs. Minor)')
plt.xlabel('Key Mode')
plt.ylabel('Number of Songs')
plt.xticks(rotation=0)
plt.show()

**Q6:** What is the mean number of beats per minute (`bpm`) in the songs included in the Spotify dataset? 

**A6:** 

import pandas as pd

# Load the dataset
df = pd.read_csv('spotify-2023.csv', encoding='ISO-8859-1')

# Verify the column names and check for 'bpm'
print("Column names:", df.columns)

# Calculate the mean bpm
mean_bpm = df['bpm'].mean()

# Display the result
print(f"The mean number of beats per minute (bpm) in the Spotify 2023 dataset is {mean_bpm:.2f} bpm.")

**Q7:** Create a visualization of the number of beats per minute (`bpm`) of the songs in the Spotify dataset.  Make sure to give your visualization a title.  

In [None]:
#A7: 

import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv('spotify-2023.csv', encoding='ISO-8859-1')

# Verify the 'bpm' column
print("Column names:", df.columns)
print("Sample bpm values:", df['bpm'].head())
print("Missing bpm values:", df['bpm'].isnull().sum())

# Create a histogram of bpm
plt.figure(figsize=(10, 6))
plt.hist(df['bpm'], bins=20, color='skyblue', edgecolor='black')
plt.title('Distribution of Beats Per Minute (BPM) in Spotify 2023 Songs')
plt.xlabel('Beats Per Minute (BPM)')
plt.ylabel('Number of Songs')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

**Q8:** In your own words, how would you describe the purpose of data visualization?

**A8:** 

Data visualization turns data into pictures like charts or graphs to make it easier to understand. Its purpose is to show patterns and insights clearly, so people can quickly see what the data means and make better decisions.

**Q9:** In your own words, describe the distribution of a feature.

**A9:** 

The distribution of a feature shows how its values are spread out in a dataset. It reveals patterns, like where most values fall, if they’re bunched up or scattered, and if there are any extreme values. For example, it might show most songs have a tempo around 120 bpm with a few faster or slower ones. It helps you understand what’s normal for that feature.

**Q10:** In your own words, describe one principle of good visualization design.

**A10:** 

A key principle of good visualization design is clarity. Make the chart simple and easy to understand with clear labels and minimal clutter, so people can quickly see the main point without confusion.