### **Cumulative Duke University Pep Band Analysis**

**Importing Libraries**<br>
Prior to running our code, we need to import libraries and our data files.

In [33]:
import pandas as pd
import os

**Importing Data**<br>
Prior to running our code, we'll merge all the song data by combining them all into one file.

In [34]:
# access all the CSV files from the input data folder
DATA_DIRECTORY = "../data/"
OUTPUT_FILENAME = 'DUMB Pep Band Song Statistics - Cumulative.csv'
dataframes = []

# loop through the CSV files to create pandas dataframes of them
for filename in os.listdir(DATA_DIRECTORY):
    if filename.endswith(".csv"):
        filepath = os.path.join(DATA_DIRECTORY, filename)
        df = pd.read_csv(filepath)
        dataframes.append(df)    
        
# merge the dataframes together and store the cumulative result
merged_df = pd.concat(dataframes, ignore_index=True)
merged_df.to_csv(os.path.join(DATA_DIRECTORY, OUTPUT_FILENAME), index=False)

**Song Frequency**<br>
Let's check how many times each song was played since records started in February 2024. First, we need to define a function that can find the song frequency.

In [35]:
"""
find_song_frequency function

Input:  input_path = String filepath of the Pep Band csv file
Output: freqs      = Dictionary that tracks frequencies of all songs played

"""
def find_song_frequency(input_path: str):
    
    # store the frequencies of each song played in a dictionary
    # where the key = song name, and value = frequency
    freqs = dict()
    
    with open(input_path, "r") as input_file:
        next(input_file, None) # skip the header
        
        for line in input_file:
            row = line.split(",")
            
            # for each song, add it to the frequency
            song = row[3]
            if song not in freqs:
                freqs[song] = 1
            else:
                freqs[song] += 1
    
    return freqs

Now that we have defined the function, we can find the song statistics.

In [36]:
INPUT_FILEPATH = "../data/DUMB Pep Band Song Statistics - Cumulative.csv"
freqs = find_song_frequency(INPUT_FILEPATH)

We can first check how many songs we played in the entirety of February 2024.

In [37]:
TOTAL_COUNT = sum(freqs.values())
print(f"Since February 1, 2024, the Duke University Pep Band has played at least {TOTAL_COUNT} times.")

Since February 1, 2024, the Duke University Pep Band has played at least 325 times.


After finding the total, let's break down the results by song name. I sort the results based in decreasing frequency.

In [38]:
# set a buffer for how many long all the "song_name" entries should be
BUFFER = 30

In [39]:
print(f"Since February 1, 2024, the Duke University Pep Band has played:")
for song_name in sorted(freqs, key=lambda x: -freqs[x]):
    print(f"{song_name.ljust(BUFFER)} {freqs[song_name]} times")

Since February 1, 2024, the Duke University Pep Band has played:
Blue & White                   35 times
Can't                          33 times
Devil                          24 times
Fight Fight                    22 times
Zing It!                       19 times
Everytime                      13 times
That's What I Want             12 times
Dance the Night                11 times
Love from the Other Side       10 times
Runaway Baby                   9 times
Pumpkin                        9 times
Dear Old Duke                  9 times
Overpass                       8 times
Mortal                         8 times
Sail                           8 times
Uma Thurman                    8 times
Spell                          7 times
Wipeout                        7 times
Lucky Strike                   7 times
That That                      5 times
Scream                         5 times
About Damn Time                4 times
Potential Breakup Song         4 times
T. gogo                      

**FlipFolder Frequency**<br>
Let's see what percentage of songs in the FlipFolder were played in February 2024. Note that "D. Ditty", "t.gogo" and "Pumpkin" are not considered FlipFolder songs, as they do not have FlipFolder entries.

In [45]:
TOTAL_FLIPFOLDER_SONGS = 96
NON_FLIPFOLDER_SONGS = 3

In [46]:
print(f"{len(freqs.keys())-NON_FLIPFOLDER_SONGS}/{TOTAL_FLIPFOLDER_SONGS} songs in the FlipFolder have been played since February 1, 2024.")

43/96 songs in the FlipFolder have been played since February 1, 2024.


**Fight vs Non-Fight Songs**<br>
What percentage of the songs played were fight songs? The Duke University Pep Band has the following fight songs:

- Fight Fight
- Blue & White
- Zing It!
- Devil
- Can't
- Dear Old Duke


We can identify fight songs using a set.

In [42]:
fight_songs = {"Fight Fight", "Blue & White", "Zing It!", "Devil", "Can't", "Dear Old Duke"}

We can define a function to count the number of fight songs.

In [43]:
"""
count_fight_songs function

Input:  freqs       = Dictionary that tracks frequencies of all songs played
        fight_songs = Set of fight song string names
Output: count       = Integer representing fight song playing count

"""
def count_fight_songs(freqs: dict, fight_songs: set):
    
    # store the number of times that fights songs have been played
    count = 0
    
    # go through each song in the frequency dict
    for song_name in freqs:
        if song_name in fight_songs:
            count += freqs[song_name]
    
    # return the count once calculated
    return count

We can now check the number of times that fight songs were played.

In [44]:
FIGHT_SONG_COUNT = count_fight_songs(freqs, fight_songs)
TOTAL_COUNT = sum(freqs.values())

print(f"Since February 2024, {FIGHT_SONG_COUNT}/{TOTAL_COUNT} songs that were played were fight songs.")

Since February 2024, 142/325 songs that were played were fight songs.


From this, we can conclude that these 6 songs make up a whopping 142/325 (~43.69%) of the pieces played!

**Songs by Date**<br>
On what day was the least number of songs played? On what day was the most played?

In [67]:
"""
find_number_of_songs_by_date function

Input:  input_path = String filepath of the Pep Band csv file
Output: freqs      = Dictionary that tracks number of songs played by date

"""
def find_number_of_songs_by_date(input_path: str):
    
    # store the number of songs played by date
    # where the key = game date, and value = number of songs played
    dates = dict()
    
    # store the current date of records
    curr_date = ""
    
    with open(input_path, "r") as input_file:
        next(input_file, None) # skip the header
        
        for line in input_file:
            row = line.split(",")
            
            # update dates if the value isn't null
            if len(row[0]) > 0:
                curr_date = " ".join(row[0:3])
            
            # increase the number of songs played at that date
            if curr_date not in dates:
                dates[curr_date] = 1
            else:
                dates[curr_date] += 1
    
    return dates

First, find the number of songs played on each date. Then, output the results by printing them.

In [68]:
dates = find_number_of_songs_by_date(INPUT_FILEPATH)

In [70]:
print(f"Since February 1, 2024, the Duke University Pep Band has played for these games:")
for curr_date in sorted(dates, key=lambda x: -dates[x]):
    print(f"{curr_date.ljust(BUFFER)} {dates[curr_date]} times")

Since February 1, 2024, the Duke University Pep Band has played for these games:
3/9 UNC MBB                    36 times
2/7 Notre Dame MBB             31 times
2/10 Boston College MBB        30 times
2/28 Louisville MBB            30 times
2/19 Notre Dame WBB            26 times
2/11 UNC WBB                   25 times
2/29 Virginia WBB              23 times
2/8 Wake Forest WBB            22 times
3/7 Georgia Tech WBB           20 times
3/24 JMU MBB                   19 times
3/14 NC State MBB              17 times
3/31 NC State MBB              17 times
3/22 Vermont MBB               16 times
3/29 Houston MBB               13 times


In [None]:
sum(dates.values()) / len(dates.keys())