*  DSC 540-T302 Data Preparation
*  Week 7 & 8 Exercise
*  Peter Lozano

# Cleaning and Transforming Data

We have two different data sources that we can use for this exercise:
1.   [So Much Candy Data Seriously](https://www.scq.ubc.ca/so-much-candy-data-seriously/)
2.   [The Metropolitan Museum of Art Open Access](https://github.com/metmuseum/openaccess/)

I will need to complete **8** different data cleaning and transformation methods on at least one of these datasets. I should at least pick 2 different methods from each chapter (Chapters 7, 8, 10, and 11) in the textbook.

## Import libraries

In [49]:
import pandas as pd

## Load dataset

In [50]:
# Loading all candy datasets
candy_data = pd.read_excel('Weeks 7 & 8 Data/CANDYDATA.xlsx')
candy_hierarchy_2015 = pd.read_excel('Weeks 7 & 8 Data/CANDY-HIERARCHY-2015-SURVEY-Responses.xlsx')
candy_hierarchy_2016 = pd.read_excel('Weeks 7 & 8 Data/BOING-BOING-CANDY-HIERARCHY-2016-SURVEY-Responses.xlsx')
candy_hierarchy_2017 = pd.read_excel('Weeks 7 & 8 Data/candyhierarchy2017.xlsx')

  warn(msg)
  warn(msg)


### Print the first 5 rows of each dataset to understand their structure and null values.

In [51]:
candy_data.head()

Unnamed: 0,ITEM,JOY,DESPAIR,NET FEELIES,NET CLOUT,DESPAIR (NEG)
0,York Peppermint Patties,634,78,556.0,1.639118,-78.0
1,Whole Wheat anything,21,419,-398.0,1.012938,-419.0
2,White Bread,15,473,-458.0,1.12344,-473.0
3,Vicodin,323,210,113.0,1.227036,-210.0
4,Twix,770,26,744.0,1.832497,-26.0


`candy_data` is a combined dataset from 2015, 2016, and 2017 candy surveys. It is small in size and easy to show all columns. What I'm looking for here are any obvious null values or structural issues.

In [52]:
candy_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87 entries, 0 to 86
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   ITEM           86 non-null     object 
 1   JOY            87 non-null     int64  
 2   DESPAIR        87 non-null     int64  
 3   NET FEELIES    86 non-null     float64
 4   NET CLOUT      86 non-null     float64
 5   DESPAIR (NEG)  86 non-null     float64
dtypes: float64(3), int64(2), object(1)
memory usage: 4.2+ KB


I'm reading the results from `.info()` function by seeing the total number of entries [**87**] and comparing that to the non-null counts for each column. If the non-null count is less than 87, then there are null values in that column.

Now, I will check the remaining datasets for null values and structural issues.

In [53]:
candy_hierarchy_2015.head()

Unnamed: 0,Timestamp,How old are you?,Are you going actually going trick or treating yourself?,[Butterfinger],[100 Grand Bar],[Anonymous brown globs that come in black and orange wrappers],[Any full-sized candy bar],[Black Jacks],[Bonkers],[Bottle Caps],...,[Necco Wafers],"Which day do you prefer, Friday or Sunday?",Please estimate the degrees of separation you have from the following folks [Bruce Lee],Please estimate the degrees of separation you have from the following folks [JK Rowling],Please estimate the degrees of separation you have from the following folks [Malala Yousafzai],Please estimate the degrees of separation you have from the following folks [Thom Yorke],Please estimate the degrees of separation you have from the following folks [JJ Abrams],Please estimate the degrees of separation you have from the following folks [Hillary Clinton],Please estimate the degrees of separation you have from the following folks [Donald Trump],Please estimate the degrees of separation you have from the following folks [Beyoncé Knowles]
0,2015-10-23 08:46:20.451,35,No,JOY,,DESPAIR,JOY,,,,...,,,,,,,,,,
1,2015-10-23 08:46:51.583,41,No,JOY,JOY,DESPAIR,JOY,DESPAIR,DESPAIR,JOY,...,DESPAIR,,,,,,,,,
2,2015-10-23 08:47:34.285,33,No,DESPAIR,DESPAIR,DESPAIR,JOY,DESPAIR,DESPAIR,DESPAIR,...,DESPAIR,,,,,,,,,
3,2015-10-23 08:47:58.964,31,No,JOY,JOY,DESPAIR,JOY,DESPAIR,DESPAIR,JOY,...,DESPAIR,,,,,,,,,
4,2015-10-23 08:48:11.719,30,No,,JOY,DESPAIR,JOY,,,,...,,,,,,,,,,


This is a wide dataset with many columns. Therefore, I will use the `verbose=True` and `show_counts=True` parameters in the `.info()` function to see all columns and their non-null counts.

In [54]:
candy_hierarchy_2015.info(
    # Show all columns
    verbose=True,
    # Show non-null counts
    show_counts=True
)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5630 entries, 0 to 5629
Data columns (total 124 columns):
 #    Column                                                                                                             Non-Null Count  Dtype         
---   ------                                                                                                             --------------  -----         
 0    Timestamp                                                                                                          5630 non-null   datetime64[ns]
 1    How old are you?                                                                                                   5431 non-null   object        
 2    Are you going actually going trick or treating yourself?                                                           5630 non-null   object        
 3     [Butterfinger]                                                                                                    5247 non-nu

From the 2015 hierarchy dataset, I can see that there are a total of 123 columns. This is not ideal to show all columns in a Jupyter Notebook, so I will instead check for null values by comparing the total number of columns to the non-null counts.

Since the rest of these datasets are also wide with many columns, I will instead just check the number of null columns by using the `.isnull().any()` function on the columns and counting the number of columns that return `True`. I will then subtract that from the total number of columns to get the number of non-null columns.

In [55]:
candy_hierarchy_2016.head()

Unnamed: 0,Timestamp,Are you going actually going trick or treating yourself?,Your gender:,How old are you?,Which country do you live in?,"Which state, province, county do you live in?",[100 Grand Bar],[Anonymous brown globs that come in black and orange wrappers],[Any full-sized candy bar],[Black Jacks],...,Please estimate the degree(s) of separation you have from the following celebrities [JK Rowling],Please estimate the degree(s) of separation you have from the following celebrities [JJ Abrams],Please estimate the degree(s) of separation you have from the following celebrities [Beyoncé],Please estimate the degree(s) of separation you have from the following celebrities [Bieber],Please estimate the degree(s) of separation you have from the following celebrities [Kevin Bacon],Please estimate the degree(s) of separation you have from the following celebrities [Francis Bacon (1561 - 1626)],"Which day do you prefer, Friday or Sunday?","Do you eat apples the correct way, East to West (side to side) or do you eat them like a freak of nature, South to North (bottom to top)?","When you see the above image of the 4 different websites, which one would you most likely check out (please be honest).",[York Peppermint Patties] Ignore
0,2016-10-24 05:09:23.033,No,Male,22,Canada,Ontario,JOY,DESPAIR,JOY,MEH,...,3 or higher,2,3 or higher,3 or higher,3 or higher,3 or higher,Friday,South to North,Science: Latest News and Headlines,
1,2016-10-24 05:09:54.798,No,Male,45,usa,il,MEH,MEH,JOY,JOY,...,3 or higher,3 or higher,3 or higher,3 or higher,3 or higher,3 or higher,Friday,East to West,Science: Latest News and Headlines,
2,2016-10-24 05:13:06.734,No,Female,48,US,Colorado,JOY,DESPAIR,JOY,MEH,...,3 or higher,3 or higher,3 or higher,3 or higher,3 or higher,3 or higher,Sunday,East to West,Science: Latest News and Headlines,
3,2016-10-24 05:14:17.192,No,Male,57,usa,il,JOY,MEH,JOY,MEH,...,3 or higher,3 or higher,3 or higher,3 or higher,3 or higher,3 or higher,Sunday,South to North,Science: Latest News and Headlines,
4,2016-10-24 05:14:24.625,Yes,Male,42,USA,South Dakota,MEH,DESPAIR,JOY,DESPAIR,...,3 or higher,3 or higher,3 or higher,3 or higher,3 or higher,3 or higher,Friday,East to West,ESPN,


In [56]:
# Count of null columns in candy_hierarchy_2016
null_columns = candy_hierarchy_2016.columns[candy_hierarchy_2016.isnull().any()]

# Count of non-null columns
number_of_non_nulls = len(candy_hierarchy_2016.columns) - len(null_columns)

print(f'There are \033[1m{number_of_non_nulls}\033[0m non-null columns and \033[1m{len(null_columns)}\033[0m columns with null values out of a total of \033[1m{len(candy_hierarchy_2016.columns)}\033[0m columns.')

There are [1m2[0m non-null columns and [1m121[0m columns with null values out of a total of [1m123[0m columns.


In [57]:
candy_hierarchy_2017.head()

Unnamed: 0,Internal ID,Q1: GOING OUT?,Q2: GENDER,Q3: AGE,Q4: COUNTRY,"Q5: STATE, PROVINCE, COUNTY, ETC",Q6 | 100 Grand Bar,Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes),Q6 | Any full-sized candy bar,Q6 | Black Jacks,...,Q8: DESPAIR OTHER,Q9: OTHER COMMENTS,Q10: DRESS,Unnamed: 113,Q11: DAY,Q12: MEDIA [Daily Dish],Q12: MEDIA [Science],Q12: MEDIA [ESPN],Q12: MEDIA [Yahoo],"Click Coordinates (x, y)"
0,90258773,,,,,,,,,,...,,,,,,,,,,
1,90272821,No,Male,44.0,USA,NM,MEH,DESPAIR,JOY,MEH,...,,Bottom line is Twix is really the only candy w...,White and gold,,Sunday,,1.0,,,"(84, 25)"
2,90272829,,Male,49.0,USA,Virginia,,,,,...,,,,,,,,,,
3,90272840,No,Male,40.0,us,or,MEH,DESPAIR,JOY,MEH,...,,Raisins can go to hell,White and gold,,Sunday,,1.0,,,"(75, 23)"
4,90272841,No,Male,23.0,usa,exton pa,JOY,DESPAIR,JOY,DESPAIR,...,,,White and gold,,Friday,,1.0,,,"(70, 10)"


In [58]:
# Count of null columns in candy_hierarchy_2017
null_columns = candy_hierarchy_2017.columns[candy_hierarchy_2017.isnull().any()]

# Count of non-null columns
number_of_non_nulls = len(candy_hierarchy_2017.columns) - len(null_columns)

print(f'There are \033[1m{number_of_non_nulls}\033[0m non-null columns and \033[1m{len(null_columns)}\033[0m columns with null values out of a total of \033[1m{len(candy_hierarchy_2017.columns)}\033[0m columns.')

There are [1m1[0m non-null columns and [1m119[0m columns with null values out of a total of [1m120[0m columns.


## Chapter 7: #1 Filter out missing data

To find the columns that contain or not contain null values, I can use the following code:

In [59]:
# Columns without null values in candy_hierarchy_2017
candy_hierarchy_2017.columns[~candy_hierarchy_2017.isnull().any()].tolist()

['Internal ID']

I used the `~` operator to invert the boolean values returned by `.isnull().any()`, so that I get `True` for columns without null values. Then, I use `.tolist()` to convert the resulting index object to a list of column names.

Removing the `~` operator will give me the columns that do contain null values.

## Chapter 7: #2 Replace values

Since I'm working with null values, I might as well see what I can do to fill those null values with meaningful data.

Using the `candy_hieararchy_2016` dataset, I will replace null values in the `gender` column with the string "Not Specified".

In [60]:
# Filling null values in the 'Your gender:' column with 'Not Specified'
candy_hierarchy_2016.fillna({'Your gender:': 'Not Specified'}, inplace=True)

# Counting the number of cases where gender was filled
count_of_filled = len(candy_hierarchy_2016.loc[candy_hierarchy_2016["Your gender:"] == "Not Specified"])

print(f'There were \033[1m{count_of_filled}\033[0m rows where gender was filled with "Not Specified".')

There were [1m9[0m rows where gender was filled with "Not Specified".


---

I will create a hierarchical index of each candy name from each dataset per year, creating a new version of the `candy_data` but providing the year they were popular. Since I'm not sure that the original `candy_data` is capturing all years of the candy data, I will use this hierarchical index to analyze candy popularity with assurance that all years are included.

This will require multiple steps since the naming convention used each year is not consistent. So, I will need to apply some cleaning/transformations from each chapter.

### Chapter 10 - #1 Split/Apply/Combine
### Chapter 7 - #3 Replace values

In [61]:
# Hierarchical index creation using candy name columns
import re
import string

CANDY_RESP_VALUES = {"joy", "meh", "despair"}  # core survey responses

def canonical_candy_name(name: str) -> str:
    string = str(name)
    # Remove survey prefixes like "Q6 | "
    string = re.sub(r"^Q6\s*\|\s*", "", string)
    # Strip surrounding brackets seen in 2015 (e.g., " [Butterfinger]")
    string = string.strip()
    string = re.sub(r"^\[|\]$", "", string.strip())
    # Remove content in parentheses (e.g., "(the board game)", "(we don't know...)")
    string = re.sub(r"\s*\([^)]*\)", "", string)
    # Normalize quotes and odd characters
    string = string.replace("’", "'").replace("“", '"').replace("”", '"')
    # Collapse whitespace
    string = re.sub(r"\s+", " ", string).strip()
    # Standardize case (Title Case keeps hyphens and apostrophes readable)
    string = string.title()
    return string


The function `canonical_candy_name` takes a candy name as input and returns a canonical version of the candy name by removing any weird suffixes (e.g., "2015", "2016", "2017", "[]") and converting the name to lowercase. This is done to standardize the candy names across different years, as the same candy may have different names in different years.

In [62]:
# Helper regex patterns for difficult cases
ALIASES_REGEX = [
    # Unify plural/singular
    (re.compile(r"\bjolly\s+rancher(?:s)?\b", re.IGNORECASE), "Jolly Ranchers"),
    # Fix spacing around O' Raisins
    (re.compile(r"\bbox'? ?o' ?raisins\b", re.IGNORECASE), "Box' O' Raisins"),
]

def canonical_candy_name(name: str) -> str:
    word = str(name)
    # Remove survey prefixes like "Q6 | " and ignoring case
    word = re.compile(r"^Q6\s*\|\s*", re.IGNORECASE).sub("", word)
    # Strip any leading/trailing whitespace
    word = word.strip()
    # Strip surrounding brackets seen in 2015 (e.g., " [Butterfinger]")
    word = re.compile(r"^\[|\]$", re.IGNORECASE).sub("", word)
    # Remove content in parentheses (e.g., "(the board game)", "(we don't know...)")
    word = re.compile(r"\s*\([^)]*\)", re.IGNORECASE).sub("", word)
    # Normalize quotes and odd characters
    word = word.replace("’", "'").replace("“", '"').replace("”", '"')
    # Collapse whitespace
    word = re.compile(r"\s+").sub(" ", word).strip()

    # Brand normalization: detect/remove, then re-prefix canonical
    has_hershey = bool(
        re.compile(
            r"\bhershey(?:'s)?\b", 
            re.IGNORECASE).search(word)
    )
    
    # Applying corrections to Hershey brand
    word = re.compile(
        # Checking for Hershey brand presence with and without possessive " 's "
        r"\bhershey(?:'s)?\b", 
        re.IGNORECASE).sub(
            # Substitute with empty string to remove brand temporarily
            "", 
            word).strip()

    # Alias rules
    for pat, repl in ALIASES_REGEX:
        word = pat.sub(repl, word)

    # Using capwords() to handle weird capitalization in candy names (i.e., Hershey'S)
    word = string.capwords(word)

    # Placing Brand at front (canonicalized) if True
    if has_hershey:
        word = f"Hershey's {word}".strip()

    return word

There are few hardcoded brand names in the `ALIASES_REGEX` set that are difficult to detect algorithmically, such as "Jolly Rancher" vs "Jolly Ranchers". This is such a specific case that it is easier to hardcode it rather than try to create a general rule that may not work for other cases.

Also, Hershey's brand is handled specifically to ensure that the brand name is always at the front of the candy name, and that the possessive " 's " is included. This is done to standardize the naming convention for Hershey's candies since there are many variations in how the brand is referenced in the dataset.

There are other's like specific M&Ms varieties that could be added as well but, I'm not sure if it was referring to a specific marketing or simply referring to M&Ms in general. So, I left those unchanged in case they were referring to a specific variety.

In [63]:
def get_candy_columns(df: pd.DataFrame) -> list[str]:
    cols = []
    # 2015-style: bracketed candy columns
    cols += [c for c in df.columns if c.strip().startswith("[") and c.strip().endswith("]")]
    # 2016/2017-style: "Q6 | Candy Name"
    cols += [c for c in df.columns if c.startswith("Q6 | ")]
    # Keep only columns that actually look like candy responses (JOY/MEH/DESPAIR mostly)
    def looks_like_candy(col: str) -> bool:
        vals = (
            # Splits candy column
            pd.Series(df[col])
            # Drops NaN values
            .dropna()
            # Converts to string, strips whitespace, lowercases
            .astype(str)
            .str.strip()
            .str.lower()
            .unique()
            .tolist()
        )
        # Allow empty sets (all NaN) or subsets of known responses
        return len(vals) == 0 or set(vals).issubset(CANDY_RESP_VALUES)
    return [c for c in cols if looks_like_candy(c)]

The function `get_candy_columns` takes a DataFrame as input and returns a list of column names that contain candy names. It does this by finding column patterns that I had to manually identify from each year's dataset. These patterns are used to filter the columns in the DataFrame that match any of the specified patterns. For example, in the 2015 dataset, candy names are prefixed with "[" and end with "]", and the function looks for columns that start with this pattern and returns them as a list.

In [64]:
def to_long(df: pd.DataFrame, year: int) -> pd.DataFrame:
    # Returns the candy columns
    candy_cols = get_candy_columns(df)
    # Transpose to long format
    long = df[candy_cols].melt(var_name="candy_raw", value_name="rating")
    # Add year column for identification
    long["year"] = year
    # Cleans up candy names
    long["candy"] = long["candy_raw"].apply(canonical_candy_name)
    # Drop rows without a rating
    long = long.dropna(subset=["rating"])
    return long[["year", "candy", "rating"]]

### Chapter 10 - Grouping with Functions
> Although this is concatenation with a function, not necessarily grouping using a function. I am still creating a grouping element by year once the data is concatenated.

Here I am finally applying the functions defined above to transform each year's dataset into a long format with canonical candy names and a year column. I then concatenate all the transformed datasets into a single DataFrame called `candies_long`.

In [65]:
candies_long = pd.concat([
    to_long(candy_hierarchy_2015, "2015"),
    to_long(candy_hierarchy_2016, "2016"),
    to_long(candy_hierarchy_2017, "2017"),
], ignore_index=True)

print("Rows:", len(candies_long), "Distinct candies:", candies_long["candy"].nunique())
candies_long

Rows: 772352 Distinct candies: 114


Unnamed: 0,year,candy,rating
0,2015,Butterfinger,JOY
1,2015,Butterfinger,JOY
2,2015,Butterfinger,DESPAIR
3,2015,Butterfinger,JOY
4,2015,Butterfinger,JOY
...,...,...,...
772347,2017,York Peppermint Patties,JOY
772348,2017,York Peppermint Patties,MEH
772349,2017,York Peppermint Patties,JOY
772350,2017,York Peppermint Patties,MEH


Checking to see if the transformations were successful by displaying Hershey without the possessive " 's " in the candy name. I should **not** see any results if the transformation was successful.

In [66]:
candies_long.loc[candies_long['candy'].str.contains("Hershey ")]

Unnamed: 0,year,candy,rating


Here is where I implement the grouping by year after concatenating the three years of candy data.

In [67]:
# Group by year and count ratings for each year
candies_long.groupby('year')['rating'].value_counts().unstack().fillna(0)

rating,DESPAIR,JOY,MEH
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015,276420.0,202087.0,0.0
2016,39001.0,41889.0,37400.0
2017,58254.0,64206.0,53095.0


I pass `value_counts()` as the function to be applied to each group created by `groupby()`. This will count the occurrences of each unique value in the `rating` column for each year.

I also pass `unstack()` to pivot the resulting Series into a DataFrame, where each unique value in the `rating` column becomes a separate column. Finally, I use `fillna(0)` to replace any missing values with 0, indicating that there were no occurrences of that rating for that year.

Let me check the data types of the columns in the `candies_long` DataFrame to ensure that the `year` column is of the correct data type for hierarchical indexing.

In [68]:
candies_long.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 772352 entries, 0 to 772351
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   year    772352 non-null  object
 1   candy   772352 non-null  object
 2   rating  772352 non-null  object
dtypes: object(3)
memory usage: 17.7+ MB


It's an object. That's fine since it is a string representing the year. However, I will explicitly convert it to a string type to avoid any potential issues later on.

### Chapter 11 - Convert between string and date time

Using `str` to convert `object` type, which is the default behavior of Python, to string type.

In [69]:
candies_long['year'] = candies_long['year'].astype('string')

# Confirming the change went through
candies_long.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 772352 entries, 0 to 772351
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   year    772352 non-null  string
 1   candy   772352 non-null  object
 2   rating  772352 non-null  object
dtypes: object(2), string(1)
memory usage: 17.7+ MB


In [70]:
# Convert the 'year' column from string to datetime
candies_long['year'] = pd.to_datetime(candies_long['year'], format='%Y')

# Confirming the change went through
candies_long.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 772352 entries, 0 to 772351
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype         
---  ------  --------------   -----         
 0   year    772352 non-null  datetime64[ns]
 1   candy   772352 non-null  object        
 2   rating  772352 non-null  object        
dtypes: datetime64[ns](1), object(2)
memory usage: 17.7+ MB


I could also use the function `.strptime()` to convert to string from a datetime objects, but since I started with just an object, it is simpler to use 'string' type conversion.

### Chapter 8 - Create hierarchical index &

### Chapter 8 - Combine and Merge Datasets

In [71]:
# Counts of ratings per year per candy
ratings = ["JOY", "MEH", "DESPAIR"]

popularity = (
    candies_long
    .assign(rating=candies_long["rating"].str.upper())
    .groupby(["year", "candy", "rating"])
    # Size does the counting
    .size()
    # Unstack is transposing the 'rating' rows to columns
    .unstack("rating", fill_value=0)
    # Reindex to ensure all rating columns are present
    .reindex(columns=ratings, fill_value=0)
    .reset_index()
)

# Preview overall table
popularity.head(10)

rating,year,candy,JOY,MEH,DESPAIR
0,2015-01-01,100 Grand Bar,3431,0,1543
1,2015-01-01,Anonymous Brown Globs That Come In Black And O...,551,0,4781
2,2015-01-01,Any Full-sized Candy Bar,4978,0,375
3,2015-01-01,Black Jacks,339,0,4300
4,2015-01-01,Bonkers,534,0,3914
5,2015-01-01,Bottle Caps,2062,0,2972
6,2015-01-01,Box' O' Raisins,583,0,4643
7,2015-01-01,Brach Products,1110,0,3822
8,2015-01-01,Broken Glow Stick,102,0,5178
9,2015-01-01,Bubble Gum,1388,0,3783


What I find interesting is that there are no "MEH" ratings for candies in 2015. I will look into the unaltered `candies_long` DataFrame to see if there are any "MEH" ratings for 2015.

In [72]:
candies_long.loc[(candies_long['rating'] == 'MEH') & (candies_long['year'] == '2015')]

Unnamed: 0,year,candy,rating


Confirmed! But still interesting that no one rated any candy as "MEH" in 2015. Must have been a great year for candy!

So, I have combined the dataset from an early step (when combining all years of candy data). However, here is where I merge the ratings of all years together by candy name to see the overall popularity of each candy.

In [73]:
popularity.groupby('candy').agg(
    {'JOY': 'sum', 
     'MEH': 'sum', 
     'DESPAIR': 'sum'
     }
# Resetting index to turn 'candy' back into a column
).reset_index().sort_values(by='JOY', ascending=False).head(10)

rating,candy,JOY,MEH,DESPAIR
3,Any Full-sized Candy Bar,7589,387,407
81,Reese's Peanut Butter Cups,7369,332,685
48,Kit Kat,7252,514,597
93,Snickers,6991,509,833
107,Twix,6990,497,768
17,"Cash, Or Other Forms Of Legal Tender",6814,457,1073
92,Smarties,6718,2042,6629
73,Peanut M&m's,6456,622,1246
105,Tolberone Something Or Other,6176,604,1341
53,Lindt Truffle,6174,613,1358


### Chapter 11 - Generate date range

I will filter to just the candies that were rated in 2015 as my date range example.

In [74]:
popularity_2015 = popularity[popularity["year"] == pd.to_datetime("2015-01-01")].sort_values("candy")

popularity_2015

rating,year,candy,JOY,MEH,DESPAIR
0,2015-01-01,100 Grand Bar,3431,0,1543
1,2015-01-01,Anonymous Brown Globs That Come In Black And O...,551,0,4781
2,2015-01-01,Any Full-sized Candy Bar,4978,0,375
3,2015-01-01,Black Jacks,339,0,4300
4,2015-01-01,Bonkers,534,0,3914
...,...,...,...,...,...
87,2015-01-01,"Vials Of Pure High Fructose Corn Syrup, For Ma...",814,0,4255
88,2015-01-01,Vicodin,2506,0,2486
89,2015-01-01,White Bread,268,0,4695
90,2015-01-01,Whole Wheat Anything,444,0,4503
