## Intermediate Data Science

#### University of Redlands - DATA 201
#### Prof: Joanna Bieri [joanna_bieri@redlands.edu](mailto:joanna_bieri@redlands.edu)
#### [Class Website: data201.joannabieri.com](https://joannabieri.com/data201_intermediate.html)

In [2]:
# Some basic package imports
import os
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.defaule = 'colab'

### You Try - 4 Warm-Up Problems From Lecture

## You Try

Run the cell below to get your data. Then update the column and index names, using the renaming methods above, so that they are consistent and easy to use. Your choice for how you want the final labels to be!

In [3]:
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index=["red green", "Blue_green", "green  "],
                    columns=["ONE", "two", "3", "Four"])
data

Unnamed: 0,ONE,two,3,Four
red green,0,1,2,3
Blue_green,4,5,6,7
green,8,9,10,11


In [4]:
# Your code here
data.rename(columns = {'ONE':'1', 'two':'2', 'Four': '4'}, inplace=True)
data.rename(index = {'red green':'RED GREEN', 'Blue_green': 'BLUE GREEN', 'green ': 'GREEN'}, inplace=True)
data


Unnamed: 0,1,2,3,4
RED GREEN,0,1,2,3
BLUE GREEN,4,5,6,7
green,8,9,10,11


-------------------------------------
## You Try

Run the cell below to create a random list of numbers to represent ages in your population. Then make up your own age range categories (at least 5) and use `.cut()` to break the data into discrete categories. Create a data frame that contains the age, the age range, and the age category code.


In [5]:
ages = [np.random.randint(15,100) for i in range(40)]

In [6]:
# Your code here
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 45, 60, 100]
age_categories = pd.cut(ages, bins)
age_categories

data = pd.DataFrame()
data['age'] = ages
data['range'] = age_categories
data['category_code'] = age_categories.codes
data

Unnamed: 0,age,range,category_code
0,20,"(18, 25]",0
1,22,"(18, 25]",0
2,25,"(18, 25]",0
3,27,"(25, 35]",1
4,21,"(18, 25]",0
5,23,"(18, 25]",0
6,37,"(35, 45]",2
7,31,"(25, 35]",1
8,61,"(60, 100]",4
9,45,"(35, 45]",2


-------------------------------------------------------
## You Try

Explain in great detail what each of the lines in the cell did to both create and then update the data frame.

In [7]:
data = pd.DataFrame(np.random.standard_normal((1000, 4)))
display(data)

data[data.abs() > 1] = np.sign(data) *2
display(data)

Unnamed: 0,0,1,2,3
0,0.821321,-0.482852,0.068016,-1.652833
1,-0.669141,-0.550383,-1.138177,0.593453
2,-1.323633,-1.414650,1.990465,0.376296
3,0.246416,0.531270,-1.627804,0.649120
4,-0.650241,-0.146735,0.660337,-1.423949
...,...,...,...,...
995,1.671990,0.548909,0.132020,-1.259460
996,0.454737,-0.215387,0.620800,-1.052695
997,-1.262082,-1.173232,-1.083678,-0.047392
998,-0.504418,0.546393,-0.216549,-1.254691


Unnamed: 0,0,1,2,3
0,0.821321,-0.482852,0.068016,-2.000000
1,-0.669141,-0.550383,-2.000000,0.593453
2,-2.000000,-2.000000,2.000000,0.376296
3,0.246416,0.531270,-2.000000,0.649120
4,-0.650241,-0.146735,0.660337,-2.000000
...,...,...,...,...
995,2.000000,0.548909,0.132020,-2.000000
996,0.454737,-0.215387,0.620800,-2.000000
997,-2.000000,-2.000000,-2.000000,-0.047392
998,-0.504418,0.546393,-0.216549,-2.000000


## Your WORDS here
**data = pd.DataFrame(np.random.standard_normal((1000, 4)))**
So this line generates a 1000 x 4 matrix of random numbers from the standard normal distribution. It also converts the NumPy array into a Pandas DataFrame.

**display(data)**
This line shows the first few rows of the DataFrame so that you can see the original numbers.

**data[data.abs() > 1] = np.sign(data) x2**
This line first returns a DataFrame of the absolute values of every element. Then it creates a boolean mask, where it is true if the value's absolute value is greater than one. The next section gives the sign of each number, +1 if the number is positive, -1 if the number is negative, and 0 if it is zero. So the entire line is to generate a 1000 × 4 table of random numbers and then replaces every value with absolute size greater than 1 by either +2 or –2, depending on its sign.

**display(data)**
This just displays the line above again.

-------------------------------
# You Try

Break the following string up into a list of strings using string manipulation functions. See if you can create a list like this:

    ['Joanna','Bieri','Redlands','Keep up the good work!']

try to get all the capitals and spacing correct!

NOTE - lots of different processes will result in this final list, there is not one right way to do this!

In [8]:
a_string = 'joanna_bieri@redlands.edu says:   Keep up the good work!'

In [9]:
# Your code here
my_list = []
for st in a_string.split('@'):
    if "joanna" in st:
        my_list.append(st.split('_')[0].title())
        my_list.append(st.split('_')[1].title())
    else:
        my_list.append(st.split('.')[0].title())
        my_list.append(st.split('says: ')[1].strip())
        
my_list


['Joanna', 'Bieri', 'Redlands', 'Keep up the good work!']

---------------
## Data Cleaning and Preparation - Day4 HW

## Homework 4

Run the cell to down load the Kaggle Dataset below

1. Load the data into Pandas
2. Quickly describe what you see from a data science perspective (vars, observations, concerns with formatting, etc)
3. How many NaNs are in the data? What do the [] mean in the data? Should you remove all rows with NaN? Are there any duplicate rows?
4. Using the dictionary given here, add a genre column using the .map() command.

```{python}
artist_to_genre = {
    "Taylor Swift": "Pop / Country",
    "Beyoncé": "R&B / Pop",
    "Madonna": "Pop",
    "Pink": "Pop Rock",
    "Celine Dion": "Adult Contemporary",
    "Lady Gaga": "Pop / Dance",
    "Katy Perry": "Pop",
    "Cher": "Pop / Disco",
    "Adele": "Soul / Pop"
}
```
5. Bin the number of shows into 'high','medium','low'. Your choice on with the cutoffs should be. Add columns with the cutoffs and the codes.
6. Create dummy variables for the genres. Separate the text by the / symbol
7. Remove the $ from the money columns and turn these into integers.
8. Remove all other special characters from the data.
9. Separate the Year(s) column into two "Tour Start" and "Tour End"
10. Save the final data as a pickle.

    
------------------------------------

Your final notebooks should:

- [ ] Be a completely new notebook with just the Day4 stuff in it: Read in the data, clean it up and save it. 
- [ ] Be reproducible with junk code removed.
- [ ] Have lots of language describing what you are doing, especially for questions you are asking or things that you find interesting about the data. Use complete sentences, nice headings, and good markdown formatting: https://www.markdownguide.org/cheat-sheet/
- [ ] It should run without errors from start to finish.


In [10]:
import kagglehub

# Data was created by scraping: https://en.wikipedia.org/wiki/List_of_highest-grossing_concert_tours_by_women
path = kagglehub.dataset_download("amruthayenikonda/dirty-dataset-to-practice-data-cleaning")

print("Path to dataset files:", path)

Path to dataset files: /Users/admin/.cache/kagglehub/datasets/amruthayenikonda/dirty-dataset-to-practice-data-cleaning/versions/1


In [11]:
# If this gives an error you might have to copy and paste the pat from above
# then update the \ with / to make it work
os.listdir(path)

['my_file (1).csv']

**1. Load the data into Pandas**

In [12]:
file_path = os.path.join(path, "my_file (1).csv")
df = pd.read_csv(file_path)


**2. Quickly describe what you see from a data science perspective (vars, observations, concerns with formatting, etc)**


In [13]:
df.shape
df.info
df.head()

Unnamed: 0,Rank,Peak,All Time Peak,Actual gross,Adjusted gross (in 2022 dollars),Artist,Tour title,Year(s),Shows,Average gross,Ref.
0,1,1,2,"$780,000,000","$780,000,000",Taylor Swift,The Eras Tour †,2023–2024,56,"$13,928,571",[1]
1,2,1,7[2],"$579,800,000","$579,800,000",Beyoncé,Renaissance World Tour,2023,56,"$10,353,571",[3]
2,3,1[4],2[5],"$411,000,000","$560,622,615",Madonna,Sticky & Sweet Tour ‡[4][a],2008–2009,85,"$4,835,294",[6]
3,4,2[7],10[7],"$397,300,000","$454,751,555",Pink,Beautiful Trauma World Tour,2018–2019,156,"$2,546,795",[7]
4,5,2[4],,"$345,675,146","$402,844,849",Taylor Swift,Reputation Stadium Tour,2018,53,"$6,522,173",[8]


What I did: Printed df.shape, df.info() and df.head() to see rows/cols, dtypes, null counts, and a preview.

**3. How many NaNs are in the data? What do the [] mean in the data? Should you remove all rows with NaN? Are there any duplicate rows?**


In [14]:
df.isna().sum()

Rank                                 0
Peak                                11
All Time Peak                       14
Actual gross                         0
Adjusted gross (in 2022 dollars)     0
Artist                               0
Tour title                           0
Year(s)                              0
Shows                                0
Average gross                        0
Ref.                                 0
dtype: int64

In [15]:
df.duplicated().sum()

np.int64(0)

What I did: Used df.isna().sum() to count nulls and df.duplicated().sum() to count duplicate rows; considered whether to drop NAs based on how critical those fields are.

**4. Using the dictionary given here, add a genre column using the .map() command.**


In [16]:
artist_to_genre = {
    "Taylor Swift": "Pop / Country",
    "Beyoncé": "R&B / Pop",
    "Madonna": "Pop",
    "Pink": "Pop Rock",
    "Celine Dion": "Adult Contemporary",
    "Lady Gaga": "Pop / Dance",
    "Katy Perry": "Pop",
    "Cher": "Pop / Disco",
    "Adele": "Soul / Pop"
}
df['Genre'] = df['Artist'].map(artist_to_genre)
df.head()                

Unnamed: 0,Rank,Peak,All Time Peak,Actual gross,Adjusted gross (in 2022 dollars),Artist,Tour title,Year(s),Shows,Average gross,Ref.,Genre
0,1,1,2,"$780,000,000","$780,000,000",Taylor Swift,The Eras Tour †,2023–2024,56,"$13,928,571",[1],Pop / Country
1,2,1,7[2],"$579,800,000","$579,800,000",Beyoncé,Renaissance World Tour,2023,56,"$10,353,571",[3],R&B / Pop
2,3,1[4],2[5],"$411,000,000","$560,622,615",Madonna,Sticky & Sweet Tour ‡[4][a],2008–2009,85,"$4,835,294",[6],Pop
3,4,2[7],10[7],"$397,300,000","$454,751,555",Pink,Beautiful Trauma World Tour,2018–2019,156,"$2,546,795",[7],Pop Rock
4,5,2[4],,"$345,675,146","$402,844,849",Taylor Swift,Reputation Stadium Tour,2018,53,"$6,522,173",[8],Pop / Country


**5. Bin the number of shows into 'high','medium','low'. Your choice on with the cutoffs should be. Add columns with the cutoffs and the codes.**

In [17]:
df['Shows_Bin'] = pd.cut(df['Shows'], bins=[0, 50, 100, float('inf')], labels=['Low','Medium','High'])

In [18]:
df.head()

Unnamed: 0,Rank,Peak,All Time Peak,Actual gross,Adjusted gross (in 2022 dollars),Artist,Tour title,Year(s),Shows,Average gross,Ref.,Genre,Shows_Bin
0,1,1,2,"$780,000,000","$780,000,000",Taylor Swift,The Eras Tour †,2023–2024,56,"$13,928,571",[1],Pop / Country,Medium
1,2,1,7[2],"$579,800,000","$579,800,000",Beyoncé,Renaissance World Tour,2023,56,"$10,353,571",[3],R&B / Pop,Medium
2,3,1[4],2[5],"$411,000,000","$560,622,615",Madonna,Sticky & Sweet Tour ‡[4][a],2008–2009,85,"$4,835,294",[6],Pop,Medium
3,4,2[7],10[7],"$397,300,000","$454,751,555",Pink,Beautiful Trauma World Tour,2018–2019,156,"$2,546,795",[7],Pop Rock,High
4,5,2[4],,"$345,675,146","$402,844,849",Taylor Swift,Reputation Stadium Tour,2018,53,"$6,522,173",[8],Pop / Country,Medium


What I did: Converted columns that represent numbers (counts, ranks, followers, money after cleaning) to numeric so math and sorting work.

**6. Create dummy variables for the genres. Separate the text by the / symbol**

In [19]:
dummies = df['Genre'].str.get_dummies(' / ')
dummies

Unnamed: 0,Adult Contemporary,Country,Dance,Disco,Pop,Pop Rock,R&B,Soul
0,0,1,0,0,1,0,0,0
1,0,0,0,0,1,0,1,0
2,0,0,0,0,1,0,0,0
3,0,0,0,0,0,1,0,0
4,0,1,0,0,1,0,0,0
5,0,0,0,0,1,0,0,0
6,1,0,0,0,0,0,0,0
7,0,0,0,0,0,1,0,0
8,0,0,0,0,1,0,1,0
9,0,1,0,0,1,0,0,0


What I did: Used df['Genre'].str.get_dummies(' / ') to one-hot encode multiple genres per row.

**7. Remove the $ from the money columns and turn these into integers.**

In [20]:
df.columns = [c.replace('\xa0', ' ') for c in df.columns]
for col in df.columns:
    print(repr(col))

'Rank'
'Peak'
'All Time Peak'
'Actual gross'
'Adjusted gross (in 2022 dollars)'
'Artist'
'Tour title'
'Year(s)'
'Shows'
'Average gross'
'Ref.'
'Genre'
'Shows_Bin'


What I did: Stripped currency symbols/commas from money columns and cast to integer.

In [21]:
df

Unnamed: 0,Rank,Peak,All Time Peak,Actual gross,Adjusted gross (in 2022 dollars),Artist,Tour title,Year(s),Shows,Average gross,Ref.,Genre,Shows_Bin
0,1,1,2,"$780,000,000","$780,000,000",Taylor Swift,The Eras Tour †,2023–2024,56,"$13,928,571",[1],Pop / Country,Medium
1,2,1,7[2],"$579,800,000","$579,800,000",Beyoncé,Renaissance World Tour,2023,56,"$10,353,571",[3],R&B / Pop,Medium
2,3,1[4],2[5],"$411,000,000","$560,622,615",Madonna,Sticky & Sweet Tour ‡[4][a],2008–2009,85,"$4,835,294",[6],Pop,Medium
3,4,2[7],10[7],"$397,300,000","$454,751,555",Pink,Beautiful Trauma World Tour,2018–2019,156,"$2,546,795",[7],Pop Rock,High
4,5,2[4],,"$345,675,146","$402,844,849",Taylor Swift,Reputation Stadium Tour,2018,53,"$6,522,173",[8],Pop / Country,Medium
5,6,2[4],10[9],"$305,158,363","$388,978,496",Madonna,The MDNA Tour,2012,88,"$3,467,709",[9],Pop,Medium
6,7,2[10],,"$280,000,000","$381,932,682",Celine Dion,Taking Chances World Tour,2008–2009,131,"$2,137,405",[11],Adult Contemporary,High
7,7,,,"$257,600,000","$257,600,000",Pink,Summer Carnival †,2023–2024,41,"$6,282,927",[12],Pop Rock,Low
8,9,,,"$256,084,556","$312,258,401",Beyoncé,The Formation World Tour,2016,49,"$5,226,215",[13],R&B / Pop,Low
9,10,,,"$250,400,000","$309,141,878",Taylor Swift,The 1989 World Tour,2015,85,"$2,945,882",[14],Pop / Country,Medium


In [None]:
for col in ["Actual gross", "Adjusted gross (in 2022 dollars)", "Average gross"]:
    df[col] = df[col].astype(str)
    df[col] = df[col].str.replace("$", "")
    df[col] = df[col].str.replace(",", "")
    df[col] = df[col].str.replace(r'\[.*\]', '')
    df[col] = df[col].str.strip()
    df[col] = df[col].astype(int)

df

**8. Remove all other special characters from the data.**

In [30]:
df = df.map(lambda x: str(x).replace('\n', '').replace('\r', ''))  
df = df.map(lambda x: str(x).replace(r'[^A-Za-z0-9 ]', ''))
df.head()

Unnamed: 0,Rank,Peak,All Time Peak,Actual gross,Adjusted gross (in 2022 dollars),Artist,Tour title,Year(s),Shows,Average gross,Ref.,Genre,Shows_Bin
0,1,1,2.0,780000000,780000000,Taylor Swift,The Eras Tour,20232024,56,13928571,1,Pop Country,Medium
1,2,1,72.0,579800000,579800000,Beyonc,Renaissance World Tour,2023,56,10353571,3,RB Pop,Medium
2,3,14,25.0,411000000,560622615,Madonna,Sticky Sweet Tour 4a,20082009,85,4835294,6,Pop,Medium
3,4,27,107.0,397300000,454751555,Pink,Beautiful Trauma World Tour,20182019,156,2546795,7,Pop Rock,High
4,5,24,,345675146,402844849,Taylor Swift,Reputation Stadium Tour,2018,53,6522173,8,Pop Country,Medium


What I did: Cleaned stray control/special chars in text fields while leaving numeric columns untouched.

**9. Separate the Year(s) column into two "Tour Start" and "Tour End"**

In [46]:
df["Tour Start"] = df["Year(s)"].astype(str).str[:4].astype(int)
df["Tour End"]   = df["Year(s)"].astype(str).str[-4:].astype(int)
df[["Year(s)", "Tour Start", "Tour End"]].head()

Unnamed: 0,Year(s),Tour Start,Tour End
0,20232024,2023,2024
1,2023,2023,2023
2,20082009,2008,2009
3,20182019,2018,2019
4,2018,2018,2018


What I did: Parsed 4-digit years from Year(s) and filled Tour Start/Tour End (single-year rows use the same year twice).

**10. Save the final data as a pickle.**

In [48]:
df.to_pickle("final_data.pkl")

What I did: Saved the cleaned DataFrame to a pickle for fast reloads.