<a href="https://colab.research.google.com/github/minjikim13/career-event-attendance-cohort-analysis/blob/main/Career_Event.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Cleaning – Event Attendance

This notebook focuses on cleaning and standardising raw event registration data in preparation for cohort analysis.

Key steps include:
- Removing cancelled registrations
- Deduplicating multiple ticket purchases per person
- Standardising country and university names
- Encoding attendance as a binary variable (1 = attended, 0 = no-show)


In [1]:
from google.colab import files
uploaded = files.upload()

Saving Raw_data.xlsx to Raw_data.xlsx


In [2]:
import pandas as pd
df = pd.read_excel("Raw_data.xlsx")
print(f"Total {len(df)} people!")

Total 261 people!


In [3]:
df.head(5)

Unnamed: 0,Order #,Order Date,First Name,Last Name,Which country are you from?,Attendee Status,Which university/private college are you from?,What is your area of study?,Do you have any particular question for our panelists?
0,13587464393,2025-10-27 13:13:13,Yanlin,Zhu,China,Attending,Unsw,,
1,13588151753,2025-10-27 17:24:53,Sneha,Nair,India,Attending,Macquarie university,IT,
2,13588396273,2025-10-27 19:16:29,arnela,tolic,bosnia,Attending,flinders university of south australia,law,I like to offer free guidance and checklist fo...
3,13593597953,2025-10-28 18:03:52,Sabina,Sariyeva,Kazakhstan,Attending,UNSW,Environmental Management,
4,13593710383,2025-10-28 18:41:42,Naveen Sanjay,Baskaran,India,Attending,Unsw,,


In [5]:
print("what information included in?")
print()

for i, name in enumerate(df.columns, 1 ) :
    print (f"{i},{name}")

what information included in?

1,Order #
2,Order Date
3,First Name
4,Last Name
5,Which country are you from?
6,Attendee Status
7,Which university/private college are you from?
8,What is your area of study?
9,Do you have any particular question for our panelists?


In [6]:
print("Attendee Status:")
print(df['Attendee Status'].value_counts())

Attendee Status:
Attendee Status
Attending        168
Checked In        63
Not Attending     30
Name: count, dtype: int64


In [7]:
print("Countries (TOP 50):")
print(df['Which country are you from?'].value_counts().head(50))

Countries (TOP 50):
Which country are you from?
India                    43
China                    38
Nepal                    20
Australia                18
Indonesia                14
Mongolia                  9
Philippines               6
Vietnam                   5
Malaysia                  5
Thailand                  4
china                     4
Colombia                  4
Bangladesh                4
Viet Nam                  3
Singapore                 3
Japan                     3
Russia                    3
Hong Kong                 3
South Korea               3
Uganda                    2
Brazil                    2
Taiwan                    2
Cambodia                  2
Kazakhstan                1
Nigeria/ UK               1
MEXICO                    1
bosnia                    1
Nz                        1
Monglia                   1
Sweden                    1
Nigeria                   1
bangladesh                1
Sri Lanka                 1
Sydney                    1


In [8]:
duplicates = df[df.duplicated(subset=['First Name', 'Last Name'], keep=False)]

print(f"duplicate ppl : {len(duplicates)} ppl")
print()
print("example (10):")
print(duplicates[['Order #', 'First Name', 'Last Name']].head(10))

duplicate ppl : 97 ppl

example (10):
        Order #     First Name Last Name
4   13593710383  Naveen Sanjay  Baskaran
5   13593710383  Naveen Sanjay  Baskaran
6   13593710383  Naveen Sanjay  Baskaran
7   13593710383  Naveen Sanjay  Baskaran
8   13593944323        Adithya     Kumar
9   13593944323        Adithya     Kumar
12  13640299073        Adithya     Kumar
13  13640299073        Adithya     Kumar
14  13640299073        Adithya     Kumar
20  13652666903            Het       Gor


In [9]:
df_clean = df[df['Attendee Status'] != 'Not Attending'].copy()

print(f"Before: {len(df)}ppl")
print(f"After: {len(df_clean)}ppl")
print(f"Removed: {len(df) - len(df_clean)}ppl")

Before: 261ppl
After: 231ppl
Removed: 30ppl


In [10]:
df_clean = df_clean.drop_duplicates(
    subset=["Order #", "First Name", "Last Name"]
)

print(f"Before: 231 ppl")
print(f"After: {len(df_clean)} ppl")
print(f"Duplicates removed: {231 - len(df_clean)} ppl")

Before: 231 ppl
After: 188 ppl
Duplicates removed: 43 ppl


In [11]:
country_raw = df_clean["Which country are you from?"]
country_norm = country_raw.str.strip().str.lower()

country_fix = {
    "china": "China",
    "viet nam": "Vietnam",
    "vietnamese": "Vietnam",
    "sydney": "Australia",
    "south korea": "Korea",
    "s.korea": "Korea",
    "nigeria/ uk": "Nigeria",
    "monglia": "Mongolia",
    "nz": "New Zealand",
    "hong kong sar (china)": "Hong Kong",
    "bangladesh": "Bangladesh",
}

df_clean["Country_Clean"] = country_norm.replace(country_fix)

print("Before:")
print(df["Which country are you from?"].value_counts().head(30))
print()
print("After:")
print(df_clean["Country_Clean"].value_counts().head(30))

Before:
Which country are you from?
India          43
China          38
Nepal          20
Australia      18
Indonesia      14
Mongolia        9
Philippines     6
Vietnam         5
Malaysia        5
Thailand        4
china           4
Colombia        4
Bangladesh      4
Viet Nam        3
Singapore       3
Japan           3
Russia          3
Hong Kong       3
South Korea     3
Uganda          2
Brazil          2
Taiwan          2
Cambodia        2
Kazakhstan      1
Nigeria/ UK     1
MEXICO          1
bosnia          1
Nz              1
Monglia         1
Sweden          1
Name: count, dtype: int64

After:
Country_Clean
China          36
india          30
australia      15
nepal          14
indonesia      13
mongolia        6
Bangladesh      5
philippines     5
vietnam         4
malaysia        4
Korea           4
colombia        4
singapore       3
Vietnam         3
thailand        3
cambodia        2
taiwan          2
hong kong       2
brazil          2
russia          2
japan           

In [12]:
uni_raw = df_clean["Which university/private college are you from?"]
uni_norm = uni_raw.str.strip().str.lower()

uni_fix = {
    # UTS
    "uts": "UTS",
    "university of technology sydney": "UTS",
    "university of technology sydeny": "UTS",
    "university of technology sydney (uts)": "UTS",

    # UNSW
    "unsw": "UNSW",
    "unsw sydney": "UNSW",
    "university of new south wales": "UNSW",

    # USYD
    "university of sydney": "USYD",
    "the university of sydney": "USYD",
    "usyd": "USYD",
    "university of sydeny": "USYD",

    # Macquarie
    "macquarie university": "Macquarie University",
    "macquarie": "Macquarie University",

    # Melbourne Institute of Technology
    "melbourne institute of technology": "Melbourne Institute of Technology",
    "mit": "Melbourne Institute of Technology",
    "mit sydney": "Melbourne Institute of Technology",
    "melbourne institute of technology (sydney campus)": "Melbourne Institute of Technology",

    # Charles Darwin University
    "charles darwin university": "Charles Darwin University",
    "cdu": "Charles Darwin University",
    "charles darwin university sydney campus": "Charles Darwin University",

    # Australian Catholic university
    "acu" : "Australian Catholic university",

    # ACAP
    "acap": "ACAP University College",
    "acap university college": "ACAP University College",

    # Kaplan / KBS
    "kaplan business school": "Kaplan Business School",
    "kbs": "Kaplan Business School",
    "kaplan": "Kaplan Business School",

    # SBTA & SELA
    "sbta": "SBTA & SELA",
    "sbta & sela": "SBTA & SELA",
    "sbta sela": "SBTA & SELA",
    "sbta and sela": "SBTA & SELA" ,
    "stba" : "SBTA & SELA",

    # Torrens
    "torrens": "Torrens University",
    "torrens university": "Torrens University",

    # Victoria / Western Sydney
    "victoria university(sydney campus)": "Victoria University",
    "vit": "Victoria University" ,
    "western sydney university": "Western Sydney University",
    "wsu": "Western Sydney University",
}

df_clean["University_Clean"] = uni_norm.replace(uni_fix)

print(f"Before: {uni_raw.nunique()} uni")
print(f"After: {df_clean['University_Clean'].nunique()} uni")
print(uni_norm.value_counts().head(50))

Before: 61 uni
After: 24 uni
Which university/private college are you from?
unsw                                                 32
uts                                                  28
university of sydney                                 18
macquarie university                                 13
university of technology sydney                      13
usyd                                                 11
melbourne institute of technology                     5
university of newcastle                               3
unsw sydney                                           3
charles darwin university                             3
kaplan business school                                3
mit                                                   3
the university of sydney                              2
acap university college                               2
university of new south wales                         2
acap                                                  2
sbta & sela                 

In [13]:
df_clean["Attended"] = (df_clean["Attendee Status"] == "Checked In").astype(int)

print("Clear Attendance")
print(f"Attended: {df_clean['Attended'].sum()} people")
print(f"Did not attend: {len(df_clean) - df_clean['Attended'].sum()} people")

Clear Attendance
Attended: 55 people
Did not attend: 133 people


In [14]:
df_clean.to_csv("cleaned_data.csv", index=False)

print("File name: cleaned_data.csv")
print(f"Total: {len(df_clean)} people")

File name: cleaned_data.csv
Total: 188 people


In [16]:
df_clean.to_excel('cleaned_data.xlsx', index=False)

files.download('cleaned_data.xlsx')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>