# Taming the wild data

Data can come in all sort of forms, but generally not the form we are looking for.

## Import data

We'll start by importing data.

In [1]:
import pandas as pd

df = pd.read_csv('../files/wild_data.csv')
df.head()

Unnamed: 0,Name,Math_Score_2023,MathScore2024,Science (23),Sci-Score-24,extra1,Unnamed: 6
0,Alice,85,88.0,89.0,91.0,,
1,Bob,78,,83.0,,,
2,Charlie,90,91.0,,95.0,,


## Cleaning it up

Cleaning this data would follow the following steps:

1) Load the data into a Pandas DataFrame. (Done!)

1) Rename the columns to consistent, snake_case format (e.g., math_score_2023).
    * Some of this can be done automatically, but renaming "sciscore24" adn "science_23" is not something you'd fix with a regex

1) Drop irrelevant or empty columns (extra1, Unnamed: 6).

1) Handle missing values (e.g., fill or drop with justification).

1) Reshape the data so each row contains "name, subject, year, score"
    * Use melt, extract, or string operations to separate fields like Math_Score_2023

1) Sort the data by name and year

Next step: Rename the columns to consistent, snake_case format (e.g., math_score_2023).

In [2]:
#DELETE
# automatic solution
# df.columns = df.columns.str.replace(r'[^\w\s]', '', regex=True).str.replace(' ', '_').str.lower()

# manual solution
df.columns = ['id', 'math_2023', 'math_2024', 'science_2023', 'science_2024', 'extra_1', 'extra_2']
df.head()

Unnamed: 0,id,math_2023,math_2024,science_2023,science_2024,extra_1,extra_2
0,Alice,85,88.0,89.0,91.0,,
1,Bob,78,,83.0,,,
2,Charlie,90,91.0,,95.0,,


Next step: Drop irrelevant or empty columns (extra1, Unnamed: 6).

In [3]:
#DELETE
df = df.drop(columns=['extra_1', 'extra_2'])
df.head()

Unnamed: 0,id,math_2023,math_2024,science_2023,science_2024
0,Alice,85,88.0,89.0,91.0
1,Bob,78,,83.0,
2,Charlie,90,91.0,,95.0


Next step: Handle missing values (e.g., fill or drop with justification).

In [4]:
#DELETE
# Trick question! Don't take care of null-values now, but do it once the dataframe has been melted.

Next step: Reshape the data so each row contains "name, subject, year, score". Use melt, extract, or string operations to separate fields like Math_Score_2023.

While you're at it, sort the data by name and year.

In [5]:
#DELETE
df_melted = df.melt(id_vars=['id'], var_name='subject_year', value_name='score')
df_melted[['subject', 'year']] = df_melted['subject_year'].str.split('_', expand=True)
df_melted = df_melted.drop(columns=['subject_year'])
df_melted = df_melted.rename(columns={'id': 'name'})
df_melted = df_melted[['name', 'subject', 'year', 'score']]
df_melted = df_melted.sort_values(by=['name', 'year'])
df_melted.head(20)


Unnamed: 0,name,subject,year,score
0,Alice,math,2023,85.0
6,Alice,science,2023,89.0
3,Alice,math,2024,88.0
9,Alice,science,2024,91.0
1,Bob,math,2023,78.0
7,Bob,science,2023,83.0
4,Bob,math,2024,
10,Bob,science,2024,
2,Charlie,math,2023,90.0
8,Charlie,science,2023,


And don't forget: Next step: Handle missing values (e.g., fill or drop with justification).

In [6]:
#DELETE
df_melted = df_melted.dropna(subset=['score'])
df_melted.head(20)

Unnamed: 0,name,subject,year,score
0,Alice,math,2023,85.0
6,Alice,science,2023,89.0
3,Alice,math,2024,88.0
9,Alice,science,2024,91.0
1,Bob,math,2023,78.0
7,Bob,science,2023,83.0
2,Charlie,math,2023,90.0
5,Charlie,math,2024,91.0
11,Charlie,science,2024,95.0


You should be at the following dataframe by now.

![](../files/2025-05-12-20-23-49.png)



## Predict data

I have an excel with all grades of the IT-students of the past five years (one excel per year). It's not in the form above but nothing some python-code couldn't fix.

Suppose I would want to use this to build a model that gives the odds of a student passing our program in 3 years, or in 4 years, what would the target column be?

I'll wait.

In [7]:
import time

lst = ["-", "\\", "|" , "/"]
for i in range(5):
    for j in range(4):
        print(lst[j], end="\r")
        time.sleep(0.25)



/

This data is unfit to train a model on. What you need is all the data next to each other, the way it was at this point:

![](../files/2025-05-12-20-39-18.png)

Then you can add a column that says "graduated_in_3" or "graduated_in_4" years. If this column is false the students haven't graduated (or at least not in a timely matter).

What will happen the is dimensionality explosion. A student takes about 20 courses/year, times three is 60 courses. That is 60 dimensions to take into account, and that's ignoring the additional information we have (gender, previous education, ...)