<a href="https://colab.research.google.com/github/hibaessid/fortnite-data-preprocessing/blob/main/FortniteDataManipulationAndPreprocessingTutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# Fortnite Data Manipulation and Preprocessing Tutorial

In this notebook, we will use the Fortnite dataset to learn how to manipulate data and preprocess it for machine learning. By the end, you will understand how to clean data, handle missing values, and prepare data for analysis. Let's get started!




## Loading the Data
## Step 1: Loading and Previewing the Dataset

In this step, we begin by loading the **Fortnite Statistics** dataset into a Pandas DataFrame. A **DataFrame** is a two-dimensional, table-like structure with rows and columns, which is used to store and manipulate data efficiently in Python.

- We use the `pd.read_csv()` function to read the CSV file containing the dataset and store it in a variable called `df`. This function is part of the **Pandas** library, which is commonly used for data manipulation and analysis.
- The `df.head()` function is then used to print the first 5 rows of the DataFrame. This gives us a quick preview of the dataset, allowing us to see what kind of data we are working with.


In [None]:
import pandas as pd

# Read the CSV file into a pandas DataFrame
df = pd.read_csv('Fortnite Statistics.csv')

# Print the DataFrame
df.head()


Unnamed: 0,Date,Time of Day,Placed,Mental State,Eliminations,Assists,Revives,Accuracy,Hits,Head Shots,Distance Traveled,Materials Gathered,Materials Used,Damage Taken,Damage to Players,Damage to Structures
0,4/10,6:00 PM,27,sober,2,0,0,23%,14,2,271.08,20,20,272,331,621
1,4/10,6:00 PM,45,sober,1,2,0,30%,19,1,396.73,123,30,247,444,998
2,4/10,6:00 PM,38,high,3,0,0,30%,32,1,607.8,71,60,176,322,1109
3,4/10,7:00 PM,30,high,1,3,0,18%,19,1,714.16,244,10,238,330,4726
4,4/10,7:00 PM,16,high,3,1,1,58%,42,18,1140.0,584,150,365,668,2070


## Understanding the Dataset
Before manipulating or preprocessing data, it’s important to understand its structure. The `info()` function gives us details about each column, including data types and whether there are missing values. The `describe()` function provides summary statistics, such as the mean and standard deviation of numerical columns.


In [None]:
print(df.shape)

(87, 16)


In [None]:
# Display column names
print(df.columns)

Index(['Date', 'Time of Day', 'Placed', 'Mental State', 'Eliminations',
       'Assists', 'Revives', 'Accuracy', 'Hits', 'Head Shots',
       'Distance Traveled', 'Materials Gathered', 'Materials Used',
       'Damage Taken', 'Damage to Players', 'Damage to Structures'],
      dtype='object')


In [None]:
# Check the structure of the dataset
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87 entries, 0 to 86
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Date                  87 non-null     object 
 1   Time of Day           87 non-null     object 
 2   Placed                87 non-null     int64  
 3   Mental State          87 non-null     object 
 4   Eliminations          87 non-null     int64  
 5   Assists               87 non-null     int64  
 6   Revives               87 non-null     int64  
 7   Accuracy              87 non-null     float64
 8   Hits                  87 non-null     int64  
 9   Head Shots            87 non-null     int64  
 10  Distance Traveled     87 non-null     float64
 11  Materials Gathered    87 non-null     int64  
 12  Materials Used        87 non-null     int64  
 13  Damage Taken          87 non-null     int64  
 14  Damage to Players     87 non-null     int64  
 15  Damage to Structures  87 

 • Among the 16 attributes, we already observed that we have 12 numeric
attributes.
• To have statistics (mean, standard deviation, min, max and quantiles)
about these attributes, we use the following:

In [None]:
# Get basic statistics about the dataset
df.describe()

Unnamed: 0,Placed,Eliminations,Assists,Revives,Accuracy,Hits,Head Shots,Distance Traveled,Materials Gathered,Materials Used,Damage Taken,Damage to Players,Damage to Structures
count,87.0,87.0,87.0,87.0,87.0,87.0,87.0,87.0,87.0,87.0,87.0,87.0,87.0
mean,22.045977,2.517241,1.482759,0.402299,26.011494,29.735632,4.747126,1137.146322,386.574713,122.712644,244.172414,581.770115,3170.91954
std,13.145791,1.885453,1.388173,0.738631,13.471325,22.093596,5.777298,1110.843642,569.978062,225.17037,124.937399,354.172396,3458.54107
min,1.0,0.0,0.0,0.0,4.0,1.0,0.0,17.86,0.0,0.0,19.0,60.0,175.0
25%,15.0,1.0,0.0,0.0,17.0,12.0,1.0,398.28,75.0,20.0,154.0,328.5,1094.0
50%,21.0,2.0,1.0,0.0,25.0,27.0,3.0,638.17,164.0,50.0,209.0,481.0,2029.0
75%,28.5,3.0,2.0,1.0,32.0,38.0,6.5,1575.0,418.5,145.0,316.5,750.0,3707.0
max,66.0,8.0,6.0,4.0,90.0,105.0,33.0,4460.0,3002.0,1740.0,677.0,1507.0,18026.0


 • To describe the non-numeric (nominal) attributes, we use the following:

In [None]:
 df.describe(include='object')

Unnamed: 0,Date,Time of Day,Mental State
count,87,87,87
unique,7,2,2
top,4/14,day,sober
freq,24,47,45


##Checking Data Types
Our dataset contains columns with mixed data types (numbers and characters). We need to check the data types of the different columns to understand how to manipulate and process these attributes.



In [None]:

df.dtypes


Unnamed: 0,0
Date,object
Time of Day,object
Placed,int64
Mental State,object
Eliminations,int64
Assists,int64
Revives,int64
Accuracy,object
Hits,int64
Head Shots,int64


##Drop features
 • Let’s remove the attribute “date”


In [None]:
 df=df.drop(columns=['Date'])

## Conversion des valeurs en nombres
Maintenant que nous avons séparé les caractères, nous allons conserver uniquement la partie numérique.


## Encodage de la colonne "Accuracy"
La colonne "Accuracy" contient des informations numériques codées sous forme de caractères (objets), avec le symbole pourcentage (%) comme caractère. Nous allons convertir ces données en un format numérique.


In [None]:
# Convert the 'Accuracy' column to strings (if not already) and remove the '%' symbol, then convert it to float
df['Accuracy'] = df['Accuracy'].astype(str).str.rstrip('%').astype(float)

# Verify the changes
df['Accuracy'].head()


Unnamed: 0,Accuracy
0,23.0
1,30.0
2,30.0
3,18.0
4,58.0


In [None]:
# Vérifier les types de données des colonnes
df.dtypes


Unnamed: 0,0
Time of Day,object
Placed,int64
Mental State,object
Eliminations,int64
Assists,int64
Revives,int64
Accuracy,float64
Hits,int64
Head Shots,int64
Distance Traveled,float64


## Encodage de la colonne "Time of Day"
Nous allons maintenant convertir la colonne "Time of Day" en une caractéristique qualitative (catégorique) avec deux valeurs possibles : "day" et "night".

- De 6h à 18h : "day"
- De 18h à 6h : "night"


In [None]:
# Diviser les valeurs de la colonne 'Time of Day' pour obtenir l'heure et le moment de la journée (AM/PM)
x = df['Time of Day'].str.split(':00 ')

# Encodage des valeurs en "day" et "night"
for i in range(len(x)):
    if (x[i][1] == 'PM') and (int(x[i][0]) >= 6):
        x[i] = 'night'
    elif (x[i][1] == 'PM') and (int(x[i][0]) < 6):
        x[i] = 'day'
    elif (x[i][1] == 'AM') and (int(x[i][0]) <= 6):
        x[i] = 'night'
    else:
        x[i] = 'day'

# Mettre à jour la colonne 'Time of Day' avec les nouvelles valeurs
df['Time of Day'] = pd.Series(x)

# Vérifier les modifications
df['Time of Day'].head()


Unnamed: 0,Time of Day
0,night
1,night
2,night
3,night
4,night


Shape & types of the data ?

## Data Digitalization

In machine learning, many algorithms require input data to be in numerical format. However, some features in our dataset may be categorical (text-based). To convert these features into numerical format, we use a process called **one-hot encoding**. This technique transforms categorical variables into a series of binary columns, where each unique category becomes a column, and the presence of that category is marked with 1, while absence is marked with 0.

In this section, we will:
1. Select all the columns with **object** (text) data types.
2. Apply one-hot encoding to these categorical features using the `pd.get_dummies()` function.
3. Convert all these binary columns to `int64` data type.
4. Combine the transformed categorical features with the numerical features in the original dataset.


In [None]:
#CODE HERE
# Step 1: Select all columns with object (text) data types
obj_df = df.select_dtypes(include=['object']).copy()

# Get a list of the categorical columns
obj_col = list(obj_df.columns)

# Step 2: Apply one-hot encoding to these categorical features
x = pd.get_dummies(obj_df, columns=obj_col)

# Convert the binary columns to integers
x = x.astype('int64')

# Step 3: Select numerical columns and join the one-hot encoded data
num_df = df.select_dtypes(include=['number']).copy()

# Join the numerical data with the newly created binary columns
num_df = num_df.join(x)

# Display the transformed DataFrame
num_df.head()


Unnamed: 0,Placed,Eliminations,Assists,Revives,Accuracy,Hits,Head Shots,Distance Traveled,Materials Gathered,Materials Used,...,Date_17-Apr,Date_4/10,Date_4/11,Date_4/13,Date_4/14,Date_4/15,Time of Day_day,Time of Day_night,Mental State_high,Mental State_sober
0,27,2,0,0,23.0,14,2,271.08,20,20,...,0,1,0,0,0,0,0,1,0,1
1,45,1,2,0,30.0,19,1,396.73,123,30,...,0,1,0,0,0,0,0,1,0,1
2,38,3,0,0,30.0,32,1,607.8,71,60,...,0,1,0,0,0,0,0,1,1,0
3,30,1,3,0,18.0,19,1,714.16,244,10,...,0,1,0,0,0,0,0,1,1,0
4,16,3,1,1,58.0,42,18,1140.0,584,150,...,0,1,0,0,0,0,0,1,1,0


## Objective: Apply Min-Max Scaling to Normalize the Fortnite Dataset

**Normalization** is an important step in data preprocessing where we scale the numerical features of our dataset to fall within a specific range, typically [0, 1]. This is particularly useful for machine learning algorithms that are sensitive to the scale of data.

### Steps:
1. Import the **MinMaxScaler** from the `sklearn.preprocessing` module.
2. Apply the scaler to the numerical features of the dataset.
3. Convert the scaled data back into a Pandas DataFrame to retain the structure.

In [None]:
# Step 1: Import the MinMaxScaler from scikit-learn
from sklearn.preprocessing import MinMaxScaler

# Step 2: Initialize the MinMaxScaler
scaler = MinMaxScaler()

# Step 3: Apply the scaler to the numerical features of the dataset
nor_df = scaler.fit_transform(num_df)

# Step 4: Convert the scaled data back to a Pandas DataFrame
nor_df = pd.DataFrame(nor_df, columns=list(num_df.columns))

# Display the normalized DataFrame
nor_df.head()


Unnamed: 0,Placed,Eliminations,Assists,Revives,Accuracy,Hits,Head Shots,Distance Traveled,Materials Gathered,Materials Used,...,Date_17-Apr,Date_4/10,Date_4/11,Date_4/13,Date_4/14,Date_4/15,Time of Day_day,Time of Day_night,Mental State_high,Mental State_sober
0,0.4,0.25,0.0,0.0,0.22093,0.125,0.060606,0.057004,0.006662,0.011494,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
1,0.676923,0.125,0.333333,0.0,0.302326,0.173077,0.030303,0.08529,0.040973,0.017241,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
2,0.569231,0.375,0.0,0.0,0.302326,0.298077,0.030303,0.132805,0.023651,0.034483,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
3,0.446154,0.125,0.5,0.0,0.162791,0.173077,0.030303,0.156749,0.081279,0.005747,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
4,0.230769,0.375,0.166667,0.25,0.627907,0.394231,0.545455,0.252612,0.194537,0.086207,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0


In [None]:
df.describe()

Unnamed: 0,Placed,Eliminations,Assists,Revives,Accuracy,Hits,Head Shots,Distance Traveled,Materials Gathered,Materials Used,Damage Taken,Damage to Players,Damage to Structures
count,87.0,87.0,87.0,87.0,87.0,87.0,87.0,87.0,87.0,87.0,87.0,87.0,87.0
mean,22.045977,2.517241,1.482759,0.402299,26.011494,29.735632,4.747126,1137.146322,386.574713,122.712644,244.172414,581.770115,3170.91954
std,13.145791,1.885453,1.388173,0.738631,13.471325,22.093596,5.777298,1110.843642,569.978062,225.17037,124.937399,354.172396,3458.54107
min,1.0,0.0,0.0,0.0,4.0,1.0,0.0,17.86,0.0,0.0,19.0,60.0,175.0
25%,15.0,1.0,0.0,0.0,17.0,12.0,1.0,398.28,75.0,20.0,154.0,328.5,1094.0
50%,21.0,2.0,1.0,0.0,25.0,27.0,3.0,638.17,164.0,50.0,209.0,481.0,2029.0
75%,28.5,3.0,2.0,1.0,32.0,38.0,6.5,1575.0,418.5,145.0,316.5,750.0,3707.0
max,66.0,8.0,6.0,4.0,90.0,105.0,33.0,4460.0,3002.0,1740.0,677.0,1507.0,18026.0
