In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
path="/content/flashscore.csv"
df=pd.read_csv(path)
df.head(5)

Unnamed: 0,Rank,Points,Tournaments,Name,Nationality
0,1,98715,14,Axelsen Viktor,Denmark
1,2,90984,19,Shi Yu Qi,China
2,3,85101,20,Ginting Anthony Sinisuka,Indonesia
3,4,83594,19,Antonsen Anders,Denmark
4,5,81531,19,Christie Jonatan,Indonesia


#DATA CLEANING

Handling Missig Values

In [5]:

df.isnull().sum()
df_cleaned = df.dropna()
df_cleaned.head(5)


Unnamed: 0,Rank,Points,Tournaments,Name,Nationality
0,1,98715,14,Axelsen Viktor,Denmark
1,2,90984,19,Shi Yu Qi,China
2,3,85101,20,Ginting Anthony Sinisuka,Indonesia
3,4,83594,19,Antonsen Anders,Denmark
4,5,81531,19,Christie Jonatan,Indonesia


**Handling Duplicate Values**

In [7]:
# Cek duplikat
df_cleaned.duplicated().sum()
df_cleaned = df_cleaned.drop_duplicates()
df_cleaned.head(7)


Unnamed: 0,Rank,Points,Tournaments,Name,Nationality
0,1,98715,14,Axelsen Viktor,Denmark
1,2,90984,19,Shi Yu Qi,China
2,3,85101,20,Ginting Anthony Sinisuka,Indonesia
3,4,83594,19,Antonsen Anders,Denmark
4,5,81531,19,Christie Jonatan,Indonesia
5,6,81015,21,Naraoka Kodai,Japan
6,7,77998,21,Li Shi Feng,China


**Handling Outliers**

In [8]:
Q1 = df_cleaned[['Rank', 'Points', 'Tournaments']].quantile(0.25)
Q3 = df_cleaned[['Rank', 'Points', 'Tournaments']].quantile(0.75)
IQR = Q3 - Q1

outlier_condition = ((df_cleaned[['Rank', 'Points', 'Tournaments']] < (Q1 - 1.5 * IQR)) |
                     (df_cleaned[['Rank', 'Points', 'Tournaments']] > (Q3 + 1.5 * IQR)))

df_cleaned = df_cleaned[~outlier_condition.any(axis=1)]
df_cleaned.head(5)

Unnamed: 0,Rank,Points,Tournaments,Name,Nationality
1,2,90984,19,Shi Yu Qi,China
2,3,85101,20,Ginting Anthony Sinisuka,Indonesia
3,4,83594,19,Antonsen Anders,Denmark
4,5,81531,19,Christie Jonatan,Indonesia
5,6,81015,21,Naraoka Kodai,Japan


**Normalisasi**

In [9]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_cleaned[['Rank', 'Points', 'Tournaments']] = scaler.fit_transform(df_cleaned[['Rank', 'Points', 'Tournaments']])
df_cleaned.head(5)

Unnamed: 0,Rank,Points,Tournaments,Name,Nationality
1,0.0,1.0,0.478261,Shi Yu Qi,China
2,0.010204,0.918916,0.521739,Ginting Anthony Sinisuka,Indonesia
3,0.020408,0.898145,0.478261,Antonsen Anders,Denmark
4,0.030612,0.869711,0.478261,Christie Jonatan,Indonesia
5,0.040816,0.862599,0.565217,Naraoka Kodai,Japan


**Encoding Kolom Kategorikal**

In [11]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
df_cleaned['Nationality'] = label_encoder.fit_transform(df_cleaned['Nationality'])
df_cleaned.head(5)

Unnamed: 0,Rank,Points,Tournaments,Name,Nationality
1,0.0,1.0,0.478261,Shi Yu Qi,5
2,0.010204,0.918916,0.521739,Ginting Anthony Sinisuka,16
3,0.020408,0.898145,0.478261,Antonsen Anders,7
4,0.030612,0.869711,0.478261,Christie Jonatan,16
5,0.040816,0.862599,0.565217,Naraoka Kodai,20


**Feature Engineering**

In [12]:
df_cleaned['Points_per_Tournament'] = df_cleaned['Points'] / df_cleaned['Tournaments']
df_cleaned.head(5)

Unnamed: 0,Rank,Points,Tournaments,Name,Nationality,Points_per_Tournament
1,0.0,1.0,0.478261,Shi Yu Qi,5,2.090909
2,0.010204,0.918916,0.521739,Ginting Anthony Sinisuka,16,1.761255
3,0.020408,0.898145,0.478261,Antonsen Anders,7,1.877939
4,0.030612,0.869711,0.478261,Christie Jonatan,16,1.818486
5,0.040816,0.862599,0.565217,Naraoka Kodai,20,1.526137


**Splitting Data**

In [13]:
from sklearn.model_selection import train_test_split
X = df_cleaned.drop('Nationality', axis=1)
y = df_cleaned['Nationality']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.head(5), y_train.head(5)

(        Rank    Points  Tournaments            Name  Points_per_Tournament
 50  0.500000  0.193470     0.086957    Momota Kento               2.224901
 71  0.714286  0.065799     0.478261  Bosniuk Danylo               0.137580
 69  0.693878  0.077570     0.739130   Toti Giovanni               0.104947
 16  0.153061  0.572263     0.565217  Weng Hong Yang               1.012466
 40  0.397959  0.237478     0.173913   Zhao Jun Peng               1.365500,
 50    20
 71    34
 69    19
 16     5
 40     5
 Name: Nationality, dtype: int64)

#Ringkasan

1. Data Cleaning

    Missing Values: Nilai yang hilang dalam dataset telah dihapus dengan menggunakan dropna(). Ini dilakukan untuk memastikan bahwa data yang digunakan dalam analisis atau model tidak memiliki kekosongan yang bisa mengganggu proses lebih lanjut.

    Duplicate Values: Baris yang duplikat telah dihapus dengan drop_duplicates() untuk menghindari pengaruh yang tidak diinginkan dari data yang berulang.

    Outliers: Outliers dideteksi dan dihapus menggunakan teknik Interquartile Range (IQR), yang bertujuan untuk menghilangkan data yang tidak wajar dan dapat mempengaruhi hasil analisis.

2. Normalisasi/Standarisasi Kolom Numerik

    Semua kolom numerik (Rank, Points, Tournaments) dinormalisasi menggunakan Min-Max Scaling, yang mengubah rentang data menjadi [0, 1]. Hal ini penting agar data tidak terpengaruh oleh skala yang berbeda antar kolom saat digunakan untuk model machine learning.

3. Encoding Kolom Kategorikal

    Kolom Nationality yang berisi nilai kategorikal di-encode menggunakan Label Encoding, mengubahnya menjadi angka agar bisa digunakan oleh model yang membutuhkan input numerik.

4. Feature Engineering

    Sebuah fitur baru, Points_per_Tournament, diciptakan untuk menunjukkan seberapa banyak poin yang diperoleh pemain dibandingkan dengan jumlah turnamen yang diikuti. Fitur ini bisa berguna untuk menganalisis efektivitas pemain.

5. Splitting Data

    Dataset dibagi menjadi data training dan testing dengan rasio 80:20 untuk melatih model dan menguji performanya pada data yang tidak terlihat sebelumnya.