# **AIFEST'24 PREPROCESSING WORKSHOP**

### *About dataset*

This data set includes 898 Pokemon, 1072 including alternate forms, including their number, name, first and second type, the stat total and basic stats: HP, Attack, Defense, Special Attack, Special Defense, and Speed, generation, and legendary status. It has been of great use when teaching statistics to kids. With certain types you can also give a geeky introduction to machine learning.

These are the raw attributes that are used for calculating how much damage an attack will do in the games. This dataset is about the pokemon games (NOT pokemon cards or Pokemon Go).

`Number`: The ID for each pokemon

`Name`: The name of each pokemon

`Type 1`: Each pokemon has a type, this determines weakness/resistance to attacks

`Type 2`: Some pokemon are dual type and have 2

`Total`: Sum of all stats that come after this, a general guide to how strong a pokemon is

`HP`: Hit points, or health, defines how much damage a pokemon can withstand before fainting

`Attack`: The base modifier for normal attacks (eg. Scratch, Punch)

`Defense`: The base damage resistance against normal attacks

`SP Atk`: Special attack, the base modifier for special attacks (e.g. fire blast, bubble beam)

`SP Def`: Special defense, the base damage resistance against special attacks

`Speed`: Determines which pokemon attacks first each round

`Generation`: The generation of games where the pokemon was first introduced

`Legendary`: Some pokemon are much rarer than others, and are dubbed "legendary"

`Inspiration`: The type of a pokemon cannot be inferred only by its Attack and Defense. It would be worthy to find which two variables can define the type of a pokemon, if any. Two variables can be plotted in a 2D space, and used as an example for machine learning. This could mean the creation of a visual example any geeky Machine Learning class would love.


## 2. Installing necessary libraries 


In [None]:
# install the libraries
# pip install pandas
# pip install numpy
# pip install seaborn
# pip install -U scikit-learn

## 3. Importing the necessary Libraries

We start by importing the necessary libraries for our analysis. Here's what each library is used for:

`pandas`: Used for data manipulation and analysis.

`matplotlib.pyplot`: Provides a MATLAB-like plotting framework.

`seaborn`: A statistical data visualization library based on matplotlib, used for creating attractive and informative statistical graphics.

`MinMaxScaler`, `Normalizer`,`StandardScaler` from sklearn.preprocessing: These are tools for data preprocessing, specifically for scaling numerical features.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split


### 4. Loading the dataset

In this section, we are loading the data from our csv file `Pokemon.csv`.

We use the `pd.read_csv()` function from the pandas library and store the data into a DataFrame named `df_pokemon`.


In [None]:
df_pokemon = pd.read_csv('Pokemon.csv')

we use `head()` to display the first few rows of the dataframe. This will give you a glimpse of what the data looks like(descriptive features, target feature, ...).

In [None]:
df_pokemon.head(5)

### Getting the data columns


In [None]:
df_pokemon.columns

### Data info 


The info() method in pandas provides a concise summary of a DataFrame. It gives information about:

- The total number of entries in the DataFrame.
- The data type of each column.
- The number of non-null values in each column.

In [None]:
df_pokemon.info()

In [None]:
data_shape = df_pokemon.shape
print("The data shape",data_shape)

### Descriptive statistics 

The `describe()` method in pandas is used to generate descriptive statistics of the numerical columns in a DataFrame. It provides the following statistics for each numerical column:

`Count`: Number of non-null observations.

`Mean`: Average value of the observations.

`Standard Deviation (std)`: Measure of the dispersion or spread of the observations.

`Minimum`: Minimum value in the column.

`25th percentile (Q1)`: Value below which 25% of the observations fall.

`Median (50th percentile)`: Middle value of the observations.

`75th percentile (Q3)`: Value below which 75% of the observations fall.

`Maximum`: Maximum value in the column.

In [None]:
df_pokemon.describe()

### 5. Data cleaning

In this section, we perform data cleaning tasks to ensure the quality and integrity of the dataset.



We first start by dropping irrelevant columns (Number, Name)

In [None]:
#Drop irrelevant columns 
df_pokemon.drop(columns=['number', 'name'], axis= 1, inplace= True)
df_pokemon.head()


#### Checking for missing values 

In [None]:
df_pokemon.isnull().sum()

There are several strategies to deal with missing values in a dataset. Here are some common techniques:
- Removing Missing Values : either rows or columns

- Imputing Missing values: 
    - Fill missing values with a constant
    - Fill missing values with mean, median, or mode

-  Advanced Imputation Techniques:
    - Interpolation 
    - ML algorithms (SimpleImputer, ...)

-  Custom Imputation Techniques
 

Here we can see that the many of the Pokemon does not have second type(power or type of pokemon)

As we can not remove them we will just replace the value with `no type2`

In [None]:
df_pokemon['type2'] = df_pokemon['type2'].fillna('no type2')

In [None]:
df_pokemon.isnull().sum()

### 6. EDA and visulation

#### Exploring the different features of the data 

##### **TYPE 1**

The `value_counts()` method in pandas is used to count the unique values in a Series (a single column of a DataFrame) and return a Series containing counts of unique values.

In [None]:
df_pokemon.type1.value_counts()


here we see that Grass is repeated by some speeling error so making it one

In [None]:
df_pokemon.type1 = df_pokemon.type1.replace({"Graass":"Grass"})


In [None]:
df_pokemon.type1.value_counts()


#### Visualizing 

In [None]:
#visualizing
countplot=sns.countplot(data=df_pokemon,x="type1")
for count in countplot.containers:
    countplot.bar_label(count,)
plt.xticks(rotation=90)
plt.title("Type1 Count")
plt.show()

Observation:

- Most of the Pokemon are the Water type
- Most Unique Pokemon type is Blastoise

##### **TYPE 2**

In [None]:
df2=df_pokemon.type2.value_counts()
df2

In [None]:
# visualizing 
countplot=sns.countplot(data=df_pokemon ,x="type2")
for count in countplot.containers:
    countplot.bar_label(count,)
plt.xticks(rotation=90)
plt.title("Type2 Count")
plt.show()

Observation:
- Most of the Pokemon has no second type power
- Apart from no power , maximum have the Fyling type power
- Bug type is least

#### Checking the count of the legendry pokemon

In [None]:
countplot=sns.countplot(data=df_pokemon,x="legendary")
for count in countplot.containers:
    countplot.bar_label(count,)
plt.title("Count of Legendary Pokemon")
plt.show()

### Convert Categorical features to numerical using Label Encoding

first we need to understand the type of the categorical data: `nominal` or `ordinal`

numerous techniques have been developed to convert categorical variables into a numerical format that algorithms can effectively work with. Each of these techniques has its advantages and its particular use cases where it shines:

- `Label Encoding`: Label Encoding begins by identifying all the unique categories within a categorical variable. Then, each category is assigned a unique integer.(it is best suited to ordinal categorical variables, where the categories have a logical order or progression.). For this we use  `LabelEncoder` class

- `One-Hot Encoding`:  It creates binary (0 or 1) features for each category in the original variable, effectively mapping each category to a vector in a high-dimensional binary space.` OneHotEncoder` class.

- `Binary Encoding`: Binary Encoding is a combination of Hashing and Binary. First, the categories of a variable are encoded as ordinal, meaning integers are assigned to categories just like in integer encoding. Then, those integers are converted into binary code, resulting in binary digits or bits.` Binary Encoding` can be implemented using the BinaryEncoder class from the category_encoders library



In [None]:
# Convert categorical features to numerical using Label Encoding
label_encoder = LabelEncoder()
for feature in ['type1', 'type2', 'legendary']:
    df_pokemon[feature] = label_encoder.fit_transform(df_pokemon[feature])

In [None]:
df_pokemon

#### Correlation matrix 

A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table represents the correlation between two variables. The value is in the range of -1 to 1, where:

- 1 indicates a perfect positive correlation,
- -1 indicates a perfect negative correlation, and
- 0 indicates no correlation at all.

In [None]:
sns.heatmap(df_pokemon.corr(),annot=True,linewidth=.5,fmt='.1f')
plt.show()

#### Outliers detection

In this section, we check for outliers in the dataset.

Outliers are data points that significantly differ from other observations in the dataset. They can arise due to various reasons, such as measurement errors, data entry mistakes, or genuine anomalies in the data. Outliers can skew statistical analyses and machine learning models, leading to misleading results if not properly handled.

#### Checking for outliers 

We use visualizations and statistical techniques to identify outliers in the dataset. One common method is to use box plots and histogranms, which display the distribution of numerical variables and highlight any data points that fall outside the whiskers.

A histogram provides a visual representation of the distribution of values in the 'Age_num' column. It can help you see the frequency of different age groups.

In [None]:
df_pokemon.boxplot(figsize=(10, 8))
plt.xticks(rotation=45) 
plt.show()

handling the outliers:

In [None]:

# Define lower and upper bounds using 0.25 and 0.75 quantiles
lower_bound = df_pokemon.quantile(0.25)
upper_bound = df_pokemon.quantile(0.75)

# Calculate interquartile range (IQR)
IQR = upper_bound - lower_bound

# Define thresholds for outliers
lower_threshold = lower_bound - 1.5 * IQR
upper_threshold = upper_bound + 1.5 * IQR

# Cap outliers
df_capped = df_pokemon.clip(lower=lower_threshold, upper=upper_threshold, axis=1)

df_capped.boxplot(figsize=(10, 8))
plt.xticks(rotation=45)  
plt.title('Boxplots after Capping Outliers')
plt.show()


### Data Normalization & Standarization: 
#### 1. NORMALIZATION: 
**Aim**: Scaling all feature values to a range between 0 and 1.

**Formula**:

\begin{align*}
X_{\text{normalized}} = \frac{{X - X_{\text{min}}}}{{X_{\text{max}} - X_{\text{min}}}}
\end{align*}

**Advantages**:
- Helps to bring all features to the same scale.
- Useful when the features have different ranges.


#### 2. STANDARIZATION:
**Aim**: Scaling the features so that they have a mean of 0 and a standard deviation of 1.

**Formula**:

\begin{align*}
X_{\text{standardized}} = \frac{{X - X_{\text{mean}}}}{X_{\text{std}}}
\end{align*}


**Advantages**:
- Does not bound values to a specific range, which may be important for certain algorithms.
- Preserves useful information about outliers and makes the algorithm less sensitive to them.

In [None]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Normalize data
scaler = MinMaxScaler()
df_normalized = scaler.fit_transform(df_capped)

# Standardize data

scaler = StandardScaler()
df_standardized = scaler.fit_transform(df_capped)

#### Saving our cleaned data

In [None]:
df_pokemon.to_csv('Pokemon_cleaned.csv', index=False)

### MODELING: 
here we will just have a quick model to test 

### split the data into train and test 

In [103]:
x = df_pokemon.iloc[:, :-1]
y = df_pokemon['legendary']

In [104]:
x

Unnamed: 0,type1,type2,total,hp,attack,defense,sp_attack,sp_defense,speed,generation
0,10,13,318,45,49,49,65,65,45,1
1,10,13,405,60,62,63,80,80,60,1
2,10,13,525,80,82,83,100,100,80,1
3,10,13,625,80,100,123,122,120,80,1
4,10,13,525,80,82,83,100,100,80,1
...,...,...,...,...,...,...,...,...,...,...
1067,12,18,580,100,145,130,65,110,30,8
1068,9,18,580,100,65,60,145,80,130,8
1069,15,9,500,100,80,80,80,80,80,8
1070,15,11,680,100,165,150,85,130,50,8


In [105]:
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.2)

In [106]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, Y_train)

In [107]:
y_pred = knn.predict(X_test)
y_pred

array([0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0], dtype=int64)

In [108]:
from sklearn.metrics import accuracy_score

In [109]:
#print accuracy score
test_accuracy = accuracy_score(Y_test, y_pred)
test_accuracy

0.9534883720930233

In [110]:
print(classification_report(Y_test, y_pred))


              precision    recall  f1-score   support

           0       0.96      0.98      0.97       183
           1       0.89      0.78      0.83        32

    accuracy                           0.95       215
   macro avg       0.93      0.88      0.90       215
weighted avg       0.95      0.95      0.95       215

