# Machine Learning 2024-2025 – UMONS 
# Exploratory Data Analysis of the Pokémon dataset

The goal of the lab is to get more familiar with the `pandas` library in Python, which will allow you to manipulate dataframes, compute the statistics of its variables, and visualize them. Data exploration is an important step before using any of the machine learning models that you will discover throughout the course. It will grant you a deeper understanding of the content of the dataset, which will ease any a posteriori manipulation.   

In this lab, we will work with the 'Pokémon' dataset, which contains the attributes of several Pokémons across various generations:
- `#`: ID for each Pokémon
- `Name`: name of each Pokémon
- `Type 1`: each Pokémon has a type; this determines weakness/resistance to attacks
- `Type 2`: second type for Pokémons that have two types
- `Total`: sum of all stats that come after this; a general guide to how strong a Pokémon is
- `HP`: hit points, or health, defines how much damage a Pokémon can withstand before fainting
- `Attack`: the base modifier for normal attacks (e.g., Scratch, Punch)
- `Defense`: the base damage resistance against normal attacks
- `SP Atk`: special attack, the base modifier for special attacks (e.g., Fire Blast, Bubble Beam)
- `SP Def`: the base damage resistance against special attacks
- `Speed`: a higher value determines which Pokémon attacks first each round

**1. Import all necessary libraries.**

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

plt.style.use('fivethirtyeight') # Custom plot style

**2. Read the csv file 'Pokemon.csv' and load it into a Dataframe. Print the 10 first rows.** 

**3. Print technical informations of the Pokemon dataset using `.info()`.** 

**4. Print the shape of the dataframe.** 

**5. Drop the "#' column and set the dataframe index to the 'Name' column.**

**6. Check if there are any missing values in the dataframe, and count them per column. For non numerical variables, replace the missing values by 'Unknown'. Check that the dataframe does not contain missing values anymore.**

**7. Change the data types of the variables 'Type 1' and 'Type 2' and 'Generation' to categorical data.**

**8. Get general statistics (mean, standard deviation...) for the numerical variables of the dataset. Check that the standard deviation for a column of your choice is correct by computing it using the definition.** 

**9. For the categorical variables, count the number of values per category, as well as the count of co-occurences, i.e., the times categorical variables occur simultaneously.**

**10. Get all the attributes of 'Bulbasaur'.**

**11. Sort the dataframe by increasing values of 'Attack' and decreasing values of 'Defense' (i.e., if two Pokémons have the same value for 'Attack', the one with higher 'Defense' should appear first).** 

**12. Create a dataframe containing all Pokémons of type 1 'Psychic' having more than 100 in 'Attack', less than 40 in 'Defense' and more than 45 in Speed.**

**13. Create two new columns, 'AttackAll' and 'DefenseAll', which correspond to the sum of 'Attack' and 'Sp. Attack' and the sum of 'Defense' and 'Sp. Defense', respectively.** 

**14. Create a new column 'AtkOverDef' corresponding to the ratio of 'AttackAll' over 'DefenseAll' for each Pokemon.** 

**15. Change the column names to upper cases, and remove the '.' in the column names, as well as blanks.** 

**16. Plot a histogram of the different 'TYPE1' categories. The figure must be 16 inches wide and 4 inches high. 
Use the matplotlib.pyplot library and the countplot method from the seaborn librabry. The counts should appear in increasing order.**

**17. Do the same as above, but for the 'TYPE2' categories.** 

**18. Plot the densities of the variables 'ATTACK', 'DEFENSE' and 'SPEED' onto three separates plots. Use the `displot()` method of the library seaborn.**

**19. Plot the density of the variable 'ATTACK' for Legendary and non-Legendary Pokémons. The two densities should appear on different facets of the same plot.**

**20. Generate a scatter plot of the variable 'DEFENSE' on the y-axis, and the variable 'ATTACK' on the x-axis. Legendary and non-Legendary Pokémons should be indicated using different colors.**

**21. Filter the dataframe to contain only Pokémons of generations 1 and 4. Using the filtered dataframe, generate a scatter plot of the variable 'TOTAL' on the y-axis, and the variable 'ATTACK' on the x-axis, by separating the two filtered generations. Note that, after filtering the dataframe, you can use the method `Series.cat.remove_unused_categories` to remove unused categories from the plot. The figure shoud be 8 inches high, and 8 inches wide.**

**22. Create a histogram of the variable 'GENERATION'. Separate Legendary and non-Legendary Pokémons. The counts should appear on the same figure in decreasing order.**

**23. Generate a boxplot of the variable 'TOTAL' with the method `boxplot()` from the library seaborn. How to interpret it?** 

**24. Generate one boxplot of the variable 'TOTAL' per category of the variable 'GENERATION'. Separate Legendary and non-Legendary Pokémons. All boxplots must appear on the same plot.**