# **Analysis of Pokemon using K-Means Clustering**
---

**Project Title:** Identify your Pokemon


**Project Description**
In this project, we will be clustering the data of Pokemons on the basis of its characteristics.

**Key Questions Answered:**
* How to find the best K for your model
* How to group pokemon based on its numeric characteristics
* How to predict using K-means

**Source:** [data.world](https://data.world/data-society/pokemon-with-stats)

**Dataset Description**
This data set includes 898 Pokemon, 1072 including alternate forms, including their number, name, first and second type, the stat total and basic stats: HP, Attack, Defense, Special Attack, Special Defense, and Speed, generation, and legendary status. The attributes of each Pokemon are as follows:

* `Number`: The ID for each pokemon

* `Name`: The name of each pokemon

* `Type 1`: Each pokemon has a type, this determines weakness/resistance to attacks

* `Type 2`: Some pokemon are dual type and have 2

* `Total`: Sum of all stats that come after this, a general guide to how strong a pokemon is

* `HP`: Hit points, or health, defines how much damage a pokemon can withstand before fainting

* `Attack`: The base modifier for normal attacks (eg. Scratch, Punch)

* `Defense`: The base damage resistance against normal attacks

* `SP Atk`: Special attack, the base modifier for special attacks (e.g. fire blast, bubble beam)

* `SP Def`: Special defense, the base damage resistance against special attacks

* `Speed`: Determines which pokemon attacks first each round

* `Generation`: The generation of games where the pokemon was first introduced

* `Legendary`: Some pokemon are much rarer than others, and are dubbed "legendary"

## **This will be a guided project**
*Each step has been provided for you*

### **Step \#0: Import the following before continuing**
---

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

### **Step \#1: Load the Pokemon Dataset**
---
*The code has been provided for you*

---



In [None]:
url ="https://query.data.world/s/p4tnasnlximnov7fpjlu2msnmegyrb"
pokemon_df = pd.read_csv(url,  sep = ",")
pokemon_df

### **Step \#2: Identify features to use for clustering the data**
---
*We can observe that the given dataset has some categorical features. Clustering using categorical features is a bit more difficult than numeric features because of the absence of any natural order, high dimensionality and existence of subspace clustering. So for this project we will only work with numeric features of the given dataset.*

**2.1)** Identify numeric features of `pokemon_df`.

In [None]:
pokemon_df.dtypes # best way to see data types

In [None]:
pokemon_df.describe() # another way

**2.2)** Create a new dataframe from `pokemon_df` with only numeric features in it.

**NOTE:** This should not include `number`. Why is that?

In [None]:
new_pokemon_df = pokemon_df[['total','hp','attack','defense', 'sp_attack', 'sp_defense', 'speed', 'generation']]
new_pokemon_df.columns

**2.3)** Create your `X` as a numpy array

To create `X`, use `.values` on your new data frame.
```
X = df.values
```

In [None]:
X = new_pokemon_df.values

In [None]:
from sklearn.preprocessing import StandardScaler

### **Step #3: Import your model**
---
*We will also be using standarization to make our model more accurate. Make sure to include the following line below*
```python
from sklearn.preprocessing import StandardScaler
```

### **Step #4: Scale the Data**
---

Standardize the data before fitting it:

*Use the following lines to incorporate the scaler*
```python
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
```

These lines standardize the data we put into our model.


In [None]:
scaler = StandardScaler()
X_std = scaler.fit_transform(X)

### **Step \#5: Determine optimal cluster number with elbow method**
---
*We will use `yellowbrick` library to implement elbow method.*

Please run the install line if you have not installed it yet

**5.1)** Import `KElbowVisualizer` from `yellowbrick.cluster` and `kmeans` from `skelarn.cluster`.

*Part of this has been provided for you. Import `KMeans` below*

In [None]:
!pip install --quiet yellowbrick

In [None]:
from yellowbrick.cluster import KElbowVisualizer

In [None]:
from sklearn.cluster import KMeans

**5.2)** Set parameter `k = (4,30)` of `KElbowVisualizer` method. Fit, visualize and figure out the optimal value of `k`.

*This has been provided for you*

In [None]:
model = KMeans()
visualizer = KElbowVisualizer(model, k=(4,30))

visualizer.fit(X_std)
visualizer.show()

### **Step \#6: Fit your model and make a prediction**
---

**6.2)** Now fit the standardized data using the optimal value of `K`.

*Use `fit_predict` on `X_std` to create `y`*

Assign `n_clusters` the optimal value of `k` obtained  from the elbow method to define your KMeans model.

**Add an optional parameter as follows:**
```
random_state = 42
```
`random_state` allows reproducible results, so when you are looking at the answer key later, your results will *be the same as ours*, despite the random aspect. If you are interested in reading more, click [here](https://towardsdatascience.com/why-do-we-set-a-random-state-in-machine-learning-models-bb2dc68d8431).

In [None]:
k=13
kmeans = KMeans(n_clusters=k, random_state=42)

In [None]:
y = kmeans.fit_predict(X_std)



### **(Challenge) Step \#7: Visualize the results**
---
Use the features `attack` and  `defense` to visualize how well the clusters separate the data.

**NOTE:** You may have to experiment with different graphs. Since this model used several features, a good two-dimensional visual is difficult to create. Feel free to look at the homework solutions for inspiration.


In [None]:
plt.figure(figsize=(15,10))
for yval in range(k):
    plt.scatter(X[y==yval,2], X[y==yval,3], s=50, label=f'class{yval}') #, c = f'class{yval}')

centers = scaler.inverse_transform(kmeans.cluster_centers_)
plt.scatter(centers[:,2], centers[:,3], s=300, c='white', label='centroids',
            marker='*', edgecolors='black', linewidth=2)


plt.title("Classification of pokemons by attack and defense")
plt.xlabel("attack")
plt.ylabel("defense")
plt.show()

In [None]:
f'class{yval}'

### **Step \#8 Calculate the Silhouette Score to test the accuracy of your model**
---
*You have done this a few times and should be able to implement it now*

In [None]:
from sklearn.metrics import silhouette_score
score = silhouette_score(X, kmeans.labels_, metric='euclidean')
print(score)

### **Step \#9:** Use the model
---
Given the following values, predict in which cluster this pokemon would fall.
*  `total` = 300,	`hp`=50, `attack`=40,	`defense`=60,	`sp_attack`=60,	`sp_defense`=67,	`speed`=40,	`generation`=6

Use the `kmeans.predict([[]])` to do this. Look at homework 3.1 solutions for an example.


In [None]:
new_pokemon = [[300,50,40,60,60,67,40,61]]
prediction = kmeans.predict(new_pokemon)
print(prediction)

## **Conclusion**
---
You're done! We hope you feel more confident using K-Means on data! If you are feeling up to it, explore online to find cool ways to visualize your clusters. You can also go back to the homework and implement the model that found the best k in terms of silhouette score. The internet will be your best resource for expanding you knowldge and abilities in ML.