# *K-Means* Clustering For 1000 Manga List from myanimelist.net

## Import library
In this section we will import library tools like `pandas`, `matplotlib`, and `sklearn`. From `sklearn` we will use `cluster` for `KMeans`, `preprocessing` for `StandardScaler` and `metrics` for `silhouette_score`

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

## Explore Data
Next we use data from https://www.kaggle.com/astronautelvis/top-1000-ranked-mangas-by-myanimelist

**Why I choose this data for *K-Means* Clustering?** as a manga's reader when i visit https://myanimelist.net/topmanga.php it's hard for me to choose what is the next manga to read.

**But you can start from 1st rank and you can choose by score is'n't?** to be honest it's not simple like that my brother, if we look closely the score and the ranks it seems sus because the numbers of members. Manga's like *JoJo no Kimyou na Bouken Part 7: Steel Ball Run* is 2nd at the rank and have 9.23 score but it only have 160,780 members.

**Why choose *K-Means* clustering methods?** *K-Means* is one simple clustering methods to separate data by specific features. In this data, we can choose `Scored_by`, `Favorites` and `Score` as features to use

In [4]:
data_manga = pd.read_csv("https://raw.githubusercontent.com/jakajek/current-projekt/main/data/top_1000_manga.csv", index_col = 0)
data_manga.head()

Unnamed: 0,Title,Title_Synonym,Title_Japanese,Status,Volumns,Chapters,Publishing,Rank,Score,Scored_by,Popularity,Memebers,Favorites,Synopsis,Publish_period,Genre
0,Berserk,Berserk,ベルセルク,Publishing,unkown,unkown,True,1,9.39,201756,2,427894,80308,"Guts, a former mercenary now known as the ""Bla...","Aug 25, 1989 to present","'Action', 'Adventure', 'Demons', 'Drama', 'Fan..."
1,JoJo no Kimyou na Bouken Part 7: Steel Ball Run,unkown,ジョジョの奇妙な冒険 Part7 STEEL BALL RUN,Finished,24,96,False,2,9.23,94427,29,160782,27459,"In the American Old West, the world's greatest...","Jan 19, 2004 to Apr 19, 2011","'Action', 'Adventure', 'Mystery', 'Historical'..."
2,One Piece,One Piece,ONE PIECE,Publishing,unkown,unkown,True,3,9.15,249936,3,410522,82310,"Gol D. Roger, a man referred to as the ""Pirate...","Jul 22, 1997 to present","'Action', 'Adventure', 'Comedy', 'Fantasy', 'S..."
3,Vagabond,Vagabond,バガボンド,On Hiatus,37,327,False,4,9.13,72613,19,211345,21596,"In 16th century Japan, Shinmen Takezou is a wi...","Sep 3, 1998 to May 21, 2015","'Action', 'Adventure', 'Drama', 'Historical', ..."
4,Monster,Monster,MONSTER,Finished,18,162,False,5,9.1,57801,33,148764,13049,"Kenzou Tenma, a renowned Japanese neurosurgeon...","Dec 5, 1994 to Dec 20, 2001","'Mystery', 'Drama', 'Psychological', 'Seinen'"


In [None]:
use_col = data_manga['Status'].isin(['Publishing','Finished'])
del_col = ['Title_Synonym', 'Title_Japanese', 'Status','Chapters', 'Volumns', 'Publishing', 'Memebers', 'Synopsis', 'Publish_period', 'Genre']
df_manga = data_manga[use_col].drop(columns = del_col)
df_manga.insert(0, 'id', range(1, 1 + len(df_manga)))
df_manga.head()

In [None]:
len(df_manga)

In [None]:
df_manga['is_duplicated'] = df_manga.duplicated(subset=['Title'])
df_manga['is_duplicated'].sum()

In [None]:
df_manga = df_manga.drop_duplicates(subset=['Title'])
len(df_manga)

In [None]:
df_manga.head()

In [None]:
X = (df_manga[['Scored_by', 'Favorites', 'Score']])
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

In [None]:
kmeans = KMeans(n_clusters=3, random_state=0) # k means
var = df_manga[['Scored_by', 'Favorites']]
df_manga['cluster'] = kmeans.fit_predict(var)# centroids
centroids = kmeans.cluster_centers_
cen_x = [i[0] for i in centroids] 
cen_y = [i[1] for i in centroids]
# add to df
df_manga['cen_x'] = df_manga.cluster.map({0:cen_x[0], 1:cen_x[1], 2:cen_x[2]})
df_manga['cen_y'] = df_manga.cluster.map({0:cen_y[0], 1:cen_y[1], 2:cen_y[2]})
colors = ['#DF2020', '#81DF20', '#2095DF']
df_manga['c'] = df_manga.cluster.map({0:colors[0], 1:colors[1], 2:colors[2]})

In [None]:
plt.scatter(df_manga.Scored_by, df_manga.Favorites, c=df_manga.c, alpha = 0.6, s=10)

In [None]:
z = StandardScaler()
nn = z.fit_transform(var)
tt = kmeans.predict(var)
silhouette_score(nn, tt)

In [None]:
from mpl_toolkits.mplot3d import Axes3D
%matplotlib widget
fig = plt.figure(figsize=(26,6))
ax = fig.add_subplot(131, projection='3d')
ax.scatter(df_manga.Scored_by, df_manga.Favorites, df_manga.Score, c=df_manga.c, s=15)
ax.set_xlabel('Scored_by')
ax.set_ylabel('Favorites')
ax.set_zlabel('Score')
plt.show()

In [None]:
to_frame = df_manga['cluster'].to_frame()
df_clust = pd.concat([df_manga.Title,df_manga.Rank,to_frame], axis=1)
print(df_clust)

In [None]:
group_clust = df_clust.groupby("cluster")["Title"].count()
print(group_clust)

In [None]:
df_clust[df_clust['cluster'] == 1]